Data Types, Sources, and Collection
Data is the raw material of predictive modelling. Before building any model, we must understand what type of data we have, where it comes from, how it is collected, and whether it is reliable enough for analysis.
A predictive model can only learn meaningful patterns when the underlying data is relevant, accurate, well-structured, and collected with a clear business objective in mind.
Why Data Understanding Comes First
In predictive analytics, the quality of the final model depends heavily on the quality of the data. Even the most advanced algorithm cannot produce reliable predictions from poor, incomplete, biased, or irrelevant data.
Data understanding helps us answer important questions such as: What variables are available? Which variable should be predicted? Are the records complete? Are there errors? Is the data recent enough? Does the data represent the real-world problem accurately?
Core Idea: Predictive modelling does not begin with algorithms. It begins with understanding the data that will teach the algorithm how the real world behaves.
What is Data in Predictive Analytics?
Data is a collection of facts, measurements, observations, or records that describe people, products, events, transactions, systems, or processes. In predictive modelling, data usually contains input features and a target variable.
Major Types of Data
Data can be classified in different ways. For predictive modelling, it is important to understand both the structure of the data and the statistical nature of each variable.
1. Structured Data
Structured data is organized in a fixed format, usually rows and columns. It is easy to store in databases and spreadsheets, and it is the most commonly used format in traditional predictive modelling.
Examples include sales tables, customer records, banking transactions, attendance data, inventory records, and CRM data.
2. Semi-Structured Data
Semi-structured data does not follow a strict table format but still contains some organizational structure through tags, keys, or metadata.
Examples include JSON files, XML files, web logs, API responses, and email metadata.
3. Unstructured Data
Unstructured data does not have a predefined tabular structure. It often requires special processing techniques before it can be used for predictive modelling.
Examples include text documents, images, videos, audio recordings, social media posts, customer reviews, and call centre transcripts.
| Data Structure Type | Description | Examples | Predictive Analytics Usage |
|---|---|---|---|
| Structured Data | Organized into rows and columns. | Sales records, customer tables, bank transactions. | Regression, classification, forecasting, customer scoring. |
| Semi-Structured Data | Has flexible structure using tags, keys, or metadata. | JSON, XML, API data, web logs. | Event tracking, clickstream analysis, user behaviour prediction. |
| Unstructured Data | No fixed format or predefined schema. | Images, text, video, audio, reviews. | Sentiment analysis, image classification, speech analytics. |
Statistical Types of Variables
Variables in a dataset can also be classified based on the kind of values they contain. This classification helps determine which cleaning, visualization, encoding, and modelling techniques should be used.
| Variable Type | Meaning | Examples | Model Preparation Need |
|---|---|---|---|
| Numerical Continuous |
Can take any value within a range. | Income, temperature, height, sales amount. | May require scaling, outlier treatment, or transformation. |
| Numerical Discrete |
Countable numeric values. | Number of purchases, number of children, visits per month. | May be used directly or transformed depending on distribution. |
| Categorical Nominal |
Categories without natural order. | Gender, city, product category, payment method. | Usually requires encoding such as one-hot encoding. |
| Categorical Ordinal |
Categories with meaningful order. | Low/Medium/High, education level, customer rating. | Can be encoded using ordered numerical values. |
| Date/Time Temporal |
Values related to time. | Transaction date, signup time, delivery date. | Can be converted into month, day, hour, weekday, seasonality features. |
| Text Unstructured |
Free-form language data. | Reviews, comments, complaints, emails. | Requires text preprocessing, vectorization, or NLP techniques. |
Sources of Data
Predictive models can use data from many different sources. The source of data affects reliability, freshness, accessibility, privacy requirements, and business relevance.
Data generated inside the organization through business operations.
- Sales records
- Customer relationship management data
- Inventory systems
- Employee records
- Transaction databases
Data collected from outside the organization to improve predictive context.
- Market trends
- Weather data
- Economic indicators
- Competitor pricing
- Government datasets
Data generated by user interactions with digital platforms.
- Website clicks
- App usage logs
- Search behaviour
- Product views
- Cart activity
Data generated by machines, sensors, devices, and connected systems.
- Machine temperature
- Vehicle GPS data
- Energy usage
- Equipment vibration
- Smart device signals
Data that captures customer opinions, satisfaction, and experiences.
- Surveys
- Reviews
- Support tickets
- Complaint records
- Social media comments
Freely available datasets published by institutions, platforms, or communities.
- Government portals
- Research datasets
- Open APIs
- Public statistics
- Industry reports
Primary Data vs Secondary Data
Data can also be classified based on whether it is collected directly for the current purpose or reused from existing sources.
| Type | Meaning | Examples | Advantages | Limitations |
|---|---|---|---|---|
| Primary Data | Collected directly for a specific research or business objective. | Surveys, interviews, experiments, direct observations. | Highly relevant and customized. | Can be expensive and time-consuming to collect. |
| Secondary Data | Already collected by someone else or for another purpose. | Public datasets, company records, reports, databases. | Faster and cheaper to access. | May not perfectly match the current problem. |
Data Collection Methods
Data collection is the process of gathering information from relevant sources for analysis and modelling. The right collection method depends on the business objective, data availability, privacy constraints, and technical infrastructure.
Typical Data Collection Pipeline
| Collection Method | Description | Example |
|---|---|---|
| Database Extraction | Retrieving data from relational or NoSQL databases. | Extracting customer transactions from a banking database. |
| APIs | Collecting data from software systems through application programming interfaces. | Getting weather data from a weather API for demand forecasting. |
| Surveys and Forms | Collecting responses directly from people. | Customer satisfaction survey for churn prediction. |
| Web Tracking | Recording user activity on websites and applications. | Tracking clicks, product views, and cart additions. |
| Sensor Collection | Collecting real-time signals from devices and machines. | Machine vibration data for predictive maintenance. |
| Web Scraping | Extracting publicly available data from websites where permitted. | Collecting competitor prices for pricing analytics. |
| Manual Entry | Human-entered data collected through forms or spreadsheets. | Sales team entering lead details into a CRM system. |
Example: Data Collection for Customer Churn Prediction
Business Problem
A telecom company wants to predict which customers are likely to leave in the next 30 days.
To build this model, the company may collect different types of data:
| Data Category | Examples | Why It Matters |
|---|---|---|
| Customer Profile Data | Age, region, plan type, tenure. | Helps identify customer segments with higher churn risk. |
| Usage Data | Call minutes, data usage, recharge frequency. | Declining usage may indicate disengagement. |
| Billing Data | Payment delays, bill amount, failed transactions. | Payment behaviour may signal dissatisfaction or affordability issues. |
| Support Data | Complaints, service tickets, resolution time. | Frequent unresolved complaints may increase churn probability. |
| Target Variable | Churned or not churned. | This is the outcome the model will learn to predict. |
Once this data is collected, it can be cleaned, analysed, transformed, and used to train a classification model.
Data Quality Checks During Collection
Collecting data is not enough. The data must also be checked for quality before it is used for modelling.
Data Collection Quality Checklist
- Completeness: Are important fields missing?
- Accuracy: Are values correct and realistic?
- Consistency: Are formats and definitions uniform across sources?
- Timeliness: Is the data recent enough for the problem?
- Relevance: Does the data help explain the target outcome?
- Uniqueness: Are duplicate records removed or controlled?
- Validity: Do values follow expected rules and ranges?
- Privacy: Is sensitive data collected and stored responsibly?
Ethical and Privacy Considerations
Data collection must be done responsibly. Predictive models often use personal, behavioural, financial, or health-related information. Organizations must ensure that data is collected legally, stored securely, and used fairly.
Important considerations include consent, data minimization, anonymization, security, transparency, and avoiding discriminatory use of sensitive attributes.
Practical Reminder: Just because data is available does not always mean it should be used. Responsible predictive analytics balances business value with privacy, fairness, and trust.
How Data Types Affect Model Building
Different data types require different preprocessing techniques before they can be used in predictive models.
| Data Type | Common Preparation Technique | Reason |
|---|---|---|
| Numerical Data | Scaling, normalization, outlier treatment. | Improves stability and comparability across variables. |
| Categorical Data | One-hot encoding, label encoding, target encoding. | Models need categories converted into numerical format. |
| Date/Time Data | Extract day, month, year, hour, weekday, season. | Raw dates are less useful than meaningful time-based features. |
| Text Data | Cleaning, tokenization, vectorization, embeddings. | Text must be transformed into numerical representation. |
| Image Data | Resizing, normalization, feature extraction. | Images require special processing for computer vision models. |
Best Practices for Data Collection
Key Takeaways
- Data is the foundation of predictive modelling.
- Predictive datasets usually contain input features and a target variable.
- Data can be structured, semi-structured, or unstructured.
- Variables may be numerical, categorical, date/time, or text-based.
- Data can come from internal systems, external sources, digital platforms, sensors, surveys, and public datasets.
- Good data collection requires clear objectives, reliable sources, quality checks, and ethical handling.
- The type and quality of data directly influence model performance.