Data Types, Sources, and Collection

Data is the raw material of predictive modelling. Before building any model, we must understand what type of data we have, where it comes from, how it is collected, and whether it is reliable enough for analysis.

A predictive model can only learn meaningful patterns when the underlying data is relevant, accurate, well-structured, and collected with a clear business objective in mind.

Why Data Understanding Comes First

In predictive analytics, the quality of the final model depends heavily on the quality of the data. Even the most advanced algorithm cannot produce reliable predictions from poor, incomplete, biased, or irrelevant data.

Data understanding helps us answer important questions such as: What variables are available? Which variable should be predicted? Are the records complete? Are there errors? Is the data recent enough? Does the data represent the real-world problem accurately?

Core Idea: Predictive modelling does not begin with algorithms. It begins with understanding the data that will teach the algorithm how the real world behaves.

What is Data in Predictive Analytics?

Data is a collection of facts, measurements, observations, or records that describe people, products, events, transactions, systems, or processes. In predictive modelling, data usually contains input features and a target variable.

📥

Input Features

Variables used to explain or predict an outcome. Examples include age, income, location, purchase frequency, and website activity.

🎯

Target Variable

The outcome the model tries to predict, such as sales amount, churn status, loan default, disease risk, or delivery time.

🧾

Records or Rows

Individual observations in a dataset. Each row may represent a customer, transaction, product, employee, or event.

📊

Variables or Columns

Attributes that describe each observation. Each column captures one measurable property of the entity being studied.

Major Types of Data

Data can be classified in different ways. For predictive modelling, it is important to understand both the structure of the data and the statistical nature of each variable.

1. Structured Data

Structured data is organized in a fixed format, usually rows and columns. It is easy to store in databases and spreadsheets, and it is the most commonly used format in traditional predictive modelling.

Examples include sales tables, customer records, banking transactions, attendance data, inventory records, and CRM data.

2. Semi-Structured Data

Semi-structured data does not follow a strict table format but still contains some organizational structure through tags, keys, or metadata.

Examples include JSON files, XML files, web logs, API responses, and email metadata.

3. Unstructured Data

Unstructured data does not have a predefined tabular structure. It often requires special processing techniques before it can be used for predictive modelling.

Examples include text documents, images, videos, audio recordings, social media posts, customer reviews, and call centre transcripts.

Data Structure Type	Description	Examples	Predictive Analytics Usage
Structured Data	Organized into rows and columns.	Sales records, customer tables, bank transactions.	Regression, classification, forecasting, customer scoring.
Semi-Structured Data	Has flexible structure using tags, keys, or metadata.	JSON, XML, API data, web logs.	Event tracking, clickstream analysis, user behaviour prediction.
Unstructured Data	No fixed format or predefined schema.	Images, text, video, audio, reviews.	Sentiment analysis, image classification, speech analytics.

Statistical Types of Variables

Variables in a dataset can also be classified based on the kind of values they contain. This classification helps determine which cleaning, visualization, encoding, and modelling techniques should be used.

Variable Type	Meaning	Examples	Model Preparation Need
Numerical Continuous	Can take any value within a range.	Income, temperature, height, sales amount.	May require scaling, outlier treatment, or transformation.
Numerical Discrete	Countable numeric values.	Number of purchases, number of children, visits per month.	May be used directly or transformed depending on distribution.
Categorical Nominal	Categories without natural order.	Gender, city, product category, payment method.	Usually requires encoding such as one-hot encoding.
Categorical Ordinal	Categories with meaningful order.	Low/Medium/High, education level, customer rating.	Can be encoded using ordered numerical values.
Date/Time Temporal	Values related to time.	Transaction date, signup time, delivery date.	Can be converted into month, day, hour, weekday, seasonality features.
Text Unstructured	Free-form language data.	Reviews, comments, complaints, emails.	Requires text preprocessing, vectorization, or NLP techniques.

Sources of Data

Predictive models can use data from many different sources. The source of data affects reliability, freshness, accessibility, privacy requirements, and business relevance.

🏢 Internal Business Data

Data generated inside the organization through business operations.

Sales records
Customer relationship management data
Inventory systems
Employee records
Transaction databases

🌐 External Data

Data collected from outside the organization to improve predictive context.

Market trends
Weather data
Economic indicators
Competitor pricing
Government datasets

📱 Digital Behaviour Data

Data generated by user interactions with digital platforms.

Website clicks
App usage logs
Search behaviour
Product views
Cart activity

📡 Sensor and IoT Data

Data generated by machines, sensors, devices, and connected systems.

Machine temperature
Vehicle GPS data
Energy usage
Equipment vibration
Smart device signals

🗣 Customer Feedback Data

Data that captures customer opinions, satisfaction, and experiences.

Surveys
Reviews
Support tickets
Complaint records
Social media comments

🧾 Public and Open Data

Freely available datasets published by institutions, platforms, or communities.

Government portals
Research datasets
Open APIs
Public statistics
Industry reports

Primary Data vs Secondary Data

Data can also be classified based on whether it is collected directly for the current purpose or reused from existing sources.

Type	Meaning	Examples	Advantages	Limitations
Primary Data	Collected directly for a specific research or business objective.	Surveys, interviews, experiments, direct observations.	Highly relevant and customized.	Can be expensive and time-consuming to collect.
Secondary Data	Already collected by someone else or for another purpose.	Public datasets, company records, reports, databases.	Faster and cheaper to access.	May not perfectly match the current problem.

Data Collection Methods

Data collection is the process of gathering information from relevant sources for analysis and modelling. The right collection method depends on the business objective, data availability, privacy constraints, and technical infrastructure.

Typical Data Collection Pipeline

Define Objective

→

Identify Data Sources

→

Collect Data

→

Store Data

→

Validate Quality

Collection Method	Description	Example
Database Extraction	Retrieving data from relational or NoSQL databases.	Extracting customer transactions from a banking database.
APIs	Collecting data from software systems through application programming interfaces.	Getting weather data from a weather API for demand forecasting.
Surveys and Forms	Collecting responses directly from people.	Customer satisfaction survey for churn prediction.
Web Tracking	Recording user activity on websites and applications.	Tracking clicks, product views, and cart additions.
Sensor Collection	Collecting real-time signals from devices and machines.	Machine vibration data for predictive maintenance.
Web Scraping	Extracting publicly available data from websites where permitted.	Collecting competitor prices for pricing analytics.
Manual Entry	Human-entered data collected through forms or spreadsheets.	Sales team entering lead details into a CRM system.

Example: Data Collection for Customer Churn Prediction

Business Problem

A telecom company wants to predict which customers are likely to leave in the next 30 days.

To build this model, the company may collect different types of data:

Data Category	Examples	Why It Matters
Customer Profile Data	Age, region, plan type, tenure.	Helps identify customer segments with higher churn risk.
Usage Data	Call minutes, data usage, recharge frequency.	Declining usage may indicate disengagement.
Billing Data	Payment delays, bill amount, failed transactions.	Payment behaviour may signal dissatisfaction or affordability issues.
Support Data	Complaints, service tickets, resolution time.	Frequent unresolved complaints may increase churn probability.
Target Variable	Churned or not churned.	This is the outcome the model will learn to predict.

Once this data is collected, it can be cleaned, analysed, transformed, and used to train a classification model.

Data Quality Checks During Collection

Collecting data is not enough. The data must also be checked for quality before it is used for modelling.

Data Collection Quality Checklist

Completeness: Are important fields missing?
Accuracy: Are values correct and realistic?
Consistency: Are formats and definitions uniform across sources?
Timeliness: Is the data recent enough for the problem?
Relevance: Does the data help explain the target outcome?
Uniqueness: Are duplicate records removed or controlled?
Validity: Do values follow expected rules and ranges?
Privacy: Is sensitive data collected and stored responsibly?

Ethical and Privacy Considerations

Data collection must be done responsibly. Predictive models often use personal, behavioural, financial, or health-related information. Organizations must ensure that data is collected legally, stored securely, and used fairly.

Important considerations include consent, data minimization, anonymization, security, transparency, and avoiding discriminatory use of sensitive attributes.

Practical Reminder: Just because data is available does not always mean it should be used. Responsible predictive analytics balances business value with privacy, fairness, and trust.

How Data Types Affect Model Building

Different data types require different preprocessing techniques before they can be used in predictive models.

Data Type	Common Preparation Technique	Reason
Numerical Data	Scaling, normalization, outlier treatment.	Improves stability and comparability across variables.
Categorical Data	One-hot encoding, label encoding, target encoding.	Models need categories converted into numerical format.
Date/Time Data	Extract day, month, year, hour, weekday, season.	Raw dates are less useful than meaningful time-based features.
Text Data	Cleaning, tokenization, vectorization, embeddings.	Text must be transformed into numerical representation.
Image Data	Resizing, normalization, feature extraction.	Images require special processing for computer vision models.

Best Practices for Data Collection

🎯

Start with the Business Objective

Collect only the data that supports the prediction goal and business decision.

🧾

Define Data Clearly

Maintain consistent definitions for variables, target outcomes, timestamps, and customer identifiers.

✅

Validate Early

Check missing values, duplicates, invalid entries, and format issues before modelling begins.

🔒

Protect Privacy

Use secure storage, access control, anonymization, and responsible data handling practices.

Key Takeaways

Data is the foundation of predictive modelling.
Predictive datasets usually contain input features and a target variable.
Data can be structured, semi-structured, or unstructured.
Variables may be numerical, categorical, date/time, or text-based.
Data can come from internal systems, external sources, digital platforms, sensors, surveys, and public datasets.
Good data collection requires clear objectives, reliable sources, quality checks, and ethical handling.
The type and quality of data directly influence model performance.

2.1 Data types, sources, and collection