Data Types, Sources, and Collection

Data is the raw material of predictive modelling. Before building any model, we must understand what type of data we have, where it comes from, how it is collected, and whether it is reliable enough for analysis.

A predictive model can only learn meaningful patterns when the underlying data is relevant, accurate, well-structured, and collected with a clear business objective in mind.

Why Data Understanding Comes First

In predictive analytics, the quality of the final model depends heavily on the quality of the data. Even the most advanced algorithm cannot produce reliable predictions from poor, incomplete, biased, or irrelevant data.

Data understanding helps us answer important questions such as: What variables are available? Which variable should be predicted? Are the records complete? Are there errors? Is the data recent enough? Does the data represent the real-world problem accurately?

Core Idea: Predictive modelling does not begin with algorithms. It begins with understanding the data that will teach the algorithm how the real world behaves.

What is Data in Predictive Analytics?

Data is a collection of facts, measurements, observations, or records that describe people, products, events, transactions, systems, or processes. In predictive modelling, data usually contains input features and a target variable.

📥
Input Features
Variables used to explain or predict an outcome. Examples include age, income, location, purchase frequency, and website activity.
🎯
Target Variable
The outcome the model tries to predict, such as sales amount, churn status, loan default, disease risk, or delivery time.
🧾
Records or Rows
Individual observations in a dataset. Each row may represent a customer, transaction, product, employee, or event.
📊
Variables or Columns
Attributes that describe each observation. Each column captures one measurable property of the entity being studied.

Major Types of Data

Data can be classified in different ways. For predictive modelling, it is important to understand both the structure of the data and the statistical nature of each variable.

1. Structured Data

Structured data is organized in a fixed format, usually rows and columns. It is easy to store in databases and spreadsheets, and it is the most commonly used format in traditional predictive modelling.

Examples include sales tables, customer records, banking transactions, attendance data, inventory records, and CRM data.

2. Semi-Structured Data

Semi-structured data does not follow a strict table format but still contains some organizational structure through tags, keys, or metadata.

Examples include JSON files, XML files, web logs, API responses, and email metadata.

3. Unstructured Data

Unstructured data does not have a predefined tabular structure. It often requires special processing techniques before it can be used for predictive modelling.

Examples include text documents, images, videos, audio recordings, social media posts, customer reviews, and call centre transcripts.

Data Structure Type Description Examples Predictive Analytics Usage
Structured Data Organized into rows and columns. Sales records, customer tables, bank transactions. Regression, classification, forecasting, customer scoring.
Semi-Structured Data Has flexible structure using tags, keys, or metadata. JSON, XML, API data, web logs. Event tracking, clickstream analysis, user behaviour prediction.
Unstructured Data No fixed format or predefined schema. Images, text, video, audio, reviews. Sentiment analysis, image classification, speech analytics.

Statistical Types of Variables

Variables in a dataset can also be classified based on the kind of values they contain. This classification helps determine which cleaning, visualization, encoding, and modelling techniques should be used.

Variable Type Meaning Examples Model Preparation Need
Numerical
Continuous
Can take any value within a range. Income, temperature, height, sales amount. May require scaling, outlier treatment, or transformation.
Numerical
Discrete
Countable numeric values. Number of purchases, number of children, visits per month. May be used directly or transformed depending on distribution.
Categorical
Nominal
Categories without natural order. Gender, city, product category, payment method. Usually requires encoding such as one-hot encoding.
Categorical
Ordinal
Categories with meaningful order. Low/Medium/High, education level, customer rating. Can be encoded using ordered numerical values.
Date/Time
Temporal
Values related to time. Transaction date, signup time, delivery date. Can be converted into month, day, hour, weekday, seasonality features.
Text
Unstructured
Free-form language data. Reviews, comments, complaints, emails. Requires text preprocessing, vectorization, or NLP techniques.

Sources of Data

Predictive models can use data from many different sources. The source of data affects reliability, freshness, accessibility, privacy requirements, and business relevance.

🏢 Internal Business Data

Data generated inside the organization through business operations.

  • Sales records
  • Customer relationship management data
  • Inventory systems
  • Employee records
  • Transaction databases
🌐 External Data

Data collected from outside the organization to improve predictive context.

  • Market trends
  • Weather data
  • Economic indicators
  • Competitor pricing
  • Government datasets
📱 Digital Behaviour Data

Data generated by user interactions with digital platforms.

  • Website clicks
  • App usage logs
  • Search behaviour
  • Product views
  • Cart activity
📡 Sensor and IoT Data

Data generated by machines, sensors, devices, and connected systems.

  • Machine temperature
  • Vehicle GPS data
  • Energy usage
  • Equipment vibration
  • Smart device signals
🗣 Customer Feedback Data

Data that captures customer opinions, satisfaction, and experiences.

  • Surveys
  • Reviews
  • Support tickets
  • Complaint records
  • Social media comments
🧾 Public and Open Data

Freely available datasets published by institutions, platforms, or communities.

  • Government portals
  • Research datasets
  • Open APIs
  • Public statistics
  • Industry reports

Primary Data vs Secondary Data

Data can also be classified based on whether it is collected directly for the current purpose or reused from existing sources.

Type Meaning Examples Advantages Limitations
Primary Data Collected directly for a specific research or business objective. Surveys, interviews, experiments, direct observations. Highly relevant and customized. Can be expensive and time-consuming to collect.
Secondary Data Already collected by someone else or for another purpose. Public datasets, company records, reports, databases. Faster and cheaper to access. May not perfectly match the current problem.

Data Collection Methods

Data collection is the process of gathering information from relevant sources for analysis and modelling. The right collection method depends on the business objective, data availability, privacy constraints, and technical infrastructure.

Typical Data Collection Pipeline

Define Objective
Identify Data Sources
Collect Data
Store Data
Validate Quality
Collection Method Description Example
Database Extraction Retrieving data from relational or NoSQL databases. Extracting customer transactions from a banking database.
APIs Collecting data from software systems through application programming interfaces. Getting weather data from a weather API for demand forecasting.
Surveys and Forms Collecting responses directly from people. Customer satisfaction survey for churn prediction.
Web Tracking Recording user activity on websites and applications. Tracking clicks, product views, and cart additions.
Sensor Collection Collecting real-time signals from devices and machines. Machine vibration data for predictive maintenance.
Web Scraping Extracting publicly available data from websites where permitted. Collecting competitor prices for pricing analytics.
Manual Entry Human-entered data collected through forms or spreadsheets. Sales team entering lead details into a CRM system.

Example: Data Collection for Customer Churn Prediction

Business Problem

A telecom company wants to predict which customers are likely to leave in the next 30 days.

To build this model, the company may collect different types of data:

Data Category Examples Why It Matters
Customer Profile Data Age, region, plan type, tenure. Helps identify customer segments with higher churn risk.
Usage Data Call minutes, data usage, recharge frequency. Declining usage may indicate disengagement.
Billing Data Payment delays, bill amount, failed transactions. Payment behaviour may signal dissatisfaction or affordability issues.
Support Data Complaints, service tickets, resolution time. Frequent unresolved complaints may increase churn probability.
Target Variable Churned or not churned. This is the outcome the model will learn to predict.

Once this data is collected, it can be cleaned, analysed, transformed, and used to train a classification model.

Data Quality Checks During Collection

Collecting data is not enough. The data must also be checked for quality before it is used for modelling.

Data Collection Quality Checklist

  • Completeness: Are important fields missing?
  • Accuracy: Are values correct and realistic?
  • Consistency: Are formats and definitions uniform across sources?
  • Timeliness: Is the data recent enough for the problem?
  • Relevance: Does the data help explain the target outcome?
  • Uniqueness: Are duplicate records removed or controlled?
  • Validity: Do values follow expected rules and ranges?
  • Privacy: Is sensitive data collected and stored responsibly?

Ethical and Privacy Considerations

Data collection must be done responsibly. Predictive models often use personal, behavioural, financial, or health-related information. Organizations must ensure that data is collected legally, stored securely, and used fairly.

Important considerations include consent, data minimization, anonymization, security, transparency, and avoiding discriminatory use of sensitive attributes.

Practical Reminder: Just because data is available does not always mean it should be used. Responsible predictive analytics balances business value with privacy, fairness, and trust.

How Data Types Affect Model Building

Different data types require different preprocessing techniques before they can be used in predictive models.

Data Type Common Preparation Technique Reason
Numerical Data Scaling, normalization, outlier treatment. Improves stability and comparability across variables.
Categorical Data One-hot encoding, label encoding, target encoding. Models need categories converted into numerical format.
Date/Time Data Extract day, month, year, hour, weekday, season. Raw dates are less useful than meaningful time-based features.
Text Data Cleaning, tokenization, vectorization, embeddings. Text must be transformed into numerical representation.
Image Data Resizing, normalization, feature extraction. Images require special processing for computer vision models.

Best Practices for Data Collection

🎯
Start with the Business Objective
Collect only the data that supports the prediction goal and business decision.
🧾
Define Data Clearly
Maintain consistent definitions for variables, target outcomes, timestamps, and customer identifiers.
Validate Early
Check missing values, duplicates, invalid entries, and format issues before modelling begins.
🔒
Protect Privacy
Use secure storage, access control, anonymization, and responsible data handling practices.

Key Takeaways

  • Data is the foundation of predictive modelling.
  • Predictive datasets usually contain input features and a target variable.
  • Data can be structured, semi-structured, or unstructured.
  • Variables may be numerical, categorical, date/time, or text-based.
  • Data can come from internal systems, external sources, digital platforms, sensors, surveys, and public datasets.
  • Good data collection requires clear objectives, reliable sources, quality checks, and ethical handling.
  • The type and quality of data directly influence model performance.