How to Clean Data for ML: A Step-by-Step Guide

Data cleaning is a crucial step in any machine learning project, as it can affect the quality, accuracy, and performance of your models. However, data cleaning can also be a challenging and time-consuming task, as it involves various steps and techniques to deal with different types of data issues. In this article, we will explore some of the common steps involved in cleaning data for ML, and how they can help you prepare your data for analysis and modeling.

Identify data problems

When cleaning data for ML, the first step is to identify the data problems that need to be addressed. These can include missing values, outliers, inconsistent values, duplicates, and errors. Missing values are not recorded or available in the data and can cause errors or biases in the analysis. Outliers are significantly different from the rest of the data and can affect the distribution, mean, and variance. Inconsistent values are not formatted or standardized in the same way and can cause confusion or misinterpretation. Duplicates are repeated or copied in the data and can inflate the size and skew results. Errors are incorrect or inaccurate due to human or machine errors and can compromise the validity and reliability of the data.

Add your perspective

Harshit Ahluwalia

Growth Hacker | Data Science, Data Visualization, Creative Ideation | 58K+ Followers
Cleaning data for ML involves: 1. Inspecting & Profiling: Assess data quality, find anomalies, missing values, and inconsistencies. 2. Cleansing: Fill or remove missing values, smooth noisy data, identify and fix errors or outliers. 3. Validating: Apply rules to ensure data accuracy and consistency. 4. Standardizing: Normalize data formats to conform to standards (e.g., dates, phone numbers). 5. Deduplicating: Remove duplicate records to prevent data skew. 6. Verifying: Check for data integrity and coherence post-cleanup. 7. Transforming: Scale or convert features to suitable formats for ML algorithms (e.g., one-hot encoding for categorical variables). 8. Documenting: Record steps and decisions for reproducibility.
Like

19
Report contribution
Sagar More

💡 10x Top LinkedIn Voice ✍️ Author 🗄️ Enterprise Technology Architect 🌟 Digital Transformation Evangelist 🚀 DevSecOps, SRE & Cloud Strategist 🎙️ Public Speaker 🗣 Guest Lecturer 🎓 1:1 Coach 🤝
Cleaning data for ML starts by donning the detective hat. Identify data problems like missing values, outliers, and inconsistencies. It's the investigative phase where you unveil the hidden imperfections in your dataset, setting the stage for data refinement and ensuring your ML model's success.
Like

7
Report contribution
Jyotishko Biswas

Solves Business Problems using AI | AI Leader | 17+ years exp. in AI | Experienced in Generative AI & LLMs | Guest Speaker on AI - IIM, JK Lakshmipat University etc. | Deployed enterprise-wide AI solutions | ex Deloitte
In order to clean data for ML, first step is to identify data problems. This is done using exploring and understanding data. We look for missing data, extreme values (outliers), multiple data type in same field, special characters instead of expected numeric value, metric's value not making sense (printer price is -ve, can be due to returns, but need to be sure of the reason). Share the findings with business SME as they are closer to the business and get their views. They can provide reasons for some of the data issues you identified and can find additional data issues.
Like

6

(edited)
Report contribution

Remove or replace data problems

The second step in cleaning data for ML is to remove or replace the data problems that you have identified. Depending on the severity and nature of the issue, you can choose from several strategies, such as deleting, imputing, filtering, standardizing, and deduplicating. Deleting is the simplest and most straightforward way, though it can reduce the amount and quality of data available for analysis. Imputing replaces missing or erroneous values with estimated or calculated values, while filtering removes outliers or extreme values. Standardizing makes inconsistent values consistent and comparable and deduplicating removes duplicate values.

Add your perspective

Jyotishko Biswas

Solves Business Problems using AI | AI Leader | 17+ years exp. in AI | Experienced in Generative AI & LLMs | Guest Speaker on AI - IIM, JK Lakshmipat University etc. | Deployed enterprise-wide AI solutions | ex Deloitte
Once identify the data issues, we do various data treatments to remove data issues. The treatment we do are missing value treatment, outlier treatment, processing data to have only one data type for a column etc. Also if due to some reason data is missing due to issue in data extraction or processing (at the time of merging datasets etc.), that can be fixed. I also want to highlight that besides cleaning data, it's very important to understand the business context of this data, relation of different metrics amongst one another etc. Finally, in our project plans, we should reserve significant time to understand and clean data. This is very critical.
Like

8
Report contribution
Shreya Khandelwal

Data Scientist @IBM | Machine Learning | Predictive Modelling | Amazon Connect | Multi-Cloud Certified | AI & Analytics
The second step in cleaning data is to Remove or replace data problems. We can use various techniques: 1. Handling Missing Values: We can remove rows with missing values if they are a small percentage of the dataset or replace them with statistical measures such as mean, median, and mode. 2. Dealing with Outliers: We can remove extreme outliers if they are insignificant or transform the rows to reduce the impact of outliers. 3. Standardizing Inconsistent Values - We can standardize inconsistent data to ensure uniformity. 4. Handling Duplicates: We can remove duplicate records or merge them keeping a single instance. 5. Addressing Imbalanced Data - We can use techniques like oversampling or undersampling to balance class distributions.
Like

7
Report contribution
Helen Yu

Founder & CEO @ Tigon Advisory Corp. | Host of CXO Spice | Top 50 Women in Tech | Board Director | AI, Cybersecurity, FinTech, Growth Acceleration
Once the data problem is identified and tested, it is important to select the right strategy to fix (i.e. deleting, imputing, filtering, standardizing, deduplicating). I always use a sample data to test in a testing environment prior to implementing the fix in the production environment. If you decide to delete a data set, make sure it does not affect the need of that data set elsewhere. Understanding the data flow and how it is used in the data model will help you fix the root cause of the issue.
Like

7
Report contribution

Explore and analyze data

Exploring and analyzing the data after cleaning is a crucial step to understand the characteristics, patterns, and relationships of the data, as well as identify any potential issues or opportunities for improvement. Common ways to explore and analyze the data include descriptive statistics, visualizations, dimensionality reduction, and clustering. Descriptive statistics are numerical summaries that describe the basic features of the data, such as mean, median, mode, standard deviation, variance, range, and frequency. Visualizations are graphical representations that illustrate the distribution, trends, and correlations of the data. Dimensionality reduction reduces the number of features or variables in the data by extracting the most relevant or informative ones. Lastly, clustering groups the data into similar or homogeneous subsets based on some similarity or distance measures.

Add your perspective

Paresh Patil

💡Top Data Science Voice | ML, Deep Learning & Python Expert, Data Scientist | Data Visualization & Storytelling | Actively Seeking Opportunities
Exploring and analyzing data is the compass guiding ML cleaning. Dive deep—chart the distributions, spot odd patterns, and question every outlier. This isn’t mere number-crunching; it’s a data detective’s hunt for clues. Your tools? Visualization and statistical tests. Your goal? To distill chaos into patterns a machine can learn from. It's the blend of art and science that marks a data strategist's pathway to crafting models that don't just compute but understand.
Like

5
Report contribution
Jyotishko Biswas

Solves Business Problems using AI | AI Leader | 17+ years exp. in AI | Experienced in Generative AI & LLMs | Guest Speaker on AI - IIM, JK Lakshmipat University etc. | Deployed enterprise-wide AI solutions | ex Deloitte
I believe analyzing and exploring data is the most important step for a ML/ data science project. We need to identify trends, relationships between metrics, distribution of a metric etc. These patterns, relationships between metrics etc. need to be captured by the algorithm to provide appropriate metric. The models are chosen based on which model better captures data characteristics. For example if the fluctuations in a metric is significant, then it is possible that in future the actual can lie outside the range. The models which can predict outside the historical range of the response variable can do extrapolation. For ex linear regression can extrapolate, however Random forest can't. In this case we select models which do extrapolation.
Like

5
Report contribution
Moazzam Mansoob

Business Intelligence Intern @ Rochester Red Wings | Master's in Data Analytics Engineer @ George Mason University | Python, SQL, AWS, R, PowerBI, QuickSight
Suppose we have a dataset of customer purchase data that we have cleaned. We want to use this data to train an ML model to predict customer churn. To explore and analyze the data, we can use the following steps: Descriptive statistics: This will help us to understand the distribution of the data and identify any outliers. Visualizations: We can create visualizations of the data to identify patterns and trends. Dimensionality reduction: We can use dimensionality reduction techniques to reduce the number of variables in the data. Clustering: We can use clustering techniques to group the data into similar or homogeneous subsets. By exploring and analyzing the data, we can gain a better understanding of the data.
Like

3
Report contribution

Validate and verify data

The fourth step in cleaning data for ML is to validate and verify the data after you have explored and analyzed it. This step is essential in ensuring that the data is accurate, consistent, and complete, and meets the requirements and expectations of the analysis and modeling. Common methods of validation and verification include quality checks, which test the validity, reliability, and completeness of the data; integrity checks, which check the consistency and coherence of the data; accuracy checks, which check the correctness and precision of the data; and documentation, which records and explains the data cleaning steps, methods, and results.

Add your perspective

Jyotishko Biswas

Solves Business Problems using AI | AI Leader | 17+ years exp. in AI | Experienced in Generative AI & LLMs | Guest Speaker on AI - IIM, JK Lakshmipat University etc. | Deployed enterprise-wide AI solutions | ex Deloitte
Data validation is important, else if we have incorrect input data the output will not have any use as it will not be reliable anymore. We can have data quality checks based on rules for example is price, revenue negative, do we have a region which is not available in firm's master data for geography dimension, annual revenue higher than firm's yearly revenue etc. Review data summary with business SME, as they know their business and data very well.
Like

8
Report contribution
Sayan Chowdhury

Software Developer @ L&T • Digital Solutions • AI-ML Engineer • Learner & Creator • Writer on 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻 Articles & 𝗺𝗲𝗱𝗶𝘂𝗺.𝗰𝗼𝗺 • 2x 𝗛𝗣𝗔𝗜𝗥 Delegate
Data validation and consistency check involve verifying the cleaned data to ensure that it is free from errors, inconsistencies, and anomalies. This step is essential to maintain data quality and reliability throughout the data cleaning process. Here's what it entails: Identify and rectify any errors in the dataset, including typos, formatting issues, or missing values. Correcting these errors ensures that the data is accurate and reliable. Cross-check the data against domain-specific rules and constraints. For instance, if you're dealing with a dataset of customer ages, verify that there are no entries that violate logical constraints (e.g., ages greater than 1000 years) or business rules (e.g., negative account balances).
Like

3
Report contribution
Hardik J.

Software Engineer | Data Engineer | Cloud Engineer | Machine Learning Engineer | MLOps
Validating and verifying data involves conducting checks to ensure data quality and accuracy. This includes cross-referencing data against reliable sources, performing sanity checks, and validating data against expected ranges or constraints. Data validation is crucial to confirm that the cleaned dataset aligns with the project's goals and domain requirements.
Like

2
Report contribution

Split and shuffle data

The fifth step in cleaning data for ML is to split and shuffle it before using for training and testing models. This helps avoid overfitting or underfitting the models, while improving their generalization and performance. Common ways to split and shuffle the data include a train-test split, which divides the data into two subsets, usually in a 80-20 or 70-30 ratio. Cross-validation is used to further split the training data into smaller subsets, called folds, which are used as a validation set to evaluate the models, while using the rest as a training set. Randomization is done by randomly reordering or rearranging the data before splitting it, to ensure that it's not biased or skewed by any order or sequence.

Add your perspective

Jyotishko Biswas

Solves Business Problems using AI | AI Leader | 17+ years exp. in AI | Experienced in Generative AI & LLMs | Guest Speaker on AI - IIM, JK Lakshmipat University etc. | Deployed enterprise-wide AI solutions | ex Deloitte
Splitting data is done into training and test to avoid over fitting of data. We build the model on training data and check if the model is performing well on test data. We also split data into multiple parts known as folds. We do training on n-1 folds and on one fold we validate the model's performance (assuming there are n folds). Objective is to check if model is able to see most of the business scenarios in training which data or lot of scenarios have got missed. If lot of scenarios have got missed, then we will stand with higher chance of some of those scenarios coming in the future, and the model will not be able to predict them, as those scenarios are not of the training dataset.
Like

14

(edited)
Report contribution
Jefin Paul

Machine Learning and Data Mining Student at Jean Monnet Université
Splitting the data can also depend on the dataset and the problem. It's essential to consider the goal of your analysis. As explained above the data is split into three subsets(train, test and validation, such as 60,20,20 or 70,20,10). It's particularly useful when one needs to fine-tune the model's parameters. It also helps in avoiding over-fitting. Also, we often get imbalanced data. Here, we cannot simply split the data. We first need to deal with it. Dealing with imbalanced data is important because ML algorithms tend to perform poorly on minority classes as the algorithm may become biased towards the minority classes. To avoid that techniques like oversampling the minority data or undersampling the majority data can help.
Like

10
Report contribution
Govardhana Miriyala Kannaiah

Founder @ NeuVeu | I bring a 'new' view to your Digital and Cloud Transformation Journey | MLOPS | AIOPS | Kubernetes | Cloud | DevOps | FinOps | GitOps | SRE | Platform Engineering
You can divide the data into chunks for training and testing, often like 80-20 or 70-30. To fine-tune, cross-validation helps by creating smaller subsets for validation. And hey, don't forget to shuffle things around randomly so your data stays unbiased and fresh before splitting.
Like

4
Report contribution

Preprocess and transform data

The sixth and final step in cleaning data for ML is preprocessing and transforming it according to the needs and specifications of your models. This step can help optimize and enhance the features or variables in the data, making them more suitable and effective for your models. Common ways to preprocess and transform the data include encoding, scaling, normalization, and feature engineering. Encoding is converting categorical or nominal features into numerical or ordinal features, such as label encoding, one-hot encoding, or binary encoding. Scaling is adjusting the range or magnitude of numerical features to make them more comparable and compatible, such as min-max scaling, standard scaling, or robust scaling. Normalization is changing the shape or distribution of numerical features to make them more symmetrical and standardized, such as z-score normalization, log normalization, or box-cox normalization. Feature engineering is creating new features or modifying existing ones to make them more relevant or informative for your models, such as feature extraction, feature selection, feature construction, or feature interaction.

Add your perspective

Rakesh Mohandas

Generative AI Ambassadors @ Google | Google Cloud Platform Fundamentals
🔍 Data cleaning is pivotal for ML success. Here's a distilled guide: 1️⃣ Spot issues: Detect missing data, outliers, and inaccuracies. 2️⃣ Fix errors: Impute gaps and rectify anomalies. 3️⃣ Uniform formats: Standardize data for consistency. 4️⃣ Validate: Ensure data integrity and accuracy. 5️⃣ Prep for ML: Split and shuffle to enhance model robustness. 6️⃣ Tailor features: Encode and scale for optimal algorithm performance. Each step is crucial for crafting models that deliver precise insights. #DataScience #MachineLearning #AI #DataQuality
Like

3
Report contribution
Tochukwu Okonkwor

Lead Principal Enterprise/Security Architect @ Xyples | Enterprise, Security and Solution Architect, Automation and Programmability
Feature Scaling: Normalize or standardize features to bring them to a similar scale. Encoding Categorical Data: Convert categorical variables into numerical representations, e.g., one-hot encoding. Dimensionality Reduction: Apply techniques like Principal Component Analysis (PCA) to reduce feature dimensions.
Like

1
Report contribution
Sitraka Forler

Economist & Senior Data Scientist | NLP | Digital Transformation
I think that oversampling the minority class or undersampling the majority class to balance the data should be done here. You can use methods such as Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic samples. Secondly, if your data includes text, you may need to perform text preprocessing, which can involve tasks like tokenization, stop-word removal, stemming, or lemmatization, and converting text into numerical representations using techniques like TF-IDF or word embeddings (Word2Vec, GloVe). Finally, we need to remember that different machine learning algorithms have specific preprocessing requirements. Decision trees or random forests may not require feature scaling, while neural networks often benefit from it.
Like

1
Report contribution

Here’s what else to consider

This is a space to share examples, stories, or insights that don’t fit into any of the previous sections. What else would you like to add?

Add your perspective

Abdullateef Opeyemi Bakare

Energy | AI | Data Science
It is important to understand the context of the data or read the documentation of the data collection process before cleaning is done. This could help inform certain decisions in the cleaning process. For example, one time during work I was supposed to assess the quality of a dataset and I had given in to the urge to dive head first into the work without properly reading the documentation, this backfired. I had incorrectly assessed the data as "complete" since there were no missing values(i.e. Nan, Na, Null etc.) observed in the data. My attention was called to the documentation by a senior and I found that the missing values were actually coded as negative numbers, each connoting a different circumstance in the dataset.
Like

6
Report contribution
Sayan Chowdhury

Software Developer @ L&T • Digital Solutions • AI-ML Engineer • Learner & Creator • Writer on 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻 Articles & 𝗺𝗲𝗱𝗶𝘂𝗺.𝗰𝗼𝗺 • 2x 𝗛𝗣𝗔𝗜𝗥 Delegate
Documentation in the context of data cleaning serves several important purposes: Reproducibility: Detailed documentation ensures that the data cleaning process is transparent and reproducible. It allows you or others to understand and replicate the steps taken to transform and clean the data. Error Tracking: If errors or issues are encountered during data cleaning, documentation can help pinpoint when and where the errors occurred. This makes it easier to identify and rectify problems in the data. Data Quality Assessment: Documentation allows you to record the quality of the data at different stages of the cleaning process. You can track improvements in data quality over time and assess the impact of various cleaning procedures.
Like

3
Report contribution
Helen Yu

Founder & CEO @ Tigon Advisory Corp. | Host of CXO Spice | Top 50 Women in Tech | Board Director | AI, Cybersecurity, FinTech, Growth Acceleration
Having a well-defined data strategy and knowing your data are foundational for data cleaning in ML. Run a data quality assessment to check data duplicates, any missing values or data outliers.
Like

3
Report contribution

What are the steps involved in cleaning data for ML?

Identify data problems

Remove or replace data problems

Explore and analyze data

Validate and verify data

Split and shuffle data

Preprocess and transform data

Here’s what else to consider

Machine Learning

Rate this article

Thanks for your feedback

More articles on Machine Learning

What are the steps involved in cleaning data for ML?

Identify data problems

Remove or replace data problems

Explore and analyze data

Validate and verify data

Split and shuffle data

Preprocess and transform data

Here’s what else to consider

Machine Learning

Rate this article

Thanks for your feedback

Explore Other Skills