What are the steps involved in cleaning data for ML?
Learn from the community’s knowledge. Experts are adding insights into this AI-powered collaborative article, and you could too.
This is a new type of article that we started with the help of AI, and experts are taking it forward by sharing their thoughts directly into each section.
If you’d like to contribute, request an invite by liking or reacting to this article. Learn more
— The LinkedIn Team
Data cleaning is a crucial step in any machine learning project, as it can affect the quality, accuracy, and performance of your models. However, data cleaning can also be a challenging and time-consuming task, as it involves various steps and techniques to deal with different types of data issues. In this article, we will explore some of the common steps involved in cleaning data for ML, and how they can help you prepare your data for analysis and modeling.
When cleaning data for ML, the first step is to identify the data problems that need to be addressed. These can include missing values, outliers, inconsistent values, duplicates, and errors. Missing values are not recorded or available in the data and can cause errors or biases in the analysis. Outliers are significantly different from the rest of the data and can affect the distribution, mean, and variance. Inconsistent values are not formatted or standardized in the same way and can cause confusion or misinterpretation. Duplicates are repeated or copied in the data and can inflate the size and skew results. Errors are incorrect or inaccurate due to human or machine errors and can compromise the validity and reliability of the data.
-
Harshit Ahluwalia
Growth Hacker | Data Science, Data Visualization, Creative Ideation | 58K+ Followers
Cleaning data for ML involves: 1. Inspecting & Profiling: Assess data quality, find anomalies, missing values, and inconsistencies. 2. Cleansing: Fill or remove missing values, smooth noisy data, identify and fix errors or outliers. 3. Validating: Apply rules to ensure data accuracy and consistency. 4. Standardizing: Normalize data formats to conform to standards (e.g., dates, phone numbers). 5. Deduplicating: Remove duplicate records to prevent data skew. 6. Verifying: Check for data integrity and coherence post-cleanup. 7. Transforming: Scale or convert features to suitable formats for ML algorithms (e.g., one-hot encoding for categorical variables). 8. Documenting: Record steps and decisions for reproducibility.
-
Sagar More
💡 10x Top LinkedIn Voice ✍️ Author 🗄️ Enterprise Technology Architect 🌟 Digital Transformation Evangelist 🚀 DevSecOps, SRE & Cloud Strategist 🎙️ Public Speaker 🗣 Guest Lecturer 🎓 1:1 Coach 🤝
Cleaning data for ML starts by donning the detective hat. Identify data problems like missing values, outliers, and inconsistencies. It's the investigative phase where you unveil the hidden imperfections in your dataset, setting the stage for data refinement and ensuring your ML model's success.
-
Jyotishko Biswas
Solves Business Problems using AI | AI Leader | 17+ years exp. in AI | Experienced in Generative AI & LLMs | Guest Speaker on AI - IIM, JK Lakshmipat University etc. | Deployed enterprise-wide AI solutions | ex Deloitte
In order to clean data for ML, first step is to identify data problems. This is done using exploring and understanding data. We look for missing data, extreme values (outliers), multiple data type in same field, special characters instead of expected numeric value, metric's value not making sense (printer price is -ve, can be due to returns, but need to be sure of the reason). Share the findings with business SME as they are closer to the business and get their views. They can provide reasons for some of the data issues you identified and can find additional data issues.
(edited)
The second step in cleaning data for ML is to remove or replace the data problems that you have identified. Depending on the severity and nature of the issue, you can choose from several strategies, such as deleting, imputing, filtering, standardizing, and deduplicating. Deleting is the simplest and most straightforward way, though it can reduce the amount and quality of data available for analysis. Imputing replaces missing or erroneous values with estimated or calculated values, while filtering removes outliers or extreme values. Standardizing makes inconsistent values consistent and comparable and deduplicating removes duplicate values.
-
Jyotishko Biswas
Solves Business Problems using AI | AI Leader | 17+ years exp. in AI | Experienced in Generative AI & LLMs | Guest Speaker on AI - IIM, JK Lakshmipat University etc. | Deployed enterprise-wide AI solutions | ex Deloitte
Once identify the data issues, we do various data treatments to remove data issues. The treatment we do are missing value treatment, outlier treatment, processing data to have only one data type for a column etc. Also if due to some reason data is missing due to issue in data extraction or processing (at the time of merging datasets etc.), that can be fixed. I also want to highlight that besides cleaning data, it's very important to understand the business context of this data, relation of different metrics amongst one another etc. Finally, in our project plans, we should reserve significant time to understand and clean data. This is very critical.
-
Shreya Khandelwal
Data Scientist @IBM | Machine Learning | Predictive Modelling | Amazon Connect | Multi-Cloud Certified | AI & Analytics
The second step in cleaning data is to Remove or replace data problems. We can use various techniques: 1. Handling Missing Values: We can remove rows with missing values if they are a small percentage of the dataset or replace them with statistical measures such as mean, median, and mode. 2. Dealing with Outliers: We can remove extreme outliers if they are insignificant or transform the rows to reduce the impact of outliers. 3. Standardizing Inconsistent Values - We can standardize inconsistent data to ensure uniformity. 4. Handling Duplicates: We can remove duplicate records or merge them keeping a single instance. 5. Addressing Imbalanced Data - We can use techniques like oversampling or undersampling to balance class distributions.
-
Helen Yu
Founder & CEO @ Tigon Advisory Corp. | Host of CXO Spice | Top 50 Women in Tech | Board Director | AI, Cybersecurity, FinTech, Growth Acceleration
Once the data problem is identified and tested, it is important to select the right strategy to fix (i.e. deleting, imputing, filtering, standardizing, deduplicating). I always use a sample data to test in a testing environment prior to implementing the fix in the production environment. If you decide to delete a data set, make sure it does not affect the need of that data set elsewhere. Understanding the data flow and how it is used in the data model will help you fix the root cause of the issue.
Exploring and analyzing the data after cleaning is a crucial step to understand the characteristics, patterns, and relationships of the data, as well as identify any potential issues or opportunities for improvement. Common ways to explore and analyze the data include descriptive statistics, visualizations, dimensionality reduction, and clustering. Descriptive statistics are numerical summaries that describe the basic features of the data, such as mean, median, mode, standard deviation, variance, range, and frequency. Visualizations are graphical representations that illustrate the distribution, trends, and correlations of the data. Dimensionality reduction reduces the number of features or variables in the data by extracting the most relevant or informative ones. Lastly, clustering groups the data into similar or homogeneous subsets based on some similarity or distance measures.
-
Paresh Patil
💡Top Data Science Voice | ML, Deep Learning & Python Expert, Data Scientist | Data Visualization & Storytelling | Actively Seeking Opportunities
Exploring and analyzing data is the compass guiding ML cleaning. Dive deep—chart the distributions, spot odd patterns, and question every outlier. This isn’t mere number-crunching; it’s a data detective’s hunt for clues. Your tools? Visualization and statistical tests. Your goal? To distill chaos into patterns a machine can learn from. It's the blend of art and science that marks a data strategist's pathway to crafting models that don't just compute but understand.
-
Jyotishko Biswas
Solves Business Problems using AI | AI Leader | 17+ years exp. in AI | Experienced in Generative AI & LLMs | Guest Speaker on AI - IIM, JK Lakshmipat University etc. | Deployed enterprise-wide AI solutions | ex Deloitte
I believe analyzing and exploring data is the most important step for a ML/ data science project. We need to identify trends, relationships between metrics, distribution of a metric etc. These patterns, relationships between metrics etc. need to be captured by the algorithm to provide appropriate metric. The models are chosen based on which model better captures data characteristics. For example if the fluctuations in a metric is significant, then it is possible that in future the actual can lie outside the range. The models which can predict outside the historical range of the response variable can do extrapolation. For ex linear regression can extrapolate, however Random forest can't. In this case we select models which do extrapolation.
-
Moazzam Mansoob
Business Intelligence Intern @ Rochester Red Wings | Master's in Data Analytics Engineer @ George Mason University | Python, SQL, AWS, R, PowerBI, QuickSight
Suppose we have a dataset of customer purchase data that we have cleaned. We want to use this data to train an ML model to predict customer churn. To explore and analyze the data, we can use the following steps: Descriptive statistics: This will help us to understand the distribution of the data and identify any outliers. Visualizations: We can create visualizations of the data to identify patterns and trends. Dimensionality reduction: We can use dimensionality reduction techniques to reduce the number of variables in the data. Clustering: We can use clustering techniques to group the data into similar or homogeneous subsets. By exploring and analyzing the data, we can gain a better understanding of the data.
The fourth step in cleaning data for ML is to validate and verify the data after you have explored and analyzed it. This step is essential in ensuring that the data is accurate, consistent, and complete, and meets the requirements and expectations of the analysis and modeling. Common methods of validation and verification include quality checks, which test the validity, reliability, and completeness of the data; integrity checks, which check the consistency and coherence of the data; accuracy checks, which check the correctness and precision of the data; and documentation, which records and explains the data cleaning steps, methods, and results.
-
Jyotishko Biswas
Solves Business Problems using AI | AI Leader | 17+ years exp. in AI | Experienced in Generative AI & LLMs | Guest Speaker on AI - IIM, JK Lakshmipat University etc. | Deployed enterprise-wide AI solutions | ex Deloitte
Data validation is important, else if we have incorrect input data the output will not have any use as it will not be reliable anymore. We can have data quality checks based on rules for example is price, revenue negative, do we have a region which is not available in firm's master data for geography dimension, annual revenue higher than firm's yearly revenue etc. Review data summary with business SME, as they know their business and data very well.
-
Sayan Chowdhury
Software Developer @ L&T • Digital Solutions • AI-ML Engineer • Learner & Creator • Writer on 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻 Articles & 𝗺𝗲𝗱𝗶𝘂𝗺.𝗰𝗼𝗺 • 2x 𝗛𝗣𝗔𝗜𝗥 Delegate
Data validation and consistency check involve verifying the cleaned data to ensure that it is free from errors, inconsistencies, and anomalies. This step is essential to maintain data quality and reliability throughout the data cleaning process. Here's what it entails: Identify and rectify any errors in the dataset, including typos, formatting issues, or missing values. Correcting these errors ensures that the data is accurate and reliable. Cross-check the data against domain-specific rules and constraints. For instance, if you're dealing with a dataset of customer ages, verify that there are no entries that violate logical constraints (e.g., ages greater than 1000 years) or business rules (e.g., negative account balances).
-
Hardik J.
Software Engineer | Data Engineer | Cloud Engineer | Machine Learning Engineer | MLOps
Validating and verifying data involves conducting checks to ensure data quality and accuracy. This includes cross-referencing data against reliable sources, performing sanity checks, and validating data against expected ranges or constraints. Data validation is crucial to confirm that the cleaned dataset aligns with the project's goals and domain requirements.
The fifth step in cleaning data for ML is to split and shuffle it before using for training and testing models. This helps avoid overfitting or underfitting the models, while improving their generalization and performance. Common ways to split and shuffle the data include a train-test split, which divides the data into two subsets, usually in a 80-20 or 70-30 ratio. Cross-validation is used to further split the training data into smaller subsets, called folds, which are used as a validation set to evaluate the models, while using the rest as a training set. Randomization is done by randomly reordering or rearranging the data before splitting it, to ensure that it's not biased or skewed by any order or sequence.
-
Jyotishko Biswas
Solves Business Problems using AI | AI Leader | 17+ years exp. in AI | Experienced in Generative AI & LLMs | Guest Speaker on AI - IIM, JK Lakshmipat University etc. | Deployed enterprise-wide AI solutions | ex Deloitte
Splitting data is done into training and test to avoid over fitting of data. We build the model on training data and check if the model is performing well on test data. We also split data into multiple parts known as folds. We do training on n-1 folds and on one fold we validate the model's performance (assuming there are n folds). Objective is to check if model is able to see most of the business scenarios in training which data or lot of scenarios have got missed. If lot of scenarios have got missed, then we will stand with higher chance of some of those scenarios coming in the future, and the model will not be able to predict them, as those scenarios are not of the training dataset.
(edited) -
Jefin Paul
Machine Learning and Data Mining Student at Jean Monnet Université
Splitting the data can also depend on the dataset and the problem. It's essential to consider the goal of your analysis. As explained above the data is split into three subsets(train, test and validation, such as 60,20,20 or 70,20,10). It's particularly useful when one needs to fine-tune the model's parameters. It also helps in avoiding over-fitting. Also, we often get imbalanced data. Here, we cannot simply split the data. We first need to deal with it. Dealing with imbalanced data is important because ML algorithms tend to perform poorly on minority classes as the algorithm may become biased towards the minority classes. To avoid that techniques like oversampling the minority data or undersampling the majority data can help.
-
Govardhana Miriyala Kannaiah
Founder @ NeuVeu | I bring a 'new' view to your Digital and Cloud Transformation Journey | MLOPS | AIOPS | Kubernetes | Cloud | DevOps | FinOps | GitOps | SRE | Platform Engineering
You can divide the data into chunks for training and testing, often like 80-20 or 70-30. To fine-tune, cross-validation helps by creating smaller subsets for validation. And hey, don't forget to shuffle things around randomly so your data stays unbiased and fresh before splitting.
The sixth and final step in cleaning data for ML is preprocessing and transforming it according to the needs and specifications of your models. This step can help optimize and enhance the features or variables in the data, making them more suitable and effective for your models. Common ways to preprocess and transform the data include encoding, scaling, normalization, and feature engineering. Encoding is converting categorical or nominal features into numerical or ordinal features, such as label encoding, one-hot encoding, or binary encoding. Scaling is adjusting the range or magnitude of numerical features to make them more comparable and compatible, such as min-max scaling, standard scaling, or robust scaling. Normalization is changing the shape or distribution of numerical features to make them more symmetrical and standardized, such as z-score normalization, log normalization, or box-cox normalization. Feature engineering is creating new features or modifying existing ones to make them more relevant or informative for your models, such as feature extraction, feature selection, feature construction, or feature interaction.
-
Rakesh Mohandas
Generative AI Ambassadors @ Google | Google Cloud Platform Fundamentals
🔍 Data cleaning is pivotal for ML success. Here's a distilled guide: 1️⃣ Spot issues: Detect missing data, outliers, and inaccuracies. 2️⃣ Fix errors: Impute gaps and rectify anomalies. 3️⃣ Uniform formats: Standardize data for consistency. 4️⃣ Validate: Ensure data integrity and accuracy. 5️⃣ Prep for ML: Split and shuffle to enhance model robustness. 6️⃣ Tailor features: Encode and scale for optimal algorithm performance. Each step is crucial for crafting models that deliver precise insights. #DataScience #MachineLearning #AI #DataQuality
-
Tochukwu Okonkwor
Lead Principal Enterprise/Security Architect @ Xyples | Enterprise, Security and Solution Architect, Automation and Programmability
Feature Scaling: Normalize or standardize features to bring them to a similar scale. Encoding Categorical Data: Convert categorical variables into numerical representations, e.g., one-hot encoding. Dimensionality Reduction: Apply techniques like Principal Component Analysis (PCA) to reduce feature dimensions.
-
Sitraka Forler
Economist & Senior Data Scientist | NLP | Digital Transformation
I think that oversampling the minority class or undersampling the majority class to balance the data should be done here. You can use methods such as Synthetic Minority Over-sampling Technique (SMOTE) to generate synthetic samples. Secondly, if your data includes text, you may need to perform text preprocessing, which can involve tasks like tokenization, stop-word removal, stemming, or lemmatization, and converting text into numerical representations using techniques like TF-IDF or word embeddings (Word2Vec, GloVe). Finally, we need to remember that different machine learning algorithms have specific preprocessing requirements. Decision trees or random forests may not require feature scaling, while neural networks often benefit from it.
-
Abdullateef Opeyemi Bakare
Energy | AI | Data Science
It is important to understand the context of the data or read the documentation of the data collection process before cleaning is done. This could help inform certain decisions in the cleaning process. For example, one time during work I was supposed to assess the quality of a dataset and I had given in to the urge to dive head first into the work without properly reading the documentation, this backfired. I had incorrectly assessed the data as "complete" since there were no missing values(i.e. Nan, Na, Null etc.) observed in the data. My attention was called to the documentation by a senior and I found that the missing values were actually coded as negative numbers, each connoting a different circumstance in the dataset.
-
Sayan Chowdhury
Software Developer @ L&T • Digital Solutions • AI-ML Engineer • Learner & Creator • Writer on 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻 Articles & 𝗺𝗲𝗱𝗶𝘂𝗺.𝗰𝗼𝗺 • 2x 𝗛𝗣𝗔𝗜𝗥 Delegate
Documentation in the context of data cleaning serves several important purposes: Reproducibility: Detailed documentation ensures that the data cleaning process is transparent and reproducible. It allows you or others to understand and replicate the steps taken to transform and clean the data. Error Tracking: If errors or issues are encountered during data cleaning, documentation can help pinpoint when and where the errors occurred. This makes it easier to identify and rectify problems in the data. Data Quality Assessment: Documentation allows you to record the quality of the data at different stages of the cleaning process. You can track improvements in data quality over time and assess the impact of various cleaning procedures.
-
Helen Yu
Founder & CEO @ Tigon Advisory Corp. | Host of CXO Spice | Top 50 Women in Tech | Board Director | AI, Cybersecurity, FinTech, Growth Acceleration
Having a well-defined data strategy and knowing your data are foundational for data cleaning in ML. Run a data quality assessment to check data duplicates, any missing values or data outliers.