How can you remove outliers for a specific ML task?
Learn from the community’s knowledge. Experts are adding insights into this AI-powered collaborative article, and you could too.
This is a new type of article that we started with the help of AI, and experts are taking it forward by sharing their thoughts directly into each section.
If you’d like to contribute, request an invite by liking or reacting to this article. Learn more
— The LinkedIn Team
Outliers are data points that deviate significantly from the rest of the distribution. They can affect the performance and accuracy of your machine learning models, especially if they are not representative of the underlying problem or domain. Therefore, it is important to identify and remove outliers for a specific ML task, depending on the type of data, the algorithm, and the objective. In this article, you will learn some common methods and criteria for outlier detection and removal, as well as some examples and code snippets to help you apply them.
Before you can remove outliers, you need to define what constitutes an outlier for your specific ML task. There is no single definition, as different data sets and domains may have different characteristics and assumptions. However, some common criteria for outliers include data points that fall outside a certain range or threshold, such as the mean plus or minus three standard deviations, or the interquartile range multiplied by 1.5; data points that have a high leverage or influence on the regression line, such as Cook's distance or DFFITS; and data points that have a low density or probability compared to the rest of the data, such as local outlier factor or isolation forest. You can use various statistical tests, graphical methods, or machine learning algorithms to identify outliers based on these criteria, depending on the nature and dimensionality of your data.
-
Krishna reddy Konda
Computer vision and Machine learning Team leader @ZF TCI
Outliers should not really be removed, In reality they are legitimate data points that are part of actual data. Nonetheless there are two scenarios 1) data collected might be faulty and there might be some wrong responses which should be removed through proper investigation. 2) Data points are too few to capture actual data distribution which creates an illusion of outliers. Given enough data these outliers will be part of the distribution. generally in second case outliers are removed due to the incapability of model rather than actual usefulness of data In summary better data collection and increasingly capable models will solve the problem of outliers without actually discarding them
(edited) -
Ayorinde Alase
Business Transformation @AXA ,Machine Learning Researcher, Business Innovation with Data
Removing outliers is crucial because they have the potential to skew the results of your model and lead to poor generalization. However, the definition of outliers may vary depending on the specific case and domain 1 Begin by visualizing your data using various plots such as boxplots e.t.c 2 Calculate statistical measures such as the mean, s.d, and quartiles. Utilize these measures to establish a threshold for identifying outliers 3 There are several methods to handle outliers, and the choice should be based on the nature of your data. Domain knowledge is key in this process, and the interquartile range can be useful 4 Once you have removed the outliers, it is important to perform EDA to ensure that the distribution appears more reasonable
-
Zia Beheshtifard
Chief Technology Officer
In my experience, in addition to the mentioned methods for identifying outliers, data clustering can be very useful in this regard. Clustering-based outlier detection leverages the inherent structure and patterns within the data to detect outliers. By grouping data points into clusters, it identifies points that are assigned to small clusters. Various techniques, such as density-based clustering and outlier scoring methods like LOF and Isolation Forest, aid in quantifying the outliers. This approach finds applications in diverse domains, including fraud detection, anomaly detection, and quality control, allowing for the identification of data points that deviate significantly from the distribution of the majority of the dataset.
Once you have identified the outliers, it is important to consider the trade-offs and implications of removing them from your data set. Removing outliers may reduce noise and distortion, however, it can also lead to losing valuable information or insights that may be hidden in the outliers. Furthermore, it can reduce the variability and diversity of your data, potentially affecting the generalization and robustness of your ML models. Moreover, there is a risk of introducing bias or distortion if the outliers are not random but systematic or correlated with other variables. Therefore, before removing them you should always check the validity and relevance of the outliers, and compare the results of your ML models with and without the outliers.
-
Dr Reji Kurien Thomas
I Empower organisations as a Global Technology & Business Transformation Leader | CTO | Harvard Leader | UK House of Lord's Awardee |Fellow Royal Society & CSR Sustainability |Visionary Innovator |CCISO CISM |DBA DSc PhD
Removing outliers for a specific ML task can be done by – Identifying outliers using statistical methods like Z-scores or the Interquartile Range (IQR) Visualising data with boxplots or scatter plots to spot anomalies Filtering out any data points that fall beyond an acceptable range based on domain knowledge Applying robust scaling techniques that reduce the influence of outliers on the model
-
Sanjay Kumar MBA,MS,PhD
When removing outliers from your dataset, it's essential to consider the trade-offs and consequences. While eliminating outliers can reduce noise and distortion, it may also result in the loss of valuable information or insights hidden within them. Additionally, it can reduce data variability and diversity, impacting the generalization and robustness of machine learning models. Removing outliers may introduce bias or distortion if they are systematic or correlated with other variables. To make an informed decision, always assess the validity and relevance of outliers and compare model results with and without them.
-
Abdullateef Opeyemi Bakare
Energy | AI | Data Science
Removing outliers must be done after carefully considering the tradeoffs as Sanjay has correctly posited, and If after this considerations you do decide to remove the outliers, it's crucial to document and justify your decision. This helps in maintaining transparency and ensures that others understand your data preprocessing steps.
An alternative to removing outliers is to replace them with more reasonable values. This can help preserve the size and structure of your data set, as well as some information from the outliers. However, you should be cautious not to introduce more noise or bias in your data by replacing them. There are several methods you can use, such as replacing the outliers with the mean, median, mode, or a constant value. This is a simple and fast approach, but it may reduce the variability and skew the distribution of your data. Alternatively, you could use a random value from a normal or uniform distribution to replace them. This is a more realistic and flexible method, but it may increase the uncertainty and variability of your data. You could also employ a machine learning algorithm such as k-nearest neighbors, linear regression, or neural networks to predict values for the outliers. This is a more sophisticated and accurate method, but it may require more computational resources and assumptions. Python libraries such as pandas, numpy, or sklearn can be used to replace outliers with these methods depending on the type and format of your data.
-
Paresh Patil
💡Top Data Science Voice | ML, Deep Learning & Python Expert, Data Scientist | Data Visualization & Storytelling | Actively Seeking Opportunities
Replacing outliers effectively demands a blend of statistical techniques and domain expertise. A common approach is winsorization, capping extreme values to a specified percentile, thus reducing variance without losing data. Another technique is imputation, substituting outliers with mean, median, or mode, which works well for mild anomalies but could introduce bias if overused. Advanced methods like k-nearest neighbors (KNN) or regression imputation leverage patterns within the data for a more informed substitution. Whichever method you choose, it should align with the data distribution and the predictive power of your ML model to ensure data fidelity and model accuracy.
-
Sayan Chowdhury
Software Developer @ L&T • Digital Solutions • AI-ML Engineer • Learner & Creator • Writer on 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻 Articles & 𝗺𝗲𝗱𝗶𝘂𝗺.𝗰𝗼𝗺 • 2x 𝗛𝗣𝗔𝗜𝗥 Delegate
Mean/Median Imputation: Replace outliers with the mean or median of the feature. This is a straightforward approach and is effective when the data follows a normal distribution. Custom Value Imputation: Replace outliers with a predefined constant or custom value. This can be useful when you have domain knowledge that suggests a specific value for outliers. e.g. replacing outliers in a survey dataset with a fixed value might be appropriate if outliers represent data entry errors. Percentile Imputation: Replace outliers with values at specific percentiles of the data distribution. E.g. you can replace values above the 99th percentile with the value at the 99th percentile.
-
Christopher Kramer
AI, Machine Learning & Data Science
Replacing outliers is an enticing proposition, but it should be done carefully. You might end up "embedding" patterns (over-representing data) in your sample which can lead to model bias. Understanding your imputation method, and balancing between complexity and simplicity in your imputation logic is paramount.
An alternative to removing or replacing outliers is to scale them to a smaller or larger range. This can reduce the influence of the outliers on ML models and maintain some of their information. However, you should be aware of the effects and limitations of scaling outliers. Common techniques include logarithmic or exponential transformation for skewed or long-tailed data, standardization or normalization for data with different scales, and robust or non-parametric methods for data with outliers. Python libraries such as scipy, statsmodels, and sklearn offer functions for scaling outliers based on the type and distribution of your data.
-
Paresh Patil
💡Top Data Science Voice | ML, Deep Learning & Python Expert, Data Scientist | Data Visualization & Storytelling | Actively Seeking Opportunities
Scaling outliers requires delicate handling to retain data integrity. Techniques like robust scaling, where the median and interquartile range establish the scale, diminish the impact of outliers on your model's performance. This non-parametric method ensures that extreme values do not distort the overall data distribution, thus preserving essential structures. When scaling, it's critical to understand the underlying distribution and variance, as well as the model's sensitivity to these extremes Choose your scaling method wisely, as it can significantly influence your ML model's ability to generalize and perform under varying data conditions.
-
Sayan Chowdhury
Software Developer @ L&T • Digital Solutions • AI-ML Engineer • Learner & Creator • Writer on 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻 Articles & 𝗺𝗲𝗱𝗶𝘂𝗺.𝗰𝗼𝗺 • 2x 𝗛𝗣𝗔𝗜𝗥 Delegate
Winsorization: Winsorization limits the values of outliers by replacing them with the nearest non-outlier data point. You can choose to replace outliers with values at a specific percentile (e.g., 99th percentile) to constrain their impact. This technique effectively trims extreme values without removing them from the dataset. Log Transformation: Applying a logarithmic transformation is useful for data that exhibits exponential or highly skewed distributions. It compresses the range of values and reduces the influence of extreme values while preserving their presence in the dataset. The degree of skewness will determine the base of the logarithm (common choices include base 10 or the natural logarithm, base e).
-
Aniket Soni
2x GCP Certified | Databricks Certified Data Engineer Associate | Full-Stack Engineer
Scaling outliers can be a pragmatic approach when dealing with extreme values. Instead of outright removing them, this method can help temper their impact on your machine learning models while retaining some useful information. It's worth noting, however, that scaling outliers is not a one-size-fits-all solution. The technique you choose, whether it's logarithmic transformation, standardization or another method, should align with your data's characteristics and distribution.
The final step to deal with outliers is to evaluate the effect and significance of the outliers on your ML models. This can help you decide if removing, replacing, or scaling outliers is beneficial or detrimental for your specific ML task, and how to refine your data cleaning and preprocessing strategies. Descriptive statistics or visualizations can be used to compare the summary and distribution of your data before and after dealing with outliers. This way, you can observe changes in the mean, median, variance, range, skewness, or kurtosis. Additionally, inferential statistics or hypothesis tests can be used to compare the significance and confidence of your data before and after dealing with outliers. This can help validate and justify your decisions about outliers regarding the p-value, t-test, ANOVA, or chi-square test. Lastly, machine learning metrics or validation techniques can be used to compare the performance and accuracy of your ML models before and after dealing with outliers. This way, you can measure and optimize your ML outcomes and objectives such as accuracy, precision, recall, F1-score, MSE, R2, or cross-validation. Python libraries like pandas, matplotlib, seaborn, scipy, statsmodels or sklearn can be used to evaluate outliers with these methods depending on the type and goal of your ML task.
-
Sayan Chowdhury
Software Developer @ L&T • Digital Solutions • AI-ML Engineer • Learner & Creator • Writer on 𝗟𝗶𝗻𝗸𝗲𝗱𝗜𝗻 Articles & 𝗺𝗲𝗱𝗶𝘂𝗺.𝗰𝗼𝗺 • 2x 𝗛𝗣𝗔𝗜𝗥 Delegate
Begin by visualizing the data using appropriate plots such as scatter plots, box plots, histograms, or Q-Q plots. Visualization allows you to identify potential outliers and gain an initial understanding of their impact. Utilize statistical techniques to identify outliers. Common methods include: Z-Score: Calculate the z-score for each data point and consider those with z-scores above a certain threshold as outliers. IQR (Interquartile Range): Identify data points that fall outside the bounds of Q1 - 1.5 * IQR and Q3 + 1.5 * IQR as outliers. Leverage domain knowledge to evaluate whether identified outliers are valid or erroneous. Some data points that appear as outliers might be genuine and important in the context of the problem.
-
Linnéa Haugen
Software Development | Data Analysis | R&D | AI | Signal Processing | Image Processing
I'd argue that evaluating your outliers is the first step. Sometimes removing outliers is the best solution and other times it's not. What's most important is to know your data. if you are using supervised learning and notice inconsistencies in e.g. detecting where the front door of a car is, then the problem may be found in your annotations. If you are using clustering and find outliers that may mean you have an underrepresented group of data. So know your data well before discarding anything. That makes your results more reliable and less biased too.
(edited)
-
Elisa Terumi Rubel Schneider, PhD
Natural Language Processing | Machine Learning | Software Development
Importante documentar e justificar por que os outliers foram removidos ou substituídos, para a transparência do processo. Muitas vezes é preferível considerar algoritmos robustos que não sejam sensíveis a outliers, em vez de removê-los.
-
Meenakshi A.
Technologist & Believer in Systems for People and People for Systems
But how do we know outliers grow to different categories along the way to the society and culture of our mother Earth for the good 😊
-
Durgesh Chalvadi
AI Engineer @IBM WatsonX Client Engineering | Ex-Data Scientist @Tata Research Development and Design Centre TRDDC | PICT '18
PyOD (Python Outlier Detection) is a specialized library for outlier detection. It provides a wide range of algorithms and methods for identifying outliers in your data. You can also explore Z-Score and IQR Functions. scipy.stats library for calculating Z-scores and Interquartile Range (IQR) to identify and remove outliers based on statistical methods.