In this era when Data Science and AI are evolving quickly. Critical business decisions are being taken and strategies being built on the output from such algorithms, ensuring their efficacy becomes extremely important. When the majority of the time of any data science project is spent in data preprocessing it becomes extremely important to have clean data to work upon. As the old saying goes ‘Garbage in, Garbage out’, the outcome of these models is highly dependent on the nature of data fed in, hence data quality challenges in data science are becoming increasingly important.
Challenges to Data Quality in Data Science
Let’s understand this problem better using a case. Let’s say you are working for an Indian bank who wants to build a customer acquisition model for one of their products using ML. As with typical ML models, they need lots and lots of data and as the size of data increases, your problems with data also increases. While doing data prep for your model you might face quality challenges. Let’s look at a few of them one by one.
The top most Common causes of data quality issues are:
Suppose you are creating customer demographics variables for your model and you notice that there are a cluster of customers in your dataset that have exactly the same age, gender and pincode address, well this case is quite possible as there can be a bunch of people of the same age, gender living in the same Pincode, but you need to have a closer inspection at the customer details table and check if rest of the details(like mobile no, education, income, etc.) of these customers are also same or not. If they all are the same, it means it is probably due to data duplication. Multiple copies of the same records not only take a toll on computing and storing but also affects the outcome of the machine learning models by creating a bias.
Suppose you are working on location specific data, it can be quite possible that the pincode column you fetched contains some values which are not of 6 digits. This problem occurs due to Inaccurate data and it can impact your model where data needs to get aggregated at pincode level. Features with a high proportion of incorrect data should be dropped altogether from your dataset.
There can be data points which might not be available for your entire customer base. Suppose your Bank started to capture the salary of your customers in the last one year only, customers who are associated with the bank for more than one year will not have their salary details captured. However important you might think this variable can be for your model, if it is not available for more than 50% of your entire dataset, it cannot be used in its current form.
Machine learning algorithms are sensitive to the range and distribution of attribute values. Data outliers can spoil and mislead the training process resulting in longer training times, less accurate models and ultimately poorer results.Correct outlier treatment can be the difference between accurate and an average performing model.
Bias in Data:
Bias error occurs when your training dataset does not reflect the realities of the environment in which a model will run. Let’s understand this in our case, typically in acquisition models the potential customers on which your model will run and predict in future can be of two types, Credit experienced or new to credit. If your training data contains only credit experienced customers, your data will be biased and will fail miserably in the production settings as all the features which capture customers performance using the credit history(Bureau data) will not be present for new to credit customers. Your model might perform very well on experienced customers but will fail for the new. ML models are as good as data they are trained on, if the training data has systematic bias your model will also produce biased results.
How To Address
Now that we understand the data quality challenges, now lets see how we can tackle them and improve our data quality. But before going further let’s first understand that it is certain that data will never be 100% perfect. There will always be inconsistencies through human error, machine error or through sheer complexity due to the growing volume of data. While developing ML models there are few techniques that we can use to address these issues like:
- Outlier Detection
- Missing value imputation
- Data deduplication
- Variable selections
Apart from these techniques we can also add some logical rule based checks to validate the data if it reflects the real value with the help of domain experts. Also there exists a lot of software solutions in the market to manage and improve data quality in data science and help you create better machine learning solutions.
Dirty data is the single greatest threat to success with analytics and machine learning and can be the result of duplicate data, human error, and nonstandard formats, just to name a few factors. The quality demands of machine learning are steep, and bad data can backfire twice — first when training predictive models and second in the new data used by that model to inform future decisions. When 70% to 80% time of a data scientist in any ML project is spent in the data preparation phase then ensuring that high-quality data is being fed into ML algorithms should be of the highest importance. As by each passing day, more and more data is being generated and captured, addressing this challenge right now is more important than ever.