What is Dirty Data and how will it impact businesses today?
There has been a lot of talk about big data and its prominence in the industries of today, from banking and finance, technology and even pharmaceuticals. However, what is also growing concern amongst data scientists and engineers who are scrambling to cleanse erroneous data to meet business goals.
Do a quick google search of ‘dirty data’, and search results will identify the term as ‘inaccurate, incomplete, inconsistent, duplicate data sets’. However, there is more to this phenomenon than just fixing and comprehending incorrect data. Currently, data scientists spend between 60 and 80 percent of their time on data preparation, leaving insufficient time to focus on important areas such as interacting with data, running advanced analytics, training and evaluating models, and deploying models to production.
We spoke with #DataScience Donnie Maclary for more insight
We sat down with our very own data expert, Donnie Maclary, Principal Consultant of Huxley Singapore, who shared some of his views on how dirty data affects organisations today.
“When people think of data, we assume it is packaged nicely with a pretty red bow on top ready for analysing. Of course, some data easily accesible, however, this is not always the reality.
Most organisations hosts massive amounts of unstructured data. Cleaning this data and preparing the analysis is a timely task that is a part of a data analysts or data scientists daily duties.
The fear of the key business stakeholders is the quality of the data. Keep in mind that such data sets are essential information that businesses employ to make key decisions with. This is also where management and technology teams need to be in alignment in terms of business needs.”
There is no one size fits all method to this process as all companies are set up differently with varied levels of accessibility to information. However, here’s what organisations need to know about dirty data.
How did dirty data emerge?
Dirty data is caused by the following:
- Human error
- Disparate systems
- Changing requirements
According to Experian, human error influences over 60% of dirty data, and poor interdepartmental communication is involved in about 35% of inaccurate data records. Intuitively, it seems that a solid data strategy should mitigate these issues, but inadequate data strategies also impacts 28% of inaccurate data.
When different departments are entering related data into separate data silos, records can still be be duplicated with non-canonical data such as different misspellings of names and addresses. Data silos with poor constraints can lead to dates, account numbers or personal information being shown in different formats, which makes them difficult or impossible to automatically reconcile.
Harvard Business Review reports that analysts spend 50 per cent of their time searching for data, correcting errors and cross-checking sources for data.
The impact of Dirty Data on organisations
- Productivity of data experts
- Greater financial losses
- Potential regulatory breaches
In the financial services industry, dirty data goes beyond financial loss. Inaccurate and incomplete data can lead to regulatory breaches, delayed decisions due to manual checks, and sub-optimal trade strategies just to name a few. Gartner research indicates that the “average financial impact of poor data quality on organisations is USD9.7 million per year. As a result it is pertinent for all to make sure that data is kept clean and proper.
Don’t let dirty data slow you down
The business impact of dirty data is staggering, but an individual organisation can avoid the morass if it takes the right approach. Clean, reliable data makes the business more agile and responsive while cutting down on wasted efforts by data scientists and knowledge workers.
Here are 3 ways institutions and firms are adopting to tackle dirty data:
- Developing an agile processes
Organisations should consider how the tool will evolve processes towards an iterative, agile approach instead of creating new barriers to entry. People will have a greater desire to prepare and understand their data if they can see the impact of their data prep.
The adoption of an agile process also includes the principle of greater transparency and visualisation. It is about planning and tracking based on real progress and data, and this is where traceability becomes so important in agile organisations.
- Standardising data
Continue to iterate and innovate: Developing a data dictionary is no small task. Data stewards and subject matter experts need to commit to ongoing iteration, checking in as requirements change. If a dictionary is out of date, it can actually do harm to your organisation's data strategy. Communication and ownership should be built into the process from the beginning to determine where the glossary should live and how often it should be updated and refined.
Firms can incorporate strategies such as:
- Standardising data by modifying it to uniformly conform to standards. This can be accomplished by matching and merging records with a standardised file.
- Use filtering technique to identify duplicate and missing data.
- Empower data experts and encourage collaboration and knowledge sharing
Adopting self-service data prep across an organization requires users to learn the ins and outs of the data. Since this knowledge was historically reserved for IT and data engineering roles, it is crucial that analysts take time to learn about nuances within the data, including the granularity and any transformations that have been done to the data set. Scheduling regular check-ins or a standardized workflow for questions allows engineers to share the most up-to-date way to query and work with valid data, while empowering analysts to prepare data faster and with greater confidence.
An opportunity for data experts to step up
Dirty data is an opportunity for organisations to review their practices at granular levels. Thus, support and expertise will always be needed from specialists to make sure that things are in place.
Data scientists, analysts and engineers are the catalysts to create and incorporate processes to ensure data integrity in a firm. If you would like to find out how these experts are managing the process, do get in touch with us. If you are also a data expert looking to take on a project this 2020, do also leave your contact details via the form below.