Data Wiping & Its Phases: Against Data Quality Problems

The data cleaning, data wiping or scrubbing is a necessary process to ensure the quality of data to be used for analytics. This step is essential to minimize the risk of basing decision-making on inaccurate, erroneous or incomplete information.

The Data Wiping deals with solving problems of data quality at two levels:

· Problems related to data from a single source: at this level are the issues related to the lack of integrity of the constraints or the precariousness of the schema design; which in turn will affect the uniqueness of the data and its referential integrity, mainly. Although, in a more practical sense, this section could also encompass issues related to data entry, in terms of redundancies or contradictory values, among others.

· Problems related to data from various sources: as a general rule, they arise as a result of the heterogeneity of data models and schemas, which can cause structural conflicts; although, at the instance level, they are related to the duplications, contradictions and inconsistencies of the data.

The phases of data wiping

The ultimate goal of any data wiping action is to improve the organization's trust
in its data. To carry out a comprehensive data cleaning action, it is necessary to follow the following steps:

1. Data analysis: your mission is to determine what kind of errors and inconsistencies should be eliminated. In addition to manual inspection of the data samples, automation is necessary, in other words, the incorporation of programs that act on the metadata to detect data quality problems that affect its properties.

2. Definition of the transformation flow and mapping rules: depending on the number of data source sources, their heterogeneity and the anticipation of data quality problems, it will be necessary to execute more or fewer steps in the transformation and adaptation stage. The most appropriate thing is to propose an action at two levels, one at an early stage that corrects the problems related to data from a single source and prepares them for good integration; and another, to intervene later, dealing with data problems from a variety of sources. To improve control over these procedures, it is convenient to define the ETL processes by framing them within the specific framework.

3. Verification: the level of adequacy and effectiveness of a transformation action must always be tested and evaluated; one of the principles of Data Wiping. As a general rule, this validation is applied through multiple iterations of the analysis, design and verification steps; since some errors only become of evidence after applying to the data a certain number of transformations.

4. Transformation: consists of proceeding to execute the ETL flow to load and refresh the data warehouse, or during the response to queries, in cases of multiple sources of origin.

5. Clean data back-flow: once quality errors have been eliminated, the "clean" data should replace the unclean data in the original sources, so that legacy applications can benefit from them as well, avoiding the need for the application of Data Wiping actions in the future.

Importance of Data Wiping

With the Data Wiping, you can start to make a selection of those that will be useful for making predictions. In this phase you have to keep the "signal" and eliminate the fields that provide "noise". This job is often called Feature Engineering:

· Discard fields with random content

· Discard dependent fields

· Select those that are "predictors"

The transformation of the data, which also belongs to the so-called Feature Engineering, tries to generate new predictor fields based on the ones we already have. Knowledge of the domain (of the business, of the area being analyzed) is essential to approach this phase. This, together with the phase of selection of predictive fields, are the ones that need the most intellectual and creative effort, since not only do you have to know the field of study, but you also need to know in some depth how predictive algorithms work, how they interpret internally the data and how the relationships between them are sought.

As an example, it can be thought that in a project for predicting customer withdrawal, it is sufficient to have the date of registration and withdrawal. We could interpret that the algorithm, analyzing these two data, is capable of "deducing" the customer's age. But it's not like that. The transformation, in this case very simple, would be to add a new field that would be the subtraction of the two dates and transform it into a number of days (or months, or years, depending on which we consider best). A small modification like this can greatly improve the predictive ability of the system.

Conclusions

Machine Learning as a Service (MLAAS) platforms are bringing predictive analytics (versus descriptive analytics) substantially closer to companies of any size. What the greats have been doing for years is now being generalized to all companies. The process we are going through reminds us of the evolution that databases had in the 80s and 90s of the last century: what at first was difficult to explain (how they work and what they are for), now is in such a way integrated into all systems that it is difficult to find one that does not have a database in its guts.

The algorithms are important, but they are not the most important thing. The previous phase of data collection and preparation requires minimal effort and knowledge to carry out a project successfully. This phase can take between 80% and 90% of the project time.

Factors such as experience, intuition, knowledge of the business and customers are basic. You can have these skills in your company, or perhaps you need to hire them outside

DATA SOLUTIONS