Data Wiping & Its Phases:
Against Data Quality Problems
The data cleaning, data wiping or
scrubbing is
a necessary process to ensure the quality of data to be used for analytics. This
step is essential to minimize the risk of basing decision-making on inaccurate,
erroneous or incomplete information.
The Data
Wiping deals with solving problems of data quality at two levels:
·
Problems
related to data from a single source: at this level are the issues related to
the lack of integrity of the constraints or the precariousness of the schema
design; which in turn will affect the uniqueness of the data and its
referential integrity, mainly. Although,
in a more practical sense, this section could also encompass issues related to
data entry, in terms of redundancies or contradictory values, among others.
·
Problems
related to data from various sources: as a general rule, they arise as a result
of the heterogeneity of data models and schemas, which can cause structural conflicts; although, at
the instance level, they are related to the duplications, contradictions and
inconsistencies of the data.
The phases of data wiping
The
ultimate goal of any data
wiping action is to improve the organization's trust
in its data. To
carry out a comprehensive data cleaning action, it is necessary to follow the
following steps:
1.
Data analysis: your mission is
to determine what kind of errors and inconsistencies should be eliminated. In
addition to manual inspection of the data samples, automation is necessary, in
other words, the incorporation of programs that act on the metadata to detect
data quality problems that affect its properties.
2.
Definition of the
transformation flow and mapping rules: depending on the number of data
source sources, their heterogeneity and the anticipation of data quality problems,
it will be necessary to execute more or fewer steps in the transformation and
adaptation stage. The most appropriate thing is to propose an action at
two levels, one at an early stage that corrects the problems related to data
from a single source and prepares them for good integration; and another,
to intervene later, dealing with data problems from a variety of sources. To
improve control over these procedures, it is convenient to define the ETL
processes by framing them within the specific framework.
3.
Verification: the level of
adequacy and effectiveness of a transformation action must always be tested and
evaluated; one of the principles of Data
Wiping. As a general rule, this validation is
applied through multiple iterations of the analysis, design and verification
steps; since some errors only become of evidence after applying to the
data a certain number of transformations.
4.
Transformation: consists of
proceeding to execute the ETL flow to load and refresh the data warehouse, or
during the response to queries, in cases of multiple sources of origin.
5. Clean data back-flow: once quality errors have been eliminated, the "clean" data should replace the unclean data in the original sources, so that legacy applications can benefit from them as well, avoiding the need for the application of Data Wiping actions in the future.
Importance
of Data Wiping
With
the Data Wiping, you can
start to make a selection of those that will be useful for making
predictions. In this phase you have to keep the "signal" and
eliminate the fields that provide "noise". This job is often
called Feature Engineering:
·
Discard
fields with random content
·
Discard
dependent fields
· Select those that are "predictors"
The
transformation of the data, which also belongs to the so-called Feature
Engineering, tries to generate new predictor fields based on the ones we
already have. Knowledge of the domain (of the business, of the area being
analyzed) is essential to approach this phase. This, together with the
phase of selection of predictive fields, are the ones that need the most
intellectual and creative effort, since not only do you have to know the field
of study, but you also need to know in some depth how predictive algorithms
work, how they interpret internally the data and how the relationships between
them are sought.
As
an example, it can be thought that in a project for predicting customer
withdrawal, it is sufficient to have the date of registration and
withdrawal. We could interpret that the algorithm, analyzing these two
data, is capable of "deducing" the customer's age. But it's not
like that. The transformation, in this case very simple, would be to add a
new field that would be the subtraction of the two dates and transform it into
a number of days (or months, or years, depending on which we consider
best). A small modification like this can greatly improve the predictive
ability of the system.
Conclusions
Machine
Learning as a Service (MLAAS) platforms are bringing predictive analytics
(versus descriptive analytics) substantially closer to companies of any
size. What the greats have been doing for years is now being generalized
to all companies. The process we are going through reminds us of the
evolution that databases had in the 80s and 90s of the last century: what at
first was difficult to explain (how they work and what they are for), now is in
such a way integrated into all systems that it is difficult to find one that
does not have a database in its guts.
The
algorithms are important, but they are not the most important thing. The
previous phase of data collection and preparation requires minimal effort and
knowledge to carry out a project successfully. This phase can take between
80% and 90% of the project time.
Factors
such as experience, intuition, knowledge of the business and customers are
basic. You can have these skills in your company, or perhaps you need to
hire them outside
No comments:
Post a Comment