Phases of Data Wiping: Against Data
Quality Problems
The Data Wiping, Data Cleansing
or Data Shredding is a
necessary process to ensure the quality of data to be used for
analytics. This step is essential to minimize the risk of basing
decision-making on inaccurate, erroneous or incomplete information.
The Data
Wiping deals with solving problems of data quality at
two levels:
·
Problems related to data from a single source: at this level are
the issues related to the lack of integrity of the constraints or the
precariousness of the schema design; which in turn will affect the uniqueness of the data and its
referential integrity, mainly. Although, in a more
practical sense, this section could also encompass issues related to data
entry, in terms of redundancies or contradictory values, among others.
· Problems related to data from various sources: as a general rule, they arise as a result of the heterogeneity of data models and schemas, which can cause structural conflicts; although, at the instance level, they are related to duplications, contradictions and inconsistencies in the data.
The phases of Data Wiping
The
ultimate goal of any Data Wiping action is to
improve the organization's trust
in its data. To carry out a comprehensive data cleaning action, it is
necessary to follow the following steps:
1. Data analysis: your mission is to
determine what kind of errors and inconsistencies should be eliminated. In
addition to manual inspection of the data samples, automation is necessary, in other
words, the incorporation of programs that act on the metadata to detect data
quality problems that affect its properties.
2. Definition of the transformation flow and mapping rules:
depending on the number of data source sources, their heterogeneity and
the anticipation of
data quality problems, it will be necessary to execute more or
fewer steps in the transformation and adaptation stage. The
most appropriate thing is to propose an action at two levels,
one at an early stage, that corrects problems related to data from a single
source and prepares them for good integration; and another, to intervene
later, dealing with data problems from a variety of sources. To improve
control over these procedures, it is convenient to define the ETL processes by
framing them within the specific framework.
3. Verification: the level of adequacy
and effectiveness of a transformation action must always be tested and
evaluated; one of the principles of Data Wiping. As a general
rule, this validation is applied through multiple iterations of the analysis,
design and verification steps; since some errors only become of evidence
after applying to the data a certain number of transformations.
4. Transformation: consists of
proceeding to execute the ETL flow to load and refresh the data warehouse, or
during the response to queries, in cases of multiple sources of origin.
5. Clean data back-flow: once
quality errors have been eliminated, the "clean" data should replace the
unclean data in the original sources, so that legacy applications can benefit
from them as well, avoiding the need for the application of Data Wiping actions in
the future.
No comments:
Post a Comment