&
Its Phases: Against Data Quality
Problems
The Data
Wiping, Data Cleansing or Data
Shredding is a necessary process to ensure the
quality of data to be used for analytics. This step is essential to
minimize the risk of basing decision-making on inaccurate, erroneous or
incomplete information.
The Data
Wiping deals with solving problems of data quality at
two levels:
·
Problems related to data from a single source: at this level are
the issues related to the lack of integrity of the constraints or the
precariousness of the schema design; which in turn will affect the uniqueness
of the data and its referential integrity, mainly. Although,
in a more practical sense, this section could also encompass issues related to
data entry, in terms of redundancies or contradictory values, among others.
·
Problems related to data from various sources: as a general
rule, they arise as a result of the heterogeneity of data models and schemas,
which can cause structural conflicts; although,
at the instance level, they are related to duplications, contradictions and
inconsistencies in the data.
The phases of Data Wiping
The
ultimate goal of any Data Wiping action is to improve the organization's trust
in its data. To
carry out a comprehensive data cleaning action, it is necessary to follow the
following steps:
1. Data analysis: your
mission is to determine what kind of errors and inconsistencies should be eliminated. In
addition to manual inspection of the data samples, automation is necessary, in
other words, the incorporation of programs that act on the metadata to detect
data quality problems that affect its properties.
2. Definition of the
transformation flow and mapping rules:
depending on the number of data source sources, their heterogeneity and the anticipation
of data quality problems, it will be necessary to execute more
or fewer steps in the transformation and adaptation stage. The
most appropriate thing is to propose an action at two levels,
one at an early stage, that corrects problems related to data from a single
source and prepares them for good integration; and another, to intervene
later, dealing with data problems from a variety of sources. To improve
control over these procedures, it is convenient to define the ETL processes by
framing them within the specific framework.
3. Verification: the
level of adequacy and effectiveness of a transformation action must always be
tested and evaluated; one of the principles of Data
Wiping. As a general rule, this validation is
applied through multiple iterations of the analysis, design and verification
steps; since some errors only become of evidence after applying to the
data a certain number of transformations.
4. Transformation: consists
of proceeding to execute the ETL flow to load and refresh the data warehouse,
or during the response to queries, in cases of multiple sources of origin.
5. Clean data back-flow: once
quality errors have been eliminated, the "clean" data should replace
the unclean data in the original sources, so that legacy applications can
benefit from them as well, avoiding the need for the application of Data
Wiping actions in the future.
No comments:
Post a Comment