Data
Shredding- Ways & Types
Data Shredding is the
process of analyzing the quality of the data in a data source; To do this,
the system's suggestions are manually approved or rejected and, in this way,
modifications are made to the data. Data Shredding in Data Quality
Services (DQS) includes a PC-assisted process that analyzes how the data
fits to the knowledge of a knowledge base and an interactive process that
allows the data manager to review and modify the results obtained in the processes
PC-assisted to ensure Data Shredding is exactly as intended.
The
data administrator can also perform Data Shredding in the
Integration Services packaging process. In this case, the data
administrator would use the Integration Services DQS Cleanup
Component to automatically clean up data using an existing knowledge base.
The data shredding feature in DQS has the following benefits:
·
Identifies
incomplete or incorrect data in the data source (Excel file or SQL Server
database), and then corrects or alerts about the invalid data.
·
Provides
a two-step process for Data Shredding: PC-assisted process and interactive process. The
PC-assisted procedure uses information from a DQS knowledge base to robotically
route the data and propose replacements or corrections. The next step, the
interactive one, allows the data administrator to approve, reject or modify the
changes that DQS has proposed during the PC-assisted Data Shredding process.
·
Standardize
and enrich customer data with domain values, domain rules, and reference
data. For example, standardize the use of the term by changing "C
/" to "Street", enrich data by filling in the missing
elements by changing "1 Microsoft way Redmond 98006" to "1
Microsoft Way, Redmond, WA 98006".
·
It
provides a simple, intuitive, and consistent wizard-like interface for the user
to navigate through data and inspect for errors in a very large set of data.
PC
assisted Data Shredding
The
DQS Data Shredding process applies the knowledge base to the data to be cleaned
and proposes changes to the data. The data manager has access to each proposed
change, allowing him to evaluate and correct the changes. To perform Data Shredding,
the data manager proceeds as follows:
1. Create a data quality project, select
a knowledge base to be used as a reference in the analysis and cleanup of the
source data, and select the Cleanup activity . Multiple data quality projects
can use the same knowledge base.
2. Specify the table or view from the
database or an Excel file that contains the source data to be cleaned. The
database or Excel file can be the same as the one used for knowledge discovery,
or it can be another database or another Excel file.
3. Map the data fields to be cleaned to
the appropriate compound domains or domains in the knowledge base. If a field
is mapped to a composite domain, the mapping occurs between the field and the
composite domain and not with the individual domains in the composite domain.
On the other hand, the data cleanup for the mapped field is carried out
according to the rules that were specified for the composite domain and not for
the individual domains in the composite domain. For more information about
compound domains, see DQS Knowledge Bases and Domains.
4. Run the PC-assisted process; to do
this, click Start on the Cleanup page.
The
Data Shredding process
finds the best match of a data instance for the known values in the data domain.
The process applies data quality knowledge to all source data, as opposed to
the knowledge discovery process, which runs on a percentage of the sample data.
The
PC-assisted process displays data quality information in the Data Quality
Client during the interactive cleanup process. In addition to accounting for
syntax error rules, DQS also uses reference data and advanced algorithms to
classify data by confidence level. The confidence level indicates the DQS
degree of certainty for the correction or suggestion. The confidence level is
based on the following thresholds:
·
An
automatic rectification threshold
exceeded which DQS will propose a modify and make it unless the data manager discards
it. You can specify the automatic correction threshold value on the General
Settings tab of the Settings screen. For more information, see Configure
Threshold Values for Cleanup and Match.
·
An
auto-suggest threshold,
below the auto-correction threshold, beyond which DQS will suggest a change and
perform it if approved by the data manager. You can specify the auto-suggest
threshold value on the General Settings tab of the Settings screen. For more
information, see Configure Threshold Values for Cleanup and Match.
DQS
leaves any value with a confidence level below the auto-suggest threshold as is
unless the data manager specifies a change.
Interactive
Data Shredding
Based
on the PC-assisted cleanup process, DQS provides the data manager with the
information they need to make a decision on whether to change the data. DQS
classifies the data in the following five tabs:
Suggested -
Values for which DQS detected suggestions with a confidence level greater than
the auto-suggestion threshold, but lower than the auto-correction threshold. You
should review these values and approve or reject them as appropriate.
New:
legitimate standards for which DQS does not have sufficient information (hint)
and therefore cannot be assigned to any other tabs. Additionally, this tab also
contains values with a confidence level lower than the autosuggest threshold,
but high enough to mark them as valid.
Invalid -
Values that were marked as invalid in the knowledge base domain, or values that
did not meet a reference data or domains rule. This tab will also include the value
that the user discards on any of the further four tabs during the interactive
cleanup process.
Fixed:
DQS corrected values during the automated cleanup process for which DQS
detected a correction for the value with a confidence level above the automatic
correction threshold. This tab will also contain the values for which the user
specified a correct value in the Fix To column during interactive cleanup and
which were subsequently approved by clicking the radio button in the Approve
column on any of the other four tabs.
Correct:
Values that are considered correct. For example, if the value matches a domain
value. If applicable, you can override the DQS cleanup; to do this, reject the
values on this tab or specify an alternative word in the Correct to column and
click the radio button in the OK column. This tab will also contain the values
that the user approved during the interactive cleanup by clicking the radio
button in the Approve column on the New or Invalid tab.
The
data manager uses the Data Quality Client to view the changes proposed by DQS
and decide whether or not to implement them. You can check whether the values
that DQS has designated as correct are actually correct. You can check if
changes already made by DQS, with a high level of confidence, should be done.
You can decide if you want to approve the suggested changes automatically. And
you can review the values that have not been changed, in case you want to make
a change not found by the PC-assisted process.
DQS
will combine the changes made by the data manager with the results of the
PC-assisted Data Shredding.
These changes will remain in the project; however, they will not be added to
the knowledge base. During Data Shredding, the associated knowledge base is
read-only.
When
the data cleanup is complete, you can choose whether to export the processed
data to a new table in a SQL Server database, to a .csv file, or to an Excel
file. The source data on which the Shredding is performed is kept in its
original state. The data manager can use the separate clean data to correct the
actual source data.
Standardize
Data Shredding
You
can choose whether to export the clean data in the normalized format based on
the output format that has been defined for the domains. When creating a
domain, you can select the format that will be applied when the domain data
values are generated
When
you export clean data on the Export page in the Data Quality Project Wizard,
you specify whether you want the clean data to be exported in the standardized
format; To do this, the Standardize output box is activated. By default, the
clean data is exported in the standardized format, that is, the check box is
selected.
No comments:
Post a Comment