Powered By Blogger

Conducts & Categories of Data Shredding

 

Conducts & Categories of

 Data Shredding



Data Shredding is the process of analyzing the quality of the data in a data source; To do this, the system's suggestions are manually approved or rejected and, in this way, modifications are made to the data. Data Shredding in Data Quality Services (DQS) includes a PC-assisted process that analyzes how the data fits to the knowledge of a knowledge base and an interactive process that allows the data manager to review and modify the results obtained in the processes PC-assisted to ensure Data Shredding is exactly as intended.

The data administrator can also perform Data Shredding in the Integration Services packaging process. In this case, the data administrator would use the Integration Services DQS Cleanup Component to automatically clean up data using an existing knowledge base. The data shredding feature in DQS has the following benefits:

·         Identifies incomplete or incorrect data in the data source (Excel file or SQL Server database), and then corrects or alerts about the invalid data.

·         Provides a two-step process for Data Shredding: PC-assisted process and interactive process. The PC-assisted procedure uses information from a DQS knowledge base to robotically route the data and propose replacements or corrections. The next step, the interactive one, allows the data administrator to approve, reject or modify the changes that DQS has proposed during the PC-assisted Data Shredding process.

·         Standardize and enrich customer data with domain values, domain rules, and reference data. For example, standardize the use of the term by changing "C /" to "Street", enrich data by filling in the missing elements by changing "1 Microsoft way Redmond 98006" to "1 Microsoft Way, Redmond, WA 98006".

·         It provides a simple, intuitive, and consistent wizard-like interface for the user to navigate through data and inspect for errors in a very large set of data.

PC assisted Data Shredding

The DQS Data Shredding process applies the knowledge base to the data to be cleaned and proposes changes to the data. The data manager has access to each proposed change, allowing him to evaluate and correct the changes. To perform Data Shredding, the data manager proceeds as follows:

1.    Create a data quality project, select a knowledge base to be used as a reference in the analysis and cleanup of the source data, and select the Cleanup activity . Multiple data quality projects can use the same knowledge base.

 

2.    Specify the table or view from the database or an Excel file that contains the source data to be cleaned. The database or Excel file can be the same as the one used for knowledge discovery, or it can be another database or another Excel file.

3.    Map the data fields to be cleaned to the appropriate compound domains or domains in the knowledge base. If a field is mapped to a composite domain, the mapping occurs between the field and the composite domain and not with the individual domains in the composite domain. On the other hand, the data cleanup for the mapped field is carried out according to the rules that were specified for the composite domain and not for the individual domains in the composite domain. For more information about compound domains, see DQS Knowledge Bases and Domains.

 

4.    Run the PC-assisted process; to do this, click Start on the Cleanup page.

The Data Shredding process finds the best match of a data instance for the known values in the data domain. The process applies data quality knowledge to all source data, as opposed to the knowledge discovery process, which runs on a percentage of the sample data.

The PC-assisted process displays data quality information in the Data Quality Client during the interactive cleanup process. In addition to accounting for syntax error rules, DQS also uses reference data and advanced algorithms to classify data by confidence level. The confidence level indicates the DQS degree of certainty for the correction or suggestion. The confidence level is based on the following thresholds:

·         An automatic rectification threshold exceeded which DQS will propose a modify and make it unless the data manager discards it. You can specify the automatic correction threshold value on the General Settings tab of the Settings screen. For more information, see Configure Threshold Values for Cleanup and Match.

 

·         An auto-suggest threshold, below the auto-correction threshold, beyond which DQS will suggest a change and perform it if approved by the data manager. You can specify the auto-suggest threshold value on the General Settings tab of the Settings screen. For more information, see Configure Threshold Values for Cleanup and Match.

DQS leaves any value with a confidence level below the auto-suggest threshold as is unless the data manager specifies a change.

Interactive Data Shredding

Based on the PC-assisted cleanup process, DQS provides the data manager with the information they need to make a decision on whether to change the data. DQS classifies the data in the following five tabs:

Suggested - Values for which DQS detected suggestions with a confidence level greater than the auto-suggestion threshold, but lower than the auto-correction threshold. You should review these values and approve or reject them as appropriate.

New: legitimate standards for which DQS does not have sufficient information (hint) and therefore cannot be assigned to any other tabs. Additionally, this tab also contains values with a confidence level lower than the autosuggest threshold, but high enough to mark them as valid.

Invalid - Values that were marked as invalid in the knowledge base domain, or values that did not meet a reference data or domains rule. This tab will also include the value that the user discards on any of the further four tabs during the interactive cleanup process.

Fixed: DQS corrected values during the automated cleanup process for which DQS detected a correction for the value with a confidence level above the automatic correction threshold. This tab will also contain the values for which the user specified a correct value in the Fix To column during interactive cleanup and which were subsequently approved by clicking the radio button in the Approve column on any of the other four tabs.

Correct: Values that are considered correct. For example, if the value matches a domain value. If applicable, you can override the DQS cleanup; to do this, reject the values on this tab or specify an alternative word in the Correct to column and click the radio button in the OK column. This tab will also contain the values that the user approved during the interactive cleanup by clicking the radio button in the Approve column on the New or Invalid tab.

 

The data manager uses the Data Quality Client to view the changes proposed by DQS and decide whether or not to implement them. You can check whether the values that DQS has designated as correct are actually correct. You can check if changes already made by DQS, with a high level of confidence, should be done. You can decide if you want to approve the suggested changes automatically. And you can review the values that have not been changed, in case you want to make a change not found by the PC-assisted process.

DQS will combine the changes made by the data manager with the results of the PC-assisted Data Shredding. These changes will remain in the project; however, they will not be added to the knowledge base. During Data Shredding, the associated knowledge base is read-only.

When the data cleanup is complete, you can choose whether to export the processed data to a new table in a SQL Server database, to a .csv file, or to an Excel file. The source data on which the Shredding is performed is kept in its original state. The data manager can use the separate clean data to correct the actual source data.

Standardize Data Shredding

You can choose whether to export the clean data in the normalized format based on the output format that has been defined for the domains. When creating a domain, you can select the format that will be applied when the domain data values are generated

When you export clean data on the Export page in the Data Quality Project Wizard, you specify whether you want the clean data to be exported in the standardized format; To do this, the Standardize output box is activated. By default, the clean data is exported in the standardized format, that is, the check box is selected.

No comments:

Post a Comment

Necessity of a Secure Data Wipe

  Necessity of a Secure Data Wipe According to projections from  The Radicati Group , in 2021 we will be sending 320,000 million emails pe...