DATA SOLUTIONS: Importance of Data Pre-Processing In Artificial Intelligence: Data Wiping

Importance of Data Pre-Processing In Artificial Intelligence: Data Wiping

In recent years, the amount of data that is generated has increased considerably. Today, in a second, almost 9000 tweets are published, more than 900 photographs are uploaded to Instagram, more than 80,000 searches are carried out on Google and close to 3 million emails are sent. This is a simple test of the immense amount of information that we generate every day. According to the World Economic Forum, in 2025 463 exabytes of data will be generated daily, which is the equivalent of more than 212 million DVDs.

In the same way that a cook needs to wash vegetables, peel potatoes, marinate the meat and measure the quantities well before starting to prepare a dish (among other things), a data scientist needs to perform a series of preliminary operations before starting to solve the problem you are trying to make the data “edible”. Due to the large amount of data that is generated, this first step is usually of vital importance to correct multiple deficiencies that we can find and to be able to really extract the relevant information for our problem.

According to different surveys of data scientists, around 80% of work time is focused on obtaining, Data Wiping and organizing data while only 3% of time is spent building machine learning models

As we have discussed in previous posts, solving a machine learning problem consists of optimizing a mathematical function. In order to fulfill this premise, we have to work with numerical data that allow us to be used in a mathematical context.

Generally, when we receive a set of data, in addition to finding numerical values, in most cases we must work with unstructured data, which is data that is not governed by a specific scheme, such as texts, images, videos, etc. All this data, if we want to be able to use ourselves to train our model, must be converted to numbers. In this post we are going to talk about data cleansing tools and in the next Artificial Intelligence post on the Xeridia blog we will talk more in depth about data transformation tools .

Data Wiping techniques in Artificial Intelligence

Data Wiping tools are those that allow us to fix specific errors that occur in the data set that we are dealing with and that negatively affect and can interfere with machine learning. The most common defects that are usually treated with these tools are:

· Absence of values: It is very common that, if we are working with a large data set, some values are empty. This can occur for a myriad of reasons (absence of actual measurement, error when storing the information, error when retrieving it, etc.) The main problem with these absences is that they prevent the machine learning system from being able to train correctly since the absence of data it is not numerically treatable. To solve this problem there are several approaches that can be used depending on the data we are treating:

o Interpolation: If we are dealing with temporary data, a policy that is usually carried out is to interpolate the missing data taking into account the nearby data.

o Fill with a fixed value: Such as the mean, the mode or even the value 0.

o Fill using regression: There are techniques that try to predict the missing value using the rest of the variables in our data set. One of these methods is MICE (Multivariate Imputation by Chained Equation)

o Consider the void as a category: If the variable where data is missing is categorical, we can add an extra category that brings together all those whose field is empty.

o Delete the complete record: In the event that none of the previous techniques is adequate to fill in the empty values, sometimes it is decided to discard that record and work exclusively with those that are complete

In the next blog post, we will address the data transformation tools that will allow us to adapt all the information we have to make it valid for training a machine learning model. Subscribe now and stay up to date on the following posts on Artificial Intelligence.

§ Data inconsistency: It happens on many occasions that when we are processing data we detect errors in the format or in the type of any of them. This may be due to a data reading error or poor data storage. For example: dates that always begin with the day of the month and in certain records begin with the year, values that should be numeric that include other types of characters, etc. There are many validations that must be verified and tried to solve so that the data is consistent. At this point, depending on the variable that we are analyzing, some validation techniques or others should be applied, which may sometimes require expert knowledge of the problem.

§ Duplicate values: It may happen that on occasion we find duplicate records in our data set. It is important to detect these records and eliminate all those that are repeated more than once. Not doing this could lead to the duplicate item being taken more into account than the rest of the data by the machine learning method and therefore training biased.

§ Outliers: Due to data storage, measurement or insertion errors, outliers (anomalous data) may appear in any of our fields. These values can greatly distort the distribution of the data, affecting the entire learning process. There are many techniques to try to detect this anomalous data. We will expand on this topic in future blog posts.

DATA SOLUTIONS

Importance of Data Pre-Processing In Artificial Intelligence: Data Wiping

No comments:

Post a Comment

Necessity of a Secure Data Wipe

Translate