Importance of Data Pre-Processing In Artificial
Intelligence: Data Wiping
In recent years, the amount of data that is generated has increased considerably. Today, in a second, almost 9000 tweets are published, more than 900 photographs are uploaded to Instagram, more than 80,000 searches are carried out on Google and close to 3 million emails are sent. This is a simple test of the immense amount of information that we generate every day. According to the World Economic Forum, in 2025 463 exabytes of data will be generated daily, which is the equivalent of more than 212 million DVDs.
In the same way that a cook needs to wash vegetables, peel potatoes, marinate the meat and measure the quantities well before starting to prepare a dish (among other things), a data scientist needs to perform a series of preliminary operations before starting to solve the problem you are trying to make the data “edible”. Due to the large amount of data that is generated, this first step is usually of vital importance to correct multiple deficiencies that we can find and to be able to really extract the relevant information for our problem.
According to different surveys of data scientists, around 80% of work time is focused on obtaining, Data Wiping and organizing data while only 3% of time is spent building machine learning models
As we have discussed in previous posts, solving a machine learning problem consists of optimizing a mathematical function. In order to fulfill this premise, we have to work with numerical data that allow us to be used in a mathematical context.
Generally, when we receive a set of data, in addition to finding numerical values, in most cases we must work with unstructured data, which is data that is not governed by a specific scheme, such as texts, images, videos, etc. All this data, if we want to be able to use ourselves to train our model, must be converted to numbers. In this post we are going to talk about data cleansing tools and in the next Artificial Intelligence post on the Xeridia blog we will talk more in depth about data transformation tools .
Data Wiping techniques in
Artificial Intelligence
Data Wiping tools are those that allow us to fix specific errors that occur in the data set that we are dealing with and that negatively affect and can interfere with machine learning. The most common defects that are usually treated with these tools are:
· Absence of
values: It is
very common that, if we are working with a large data set, some values are
empty. This can occur for a myriad of reasons (absence of actual
measurement, error when storing the information, error when retrieving it,
etc.) The main problem with these absences is that they prevent the machine
learning system from being able to train correctly since the absence of data it
is not numerically treatable. To solve this problem there are several
approaches that can be used depending on the data we are treating:
o
Interpolation: If we are dealing with temporary data, a
policy that is usually carried out is to interpolate the missing data taking
into account the nearby data.
o
Fill with a
fixed value: Such
as the mean, the mode or even the value 0.
o
Fill using regression: There are
techniques that try to predict the missing value using the rest of the
variables in our data set. One of these methods is MICE (Multivariate
Imputation by Chained Equation)
o
Consider the
void as a category: If the variable where data is missing is categorical, we
can add an extra category that brings together all those whose field is empty.
o Delete the complete record: In the event that none of the previous techniques is adequate to fill in the empty values, sometimes it is decided to discard that record and work exclusively with those that are complete
In the next blog post, we will address the data transformation tools that will allow us to adapt all the information we have to make it valid for training a machine learning model. Subscribe now and stay up to date on the following posts on Artificial Intelligence.
§
Data inconsistency: It happens on many occasions that when we are
processing data we detect errors in the format or in the type of any of
them. This may be due to a data reading error or poor data
storage. For example: dates that always begin with the day of the month
and in certain records begin with the year, values that should be numeric
that include other types of characters, etc. There are many validations
that must be verified and tried to solve so that the data is
consistent. At this point, depending on the variable that we are
analyzing, some validation techniques or others should be applied, which may
sometimes require expert knowledge of the problem.
§
Duplicate
values: It may happen that on occasion we
find duplicate records in our data set. It is important to detect these
records and eliminate all those that are repeated more than once. Not
doing this could lead to the duplicate item being taken more into account than
the rest of the data by the machine learning method and therefore training
biased.
§
Outliers: Due to data storage, measurement or insertion
errors, outliers (anomalous data) may appear in any of our fields. These
values can greatly distort the distribution of the data, affecting the entire
learning process. There are many techniques to try to detect this
anomalous data. We will expand on this topic in future blog posts.
No comments:
Post a Comment