Significance
of Data Pre-Processing In Artificial Intelligence:
In recent years, the
amount of data that is generated has increased considerably. Today, in a
second, almost 9000 tweets are published, more than 900 photographs are
uploaded to Instagram, more than 80,000 searches are carried out on Google and
close to 3 million emails are sent. This is a simple proof of the immense
amount of information that we generate every day. According to the World
Economic Forum, in 2025 463 exabytes of data will be generated daily, which is
the equivalent of more than 212 million DVDs.
In
the same way that a cook needs to wash vegetables, peel potatoes, marinate meat
and measure the quantities well before starting to prepare a dish (among other
things), a data scientist needs to carry out a series of preliminary operations
before starting to solve the problem you are trying to make the data “edible”. Due
to the large amount of data that is generated, this first step is usually of
vital importance to correct multiple deficiencies that we can find and to be
able to really extract the relevant information for our problem. According
to different surveys of data scientists, around 80% of work time is focused on
obtaining, Data Wiping
and organizing data while only 3% of time is spent buildingmachine.
As we have discussed
in previous posts, solving a machine learning problem
consists of optimizing a mathematical function. In order
to fulfill this premise, we have to work with numerical data that allow us to
be used in a mathematical context.
Generally,
when we receive a set of data, in addition to finding numerical values, in most
cases we must work with unstructured data, which is data that is not governed
by a specific scheme, such as texts, images, videos, etc. All this data,
if we want to be able to use ourselves to train our model, must be converted to
numbers. In this post we are going to talk about data wiping Tools and in
the next Artificial Intelligence post on the Xeridia blog we will talk more in
depth
Data Wiping techniques in
Artificial Intelligence
Data
Wiping tools are those that allow us to fix specific errors that occur in the
data set we are dealing with and that negatively affect machine learning. The
most common defects that are usually treated with these tools are:
·
Absence of values: It is very
common that, if we are working with a large data set, some values are empty. This
can occur for a myriad of reasons (absence of actual measurement, error when
storing the information, error when retrieving it, etc.) The main problem with
these absences is that they prevent the machine learning system from being able
to train correctly since the absence of data it is not numerically treatable. To
solve this problem there are several approaches that can be used depending on
the data we are treating:
o Interpolation: If we are
dealing with temporary data, a policy that is usually carried out is to
interpolate the missing data taking into account the nearby data.
o Fill
with a fixed value: Such
as the mean, the mode or even the value 0.
o Fill
using regression:
There are techniques that try to predict the missing value using the rest of
the variables in our data set. One of these methods is MICE (Multivariate
Imputation by Chained Equation).
o Consider
emptiness as a category: If the variable where data is missing is categorical,
we can add an extra category that brings together all those whose field is empty.
o Delete
the complete record: In
the event that none of the previous techniques is adequate to fill in the empty
values, sometimes it is decided to discard that record and work exclusively
with those that are complete.
In the next blog
post, we will address the data transformation tools that will allow us to adapt
all the information we have to make it valid for training a machine learning
model
§ Data
inconsistency:
It happens on many occasions that when we are processing data we detect errors
in the format or in the type of any of them. This may be due to a data
reading error or poor data storage. For example: dates that always begin
with the day of the month and in certain records begin with the year, values
that should be numeric that include other types of characters, etc. There
are many validations that must be verified and tried to solve so that the data
is consistent. At this point, depending on the variable that we are
analyzing, some validation techniques or others must be applied, which may
sometimes require expert knowledge of the problem.
§ Duplicate
values: It may happen that on occasion we find
duplicate records in our data set. It is important to detect these records
and eliminate all those that are repeated more than once. Failure to do
this could result in the duplicate item being taken more into account than the
rest of the data by the machine learning method, and therefore training in a
biased way.
§ Outliers: Due to
data storage, measurement or insertion errors, outliers (anomalous data) may
appear in any of our fields. These values can greatly distort the
distribution of the data, affecting the entire learning process. There are
many techniques to try to detect this anomalous data. We will expand on
this topic in future blog posts.
No comments:
Post a Comment