Messy/Unclean data making you feel worse? We have a general guide for you that should help you clean data.
Businesses gather data. A lot of it, often. However, recording and possessing data is only a small part of analyzing business records. Data must be arranged and must be available for statistical analysis and training Machine Learning models. This process of making sense of data is called feature engineering.
Let us more formally introduce what a feature is. A “feature” is a property or attribute of an object or event. A Machine Learning model or a data scientist uses “a feature” to analyze the whole dataset and predict future events. Since we have defined what a feature is, let us now introduce the idea of “feature engineering”.
“Feature engineering” is the process of extracting and refining “features” from raw, freshly collected data. “Feature engineering” is to “features” as “cooking” is to “vegetables”.
That is no easy task. Data scientists often spend a large chunk of their professional lives doing precisely this.
Feature engineering has no rigid rules since it requires domain knowledge of the dataset. The requirement of feature engineering is often problem-dependent. Nevertheless, feature engineering generally involves the following procedures:
2. Handling Outliers
3. Log Transform
4. One-Hot Encoding
6. Feature Split
8. Extracting Dates
The methods are described below.
Imputation is the process of handling missing or invalid values. Missing values is one of the most perplexing problems faced by Data Scientists and Analysts. Often numerous cells in a dataset are filled with out of range values, NAN values and other Null values. Dropping rows with missing values is a quick fix to this problem. However, that can lead to a waste of valuable information. Dropping columns which have a large number of missing cells is a better option. There is not an optimum condition. However, a threshold of 70% missing values is often adequate.
Numerical Imputation is a preferable option rather than dropping entire rows and columns. In this process, computer-generated or statistically generated data replaces the missing values. Numerical Imputation is preferred because it preserves the dataset size. Imputation with zero and Imputation with the median value are popular methods of Numerical Imputation. For categorical data, Imputing a category like “other” is a quick and fast fix to missing values.
Handling Outliers is much more difficult. The first step in Outlier Handling is detecting the outliers. The best way to do it is to demonstrate the data visually. Statistical Methodologies are less precise but fast. Popular means of Outlier Detection with Statistical Methodologies are:
1. Outlier Detection with Standard Deviation
2. Outlier Detection with Percentiles
A common way to Handle Outliers is “Capping”. The following algorithm can best explain the process.
So, instead of dropping the row altogether, we replace the value in it with the value of the limit it exceeds.
Log Transform or Logarithmic Transformation is simple to implement on any continuous numerical data. Log Transform replaces each value in a column with its logarithm. The choice of the base of this logarithm is up to the programmer or the analyst. The following algorithm can best explain the process.
That is it. This simple method can often work wonders on a skewed dataset. Log Transform handles outliers and prevents skewness of the data. However, Log Transform requires the data to follow a skewed Normal Distribution.
One-Hot Encoding changes the categorical values to regression style data. Regression data is more convenient for Machine Learning programs to understand.
Let’s say the data has data about 3 categorical variables repeated in the following way:
After One-Hot Encoding, it will be:
Grouping is another crucial step in Feature Engineering. In this process, individual observations of a variable aggregate into groups. The frequency distribution of these groups serves as a convenient means of analyzing the data. Binning is a popular grouping method.
Binning is a simple method to regularize data and handle outliers. Binning involves assigning a new category to a range of data. For example, in a column which has values from 0 to 100,
● Category “LOW” is assigned to values from 0 to 30.
● Category “MID” is assigned to values from 31 to 75.
● Category “HIGH” is assigned to values from 76 to 100.
We can also perform Binning on Categorical Columns. For example, in a column with country names,
● Category “Europe” contains values like “The UK”, “Italy”, “Spain”, etc.
● Category “Asia” contains values like “India”, “China”, “Japan”, etc.
● Category “Americas” contains values like “The USA”, “Mexico”, “Canada”, etc.
And so on. Binning is akin to the system of grading in schools and colleges. Binning is easy to implement and fast. However, every time we Bin something, we sacrifice information about the feature and make the data more regularized.
Another method that makes Grouping possible is the command GROUP BY in MySQL. The discussion of that, however, is beyond the scope of this article.
Feature Splitting is extracting the utilizable parts of a dataset. For example, “Names Of Movies and Release Dates” gets Split into “Names Of Movies” and “Release dates”. “Names” are Split into “First Names” and “Last Names”.
Scaling is a feature dependent process. In most cases, numerical data do not have a definite range and often exist in different units and data types (integers, fractional values, boolean, fuzzy, etcetera). Converting them into the same data-type is not desirable and will not make much sense in the real world (like converting age, currency and name into the same unit). However, Machine Learning algorithms need to have features in the same range and same type. Having data in the same data-range and same data-type decreases complexity. Scaling solves this problem in two ways:
1. Normalisation: Scale the data into a fixed range, typically between zero and one.
2. Standardisation: Scale the data while taking into account the standard deviation. Different standard deviations result in different ranges. This method of scaling also handles outliers.
Extracting dates is a simple task. It is possible in three steps:
1. Extracting parts of the date into different columns (Year, month, day).
2. Extracting the time-period between different dates.
3. Extracting specific features like holidays, weekdays, etc.
As you can see, Feature Engineering is a lengthy process. It will take a toll on someone new to Data Science. People with heavy responsibilities on their shoulder might not have the time and resources to perform such operations on large amounts of data.
Cliently handles it all for you. All these complex operations of cleaning the data, processing, analysing- are all done by the Proton AutoML software.
Gigabytes of data collected by businesses and stored in secured servers all around the world, often lead to dead ends, just because they have missing values or are in different sizes, types, formats, etcetera. Data scientists, warehouse managers, fundraising campaigns and big corporations- all face this problem. Feature engineering prevents this from happening. It is thus a crucial step in data analysis and Machine Learning.