Skip to content

data preprocessing - 1

Machine Learning Data Preprocesing - 1

table of contents

EDA - Exploratory Data Analysis
data preprocesing

before you even start processing the data, the first thing you want to do is understand the data, in terms of the data structure, types of the data.

identify the problems with the incoming data, does the data source have all the necessary fields that can be injected into the db or will i have to compute columns depending on the other values in the data source.

steps in data understanding :

  • collecting initial data ( data source can be from an external api etc )
  • describing the data - ( statistical, visualization )
  • exploring the data - ( look for patterns in data )
  • verifying data quality - ( )

data collection

structured : highly organized ( sql, excel )
unstructured : images, pdf's
semi-structured : json files, xml, html

EDA

  • summary statistics
  • visualization

the idea is to better understand the dataset.

To print the dataframe

print(df)

to view the top 5 rows :

print(df.head())

To view the statistics :

print(df.describe())

Box plot shows the median in the dataset.

Data profiling

A systematic summary of your data's chara.

Data Quality

Checking the authenticity of the data.

Outlier detection

identify unsual or extreme values that may indicate errors.

Methods: statistical methods ( z-score, iqr )

Visualization ( boxplots, scatterplots )

SESSION - 2

  • data encoding
  • label / ordinal vs one hot encoding
  • feature scaling
  • normalization vs standardization

data encoding

converting textual representation to numerical representation.

--------------------
size    size_encoded
--------------------
small   0 
large   2
medium  1
--------------------

sklearn will be used to do data encoding

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder


def encode_df():
    df = pd.DataFrame({"status": ["success", "failure", "pending", "retry"]})
    status_encoded_values = ["success", "failure", "pending", "retry"]

    encoder = OrdinalEncoder(categories=[status_encoded_values])

    df["status_encoded"] = encoder.fit_transform(df[["status"]])
    df["status_encoded"] = df["status_encoded"].astype(int)

    print(df)


encode_df()

one hot encoding

import pandas as pd
from sklearn.preprocessing import OneHotEncoder


def encode_df():
    df = pd.DataFrame({"status": ["success", "failure", "pending", "retry"]})

    encoder = OneHotEncoder(sparse_output=False)
    encoded_data = encoder.fit_transform(df[["status"]])
    encoded_df = pd.DataFrame(
        encoded_data, columns=encoder.get_feature_names_out(["status"])
    )

    print(encoded_df)


encode_df()

Feature Scaling or standardisation

normalization ( min-max scaling )

x` = [x - x(min)] / [x(max) - x(min)]

The output value is normalized to [0, 1]

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

def normalize_data():

    data = np.array([[20], [30], [40], [50], [100]])
    scaler = MinMaxScaler()
    scaled_data = scaler.fit_transform(data)


    print(scaled_data)



normalize_data()

z-score scaling

centers data around mean = 0 and variance = 1

import numpy as np
from sklearn.preprocessing import StandardScaler

def compute_standard_scaler():

    data = np.array([[10], [20], [30], [40], [50], [60], [10000]])
    print(data)

    scaler = StandardScaler()
    scaled_data = scaler.fit_transform(data)

    return scaled_data


rv = compute_standard_scaler()
print(f"rv : {rv}")

outlier detection methods

  • interquartile range
  • z-score if z-score > 3 or z-score < 3, then it is considered outliers.

splitting datasets

  • training set
  • validation set
  • test set
train_test_split

k-fold cross validation

in k-fold cross validation, the data is split in k random portions of equal size, 1 portion makes the test data.

all the remaining k-1 portions, make the train data.

sample code for ordinal, one hot encoding and normalization

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler


def one_hot_encoder(df):
    encoder = OneHotEncoder(sparse_output=False)
    encoded_data = encoder.fit_transform(df[["Gender"]])
    encoded_df = pd.DataFrame(
        encoded_data, columns=encoder.get_feature_names_out(["Gender"])
    )
    print(f"encoded_df : {encoded_df}")

    return


def ordinal_encoder(df):
    job_titles = df["Job Title"].unique()

    encoder = OrdinalEncoder(categories=[job_titles])

    df["job_titles_encoded"] = encoder.fit_transform(df[["Job Title"]])
    df["job_titles_encoded"] = df["job_titles_encoded"].astype(int)

    print(f"df : {df[['Job Title', 'job_titles_encoded']]}")


def normalize_data(df):
    scaler = MinMaxScaler()
    salary_df = df[["Salary"]]
    bonus_df = df[["Bonus"]]
    normalized_salary_df = scaler.fit_transform(salary_df)
    normalized_bonus_df = scaler.fit_transform(bonus_df)
    df["normalized_bonus"] = normalized_bonus_df
    df["normalized_salary"] = normalized_salary_df

    print(df)

    df.to_csv("normalized_data.csv")


def f():
    filename = "./employment_records.csv"
    df = pd.read_csv(filename)
    one_hot_encoder(df)
    ordinal_encoder(df)
    normalize_data(df)


f()

outlier detection methods :

  • scatter plot
  • bot plot
  • iqr method