data preprocessing - 1
Machine Learning Data Preprocesing - 1
table of contents
EDA - Exploratory Data Analysis
data preprocesing
before you even start processing the data, the first thing you want to do is understand the data, in terms of the data structure, types of the data.
identify the problems with the incoming data, does the data source have all the necessary fields that can be injected into the db or will i have to compute columns depending on the other values in the data source.
steps in data understanding :
- collecting initial data ( data source can be from an external api etc )
- describing the data - ( statistical, visualization )
- exploring the data - ( look for patterns in data )
- verifying data quality - ( )
data collection
structured : highly organized ( sql, excel )
unstructured : images, pdf's
semi-structured : json files, xml, html
EDA
- summary statistics
- visualization
the idea is to better understand the dataset.
To print the dataframe
to view the top 5 rows :
To view the statistics :
Box plot shows the median in the dataset.
Data profiling
A systematic summary of your data's chara.
Data Quality
Checking the authenticity of the data.
Outlier detection
identify unsual or extreme values that may indicate errors.
Methods: statistical methods ( z-score, iqr )
Visualization ( boxplots, scatterplots )
SESSION - 2
- data encoding
- label / ordinal vs one hot encoding
- feature scaling
- normalization vs standardization
data encoding
converting textual representation to numerical representation.
--------------------
size size_encoded
--------------------
small 0
large 2
medium 1
--------------------
sklearn will be used to do data encoding
import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
def encode_df():
df = pd.DataFrame({"status": ["success", "failure", "pending", "retry"]})
status_encoded_values = ["success", "failure", "pending", "retry"]
encoder = OrdinalEncoder(categories=[status_encoded_values])
df["status_encoded"] = encoder.fit_transform(df[["status"]])
df["status_encoded"] = df["status_encoded"].astype(int)
print(df)
encode_df()
one hot encoding
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
def encode_df():
df = pd.DataFrame({"status": ["success", "failure", "pending", "retry"]})
encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(df[["status"]])
encoded_df = pd.DataFrame(
encoded_data, columns=encoder.get_feature_names_out(["status"])
)
print(encoded_df)
encode_df()
Feature Scaling or standardisation
normalization ( min-max scaling )
The output value is normalized to [0, 1]
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
def normalize_data():
data = np.array([[20], [30], [40], [50], [100]])
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(data)
print(scaled_data)
normalize_data()
z-score scaling
centers data around mean = 0 and variance = 1
import numpy as np
from sklearn.preprocessing import StandardScaler
def compute_standard_scaler():
data = np.array([[10], [20], [30], [40], [50], [60], [10000]])
print(data)
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)
return scaled_data
rv = compute_standard_scaler()
print(f"rv : {rv}")
outlier detection methods
- interquartile range
- z-score if z-score > 3 or z-score < 3, then it is considered outliers.
splitting datasets
- training set
- validation set
- test set
k-fold cross validation
in k-fold cross validation, the data is split in k random portions of equal size, 1 portion makes the test data.
all the remaining k-1 portions, make the train data.
sample code for ordinal, one hot encoding and normalization
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, MinMaxScaler
def one_hot_encoder(df):
encoder = OneHotEncoder(sparse_output=False)
encoded_data = encoder.fit_transform(df[["Gender"]])
encoded_df = pd.DataFrame(
encoded_data, columns=encoder.get_feature_names_out(["Gender"])
)
print(f"encoded_df : {encoded_df}")
return
def ordinal_encoder(df):
job_titles = df["Job Title"].unique()
encoder = OrdinalEncoder(categories=[job_titles])
df["job_titles_encoded"] = encoder.fit_transform(df[["Job Title"]])
df["job_titles_encoded"] = df["job_titles_encoded"].astype(int)
print(f"df : {df[['Job Title', 'job_titles_encoded']]}")
def normalize_data(df):
scaler = MinMaxScaler()
salary_df = df[["Salary"]]
bonus_df = df[["Bonus"]]
normalized_salary_df = scaler.fit_transform(salary_df)
normalized_bonus_df = scaler.fit_transform(bonus_df)
df["normalized_bonus"] = normalized_bonus_df
df["normalized_salary"] = normalized_salary_df
print(df)
df.to_csv("normalized_data.csv")
def f():
filename = "./employment_records.csv"
df = pd.read_csv(filename)
one_hot_encoder(df)
ordinal_encoder(df)
normalize_data(df)
f()
outlier detection methods :
- scatter plot
- bot plot
- iqr method