Skip to content

Machine Learning - 1

Machine Learning blog -1

table of contents

  • convert raw data to a usable data format for machine learning.
  • look into various data formats, processing, wrangling.

data collection

this is where it all begins. although business understand and domain knowledge is the first step, data collection goes hand in hand.

reading a csv file using python

import csv
csv_reader = csv.reader(open(filename, "rb"), delimeter = ",")

nowadays, almost everyone uses pandas to read csv data and start working on the data using dataframe manipulations.

json is another very common data format that acts as a data source.

we can read json data using the json module available in python or again use pandas to do the job.

import json

filename = "data.json"

with open(filename, "r") as f:
    data = json.load(f)

using pandas :

import pandas as pd
df = pd.read_json(filename,orient="records")

xml is another popular data format, although not very common as csv or json these days.

data description

  • numeric
  • text
  • categorical

categorical data means the data that is being observed and can be classified.

categorical data can be classified as :

nominal : classified without ordering

ordinal : can be ordered ( high, medium, low )

data wrangling

import string
import random
import pandas as pd
import os

terminal_width = os.get_terminal_size().columns


def create_dummy_data():
    prices = [random.randint(100, 1000) for _ in range(100)]
    companies = [
        "".join(random.choices(string.ascii_uppercase, k=4)) for _ in range(100)
    ]

    df = pd.DataFrame(data={"company": companies, "price": prices})
    return df


def show_df_info(df):
    print("=" * terminal_width)
    print("Dataframe Shape".center(terminal_width))
    rows, cols = df.shape
    print(f"rows : {rows}")
    print(f"cols : {cols}")
    print("=" * terminal_width)
    print("Columns".center(terminal_width))
    column_names = df.columns.values.tolist()
    for column_name in column_names:
        print(f"column_name : {column_name}")
    print("=" * terminal_width)

    print("Column Types".center(terminal_width))
    print(df.dtypes)

    print("=" * terminal_width)
    print("General".center(terminal_width))
    print(df.info())

    print("=" * terminal_width)
    print("Summary".center(terminal_width))
    print(df.describe())
    print("=" * terminal_width)


df = create_dummy_data()
show_df_info(df)

Filtering Data