Machine Learning - 1
Machine Learning blog -1
table of contents
- convert raw data to a usable data format for machine learning.
- look into various data formats, processing, wrangling.
data collection
this is where it all begins. although business understand and domain knowledge is the first step, data collection goes hand in hand.
reading a csv file using python
nowadays, almost everyone uses pandas to read csv data and start working on the data using dataframe manipulations.
json is another very common data format that acts as a data source.
we can read json data using the json module available in python or again use pandas to do the job.
using pandas :
xml is another popular data format, although not very common as csv or json these days.
data description
- numeric
- text
- categorical
categorical data means the data that is being observed and can be classified.
categorical data can be classified as :
nominal : classified without ordering
ordinal : can be ordered ( high, medium, low )
data wrangling
import string
import random
import pandas as pd
import os
terminal_width = os.get_terminal_size().columns
def create_dummy_data():
prices = [random.randint(100, 1000) for _ in range(100)]
companies = [
"".join(random.choices(string.ascii_uppercase, k=4)) for _ in range(100)
]
df = pd.DataFrame(data={"company": companies, "price": prices})
return df
def show_df_info(df):
print("=" * terminal_width)
print("Dataframe Shape".center(terminal_width))
rows, cols = df.shape
print(f"rows : {rows}")
print(f"cols : {cols}")
print("=" * terminal_width)
print("Columns".center(terminal_width))
column_names = df.columns.values.tolist()
for column_name in column_names:
print(f"column_name : {column_name}")
print("=" * terminal_width)
print("Column Types".center(terminal_width))
print(df.dtypes)
print("=" * terminal_width)
print("General".center(terminal_width))
print(df.info())
print("=" * terminal_width)
print("Summary".center(terminal_width))
print(df.describe())
print("=" * terminal_width)
df = create_dummy_data()
show_df_info(df)