Indeed Scraper

Disclaimer

This project and blog post are created solely for educational and research purposes. The Indeed scraper is not intended for production use, commercial exploitation, or any activity that violates Indeed’s Terms of Service or applicable laws. The author does not encourage or condone web scraping of sites without prior permission. If you plan to scrape any website, always review and comply with its robots.txt file, terms, and legal requirements. Use the information here responsibly and at your own risk.

In this blog post, we'll look into how we can scrape indeed.com ( a job portal ) website and create a fastapi webpage to display the results.

We'll be looking into how we can create a scraper with chrome dev tools protocol ( CDP ). The reason for this choice is, it bocomes so much easier to replicate the code in the language of your choice, for eg : if you're a java, c++, or javascript developer, you do not have to write the scraper code in python, had we choose a framework like pyppeteer or playwright.

Say you're a c++ developer, you could do the same with boost-beast for the websocket client, crowcpp for the http-server for serving the http template ( crowcpp is a flask like framework in c++ ), nlohmann json for parsing json etc.

Tools we'll be using :

Python
sqlite3 - for storing the scraped results ( feel free to mysql, postgres or any other database )
websockets lib : For sending websocket commands to the chrome dev tools protocols server.
requests lib : For sending http requests to our server.
Chrome Browser

Coding Starts Here

This is a code along project, its something you can follow if you have some familiarity with Python.

We'll start by creating a virtualenv.

virtualenv env

We'll activate the virtualenv

for unix

source env/bin/activate

for windows ( not sure if the below commands works for windows )

.\env\Scripts\activate

We'll install the packages needed :

pip install fastapi["standard"] pydantic sqlmodel requests websockets beautifulsoup4 jinja2 lxml

We can create a requirements.txt file with the following command :

pip freeze > requirements.txt

We'll create the required folders and files next :

indeed-scraper/
    ▾ api/
      ▾ templates/
          index.html
        __init__.py
        database.py
        main.py
        models.py
    ▸ env/
    ▾ scraper/
        __init__.py
        indeed_scraper.py
        util.py
    ▾ scripts/
        run.sh

Above snippet shows the project structure from the top level project folder.

The project folder is called indeed-scraper.

The web app code is located in the api folder and everything related to the scraper is in the scraper folder.

In the api directory, there is an __init__.py file to mark it as a package.

It has 3 other files :

main.py - for the web and api.    
database.py - for the database connections.    
models.py - for storing the models.

We have a templates folder to store the templates that we'll be serving to the user. We have one index.html file in the templates folder.

Now let's take a look at the scraper directory.

In the scraper directory, there is an __init__.py file to mark it as a package.

We have 2 other files :

indeed_scraper.py - for the scraper code    
util.py - for the parsing functions.

We also have a scripts folder with a shell script to run our server. At the time of writing, i'm not sure i'm we'll be using it, the idea of the shell script is to be able to schedule runs using cron jobs.

We'll work on the web & api before we implement the scraper, so that we'll have a better understanding of the project.

We'll start by setting up the database first. I'll be using sqlite3. Personally i'm a fan of postgres, but i'm using sqlite3 for this tutorial, because it'll be so much easier for you guys to follow along. Feel free to use the sql database of your choice.

I'll start by creating the sqlite3 database file using the following command :

sqlite3 indeed.db

The sqlite3 is the command line utility and indeed.db is the name of the file.

I'll create a jobs table to store the jobs. For the sake of the tutorial, we'll only scrape the job titles.

I'll create the table with following command :

create table jobs (
id integer primary key autoincrement, 
title text
);

To confirm the table has been created i'll use the following command :

.tables -- to list tables
.schema -- to view the schema of the db.
.schema jobs -- to view the schema of the jobs table.

To select all the rows :

select * from jobs;

I'll insert a dummy row :

insert into jobs (title) values ('Software Engineer');

To delete all records :

delete from jobs;

In sqlite3 to view the colum names, when using the select statement :

.mode box 
or 
.mode column 
or 
.mode table

To exit out of the sqlite3 command line utility :

.exit

Now we'll start working on the frontend template to display the Jobs Scraped.

Below is a template i generated with gemini, this is all we need for our use case.

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Indeed Scraper - Job Titles</title>
    <script src="https://cdn.jsdelivr.net/npm/@tailwindcss/browser@4"></script>
  </head>
  <body class="bg-gray-50 text-gray-800 min-h-screen p-4 sm:p-8">
    <header class="text-center mb-6 sm:mb-8">
      <h1 class="text-3xl sm:text-4xl font-extrabold text-gray-900 leading-tight tracking-tight">
        Scraped Jobs
      </h1>
    </header>

    <main class="container mx-auto max-w-2xl">
      <div class="bg-white rounded-lg p-4 sm:p-6 shadow-md">
        <ul class="divide-y divide-gray-200">
          <li class="py-3 sm:py-4">
            <a href="#" class="block hover:bg-gray-50 px-2 rounded-md transition-colors duration-200 -mx-2">
              <span class="text-lg font-medium text-gray-900">Software Engineer</span>
            </a>
          </li>
          </ul>
      </div>
    </main>
  </body>
</html>

We will add the ninja template code to map over the list of jobs later, for now, you can copy paste the html into your template file.

Now we have the database setup and template code, we can start working on the api.

We have 2 routes, one for serving the html template and other for inserting the scraped jobs.

main.py

from fastapi import FastAPI, Depends, Request
from fastapi.templating import Jinja2Templates
from .database import get_db_session
from sqlmodel import text
from typing import List
from .models import InsertJobsModel

app = FastAPI()

templates = Jinja2Templates(directory="templates")


@app.get("/")
async def home(request: Request, session=Depends(get_db_session)):
    sql = "select * from jobs;"
    jobs = session.execute(text(sql)).mappings().all()

    return templates.TemplateResponse(
        "index.html", context={"request": request, "jobs": jobs}
    )


@app.post("/jobs")
async def insert_jobs(payload: List[InsertJobsModel], session=Depends(get_db_session)):
    values = [{"title": item.title} for item in payload]
    if not values:
        return {"message": "Payload is Empty"}

    sql = """
    insert into jobs (title) 
    values
    (:title)
    """
    session.execute(text(sql), values)
    session.commit()
    return {"message": "Inserted Jobs"}

models.py

from pydantic import BaseModel


class InsertJobsModel(BaseModel):
    title: str

database.py

from sqlmodel import create_engine, Session
from fastapi import Depends

engine = create_engine("sqlite:///../indeed.db")


def get_db_session():
    with Session(engine) as session:
        yield session

templates/index.html

<!DOCTYPE html>
<html>
  <head>
    <meta charset="UTF-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <title>Indeed Scraper - Job Titles</title>
    <script src="https://cdn.jsdelivr.net/npm/@tailwindcss/browser@4"></script>
  </head>
  <body class="bg-gray-50 text-gray-800 min-h-screen p-4 sm:p-8">
    <header class="text-center mb-6 sm:mb-8">
      <h1 class="text-3xl sm:text-4xl font-extrabold text-gray-900 leading-tight tracking-tight">
        Scraped Jobs
      </h1>
    </header>

    <main class="container mx-auto max-w-2xl">
      <div class="bg-white rounded-lg p-4 sm:p-6 shadow-md">
        <ul class="divide-y divide-gray-200">
          {% for job in jobs %}
          <li class="py-3 sm:py-4">
            <a href="#" class="block hover:bg-gray-50 px-2 rounded-md transition-colors duration-200 -mx-2">
              <span class="text-lg font-medium text-gray-900">{{ job.title }}</span>
            </a>
          </li>
          {% endfor %}
          </ul>
      </div>
    </main>
  </body>
</html>

Scraper Code

util.py

import json
import requests
import subprocess
from bs4 import BeautifulSoup


def start_browser():
    browser = subprocess.Popen(
        [
            "google-chrome",
            "--remote-debugging-port=9222",
            "--no-first-run",
            " --no-default-browser-check",
            "--disable-default-apps",
            "--user-data-dir=./browser-data-tmp/",
        ]
    )

    return browser


def save_page_content(html_content):
    print("Saving HTML Content")
    with open("page_content.html", "w") as f:
        f.write(html_content)

    return


def parser():
    """
    reads the page_content.html file.
    converts to parsed html.
    reads required data.
    returns the list of job titles.
    """
    with open("page_content.html", "r") as f:
        page_content = f.read()

    if not page_content:
        return None

    soup = BeautifulSoup(page_content, "lxml")

    job_titles = soup.find_all("h2", {"class": "jobTitle"})
    job_titles = [{"title": item.text} for item in job_titles]
    return job_titles


def get_browser_ws_endpoint():
    url = "http://localhost:9222/json/list"
    response = requests.get(url=url, verify=False, timeout=10)
    if not response:
        return ""

    response_json = response.json()

    for item in response_json:
        title = item.get("title")
        if title == "New Tab":
            ws_endpoint = item.get("webSocketDebuggerUrl")
            return ws_endpoint
    return ""


def insert_jobs(jobs):
    url = "http://localhost:8080/jobs"
    response = requests.post(url=url, json=jobs)
    print(f"response : {response}")


def get_scraper_steps():
    with open("scraper_steps.json", "r") as f:
        data = json.load(f)
    return data

scraper_steps.json

[
    {
        "id": 1,
        "method": "Emulation.setDeviceMetricsOverride",
        "params": {
            "width": 800,
            "height": 800,
            "deviceScaleFactor": 1,
            "mobile": false
        }
    },

    {
        "id": 2,
        "method": "Page.navigate",
        "params": {
            "url": "https://in.indeed.com/"
        }
    },

    {
        "id": 2,
        "method": "Page.navigate",
        "params": {
            "url": "https://in.indeed.com/jobs?q=Software+Engineer&l=Delhi%2C+Delhi&fromage=1"
        }
    },
    {
        "id": 10,
        "method": "Input.dispatchMouseEvent",
        "params": {
            "type": "mouseMoved",
            "x": 200,
            "y": 400
        }
    },

    {
        "id": 10,
        "method": "Input.dispatchMouseEvent",
        "params": {
            "type": "mouseWheel",
            "x": 200,
            "y": 400,
            "deltaX": 0,
            "deltaY": 2000

        }
    },



    {
        "id": 5,
        "method": "DOM.getDocument",
        "params": {
            "depth": -1

        }
    },

    {
        "id": 6,
        "method": "DOM.getOuterHTML",
        "params": {
            "nodeId": 3

        }
    }

]

indeed_scraper.py

import json
from time import sleep
from websockets.sync.client import connect
from util import (
    start_browser,
    parser,
    get_browser_ws_endpoint,
    get_scraper_steps,
    save_page_content,
    insert_jobs,
)


def execute():
    """
    create browser instance
    get browser websocket endpoint
    fetch scraper steps
    run through the steps
    get page html
    parse
    insert jobs by making an api call.


    """
    browser = start_browser()
    sleep(5)
    ws_endpoint = get_browser_ws_endpoint()
    print(f"ws_endpoint : {ws_endpoint}")
    scraper_steps = get_scraper_steps()
    print("Executing scraper steps")
    with connect(ws_endpoint, max_size=None) as ws:
        for step in scraper_steps:
            ws_msg = json.dumps(step)
            ws.send(ws_msg)
            while True:
                sleep(10)
                response_msg_str = ws.recv()
                response_msg = json.loads(response_msg_str)
                if response_msg.get("id") == 6:
                    html_content = response_msg.get("result").get("outerHTML")
                    save_page_content(html_content)

                break

    print("Parsing html file")

    jobs = parser()
    print(f"jobs : {jobs}")
    insert_jobs(jobs)


execute()