Indeed Scraper
Disclaimer
This project and blog post are created solely for educational and research purposes. The Indeed scraper is not intended for production use, commercial exploitation, or any activity that violates Indeed’s Terms of Service or applicable laws. The author does not encourage or condone web scraping of sites without prior permission. If you plan to scrape any website, always review and comply with its robots.txt file, terms, and legal requirements. Use the information here responsibly and at your own risk.
In this blog post, we'll look into how we can scrape indeed.com ( a job portal ) website and create a fastapi webpage to display the results.
We'll be looking into how we can create a scraper with chrome dev tools protocol ( CDP ). The reason for this choice is, it bocomes so much easier to replicate the code in the language of your choice, for eg : if you're a java, c++, or javascript developer, you do not have to write the scraper code in python, had we choose a framework like pyppeteer or playwright.
Say you're a c++ developer, you could do the same with boost-beast for the websocket client, crowcpp for the http-server for serving the http template ( crowcpp is a flask like framework in c++ ), nlohmann json for parsing json etc.
Tools we'll be using :
- Python
- sqlite3 - for storing the scraped results ( feel free to mysql, postgres or any other database )
- websockets lib : For sending websocket commands to the chrome dev tools protocols server.
- requests lib : For sending http requests to our server.
- Chrome Browser
Coding Starts Here
This is a code along project, its something you can follow if you have some familiarity with Python.
We'll start by creating a virtualenv.
We'll activate the virtualenv
for unix
for windows ( not sure if the below commands works for windows )
We'll install the packages needed :
We can create a requirements.txt file with the following command :
We'll create the required folders and files next :
indeed-scraper/
▾ api/
▾ templates/
index.html
__init__.py
database.py
main.py
models.py
▸ env/
▾ scraper/
__init__.py
indeed_scraper.py
util.py
▾ scripts/
run.sh
Above snippet shows the project structure from the top level project folder.
The project folder is called indeed-scraper.
The web app code is located in the api folder and everything related to the scraper is in the scraper folder.
In the api directory, there is an __init__.py file to mark it as a package.
It has 3 other files :
main.py - for the web and api.
database.py - for the database connections.
models.py - for storing the models.
We have a templates folder to store the templates that we'll be serving to the user. We have one index.html file in the templates folder.
Now let's take a look at the scraper directory.
In the scraper directory, there is an __init__.py file to mark it as a package.
We have 2 other files :
We also have a scripts folder with a shell script to run our server. At the time of writing, i'm not sure i'm we'll be using it, the idea of the shell script is to be able to schedule runs using cron jobs.
We'll work on the web & api before we implement the scraper, so that we'll have a better understanding of the project.
We'll start by setting up the database first. I'll be using sqlite3. Personally i'm a fan of postgres, but i'm using sqlite3 for this tutorial, because it'll be so much easier for you guys to follow along. Feel free to use the sql database of your choice.
I'll start by creating the sqlite3 database file using the following command :
The sqlite3 is the command line utility and indeed.db is the name of the file.
I'll create a jobs table to store the jobs. For the sake of the tutorial, we'll only scrape the job titles.
I'll create the table with following command :
To confirm the table has been created i'll use the following command :
.tables -- to list tables
.schema -- to view the schema of the db.
.schema jobs -- to view the schema of the jobs table.
To select all the rows :
I'll insert a dummy row :
To delete all records :
In sqlite3 to view the colum names, when using the select statement :
To exit out of the sqlite3 command line utility :
Now we'll start working on the frontend template to display the Jobs Scraped.
Below is a template i generated with gemini, this is all we need for our use case.
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Indeed Scraper - Job Titles</title>
<script src="https://cdn.jsdelivr.net/npm/@tailwindcss/browser@4"></script>
</head>
<body class="bg-gray-50 text-gray-800 min-h-screen p-4 sm:p-8">
<header class="text-center mb-6 sm:mb-8">
<h1 class="text-3xl sm:text-4xl font-extrabold text-gray-900 leading-tight tracking-tight">
Scraped Jobs
</h1>
</header>
<main class="container mx-auto max-w-2xl">
<div class="bg-white rounded-lg p-4 sm:p-6 shadow-md">
<ul class="divide-y divide-gray-200">
<li class="py-3 sm:py-4">
<a href="#" class="block hover:bg-gray-50 px-2 rounded-md transition-colors duration-200 -mx-2">
<span class="text-lg font-medium text-gray-900">Software Engineer</span>
</a>
</li>
</ul>
</div>
</main>
</body>
</html>
We will add the ninja template code to map over the list of jobs later, for now, you can copy paste the html into your template file.
Now we have the database setup and template code, we can start working on the api.
We have 2 routes, one for serving the html template and other for inserting the scraped jobs.
main.py
from fastapi import FastAPI, Depends, Request
from fastapi.templating import Jinja2Templates
from .database import get_db_session
from sqlmodel import text
from typing import List
from .models import InsertJobsModel
app = FastAPI()
templates = Jinja2Templates(directory="templates")
@app.get("/")
async def home(request: Request, session=Depends(get_db_session)):
sql = "select * from jobs;"
jobs = session.execute(text(sql)).mappings().all()
return templates.TemplateResponse(
"index.html", context={"request": request, "jobs": jobs}
)
@app.post("/jobs")
async def insert_jobs(payload: List[InsertJobsModel], session=Depends(get_db_session)):
values = [{"title": item.title} for item in payload]
if not values:
return {"message": "Payload is Empty"}
sql = """
insert into jobs (title)
values
(:title)
"""
session.execute(text(sql), values)
session.commit()
return {"message": "Inserted Jobs"}
models.py
database.py
from sqlmodel import create_engine, Session
from fastapi import Depends
engine = create_engine("sqlite:///../indeed.db")
def get_db_session():
with Session(engine) as session:
yield session
templates/index.html
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>Indeed Scraper - Job Titles</title>
<script src="https://cdn.jsdelivr.net/npm/@tailwindcss/browser@4"></script>
</head>
<body class="bg-gray-50 text-gray-800 min-h-screen p-4 sm:p-8">
<header class="text-center mb-6 sm:mb-8">
<h1 class="text-3xl sm:text-4xl font-extrabold text-gray-900 leading-tight tracking-tight">
Scraped Jobs
</h1>
</header>
<main class="container mx-auto max-w-2xl">
<div class="bg-white rounded-lg p-4 sm:p-6 shadow-md">
<ul class="divide-y divide-gray-200">
{% for job in jobs %}
<li class="py-3 sm:py-4">
<a href="#" class="block hover:bg-gray-50 px-2 rounded-md transition-colors duration-200 -mx-2">
<span class="text-lg font-medium text-gray-900">{{ job.title }}</span>
</a>
</li>
{% endfor %}
</ul>
</div>
</main>
</body>
</html>
Scraper Code
util.py
import json
import requests
import subprocess
from bs4 import BeautifulSoup
def start_browser():
browser = subprocess.Popen(
[
"google-chrome",
"--remote-debugging-port=9222",
"--no-first-run",
" --no-default-browser-check",
"--disable-default-apps",
"--user-data-dir=./browser-data-tmp/",
]
)
return browser
def save_page_content(html_content):
print("Saving HTML Content")
with open("page_content.html", "w") as f:
f.write(html_content)
return
def parser():
"""
reads the page_content.html file.
converts to parsed html.
reads required data.
returns the list of job titles.
"""
with open("page_content.html", "r") as f:
page_content = f.read()
if not page_content:
return None
soup = BeautifulSoup(page_content, "lxml")
job_titles = soup.find_all("h2", {"class": "jobTitle"})
job_titles = [{"title": item.text} for item in job_titles]
return job_titles
def get_browser_ws_endpoint():
url = "http://localhost:9222/json/list"
response = requests.get(url=url, verify=False, timeout=10)
if not response:
return ""
response_json = response.json()
for item in response_json:
title = item.get("title")
if title == "New Tab":
ws_endpoint = item.get("webSocketDebuggerUrl")
return ws_endpoint
return ""
def insert_jobs(jobs):
url = "http://localhost:8080/jobs"
response = requests.post(url=url, json=jobs)
print(f"response : {response}")
def get_scraper_steps():
with open("scraper_steps.json", "r") as f:
data = json.load(f)
return data
scraper_steps.json
[
{
"id": 1,
"method": "Emulation.setDeviceMetricsOverride",
"params": {
"width": 800,
"height": 800,
"deviceScaleFactor": 1,
"mobile": false
}
},
{
"id": 2,
"method": "Page.navigate",
"params": {
"url": "https://in.indeed.com/"
}
},
{
"id": 2,
"method": "Page.navigate",
"params": {
"url": "https://in.indeed.com/jobs?q=Software+Engineer&l=Delhi%2C+Delhi&fromage=1"
}
},
{
"id": 10,
"method": "Input.dispatchMouseEvent",
"params": {
"type": "mouseMoved",
"x": 200,
"y": 400
}
},
{
"id": 10,
"method": "Input.dispatchMouseEvent",
"params": {
"type": "mouseWheel",
"x": 200,
"y": 400,
"deltaX": 0,
"deltaY": 2000
}
},
{
"id": 5,
"method": "DOM.getDocument",
"params": {
"depth": -1
}
},
{
"id": 6,
"method": "DOM.getOuterHTML",
"params": {
"nodeId": 3
}
}
]
indeed_scraper.py
import json
from time import sleep
from websockets.sync.client import connect
from util import (
start_browser,
parser,
get_browser_ws_endpoint,
get_scraper_steps,
save_page_content,
insert_jobs,
)
def execute():
"""
create browser instance
get browser websocket endpoint
fetch scraper steps
run through the steps
get page html
parse
insert jobs by making an api call.
"""
browser = start_browser()
sleep(5)
ws_endpoint = get_browser_ws_endpoint()
print(f"ws_endpoint : {ws_endpoint}")
scraper_steps = get_scraper_steps()
print("Executing scraper steps")
with connect(ws_endpoint, max_size=None) as ws:
for step in scraper_steps:
ws_msg = json.dumps(step)
ws.send(ws_msg)
while True:
sleep(10)
response_msg_str = ws.recv()
response_msg = json.loads(response_msg_str)
if response_msg.get("id") == 6:
html_content = response_msg.get("result").get("outerHTML")
save_page_content(html_content)
break
print("Parsing html file")
jobs = parser()
print(f"jobs : {jobs}")
insert_jobs(jobs)
execute()