Skip to content

Web Scraping Talk

What is Web Scraping?

At its core, web scraping is the process of extracting data from websites.
Think of it as automating what a user would normally do when copying content — but at scale.


Evolution of Web Scraping

1. The Early Days: Server-Side Rendered Websites

In the beginning, most websites were server-side rendered. That meant you could fetch a page with a simple HTTP request and parse the HTML or JSON response directly.

Example using Python’s requests library:

import requests

url = "https://example.com"
headers = {
    "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers, verify=False, timeout=10)
html_content = response.text

with open("http_scraper.html", "w") as f:
    f.write(html_content)

This worked great when the content was static and delivered directly from the server.


2. The Rise of Client-Side Rendering

As web applications grew more interactive, JavaScript-heavy websites became the norm.
Now, much of the content was loaded dynamically in the browser after the initial page load.

To scrape these, we had to spin up a browser in headless mode and automate the interaction.

Example with Pyppeteer:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch({
        "executablePath": "/opt/google/chrome/google-chrome",
        "headless": True
    })
    page = await browser.newPage()
    await page.goto('https://example.com')
    await page.screenshot({'path': 'example.png'})
    await browser.close()

asyncio.run(main())

3. Driving the Browser with Chrome DevTools Protocol (CDP)

A more advanced (and fun) way of scraping is to control Chrome directly using CDP.
This allows you to send low-level commands to the browser over WebSockets.

Start Chrome with debugging enabled:

google-chrome --remote-debugging-port=9222     --no-first-run     --no-default-browser-check     --disable-default-apps     --user-data-dir=./browser-data-tmp/

Example CDP Steps

Here’s a sample sequence of scraping commands in JSON:

[
    {
        "id": 1,
        "method": "Emulation.setDeviceMetricsOverride",
        "params": {
            "width": 800,
            "height": 800,
            "deviceScaleFactor": 1,
            "mobile": false
        }
    },
    {
        "id": 2,
        "method": "Page.navigate",
        "params": { "url": "https://www.example.com" }
    },
    {
        "id": 3,
        "method": "DOM.getDocument",
        "params": { "depth": -1 }
    },
    {
        "id": 4,
        "method": "DOM.getOuterHTML",
        "params": { "nodeId": 3 }
    }
]

Putting It Together in Python

We can tie everything together with Python by launching Chrome, connecting to its debugging endpoint, and executing the steps:

import json
import requests
import subprocess
from time import sleep
from websockets.sync.client import connect

def create_browser():
    return subprocess.Popen([
        "google-chrome",
        "--remote-debugging-port=9222",
        "--no-first-run",
        "--no-default-browser-check",
        "--disable-default-apps",
        "--user-data-dir=./browser-data-tmp/",
    ])

def get_ws_endpoint():
    response = requests.get("http://localhost:9222/json/list", verify=False, timeout=10)
    for item in response.json():
        if item.get("title") == "New Tab":
            return item.get("webSocketDebuggerUrl")
    return ""

def execute(ws_url):
    with open("steps.json", "r") as f:
        steps = json.load(f)
    with connect(ws_url) as ws:
        for step in steps:
            ws.send(json.dumps(step))
            sleep(10)

def main():
    browser_instance = create_browser()
    sleep(5)
    ws_url = get_ws_endpoint()
    if ws_url:
        execute(ws_url)

main()

Final Thoughts

Web scraping has come a long way:
- From simple HTTP requests,
- To headless browsers,
- To direct browser automation with DevTools.

Each method has its trade-offs, and the right approach depends on the complexity of the site you’re scraping.