Skip to content

Ways to Do Web Scraping

I will be discussing the ways to do webscraping in this blog post.

There are a lot of ways to do web scraping.

I'll be discussing 2 most common methods and 1 unconventional method which is not used a lot.

I'll be using python for my examples.

Request Based Scraping.

Now if the data you are trying to retreive is from a server rendered page or from an api endpoint, then API based method just works fine.

for eg : if you're trying to scrape from a news website, whose latest news page link is static and server rendered, then you could do something like this with the requests module.

import requests

url = "https://example.com/"

response = requests.get(url, verify = False)

print(response)

Browser Driven.

Say you have to do a lot of clicks on a website to get to the content your looking for, or the website is client side rendering.

Here you could use a browser automation tool like puppeteer, playwright, selenium etc.

import asyncio 
from pyppeteer import launch


async def main():

    browser = await launch(executable_path = "/opt/google/chrome/google-chrome")
    page = await browser.newPage()
    await page.goto("https://example.com")
    asyncio.sleep(3)
    await browser.close()


    return





asyncio.run(main())

Using Chrome Devtools protocol.

The Chrome DevTools Protocol allows for tools to instrument, inspect, debug and profile Chromium, Chrome and other Blink-based browsers. Many existing projects currently use the protocol. The Chrome DevTools uses this protocol and the team maintains its API.

import json
from websockets.sync.client import connect
from time import sleep


def scraper():
    url = ""
    with connect(url) as ws:
        print("CONNECTED")
        payload = {
            "id": 1,
            "method": "Page.navigate",
            "params": {"url": "https://example.com"},
        }
        json_payload = json.dumps(payload)
        ws.send(json_payload)
        sleep(5)
        response = ws.recv()
        print(f"response : {response}")


scraper()