Web scraping in 2026: a practical guide and where AI changes it

Data is even more valuable in 2026 than it was ten years ago. AI models inhale it on the input side, dashboards need it on the output side, and sitting quietly in between is a discipline most people overlook — web scraping.

This article is about what scraping actually is, what it's good for, what stack makes sense today, how to handle anti-bot defenses, and where the whole thing has been changing thanks to LLMs. No specific projects, just a practical guide. If you want to start, or you want to see where things have moved, you're in the right place.

Crawler vs. scraper — what is what

The terms get used interchangeably, but they're two different things:

A crawler moves through the web by following links. It maps structure. The classic example is Googlebot, which discovers pages across the internet, or archive.org, which stores copies for posterity.
A scraper opens a specific page and pulls data out of it. Product price, article body, list of reviews, product specs.

In practice the two are almost always combined. The crawler finds where things are, the scraper extracts what you need. When people say "I wrote a scraper," they usually mean both.

What it's actually used for

Use cases that make sense today:

Price and inventory tracking — competitor drops the price, you know in 5 minutes, not in a week.
News and content aggregation — RSS on steroids, you define the sources and the signals you care about.
Market and competitive research — who's selling what, how their marketing reads, what reviews they accumulate.
Datasets for ML / fine-tuning — your own training data, your own embeddings, your own evals.
Change monitoring — price change, text change on a page, new blog post from a tracked author → alert on Telegram.
Content migration — old website without an API, new website without manually copy-pasting 5,000 articles.

The common thread: automating things you'd otherwise do by hand. And where someone else is doing it by hand, you have an edge.

Stack in 2026 — what to use when

This isn't about the "best" library, it's about the right choice for the situation. Here's how I see it:

requests + BeautifulSoup — still the gold standard for static HTML. Cheapest, fastest, simplest. If you just need to download HTML and parse it, there's no reason to reach for anything else. Server-rendered sites (Wikipedia, most e-shops, blogs) fall into this category.

Playwright or Puppeteer — for JS-heavy sites. SPAs, React apps, infinite scroll, pages that load content dynamically. Spins up a real browser (headless), waits for render, lets you pull what you need. Slower and more memory-hungry, but sometimes necessary. In 2026, it's clearly Playwright — better API, better debugging, more active ecosystem than Puppeteer.

Scrapy — for large projects and distributed crawling. When you need to crawl thousands of pages, want solid retry logic, pipelines for postprocessing, and want to deploy on a cluster. Overkill for a one-off script. Still the most mature choice for production crawlers.

Hosted services — Browserless, Apify, Bright Data, ScrapingBee — for when you don't want to operate anything. You send an HTTP request to their API, they do the scraping and return the result. Expensive, but it saves you the anti-bot fight and the infra. Good choice for getting started or for use cases where you need the result fast.

A practical heuristic: start with requests + BS4. If that's not enough, try Playwright. If Playwright doesn't scale, move to Scrapy. Hosted services are an escape hatch, not a default.

Anti-bot and how to get around it (legally)

Most reasonable websites don't make scraping easy on you — and for good reason. How to handle it in 2026:

User-Agent rotation + realistic headers. Every request with User-Agent: python-requests/2.31 is an instant tell. Send full headers like a real browser would: Accept, Accept-Language, Accept-Encoding, Sec-Ch-Ua, Referer. Libraries (fake-useragent, curl_cffi) handle this for you, including TLS fingerprint mimicry.

Proxies. When you make thousands of requests, a single IP gives you up immediately. Datacenter proxies are cheap, but big sites detect them. Residential proxies look like a real user from a home network — more expensive, but they pass through Cloudflare and similar layers. Mobile proxies are the most expensive and most resilient, but you rarely need them.

Headless detection. Cloudflare, PerimeterX, DataDome detect headless browsers from a hundred small things (navigator properties, weird WebGL rendering, timing). Solution: stealth plugins (playwright-stealth, puppeteer-extra-plugin-stealth) or switch to Browser Use / a real browser instance. Never 100% reliable — it's an arms race.

Rate limiting and polite scraping. This isn't just about "not getting banned." It's also ethics. Send at most 1–2 requests per second per domain. Use exponential backoff on errors (429, 503). Cache responses so you don't make the same request twice. Your User-Agent can include a contact email (MyScraper/1.0 (contact: pavel@example.com)) — admins appreciate knowing who you are and why.

Operations: it's not just "download HTML"

This is the section most tutorials skip. A script that downloads HTML is not a scraper. A scraper is a system.

Scheduling. Cron, scheduler, workflow runner. When does it run? How often? What if the last run hasn't finished? Cron is fine to start. For something more involved, Airflow, Prefect, or simple code with APScheduler.

Retry policy. Networks drop, servers return 503, proxies time out. Without retry logic, your scraper breaks at the first error. Exponential backoff + maximum N attempts + a dead letter queue for requests that failed after N tries.

Error monitoring. A silent scraper is a broken scraper. Failures should fire an alert — Telegram, Slack, email. If your scraper stops collecting data and you find out a week later, you have a week-long hole and nobody cares.

Storage. Start with CSV. As data grows, SQLite or Postgres. As it grows more, object storage (S3, R2) for raw HTML + DB for extracted records. It's worth saving the raw HTML too — when you rewrite the parser, you can re-run on old data without re-scraping.

Idempotency and deduplication. The same product shows up multiple times in the inventory. The same article can have multiple URLs. Dedupe at the data level, not the URL level. A canonical identifier (slug, EAN, content hash) saves you.

Legal and ethical side

Short but important section. Scraping isn't a purely technical discipline.

robots.txt is a request, not a legal prohibition. It tells you what the site prefers. Ignoring it isn't a crime, but it can be evidence in a civil dispute that you knew what the site didn't want. Respect it as a default. If you ignore it, have a reason.

Site ToS. Some sites explicitly forbid automated access. That is a contractual relationship and can be legally binding — it depends on jurisdiction and specific terms. In the US there was a years-long precedent in hiQ vs. LinkedIn, where the court ruled scraping publicly available data isn't a CFAA crime. That doesn't mean it isn't a ToS violation. For corporate use, talk to a lawyer.

GDPR and personal data. If you scrape names, emails, photos of people — you're processing personal data. That has consequences: legitimate purpose, retention, right to be forgotten. Public data is not automatically free data.

Copyright. Article text, images, databases all have an author. Scraping for your own analysis is a different animal than scraping and republishing. If you want to reproduce text, deal with the license.

A practical rule that has never let me down: scrape the way you'd want someone to scrape you.

And now AI is changing it

Here's where it gets interesting. LLMs (Claude, GPT-4, Gemini) over the past two years have transformed what scraping even is. Four big shifts:

1) Extraction via LLM instead of CSS selectors

Classic scraping: inspect the DOM, find .product-card .price, write a selector, parse. Works great — until the site reshuffles its HTML. Then it breaks and you go hunting for a new selector.

LLM-driven extraction: send raw HTML (or a simplified version) to the model and say: "Pull me a list of products with name, price, availability. Return JSON." The model gets the HTML, returns structured data.

Pros: robust against layout changes. When the site reshuffles HTML, your scraper still works. No more if div.class == "product-2024-v3".

Cons: more expensive (you pay for tokens), slower (seconds vs. milliseconds), occasionally hallucinates (the model invents a price that wasn't there).

The hybrid approach I use: CSS selectors as fast path, LLM as fallback. When the selector fails, send HTML to Claude. You get resilience without paying full LLM-scraping price.

2) AI-native scraping toolkits

A new generation of tools has emerged where you write tasks in natural language, not in selectors:

Browser Use — open-source agent that drives a browser. Instead of "click on .btn-submit," you write "click the Sign In button."
Stagehand — Browserbase library that combines Playwright with LLM APIs.
Firecrawl — hosted service that returns Markdown from any page, optimized for LLMs.
ScrapeGraphAI — Python library for LLM scraping, you define a schema and it extracts.

For exploratory scraping, prototypes, one-off extractions — faster than the traditional approach. For production systems with thousands of requests per hour, CSS selectors are still cheaper.

3) MCP servers as a scraping alternative

MCP (Model Context Protocol) has, in 2026, become the standard for how LLMs talk to tools and data. If a data provider has an MCP server, you don't need to scrape — the model requests data through the official interface.

Developer takeaway: if an MCP server exists (or even just a REST API), start there. Scraping makes sense when there's no API or it's heavily restricted.

4) Defense: AI Audit, paywalls, and the ethics of training data

The other side of the coin. AI scrapers (especially for training data) have caused websites to actively block AI agents.

Cloudflare AI Audit shows sites which AI agents are crawling them and enables "pay per crawl" — the model pays for access.
Major media (NYT, Washington Post) block OpenAI/Anthropic crawlers and sue over historical scraping.
Bloggers and creators add noai directives, paywalls for AI agents, explicit robots.txt rules.

What does it mean for you? Scraping for AI training is getting legally and ethically complicated. Scraping for your own use stays largely the same, but detection layers are getting richer here too. Polite scraping is now more about reputation than ever.

Conclusion — what to take home

Five things I'd tell my five-years-younger self:

Start with the simplest stack — requests + BeautifulSoup for 80% of cases. Reach for Playwright only when you must.
Storage > parser — save raw HTML, parse later. It saves you future rewrites.
Polite is the default — rate limit, contact User-Agent, robots.txt. Reputation is capital.
LLM as fallback, not main weapon — selectors are cheap and fast. LLM is expensive and magical. Combine.
Monitoring from version one — a silent scraper is a broken scraper. Telegram or Slack alerts save your data.

Scraping in 2026 isn't dead. It's the opposite. It's just smarter, more ethical, more interleaved with AI. Same principles (go slow, respect other people's infrastructure, save raw data), new tools (Playwright, LLM extraction, MCP).

And for those of us building projects around it, this is a great time to do it.