A stealth web scraper that navigates anti-detection systems using Patchright's CDP-patched Chromium — extracting nested detail-page data from every book on books.toscrape.com.
This is a production-grade stealth scraping pipeline built on Patchright and Playwright, demonstrating the exact anti-detection techniques used to bypass commercial bot-detection systems like Cloudflare, DataDome, and PerimeterX. It doesn't just pull listing data — it navigates into every book's detail page to extract UPC, tax breakdown, stock count, and full descriptions. The codebase is structured how a real scraping service would be: typed schemas, retry logic, human-like timing, clean separation of stealth configuration from extraction logic, and a CI pipeline that runs the scraper on a weekly schedule.
When Playwright drives Chrome over CDP (Chrome DevTools Protocol), the browser leaks automation signals at multiple layers:
- JavaScript level:
navigator.webdriveristrue,navigator.pluginsis empty,window.chrome.runtimeis absent. - CDP level: Playwright calls
Runtime.enableon every new context, which is detectable via timing side-channels and service worker introspection. - Binary level: Internal variables like
window.cdc_adoQpoasnfa76pfcZLmcfl_Arrayare injected by the ChromeDriver/CDP integration layer. These cannot be removed withaddInitScript. - TLS level: The TLS fingerprint of a CDP-controlled Chrome differs from a user-launched instance due to modified cipher suite ordering.
You can patch the JS layer with page.addInitScript(), but CDP and binary signals survive any amount of JavaScript injection.
Patchright forks Playwright and applies patches to the Chromium binary itself:
- CDP channel masking — rewrites the internal CDP identifier so automation is not detectable at the protocol level.
Runtime.enablesuppression — prevents the CDP call that advanced detection systems monitor.cdc_*variable removal — strips ChromeDriver artifacts from the binary before launch.- Consistent TLS fingerprint — the patc 8000 hed binary presents the same TLS profile as a user-launched Chrome.
| Tool | Ease of Use | Detection Resistance | Maintenance | Production Readiness |
|---|---|---|---|---|
| Vanilla Playwright | Excellent | Low — fails any serious fingerprinting | Official Chromium team | Not for protected sites |
| playwright-extra + stealth | Good | Medium — JS patches only, CDP leaks remain | Community plugin | Moderate protection sites |
| Puppeteer-stealth | Good | Medium — same JS-only limitation | Less active, Puppeteer-only | Legacy codebases |
| Patchright | Excellent (Playwright API) | High — binary + CDP + JS patches combined | Active fork, tracks Playwright releases | Advanced protection sites |
This repo uses Patchright as the primary driver. Swapping to vanilla Playwright requires changing one import line — useful for sites with no detection.
| Technique | Signal It Hides | Detection System Defeated |
|---|---|---|
navigator.webdriver → undefined |
JS automation flag (true in all CDP browsers) |
Cloudflare JS Challenge, Akamai Bot Manager, PerimeterX |
--disable-blink-features=AutomationControlled |
Chrome's internal automation feature flag | Chrome-specific headless checks |
navigator.plugins → [1, 2, 3] |
Empty plugin list (zero plugins = headless) | FingerprintJS, DataDome, Kasada |
window.chrome.runtime mock |
Absent Chrome API object in headless mode | Scripts checking typeof chrome.runtime |
Notification.permission → 'default' |
Headless returns 'denied' by default |
Permission-based fingerprinting |
screen.colorDepth → 24 |
Non-standard depth in headless environments | Canvas/screen fingerprinting |
| Realistic viewport (1920×1080) | Default headless is 800×600 | Dimension-based bot detection |
| Human-like random delays (300ms–6s) | Constant timing between requests | Behavioral analysis, rate limiting |
| Chrome 131 Windows User-Agent | "HeadlessChrome" in default UA string |
Any UA-string filter |
The scraper navigates to each book's detail page and extracts the full product information table. Every BookDetail object contains:
| Field | Type | Source |
|---|---|---|
title |
string |
<h1> on detail page |
price |
string |
.price_color element |
priceExclTax |
string |
Product info table: "Price (excl. tax)" |
priceInclTax |
string |
Product info table: "Price (incl. tax)" |
tax |
string |
Product info table: "Tax" |
rating |
number |
Star rating class → numeric (1–5) |
availability |
string |
Product info table: "Availability" |
inStock |
number |
Parsed from "In stock (X available)" |
description |
string |
#product_description + p |
upc |
string |
Product info table: "UPC" |
productType |
string |
Product info table: "Product Type" |
numberOfReviews |
number |
Product info table: "Number of reviews" |
detailUrl |
string |
Full URL to the product page |
{
"title": "A Light in the Attic",
"price": "£51.77",
"priceExclTax": "£51.77",
"priceInclTax": "£51.77",
"tax": "£0.00",
"rating": 3,
"availability": "In stock (22 available)",
"inStock": 22,
"description": "It's hard to imagine a world without A Light in the Attic...",
"upc": "a897fe39b1053632",
"productType": "Books",
"numberOfReviews": 0,
"detailUrl": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
}git clone https://github.com/M-Hammad-Faisal/stealth-scraper-playwright
cd stealth-scraper-playwright
nvm use
npm install
npx playwright install chromium
npx patchright install chromium
npm run scrapeResults are written to output/results.json with a timestamped backup.
output/results.json:
{
"success": true,
"totalBooks": 60,
"pagesScraped": 3,
"scrapedAt": "2025-01-15T10:23:41.123Z",
"durationMs": 45230,
"detectionEvaded": true,
"books": [
{
"title": "A Light in the Attic",
"price": "£51.77",
"priceExclTax": "£51.77",
"priceInclTax": "£51.77",
"tax": "£0.00",
"rating": 3,
"availability": "In stock (22 available)",
"inStock": 22,
"description": "It's hard to imagine a world without A Light in the Attic...",
"upc": "a897fe39b1053632",
"productType": "Books",
"numberOfReviews": 0,
"detailUrl": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
}
]
}| Script | Command | Description |
|---|---|---|
scrape |
npm run scrape |
Run the stealth scraper end-to-end |
build |
npm run build |
Compile TypeScript to dist/ |
lint |
npm run lint |
Check code with ESLint v9 |
lint:fix |
npm run lint:fix |
Auto-fix linting issues |
format |
npm run format |
Format code with Prettier |
format:check |
npm run format:check |
Check formatting without writing |
typecheck |
npm run typecheck |
Run TypeScript compiler in check mode |
stealth-scraper-playwright/
├── README.md # Project documentation
├── package.json # Dependencies, scripts, metadata
├── tsconfig.json # TypeScript strict config (ES2022, ESNext modules)
├── eslint.config.js # ESLint v9 flat config with type-checked rules
├── .prettierrc # Prettier formatting rules
├── .prettierignore # Files excluded from formatting
├── .nvmrc # Node version pin (20)
├── .gitignore # Excludes node_modules, dist, output, env files
├── .github/
│ └── workflows/
│ └── scrape.yml # CI: lint, typecheck, format, scrape, artifact upload
└── src/
├── index.ts # Entry point: orchestrates init → scrape → save → close
├── scraper.ts # BookScraper class: list + detail page extraction
├── stealth.ts # Launch args and init script overrides (6 techniques)
├── types.ts # All interfaces: BookSummary, BookDetail, ScrapeResult
└── utils/
├── delay.ts # randomDelay, shortDelay, longDelay
├── logger.ts # Timestamped ANSI color logger
└── saveResults.ts # Write results.json + timestamped backup
The GitHub Actions workflow (.github/workflows/scrape.yml) runs on:
- Push to
master— every commit triggers a full pipeline run. - Manual dispatch — trigger from the Actions tab for on-demand scraping.
- Weekly schedule — every Monday at 08:00 UTC for fresh dataset generation.
Pipeline steps in order: checkout → Node.js setup (from .nvmrc) → npm ci → install Playwright browsers → install Patchright browsers → type check → lint → format check → run scraper → upload output/ as artifact (retained 30 days).
- Patchright — CDP-patched Chromium for deep stealth
- Playwright — Browser automation framework
- TypeScript 5.9 — Strict typing, zero
any - ESLint 9 — Flat config with type-checked rules
- Prettier 3 — Consistent code formatting
- Node.js 24+ — Runtime
- GitHub Actions — CI/CD with scheduled runs and artifact storage
- Price monitoring — Track competitor pricing across e-commerce catalogs on a daily/weekly schedule.
- Lead enrichment — Scrape business directories behind Cloudflare to extract verified contact data.
- Competitive intelligence — Monitor product launches, stock levels, and review counts across protected retail sites.
- Availability tracking — Detect restocks and inventory changes for high-demand products.
- Dataset generation for ML — Build labeled training datasets from structured product pages at scale.
MIT — Muhammad Hammad Faisal