Playwright Scraper Skill

Playwright-based web scraping skill with anti-bot protection for reliable data extraction

Playwright Scraper Skill is a community skill for advanced web scraping, covering anti-bot protection bypass, dynamic content extraction, form automation, screenshot capture, and complex site navigation for robust data collection from modern web applications.

What Is This?

Overview

Playwright Scraper Skill provides AI agents and data collection tools with advanced web scraping capabilities using Playwright browser automation with built-in anti-bot protection handling. It covers anti-bot bypass techniques that handle CAPTCHAs, fingerprint detection, and rate limiting through realistic browser behavior patterns and request timing, dynamic content extraction that waits for JavaScript-rendered elements and AJAX-loaded data before scraping, form automation that fills inputs, selects dropdowns, and submits multi-step forms across complex workflows, screenshot capture that documents page state and visual elements for verification, and complex navigation that handles single-page application routing, infinite scroll, and pagination patterns. The skill has been successfully tested on challenging sites with sophisticated bot detection systems, including e-commerce platforms and content aggregators that actively block automated access.

Who Should Use This

This skill serves data collection engineers building robust scrapers, market research teams extracting competitor data, and AI agents requiring structured web data from protected sites. It is also well suited for QA engineers who need to automate and verify web interactions as part of testing pipelines.

Why Use It?

Problems It Solves

Modern websites use sophisticated bot detection that blocks simple HTTP scraping attempts. JavaScript-rendered content does not appear in initial HTML and requires browser execution to extract. Multi-step workflows with form submissions and authentication cannot be scraped with static requests. Building custom scraping solutions with anti-bot protection requires extensive browser automation knowledge and constant maintenance as detection methods evolve continuously. Scaling scraping operations across many target sites demands infrastructure for managing browser instances and handling concurrent extraction jobs efficiently. Verifying scraping accuracy is difficult without visual confirmation of page state during extraction.

Core Highlights

Anti-bot handler bypasses detection through realistic browser behavior patterns. Content extractor waits for dynamic JavaScript rendering before scraping. Form automator handles multi-step submissions and complex input sequences. Screenshot tool captures visual page state for verification and debugging purposes.

How to Use It?

Basic Usage

from playwright.sync_api import \
    sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True
    )
    page = browser.new_page()
    
    # Navigate and wait
    page.goto('https://site.com')
    page.wait_for_selector(
        '.product'
    )
    
    # Extract data
    products = page.query_selector_all(
        '.product'
    )
    for p in products:
        print(p.text_content())

Real-World Examples

page.goto('https://app.com/login')
page.fill('#email',
          'user@test.com')
page.fill('#password', 'pass')
page.click('button[type=submit]')
page.wait_for_url('**/dashboard')

page.goto('https://store.com')
page.evaluate(
    'window.scrollTo(0, '
    'document.body.scrollHeight)'
)
page.wait_for_timeout(2000)

page.screenshot(
    path='verification.png'
)

for i in range(5):
    page.evaluate(
        'window.scrollBy(0, 1000)'
    )
    page.wait_for_timeout(1000)

Advanced Tips

Use stealth plugins to mask Playwright automation signatures that websites detect through JavaScript fingerprinting. Implement random delays between actions to mimic human browsing patterns and avoid rate limiting. Rotate user agents and viewport sizes across scraping sessions to appear as different users and reduce detection risk. Additionally, consider intercepting and blocking unnecessary network requests such as images and fonts to improve scraping speed and reduce resource consumption during large-scale extraction runs.

When to Use It?

Use Cases

Scrape competitor pricing and product data from e-commerce sites with bot protection for market intelligence. Extract job postings from employment websites that render listings dynamically with JavaScript frameworks. Collect social media public profile data from platforms with infinite scroll and anti-automation measures for research analysis.

Important Notes

Requirements

Playwright installed with browser binaries for Chromium, Firefox, or WebKit execution. Sufficient system resources including memory and CPU for browser process execution during scraping. Understanding of HTML selectors and page structure for accurate data extraction targeting.

Usage Recommendations

Do: implement rate limiting and random delays to avoid overwhelming target servers and triggering defenses. Take screenshots at critical steps to verify scraping logic extracts correct data. Use wait conditions for dynamic elements rather than fixed timeouts for reliability.

Don't: scrape websites in violation of their terms of service or robots.txt directives. Run scrapers without error handling since websites change structure and break selectors. Leave browser processes running after scraping completes since they consume significant system resources.

Limitations

Advanced CAPTCHA systems may still block automated access despite anti-bot measures. Scraping is significantly slower than API access due to full browser rendering overhead. Website structure changes break selectors and require scraper maintenance regularly. Some sites implement server-side bot detection that cannot be bypassed with client-side techniques alone.

More Skills You Might Like

Explore similar skills to enhance your workflow