Apify Ultimate Scraper
Automate and integrate Apify Ultimate Scraper workflows at scale
Apify Ultimate Scraper is an AI skill that provides advanced web scraping techniques and best practices for building robust, production grade Apify Actors that handle complex scraping scenarios. It covers anti-bot bypass strategies, dynamic content rendering, session management, data quality validation, and scalable architecture patterns for high volume data collection.
What Is This?
Overview
Apify Ultimate Scraper provides advanced scraping patterns for tackling websites that resist automated data collection. It handles implementing browser fingerprint rotation to avoid detection, managing sessions with cookie persistence across crawls, rendering JavaScript heavy applications with headless browsers, validating data quality before storing, implementing adaptive retry strategies, and scaling across multiple Actor instances.
Who Should Use This
This skill serves experienced scraping developers handling sites with aggressive bot detection, data engineers building production pipelines requiring high reliability, teams scaling beyond single Actor instances, and developers troubleshooting anti-bot related scraping failures.
Why Use It?
Problems It Solves
Modern websites employ sophisticated bot detection that blocks basic scraping approaches. Single page applications render content dynamically, making HTTP scraping ineffective. Large scale operations hit rate limits requiring distributed execution. Data quality issues from inconsistent extraction go undetected until downstream systems fail.
Core Highlights
Fingerprint rotation varies browser signatures to avoid detection. Session management maintains authentication across paginated crawls. Adaptive retry logic classifies errors and applies appropriate recovery. Data validation catches extraction failures before corrupted records enter datasets.
How to Use It?
Basic Usage
import { Actor } from "apify";
import { PlaywrightCrawler } from "crawlee";
await Actor.init();
const input = await Actor.getInput();
const proxyConfig = await Actor.createProxyConfiguration({
groups: ["RESIDENTIAL"],
countryCode: input.country || "US",
});
const crawler = new PlaywrightCrawler({
proxyConfiguration: proxyConfig,
useSessionPool: true,
sessionPoolOptions: {
maxPoolSize: 20,
sessionOptions: { maxUsageCount: 5 },
},
browserPoolOptions: {
fingerprintOptions: {
fingerprintGeneratorOptions: {
browsers: ["firefox", "chrome"],
},
},
},
async requestHandler({ page, request, session }) {
const status = page.url().includes("blocked")
? "blocked" : "ok";
if (status === "blocked") {
session.retire();
throw new Error("Session blocked, rotating");
}
await page.waitForSelector(".content", {
timeout: 15000,
});
const data = await page.evaluate(() => ({
title: document.querySelector("h1")?.textContent,
items: Array.from(
document.querySelectorAll(".item")
).map((el) => el.textContent.trim()),
}));
await Actor.pushData({
...data,
url: request.url,
});
},
failedRequestHandler({ request }) {
console.error(`Failed: ${request.url}`);
},
});
await crawler.run(input.startUrls);
await Actor.exit();Real-World Examples
class DataValidator {
constructor(schema) {
this.schema = schema;
this.errors = [];
}
validate(record) {
const issues = [];
for (const [field, rules] of
Object.entries(this.schema)) {
const value = record[field];
if (rules.required && !value) {
issues.push(`Missing required field: ${field}`);
}
if (rules.type && value
&& typeof value !== rules.type) {
issues.push(
`${field}: expected ${rules.type}, got ${typeof value}`
);
}
if (rules.pattern && value
&& !rules.pattern.test(value)) {
issues.push(`${field}: failed pattern validation`);
}
}
if (issues.length > 0) {
this.errors.push({ record, issues });
}
return issues.length === 0;
}
report() {
return {
totalErrors: this.errors.length,
commonIssues: this.summarizeIssues(),
};
}
summarizeIssues() {
const counts = {};
for (const err of this.errors) {
for (const issue of err.issues) {
counts[issue] = (counts[issue] || 0) + 1;
}
}
return Object.entries(counts)
.sort(([, a], [, b]) => b - a)
.slice(0, 10);
}
}
const validator = new DataValidator({
title: { required: true, type: "string" },
price: { required: true, type: "number" },
url: { required: true, pattern: /^https?:\/\// },
});Advanced Tips
Rotate user agent strings alongside proxy IPs to create unique fingerprints per session. Implement exponential backoff for rate limited requests, starting at two seconds and capping at sixty. Use request queue priority to process high value pages before exhausting rate limits.
When to Use It?
Use Cases
Use Apify Ultimate Scraper when target websites employ Cloudflare or similar bot protection, when scraping applications requiring full browser rendering, when building pipelines handling millions of pages, or when existing Actors fail due to detection issues.
Related Topics
Browser fingerprinting techniques, headless browser optimization, distributed crawling architectures, data quality frameworks, and web scraping legal considerations complement advanced scraping.
Important Notes
Requirements
An Apify account with residential proxy access for bypassing bot detection. Sufficient memory for browser based Actors. Understanding of the target site's structure and protection mechanisms.
Usage Recommendations
Do: monitor session health and retire sessions proactively before they get blocked. Validate a sample of extracted data after each run to catch silent failures. Start with conservative concurrency and increase gradually while monitoring block rates.
Don't: use maximum concurrency from the start, as sudden traffic spikes trigger detection. Ignore HTTP status codes, because soft blocks return 200 with different content. Bypass CAPTCHA protections with automated solvers on sites where this violates terms of service.
Limitations
Advanced techniques cannot guarantee access to sites with the most sophisticated bot protection. Browser based scraping consumes substantially more resources than HTTP approaches. Anti-bot changes require ongoing maintenance of fingerprinting strategies.
More Skills You Might Like
Explore similar skills to enhance your workflow
Usfiscaldata
Automate and integrate US Fiscal Data for streamlined access to government financial datasets
China Stock Analysis
China Stock Analysis automation and integration for in-depth financial market insights
Byteforms Automation
Automate Byteforms operations through Composio's Byteforms toolkit via
Mailersend Automation
Automate Mailersend operations through Composio's Mailersend toolkit
Discord
Use when you need to control Discord from Clawdbot via the discord tool: send messages, react
Render Deploy
Automate and integrate Render Deploy into your deployment workflows