Apify Actor Development
Apify Actor Development automation and integration
Apify Actor Development is an AI skill that guides the creation, testing, and deployment of Apify Actors for web scraping, data extraction, and automation tasks. It covers Actor architecture, input schema design, proxy configuration, storage integration, error handling patterns, and publishing workflows that produce reliable and scalable cloud automation programs.
What Is This?
Overview
Apify Actor Development provides structured workflows for building Actors on the Apify platform. It handles scaffolding Actor projects with proper directory structure and configuration, designing input schemas that validate user provided parameters, configuring proxy rotation and browser automation for reliable scraping, integrating with Apify storage systems including datasets, key-value stores, and request queues, implementing retry logic and error handling for resilient crawling, and publishing Actors to the Apify Store with documentation and versioning.
Who Should Use This
This skill serves developers building web scrapers that need cloud execution and scheduling, data engineers creating automated data collection pipelines, teams migrating local scraping scripts to managed cloud infrastructure, and entrepreneurs publishing scraping tools as products on the Apify marketplace.
Why Use It?
Problems It Solves
Running web scrapers locally is unreliable due to IP blocking, browser crashes, and machine availability. Scaling scraping operations requires managing infrastructure that distracts from the actual extraction logic. Without structured input validation, Actors fail with confusing errors when users provide incorrect parameters. Distributing scrapers to non-technical users requires packaging them with documentation and configuration interfaces.
Core Highlights
The Actor SDK provides a standardized lifecycle for initialization, execution, and cleanup. Built in proxy management rotates IP addresses automatically to avoid blocking. Apify storage APIs handle dataset creation, pagination, and export without custom infrastructure. The Actor Store enables monetization and distribution of scraping tools to a marketplace audience.
How to Use It?
Basic Usage
import { Actor } from "apify";
import { CheerioCrawler } from "crawlee";
await Actor.init();
const input = await Actor.getInput();
const { startUrls, maxPages = 100 } = input;
const crawler = new CheerioCrawler({
maxRequestsPerCrawl: maxPages,
async requestHandler({ request, $, enqueueLinks }) {
const title = $("h1").text().trim();
const price = $(".price").text().trim();
const description = $(".description").text().trim();
await Actor.pushData({
url: request.url,
title,
price,
description,
scrapedAt: new Date().toISOString(),
});
await enqueueLinks({
selector: "a.product-link",
strategy: "same-domain",
});
},
failedRequestHandler({ request, error }) {
console.error(
`Request ${request.url} failed: ${error.message}`
);
},
});
await crawler.run(startUrls);
await Actor.exit();Real-World Examples
import { Actor } from "apify";
import { PlaywrightCrawler, Dataset } from "crawlee";
await Actor.init();
const input = await Actor.getInput();
const proxyConfig = await Actor.createProxyConfiguration({
groups: ["RESIDENTIAL"],
countryCode: input.country || "US",
});
const crawler = new PlaywrightCrawler({
proxyConfiguration: proxyConfig,
headless: true,
requestHandlerTimeoutSecs: 120,
maxConcurrency: 5,
async requestHandler({ page, request }) {
await page.waitForSelector(".results-list", {
timeout: 30000,
});
const items = await page.$$eval(
".result-item",
(elements) =>
elements.map((el) => ({
name: el.querySelector(".name")?.textContent,
rating: el.querySelector(".rating")?.textContent,
reviews: el.querySelector(".count")?.textContent,
}))
);
await Dataset.pushData(
items.map((item) => ({
...item,
sourceUrl: request.url,
extractedAt: new Date().toISOString(),
}))
);
},
});
await crawler.run(input.startUrls);
await Actor.exit();Advanced Tips
Use the Actor input schema to define field types, defaults, and descriptions so the Apify console generates a user friendly configuration form. Store intermediate state in key-value stores to enable resumable crawls that survive Actor restarts. Set memory limits and timeout values in the Actor configuration to prevent runaway executions from consuming excessive resources.
When to Use It?
Use Cases
Use Apify Actor Development when building cloud hosted web scrapers that need scheduling and proxy rotation, when creating data collection tools for non-technical users, when migrating local Puppeteer or Playwright scripts to managed infrastructure, or when publishing scraping tools on the Apify Store for distribution.
Related Topics
Crawlee framework for web crawling, Playwright and Puppeteer browser automation, proxy rotation strategies, web scraping ethics and robots.txt compliance, and data pipeline design all complement Apify Actor development.
Important Notes
Requirements
An Apify account for cloud deployment and storage access. Node.js runtime and the Apify SDK installed locally for development. Understanding of web scraping fundamentals including HTTP requests, DOM parsing, and browser automation.
Usage Recommendations
Do: validate all Actor inputs using the input schema to provide clear error messages for misconfigured runs. Use the built in proxy configuration rather than managing proxies manually. Test Actors locally with apify run before deploying to the cloud.
Don't: hardcode URLs or selectors that change frequently without providing them as configurable input parameters. Ignore rate limiting and robots.txt directives, as aggressive scraping can result in IP bans. Store sensitive credentials in Actor source code instead of using environment variables or the Apify secret store.
Limitations
Browser based Actors consume significantly more memory than HTTP based crawlers, affecting cost at scale. Website structure changes can break selectors without warning, requiring ongoing maintenance. Apify free tier limits concurrent runs and compute units, which may constrain large operations.
More Skills You Might Like
Explore similar skills to enhance your workflow
Substrate Vulnerability Scanner
Substrate Vulnerability Scanner automation and integration
Keen Io Automation
Automate Keen IO operations through Composio's Keen IO toolkit via Rube
Zoom Automation
Automate Zoom meeting creation, management, recordings, webinars, and participant tracking via Rube MCP (Composio). Always search tools first for curr
Geopandas
Geopandas automation and integration for geospatial data analysis and visualization
Springboot Patterns
Automate and integrate Spring Boot design patterns for robust and maintainable applications
Team Communication Protocols
- Choosing between message types (message, broadcast, shutdownrequest)