Defuddle

Automate and integrate Defuddle to simplify and streamline complex processes

Defuddle is a community skill for extracting clean readable content from web pages, covering HTML parsing and cleanup, article content extraction, metadata retrieval, boilerplate removal, and markdown conversion for content processing pipelines.

What Is This?

Overview

Defuddle provides patterns for extracting the main content from web pages while removing navigation, advertisements, and boilerplate elements. It covers HTML parsing that processes raw page source into a structured DOM for content analysis, article extraction that identifies the primary content block using heuristics and scoring algorithms, metadata retrieval that pulls title, author, publication date, and description from page headers and structured data, boilerplate removal that strips navigation menus, sidebars, footers, and ad containers, and markdown conversion that transforms the cleaned HTML into readable markdown format. The skill enables developers to build content processing pipelines that consume web pages as structured text.

Who Should Use This

This skill serves developers building content aggregation tools, data engineers creating web scraping pipelines that need clean text output, and AI application developers preparing web content for language model consumption.

Why Use It?

Problems It Solves

Raw HTML pages contain navigation, ads, and boilerplate that overwhelm the actual article content. Simple text extraction loses document structure including headings, lists, and code blocks. Metadata like author and publish date is scattered across different HTML patterns and structured data formats. Web pages use inconsistent markup making generic CSS selectors unreliable for content extraction.

Core Highlights

Content scorer identifies the main article block by analyzing text density and element structure. Boilerplate stripper removes non-content elements using pattern matching and positional heuristics. Metadata extractor pulls structured data from Open Graph tags, JSON-LD, and HTML meta elements. Markdown renderer converts cleaned HTML to formatted markdown preserving structure.

How to Use It?

Basic Usage

// Defuddle content extraction
const { Defuddle }
  = require('defuddle');

async function
  extractArticle(html) {
  const result =
    new Defuddle(html)
      .parse();
  return {
    title:
      result.title,
    author:
      result.author,
    content:
      result.content,
    markdown:
      result.markdown,
    wordCount:
      result.wordCount,
    publishedDate:
      result
        .publishedDate,
  };
}

// Usage
const resp = await
  fetch(url);
const html = await
  resp.text();
const article =
  await extractArticle(
    html);
console.log(
  article.markdown);

Real-World Examples

// Batch content pipeline
const { Defuddle }
  = require('defuddle');

class ContentPipeline {
  constructor() {
    this.results = [];
  }

  async process(
    urls
  ) {
    for (const url
        of urls) {
      try {
        const resp =
          await fetch(url);
        const html =
          await resp
            .text();
        const parsed =
          new Defuddle(
            html).parse();
        this.results
          .push({
            url,
            title:
              parsed.title,
            markdown:
              parsed
                .markdown,
            words:
              parsed
                .wordCount,
            status:
              'success',
        });
      } catch (err) {
        this.results
          .push({
            url,
            status:
              'error',
            error:
              err.message,
        });
      }
    }
    return this.results;
  }

  summary() {
    const ok = this
      .results.filter(
        r => r.status
          === 'success');
    return {
      total:
        this.results
          .length,
      success: ok.length,
      avgWords:
        ok.reduce(
          (s, r) =>
            s + r.words,
          0)
        / (ok.length
          || 1),
    };
  }
}

Advanced Tips

Pre-filter HTML by removing script and style tags before passing to Defuddle to reduce processing time on pages with heavy JavaScript bundles. Use the word count output to filter extracted content below a minimum threshold which often indicates extraction failure. Cache extraction results keyed by URL and page hash to avoid reprocessing unchanged pages.

When to Use It?

Use Cases

Extract article content from news sites for a content aggregation feed in clean markdown format. Prepare web page content for language model context windows by removing boilerplate. Build a research tool that pulls structured metadata and content from academic or blog pages.

Important Notes

Requirements

Defuddle npm package for content extraction. Node.js runtime for server-side processing. HTTP client for fetching page HTML before extraction.

Usage Recommendations

Do: validate extracted content length to detect pages where extraction failed or returned only boilerplate. Handle JavaScript-rendered pages by using a headless browser to produce the final HTML before passing to Defuddle. Rate-limit requests when processing multiple URLs to respect target site policies.

Don't: assume all web pages have a single main content block since some layouts use multi-column designs that split article content. Skip error handling for fetch failures which are common in batch processing pipelines. Trust metadata fields unconditionally since some sites provide inaccurate or missing structured data.

Limitations

Content extraction heuristics work best on article-style pages and may produce poor results on application interfaces or heavily dynamic layouts. JavaScript-rendered content is not available in the raw HTML and requires a headless browser before extraction. Pages behind authentication or paywalls cannot be extracted without valid session credentials.

More Skills You Might Like

Explore similar skills to enhance your workflow