Baoyu Url To Markdown
Baoyu Url To Markdown automation and integration for easy content conversion
Baoyu Url To Markdown is a community skill for converting web pages to clean Markdown text, covering HTML parsing, content extraction, link preservation, image handling, and structured Markdown output for web content archival and repurposing.
What Is This?
Overview
Baoyu Url To Markdown provides patterns for fetching web pages and converting their content to clean Markdown. It covers HTML parsing that processes raw HTML into a structured document tree, content extraction that isolates the main article body from navigation, ads, and sidebars, link preservation that converts hyperlinks to Markdown link syntax with full URLs, image handling that downloads or references images with proper Markdown image tags, and structured output that formats headings, lists, tables, and code blocks as valid Markdown. The skill enables automated conversion of web content into portable Markdown files.
Who Should Use This
This skill serves researchers archiving web articles in a portable text format, content curators converting web content for newsletters or documentation, and developers building tools that ingest web pages as Markdown for further processing.
Why Use It?
Problems It Solves
Web pages contain navigation, ads, and chrome that obscure the actual content. Copying web content loses formatting and link structure. Converting HTML tables and code blocks to Markdown requires careful parsing. Saving web articles for offline reference needs a consistent output format.
Core Highlights
Content extractor isolates main article content from page chrome. HTML-to-Markdown converter handles headings, lists, tables, and code blocks. Link resolver converts relative URLs to absolute for portable Markdown. Image handler downloads or references images with Markdown syntax.
How to Use It?
Basic Usage
import requests
from bs4\
import BeautifulSoup
from urllib.parse\
import urljoin
import re
class UrlToMarkdown:
def __init__(
self,
selectors: list[str]\
= None
):
self.selectors =\
selectors or [
'article',
'main',
'.post-content',
'.entry-content',
]
def fetch(self, url: str
) -> str:
resp = requests.get(
url, headers={
'User-Agent':
'Mozilla/5.0'})
resp.raise_for_status()
return self._convert(
resp.text, url)
def _extract_main(
self,
soup: BeautifulSoup
):
for sel\
in self.selectors:
el = soup.select_one(
sel)
if el:
return el
return soup.body\
or soup
def _convert(
self,
html: str,
base_url: str
) -> str:
soup = BeautifulSoup(
html, 'html.parser')
main = self\
._extract_main(soup)
for tag in main.find_all(
['script', 'style',
'nav', 'footer']):
tag.decompose()
lines = []
for el in main\
.descendants:
if el.name\
and el.name in [
'h1','h2','h3',
'h4','h5','h6']:
level = int(
el.name[1])
lines.append(
f'{"#"*level} '
f'{el.get_text()}')
return '\n\n'.join(
lines)Real-World Examples
from pathlib import Path
import time
class MarkdownArchiver:
def __init__(
self,
output_dir: str,
delay: float = 1.0
):
self.converter =\
UrlToMarkdown()
self.output = Path(
output_dir)
self.output.mkdir(
exist_ok=True)
self.delay = delay
def archive(
self,
urls: list[str]
) -> list[dict]:
results = []
for url in urls:
try:
md = self.converter\
.fetch(url)
slug = url.split(
'/')[-1] or 'index'
slug = re.sub(
r'[^\w-]', '',
slug)[:60]
path = self.output\
/ f'{slug}.md'
path.write_text(md)
results.append({
'url': url,
'file': str(path),
'words':
len(md.split())})
except Exception as e:
results.append({
'url': url,
'error': str(e)})
time.sleep(
self.delay)
return resultsAdvanced Tips
Configure custom CSS selectors per domain for sites where the default content extraction misses the main article. Add a request delay between fetches in batch operations to avoid triggering rate limits. Strip tracking parameters from URLs before archiving for cleaner link references.
When to Use It?
Use Cases
Archive a reading list of web articles as local Markdown files for offline reference. Convert documentation pages to Markdown for inclusion in a project repository. Build a content pipeline that ingests web articles as Markdown for newsletter curation.
Related Topics
Web scraping, HTML parsing, Markdown conversion, content extraction, and document archival.
Important Notes
Requirements
Python requests and BeautifulSoup libraries for HTTP fetching and HTML parsing. Network access to target URLs. Storage for downloaded Markdown files and referenced images.
Usage Recommendations
Do: respect robots.txt and rate limits when fetching pages in batch. Verify extracted content matches the visible article before bulk archival. Use custom selectors for known sites to improve extraction accuracy.
Don't: scrape websites that prohibit automated access in their terms of service. Rely on generic extraction for complex JavaScript-rendered pages without a headless browser. Assume all HTML tables convert cleanly to Markdown without testing.
Limitations
JavaScript-rendered content is not captured without a headless browser. Complex HTML layouts with nested divs may produce messy Markdown output. Paywalled or authenticated content cannot be accessed without valid session credentials.
More Skills You Might Like
Explore similar skills to enhance your workflow
Customer Success Manager
Customer Success Manager automation and integration
Model Recommendation
Provide intelligent AI model recommendations and automated selection integration
Clearout Automation
Automate Clearout operations through Composio's Clearout toolkit via
Libfuzzer
Automate and integrate LibFuzzer coverage-guided fuzzing into your testing workflows
Front Automation
Automate Front operations through Composio's Front toolkit via Rube MCP
Tech Debt Tracker
Track, manage, and resolve technical debt with automated monitoring and integration