Markitdown

Convert various file formats to Markdown with automated processing and integration

Source: K-Dense-AI/claude-scientific-skills

Markitdown is a community skill for converting documents and files to Markdown format, covering PDF extraction, DOCX conversion, HTML parsing, spreadsheet transformation, and presentation conversion for content migration workflows.

What Is This?

Overview

Markitdown provides tools for converting various document formats into clean Markdown text. It covers PDF extraction that pulls text content from PDF files with heading detection and paragraph preservation, DOCX conversion that transforms Word documents into Markdown with heading hierarchy, list formatting, and inline styling, HTML parsing that converts web page content to Markdown stripping navigation and boilerplate elements, spreadsheet transformation that converts Excel and CSV data into Markdown tables with column alignment, and presentation conversion that extracts slide content from PowerPoint files into structured Markdown with slide separators. The skill enables content teams to migrate documents into Markdown-based systems efficiently.

Who Should Use This

This skill serves content managers migrating documentation to Markdown platforms, developers building document processing pipelines, and technical writers converting legacy content into version-controlled formats.

Why Use It?

Problems It Solves

Manual document conversion to Markdown is tedious and error-prone for large content libraries. PDF text extraction loses heading structure and formatting when using basic extraction tools. Word documents contain proprietary formatting that does not map directly to Markdown without interpretation logic. Spreadsheet data requires manual table formatting to produce readable Markdown tables.

Core Highlights

PDF converter extracts text with heading detection and paragraph grouping. DOCX transformer maps Word styles to Markdown heading levels and formatting. HTML cleaner strips boilerplate and converts content elements. Table formatter converts spreadsheet data into aligned Markdown tables.

How to Use It?

Basic Usage

from markitdown import (
  MarkItDown)

class DocConverter:
  def __init__(self):
    self.md = MarkItDown()

  def convert(
    self,
    file_path: str
  ) -> str:
    result = (
      self.md.convert(
        file_path))
    return (
      result.text_content)

  def convert_batch(
    self,
    paths: list[str]
  ) -> dict:
    results = {}
    for path in paths:
      try:
        results[path] = (
          self.convert(
            path))
      except Exception\
          as e:
        results[path] = (
          f'Error: {e}')
    return results

  def save_markdown(
    self,
    source: str,
    output: str
  ):
    content = (
      self.convert(
        source))
    with open(
      output, 'w'
    ) as f:
      f.write(content)

Real-World Examples

from pathlib import Path

class MigrationPipeline:
  SUPPORTED = {
    '.pdf', '.docx',
    '.xlsx', '.pptx',
    '.html', '.csv'}

  def __init__(
    self,
    converter:
      DocConverter,
    output_dir: str
  ):
    self.converter = (
      converter)
    self.out = Path(
      output_dir)
    self.out.mkdir(
      exist_ok=True)

  def scan_directory(
    self,
    input_dir: str
  ) -> list[Path]:
    src = Path(input_dir)
    files = []
    for ext\
        in self.SUPPORTED:
      files.extend(
        src.glob(
          f'*{ext}'))
    return sorted(files)

  def migrate(
    self,
    input_dir: str
  ) -> dict:
    files = (
      self.scan_directory(
        input_dir))
    report = {
      'total': len(files),
      'success': 0,
      'failed': 0}
    for f in files:
      try:
        md_name = (
          f.stem + '.md')
        self.converter\
          .save_markdown(
            str(f),
            str(self.out
              / md_name))
        report[
          'success'] += 1
      except Exception:
        report[
          'failed'] += 1
    return report

Advanced Tips

Pre-process PDF files with OCR tools for scanned documents before running Markdown conversion to ensure text extraction succeeds. Post-process converted Markdown with linting tools to normalize heading levels and fix formatting inconsistencies. Use file extension detection to route documents to format-specific conversion handlers.

When to Use It?

Use Cases

Migrate a library of Word documents to a Markdown-based documentation platform. Convert PDF reports to Markdown for ingestion into a knowledge base or search index. Transform spreadsheet data into Markdown tables for embedding in technical documentation.

Important Notes

Requirements

MarkItDown Python package installed. Input files in supported formats. Write access to the output directory for converted files.

Usage Recommendations

Do: validate converted Markdown output against the original document to check for content loss. Handle conversion errors gracefully with logging for batch operations. Preserve original files alongside converted Markdown until migration is verified.

Don't: assume perfect conversion fidelity for complex documents with embedded objects or advanced formatting. Delete original source files before verifying conversion quality. Process untrusted files without sandboxing since document parsing libraries may have security vulnerabilities.

Limitations

Complex document layouts with multi-column text or floating elements may not convert cleanly. Image extraction from documents requires additional handling beyond text conversion. Scanned PDF files without embedded text need OCR preprocessing before conversion.

More Skills You Might Like

Explore similar skills to enhance your workflow