Book Sft Pipeline

Book Sft Pipeline

Book Sft Pipeline automation and integration for streamlined supervised fine-tuning workflows

Category: productivity Source: muratcankoylan/Agent-Skills-for-Context-Engineering

Book SFT Pipeline is an AI skill for building supervised fine-tuning data pipelines from book and document corpora. It covers text extraction, chunk segmentation, instruction-response pair generation, quality filtering, deduplication, and dataset formatting that enable teams to create high-quality training data from long-form text sources.

What Is This?

Overview

Book SFT Pipeline provides structured approaches to transforming books into training datasets. It handles extracting clean text from PDF, EPUB, and other document formats, segmenting long documents into coherent chunks at chapter or section boundaries, generating instruction-response pairs from text passages using template or model-based methods, filtering pairs for quality through length and relevance checks, deduplicating examples to prevent redundancy, and formatting output in JSONL for fine-tuning frameworks.

Who Should Use This

This skill serves ML engineers building fine-tuning datasets from proprietary documents, research teams creating domain-specific corpora from textbooks, organizations converting internal knowledge bases into training data, and developers building data pipelines for custom model adaptation.

Why Use It?

Problems It Solves

Books contain valuable knowledge but are not in instruction-response format needed for SFT. Manual pair creation from documents is prohibitively expensive at scale. Raw PDF extraction produces formatting artifacts that degrade quality. Without deduplication, similar passages generate redundant examples that waste compute.

Core Highlights

Text extraction handles multiple formats with cleanup of formatting artifacts. Smart chunking preserves coherence by splitting at section boundaries. Pair generation transforms passages into instruction-response examples. Quality filtering removes low-value pairs through scoring.

How to Use It?

Basic Usage

from dataclasses import dataclass, field
from pathlib import Path

@dataclass
class TextChunk:
    text: str
    source: str
    chapter: str = ""
    page: int = 0

@dataclass
class SFTPair:
    instruction: str
    response: str
    source: str
    quality_score: float = 0.0

class TextExtractor:
    def extract(self, file_path):
        path = Path(file_path)
        if path.suffix == ".txt":
            return path.read_text()
        if path.suffix == ".md":
            return path.read_text()
        raise ValueError(f"Unsupported format: {path.suffix}")

class Chunker:
    def __init__(self, max_tokens=512, overlap=50):
        self.max_tokens = max_tokens
        self.overlap = overlap

    def chunk(self, text, source=""):
        paragraphs = text.split("\n\n")
        chunks = []
        current = []
        current_len = 0
        for para in paragraphs:
            words = len(para.split())
            if current_len + words > self.max_tokens and current:
                chunks.append(TextChunk(
                    text="\n\n".join(current), source=source
                ))
                current = current[-1:]
                current_len = len(current[0].split()) if current else 0
            current.append(para)
            current_len += words
        if current:
            chunks.append(TextChunk(
                text="\n\n".join(current), source=source
            ))
        return chunks

Real-World Examples

import json
import hashlib

class PairGenerator:
    TEMPLATES = [
        ("Summarize the following passage:", "summary"),
        ("What are the key points in this text?", "key_points"),
        ("Explain the main concept described below:", "explain")
    ]

    def generate(self, chunk):
        pairs = []
        for instruction_tmpl, ptype in self.TEMPLATES:
            instruction = f"{instruction_tmpl}\n\n{chunk.text[:200]}"
            response = chunk.text
            pairs.append(SFTPair(
                instruction=instruction,
                response=response,
                source=chunk.source
            ))
        return pairs

class QualityFilter:
    def __init__(self, min_response_words=20, max_response_words=500):
        self.min_words = min_response_words
        self.max_words = max_response_words

    def filter(self, pairs):
        filtered = []
        for pair in pairs:
            words = len(pair.response.split())
            if self.min_words <= words <= self.max_words:
                pair.quality_score = min(words / 100, 1.0)
                filtered.append(pair)
        return filtered

class Deduplicator:
    def deduplicate(self, pairs):
        seen = set()
        unique = []
        for pair in pairs:
            h = hashlib.md5(pair.response.encode()).hexdigest()
            if h not in seen:
                seen.add(h)
                unique.append(pair)
        return unique

class SFTPipeline:
    def __init__(self):
        self.chunker = Chunker()
        self.generator = PairGenerator()
        self.quality = QualityFilter()
        self.dedup = Deduplicator()

    def process(self, text, source):
        chunks = self.chunker.chunk(text, source)
        pairs = []
        for chunk in chunks:
            pairs.extend(self.generator.generate(chunk))
        pairs = self.quality.filter(pairs)
        pairs = self.dedup.deduplicate(pairs)
        return pairs

    def export_jsonl(self, pairs, output_path):
        with open(output_path, "w") as f:
            for p in pairs:
                f.write(json.dumps({
                    "instruction": p.instruction,
                    "output": p.response
                }) + "\n")

Advanced Tips

Use chapter and section headers as natural chunk boundaries rather than fixed token counts. Generate diverse instruction types from the same passage. Score pairs with a separate model to filter low-quality generations.

When to Use It?

Use Cases

Use Book SFT Pipeline when creating fine-tuning datasets from textbooks or technical manuals, when converting knowledge bases into training data, when building domain-specific instruction datasets from long-form documents, or or when scaling beyond manual annotation.

Related Topics

SFT data formats, text extraction from PDFs, instruction generation techniques, deduplication algorithms, and data quality metrics complement SFT pipeline development.

Important Notes

Requirements

Source documents in extractable formats. Chunking configuration tuned for the target model context length. Filtering criteria appropriate to the training objective.

Usage Recommendations

Do: validate generated pairs with human review on a sample before using the full set. Preserve source metadata for traceability from examples back to original documents. Use multiple instruction templates per passage for diversity.

Don't: skip deduplication, which wastes training compute on redundant examples. Use raw extracted text without cleaning formatting artifacts from PDF conversion. Generate pairs from tables of contents or reference sections lacking substantive content.

Limitations

Template-based generation produces less diverse instructions than model-based approaches. Quality scoring heuristics may not correlate with training value. Copyright considerations may restrict using published books as training sources.