Fuzzing Dictionary

Fuzzing Dictionary automation and integration for security testing workflows

Fuzzing Dictionary is a community skill for creating and managing fuzzing input dictionaries, covering token extraction from target programs, format-aware dictionary construction, mutation seed selection, dictionary optimization, and corpus management for coverage-guided fuzzing effectiveness.

What Is This?

Overview

Fuzzing Dictionary provides patterns for building effective input dictionaries that improve fuzzing coverage. It covers token extraction that scans target program source code and binaries for magic values, keywords, and format markers, format-aware construction that builds dictionaries matching expected input formats like JSON, XML, or protocol buffers, mutation seed selection that picks initial inputs most likely to trigger new code paths, dictionary optimization that removes redundant entries and prioritizes tokens by coverage impact, and corpus management that maintains a minimized set of inputs maximizing coverage. The skill enables security researchers to improve fuzzing campaign effectiveness.

Who Should Use This

This skill serves security researchers running fuzzing campaigns against parsers and protocols, software engineers adding fuzz testing to CI pipelines, and QA teams using fuzzing for robustness testing.

Why Use It?

Problems It Solves

Fuzzing without dictionaries wastes time generating inputs the parser immediately rejects. Random mutation alone struggles to produce valid structure needed to reach deep code paths. Large dictionaries with redundant tokens slow down the fuzzing loop without improving coverage. Initial seed corpora without strategic selection lead to slow coverage growth.

Core Highlights

Token extractor scans binaries for string constants and magic values relevant to the input format. Format builder constructs dictionaries matching target format grammar tokens. Seed selector ranks initial inputs by coverage potential and minimality. Optimizer prunes dictionaries to remove redundant tokens.

How to Use It?

Basic Usage

import re
from pathlib import Path

class DictBuilder:
  def __init__(self):
    self.tokens = set()

  def extract_strings(
    self,
    source_path: str
  ):
    content = Path(
      source_path)\
        .read_text()
    strings = re.findall(
      r'"([^"]{1,64})"',
      content)
    self.tokens.update(
      strings)

  def add_format_tokens(
    self,
    fmt: str
  ):
    formats = {
      'json': [
        '{', '}', '[',
        ']', ':', ',',
        'null', 'true',
        'false'],
      'xml': [
        '<?xml', '<!',
        '</', '/>', '&',
        'CDATA'],
      'http': [
        'GET', 'POST',
        'Content-Type',
        'Host',
        'HTTP/1.1']}
    self.tokens.update(
      formats.get(
        fmt, []))

  def export(
    self,
    output: str
  ):
    with open(output,
        'w') as f:
      for token\
          in sorted(
            self.tokens):
        f.write(
          f'"{token}"\n')

Real-World Examples

class CorpusManager:
  def __init__(
    self,
    corpus_dir: str
  ):
    self.dir = Path(
      corpus_dir)
    self.inputs = []

  def load(self):
    for f in self.dir\
        .glob('*'):
      if f.is_file():
        self.inputs\
          .append({
            'path': f,
            'size':
              f.stat()\
                .st_size,
            'content':
              f.read_bytes()
        })

  def minimize(
    self,
    max_size:\
      int = 4096
  ) -> list:
    self.inputs.sort(
      key=lambda x:
        x['size'])
    unique = []
    seen = set()
    for inp\
        in self.inputs:
      h = hash(
        inp['content'])
      if h not in seen\
          and inp['size']\
            <= max_size:
        seen.add(h)
        unique.append(
          inp)
    return unique

Advanced Tips

Extract tokens from both source code string literals and the binary strings section since compiled code may contain tokens not visible in source. Use format-specific dictionary entries as initial seeds paired with structural mutations to reach deep parsing logic. Run periodic corpus minimization to keep the seed set small and focused on unique coverage.

When to Use It?

Use Cases

Build a fuzzing dictionary for a JSON parser by extracting format tokens and string constants. Optimize a fuzzing corpus by removing redundant inputs while preserving coverage. Create format-aware seed inputs for fuzzing an HTTP server request parser.

Related Topics

Fuzzing, dictionary generation, corpus management, coverage-guided testing, security testing, and input mutation.

Important Notes

Requirements

Target program source code or binary for token extraction. Fuzzing engine such as AFL or libFuzzer that supports dictionary inputs. Coverage measurement tool for evaluating dictionary effectiveness.

Usage Recommendations

Do: combine extracted tokens with format-specific grammar tokens for comprehensive dictionaries. Measure coverage improvement when adding dictionary entries to verify they contribute to testing depth. Keep individual dictionary tokens short since the fuzzer combines them during mutation.

Don't: create dictionaries with thousands of entries without measuring coverage impact since large dictionaries slow the fuzzing loop. Include long multi-word strings as single dictionary entries since the fuzzer works better with atomic tokens. Skip corpus minimization which allows the seed set to grow with redundant inputs.

Limitations

Dictionary-based fuzzing helps reach code behind format checks but cannot replace structure-aware grammar-based fuzzing for deeply nested formats. Token extraction from binaries may include irrelevant strings that add noise to the dictionary. Coverage-guided corpus minimization requires instrumented builds which may not be available for all targets.