Sentencepiece

SentencePiece tokenization automation and integration for NLP pipelines

Source: Orchestra-Research/AI-Research-SKILLs

SentencePiece is a community skill for text tokenization using the SentencePiece library, covering subword segmentation, BPE encoding, unigram language models, vocabulary training, and integration with NLP model pipelines.

What Is This?

Overview

SentencePiece provides tools for training and applying subword tokenization models that split text into smaller units for language model processing. It covers subword segmentation that breaks words into meaningful pieces that balance vocabulary size and coverage, BPE encoding that builds a vocabulary by iteratively merging the most frequent character pairs, unigram language models that select subword units based on probability maximization, vocabulary training that creates custom tokenizers from domain-specific text corpora, and pipeline integration that connects trained tokenizers with downstream NLP models. The skill helps developers handle text preprocessing for language models, including transformer architectures such as BERT and T5.

Who Should Use This

This skill serves NLP engineers building custom tokenizers, researchers training language models on specialized corpora, and developers preprocessing text for transformer-based models. It is also useful for teams working with low-resource or morphologically complex languages where standard tokenizers underperform.

Why Use It?

Problems It Solves

Word-level tokenization creates large vocabularies with many rare words that receive poor representations. Character-level tokenization produces very long sequences that are computationally expensive. Pre-built tokenizers may not handle domain-specific terminology or specialized languages effectively. Whitespace-based splitting fails for languages without word boundaries, such as Japanese, Chinese, and Thai.

Core Highlights

BPE trainer builds vocabularies from frequent character pair merges. Unigram trainer selects subwords by likelihood maximization. Language-agnostic tokenizer works without whitespace assumptions. Custom vocabulary trainer adapts to domain-specific text corpora.

How to Use It?

Basic Usage

import sentencepiece as spm

spm.SentencePieceTrainer\
  .train(
    input='corpus.txt',
    model_prefix='tok',
    vocab_size=8000,
    model_type='bpe',
    character_coverage=
      1.0)

sp = spm\
  .SentencePieceProcessor(
    model_file=
      'tok.model')

text = (
  'Natural language '
  'processing with '
  'subword tokenization')

pieces = (
  sp.encode_as_pieces(
    text))
print(f'Pieces: {pieces}')

ids = sp.encode_as_ids(
  text)
print(f'IDs: {ids}')

decoded = sp.decode(
  ids)
print(f'Decoded: {decoded}')

print(
  f'Vocab size: '
  f'{sp.get_piece_size()}')

Real-World Examples

import sentencepiece as spm
from pathlib import Path
import json

class TokenizerPipeline:
  def __init__(
    self,
    model_path: str
  ):
    self.sp = spm\
      .SentencePieceProcessor(
        model_file=
          model_path)

  @classmethod
  def train(
    cls,
    corpus: str,
    prefix: str,
    vocab_size: int,
    model_type: str
      = 'unigram'
  ):
    spm.SentencePiece\
      Trainer.train(
        input=corpus,
        model_prefix=
          prefix,
        vocab_size=
          vocab_size,
        model_type=
          model_type)
    return cls(
      f'{prefix}.model')

  def tokenize(
    self, text: str
  ) -> list[int]:
    return (
      self.sp
      .encode_as_ids(
        text))

  def detokenize(
    self, ids: list[int]
  ) -> str:
    return self.sp.decode(
      ids)

  def batch_encode(
    self,
    texts: list[str],
    max_len: int = 512
  ) -> list[list[int]]:
    results = []
    for t in texts:
      ids = self.tokenize(
        t)[:max_len]
      results.append(ids)
    return results

tok = TokenizerPipeline\
  .train(
    'data.txt',
    'custom_tok',
    16000)
encoded = tok.batch_encode(
  ['Hello world',
   'NLP tokenization'])
for e in encoded:
  print(
    f'{len(e)} tokens: '
    f'{e[:10]}')

Advanced Tips

Adjust the character_coverage parameter for languages with large character sets to balance vocabulary completeness and size. A value of 0.9995 works well for most Latin-script languages, while scripts like Japanese may require 0.9999 or higher. Use the unigram model type for better handling of morphologically rich languages. Add special tokens for task-specific markers like classification and separation tokens.

When to Use It?

Use Cases

Train a custom tokenizer on domain-specific medical or legal text for better subword coverage of technical terminology. Preprocess multilingual text for a transformer model that handles languages without whitespace separation. Build a tokenization pipeline that converts raw text to model-ready integer sequences.

Important Notes

Requirements

SentencePiece Python package installed via pip. Training corpus as a plain text file with one sentence per line. Sufficient disk space for model files and vocabulary output.

Usage Recommendations

Do: train tokenizers on text that is representative of the data your model will process. Experiment with vocabulary sizes to find the balance between coverage and sequence length. Save trained models alongside your NLP models for reproducible preprocessing.

Don't: use a tokenizer trained on one domain for a significantly different domain without retraining. Set vocabulary size too small since this forces excessive splitting that loses word meaning. Ignore special token handling when integrating with transformer models that expect specific control tokens.

Limitations

Subword tokenization can split rare terms into unintuitive pieces that reduce interpretability. For example, a medical term like "hepatosplenomegaly" may fragment into many small pieces that obscure its meaning. Training requires a representative corpus and vocabulary size choices affect downstream model quality. BPE and unigram algorithms produce different tokenizations and the optimal choice depends on the target language.

More Skills You Might Like

Explore similar skills to enhance your workflow