Matchms

Process mass spectrometry data using Matchms automation and spectral integration

Matchms is a community skill for processing and comparing mass spectrometry data using the matchms library, covering spectrum loading, metadata cleaning, spectral similarity scoring, molecular networking, and library matching for metabolomics analysis.

What Is This?

Overview

Matchms provides tools for working with mass spectrometry spectral data in Python. It covers spectrum loading that reads spectra from standard formats including MGF, MSP, and MZM files into structured Python objects, metadata cleaning that normalizes precursor masses, adduct annotations, and instrument type fields across spectra, spectral similarity scoring that computes pairwise similarity between spectra using cosine, modified cosine, and other distance metrics, molecular networking that builds similarity graphs connecting related spectra based on shared fragment patterns, and library matching that identifies compounds by comparing experimental spectra against reference spectral libraries. The skill enables metabolomics researchers to build spectral analysis pipelines.

Who Should Use This

This skill serves metabolomics researchers analyzing mass spectrometry datasets, analytical chemists identifying compounds through spectral matching, and bioinformatics engineers building spectral processing pipelines.

Why Use It?

Problems It Solves

Mass spectrometry data from different instruments uses inconsistent metadata formats that complicate cross-dataset comparisons. Spectral similarity computation requires implementing specialized scoring algorithms that handle fragment matching with tolerance windows. Library matching against reference databases needs optimized search over large spectral collections. Molecular networking from raw spectra requires preprocessing, similarity computation, and graph construction steps.

Core Highlights

Spectrum loader reads multiple spectral file formats into a unified data structure. Metadata cleaner normalizes fields across spectra from different sources. Similarity calculator computes pairwise scores using configurable distance metrics. Network builder constructs molecular similarity graphs from spectral collections.

How to Use It?

Basic Usage

from matchms.importing\
  import load_from_mgf
from matchms import (
  calculate_scores)
from matchms.similarity\
  import CosineGreedy

class SpectraMatcher:
  def __init__(
    self,
    tolerance:
      float = 0.1
  ):
    self.similarity = (
      CosineGreedy(
        tolerance=(
          tolerance)))

  def load_spectra(
    self,
    mgf_path: str
  ) -> list:
    spectra = list(
      load_from_mgf(
        mgf_path))
    return spectra

  def match(
    self,
    queries: list,
    references: list
  ):
    scores = (
      calculate_scores(
        references,
        queries,
        self.similarity))
    return scores

  def top_matches(
    self,
    scores,
    threshold:
      float = 0.7
  ) -> list[dict]:
    matches = []
    for i, q\
        in enumerate(
          scores.queries):
      best = scores\
        .scores_by_query(
          q, sort=True)
      for ref, vals\
          in best[:3]:
        sc = float(
          vals['score'])
        if sc >= threshold:
          matches.append({
            'query': i,
            'ref': ref.get(
              'compound_name'),
            'score': sc})
    return matches

Real-World Examples

from matchms.filtering\
  import (
    default_filters,
    normalize_intensities,
    select_by_mz)

class SpectraProcessor:
  def __init__(
    self,
    mz_from: float = 0,
    mz_to: float = 1000
  ):
    self.mz_range = (
      mz_from, mz_to)

  def clean(
    self,
    spectrum
  ):
    s = default_filters(
      spectrum)
    s = (
      normalize_intensities(
        s))
    s = select_by_mz(
      s,
      mz_from=(
        self.mz_range[0]),
      mz_to=(
        self.mz_range[1]))
    return s

  def process_batch(
    self,
    spectra: list
  ) -> list:
    cleaned = []
    for s in spectra:
      result = (
        self.clean(s))
      if result\
          is not None:
        cleaned.append(
          result)
    return cleaned

Advanced Tips

Apply metadata cleaning filters before similarity scoring to normalize precursor masses and adduct types across datasets. Use modified cosine similarity for comparing spectra with different precursor masses that share fragment patterns. Build molecular networks by thresholding pairwise similarity scores and visualizing the resulting graph to identify compound families.

When to Use It?

Use Cases

Match experimental mass spectra against a reference library to identify metabolite compounds. Build a molecular network from untargeted metabolomics data to discover related compound families. Preprocess and normalize spectral collections from multiple instruments for cross-study comparison.

Related Topics

Mass spectrometry, metabolomics, spectral matching, molecular networking, compound identification, spectral similarity, and analytical chemistry.

Important Notes

Requirements

Matchms Python package installed. Mass spectrometry data in supported file formats. Reference spectral libraries for compound identification tasks.

Usage Recommendations

Do: apply default preprocessing filters to normalize spectra before computing similarity scores. Set appropriate mass tolerance values based on instrument precision. Use multiple scoring metrics to validate matching confidence.

Don't: compare spectra across datasets without first normalizing metadata fields and intensity values. Use overly loose mass tolerance thresholds that produce false positive matches. Trust single-metric scores without considering the number of matched fragments.

Limitations

Spectral matching accuracy depends on reference library completeness for the target compound classes. Similarity scores between spectra from different instruments may vary due to fragmentation differences. Large-scale pairwise comparison is computationally intensive for collections with many thousands of spectra.