Esm

Esm

Automate and integrate ESM to streamline modular JavaScript workflows and dependencies

Category: productivity Source: K-Dense-AI/claude-scientific-skills

ESM is a community skill for protein analysis using Meta's Evolutionary Scale Modeling library, covering protein language model inference, sequence embeddings, structure prediction, contact maps, and protein representation learning for computational biology.

What Is This?

Overview

ESM provides patterns for applying protein language models to biological sequence analysis. It covers loading pretrained ESM models for protein sequence encoding, extracting residue-level and sequence-level embeddings for downstream tasks, contact map prediction from sequence information alone, ESMFold structure prediction that generates 3D protein coordinates from amino acid sequences, and batch inference for processing large protein datasets efficiently. The skill enables researchers to leverage deep learning representations of proteins for function prediction, engineering, and structural analysis.

Who Should Use This

This skill serves computational biologists using protein embeddings for function and property prediction, protein engineers designing variants with learned sequence representations, and structural biologists predicting protein structures from sequences.

Why Use It?

Problems It Solves

Traditional sequence analysis methods like BLAST depend on homology that fails for orphan proteins with no known relatives. Protein feature engineering for machine learning requires manual selection of physicochemical properties. Structure prediction from sequence historically required multiple sequence alignments that are slow to compute. Comparing proteins at a functional level needs representations beyond raw sequence similarity.

Core Highlights

Pretrained ESM-2 models encode protein sequences into dense vector representations. Residue embeddings capture per-position structural and functional information. ESMFold predicts 3D protein structures from single sequences without alignments. Contact prediction maps identify residue pairs that are spatially close in the folded protein.

How to Use It?

Basic Usage

import torch
import esm

model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()

data = [
    ("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQ"),
    ("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRA")]
batch_labels, batch_strs, batch_tokens = (
    batch_converter(data))

with torch.no_grad():
    results = model(
        batch_tokens,
        repr_layers=[33],
        return_contacts=True)

embeddings = results["representations"][33]
print(f"Shape: {embeddings.shape}")

seq_emb = embeddings[:, 1:-1].mean(dim=1)
print(f"Seq embedding: {seq_emb.shape}")

Real-World Examples

import torch
import esm

class ProteinEmbedder:
    def __init__(self, model_name: str
                 = "esm2_t33_650M_UR50D"):
        self.model, self.alphabet = (
            getattr(esm.pretrained, model_name)())
        self.converter = (
            self.alphabet.get_batch_converter())
        self.model.eval()

    def embed(self, sequences: list[tuple]
              ) -> dict:
        labels, strs, tokens = (
            self.converter(sequences))
        with torch.no_grad():
            out = self.model(
                tokens, repr_layers=[33],
                return_contacts=True)
        reps = out["representations"][33]
        contacts = out["contacts"]
        results = {}
        for i, (name, seq) in enumerate(
                sequences):
            length = len(seq)
            results[name] = {
                "residue_emb": reps[
                    i, 1:length+1],
                "seq_emb": reps[
                    i, 1:length+1].mean(dim=0),
                "contacts": contacts[
                    i, :length, :length]}
        return results

embedder = ProteinEmbedder()
seqs = [("lysozyme", "MKALIVLGL"),
        ("insulin", "MALWMRLLPL")]
embs = embedder.embed(seqs)
for name, data in embs.items():
    print(f"{name}: {data['seq_emb'].shape}")

Advanced Tips

Choose ESM model size based on the trade-off between embedding quality and compute resources available. Use GPU inference for batch processing large protein datasets to achieve acceptable throughput. Extract embeddings from multiple representation layers and concatenate them for richer feature vectors in downstream models.

When to Use It?

Use Cases

Build a protein function predictor that classifies enzyme activity from ESM embeddings. Create a variant effect predictor that scores mutations using embedding distance from wildtype. Implement a protein search engine that finds functionally similar proteins using embedding similarity.

Related Topics

Protein language models, sequence embeddings, protein structure prediction, computational biology, and protein engineering.

Important Notes

Requirements

Python with PyTorch and the esm package installed. GPU recommended for inference with large models. Sufficient memory for model weights and batch embedding computation. Protein sequences in FASTA format for batch processing input.

Usage Recommendations

Do: use mean pooling over residue embeddings for sequence-level representations. Select the appropriate model size for your compute budget and accuracy requirements. Process sequences in batches for efficient GPU utilization.

Don't: pass sequences longer than the model maximum context length without truncation. Use the largest model when a smaller variant provides sufficient accuracy for the task. Ignore memory constraints when batching many long protein sequences together.

Limitations

Large ESM models require significant GPU memory for inference. Embedding quality depends on the model having seen similar sequences during pretraining. ESMFold structure predictions are less accurate than AlphaFold for proteins with available homologs. Contact map predictions from smaller models may miss long-range interactions that larger models capture. Embeddings are trained on natural protein sequences and may not generalize well to synthetic or highly engineered proteins.