Esm
Automate and integrate ESM to streamline modular JavaScript workflows and dependencies
Category: productivity Source: K-Dense-AI/claude-scientific-skillsESM is a community skill for protein analysis using Meta's Evolutionary Scale Modeling library, covering protein language model inference, sequence embeddings, structure prediction, contact maps, and protein representation learning for computational biology.
What Is This?
Overview
ESM provides patterns for applying protein language models to biological sequence analysis. It covers loading pretrained ESM models for protein sequence encoding, extracting residue-level and sequence-level embeddings for downstream tasks, contact map prediction from sequence information alone, ESMFold structure prediction that generates 3D protein coordinates from amino acid sequences, and batch inference for processing large protein datasets efficiently. The skill enables researchers to leverage deep learning representations of proteins for function prediction, engineering, and structural analysis.
Who Should Use This
This skill serves computational biologists using protein embeddings for function and property prediction, protein engineers designing variants with learned sequence representations, and structural biologists predicting protein structures from sequences.
Why Use It?
Problems It Solves
Traditional sequence analysis methods like BLAST depend on homology that fails for orphan proteins with no known relatives. Protein feature engineering for machine learning requires manual selection of physicochemical properties. Structure prediction from sequence historically required multiple sequence alignments that are slow to compute. Comparing proteins at a functional level needs representations beyond raw sequence similarity.
Core Highlights
Pretrained ESM-2 models encode protein sequences into dense vector representations. Residue embeddings capture per-position structural and functional information. ESMFold predicts 3D protein structures from single sequences without alignments. Contact prediction maps identify residue pairs that are spatially close in the folded protein.
How to Use It?
Basic Usage
import torch
import esm
model, alphabet = esm.pretrained.esm2_t33_650M_UR50D()
batch_converter = alphabet.get_batch_converter()
model.eval()
data = [
("protein1", "MKTVRQERLKSIVRILERSKEPVSGAQ"),
("protein2", "KALTARQQEVFDLIRDHISQTGMPPTRA")]
batch_labels, batch_strs, batch_tokens = (
batch_converter(data))
with torch.no_grad():
results = model(
batch_tokens,
repr_layers=[33],
return_contacts=True)
embeddings = results["representations"][33]
print(f"Shape: {embeddings.shape}")
seq_emb = embeddings[:, 1:-1].mean(dim=1)
print(f"Seq embedding: {seq_emb.shape}")
Real-World Examples
import torch
import esm
class ProteinEmbedder:
def __init__(self, model_name: str
= "esm2_t33_650M_UR50D"):
self.model, self.alphabet = (
getattr(esm.pretrained, model_name)())
self.converter = (
self.alphabet.get_batch_converter())
self.model.eval()
def embed(self, sequences: list[tuple]
) -> dict:
labels, strs, tokens = (
self.converter(sequences))
with torch.no_grad():
out = self.model(
tokens, repr_layers=[33],
return_contacts=True)
reps = out["representations"][33]
contacts = out["contacts"]
results = {}
for i, (name, seq) in enumerate(
sequences):
length = len(seq)
results[name] = {
"residue_emb": reps[
i, 1:length+1],
"seq_emb": reps[
i, 1:length+1].mean(dim=0),
"contacts": contacts[
i, :length, :length]}
return results
embedder = ProteinEmbedder()
seqs = [("lysozyme", "MKALIVLGL"),
("insulin", "MALWMRLLPL")]
embs = embedder.embed(seqs)
for name, data in embs.items():
print(f"{name}: {data['seq_emb'].shape}")
Advanced Tips
Choose ESM model size based on the trade-off between embedding quality and compute resources available. Use GPU inference for batch processing large protein datasets to achieve acceptable throughput. Extract embeddings from multiple representation layers and concatenate them for richer feature vectors in downstream models.
When to Use It?
Use Cases
Build a protein function predictor that classifies enzyme activity from ESM embeddings. Create a variant effect predictor that scores mutations using embedding distance from wildtype. Implement a protein search engine that finds functionally similar proteins using embedding similarity.
Related Topics
Protein language models, sequence embeddings, protein structure prediction, computational biology, and protein engineering.
Important Notes
Requirements
Python with PyTorch and the esm package installed. GPU recommended for inference with large models. Sufficient memory for model weights and batch embedding computation. Protein sequences in FASTA format for batch processing input.
Usage Recommendations
Do: use mean pooling over residue embeddings for sequence-level representations. Select the appropriate model size for your compute budget and accuracy requirements. Process sequences in batches for efficient GPU utilization.
Don't: pass sequences longer than the model maximum context length without truncation. Use the largest model when a smaller variant provides sufficient accuracy for the task. Ignore memory constraints when batching many long protein sequences together.
Limitations
Large ESM models require significant GPU memory for inference. Embedding quality depends on the model having seen similar sequences during pretraining. ESMFold structure predictions are less accurate than AlphaFold for proteins with available homologs. Contact map predictions from smaller models may miss long-range interactions that larger models capture. Embeddings are trained on natural protein sequences and may not generalize well to synthetic or highly engineered proteins.