Solublempnn

Design soluble protein variants with SolubleMPNN for improved expression

Solublempnn is a protein design skill for creating soluble protein variants, covering computational protein engineering, solubility prediction, and expression optimization through machine learning

What Is This?

Overview

SolubleMPNN is a computational tool that redesigns protein sequences to improve solubility while maintaining structural integrity. It leverages message passing neural networks (MPNNs) trained on large datasets of experimental solubility measurements to predict which amino acid substitutions are most likely to enhance protein expression without compromising function. By integrating deep learning with protein structure information, SolubleMPNN enables researchers to transform poorly expressed or aggregation-prone proteins into highly soluble variants suitable for therapeutic, industrial, and research applications.

Unlike traditional random mutagenesis or brute-force screening, SolubleMPNN intelligently suggests targeted mutations that reduce aggregation propensity and increase aqueous solubility. The tool processes three-dimensional protein structures, analyzes surface-exposed residues, and outputs ranked sequence variants with predicted solubility improvements. Its neural network models capture subtle sequence-structure relationships, allowing for precise optimization of protein surfaces while preserving the core fold and activity.

Who Should Use This

Protein engineers, synthetic biologists, and pharmaceutical researchers developing recombinant proteins should use this skill. It is particularly valuable for those facing challenges with low protein expression levels, insoluble inclusion bodies, or aggregation-prone targets that require optimization for downstream applications. Academic researchers working on structural biology or functional studies of difficult proteins can also benefit from SolubleMPNN’s predictive capabilities.

Why Use It?

Problems It Solves

Many recombinant proteins express poorly or aggregate in solution, creating bottlenecks in drug development, manufacturing, and basic research. Traditional approaches to improving solubility often require extensive experimental screening of random or semi-rational mutations, which is time-consuming and costly. SolubleMPNN accelerates this process by computationally predicting high-solubility variants, dramatically reducing the time and resources needed for protein optimization campaigns.

Core Highlights

SolubleMPNN’s machine learning models are trained on thousands of solubility measurements, providing accurate predictions for new sequences and diverse protein families. The tool preserves protein function by respecting structural constraints and optimizing only surface-exposed residues, minimizing the risk of disrupting the protein’s core fold or active site. Batch processing enables rapid evaluation of multiple design variants in parallel, supporting high-throughput workflows. Integration with standard protein design pipelines allows seamless incorporation into existing computational and experimental frameworks.

How to Use It?

Basic Usage

from solublempnn import ProteinDesigner

designer = ProteinDesigner(pdb_file="protein.pdb")
variants = designer.design(num_variants=10)
for variant in variants:
    print(f"Sequence: {variant.sequence}")
    print(f"Solubility Score: {variant.score}")

Real-World Examples

Example one demonstrates optimizing an antibody fragment with poor expression. Load the crystal structure, generate ten design variants ranked by predicted solubility, and select the top candidate for experimental validation. This approach typically improves expression yields by 5 to 10 fold compared to wild-type, enabling more efficient purification and downstream use.

from solublempnn import ProteinDesigner

designer = ProteinDesigner(pdb_file="antibody_fab.pdb")
variants = designer.design(num_variants=10, preserve_cdr=True)
best = variants[0]
print(f"Expression improvement: {best.solubility_delta}")

Example two shows redesigning a therapeutic enzyme that aggregates at high concentrations. Specify regions to preserve for catalytic activity, generate variants, and filter by both solubility and predicted activity retention to identify candidates that maintain enzymatic function while exhibiting improved solubility.

designer = ProteinDesigner(pdb_file="enzyme.pdb")
variants = designer.design(
    num_variants=15,
    preserve_regions=[(10, 45), (120, 160)],
    min_activity_retention=0.8
)

Advanced Tips

Combine SolubleMPNN predictions with experimental validation using high-throughput screening to identify optimal variants faster than either approach alone. Use ensemble predictions from multiple model checkpoints to increase confidence in solubility improvements for critical applications. For challenging targets, consider iterative rounds of design and testing to further refine solubility and expression properties.

When to Use It?

Use Cases

SolubleMPNN is ideal for optimizing recombinant therapeutic proteins for manufacturing scale-up and cost reduction in biopharmaceutical production. It is also valuable for improving expression of difficult-to-express research proteins needed for structural biology, biochemical studies, or diagnostic development. The tool supports redesigning protein variants for cell-free protein synthesis systems, where solubility directly impacts yield and activity. Additionally, it is useful for engineering proteins for industrial applications that require high concentration stability and minimal aggregation, such as enzymes for biocatalysis or biosensors.

Related Topics

This skill complements protein structure prediction tools like AlphaFold and sequence design frameworks such as ProteinMPNN for comprehensive protein engineering workflows. It can be integrated with molecular dynamics simulations and experimental mutagenesis for a holistic approach to protein optimization.

Important Notes

Requirements

Input requires a protein structure file in PDB format with proper atom coordinates. A Python environment with PyTorch and standard scientific computing libraries (such as NumPy and Biopython) must be installed. GPU acceleration is recommended for processing large protein complexes or batch designs, as it significantly speeds up computation.

Usage Recommendations

  • Always provide high-quality, experimentally validated protein structures (such as X-ray or cryo-EM PDB files) to maximize prediction accuracy.
  • Limit designable regions to surface-exposed residues unless there is a strong rationale for core mutations, as altering buried residues may destabilize the protein.
  • Use the preserve_regions or preserve_cdr options to protect functional or structurally critical motifs from unintended modification.
  • Generate multiple design variants and prioritize those with both high predicted solubility and minimal deviation from the wild-type sequence for downstream testing.
  • Combine computational predictions with experimental screening to confirm solubility improvements and avoid over-reliance on in silico scores.

Limitations

  • SolubleMPNN predictions are limited by the accuracy and completeness of the input protein structure; errors or missing regions can reduce reliability.
  • The tool does not guarantee preservation of protein function, especially if mutations are introduced near active sites or essential motifs.
  • Solubility predictions may not fully capture context-dependent effects such as post-translational modifications, oligomerization, or cellular expression environment.
  • The model is optimized for soluble, globular proteins and may perform suboptimally on membrane proteins or proteins with extensive disordered regions.