Molfeat
Extract molecular features for drug discovery using automated Molfeat integration
Molfeat is a community skill for computing molecular feature representations using the molfeat library, covering fingerprint generation, descriptor calculation, embedding computation, feature selection, and representation benchmarking for cheminformatics and drug discovery workflows.
What Is This?
Overview
Molfeat provides tools for transforming molecular structures into numerical feature representations for machine learning. It covers fingerprint generation that computes binary and count-based molecular fingerprints including Morgan, MACCS, and topological patterns, descriptor calculation that derives physicochemical property vectors from molecular structures using RDKit descriptors, embedding computation that generates learned molecular representations from pre-trained neural network models, feature selection that identifies informative molecular features from large descriptor sets for model training, and representation benchmarking that compares feature quality across tasks using standard evaluation metrics. The skill enables researchers to convert molecules into ML-ready feature vectors.
Who Should Use This
This skill serves cheminformatics researchers building molecular property prediction models, computational chemists comparing molecular representations, and drug discovery scientists featurizing compound libraries.
Why Use It?
Problems It Solves
Different molecular representation methods are scattered across multiple libraries with inconsistent interfaces. Comparing fingerprint and descriptor methods requires implementing each separately with different API patterns. Pre-trained molecular embeddings from deep learning models need specific loading and inference code per model. Feature selection from high-dimensional molecular descriptors requires systematic evaluation.
Core Highlights
Fingerprint engine computes multiple fingerprint types through a unified interface. Descriptor calculator generates physicochemical property vectors from molecular structures. Embedding generator loads pre-trained models for learned molecular representations. Feature evaluator benchmarks representations across prediction tasks.
How to Use It?
Basic Usage
from molfeat.trans\
.fp import (
FPVecTransformer)
from molfeat.trans\
.pretrained import (
PretrainedMol\
Transformer)
class MolFeaturizer:
def __init__(
self,
fp_type: str
= 'ecfp:4'
):
self.fp_trans = (
FPVecTransformer(
kind=fp_type,
dtype=float))
def fingerprints(
self,
smiles: list[str]
):
return (
self.fp_trans(
smiles))
def embeddings(
self,
smiles: list[str],
model: str
= 'gin_supervised'
):
trans = (
PretrainedMol\
Transformer(
kind=model))
return trans(smiles)
def combined(
self,
smiles: list[str]
) -> dict:
fps = (
self.fingerprints(
smiles))
embs = (
self.embeddings(
smiles))
return {
'fingerprints': fps,
'embeddings': embs}Real-World Examples
import numpy as np
class FeatureComparer:
def __init__(
self,
smiles: list[str]
):
self.smiles = smiles
self.features = {}
def add_method(
self,
name: str,
transformer
):
feats = transformer(
self.smiles)
self.features[
name] = np.array(
feats)
def dimensionality(
self
) -> dict:
return {
name: arr.shape[1]
for name, arr
in self.features
.items()}
def sparsity(
self
) -> dict:
results = {}
for name, arr\
in self.features\
.items():
zero_pct = (
np.mean(
arr == 0)
* 100)
results[name] = (
round(
zero_pct, 1))
return results
def summary(
self
) -> dict:
return {
'methods': list(
self.features
.keys()),
'dims': self
.dimensionality(),
'sparsity': self
.sparsity()}Advanced Tips
Compare multiple fingerprint types on your specific prediction task since the best representation varies by target property and dataset. Use concatenated features combining fingerprints with learned embeddings to capture both structural patterns and latent molecular properties. Cache computed features for large compound libraries to avoid recomputation during model iteration.
When to Use It?
Use Cases
Compute Morgan fingerprints for a compound library to train a property prediction model. Compare fingerprint and embedding methods on a benchmark dataset to select the best representation. Generate pre-trained molecular embeddings for similarity search across a chemical database.
Related Topics
Molecular featurization, cheminformatics, molecular fingerprints, molecular embeddings, drug discovery, feature engineering, and chemical machine learning.
Important Notes
Requirements
Molfeat Python package with RDKit dependency. Pre-trained model weights for embedding computation. SMILES strings as molecular input format.
Usage Recommendations
Do: benchmark multiple representation methods before selecting features for model training. Validate that input SMILES are parseable before batch featurization to avoid silent failures. Use the unified transformer interface for consistent feature generation across methods.
Don't: assume one fingerprint type is optimal for all prediction tasks without benchmarking. Mix feature representations computed with different settings across training and inference. Ignore feature dimensionality when selecting representations since high-dimensional features may overfit on small datasets.
Limitations
Pre-trained embeddings may not generalize well to chemical spaces not represented in their training data. Fingerprint bit collision rates increase with smaller fingerprint lengths. Feature computation time varies significantly across representation methods especially for neural network embeddings.
More Skills You Might Like
Explore similar skills to enhance your workflow
Clearout Automation
Automate Clearout operations through Composio's Clearout toolkit via
AI Prompt Engineering Safety Review
ai-prompt-engineering-safety-review skill for ai & tech tools
Mermaidjs V11
Create diagrams and visualizations using Mermaid.js v11 syntax. Use when generating flowcharts, sequence diagrams, class diagrams, state diagrams, ER
Stock Market Pro
Yahoo Finance (yfinance) powered stock analysis skill: quotes, fundamentals, ASCII trends
Image To Video
Convert static images into dynamic video content using automated animation and rendering integrations
Tensorboard
Visualize and monitor machine learning metrics with TensorBoard automation and integration