Scikit Bio

Automate and integrate Scikit Bio for powerful bioinformatics data analysis workflows

Scikit-bio is a community skill for bioinformatics data analysis using the scikit-bio Python library, covering sequence alignment, phylogenetic trees, diversity metrics, distance matrices, and statistical ordination for microbial ecology and genomics.

What Is This?

Overview

Scikit-bio provides tools for biological data analysis through a comprehensive Python library. It covers sequence alignment that performs pairwise and multiple sequence alignment on DNA, RNA, and protein sequences, phylogenetic trees that builds and manipulates tree structures for evolutionary analysis, diversity metrics that calculates alpha and beta diversity indices from community abundance data, distance matrices that computes and manipulates pairwise distances using ecological and sequence-based metrics, and statistical ordination that performs PCoA, NMDS, and CCA for visualizing community composition patterns. The skill helps bioinformaticians analyze biological datasets efficiently within standard Python workflows.

Who Should Use This

This skill serves microbial ecologists analyzing microbiome community composition, bioinformaticians processing genomic sequence data, and evolutionary biologists studying phylogenetic relationships. It is also well suited for researchers working with 16S rRNA amplicon data or shotgun metagenomic datasets who need reproducible, scriptable analysis pipelines.

Why Use It?

Problems It Solves

Analyzing microbiome community composition requires computing diversity metrics and ordination from abundance tables. Comparing biological sequences across samples needs pairwise alignment and distance computation. Phylogenetic tree operations including pruning and distance extraction require specialized data structures. Statistical testing of community differences needs methods like PERMANOVA that handle distance matrix inputs directly rather than raw count data.

Core Highlights

Sequence analyzer performs alignment and distance computation across sequence types. Diversity calculator computes alpha and beta diversity indices. Tree manipulator builds and queries phylogenetic tree structures. Ordination engine runs PCoA and NMDS for community visualization.

How to Use It?

Basic Usage

import numpy as np
from skbio import (
  DistanceMatrix,
  TreeNode)
from skbio.diversity import (
  alpha_diversity,
  beta_diversity)

counts = np.array([
  [10, 20, 30, 0, 5],
  [5, 5, 5, 5, 5],
  [100, 0, 0, 0, 1]])
ids = ['S1', 'S2', 'S3']

shannon = alpha_diversity(
  'shannon', counts,
  ids)
print('Shannon:')
print(shannon)

bc = beta_diversity(
  'braycurtis', counts,
  ids)
print('\nBray-Curtis:')
print(bc)

tree = TreeNode.read([
  '((A:0.1,B:0.2):0.3,'
  '(C:0.4,D:0.5):0.6);'])
print(f'\nTips: '
  f'{[n.name for n '
  f'in tree.tips()]}')
print(f'Depth: '
  f'{tree.descending_branch_length():.1f}')

Real-World Examples

import numpy as np
from skbio import (
  DistanceMatrix)
from skbio.diversity import (
  alpha_diversity,
  beta_diversity)
from skbio.stats.ordination\
  import pcoa
from skbio.stats.distance\
  import permanova

class MicrobiomeAnalysis:
  def __init__(
    self,
    counts: np.ndarray,
    sample_ids: list,
    groups: list
  ):
    self.counts = counts
    self.ids = sample_ids
    self.groups = groups

  def alpha(self):
    return {
      'shannon':
        alpha_diversity(
          'shannon',
          self.counts,
          self.ids),
      'observed':
        alpha_diversity(
          'observed_otus',
          self.counts,
          self.ids)}

  def beta(self):
    return beta_diversity(
      'braycurtis',
      self.counts,
      self.ids)

  def ordinate(self):
    dm = self.beta()
    return pcoa(dm)

  def test_groups(self):
    dm = self.beta()
    import pandas as pd
    grouping = pd.Series(
      self.groups,
      index=self.ids)
    return permanova(
      dm, grouping)

counts = np.random.randint(
  0, 100, (6, 20))
ma = MicrobiomeAnalysis(
  counts,
  [f'S{i}' for i
    in range(6)],
  ['A','A','A',
   'B','B','B'])
result = ma.test_groups()
print(f'p-value: '
  f'{result["p-value"]}')

Advanced Tips

Use phylogenetic-aware beta diversity metrics like UniFrac when phylogenetic tree information is available for more sensitive community comparisons. Apply PERMANOVA with multiple permutations, typically 999 or more, for reliable significance testing on distance matrices. Combine PCoA ordination with group metadata for visual exploration of community structure differences. When working with large datasets, consider subsetting or rarefying samples first to reduce memory overhead during distance matrix computation.

When to Use It?

Use Cases

Calculate alpha and beta diversity metrics from a 16S microbiome abundance table. Run PCoA ordination to visualize community composition differences between sample groups. Test whether microbial communities differ significantly between treatment conditions using PERMANOVA.

Related Topics

Scikit-bio, bioinformatics, microbiome analysis, diversity metrics, phylogenetics, ordination, and ecological statistics.

Important Notes

Requirements

Scikit-bio Python package installed via pip or conda. NumPy and pandas for numerical computation and data handling. Community abundance data in matrix format with sample and feature identifiers.

Usage Recommendations

Do: normalize or rarefy abundance data before computing diversity metrics to account for uneven sequencing depth. Use multiple diversity metrics to capture different aspects of community structure. Verify sample identifiers match between abundance tables and metadata.

Don't: compare diversity values computed with different rarefaction depths since this introduces bias. Use parametric statistical tests on distance matrices since PERMANOVA and related methods are designed for this data type. Interpret ordination axes as representing specific biological variables.

Limitations

Some scikit-bio functions have specific input format requirements that differ from other bioinformatics libraries. Large distance matrices require significant memory for computation and storage, which can become a bottleneck with hundreds of samples. Phylogenetic diversity metrics require a matching tree that covers all features in the abundance table.