Tensorboard

Visualize and monitor machine learning metrics with TensorBoard automation and integration

Source: Orchestra-Research/AI-Research-SKILLs

TensorBoard is a community skill for ML experiment visualization with TensorBoard, covering scalar tracking, model graphs, histogram analysis, image logging, and hyperparameter comparison for deep learning training monitoring.

What Is This?

Overview

TensorBoard provides guidance on visualizing machine learning experiments using the TensorBoard dashboard. It covers scalar tracking that plots loss, accuracy, and custom metrics over training steps, model graphs that visualize neural network architectures and computation flows, histogram analysis that shows weight and gradient distributions across training epochs, image logging that records sample predictions and feature maps during training, and hyperparameter comparison that evaluates experiment configurations side by side. The skill helps engineers monitor and debug training runs.

Who Should Use This

This skill serves ML engineers monitoring training experiments, researchers comparing model architectures, and teams tracking experiment metrics across multiple runs.

Why Use It?

Problems It Solves

Training runs without visualization make it hard to detect divergence or overfitting early. Comparing experiments by reading log files is tedious and error-prone. Model architecture bugs hide in code but become visible in computation graphs. Gradient vanishing or exploding issues go undetected without distribution monitoring.

Core Highlights

Scalar plotter tracks training metrics over time. Graph viewer visualizes model architecture and data flow. Histogram tracker shows parameter distributions across epochs. Image logger records visual outputs during training.

How to Use It?

Basic Usage

from torch.utils\
  .tensorboard import (
    SummaryWriter)
import torch
import torch.nn as nn

writer = SummaryWriter(
  'runs/experiment_1')

model = nn.Sequential(
  nn.Linear(784, 256),
  nn.ReLU(),
  nn.Linear(256, 10))

dummy = torch.randn(
  1, 784)
writer.add_graph(
  model, dummy)

for epoch in range(100):
  loss = 1.0 / (
    epoch + 1)
  acc = 1 - loss

  writer.add_scalar(
    'Loss/train',
    loss, epoch)
  writer.add_scalar(
    'Accuracy/train',
    acc, epoch)

  # Log weights
  for name, param in (
    model.named_parameters()
  ):
    writer.add_histogram(
      name, param, epoch)

writer.close()
print('Logs saved to '
  'runs/experiment_1')

Real-World Examples

from torch.utils\
  .tensorboard import (
    SummaryWriter)
import torch
import numpy as np

class ExperimentTracker:
  def __init__(
    self, name: str
  ):
    self.writer = (
      SummaryWriter(
        f'runs/{name}'))
    self.name = name
    self.step = 0

  def log_metrics(
    self,
    metrics: dict
  ):
    for key, val in (
      metrics.items()
    ):
      self.writer\
        .add_scalar(
          key, val,
          self.step)
    self.step += 1

  def log_hparams(
    self,
    hparams: dict,
    metrics: dict
  ):
    self.writer\
      .add_hparams(
        hparams, metrics)

  def log_images(
    self, tag: str,
    images
  ):
    self.writer\
      .add_images(
        tag, images,
        self.step)

  def close(self):
    self.writer.close()

configs = [
  {'lr': 0.01,
   'batch': 32},
  {'lr': 0.001,
   'batch': 64}]

for i, cfg in enumerate(
  configs
):
  tracker = (
    ExperimentTracker(
      f'exp_{i}'))
  for epoch in range(50):
    loss = (
      1.0 / (epoch + 1)
      * cfg['lr'] * 100)
    tracker.log_metrics(
      {'Loss/train': loss})
  tracker.log_hparams(
    cfg,
    {'final_loss': loss})
  tracker.close()
  print(
    f'Exp {i}: '
    f'lr={cfg["lr"]}')

Advanced Tips

Use separate SummaryWriter instances for each experiment run to enable side-by-side comparison in the TensorBoard dashboard. Log hyperparameters with add_hparams to correlate configurations with final metrics. Use custom scalars for tracking derived metrics like learning rate schedules alongside loss curves.

When to Use It?

Use Cases

Monitor training loss and accuracy curves to detect overfitting and divergence early. Compare hyperparameter configurations across experiment runs to select optimal settings. Visualize model architecture graphs to verify network structure before training.

Important Notes

Requirements

TensorBoard installed alongside PyTorch or TensorFlow for logging support. A training script instrumented with SummaryWriter calls to record metrics and artifacts. A web browser to access the TensorBoard dashboard served on a local port.

Usage Recommendations

Do: log both training and validation metrics to detect overfitting by comparing curves. Use meaningful run names that encode key hyperparameters for easy identification. Clean up old run directories to keep the dashboard manageable.

Don't: log too frequently since excessive writes slow training and inflate log sizes. Forget to close the SummaryWriter since this may result in lost data. Rely solely on final metrics without examining training curves since the trajectory reveals issues that endpoints miss.

Limitations

TensorBoard is designed for single-machine visualization and may slow with very large log directories. Real-time updates depend on the dashboard refresh interval and may lag during fast training. Comparing many experiments simultaneously can make the dashboard cluttered and hard to read.

More Skills You Might Like

Explore similar skills to enhance your workflow