Clip

CLIP multimodal AI automation and integration for image and text tasks

Source: Orchestra-Research/AI-Research-SKILLs

Clip is a community skill for using the CLIP vision-language model for zero-shot image classification, image-text matching, and multimodal embedding extraction, covering model loading, image preprocessing, text encoding, similarity computation, and application to visual search and content tagging.

What Is This?

Overview

Clip provides patterns for running OpenAI CLIP which jointly encodes images and text into a shared embedding space. It covers model loading that initializes CLIP variants with different backbone sizes, image preprocessing that resizes, crops, and normalizes input images for the vision encoder, text encoding that converts label descriptions into embedding vectors, similarity computation that ranks images against text labels or text against images using cosine distance, and embedding extraction that outputs feature vectors for building custom retrieval and classification systems. The skill enables zero-shot visual understanding without task-specific training.

Who Should Use This

This skill serves developers building image search with natural language queries, content teams automating image tagging and categorization, and researchers using multimodal embeddings for cross-modal retrieval tasks.

Why Use It?

Problems It Solves

Traditional image classifiers require labeled training data for every category. Adding new categories to a classifier needs retraining with additional labeled examples. Building image search with natural language queries requires aligned image and text representations. Content moderation and tagging at scale needs automated visual understanding without per-category training.

Core Highlights

Zero-shot classifier matches images to text descriptions without task-specific training. Embedding extractor produces aligned image and text vectors for retrieval. Similarity ranker scores image-text pairs by cosine distance. Multi-label tagger assigns multiple text labels to images based on similarity thresholds.

How to Use It?

Basic Usage

import torch
import clip
from PIL import Image

device = 'cuda'\
  if torch.cuda\
    .is_available()\
  else 'cpu'

model, preprocess =\
  clip.load(
    'ViT-B/32',
    device=device)

def classify_image(
  image_path: str,
  labels: list[str]
) -> list[dict]:
  image = preprocess(
    Image.open(
      image_path)
  ).unsqueeze(0)\
    .to(device)

  text = clip.tokenize(
    labels).to(device)

  with torch.no_grad():
    image_feat =\
      model.encode_image(
        image)
    text_feat =\
      model.encode_text(
        text)

    similarity = (
      image_feat
      @ text_feat.T)\
      .softmax(dim=-1)
    probs = similarity[0]\
      .cpu().tolist()

  return sorted(
    [{'label': l,
      'score': s}
     for l, s
     in zip(labels, probs)],
    key=lambda x:
      x['score'],
    reverse=True)

Real-World Examples

import numpy as np

class CLIPSearchEngine:
  def __init__(self):
    self.embeddings = []
    self.paths = []

  def index(
    self,
    image_paths:\
      list[str]
  ):
    for path\
        in image_paths:
      img = preprocess(
        Image.open(path)
      ).unsqueeze(0)\
        .to(device)
      with torch.no_grad():
        emb = model\
          .encode_image(img)
        emb /= emb.norm()
      self.embeddings\
        .append(
          emb.cpu().numpy())
      self.paths.append(
        path)

  def search(
    self,
    query: str,
    top_k: int = 5
  ) -> list[dict]:
    text = clip.tokenize(
      [query]).to(device)
    with torch.no_grad():
      text_emb = model\
        .encode_text(text)
      text_emb /=\
        text_emb.norm()

    sims = [
      float(text_emb.cpu()
        @ e.T)
      for e
      in self.embeddings]
    ranked = sorted(
      enumerate(sims),
      key=lambda x: x[1],
      reverse=True)
    return [
      {'path':
        self.paths[i],
       'score': s}
      for i, s
      in ranked[:top_k]]

Advanced Tips

Use prompt engineering with templates like 'a photo of a {label}' to improve zero-shot classification accuracy. Normalize embeddings before computing similarity for consistent cosine distance measurements. Use the larger ViT-L/14 model variant for higher accuracy at the cost of increased inference time and memory.

When to Use It?

Use Cases

Build an image search engine that accepts natural language queries. Classify product images into categories without training a custom model. Tag user-uploaded photos with descriptive labels for content organization.

Important Notes

Requirements

GPU recommended for efficient inference. OpenAI CLIP library or Hugging Face transformers with CLIP support. PyTorch for model execution and tensor operations.

Usage Recommendations

Do: use descriptive label phrases rather than single words for more accurate zero-shot classification. Pre-compute and cache image embeddings for large collections to avoid repeated inference. Evaluate classification accuracy on a labeled test set before deployment.

Don't: rely on CLIP for safety-critical classification without human verification as accuracy varies by domain. Use CLIP embeddings for facial recognition or biometric applications. Assume zero-shot accuracy matches supervised classifiers trained on domain-specific data.

Limitations

Zero-shot accuracy is lower than supervised classifiers for domain-specific categories. CLIP text encoder has a 77-token context limit for label descriptions. Performance degrades on fine-grained categories that require expert visual knowledge. CLIP was trained on internet-sourced image-text pairs and inherits any biases present in that training data.

More Skills You Might Like

Explore similar skills to enhance your workflow