Clip
CLIP multimodal AI automation and integration for image and text tasks
Clip is a community skill for using the CLIP vision-language model for zero-shot image classification, image-text matching, and multimodal embedding extraction, covering model loading, image preprocessing, text encoding, similarity computation, and application to visual search and content tagging.
What Is This?
Overview
Clip provides patterns for running OpenAI CLIP which jointly encodes images and text into a shared embedding space. It covers model loading that initializes CLIP variants with different backbone sizes, image preprocessing that resizes, crops, and normalizes input images for the vision encoder, text encoding that converts label descriptions into embedding vectors, similarity computation that ranks images against text labels or text against images using cosine distance, and embedding extraction that outputs feature vectors for building custom retrieval and classification systems. The skill enables zero-shot visual understanding without task-specific training.
Who Should Use This
This skill serves developers building image search with natural language queries, content teams automating image tagging and categorization, and researchers using multimodal embeddings for cross-modal retrieval tasks.
Why Use It?
Problems It Solves
Traditional image classifiers require labeled training data for every category. Adding new categories to a classifier needs retraining with additional labeled examples. Building image search with natural language queries requires aligned image and text representations. Content moderation and tagging at scale needs automated visual understanding without per-category training.
Core Highlights
Zero-shot classifier matches images to text descriptions without task-specific training. Embedding extractor produces aligned image and text vectors for retrieval. Similarity ranker scores image-text pairs by cosine distance. Multi-label tagger assigns multiple text labels to images based on similarity thresholds.
How to Use It?
Basic Usage
import torch
import clip
from PIL import Image
device = 'cuda'\
if torch.cuda\
.is_available()\
else 'cpu'
model, preprocess =\
clip.load(
'ViT-B/32',
device=device)
def classify_image(
image_path: str,
labels: list[str]
) -> list[dict]:
image = preprocess(
Image.open(
image_path)
).unsqueeze(0)\
.to(device)
text = clip.tokenize(
labels).to(device)
with torch.no_grad():
image_feat =\
model.encode_image(
image)
text_feat =\
model.encode_text(
text)
similarity = (
image_feat
@ text_feat.T)\
.softmax(dim=-1)
probs = similarity[0]\
.cpu().tolist()
return sorted(
[{'label': l,
'score': s}
for l, s
in zip(labels, probs)],
key=lambda x:
x['score'],
reverse=True)Real-World Examples
import numpy as np
class CLIPSearchEngine:
def __init__(self):
self.embeddings = []
self.paths = []
def index(
self,
image_paths:\
list[str]
):
for path\
in image_paths:
img = preprocess(
Image.open(path)
).unsqueeze(0)\
.to(device)
with torch.no_grad():
emb = model\
.encode_image(img)
emb /= emb.norm()
self.embeddings\
.append(
emb.cpu().numpy())
self.paths.append(
path)
def search(
self,
query: str,
top_k: int = 5
) -> list[dict]:
text = clip.tokenize(
[query]).to(device)
with torch.no_grad():
text_emb = model\
.encode_text(text)
text_emb /=\
text_emb.norm()
sims = [
float(text_emb.cpu()
@ e.T)
for e
in self.embeddings]
ranked = sorted(
enumerate(sims),
key=lambda x: x[1],
reverse=True)
return [
{'path':
self.paths[i],
'score': s}
for i, s
in ranked[:top_k]]Advanced Tips
Use prompt engineering with templates like 'a photo of a {label}' to improve zero-shot classification accuracy. Normalize embeddings before computing similarity for consistent cosine distance measurements. Use the larger ViT-L/14 model variant for higher accuracy at the cost of increased inference time and memory.
When to Use It?
Use Cases
Build an image search engine that accepts natural language queries. Classify product images into categories without training a custom model. Tag user-uploaded photos with descriptive labels for content organization.
Related Topics
CLIP, zero-shot classification, visual search, multimodal embeddings, and image-text matching.
Important Notes
Requirements
GPU recommended for efficient inference. OpenAI CLIP library or Hugging Face transformers with CLIP support. PyTorch for model execution and tensor operations.
Usage Recommendations
Do: use descriptive label phrases rather than single words for more accurate zero-shot classification. Pre-compute and cache image embeddings for large collections to avoid repeated inference. Evaluate classification accuracy on a labeled test set before deployment.
Don't: rely on CLIP for safety-critical classification without human verification as accuracy varies by domain. Use CLIP embeddings for facial recognition or biometric applications. Assume zero-shot accuracy matches supervised classifiers trained on domain-specific data.
Limitations
Zero-shot accuracy is lower than supervised classifiers for domain-specific categories. CLIP text encoder has a 77-token context limit for label descriptions. Performance degrades on fine-grained categories that require expert visual knowledge. CLIP was trained on internet-sourced image-text pairs and inherits any biases present in that training data.
More Skills You Might Like
Explore similar skills to enhance your workflow
Doppler Secretops Automation
Automate Doppler Secretops tasks via Rube MCP (Composio)
Capsule CRM Automation
1. Add the Composio MCP server to your client configuration:
Interleaved Thinking
Optimize cognitive workflows by automating interleaved thinking patterns and task management integration
Dynamics 365 Automation
Dynamics 365 Automation: manage CRM contacts, accounts, leads,
Tensorboard
Visualize and monitor machine learning metrics with TensorBoard automation and integration
Api Bible Automation
Automate API Bible operations through Composio's API Bible toolkit via