Blip 2

Blip 2 automation and integration for advanced vision and language multimodal AI tasks

Blip 2 is a community skill for using the BLIP-2 vision-language model for image captioning, visual question answering, and image-text matching, covering model loading, image preprocessing, caption generation, VQA inference, and multimodal embedding extraction for vision-language tasks.

What Is This?

Overview

Blip 2 provides patterns for running the BLIP-2 model which bridges frozen image encoders and large language models through a lightweight Querying Transformer. It covers model loading that initializes BLIP-2 variants with different LLM backends such as OPT and Flan-T5, image preprocessing that resizes and normalizes input images for the vision encoder, caption generation that produces natural language descriptions of image content, visual question answering that responds to questions about image content, and embedding extraction that outputs aligned image-text representations for retrieval and matching. The skill enables multimodal understanding and generation tasks using pretrained BLIP-2 models without requiring task-specific fine-tuning in many common scenarios.

Who Should Use This

This skill serves researchers working on vision-language tasks, developers building image captioning or visual search features, and teams needing multimodal embeddings for image-text retrieval systems. It is also relevant for machine learning engineers integrating vision capabilities into existing language model pipelines.

Why Use It?

Problems It Solves

Training vision-language models from scratch requires massive compute resources. Connecting image encoders to language models needs careful architectural design. Generating accurate image captions requires understanding visual content at a semantic level. Building image search with natural language queries requires aligned multimodal representations. Existing captioning models often produce generic descriptions that miss fine-grained visual details.

Core Highlights

Q-Former bridges frozen image encoders and LLMs without end-to-end training. Caption generator produces detailed natural language descriptions from images. VQA engine answers free-form questions about image content. Embedding extractor outputs aligned vectors for image-text retrieval.

How to Use It?

Basic Usage

from transformers import (
  Blip2Processor,
  Blip2ForConditional\
    Generation)
from PIL import Image
import torch

model_id = (
  'Salesforce/'
  'blip2-opt-2.7b')

processor =\
  Blip2Processor\
    .from_pretrained(
      model_id)
model =\
  Blip2ForConditional\
    Generation\
    .from_pretrained(
      model_id,
      torch_dtype=\
        torch.float16,
      device_map='auto')

def caption_image(
  image_path: str
) -> str:
  image = Image.open(
    image_path).convert(
      'RGB')
  inputs = processor(
    images=image,
    return_tensors='pt'
  ).to(model.device,
    torch.float16)

  output = model.generate(
    **inputs,
    max_new_tokens=100)
  return processor.decode(
    output[0],
    skip_special_tokens=\
      True)

Real-World Examples

def answer_question(
  image_path: str,
  question: str
) -> str:
  image = Image.open(
    image_path).convert(
      'RGB')
  inputs = processor(
    images=image,
    text=question,
    return_tensors='pt'
  ).to(model.device,
    torch.float16)

  output = model.generate(
    **inputs,
    max_new_tokens=50)
  return processor.decode(
    output[0],
    skip_special_tokens=\
      True)

def caption_batch(
  image_paths: list[str]
) -> list[dict]:
  results = []
  for path in image_paths:
    cap = caption_image(path)
    results.append({
      'image': path,
      'caption': cap})
  return results

Advanced Tips

Use the Flan-T5 backend variant for instruction-following VQA tasks that benefit from the instruction-tuned LLM. Load the model in 8-bit precision with bitsandbytes to reduce VRAM requirements while maintaining output quality. Provide prompt prefixes like 'a photo of' to guide caption generation toward specific description styles. When processing batches, group images of similar dimensions together to improve throughput and reduce padding overhead during inference.

When to Use It?

Use Cases

Generate alt text for product images in an e-commerce catalog. Build a visual question answering interface that answers customer questions about product photos. Create an image search engine using BLIP-2 embeddings for natural language image retrieval.

Related Topics

Vision-language models, image captioning, visual question answering, multimodal AI, and BLIP-2 architecture.

Important Notes

Requirements

GPU with at least 8GB VRAM for the 2.7B parameter OPT variant. Transformers library version 4.27 or later with BLIP-2 support. Pillow for image loading and preprocessing. Sufficient disk space for model weights which range from 5GB to 30GB depending on the LLM backend variant.

Usage Recommendations

Do: use float16 precision on GPU to halve memory usage with minimal quality loss. Resize large input images before processing to reduce memory consumption. Validate caption output on a test set before deploying for production use. Set max_new_tokens to control caption length and avoid excessively verbose output.

Don't: run the model on CPU for production workloads as inference will be prohibitively slow. Rely solely on generated captions for safety-critical applications without human review. Assume VQA answers are factually correct without verification.

Limitations

Generated captions may hallucinate details not present in the image. VQA accuracy decreases for complex reasoning questions requiring multi-step inference, such as spatial relationship queries or questions involving numerical counting. Larger LLM backends like Flan-T5-XXL improve quality but require significantly more GPU memory. The model processes single images per inference call and does not natively support video input or multi-image reasoning.