Audiocraft

Automate and integrate Audiocraft audio generation into your projects

Source: Orchestra-Research/AI-Research-SKILLs

Audiocraft is a community skill for generating music and audio using Meta AudioCraft models, covering text-to-music generation, audio continuation, melody conditioning, sound effect synthesis, and model configuration for AI audio production workflows.

What Is This?

Overview

Audiocraft provides patterns for using Meta AudioCraft models to generate and manipulate audio programmatically. It covers text-to-music generation that creates musical compositions from natural language descriptions of genre, mood, and instrumentation, audio continuation that extends existing audio clips by generating coherent follow-up content, melody conditioning that generates music matching a provided melody reference while varying arrangement and style, sound effect synthesis that creates environmental sounds and sound effects from text descriptions, and model configuration that tunes generation parameters for duration, quality, and inference speed. The skill enables developers to build audio generation features into applications without requiring deep expertise in digital signal processing or music theory.

Who Should Use This

This skill serves application developers adding AI music generation to creative tools, game developers generating dynamic background music and sound effects, and content creators building automated audio production pipelines. It is also relevant for researchers and hobbyists exploring generative audio techniques.

Why Use It?

Problems It Solves

Licensing commercial music for applications is expensive and legally complex. Creating original music requires specialized skills and production time. Dynamic audio that adapts to context needs real-time generation capability. Sound effect libraries are limited and require searching through large collections. AudioCraft addresses these constraints by enabling on-demand, royalty-free audio generation from simple text descriptions.

Core Highlights

MusicGen generates music from text prompts with controllable genre and mood. AudioGen produces sound effects from text descriptions. Melody conditioning preserves a reference melody while generating new arrangements. Duration and quality parameters balance generation speed with output fidelity.

How to Use It?

Basic Usage

from audiocraft.models\
  import MusicGen
import torchaudio

model = MusicGen.get_pretrained(
  'facebook/musicgen-medium')
model.set_generation_params(
  duration=15,
  temperature=1.0,
  top_k=250,
  cfg_coef=3.0,
)

descriptions = [
  'upbeat electronic dance'
  + ' music with synth pads'
  + ' and driving bass',
  'calm acoustic guitar'
  + ' with light percussion'
  + ' and ambient pads',
]

wav = model.generate(
  descriptions)

for i, audio\
    in enumerate(wav):
  torchaudio.save(
    f'output_{i}.wav',
    audio.cpu(),
    sample_rate=32000)

Real-World Examples

from audiocraft.models\
  import MusicGen
import torchaudio

model = MusicGen.get_pretrained(
  'facebook/'
  + 'musicgen-melody')
model.set_generation_params(
  duration=20)

melody, sr =\
  torchaudio.load(
    'reference.wav')
if sr != model.sample_rate:
  melody = torchaudio\
    .functional.resample(
      melody, sr,
      model.sample_rate)

wav = model.generate_with_chroma(
  descriptions=[
    'orchestral arrangement'
    + ' with strings'
    + ' and piano'],
  melody_wavs=melody\
    .unsqueeze(0),
  melody_sample_rate=\
    model.sample_rate,
)

torchaudio.save(
  'conditioned.wav',
  wav[0].cpu(),
  sample_rate=32000)

Advanced Tips

Adjust the cfg_coef parameter to control how closely the generation follows the text description where higher values produce more literal interpretations and lower values allow more creative variation. Use the small model for rapid prototyping and the large model for production quality output. Chain audio continuation with text conditioning to create longer compositions that evolve through multiple sections. When crafting prompts, include specific tempo indicators such as slow, mid-tempo, or fast alongside mood descriptors to improve output consistency.

When to Use It?

Use Cases

Generate background music for a video editing application from user text descriptions. Create dynamic game audio that adapts to gameplay context using text-conditioned generation. Build a sound effect generator for a content creation platform.

Important Notes

Requirements

PyTorch with CUDA support for GPU-accelerated generation. AudioCraft library installed from the Meta research repository. Sufficient GPU memory for model loading where medium model requires approximately 4GB VRAM.

Usage Recommendations

Do: use descriptive prompts that specify genre, instruments, tempo, and mood for best results. Start with shorter durations to iterate on prompt quality before generating longer pieces. Use melody conditioning when you have a specific musical direction to maintain.

Don't: expect generated music to match the quality of professional studio productions. Use the largest model for quick experiments as it requires significant GPU resources. Generate very long durations in a single pass which degrades quality after 30 seconds.

Limitations

Generated audio quality decreases with longer durations beyond 30 seconds. The model does not generate vocals or lyrics only instrumental music. Real-time generation is not feasible on consumer hardware due to inference time. Model weights require significant disk space with the large model exceeding 3GB. Output audio is 32kHz mono which may need resampling for production use. Generated audio varies between runs with the same prompt due to stochastic sampling.

More Skills You Might Like

Explore similar skills to enhance your workflow