Welcome to the MediaSFU AI Pipeline Guide! This guide helps you build audio and vision and multimodal pipelines for creating advanced AI-powered agents. Throughout this guide, you'll learn how to:
- Configure AI Credentials for Voice and Vision services.
- Build pipelines with STT, TTS, LLM, and custom processing steps.
- Manage data buffers for real-time audio and video frames.
- Handle errors effectively and return results to the client.
By the end of this guide, you'll have a comprehensive understanding of how to integrate speech recognition, text generation, speech synthesis, and image analysis into your MediaSFU applications.
Note: Dashboard-configured AI credentials take precedence over ephemeral parameters for the same keys (unless the dashboard field is empty). Use ephemeral parameters for additional fields not already set on the dashboard.
What the newer Media runtime makes explicit
The raw pipeline array is only one layer of the system. The production path also includes runtime selection, context assembly, observability, and escalation design.
A production turn is more than STT to LLM to TTS
- 01Entry point attachesA widget, SIP route, or headless room becomes live and exposes the socket and runtime state that will drive the buffers.
- 02Turn detection packages inputVoice activity, silence windows, or frame cadence decide when MediaSFU has enough audio or vision data to assemble a turn.
- 03Context is assembledTranscript, prompts, provider settings, approved knowledge, and callable tools are combined before model execution.
- 04The model answers or chooses an actionThe agent can respond directly, call a tool, request clarification, or branch into an escalation and handoff path.
- 05Output and audit artifacts are emittedTTS playback, structured results, latency traces, summaries, and handoff context are returned to the client or operator surface.
Building custom apps? Start from these GitHub repos: