Before You Continue

This guide is for SDK builders creating custom AI apps.

If you are not building with the SDK, you likely do not need this setup. Use the Widgets Guide for low-configuration embeds, or the Telephony Guide for phone and SIP workflows.

Choose a build goal

Choose the right surface

Headless agent quickstart

Use MediaSFU as the room and socket engine behind your custom agent UI.

The source-backed flow is consistent across the repo: protect room creation and join credentials with a backend proxy, mount the SDK in headless mode, then wait for the SDK to expose the room socket before you start voice or multimodal buffers.

Backend proxy firstHeadless room attachVoice + vision buffers
Step 01Proxy room create and joinKeep production keys on your backend. The repo docs recommend a backend proxy or localLink-based routing so public clients never ship real MediaSFU secrets.
Step 02Mount MediaSFU with no visible UIRender the SDK in headless mode and feed room state through sourceParameters plus updateSourceParameters. This is the same pattern used by the headless handler and widget blocks.
Step 03Wait for room attachDo not start the agent from the REST create or join response. Wait until MediaSFU pushes roomName and socket or localSocket into SDK state after signaling attaches.
Step 04Start voice or multimodal buffersAfter the room is attached, use the exposed socket to startDataBuffer, react to startBuffers, and listen for pipelineResult or pipelineResultVision instead of making text sessions your default entry point.
Headless MediaSFU attach pattern
1import React, { useEffect, useMemo, useState } from "react";
2import {
3  MediasfuGeneric,
4  PreJoinPage,
5  type CreateJoinRoomType,
6  type CreateMediaSFURoomOptions,
7  type JoinMediaSFURoomOptions,
8} from "mediasfu-reactjs";
9
10// /api/mediasfu/rooms injects the real apiUserName/apiKey on the server.
11const createOrJoinViaProxy: CreateJoinRoomType = async ({ payload }) => {
12  const response = await fetch("/api/mediasfu/rooms", {
13    method: "POST",
14    headers: { "Content-Type": "application/json" },
15    body: JSON.stringify(payload),
16  });
17
18  const data = await response.json();
19  return { success: response.ok, data };
20};
21
22export function HeadlessAgentRoom() {
23  const [sourceParameters, setSourceParameters] = useState<Record<string, any>>({});
24  const [bufferStarted, setBufferStarted] = useState(false);
25
26  const noUIOptions = useMemo<CreateMediaSFURoomOptions | JoinMediaSFURoomOptions>(
27    () => ({
28      action: "create",
29      duration: 15,
30      capacity: 5,
31      userName: "agent-user",
32      eventType: "conference",
33      dataBuffer: true,
34      bufferType: "all",
35    }),
36    []
37  );
38
39  const bufferConfig = useMemo(
40    () => ({
41      audio: {
42        format: "wav",
43        channels: 1,
44        sampleRate: 16000,
45        pipeline: ["stt", "ttllm", "tts", "return"],
46        sttNickName: "support-stt",
47        llmNickName: "support-llm",
48        ttsNickName: "support-tts",
49        returnAudioFormat: "base64",
50      },
51      vision: {
52        fps: 1.0,
53        pipeline: ["visionllm", "tts", "return"],
54        llmNickName: "support-vision",
55        ttsNickName: "support-tts",
56        returnAudioFormat: "base64",
57      },
58    }),
59    []
60  );
61
62  const socket = sourceParameters.localSocket?.id
63    ? sourceParameters.localSocket
64    : sourceParameters.socket;
65
66  useEffect(() => {
67    if (bufferStarted || !socket?.id || !sourceParameters.roomName) return;
68
69    const onStartBuffers = () => {
70      socket.emit(
71        "startBuffer",
72        {
73          roomName: sourceParameters.roomName,
74          member: sourceParameters.member || "agent-user",
75        },
76        (ack: { success?: boolean; reason?: string }) => {
77          if (!ack?.success) {
78            console.error("Buffer attach failed", ack?.reason);
79          }
80        }
81      );
82    };
83
84    const onPipelineResult = (data: any) => {
85      console.log("voice pipeline", data);
86    };
87
88    const onPipelineResultVision = (data: any) => {
89      console.log("vision pipeline", data);
90    };
91
92    socket.on("startBuffers", onStartBuffers);
93    socket.on("pipelineResult", onPipelineResult);
94    socket.on("pipelineResultVision", onPipelineResultVision);
95    socket.emit(
96      "startDataBuffer",
97      {
98        roomName: sourceParameters.roomName,
99        config: bufferConfig,
100      },
101      (ack: { success?: boolean; reason?: string }) => {
102        if (!ack?.success) {
103          console.error("Buffer session start failed", ack?.reason);
104          return;
105        }
106
107        setBufferStarted(true);
108      }
109    );
110
111    return () => {
112      socket.off("startBuffers", onStartBuffers);
113      socket.off("pipelineResult", onPipelineResult);
114      socket.off("pipelineResultVision", onPipelineResultVision);
115    };
116  }, [bufferConfig, bufferStarted, socket, sourceParameters.member, sourceParameters.roomName]);
117
118  return (
119    <div style={{ width: 0, height: 0, overflow: "hidden" }}>
120      <MediasfuGeneric
121        PrejoinPage={(options: any) => <PreJoinPage {...options} />}
122        credentials={{ apiUserName: "dummy-user", apiKey: "dummy-key" }}
123        returnUI={false}
124        connectMediaSFU={true}
125        noUIPreJoinOptions={noUIOptions}
126        sourceParameters={sourceParameters}
127        updateSourceParameters={setSourceParameters}
128        createMediaSFURoom={createOrJoinViaProxy}
129        joinMediaSFURoom={createOrJoinViaProxy}
130      />
131    </div>
132  );
133}
Full SDK headless guide

Overview

Welcome to the MediaSFU AI Pipeline Guide! This guide helps you build audio and vision and multimodal pipelines for creating advanced AI-powered agents. Throughout this guide, you'll learn how to:

  • Configure AI Credentials for Voice and Vision services.
  • Build pipelines with STT, TTS, LLM, and custom processing steps.
  • Manage data buffers for real-time audio and video frames.
  • Handle errors effectively and return results to the client.

By the end of this guide, you'll have a comprehensive understanding of how to integrate speech recognition, text generation, speech synthesis, and image analysis into your MediaSFU applications.

Note: Dashboard-configured AI credentials take precedence over ephemeral parameters for the same keys (unless the dashboard field is empty). Use ephemeral parameters for additional fields not already set on the dashboard.

What the newer Media runtime makes explicit

The raw pipeline array is only one layer of the system. The production path also includes runtime selection, context assembly, observability, and escalation design.

Runtime SurfaceRoute into the real Media runtimeThe pipeline does not start at the first STT token. Room attach, widget or SIP entry, and runtime overrides decide which agent config and providers the incoming turn actually uses.
Context AssemblyLoad tools and approved knowledge firstUseful turns are transcript plus persona, policy, retrieval, and callable tools. MCP integrations belong in the response path before the model answer is finalized.
ObservabilityTrace what happened on each turnCapture transcript, latency, tool use, quality checkpoints, and summaries so you can explain why the agent responded the way it did and whether it met your SLA.
Fallback DesignKeep a human handoff path readyStrong agent flows define escalation triggers, operator-ready summaries, and takeover paths for cases where confidence, policy, or customer intent requires a person.

A production turn is more than STT to LLM to TTS

  1. 01
    Entry point attachesA widget, SIP route, or headless room becomes live and exposes the socket and runtime state that will drive the buffers.
  2. 02
    Turn detection packages inputVoice activity, silence windows, or frame cadence decide when MediaSFU has enough audio or vision data to assemble a turn.
  3. 03
    Context is assembledTranscript, prompts, provider settings, approved knowledge, and callable tools are combined before model execution.
  4. 04
    The model answers or chooses an actionThe agent can respond directly, call a tool, request clarification, or branch into an escalation and handoff path.
  5. 05
    Output and audit artifacts are emittedTTS playback, structured results, latency traces, summaries, and handoff context are returned to the client or operator surface.

Building custom apps? Start from these GitHub repos: