Integration quickstart

Integrating Agora's real-time audio communication capabilities with OpenAI's language models enables dynamic, conversational AI experiences. This guide shows you how to set up a Python project that combines Agora's server-side Voice SDK with OpenAI's API to create an interactive, voice-driven assistant.

Understand the tech

The RealtimeKitAgent class manages the integration by connecting to an Agora channel for real-time audio streaming and to OpenAI's API for processing audio input and generating AI-driven responses. Audio frames captured from the Agora channel are streamed to OpenAI's API, where the AI processes the input. The API responses, which include transcribed text and synthesized voice output, are then delivered back to the Agora channel.

The code sets up tools that can be executed locally or passed through the API. This allows the AI to perform specific tasks, such as retrieving data from external sources. The agent processes various message types from OpenAI, such as audio responses, transcription updates, and error messages, and sends them to users through the Agora audio channel, facilitating continuous interaction.

The following figure illustrates the integration topology:

Prerequisites

Set up the project

This guide walks you through the core elements of the Agora Conversational AI Demo integrating Agora's Python SDK with OpenAI's Realtime API:

Download the Agora Conversational AI Demo code.
The project is structured as follows:

_22/realtime_agent _22 ├── __init__.py _22 ├── agent.py _22 ├── agora _22 │ ├── __init__.py _22 │ ├── requirements.txt _22 │ ├── rtc.py _22 │ └── token_builder _22 │ ├── AccessToken2.py _22 │ ├── Packer.py _22 │ ├── RtcTokenBuilder2.py _22 │ └── realtimekit_token_builder.py _22 ├── parse_args.py _22 └── realtimeapi _22 ├── __init__.py _22 ├── call_tool.py _22 ├── client.py _22 ├── messages.py _22 ├── mic_to_websocket.py _22 ├── push_to_talk.py _22 ├── send_audio_to_websocket.py _22 └── util.py

Note
This project uses the OpenAI realtimeapi-examples package. Download the project and unzip it into your realtime-agent folder.

Overview of key files:
- agent.py: The primary script responsible for executing the RealtimeKitAgent. It integrates Agora's functionality from the rtc.py module and OpenAI's capabilities from the realtimeapi package.
- rtc.py: Contains an implementation of the server-side Agora Python Voice SDK.
- parse_args.py: Handles command-line argument parsing for the application.
- realtimeapi/: Contains the classes and methods that interact with OpenAI's Realtime API.
Create the .env file by copying the .env.example in the root of the repo

_1cp .env.example .env
Fill in the values for the environment variables:

_6# Agora RTC app ID _6AGORA_APP_ID= _6AGORA_APP_CERT= _6 _6# OpenAI API key for authentication _6OPENAI_API_KEY=
Create a virtual environment and activate it:

_1python3 -m venv venv && source venv/bin/activate
Install the required dependencies:

_1pip install -r requirements.txt
Run the demo server:

_1python -m realtime_agent.agent --channel_name=<channel_name> --uid=<agent_uid>

Implementation

The RealtimeKitAgent class integrates Agora's audio communication capabilities with OpenAI's AI services. This class manages audio streams, handles communication with the OpenAI API, and processes AI-generated responses, providing a seamless conversational AI experience.

Connect to Agora and OpenAI

The setup_and_run_agent method sets up the RealtimeKitAgent by connecting to an Agora channel using the provided RtcEngine and initializing a session with the OpenAI Realtime API client. It sends configuration messages to set up the session and define conversation parameters, such as the system message and output audio format, before starting the agent's operations. The method uses asynchronous execution to handle both listening for the session start and sending conversation configuration updates concurrently. It ensures that the connection is properly managed and cleaned up after use, even in cases of exceptions, early exits, or shutdowns.

Note

UIDs in the Python SDK are set using a string value. Agora recommends using only numerical values for UID strings to ensure compatibility with all Agora products and extensions.

@classmethod
async def setup_and_run_agent(
    cls,
    *,
    engine: RtcEngine,
    options: RtcOptions,
    inference_config: InferenceConfig,
    tools: ToolContext | None,
) -> None:
    # Create and connect to an Agora channel
    channel = engine.create_channel(options)
    await channel.connect()
    try:
        # Initialize the OpenAI Realtime API client
        async with RealtimeApiClient(
            base_uri="wss://api.openai.com",
            api_key=os.getenv("OPENAI_API_KEY"),
            verbose=False,
        ) as client:
            # Update the session configuration
            await client.send_message(
                messages.SessionUpdate(
                    session=messages.SessionUpdateParams(
                        turn_detection=inference_config.turn_detection,
                        tools=tools.model_description() if tools else None,
                        tool_choice="auto",
                        instructions=inference_config.system_message,
                    )
                )
            )
            # Concurrently wait for the session to start and update the conversation config
            [start_session_message, _] = await asyncio.gather(
                *[
                    anext(client.listen()),
                    client.send_message(
                        messages.UpdateConversationConfig(
                            system_message=inference_config.system_message,
                            output_audio_format=messages.AudioFormats.PCM16,
                            voice=inference_config.voice,
                            tools=tools.model_description() if tools else None,
                            transcribe_input=False,
                        )
                    ),
                ]
            )
            logger.info(
                f"Session started: {start_session_message.session.id} model: {start_session_message.session.model}"
            )
            # Create and run the RealtimeKitAgent
            agent = cls(
                client=client,
                tools=tools,
                channel=channel,
            )
            await agent.run()
    finally:
        # Ensure the Agora engine is destroyed, even if an exception occurs
        engine.destroy()

Initialize the RealtimeKitAgent

The RealtimeKitAgent class constructor accepts an OpenAI RealtimeApiClient, an optional ToolContext for function registration, and an Agora channel for managing audio communication. This setup initializes the agent to process audio streams, register tools (if provided), and interacts with the AI model.

def __init__(
    self,
    *,
    client: RealtimeApiClient,
    tools: ToolContext | None,
    channel: Channel,
) -> None:
    self.client = client  # OpenAI Realtime API client
    self.tools = tools  # Optional tool context for function registration
    self._client_tool_futures = {}  # For managing asynchronous tool calls
    self.channel = channel  # Agora channel for audio communication
    self.subscribe_user = None  # Will store the user ID we're subscribing to

Launch the Agent

The run method orchestrates the main operations of the RealtimeKitAgent. It manages audio streaming, processes tasks related to audio input, output, and model messages, and ensures exception handling is in place.

async def run(self) -> None:
    try:
        # Helper function to log unhandled exceptions in tasks
        def log_exception(t: asyncio.Task[Any]) -> None:
            if not t.cancelled() and t.exception():
                logger.error(
                    "unhandled exception",
                    exc_info=t.exception(),
                )
        
        logger.info("Waiting for remote user to join")
        # Wait for a remote user to join the channel
        self.subscribe_user = await wait_for_remote_user(self.channel)
        logger.info(f"Subscribing to user {self.subscribe_user}")
        # Subscribe to the audio of the joined user
        await self.channel.subscribe_audio(self.subscribe_user)
        # Handle user leaving the channel
        async def on_user_left(agora_rtc_conn: RTCConnection, user_id: int, reason: int):
            logger.info(f"User left: {user_id}")
            if self.subscribe_user == user_id:
                self.subscribe_user = None
                logger.info("Subscribed user left, disconnecting")
                await self.channel.disconnect()
                
        self.channel.on("user_left", on_user_left)
        # Set up a future to track when the agent should disconnect
        disconnected_future = asyncio.Future[None]()
        # Handle connection state changes
        def callback(agora_rtc_conn: RTCConnection, conn_info: RTCConnInfo, reason):
            logger.info(f"Connection state changed: {conn_info.state}")
            if conn_info.state == 1:  # Disconnected state
                if not disconnected_future.done():
                    disconnected_future.set_result(None)
        self.channel.on("connection_state_changed", callback)
        # Start tasks for streaming audio and processing messages
        asyncio.create_task(self._stream_input_audio_to_model()).add_done_callback(
            log_exception
        )
        asyncio.create_task(
            self._stream_audio_queue_to_audio_output()
        ).add_done_callback(log_exception)
        asyncio.create_task(self._process_model_messages()).add_done_callback(
            log_exception
        )
        # Wait until the agent is disconnected
        await disconnected_future
        logger.info("Agent finished running")
    except asyncio.CancelledError:
        logger.info("Agent cancelled")

Stream input audio to the AI model

The _stream_input_audio_to_model method captures audio frames from the Agora channel and sends them to the OpenAI API client for real-time processing by the AI model.

async def _stream_input_audio_to_model(self) -> None:
    # Wait until we have a subscribed user
    while self.subscribe_user is None:
        await asyncio.sleep(0.1)
    # Get the audio frame stream for the subscribed user
    audio_frames = self.channel.get_audio_frames(self.subscribe_user)
    async for audio_frame in audio_frames:
        try:
            # Send the audio frame to the OpenAI model via the API client
            await self.client.send_audio_data(audio_frame.data)
        except Exception as e:
            logger.error(f"Error sending audio data to model: {e}")

Stream audio from the AI model to the user

The _stream_audio_queue_to_audio_output method handles the playback of processed audio data from the AI model. It retrieves audio frames from a queue and sends them to the Agora channel, allowing users to hear AI-generated responses in real-time.

async def _stream_audio_queue_to_audio_output(self) -> None:
    while True:
        # Get the next audio frame from the queue (contains audio data from the model)
        frame = await self.audio_queue.get()
        # Send the frame to the Agora channel for playback to the user
        await self.channel.push_audio_frame(frame)
        await asyncio.sleep(0)  # Allow other tasks to run

Process model messages

The _process_model_messages method listens for messages from the OpenAI API client and processes them based on their type. It handles a variety of message types, including audio deltas, transcriptions, and errors.

async def _process_model_messages(self) -> None:
    # Listen for incoming messages from the OpenAI API client
    async for message in self.client.listen():
        # Process each type of message received from the client
        match message:
            case messages.ResponseAudioDelta():
                # Process incoming audio data from the model
                await self.audio_queue.put(base64.b64decode(message.delta))
            case messages.ResponseAudioTranscriptDelta():
                # Handle incoming transcription updates
                logger.info(f"Received text message {message=}")
                await self.channel.chat.send_message(ChatMessage(message=message.model_dump_json(), msg_id=message.item_id))
            case messages.ResponseAudioTranscriptDone():
                # Handle completed transcriptions
                logger.info(f"Text message done: {message=}")
                await self.channel.chat.send_message(ChatMessage(message=message.model_dump_json(), msg_id=message.item_id))
            case messages.InputAudioBufferSpeechStarted():
                # Handle the start of speech in the input audio
                pass
            case messages.InputAudioBufferSpeechStopped():
                # Handle the end of speech in the input audio
                pass
            case messages.InputAudioBufferCommitted():
                # Handle when an input audio buffer is committed
                pass
            case messages.ItemCreated():
                # Handle when a new item is created in the conversation
                pass
            case messages.ResponseCreated():
                # Handle when a new response is created
                pass
            case messages.ResponseOutputItemAdded():
                # Handle when a new output item is added to the response
                pass
            case messages.ResponseContenPartAdded():
                # Handle when a new content part is added to the response
                pass
            case messages.ResponseAudioDone():
                # Handle when the audio response is complete
                pass
            case messages.ResponseContentPartDone():
                # Handle when a content part of the response is complete
                pass
            case messages.ResponseOutputItemDone():
                # Handle when an output item in the response is complete
                pass
            case _:
                # Log any unhandled or unknown message types
                logger.warning(f"Unhandled message {message=}")

Main entry point

The main entry point of the application sets up the Agora RTC engine, configures the options, and launches the RealtimeKitAgent.

if __name__ == "__main__":
    # Load environment variables from .env file
    load_dotenv()
    
    # Parse command line arguments
    options = parse_args_realtimekit()
    logger.info(f"app_id: channel_id: {options['channel_name']}, uid: {options['uid']}")
    
    # Ensure the Agora App ID is set
    if not os.environ.get("AGORA_APP_ID"):
        raise ValueError("Need to set environment variable AGORA_APP_ID")
    
    # Run the RealtimeKitAgent
    asyncio.run(
        RealtimeKitAgent.entry_point(
            # Initialize the RtcEngine with Agora credentials
            engine=RtcEngine(appid=os.environ.get("AGORA_APP_ID"), appcert=os.environ.get("AGORA_APP_CERT")),
            # Configure RTC options
            options=RtcOptions(
                channel_name=options['channel_name'],
                uid=options['uid'],
                sample_rate=SAMPLE_RATE,
                channels=CHANNELS
            ),
            # Configure inference settings
            inference_config=InferenceConfig(
                # Set up the AI assistant's behavior
                system_message="""\
You are a helpful assistant. If asked about the weather make sure to use the provided tool to get that information. \
If you are asked a question that requires a tool, say something like "working on that" and dont provide a concrete response \
until you have received the response to the tool call.\
""",
                voice=messages.Voices.Alloy,
                # Configure voice activity detection
                turn_detection=messages.ServerVAD(
                    threshold=0.5,
                    prefix_padding_ms=500,
                    suffix_padding_ms=200,
                ),
            ),
        )
    )

Test the code

Update the values for AGORA_APP_ID, AGORA_APP_CERT, and OPENAI_API_KEY in the project's .env file.

This step ensures that the necessary credentials for Agora and OpenAI are correctly configured in your project.
Execute the following command to run the demo:

_1python3 agent.py --channel_name=your_channel_name --uid=your_user_id

This command launches the agent.py script, initializing the Agora channel and the OpenAI API connection. Replace your_channel_name with the desired channel name and your_user_id with a unique user ID.

Reference

This section contains additional information or links to relevant documentation that complements the current page or explains other aspects of the product.