AssemblyAI’s Streaming Speech-to-Text (STT) service allows you to transcribe live audio streams with high accuracy and low latency. By streaming your audio data to our secure WebSocket API, you can receive transcripts back within a few hundred milliseconds, and our system continues to revise these transcripts with greater accuracy over time as more context arrives.
portaudio
first. Additionally, install the websocket-client
package:
wss://api.assemblyai.com/v2/realtime/ws
.
Authenticate your request by including your API key in the authorization header of your WebSocket connection, and provide the sample rate of your audio data as a query parameter to the streaming endpoint.
message
event to load the incoming data as JSON and extract the text
message
event to print the transcript, conditionally prepended with a string that signifies if the transcript is partial or final.
open
event to stream data from the microphone.
word_boost
parameter as an optional query parameter in the URL.
See also Adding Custom Vocabulary
error
event to handle WebSocket errors and application-level errors, including bad sample rate, authentication failure, insufficient funds, and more. See also Closing and Status Codes for a list of errors.
Additionally, update the WebSocket’s close
event.
sample_rate
query param you supplyencoding
parameter to pcm_mulaw
:
Encoding | Description |
---|---|
pcm_s16le (Default) | PCM signed 16-bit little-endian. |
pcm_mulaw | PCM Mu-law. |
sample_rate
wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000
word_boost
encoding
token
audio_data
via JSON is also supported but will be deprecated in the future. Use the binary mode instead.Field | Example | Description |
---|---|---|
audio_data | "UklGRtjIAABXQVZFZ" | Raw audio data, base64 encoded. |
Field | Example | Description |
---|---|---|
terminate_session | true | A boolean value to communicate that you wish to end your streaming session forever. |
SessionBegins
message with the following JSON data:
Field | Example | Description |
---|---|---|
message_type | "SessionBegins" | Describes the type of the message. |
session_id | "d3e8c537-2f11-494b-b497-e59a434588bd" | Unique identifier for the established session. |
expires_at | "2023-05-24T08:09:10.161850" | Timestamp when this session will expire. |
Field | Example | Description |
---|---|---|
message_type | "PartialTranscript" | Describes the type of message. |
audio_start | 0 | Start time of audio sample relative to session start, in milliseconds. |
audio_end | 1500 | End time of audio sample relative to session start, in milliseconds. |
confidence | 0.987190506414702 | The confidence score of the entire transcription, between 0 and 1. |
text | "there is a house in new orleans" | The partial transcript for your audio. |
words | [{"start": 0, "end": 440, "confidence": 1.0, "text": "there"}, ...] | An array of objects, with the information for each word in the transcription text. Includes the start /end time (in milliseconds) of the word, the confidence score of the word, and the text (i.e. the word itself). |
created | "2023-05-24T08:09:10.161850" | The timestamp for the partial transcript. |
Field | Example | Description |
---|---|---|
message_type | "FinalTranscript" | Describes the type of message. |
audio_start | 0 | Start time of audio sample relative to session start, in milliseconds. |
audio_end | 1500 | End time of audio sample relative to session start, in milliseconds. |
confidence | 0.997190506414702 | The confidence score of the entire transcription, between 0 and 1. |
text | "There is a house in New Orleans" | The final transcript for your audio. |
words | [{"start": 0, "end": 440, "confidence": 1.0, "text": "There"}, ...] | An array of objects, with the information for each word in the transcription text. Includes the start /end time (in milliseconds) of the word, the confidence score of the word, and the text (i.e. the word itself). |
created | "2023-05-24T08:09:10.161850" | The timestamp for the final transcript. |
punctuated | true | Whether the text has been punctuated and cased. |
text_formatted | true | Whether the text has been formatted (e.g. Dollar -> $) |
SessionTerminated
message. Your client receives a SessionTerminated
message with the following JSON data:
Field | Example | Description |
---|---|---|
message_type | "SessionTerminated" | Describes the type of the message. |
Error Condition | Status Code | Message | |
---|---|---|---|
bad sample rate | 4000 | ”Sample rate must be a positive integer” | |
auth failed | 4001 | ”Not Authorized” | |
insufficient funds | 4002 | ”Insufficient Funds” | |
free tier user | 4003 | ”This feature is paid-only and requires you to add a credit card. Please visit https://app.assemblyai.com/ to add a credit card to your account” | |
attempt to connect to nonexistent session id | 4004 | ”Session not found” | |
session expired | 4008 | ”Session Expired” | |
attempt to connect to closed session | 4010 | ”Session previously closed” | |
rate limited | 4029 | ”Client sent audio too fast” | |
unique session violation | 4030 | ”Session is handled by another WebSocket” | |
session times out | 4031 | ”Session idle for too long” | |
audio too short | 4032 | ”Audio duration is too short” | |
audio too long | 4033 | ”Audio duration is too long” | |
audio too small to transcode | 4034 | ”Audio too small to transcode” | |
bad schema | 4101 | ”Endpoint received a message with an invalid schema” | |
too many streams | 4102 | ”This account has exceeded the number of allowed streams” | |
reconnected | 4103 | ”This session has been reconnected. This WebSocket is no longer valid” | |
word boost parameter parsing failed | 4104 | ”Could not parse word boost parameter” |
word_boost
in the URL. The parameter should map to a JSON encoded list of strings as shown in this Python example:
POST
request to https://api.assemblyai.com/v2/realtime/token
. Use the expires_in
parameter to specify how long the token should be valid for, in seconds.
expires_in
parameter must have a value between 60 and 360000 seconds.wss://api.assemblyai.com/v2/realtime/ws?sample_rate=16000&token={New Temp Token}
. For example: