FAQ
Commonly asked questions
Need more help?
Check out our Knowledge Base for additional questions and answers.
The AssemblyAI API supports most common audio and video file formats. We recommend that you submit your audio in its native format without additional transcoding or file conversion. Transcoding or converting it to another format can sometimes result in a loss of quality, especially if you’re converting compressed formats like .mp3
. The AssemblyAI API converts all files to 16khz uncompressed audio as part of our transcription pipeline.
Note that when you upload a video to our API, the audio will be extracted from it and processed independently, so the list of supported video formats isn’t exhaustive. If you need support for a format that isn’t listed below, please contact our team at support@assemblyai.com.
Supported audio file types | Supported video file types |
---|---|
.3ga | .webm |
.8svx | .mts, .m2ts, .ts |
.aac | .mov |
.ac3 | .mp2 |
.aif | .mp4, .m4p (with DRM), .m4v |
.aiff | .mxf |
.alac | |
.amr | |
.ape | |
.au | |
.dss | |
.flac | |
.flv | |
.m4a | |
.m4b | |
.m4p | |
.m4r | |
.mp3 | |
.mpga | |
.ogg, .oga, .mogg | |
.opus | |
.qcp | |
.tta | |
.voc | |
.wav | |
.wma | |
.wv |
Currently, there are two main limitations:
- Maximum file size for
/v2/transcript
endpoint: 5GB - Maximum duration: 10 hours
- Maximum file size for
/v2/upload
endpoint: 2.2GB
The vast majority of files will complete in under 45 seconds, with a Real-Time-Factor (RTF) as low as .008x.
To put this into perspective:
- 1h3min (75MB) meeting → 35 seconds
- 3h15min (191MB) podcast → 133 seconds
- 8h21min (464MB) video course → 300 seconds
Files submitted for Streaming Speech-to-Text receive a response within a few hundred milliseconds.
The response for a completed request includes start
and end
keys. These timestamp values indicate when a given word, phrase, or sentence starts and ends. They are:
- Measured in milliseconds
- Accurate to within about 400 milliseconds
Custom Vocabulary
- Allows submission of words/phrases to boost prediction likelihood
- Helps with under-represented terms in training data
Custom Spelling
- Controls word spelling/formatting in transcript text
- Works like find-and-replace functionality
File Storage:
- Files are encrypted in transit
- Deleted immediately after transcription completion
- Uploaded but untranscribed files are deleted after 24 hours
- Upload URLs become invalid after deletion
Transcript Management:
- Transcripts are stored encrypted at rest
- Can be deleted permanently via API request
- List all transcripts with a GET request
Completed transcripts are stored in our database, encrypted at rest, so that we can serve it to you and your application.
To permanently delete the transcription from our database once you’ve retrieved it, you can make a DELETE
request to the API.
You can retrieve a list of all transcripts that you have created by making a GET
request to the API.
Best Tier
- Most robust and accurate offering
- Houses most powerful models
- Broadest range of capabilities
- Ideal for accuracy-critical use cases
Nano Tier
- Fast, lightweight offering
- Supports 99 languages
- Cost-effective price point
- Best for extensive language needs
Discounts
- Volume discounts available for large-scale usage
- Contact support@assemblyai.com for eligibility
Support
- JSON responses include
error
key with descriptive messages - Email support@assemblyai.com for assistance
- Include transcript IDs and detailed issue description when contacting support
Any time you make a request to the API, you should receive a JSON response. If you don’t receive the expected output, the JSON contains an error
key with a message value describing the error.
You can also can reach out to our support team any time by sending an email to support@assemblyai.com. When reaching out, please include a detailed description of any issues you’re experiencing as well as transcript IDs for affected requests, if possible.
Custom models are rarely more accurate than the best general models due to several factors:
- General models are trained on massive datasets (600,000+ hours of speech data)
- Training data includes diverse audio types:
- Broadcast TV recordings
- Phone calls
- Zoom meetings
- Videos
- Various accents and speakers
Custom models are mainly beneficial for audio with unique characteristics unseen by general models, though these cases are rare due to the comprehensive training of general models.
Learn more about this topic on the AssemblyAI blog.