Speaker Diarization
Learn how to detect multiple speakers in an audio
The Speaker Diarization model lets you detect multiple speakers in an audio file and what each speaker said.
If you enable Speaker Diarization, the resulting transcript will return a list of utterances, where each utterance corresponds to an uninterrupted segment of speech from a single speaker.
Speaker Diarization and multichannel
Speaker Diarization doesn’t support multichannel transcription. Enabling both Speaker Diarization and multichannel will result in an error.
Quickstart
To enable Speaker Diarization, set speaker_labels
to true
in the transcription config.
Example output
Set number of speakers
If you know the number of speakers in advance, you can improve the diarization performance by setting the speakers_expected
parameter.
The speakers_expected
parameter is ignored for audio files with a duration less than 2 minutes.
API reference
Request
Key | Type | Description |
---|---|---|
speaker_labels | boolean | Enable Speaker Diarization. |
speaker_expected | number | Set number of speakers. |
Response
Key | Type | Description |
---|---|---|
utterances | array | A turn-by-turn temporal sequence of the transcript, where the i-th element is an object containing information about the i-th utterance in the audio file. |
utterances[i].confidence | number | The confidence score for the transcript of this utterance. |
utterances[i].end | number | The ending time, in milliseconds, of the utterance in the audio file. |
utterances[i].speaker | string | The speaker of this utterance, where each speaker is assigned a sequential capital letter. For example, “A” for Speaker A, “B” for Speaker B, and so on. |
utterances[i].start | number | The starting time, in milliseconds, of the utterance in the audio file. |
utterances[i].text | string | The transcript for this utterance. |
utterances[i].words | array | A sequential array for the words in the transcript, where the j-th element is an object containing information about the j-th word in the utterance. |
utterances[i].words[j].text | string | The text of the j-th word in the i-th utterance. |
utterances[i].words[j].start | number | The starting time for when the j-th word is spoken in the i-th utterance, in milliseconds. |
utterances[i].words[j].end | number | The ending time for when the j-th word is spoken in the i-th utterance, in milliseconds. |
utterances[i].words[j].confidence | number | The confidence score for the transcript of the j-th word in the i-th utterance. |
utterances[i].words[j].speaker | string | The speaker who uttered the j-th word in the i-th utterance. |
The response also includes the request parameters used to generate the transcript.