Learn how to detect multiple speakers in an audio
speaker_labels
to true
in the transcription config.
speakers_expected
parameter.
speakers_expected
parameter is ignored for audio files with a duration less than 2 minutes.Key | Type | Description |
---|---|---|
speaker_labels | boolean | Enable Speaker Diarization. |
speaker_expected | number | Set number of speakers. |
Key | Type | Description |
---|---|---|
utterances | array | A turn-by-turn temporal sequence of the transcript, where the i-th element is an object containing information about the i-th utterance in the audio file. |
utterances[i].confidence | number | The confidence score for the transcript of this utterance. |
utterances[i].end | number | The ending time, in milliseconds, of the utterance in the audio file. |
utterances[i].speaker | string | The speaker of this utterance, where each speaker is assigned a sequential capital letter. For example, “A” for Speaker A, “B” for Speaker B, and so on. |
utterances[i].start | number | The starting time, in milliseconds, of the utterance in the audio file. |
utterances[i].text | string | The transcript for this utterance. |
utterances[i].words | array | A sequential array for the words in the transcript, where the j-th element is an object containing information about the j-th word in the utterance. |
utterances[i].words[j].text | string | The text of the j-th word in the i-th utterance. |
utterances[i].words[j].start | number | The starting time for when the j-th word is spoken in the i-th utterance, in milliseconds. |
utterances[i].words[j].end | number | The ending time for when the j-th word is spoken in the i-th utterance, in milliseconds. |
utterances[i].words[j].confidence | number | The confidence score for the transcript of the j-th word in the i-th utterance. |
utterances[i].words[j].speaker | string | The speaker who uttered the j-th word in the i-th utterance. |
How can I improve the performance of the Speaker Diarization model?
How many speakers can the model handle?
How accurate is the Speaker Diarization model?
Why is the speaker diarization not performing as expected?