I am working with two third-party services: Pyannote AI for Speaker Diarization and Whisper for Speech-To-Text.
The reason I am using Pyannote is because Whisper does not innitially support Speaker Diarization (for what I am informed about).
So what I need to do, is match the Speakers from the Diarization response of Pyannote with the segments from the Transcription response from Whisper. Since both responses provide time parameters (start and stop), I am thinking to rely on them...
Does any of you have any experience with similar "assignment" problems, or does any of you have any suggestion on what's the best way to approach this sort of problem...