I have a large dataset of transcripts (without timestamps) and corresponding audio files (avg length of one hour). My goal is to temporally align the transcripts with the corresponding audio files. Can anyone point me to resources, e.g., tutorials or huggingface models, that may help with the task? Are there any best practices for how to do it (without building an entire system from scratch)? … This task is called Forced Alignment and there are reasonably mature tools to do it with classical approaches. I’d suggest perusing forced-alignment.