| Nome: | Descrição: | Tamanho: | Formato: | |
|---|---|---|---|---|
| 2.54 MB | Adobe PDF |
Autores
Orientador(es)
Resumo(s)
Automatic Speech Recognition (ASR) systems struggle with speech disfluencies like repetitions
and filled pauses, which are common in spontaneous speech and disorders such as stuttering.
This performance gap limits the accessibility and clinical utility of ASR technology. This
research proposes a comprehensive Deep Learning framework to improve the transcription,
classification, and interpretation of disfluent speech. The multi-stage methodology begins by
fine-tuning state-of-the-art ASR models on the FluencyBank corpus to accurately transcribe
disfluent events. Next, a custom-trained ModernBERT model performs token-level
classification to identify and label specific disfluencies in the transcript. Finally, a specialized
Large Language Model (LLM) provides a structured, context-aware analysis of the identified
patterns. The integrated pipeline is demonstrated through a functional prototype, ArtiFlow
(Articulate Flow), which showcases the system's end-to-end capabilities. Empirical results
validate the framework's effectiveness. After a comparative analysis of different ASR model
families, the Whisper Large v3 Turbo model was selected, offering an optimal balance
between high transcription accuracy (12.94% Word Error Rate) and enhanced inference
speed, while leveraging a stable and accessible implementation framework. The subsequent
disfluency classifier performs with high reliability (weighted F1-score of 0.9512) and the
selected LLM demonstrates strong capabilities in generating structured analyses. This thesis
establishes a robust, modular system for both identifying and interpreting speech disfluencies,
providing a foundation for advanced support tools. The work contributes to speech processing
and assistive technology, offering significant potential for applications in clinical diagnosis,
speech therapy, and the development of more inclusive human-computer interaction
systems.
Descrição
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data Science
Palavras-chave
Automatic Speech Recognition Speech Disfluencies Large Language Models Spoken Language Processing Stuttering SDG 3 - Good health and well-being
