Joint End-to-End Spoken Language Understanding and Automatic Speech Recognition Training Based on Unified Speech-to-Text Pre-Training
Abstract
Modern spoken language understanding (SLU) approaches optimize the system in an end-to-end (E2E) manner. This approach offers two key advantages. Firstly, it helps mitigate error propagation from upstream systems. Secondly, combining various information types and optimizing them towards the same objective is straightforward. In this study, we attempt to build an SLU system by integrating information from two modalities, i.e., speech and text, and concurrently optimizing the associated tasks. We leverage a pre-trained model built with speech and text data and fine-tune it for the E2E SLU tasks. The SLU model is jointly optimized with automatic speech recognition (ASR) and SLU tasks under single-mode and dual-mode schemes. In the single-mode model, ASR and SLU results are predicted sequentially, whereas the dualmode model predicts either ASR or SLU outputs based on the task tag. Our proposed method demonstrates its superiority through benchmarking against FSC, SLURP, and in-house datasets, exhibiting improved intent accuracy, SLU-F1, and Word Error Rate (WER).
Author: Eesung Kim, Yun Tang, Taeyeon Ki, Divya Neelagiri, Vijendra Raj Apsingek
Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date: Apr 14, 2024