Publications
Search or browse our publications below.
- 103 publications
Title | Details | Date | Abstract | Link | Research Areas |
---|---|---|---|---|---|
AlpaGasus: Training A Better Alpaca with Fewer Data |
Author: Lichang Chen et al. Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, Hongxia Jin Published: International Conference on Learning Representations (ICLR) |
May 7, 2024 | Large language models (LLMs) strengthen instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca’s 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT. In this paper, we propose a simple and effective data selection strategy that automatically identifies and filters out low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. AlpaGasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human evaluation. Its 13B variant matches >90% performance of its teacher LLM (i.e., Text-Davinci-003 generating the 52k data) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes. Moreover, the experiments prove the efficacy of our method across diverse datasets, base models, and LLM filters. Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models. |
https://arxiv.org/abs/2307.08701 | Artificial Intelligence |
Multimodal Breathing Rate Estimation Using Facial Motion and RPPG From RGB Camera |
Author: Migyeong Gwak et al. Korosh Vatanparvar, Li Zhu, Nafiul Rashid, Moshin Ahmed, Jungmok Bae, Jilong Kuang, Alex Gao Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Camera-based respiratory monitoring is contactless, non-invasive, unobtrusive, and easily accessible compared to conventional wearable devices. This paper presents a novel multimodal approach to estimating breathing rate based on tracking the movement and color changes of the face through an RGB camera. A machine learning model determines the final breathing rate between two separately calculated ones from breathing motion and remote photoplethysmography (rPPG) to improve the measurement performance in a broader range of breathing frequencies. Our proposed pipeline is evaluated with 140 facial video recordings from 22 healthy subjects, including 6 controlled and 2 spontaneous breathing tasks ranging from 5 to 30 BPM. The estimation accuracy achieves 1.33 BPM mean absolute error and 86.53% pass rate within 2 BPM error criteria. To the best of our knowledge, our approach outperforms previous works that use a face region alone with a single RGB camera. |
https://ieeexplore.ieee.org/document/10446086 | Artificial Intelligence |
Weakly Supervised Learning for Camera-Based Heart Rate Variability |
Author: Jeremy Speth et al. Korosh Vatanparvar, Li Zhu, Jilong Kuang, Alex Gao Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Camera-based pulse measurements from remote photoplethysmography (rPPG) have rapidly improved over recent years due to innovations in video processing and deep learning. However, modern data-driven solutions require large training datasets collected under diverse conditions. Collecting such training data is made more challenging by the need for time-synchronized video and physiological signals as ground truth. This paper presents a weakly supervised learning framework, Freq2Time, to train with heart rate (HR) labels. Our framework mitigates the need for simultaneous PPG or ECG as ground truth, since the HR changes relatively slowly and describes the target rPPG signal over a time interval. We show that 3D convolutional neural network (3DCNN) models trained with the Freq2Time framework give state-of-the-art HR performance with MAE of 2.86 bpm, when tested with challenging smartphone video data from 30 subjects. Additionally, our models still learn accurate rPPG time signals, allowing for other physiological metrics such as heart rate variability. |
https://ieeexplore.ieee.org/abstract/document/10446054 | Artificial Intelligence |
Heart Rate Variability Estimation with Dynamic Fine Filtering and Global-Local Context Outlier Removal |
Author: Ramesh Kumar Sah et al. Md. Mahbubar Rahman, Viswam Nathan, Li Zhu, Jungmok Bae, Christina Rosa, Wendy Berry Mendes, Jilong Kuang, Alex Jun Gao Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Consumer hearable technologies such as earbuds are increasingly embedding physiological sensors, including photoplethysmography (PPG) and inertial measurements. They create unique opportunities to passively monitor stress and deliver digital interventions such as music. However, PPG signals recorded from ear canals are often very noisy due to head movement and fit issues. This work proposes algorithms to estimate heart rate variability (HRV) features from noisy PPG signals recorded using earbuds. We have used template matching to determine the signal quality for dynamic fine filtering around the estimated heart rate. We have also improved the inter-beat interval (IBI) outlier detection and removal algorithm using the global-local context of the input PPG signal. The mean absolute error of estimating RMSSD decreased from 70.83 milliseconds (ms) to 24.88 ms, and SDNN decreased from 46.89 ms to 16.60 ms. |
https://ieeexplore.ieee.org/document/10447778 | Artificial Intelligence |
Ballistocardiogram-Based Heart Rate Variability Estimation for Stress Monitoring using Consumer Earbuds |
Author: David J. Lin et al. Md Mahbubur Rahman, Li Zhu, Viswam Nathan, Jungmok Bae, Christina Rosa, Wendy B Mendes, Jilong Kuang, Alex J Gao Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Stress can potentially have detrimental effects on both physical and mental well-being, but monitoring it can be challenging, especially in free-living conditions. One approach to address this challenge is to use earbud accelerometers to capture the ballistocardiogram (BCG) response. These sensors allow for noninvasive stress monitoring by estimating physiological indicators linked to stress, such as heart rate variability (HRV). However, ear-worn devices are susceptible to motion artifacts and can exhibit significant BCG signal morphology variations. These challenges necessitate accurate algorithms to estimate HRV for everyday use. Therefore, we developed a method to measure interbeat intervals (IBI) from BCG signals collected from an earbud. To enhance IBI estimation accuracy, we employed a Bayesian method that incorporates robust apriori IBI prediction weighting and sensor fusion techniques. We have also conducted a study involving 97 participants to assess the earbuds’ ability to estimate HRV metrics and classify stressful activities. Our findings demonstrate low IBI estimation error (4.16% ± 1.90%), along with lower errors in subsequent higher-order HRV metrics compared to the state-of-the-art algorithms. |
https://ieeexplore.ieee.org/document/10447280 | Artificial Intelligence |
Core Body Temperature and its Role in Detecting Acute Stress: A Feasibility Study |
Author: Mehrab Bin Morshed et al. Md Mahbubur Rahman, Viswam Nathan, Li Zhu, Jungmok Bae, Christina Rosa, Wendy Berry Mendes, Jilong Kuang, Alex Gao Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Core body temperature (CBT) is one of the critical yet under-explored phenomena in the context of stress detection. Several CBT measurement methods exist, but they are often limited in continuous CBT monitoring. Furthermore, how continuous CBT can be used to model acute stress is little explored. We address these challenges by conducting an in-lab controlled study with 97 participants who participated in baseline and stress-inducing tasks while wearing prototype earbuds capable of collecting CBT. We found that accounting for changes from individual baselines in CBT results is acute stress detection with 94.88% accuracy and 94.4% F1-score, which is 29.31% and 26.07% higher in terms of accuracy and F1-score, respectively, compared to generalized features. |
https://ieeexplore.ieee.org/abstract/document/10447599 | Artificial Intelligence |
Joint End-to-End Spoken Language Understanding and Automatic Speech Recognition Training Based on Unified Speech-to-Text Pre-Training |
Author: Eesung Kim et al. Yun Tang, Taeyeon Ki, Divya Neelagiri, Vijendra Raj Apsingek Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Modern spoken language understanding (SLU) approaches optimize the system in an end-to-end (E2E) manner. This approach offers two key advantages. Firstly, it helps mitigate error propagation from upstream systems. Secondly, combining various information types and optimizing them towards the same objective is straightforward. In this study, we attempt to build an SLU system by integrating information from two modalities, i.e., speech and text, and concurrently optimizing the associated tasks. We leverage a pre-trained model built with speech and text data and fine-tune it for the E2E SLU tasks. The SLU model is jointly optimized with automatic speech recognition (ASR) and SLU tasks under single-mode and dual-mode schemes. In the single-mode model, ASR and SLU results are predicted sequentially, whereas the dualmode model predicts either ASR or SLU outputs based on the task tag. Our proposed method demonstrates its superiority through benchmarking against FSC, SLURP, and in-house datasets, exhibiting improved intent accuracy, SLU-F1, and Word Error Rate (WER). |
https://ieeexplore.ieee.org/document/10447509 | Artificial Intelligence |
End-To-End Personalized Cuff-Less Blood Pressure Monitoring Using ECG and PPG Signals |
Author: Suhas BN et al. Rakshith Sharma Srinivasa, Yashas Malur Saidutta, Jaejin Cho, Ching-Hua Lee, Chouchang Yang, Yilin Shen, Hongxia Jin Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Cuffless blood pressure (BP) monitoring offers the potential for continuous, non-invasive healthcare but has been limited in adoption by existing models relying on handcrafted features from ECG and PPG signals. To overcome this, researchers have looked to deep learning. Along these lines, in this paper, we introduce a novel end-to-end model based on transformers. Further, we also introduce a novel contrastive loss-based loss function for robust training. To study the limits of performance for our proposed ideas, we first study personalized models trained on large subject-specific datasets, and achieve an average mean absolute error of 1.08/0.68 mmHg for systolic (SBP) and diastolic BP (DBP) across all subjects while achieving a best case of 0.29/0.19 mmHg. Further, in the case where subject-specific data is scarce, we leverage transfer learning using multi-subject data, and show that our model outperforms State-of-the-Art (SOTA) methods across varying amounts of subject-specific data. |
https://ieeexplore.ieee.org/abstract/document/10445970 | Artificial Intelligence |
Zero-Shot Intent Classification Using a Semantic Similarity Aware Contrastive Loss and Large Language Model |
Author: Jaejin Cho et al. Rakshith Sharma Srinivasa, Ching-Hua Lee, Yashas Malur Saidutta, Chouchang Yang, Yilin Shen, Hongxia Jin Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Zero-shot systems can reduce the cost of collecting data and training in a new domain since they can work directly with the test data without further training. In this paper, we build zero-shot systems for intent classification, based on Semantic Similarity-aware Contrastive Loss (SSCL) that addresses an issue in the original CL which treats non-corresponding pairs indiscriminately. We confirm that SSCL outperforms CL through experiments. Then, we explore how including text or speech in-domain data during the SSCL training affects the out-of-domain intent classification.During the zero-shot classification, embeddings for a set of classes in the new domain are generated to calculate the similarities between each class embedding and an input utterance embedding, after which the most similar class is predicted for the utterance’s intent. Although manually-collected text sentences per class can be used to generate the class embedding, the data collection can be costly. Thus, we explore how to generate better class embeddings without human-collected text data in the target domain. The best proposed method employing an instruction-tuned Llama2, a public large language model, shows the performance comparable to the case where the human-collected text data was used, implying the importance of accurate class embedding generation. |
https://ieeexplore.ieee.org/document/10446276 | Artificial Intelligence |
Leveraging Self-Supervised Speech Representations for Domain Adaptation in Speech Enhancement |
Author: Ching-Hua Lee et al. Chouchang Yang, Rakshith Sharma Srinivasa, Yashas Malur Saidutta, Jaejin Cho, Yilin Shen, Hongxia Jin Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Deep learning based speech enhancement (SE) approaches could suffer from performance degradation due to mismatch between training and testing environments. A realistic situation is that an SE model trained on parallel noisy-clean utterances from one environment, the source domain, may fail to perform adequately in another environment, the target (new) domain of unseen acoustic or noise conditions. Even though we can improve the target domain performance by leveraging paired data in that domain, in reality, noisy data is more straightforward to collect. Therefore, it is worth studying unsupervised domain adaptation techniques for SE that utilize only noisy data from the target domain, together with exploiting the knowledge available from the source domain paired data, for improved SE in the new domain. In this paper, we present a novel adaptation framework for SE by leveraging self-supervised learning (SSL) based speech models. SSL models are pre-trained with large amount of raw speech data to extract representations rich in phonetic and acoustics information. We explore the potential of leveraging SSL representations for effective SE adaptation to new domains. To our knowledge, it is the first attempt to apply SSL models for domain adaptation in SE. |
https://ieeexplore.ieee.org/document/10447573 | Artificial Intelligence |