Title | Details | Date | Abstract | Link | Research Areas |
---|---|---|---|---|---|
AlpaGasus: Training A Better Alpaca with Fewer Data |
Author: Lichang Chen et al. Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, Hongxia Jin Published: International Conference on Learning Representations (ICLR) |
May 7, 2024 | Large language models (LLMs) strengthen instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca’s 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT. In this paper, we propose a simple and effective data selection strategy that automatically identifies and filters out low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. AlpaGasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human evaluation. Its 13B variant matches >90% performance of its teacher LLM (i.e., Text-Davinci-003 generating the 52k data) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes. Moreover, the experiments prove the efficacy of our method across diverse datasets, base models, and LLM filters. Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models. |
https://arxiv.org/abs/2307.08701 | Artificial Intelligence |
Multimodal Breathing Rate Estimation Using Facial Motion and RPPG From RGB Camera |
Author: Migyeong Gwak et al. Korosh Vatanparvar, Li Zhu, Nafiul Rashid, Moshin Ahmed, Jungmok Bae, Jilong Kuang, Alex Gao Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Camera-based respiratory monitoring is contactless, non-invasive, unobtrusive, and easily accessible compared to conventional wearable devices. This paper presents a novel multimodal approach to estimating breathing rate based on tracking the movement and color changes of the face through an RGB camera. A machine learning model determines the final breathing rate between two separately calculated ones from breathing motion and remote photoplethysmography (rPPG) to improve the measurement performance in a broader range of breathing frequencies. Our proposed pipeline is evaluated with 140 facial video recordings from 22 healthy subjects, including 6 controlled and 2 spontaneous breathing tasks ranging from 5 to 30 BPM. The estimation accuracy achieves 1.33 BPM mean absolute error and 86.53% pass rate within 2 BPM error criteria. To the best of our knowledge, our approach outperforms previous works that use a face region alone with a single RGB camera. |
https://ieeexplore.ieee.org/document/10446086 | Artificial Intelligence |
Weakly Supervised Learning for Camera-Based Heart Rate Variability |
Author: Jeremy Speth et al. Korosh Vatanparvar, Li Zhu, Jilong Kuang, Alex Gao Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Camera-based pulse measurements from remote photoplethysmography (rPPG) have rapidly improved over recent years due to innovations in video processing and deep learning. However, modern data-driven solutions require large training datasets collected under diverse conditions. Collecting such training data is made more challenging by the need for time-synchronized video and physiological signals as ground truth. This paper presents a weakly supervised learning framework, Freq2Time, to train with heart rate (HR) labels. Our framework mitigates the need for simultaneous PPG or ECG as ground truth, since the HR changes relatively slowly and describes the target rPPG signal over a time interval. We show that 3D convolutional neural network (3DCNN) models trained with the Freq2Time framework give state-of-the-art HR performance with MAE of 2.86 bpm, when tested with challenging smartphone video data from 30 subjects. Additionally, our models still learn accurate rPPG time signals, allowing for other physiological metrics such as heart rate variability. |
https://ieeexplore.ieee.org/abstract/document/10446054 | Artificial Intelligence |
Heart Rate Variability Estimation with Dynamic Fine Filtering and Global-Local Context Outlier Removal |
Author: Ramesh Kumar Sah et al. Md. Mahbubar Rahman, Viswam Nathan, Li Zhu, Jungmok Bae, Christina Rosa, Wendy Berry Mendes, Jilong Kuang, Alex Jun Gao Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Consumer hearable technologies such as earbuds are increasingly embedding physiological sensors, including photoplethysmography (PPG) and inertial measurements. They create unique opportunities to passively monitor stress and deliver digital interventions such as music. However, PPG signals recorded from ear canals are often very noisy due to head movement and fit issues. This work proposes algorithms to estimate heart rate variability (HRV) features from noisy PPG signals recorded using earbuds. We have used template matching to determine the signal quality for dynamic fine filtering around the estimated heart rate. We have also improved the inter-beat interval (IBI) outlier detection and removal algorithm using the global-local context of the input PPG signal. The mean absolute error of estimating RMSSD decreased from 70.83 milliseconds (ms) to 24.88 ms, and SDNN decreased from 46.89 ms to 16.60 ms. |
https://ieeexplore.ieee.org/document/10447778 | Artificial Intelligence |
Ballistocardiogram-Based Heart Rate Variability Estimation for Stress Monitoring using Consumer Earbuds |
Author: David J. Lin et al. Md Mahbubur Rahman, Li Zhu, Viswam Nathan, Jungmok Bae, Christina Rosa, Wendy B Mendes, Jilong Kuang, Alex J Gao Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Stress can potentially have detrimental effects on both physical and mental well-being, but monitoring it can be challenging, especially in free-living conditions. One approach to address this challenge is to use earbud accelerometers to capture the ballistocardiogram (BCG) response. These sensors allow for noninvasive stress monitoring by estimating physiological indicators linked to stress, such as heart rate variability (HRV). However, ear-worn devices are susceptible to motion artifacts and can exhibit significant BCG signal morphology variations. These challenges necessitate accurate algorithms to estimate HRV for everyday use. Therefore, we developed a method to measure interbeat intervals (IBI) from BCG signals collected from an earbud. To enhance IBI estimation accuracy, we employed a Bayesian method that incorporates robust apriori IBI prediction weighting and sensor fusion techniques. We have also conducted a study involving 97 participants to assess the earbuds’ ability to estimate HRV metrics and classify stressful activities. Our findings demonstrate low IBI estimation error (4.16% ± 1.90%), along with lower errors in subsequent higher-order HRV metrics compared to the state-of-the-art algorithms. |
https://ieeexplore.ieee.org/document/10447280 | Artificial Intelligence |
Core Body Temperature and its Role in Detecting Acute Stress: A Feasibility Study |
Author: Mehrab Bin Morshed et al. Md Mahbubur Rahman, Viswam Nathan, Li Zhu, Jungmok Bae, Christina Rosa, Wendy Berry Mendes, Jilong Kuang, Alex Gao Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Core body temperature (CBT) is one of the critical yet under-explored phenomena in the context of stress detection. Several CBT measurement methods exist, but they are often limited in continuous CBT monitoring. Furthermore, how continuous CBT can be used to model acute stress is little explored. We address these challenges by conducting an in-lab controlled study with 97 participants who participated in baseline and stress-inducing tasks while wearing prototype earbuds capable of collecting CBT. We found that accounting for changes from individual baselines in CBT results is acute stress detection with 94.88% accuracy and 94.4% F1-score, which is 29.31% and 26.07% higher in terms of accuracy and F1-score, respectively, compared to generalized features. |
https://ieeexplore.ieee.org/abstract/document/10447599 | Artificial Intelligence |
Joint End-to-End Spoken Language Understanding and Automatic Speech Recognition Training Based on Unified Speech-to-Text Pre-Training |
Author: Eesung Kim et al. Yun Tang, Taeyeon Ki, Divya Neelagiri, Vijendra Raj Apsingek Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Modern spoken language understanding (SLU) approaches optimize the system in an end-to-end (E2E) manner. This approach offers two key advantages. Firstly, it helps mitigate error propagation from upstream systems. Secondly, combining various information types and optimizing them towards the same objective is straightforward. In this study, we attempt to build an SLU system by integrating information from two modalities, i.e., speech and text, and concurrently optimizing the associated tasks. We leverage a pre-trained model built with speech and text data and fine-tune it for the E2E SLU tasks. The SLU model is jointly optimized with automatic speech recognition (ASR) and SLU tasks under single-mode and dual-mode schemes. In the single-mode model, ASR and SLU results are predicted sequentially, whereas the dualmode model predicts either ASR or SLU outputs based on the task tag. Our proposed method demonstrates its superiority through benchmarking against FSC, SLURP, and in-house datasets, exhibiting improved intent accuracy, SLU-F1, and Word Error Rate (WER). |
https://ieeexplore.ieee.org/document/10447509 | Artificial Intelligence |
End-To-End Personalized Cuff-Less Blood Pressure Monitoring Using ECG and PPG Signals |
Author: Suhas BN et al. Rakshith Sharma Srinivasa, Yashas Malur Saidutta, Jaejin Cho, Ching-Hua Lee, Chouchang Yang, Yilin Shen, Hongxia Jin Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Cuffless blood pressure (BP) monitoring offers the potential for continuous, non-invasive healthcare but has been limited in adoption by existing models relying on handcrafted features from ECG and PPG signals. To overcome this, researchers have looked to deep learning. Along these lines, in this paper, we introduce a novel end-to-end model based on transformers. Further, we also introduce a novel contrastive loss-based loss function for robust training. To study the limits of performance for our proposed ideas, we first study personalized models trained on large subject-specific datasets, and achieve an average mean absolute error of 1.08/0.68 mmHg for systolic (SBP) and diastolic BP (DBP) across all subjects while achieving a best case of 0.29/0.19 mmHg. Further, in the case where subject-specific data is scarce, we leverage transfer learning using multi-subject data, and show that our model outperforms State-of-the-Art (SOTA) methods across varying amounts of subject-specific data. |
https://ieeexplore.ieee.org/abstract/document/10445970 | Artificial Intelligence |
Zero-Shot Intent Classification Using a Semantic Similarity Aware Contrastive Loss and Large Language Model |
Author: Jaejin Cho et al. Rakshith Sharma Srinivasa, Ching-Hua Lee, Yashas Malur Saidutta, Chouchang Yang, Yilin Shen, Hongxia Jin Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Zero-shot systems can reduce the cost of collecting data and training in a new domain since they can work directly with the test data without further training. In this paper, we build zero-shot systems for intent classification, based on Semantic Similarity-aware Contrastive Loss (SSCL) that addresses an issue in the original CL which treats non-corresponding pairs indiscriminately. We confirm that SSCL outperforms CL through experiments. Then, we explore how including text or speech in-domain data during the SSCL training affects the out-of-domain intent classification.During the zero-shot classification, embeddings for a set of classes in the new domain are generated to calculate the similarities between each class embedding and an input utterance embedding, after which the most similar class is predicted for the utterance’s intent. Although manually-collected text sentences per class can be used to generate the class embedding, the data collection can be costly. Thus, we explore how to generate better class embeddings without human-collected text data in the target domain. The best proposed method employing an instruction-tuned Llama2, a public large language model, shows the performance comparable to the case where the human-collected text data was used, implying the importance of accurate class embedding generation. |
https://ieeexplore.ieee.org/document/10446276 | Artificial Intelligence |
Leveraging Self-Supervised Speech Representations for Domain Adaptation in Speech Enhancement |
Author: Ching-Hua Lee et al. Chouchang Yang, Rakshith Sharma Srinivasa, Yashas Malur Saidutta, Jaejin Cho, Yilin Shen, Hongxia Jin Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Deep learning based speech enhancement (SE) approaches could suffer from performance degradation due to mismatch between training and testing environments. A realistic situation is that an SE model trained on parallel noisy-clean utterances from one environment, the source domain, may fail to perform adequately in another environment, the target (new) domain of unseen acoustic or noise conditions. Even though we can improve the target domain performance by leveraging paired data in that domain, in reality, noisy data is more straightforward to collect. Therefore, it is worth studying unsupervised domain adaptation techniques for SE that utilize only noisy data from the target domain, together with exploiting the knowledge available from the source domain paired data, for improved SE in the new domain. In this paper, we present a novel adaptation framework for SE by leveraging self-supervised learning (SSL) based speech models. SSL models are pre-trained with large amount of raw speech data to extract representations rich in phonetic and acoustics information. We explore the potential of leveraging SSL representations for effective SE adaptation to new domains. To our knowledge, it is the first attempt to apply SSL models for domain adaptation in SE. |
https://ieeexplore.ieee.org/document/10447573 | Artificial Intelligence |
An MVDR-Embedded U-Net Beamformer for Effective and Robust Multichannel Speech Enhancement |
Author: Ching-Hua Lee et al. Kashyap Patel, Chouchang Yang, Yilin Shen , Hongxia Jin Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | In multichannel speech enhancement (SE) systems based on beamforming, deep neural networks (DNNs) are often used to estimate beamformer weights directly. This approach, however, may not generalize well to new acoustic conditions. Alternatively, DNNs can predict T-F masks for speech and noise patterns that can be used with statistical beamforming. This approach is robust, but its performance is constrained by the later component as relying on certain modeling assumptions, e.g., covariance-based modeling in the minimum-variance-distortionless-response (MVDR) beamformer. In this paper, we propose a novel integration of the two types of methodology by introducing an intra-MVDR module embedded in the U-Net architecture that combines the merits of both, i.e., effectiveness and robustness. Simulation results show that the proposed MVDR-embedded U-Net leads to SE improvements that are not achievable by simply enlarging the network with baseline approaches. |
https://ieeexplore.ieee.org/document/10448366 | Artificial Intelligence |
Leveraging automated knowledge transfer to enable smart home planning capabilities of small language model |
Author: Sudipta Paul et al. Lingyu Zhang, Yilin Shen, Hongxia Jin Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Smart home device control is a difficult task if the instruction is abstract and the planner needs to adjust dynamic home configurations. With the increasing capability of Large Language Model (LLM), they have become the customary model for zero-shot planning tasks similar to smart home device control. Although cloud supported large language models can seamlessly do device control tasks, on-device small language models show limited capabilities. In this work, we show how we can leverage large language models to enable small language models for device control task. Towards this goal, we develop an automated system to generate device control planning data leveraging large language model and use the generated data to finetune the small language models. We empirically validate the improvement of small language models’ performance for device control task. |
https://ieeexplore.ieee.org/document/10446064 | Artificial Intelligence |
Extremely Light-Weight Learning Based LDR to PQ HDR Conversion Using Bernstein Curves |
Author: Dung Vo et al. Chenguang Liu, McClain Nelson Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | The paper proposes a novel automatic Low Dynamic Range (LDR) to Perceptual Quantizer (PQ) High Dynamic Range (HDR) system to convert a LDR input into a HDR output. The process is based on a machine learning model that generates a Bernstein inverse tone mapping (iTM) curve. The monotonic constrain is also considered to maintain the curve monotonic characteristic. The proposed model is kept very small and the iTM is implemented pixel by pixel so that the whole system is extremely light-weight. Experiment results show that the proposed iTM can learn the LDR to HDR conversion style of the experts and outperforms other methods. |
https://ieeexplore.ieee.org/abstract/document/10447932 | Artificial Intelligence |
Multi-Person Respiration Rate Estimation With Single Pair Of Transmit And Receive Antenna |
Author: Hao-Hsuan Chang et al. Vishnu Ratnam, Hao Chen, Junsu Choi, Charlie Jianzhong Zhang Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Human respiration rate (RR) estimation is essential for various health care applications, such as sleep apnea detection and chronic obstructive pulmonary disease early diagnose. Recently, radio frequency based RR estimation has achieved high accuracy for single-person RR detection. However, multi-person RR estimation is still the obstacle blocking the wide commercialization of RF sensing based RR solution. In this paper, a novel multi-person RR estimation algorithm that can overcome the frequency resolution limit is present. The proposed algorithm is not only analytically justified but also verified in a real test-bed involving commercial off-the-shelf WiFi devices. Extensive experiment results show a 98% accuracy in people-counting and a root mean square error (RMSE) of 0.13 breath per minute (bpm) on RR detection. To the best of our knowledge, this is the first WiFi sensing work that can detect different people who share the same RR by only using a single pair of transmit and receive antenna. |
https://ieeexplore.ieee.org/document/10446996 | Artificial Intelligence |
Normalization is All You Need: Robust Full-Range Contactless SpO2 Estimation Across Users |
Author: Qijia Shao et al. Li Zhu, Moshin Ahmed , Korosh Vatanparvar, Migyeong Gwak, Nafiul Rashid, Jungmok Bae, Jilong Kuang, Alex Gao Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | The accurate estimation of peripheral capillary oxygen saturation (SpO 2 ) is vital for monitoring respiratory health, with applications spanning medical diagnostics and fitness tracking. Remote photoplethysmography (rPPG) offers a convenient and non-contact approach for SpO 2 estimation. However, existing methods predominantly rely on data within the normal SpO 2 range, hindering their effectiveness during hypoxemia. Moreover, cross-user variations poses significant challenges for practicality. To address these limitations, we propose a simple yet effective normalization-based SpO 2 estimation algorithm. By aligning individual Ratio-of-Ratios (RoR) data with a standard model at the matching SpO 2 level, we mitigate cross-user variation, accommodate different camera configurations, and account for lighting changes. Our experiments demonstrate that the proposed method achieves an rMSE of 2.8% with leave-one-subject-out cross-validation across the full SpO 2 range (70%-100%), significantly outperforming existing RoR-based and CNN-based SpO 2 estimation approaches. Notably, our methods excel in accurately identifying hypoxemia, a critical clinical requirement. We anticipate broader applicability of our approach in rPPG-based vital sign monitoring, underlining the potential for enhancing robustness and reliability in various domains. |
https://ieeexplore.ieee.org/document/10446435 | Artificial Intelligence |
Unified Srgb Real Noise Synthesizing with Adaptive Feature Modulation |
Author: Wenbo Li et al. Zhipeng Mo, Yilin Shen, Hongxia Jin Published: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Apr 14, 2024 | Recently, the Neighboring Correlation-Aware (NeCA) noise model has achieved impressive performance on both noise synthesis and the downstream image denoising task. However, its design regarding noise-level prediction requires training NeCA separately for each camera type. To this end, by making use of an adaptive feature modulation technique, we improve NeCA’s noise-level prediction model to be unified for different camera types and thus enable a unified sRGB real noise synthesis method. We also find out that in the neigh-boring correlation network of NeCA, there is no mechanism to maintain the signal dependency of the synthesized noise. Therefore, we introduce another adaptive feature modulation technique to the neighboring correlation network to maintain the signal dependency of the noise. |
https://ieeexplore.ieee.org/abstract/document/10447546 | Artificial Intelligence |
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss |
Author: Rakshith Srinivasa et al. Jaejin Cho, Chouchang Yang, Yashas Malur Saidutta, Chinghua Lee, Yilin Shen, Hongxia Jin Published: Neural Information Processing Systems (NeurIPS) |
Dec 10, 2023 | This paper considers contrastive training for cross-modal 0-shot transfer wherein a pre-trained model in one modality is used for representation learning in another domain using pairwise data. The learnt models in the latter domain can then be used for a diverse set of tasks in a 0-shot way, similar to Contrastive Language-Image Pre-training (CLIP) and Locked-image Tuning (LiT) that have recently gained considerable attention. Classical contrastive training employs sets of positive and negative examples to align similar and repel dissimilar training data samples. However, similarity amongst training examples has a more continuous nature, thus calling for a more non-binary treatment. To address this, we propose a new contrastive loss function called Continuously Weighted Contrastive Loss (CWCL) that employs a continuous measure of similarity. With CWCL, we seek to transfer the structure of the embedding space from one modality to another. Owing to the continuous nature of similarity in the proposed loss function, these models outperform existing methods for 0-shot transfer across multiple models, datasets and modalities. By using publicly available datasets, we achieve 5-8% (absolute) improvement over previous state-of-the-art methods in 0-shot image classification and 20-30% (absolute) improvement in 0-shot speech-to-intent classification and keyword classification. |
https://openreview.net/pdf?id=hz10oiVMNE | Artificial Intelligence |
Training Energy-Based Normalizing Flow with Score-Matching Objectives |
Author: Yen-Chang Hsu Published: Neural Information Processing Systems (NeurIPS) |
Dec 10, 2023 | ” In this paper, we establish a connection between the parameterization of flow-based and energy-based models, and present a new flow-based modeling approach called energy-based normalizing flow (EBFlow). We demonstrate that by optimizing EBFlow with score-matching objectives, the computation of Jacobian determinants for linear transformations can be entirely bypassed. This feature enables the use of arbitrary linear layers in the construction of flow-based models without increasing the asymptotic complexity of each training iteration. In addition to the reduction in runtime, we enhance the training stability and empirical performance of EBFlow through a number of techniques developed for score-matching methods. Our experimental results demonstrate that our approach exhibits improved efficiency compared to maximum likelihood estimation, and outperforms the other flow-based models trained using score-matching methods in recent literature. “ |
https://arxiv.org/pdf/2305.15267.pdf | Artificial Intelligence |
DVSOD: RGB-D Video Salient Object Detection |
Author: Jingjing Li et al. Wei Ji, Size Wang, Wenbo Li, Li Cheng Published: 37th Conference on Neural Information Processing Systems (NeurIPS 2023) |
Dec 10, 2023 | Salient object detection (SOD) aims to identify standout elements in a scene, with recent advancements primarily focused on integrating depth data (RGB-D) or temporal data from videos to enhance SOD in complex scenes. However, the unison of two types of crucial information remains largely underexplored due to data constraints. To bridge this gap, we in this work introduce the DViSal dataset, fueling further research in the emerging field of RGB-D video salient object detection (DVSOD). Our dataset features 237 diverse RGB-D videos alongside comprehensive annotations, including object and instance-level markings, as well as bounding boxes and scribbles. These resources enable a broad scope for potential research directions. We also conduct benchmarking experiments using various SOD models, affirming the efficacy of multimodal video input for salient object detection. Lastly, we highlight some intriguing findings and promising future research avenues. To foster growth in this field, our dataset and benchmark results are publicly accessible at: https://dvsod.github.io/. |
https://papers.nips.cc/paper_files/paper/2023/file/1b88e65f737256d437e56764d39ba06d-Paper-Datasets_and_Benchmarks.pdf | Artificial Intelligence |
Loudspeaker position identification using human speech directivity index |
Author: Adrian Celestinos et al. Carren Zhongran Wang, Victor Manuel Chin Lopez Published: Audio engineering society convention (AES) |
Oct 25, 2023 | “Extended Summary Often a regular user of a multichannel loudspeaker system in regular living rooms places the loudspeakers in a non-uniform manner, with angles that don’t necessarily follow the recommended ITU-R BS.2159-4 standard, and with inconsistent distances from each speaker to the listener. By identifying the physical loudspeakers’ location, a spatial correction can thus be applied to recreate the artistic intention of the producer. The main goal of this proposal is to obtain the user/listener location with respect to the loudspeakers, assuming a multichannel audio system equipped with N number of loudspeakers and M very near-field (NF) microphones attached to each speaker. This is done by using a supervised machine learning (ML) model that is trained with the human speech directivity index (DI) computed by room simulations, where the sound source is the typical directivity radiation pattern of human speech and the NF receivers attached to the loudspeakers are located around the listener. The DI represents an acoustical energy ratio of one specific direction to all directions. The human voice presents a unique directivity pattern which is frequency, angle/direction and distance dependent; the computed DI carries that information. Assuming the setup described above with multiple microphones/receivers placed in a room and a human speaker as source, then the DI can be extracted from a multichannel voice command recorded from the user. The neural network (NN) is trained with DI data computed from human speech in-room simulations. An image source room simulation model is utilized to replicate a typical human speech recorded by receivers placed in typical loudspeaker positions around the source (user). Since the NF microphones are attached to the loudspeakers as close as possible to the driver, their directivity is affected due to the loudspeaker baffle. In the simulation model the NF microphone directivity is included. Typical female and male directivity was included for the source in the simulations. A customized room generator was used to create shoe-box room setups of various sizes. Each setup has material absorption coefficients chosen from a selection pool, with randomized receiver and source locations within some limits. A total of 39 rooms were simulated. Among these simulations there were three room sizes from 80 to 300 cubic meters. A total of 1140 setups consisting of 570 x two gender sets of IR data were computed. Each setup includes four channels of simulated IRs by gender. The result of the simulation was impulse responses (IR) at the NF microphone/receivers which then were convolved with anechoic human male and female mono recordings. Thus, convolution audio is the result from the simulation representing the voice command audio that the loudspeaker multichannel NF mics are supposed to “record’’ in each simulation case. The data was split as 80% for testing, 10% for training and 10% for validation. Before passing the data to the NN model, principal component analysis (PCA) was utilized to increase interpretability and reduce dimensionality. The dB values were converted to linear amplitude values to facilitate the PCA analysis. Then the training, test and validation sets were passed to the two NN models, one for the distance to the user, and one for the incidence angle. The distance NN model included an Input layer, two hidden layers, and an output layer. The angle estimation network included an input” |
https://www.aes.org/e-lib/browse.cfm?elib=22292 | Digital Media |
In-Ear Headphones on Ear Canal Simulator vs Real Human Ear Geometries: Quantifying the Differences with Simulations |
Author: Andri Bezzola Published: Audio engineering society convention (AES) |
Oct 25, 2023 | “Measurements of the pressure at the ear Drum Reference Point (DRP) in humans are extremely difficult and bear a high risk of injury to the eardrum and ear canal. In order to circumvent this challenge, ear simulators have been developed to predict a population average of the pressure response at DRP. I.e. the results of the simulator should mimic the acoustic behavior of an average human ear. However, these simulators do not predict the range and variance of the pressure responses at DRP, only their average. Recent research by Olive et al. suggests that there is significant variance in frequency response of over-ear headphones, and they emphasize the need for personalized headphone solutions [1]. We can estimate the pressure response at DRP for in-ear headphones outfitted with microphones in the cavity between headphone driver and eardrum by means of a transfer function G(f) that relates the pressure at DRP with the pressure inside the earbud. |D(f)|=|H(f)|+|G(f)|, Where |H(f)| is the magnitude of pressure response in dB at the microphone inside the earbud, |D(f)| is the magnitude of the pressure response in dB at DRP, and G(f) is the transfer function from pressure at earbud microphone to pressure at DRP. If an accurate (personalized) function for G_p (f) can be found, then we can accurately predict D_p (f), but obtaining an accurate G_p (f) by means of measurement is nearly impossible in live humans, and using an average approximation of G_a (f) does not accurately describe the personalized sound pressure response at DRP. In order to quantify the variance of error between average G_a (f) provided by the simulator and the personalized G_p (f) in a real human ear we used finite element simulations of MRI scans of 10 subjects (20 ears) at five different insert depths each. The MRI scans were obtained from the openly available “IHA database of human geometries including torso, head, and complete outer ears for acoustic research” [2]. From these simulations we obtained a spread of personalized H_p (f) and D_p (F), from which we could calculated 100 different personalized G_p (f). This ensemble of personalized G_p (f) were then compared to the G_a (f) obtained on G.R.A.S HATS simulator outfitted with 711 couplers. The range of errors observed at frequencies below 1 kHz was less than ±1 dB, assuming no leakage in all cases. Between 1 kHz and 4 kHz, the range of errors was within ±5 dB and above 4 kHz the error grows to ±15 dB or more. These results align well with the observations of Olive et al. that personalized solutions are needed in headphones just room equalization is needed for an optimal listening experience with a convention speaker setup. The difference in shapes of the ear canals are large enough to warrant the investigation of personalized equalization, which can deviate from the standard voicing which is often performed on ear canal simulators. “ |
https://www.aes.org/e-lib/browse.cfm?elib=22277 | Digital Media |
Implementation of Simultaneous Deconvolution on a Real-time Smartphone App |
Author: Ashish Rawat et al. Sunil Bharitkar, Allan Devantier, Matthew Ryan McDuffee, Ritesh Banka Published: Audio engineering society convention (AES) |
Oct 25, 2023 | In-room speaker system equalization was traditionally implemented by exciting one speaker at a time. With a higher number of speakers, restrictions of measurement microphone setup, the annoyance factor due to traditional stimuli, and background noises, the process of measuring the impulse response of a multi-channel system in real-time can be cumbersome. With FFT computation restrictions on a smartphone DSP, the accuracy and resolution of the impulse responses are compromised. This paper addresses all of these concerns with a novel approach to implementing the Simultaneous Deconvolution of a multichannel speaker system. It uses a set of circularly shifted Sine-Sweep stimuli to excite the speakers and calculate the impulse responses in real-time on a smartphone app over a cloud-based architecture. An independent recording and playback system, along with manual delays or system delays due to Bluetooth, Wi-Fi, or cloud-based communication, pose further challenges to the accuracy of our measurements. To surmount these complications, we discuss a time-alignment method that uses bin-wise matched filtering of spectrograms, followed by a statistical analysis of its results. |
https://www.aes.org/e-lib/browse.cfm?elib=22287 | Digital Media |
Perceptually Motivated Bitrate Allocation for Object-Based Audio Using Opus Codec |
Author: Toni Hirvonen et al. Carlos Tejeda Ocampo, Ema Souza Blanes, Sunil Bharitkar Published: Audio engineering society convention (AES) |
Oct 25, 2023 | We reviewed the performance of the opus codec for object-based audio using two different bit-allocation strategies: a vanilla method that uses the same bitrate for each object and joint allocation method that distributes the total bitrate among objects using their energy-based perceptual importance. Proposed joint allocation significantly outperformed the vanilla method at the same total bitrate and achieved an Excellent score during MUSHRA testing. |
https://www.aes.org/e-lib/browse.cfm?elib=22279 | Digital Media |
Excitation Stimuli For Simultaneous Deconvolution of Room Responses |
Author: Sunil Bharitkar et al. Ema Souza Blanes, Pascal Brunet Published: Audio engineering society convention (AES) |
Oct 25, 2023 | This paper compares three state-of-the-art stimuli (multitone-pink, MLS, and log-sweep) for the deconvolution of several loudspeaker-room impulse responses using a single time-domain measurement after exciting all the loudspeakers simultaneously. A Bayesian hyper-parameter optimization algorithm constructs the stimulus, where the algorithm optimizes the stimuli parameters by minimizing a {\em time-domain error} between the actual impulse responses and the simultaneously deconvolved responses over a training dataset. Objective results are presented for the various stimuli on a test dataset, whereas subjective tests compare the preference to the excitation stimuli played on all the loudspeakers. Additionally, the robustness of the constructed stimuli to various noises at different signal-to-noise ratios (SNR) is compared in the context of simultaneous deconvolution. |
https://www.aes.org/e-lib/browse.cfm?elib=22321 | Digital Media |
Advances in Perceptual Bass Extension for Music and Cinematic Content |
Author: Sunil Bharitkar et al. Ema Souza Blanes, Glenn S Kubota, Ashish Rawat Published: Audio engineering society convention (AES) |
Oct 25, 2023 | “Small-form factor and thin devices exhibit a high-pass frequency response due to loudspeaker-enclosure constraints. The low-frequency reproduction loss from these devices severely degrades the audio experience for music and cinematic content. In this paper, we present a new perceptual bass extension model using a side chain for music and cinematic content and leveraging the principle of the missing fundamental frequency. Optimizing the nonlinear function parameters enables the nonlinear function output to be invariant to input signal level changes. The model employs a unique input gain normalization scheme based on loudness metadata and level-matching between multiple side chains. A loudness compensation algorithm restores the perception of the loss of bass, particularly at low playback levels. Subjective testing and perceptually derived objective metrics using television (TV) loudspeakers validate the performance of the approach. “ |
https://www.aes.org/e-lib/browse.cfm?elib=22055 | Digital Media |
Application of ML-Based Time Series Forecasting to Audio Dynamic Range Compression |
Author: Pascal Brunet et al. Yuan Li, Retiree Published: Audio engineering society convention (AES) |
Oct 25, 2023 | Time Series Forecasting (TSF) is used in astronomy, geology, weather forecasting, finance to name a few. Recent research [1] has shown that, combined with Machine Learning (ML) techniques, TSF can be applied successfully for short-term prediction of music signals. We present here an application of this approach to prediction of music levels and Dynamic Range Compression (DRC). Look ahead prediction of audio level allows to apply compression just-in-time, avoiding latency and attack/release time constants, which are proper to traditional DRC and difficult to tune. |
https://www.aes.org/e-lib/browse.cfm?elib=22266 | Digital Media |
Explainable and Accurate Natural Language Understanding for Voice Assistants and Beyond |
Author: Kalpa Gunaratna et al. Vijay Srinivasan, Hongxia Jin Published: International Conference on Information and Knowledge Management (CIKM) |
Oct 21, 2023 | Joint intent detection and slot filling is invaluable for smart voice assistants, which is also termed as joint NLU (Natural Language Understanding). Most of the recent advancements in this area has been to improve accuracy using various techniques. Explainability is undoubtedly an important aspect for deep learning-based models including joint NLU models since they are considered black box models. Their decisions are opaque to the outside world and hence, tendency to lack user trust. This this proposed work, we show that is it possible to make the full joint NLU model `inherently explainable at granular levels of explanations without compromising on the accuracy. Further, as we enable the full joint NLU model explainable, we show that our extensions can be used in other general classification tasks such as sentiment analysis and named entity recognition (NER). |
https://dl.acm.org/doi/pdf/10.1145/3583780.3615277 | Artificial Intelligence |
Massive MIMO Evolution Towards 3GPP Release 18 |
Author: Gilwon Lee et al. Emad Farag, Dalin Zhu, Eko Onggosanusi Published: IEEE Journal on Selected Areas in Communications |
Oct 19, 2023 | Since the introduction of fifth-generation new radio (5G-NR) systems in Third Generation Partnership Project (3GPP) Release 15, swift progress has been made in to evolve 5G. Recently, 3GPP began work on Release 18, a.k.a. 5G-Advanced, to standardize systems beyond 5G. From its predecessors, 5G-NR can be distinguished in several different aspects. One critical aspect is the design of the radio access network (RAN) via massive multiple-input multiple-output (MIMO) technology, resulting in superior coverage, spectral efficiency, and reliability. This paper makes several important contributions. We first provide a comprehensive overview of the evolution of standardized massive MIMO features from 3GPP Release 15 to 17 for both time/frequency-division duplex (TDD/FDD) operation across both 3GPP frequency ranges, i.e., FR-1 (microwave) and FR-2 (millimeter-wave). In particular, we present the common massive MIMO architectures in commercial radio products, analyze the progress on channel state information (CSI) acquisition frameworks for single and multi-user (UE) operation, study beam management/indication frameworks for FR-2 bands, and present enhancements for uplink CSI. Secondly, by comparing concepts proposed for the 3GPP Release 18 Work Item relative to those for contemporaneous releases, we shed light on the emerging problems requiring imminent attention at the physical layer. These include advanced codebook design (for FDD) and sounding reference signal design (for TDD systems) optimized for coherent joint transmission (CJT) from multiple transmission/reception points (multi-TRPs); advancements in uplink demodulation reference signal design to support higher-order massive MIMO, enhancements for mobility to provide accurate CSI estimates, and unified transmission configuration indicator (TCI) framework tailored for FR-2-based multi-TRPs to reduce beam switch latency. For each of these concepts, we provide comprehensive system level simulation results to highlight their performance gains relative to systems standardized in Release 15 to 17. Thirdly, via real-world field trials in an outdoor urban environment at Shanghai Jiaotong University, China, we demonstrate the gains in practice of multi-TRP CJT with real-time constraints relative to single TRP transmissions. To the best of our knowledge, a contribution of this type, which amalgamates massive MIMO evolution with novel concepts and results for 3GPP Release 18 has been missing from the literature. |
https://arxiv.org/ftp/arxiv/papers/2210/2210.08218.pdf | Next Generation Communications |
Compositional Generalization in Spoken Language Understanding |
Author: Yilin Shen et al. Hongxia Jin Published: Annual Conference of the International Speech Communication Association (INTERSPEECH) |
Aug 24, 2023 | State-of-the-art spoken language understanding (SLU) models have shown tremendous success in benchmark SLU datasets, yet they still fail in many practical scenario due to the lack of model compositionality when trained on limited training data. In this paper, we study two types of compositionality: novel slot combination, and length generalization. We first conduct in-depth analysis, and find that state-of-the-art SLU models of- ten learn spurious slot correlations during training, which leads to poor performance in both compositional cases. To miti- gate these limitations, we create the first compositional splits of benchmark SLU datasets and we propose the first compositional SLU model, including compositional loss and paired training that tackle each compositional case respectively. On both benchmark and compositional splits in ATIS and SNIPS, we show that our compositional SLU model significantly out- performs state-of-the-art BERT SLU model. |
https://www.isca-speech.org/archive/pdfs/interspeech_2023/ray23_interspeech.pdf | Artificial Intelligence |
Robust Keyword Spotting for Noisy Environments by Leveraging Speech Enhancement and Speech Presence Probability |
Author: Chouchang Yang et al. Yashas Malur Saidutta, Rakshith Srinivasa,, Chinghua Lee, Yilin Shen, Hongxia Jin Published: Annual Conference of the International Speech Communication Association (INTERSPEECH) |
Aug 21, 2023 | Although various deep keyword spotting (KWS) systems have demonstrated promising performance under relatively noiseless environments, accurate keyword detection in the presence of strong noise encountered in our daily lives remains challenging. Room acoustics and noise conditions can be highly diverse, which can lead to drastic performance degradation if not handled carefully. In this paper, we aim to make deep KWS systems with small model sizes robust to environmental noise. We propose a noise management module (SE-SPP Net) that estimates both the denoised Mel spectrogram and the position of the speech utterance in the noisy input signal. The latter is estimated as the probability of a particular TF bin containing speech. Further, it comes at relatively no cost in model size when compared to a model estimating the denoising mask. Our proposed SE-SPP with KWD module can improve keyword speech performance by up to 7% compared to a similar sized SOTA model at SNR -10dB. |
https://www.isca-speech.org/archive/pdfs/interspeech_2023/yang23t_interspeech.pdf | Artificial Intelligence |
ESC: Exploration with SoF Commonsense Constraints for Zero-shot Object Navigation |
Author: Yilin Shen et al. Hongxia Jin Published: International Conference on Machine Learning (ICML) |
Jul 29, 2023 | Navigating to the right place to localize the desired object is the fundamental ability of embodied agents that interact with the objects and complete real-world tasks. Such object navigation tasks usually require large-scale training in visual environments with labeled objects. In this work, we find that the knowledge in pre-trained models for semantic scene understanding and commonsense reasoning can be transferred to open-world object navigation without any navigation experience nor any other training on the visual environments to achieve training-free zero-shot object navigation. However, these large pre-trained models may not directly generate navigation actions well. To mitigate the gap between the pre-trained knowledge and navigation actions, we propose a framework combining the commonsense knowledge with an existing exploration method to enable exploration with commonsense (EwC) using Probabilistic Soft Logic (PSL). Extensive experiments on MP3D, HM3D, and RoboTHOR (Deitke et al., 2020; Chang et al., 2017; Ramakrishnan et al., 2021) benchmarks shows that our method improves significantly over former baselines, and achieves new state-of-the-art results for zero-shot object navigation (e.g., 215% relative Success Rate improvement than CoW (Gadre et al., 2022) on MP3D). Our ablation studies also validate the efficacy of commonsense reasoning. |
https://arxiv.org/pdf/2301.13166.pdf | Artificial Intelligence |
MM-HAR: Multi-Modal Human Activity Recognition Using Consumer Smartwatch and Earbuds |
Author: Nafiul Rashid et al. Ebrahim Nematihosseinabadi, Mohsin Ahmed, Jilong Kuang, Alex Gao Published: Engineering in Medicine and Biology Conference (EMBC) |
Jul 24, 2023 | Human Activity Recognition (HAR) is one of the important applications of digital health that helps to track fitness or to avoid sedentary behavior by monitoring daily activities. Due to the growing popularity of consumer wearable devices, smartwatches, and earbuds are being widely adopted for HAR applications. However, using just one of the devices may not be sufficient to track all activities properly. This paper proposes a multi-modal approach to HAR by using both buds and watch. Using a large dataset of 44 subjects collected from both in-lab and in-home environments, we demonstrate the limitations of using a single modality as well as the importance of a multi-modal approach. Moreover, we also train and evaluate the performance of five different machine learning classifiers for various combinations of devices such as buds only, watch only, and both. We believe the detailed analyses presented in this paper may serve as a benchmark for the research community to explore and build upon in the future. |
https://ieeexplore.ieee.org/document/10340984 | Digital Health |
Power Optimized Smartwatch-Earbuds Multimodal System for Monitoring Activities of Daily Living |
Author: Mohsin Ahmed et al. Ebrahim Nematihosseinabadi, Nafiul Rashid, Maksim Shurpo, Jilong Kuang, Alex Gao Published: Engineering in Medicine and Biology Conference (EMBC) |
Jul 24, 2023 | We implement an end-to-end mobile and wearable system using earbuds and smartwatch IMU sensors for detecting various activities of daily living and present various power optimizing scheme to reduce power consumption. |
https://ieeexplore.ieee.org/document/10340166 | Digital Health |
Estimating SpO2 with Deep Oxygen Desaturations from Facial Video Under Various Lighting Conditions: A Feasibility Study |
Author: Li Zhu et al. Korosh Vatanparvar, Migyeong Gwak, Jilong Kuang, Alex Gao Published: Engineering in Medicine and Biology Conference (EMBC) |
Jul 24, 2023 | This paper presents a feasibility study to collect data, process signals, and validate accuracy of peripheral oxygen saturation (SpO2) estimation from facial video in various lighting conditions. The remote photoplethysmogram (rPPG) signals were first extracted from the facial videos recorded using RGB camera without auto-tuning. These videos were collected from subjects breathing through mouth tube with their nose clipped. The air inhaled by the subjects were manually controlled to gradually induce hypoxemia and lower subjects’ SpO2 to as low as 81%. We applied the principle of pulse oximetry and extracted the ratio of ratios (RoR) for two color combinations: Red/Blue and Red/Green. Next, we assessed SpO2 estimation accuracy against a SpO2 multi-wavelength analyzer under four lighting conditions: warm color temperature and normal brightness, neutral color temperature and normal brightness, cool color temperature and normal brightness, neural color temperature and dim brightness. We have achieved an RMSE of 1.93% and a PCC of 0.97 under the warm color temperature and normal brightness lighting condition using leave-one-subject-out cross validation between two subjects. The results have shown that it is feasible to estimate SpO2 remotely and accurately using consumer level RGB camera with suitable camera configuration and lighting condition. |
https://ieeexplore.ieee.org/document/10340025 | Digital Health |
RRDetection: Respiration Rate Estimation Using Earbuds During Physical Activities |
Author: Yincheng Jin et al. Mahbubur Rahman, Retiree, Jilong Kuang, Alex Gao Published: Engineering in Medicine and Biology Conference (EMBC) |
Jul 24, 2023 | Respiratory rate is a fundamental vital sign that is sensitive to different pathological conditions (e.g., adverse cardiac events, pneumonia, and clinical deterioration) and stressors, including emotional stress, cognitive load, heat, cod, physical effort, and exercise-induced fatigue. Traditionally, the capnograph device was recognized as the gold standard to monitor the respiratory measurement, however it is expensive and inconvenient to be used at home. There are recent development on monitoring breathing rate passively using earbuds. However, those are limited to resting condition. In this paper, we develop novel algorithm based on earbud motion sensor data that can estimate breathing rate while the user is walking or doing physical exercises such as step test. |
https://ieeexplore.ieee.org/document/10340157 | Digital Health |
Enhanced AI Based CSI Prediction Solutions for Massive MIMO in 5G and 6G Systems |
Author: Daoud Burghal et al. Yang Li, Pranav Madadi, Yeqing Hu, Joonyoung Cho, Jeongho Jeon, Andreas Molisch, Charlie Zhang Published: IEEE Access |
Jun 30, 2023 | “Accurate Channel State Information (CSI) is critical for maximizing throughput of massive Multi-Input Multi-Output (mMIMO) systems. Due to the environment dynamics and user mobility, CSI aging is a major challenge to achieving the large throughput potential of mMIMO. CSI prediction can be used to overcome this without increasing the overhead. Motivated by the anticipated native support for Artificial Intelligence (AI) in the fifth generation and beyond cellular standards, we propose deep learning CSI prediction solutions based on 3-Dimensional (3D) Complex Convolutional Neural Networks (CCNN). These solutions provide improved capabilities for capturing temporal and spatial correlations, enhancing CSI prediction performance. In particular, they utilize the angle delay decomposition of previously observed CSI to predict the future one. In one architecture, the network, dubbed CSI Prediction Network (CSI-PNet), uses small kernels with circular padding to efficiently capture the correlation between propagation paths in the angle domain. This architecture can be further improved by the use of an attention-like model to vary the weights and enhance prediction performance adaptively. We also propose methods to enhance robustness to noise, time-, and frequency offsets. We tested these solutions using both 3GPP-compatible simulations and field measurement in commercial network. Our solutions demonstrate stable performance and significantly outperform several benchmarks, especially at low and medium speeds. They strike a balance between the performance and the architecture complexity, indicating suitability for actual implementation.” |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10286287 | Next Generation Communications |
Artificial Intelligence Augmentation for Channel State Information in 5G and 6G |
Author: Yang Li et al. Yeqing Hu, Retiree, Hyo Yol Park, Hayoung Yang, Tiexing Wang, Retiree, Ji-Yun Seol, Charlie Zhang Published: IEEE Wireless Communications |
Jun 16, 2023 | In this article, we present an artificial intelligence (AI) augmentation framework for physical layer communication applicable to both 5G and future 6G networks. The framework classifies the channel state information (CSI), and uses the classified CSI knowledge to optimally adapt transmission configurations and/or improve conventional signal processing modules in estimation. We demonstrate system benefits of such AI-augmentation in different use cases in 5G NR context, such as beamforming mode adaptation, reference signal (RS) resource optimization, link adaptation as well as channel estimation. The framework also allows extension to resolve future 6G challenges. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10077214 | Next Generation Communications |
To Wake-up or Not to Wake-up: Keyword False Alarm reduction by Succesive Refinement |
Author: Yashas Malur Saidutta et al. Rakshith Srinivasa, Chinghua Lee, Chouchang Yang, Yilin Shen, Hongxia Jin Published: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) |
Jun 4, 2023 | Keyword spotting systems continuously process audio steams to detect keywords. One of the most challenging aspects of such systems is the case when the system falsely registers a keyword (false alarm) despite the keyword not being uttered. In this paper, we propose a simple yet elegant solution that follows from the law of total probability. We show that existing deep keyword spotting mechanisms can be improved by successive refinement, where the system first classifies whether the input audio is speech or not, followed by whether the input is keyword-like or not, and finally classifies which keyword was uttered. We show across multiple models ranging from 13K parameters to 340K parameters, the successive refinement technique reduces false alarm by a factor of 3 on both held-out test dataset and out-of-domain (unseen) data. Further, our proposed approach is “plug-and-play and can applied to any baseline keyword spotting method. |
https://arxiv.org/pdf/2304.03416.pdf | Artificial Intelligence |
Snapshot Matching Masking for Improved PSD Estimation in Mask-Based Neural Beamforming for Multichannel Speech Enhancement |
Author: Chinghua Lee et al. Chouchang Yang, Yilin Shen, Hongxia Jin Published: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) |
Jun 4, 2023 | In multichannel speech enhancement (SE), time-frequency (T-F) mask-based neural beamforming algorithms take advantage of deep neural networks to predict T-F masks that represent speech/noise dominance, which are leveraged to estimate the speech/noise power spectral density (PSD) matrices that are subsequently utilized to obtain the spatial filter weights. However, in the literature most networks are trained to estimate some pre-defined masks, e.g., the ideal binary mask (IBM) and ideal ratio mask (IRM), which lack direct connection to the PSD matrices. In this paper, we propose a new masking strategy where the complex-valued U-Net is utilized to predict a novel T-F mask, namely the Snapshot Matching Mask (SMM), that aims to minimize the distance between the predicted signal snapshots and the true signal snapshots, thereby estimating the PSD matrices in a more systematic way. Performance of the SMM compared with existing IBM- and IRM-based beamformers is presented on several datasets to demonstrate its effectiveness for improved T-F mask-based beamforming. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10096213 | Artificial Intelligence |
MOUTH BREATHING DETECTION USING AUDIO CAPTURED THROUGH EARBUDS |
Author: Tousif Ahmed et al. Mahbubur Rahman, Ebrahim Nematihosseinabadi, Jilong Kuang, Alex Gao Published: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) |
Jun 4, 2023 | Mouth breathing has a significant adverse effect on people. For example, mouth breathing has been associated with sleep-related disorders and dental problems. Detecting mouth breathing in the daily environment during resting activities could be helpful for early intervention and reversing the negative impact. However, in previous works, mouth breathing detection in the everyday environment has not sufficiently been explored. This work presents a machine learning approach to detect mouth breathing using the audio captured by commercially available earbuds. Earbuds are becoming famous for health monitoring, and they could provide a more convenient method for detecting mouth breathing without requiring user attention. We conducted a data collection study with 30 participants to train the audio-based classifier. Our results suggest that a convolutional neural network based model can detect mouth breathing with 80.4\% accuracy. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10095793 | Digital Health |
IMPROVING HEART RATE AND HEART RATE VARIABILITY ESTIMATION FROM VIDEO THROUGH A HR-RR-TUNED FILTER |
Author: Retiree et al. Li Zhu, Korosh Vatanparvar, Jilong Kuang, Alex Gao Published: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) |
Jun 4, 2023 | This paper presents algorithms to improve the estimation of heart rate (HR) and heart rate variability (HRV) from smartphone video. The remote photoplethysmogram (rPPG) signals are first extracted from the videos recorded. Next, we proposed rPPG filtering adaptively tuned by HR and respiratory rate (RR) to better enhance source signal that modulates HR. Further, we also addressed the smartphone artifact— occasionally seen in smartphone videos—by introducing an algorithm designed to suppress these artifacts. HR and HRV accuracies are assessed on 22 subjects who were instructed to breath at seven different RRs. The mean absolute errors of HR and standard deviation of the NN intervals (SDNN) are found to be 1.36 ± 0.88 bpm and 23.47 ± 12.09 ms respectively. Finally, we also conducted experiments to highlight improvements in accuracies made by the proposed algorithms. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10096576 | Digital Health |
BREATHIE: ESTIMATING BREATHING INHALE EXHALE RATIO USING MOTION SENSOR DATA FROM COMMODITY EARBUDS |
Author: Nafiul Rashid et al. Mahbubur Rahman, Tousif Ahmed, Jilong Kuang, Alex Gao Published: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) |
Jun 4, 2023 | Breathing Inhale/Exhale (IE) ratio is one of the critical breathing biomarkers for pulmonary patients and healthy individuals. It can indicate the severity of the lung obstruction for chronic lung patients and help detect psycho-social stress for healthy individuals. With the advancement of the wearable technologies, common consumer wearables such as smartwatches offer breathing rate. However, IE ratio measurement is not available in consumer electronic devices till today. In this paper, we present a novel algorithm, BreathIE, to estimate breathing rate and IE ration using low-power motion sensor embedded in consumer grade earbuds. Moreover, our algorithm is adaptive which is capable of dynamically adjusting to the varying breathing duration based on the breathing habit of the user at run time. We conducted a study with 30 participants where we use both earbuds and a reference chestband device simultaneously. Experimental evaluation against the annotated reference data shows that, our algorithm can estimate breathing rate with a mean absolute error (MAE) of 2.48 breaths per minute and breathing IE ratio with 0.26 MAE while outperforming the state-of-the-art algorithms. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10096084 | Digital Health |
Machine Learning-Assisted Codebook Design for MMSE Channel Estimation |
Author: Yeqing Hu et al. Yang Li, Tiexing Wang, Charlie Zhang Published: IEEE International Conference on Communications (ICC) |
May 28, 2023 | Breathing rate is critical for the user’s respiratory health, stress, and fitness level. Unfortunately, breathing rates are hard to track outside the clinical context, requiring specialized devices. While the mobile and wearable device-based approach could be helpful, existing methods require heavy engagement from the user. Furthermore, they can be inaccurate in the presence of minute body motions and loud noises. This paper presents a robust multimodal approach to tracking the users breathing rate by using a signal-processing-based algorithm on motion sensors and a lightweight machine learning algorithm on acoustic sensors from the earbuds. A comprehensive user study with 30 participants shows that our system can calculate the breathing rate reasonably (Mean Absolute Error < 2 BPM) in controlled and uncontrolled settings with varying body motion and environmental noises. This work provides an essential direction toward developing continuous and passive breathing rate monitoring in the wild. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10283641 | Next Generation Communications |
LEARNING TO JOINTLY SHARE AND PRUNE WEIGHTS FOR GROUNDING BASED VISION AND LANGUAGE MODELS |
Author: Shangqian Gao et al. Burak Uzkent, Yilin Shen, Hongxia Jin Published: International Conference on Learning Representation (ICLR) |
May 1, 2023 | Transformers have seen growing interest in processing different modalities, including language and images. As a result, we can process vision and language data using transformers that are architecturally similar. This feature of transformers provides us with several opportunities. This study explores weight sharing across two transformer backbones and within the same transformer backbone and prun-ing in a unified framework. More specifically, we investigate weight sharing and pruning for two components of the transformers: (1) Multi-Head Attention (MSA) and (2) Feed-Forward Network (FFN) layers. To jointly perform weight sharing and pruning, we propose to use a regularization term to align model weights and the desired structure during the pre-training step. The structure vectors of sharing and pruning are generated by using a hypernetwork, which can capture complex interactions between pruning and sharing across layers and modalities. The hypernetwork and model weights are trained iteratively so that the learned structure evolves along with model weights. After minimizing the proposed objective in the pre-training step, we perform weight sharing and pruning and fine-tune the model on downstream tasks. We perform experiments on vision and language tasks, including Referring Expression Comprehension (REC) and Visual Question Answering (VQA), using the state-of-the-art models: MDETR and GLIP. Our experiments show that we can reduce the size of the MDETR and GLIP by 35 − 40% by sharing and pruning MSA and FFN weights without significant loss in accuracy. 1 INTRODUCTION The dominant architecture in natural language processing (NLP) is Transformer Vaswani et al. (2017). Besides NLP, recent advance in computer vision shows that transformer based model, like ViT Dosovitskiy et al. (2021) or DeiT Touvron et al. (2020), can achieve similar or even better performance than convolutional neural networks (CNNs) on various tasks. As a result, it allows us to use architecturally similar models on cross-modal tasks with vision and language data. This setting naturally provides foundations to structurally share weights across different modalities. The advantage of weight sharing is that it encourages weight reuse and thus reduces the number of parameters while maintaining the model capacity to some extent. On the other hand, existing weight sharing techniques have many limitations. Most of them Lee et al. (2021); You et al. (2022); Lan et al. (2019); Reid et al. (2021) use manual designed sharing rules to share a whole layer or block, largely restricting the flexibility of weight sharing. This reduced flexibility can lead to drastic performance drop. To maximally utilize model parameters, we proposed to unify cross-modal sharing, layer-wise sharing and pruning, all in a single unified framework. Unlike previous works, the minimal structure of these operations is a weight vector instead of a whole layer or block, which drastically increases the flexibility of sharing and pruning. Instead of manually designed strategies, the position of sharing and pruning is learned in an end-to-end differentiable manner. To purse a better trade-off between the model performance and the parameter efficiency, we aim to maximize the flexibility by utilizing the structure of transformer backbones. If only cross-modal sharing is considered, there will be an upper bound for the compression rate (∼ 50%) when sharing all layers of one backbone for another one. Another direction is to share layers within a single 1 |
https://openreview.net/pdf?id=UMERaIHMwB3 | Artificial Intelligence |
Breathing Rate Tracking Using Earbuds In Uncontrolled Environments |
Author: Tousif Ahmed et al. Mahbubur Rahman, Mohsin Ahmed, Ebrahim Nematihosseinabadi, Jilong Kuang, Alex Gao Published: International Conference on Human Factors in Computing Systems (CHI) |
Apr 21, 2023 | Breathing rate is critical for the user’s respiratory health, stress, and fitness level. Unfortunately, breathing rates are hard to track outside the clinical context, requiring specialized devices. While the mobile and wearable device-based approach could be helpful, existing methods require heavy engagement from the user. Furthermore, they can be inaccurate in the presence of minute body motions and loud noises. This paper presents a robust multimodal approach to tracking the users breathing rate by using a signal-processing-based algorithm on motion sensors and a lightweight machine learning algorithm on acoustic sensors from the earbuds. A comprehensive user study with 30 participants shows that our system can calculate the breathing rate reasonably (Mean Absolute Error < 2 BPM) in controlled and uncontrolled settings with varying body motion and environmental noises. This work provides an essential direction toward developing continuous and passive breathing rate monitoring in the wild. |
https://dl.acm.org/doi/pdf/10.1145/3544548.3581265 | Digital Health |
GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured Pruning for Vision Transformer |
Author: Burak Uzkent et al. Yilin Shen, Hongxia Jin Published: National Conference on Artificial Intelligence (AAAI) |
Jan 2, 2023 | Transformers have seen growing interest on processing different modalities including language and images. As a result, we can process vision and language data using transformers that are architecturally similar. This feature of transformers provides us several opportunities. In this study, we explore weight sharing between two architecturally similar transformers for vision and language tasks. More specifically, we investigate sharing two main components of the transformers; (1) Multi-Head Attention (MSA), and (2) Feed-Forward Network (FFN) layers, across two backbones. To achieve this, we propose an additional objective that encourages the minimization of the difference of the MSA weights as well as FFN weights across two backbones. After minimizing the corresponding weights in two backbones, we perform weight sharing and fine-tune the model. We perform experiments on vision and language tasks including Referring Expression Comprehension and VQA using the state-of-the-art model, MDETR. Our experiments show that we can reduce the size of the MDETR by 35 − 40% by sharing MSA and FFN weights without significant loss in accuracy. |
https://arxiv.org/pdf/2301.05345.pdf | Artificial Intelligence |
Numerical Optimizations for Weighted Value Decomposition on Language Models |
Author: Ting Hua et al. Yen-Chang Hsu, Felicity Wang, Retiree, Yilin Shen, Hongxia Jin Published: Conference on Empirical Methods in Natural Language Processing (EMNLP) |
Dec 9, 2022 | Singular value decomposition (SVD) is one of the most popular compression methods that approximate a target matrix with smaller matrices. However, standard SVD treats the parameters within the matrix with equal importance, which is a simple but unrealistic assumption. In real cases, the parameters of a trained neural network model affect the task performance unevenly, suggesting non-equal importance among the parameters. Therefore, this paper proposed Fisher information weighted Value Decomposition (FVD) to compress a neural network model with the awareness about parameter importance. Unlike standard SVD, FVD is a non-convex optimization problem that lacks a closed-form solution. Therefore, optimizing FVD is non-trivial. |
https://arxiv.org/pdf/2211.09718.pdf | Artificial Intelligence |
Explainable Slot Type Attentions to Improve Joint Intent Detection and Slot Filling |
Author: Kalpa Gunaratna et al. Vijay Srinivasan, Retiree, Hongxia Jin Published: Conference on Empirical Methods in Natural Language Processing (EMNLP) |
Dec 7, 2022 | Joint intent detection and slot filling is a key research topic in natural language understanding (NLU). Existing joint intent and slot filling systems compute features collectively for all slot types, and importantly, have no way to explain the slot filling model decisions. In this work, we propose a novel approach that: (i) learns to generate slot type specific features to improve accuracy and (ii) provides explanations of slot filling decisions for the first time in a joint NLU model. Further, the model is inherently explainable and does not need any post-hoc processing. We perform an additional constrained supervision using a set of binary classifiers to learn slot type specific features, thus ensuring appropriate attention weights are learned to explain slot filling decisions for utterances. We evaluate our approach on two widely used datasets and show accuracy improvements. Moreover, a detailed analysis is also provided for the exclusive slot explainability of our proposed model. |
https://arxiv.org/pdf/2210.10227.pdf | Artificial Intelligence |
Foreground-Specialized Model Imitation for Instance Segmentation |
Author: Wenbo Li et al. Hongxia Jin Published: Asian Conference on Computer Vision (ACCV) |
Dec 4, 2022 | We leverage the knowledge distillation to address the object instance segmentation for robots with limited computational power. Instance segmentation is formulated as a multi-task learning problem involving object classification, localization and mask prediction. However, knowledge distillation is not well-suited to these sub-tasks except only one of them, i.e., multi-class object classification. To deal with this challenge, we introduce a novel distillation method where the teacher is a small foreground-specialized (FS) model. We train the FS instance segmentation teacher model using images with only foreground objects, i.e., background pixels are removed. So, the FS instance segmentation model is effective in object classification which is exactly what the distillation method is designed exclusively for. To accommodate the difference between inputs used by the teacher and student, we introduce a novel Foreground-Specialized model Imitation (FSI) method with two complementary module components. First, a reciprocal anchor box selection method is introduced to distill from the most informative output of the teacher model. Second, to embed the foreground-awareness in the students feature learning, we come up with two solutions by either adding a co-learned foreground segmentation branch or applying a soft feature mask. We conducted an extensive evaluation with the state-of-the-art one-stage object instance segmentation method YOLACT which is suitable for on-device inference. Experiment results on MS COCO and Pascal VOC datasets demonstrate that our method significantly outperforms knowledge distillation baselines in terms of both accuracy improvement and training efficiency. |
https://openaccess.thecvf.com/content/ACCV2022/papers/Li_Foreground-Specialized_Model_Imitation_for_Instance_Segmentation_ACCV_2022_paper.pdf | Artificial Intelligence |
BreatheBuddy: Tracking Real-time Breathing Exercises for Automated Bio-feedback Using Commodity Earbuds |
Author: Mahbubur Rahman et al. Tousif Ahmed, Mohsin Ahmed, Retiree, Ebrahim Nematihosseinabadi, Jilong Kuang, Alex Gao Published: Mobile HCI |
Oct 1, 2022 | Breathing exercises reduce stress and improve overall mental well-being. There are various types of breathing exercises. Performing the exercises correctly may give the best outcome and doing it in wrong ways can sometimes have adverse effect. Providing real-time biofeedback can greatly improve the user experience in doing the right exercises in the right ways. In this paper, we present methods to passively track breathing biomarkers in real-time using wireless commodity earbuds and generate feedback on users breathing performance. We use the earbuds low-power accelerometer to generate a comprehensive set of breathing biomarkers including breathing phase, breathing rate, depth of breathing, and breathing symmetry. We have conducted studies where the subjects performed different types of guided breathing exercises while wearing the earbuds. Our algorithms detect breathing phases with ~88.88\% accuracy and estimate breathing rate with ~95\% accuracy. We further show that our algorithms can be used to generate biofeedback towards designing engaging smartphones user interactions that facilitate users to accurately perform various breathing exercises. |
https://dl.acm.org/doi/pdf/10.1145/3546748 | Digital Health |
Real-Time Breathing Phase Detection Using Earbuds Microphone |
Author: Retiree et al. Tousif Ahmed, Mahbubur Rahman, Mohsin Ahmed, Ebrahim Nematihosseinabadi, Jilong Kuang, Alex Gao Published: IEEE-EMBS International Conference on Biomedical and Health Informatics(BHI) and the Body Sensor Networks(BSN) Conferences (IEEE BHI & BSN) |
Sep 27, 2022 | Tracking breathing phases (inhale and exhale) outside the hospitals can offer significant health and wellness benefits to users. For example, the breathing phases can provide fine-grained breathing information for proper meditation or breathing exercises. While previous works use smartphones and smartwatches for tracking breathing phases, in this work, we use earbuds for breathing phase detection, which has the potential to be a better form factor for breathing exercises as it requires less user attention from the user. We propose a convolutional neural network-based algorithm for detecting breathing phases using the audio captured through the earbuds during guided breathing sessions. We conducted a user study with 30 participants in both lab and home environments to develop and evaluate our algorithm. Our algorithm can detect the breathing phases with 85% accuracy by taking only 500ms audio signal. Our work demonstrates the potential of using earbuds for tracking the breathing phases in real-time. |
https://ieeexplore.ieee.org/document/9928520 | Digital Health |
Deep Audio Spectral Processing for Respiration Rate Estimation from Commodity Earbuds |
Author: Mohsin Ahmed et al. Tousif Ahmed, Mahbubur Rahman, Retiree, Jilong Kuang, Alex Gao Published: IEEE-EMBS International Conference on Biomedical and Health Informatics(BHI) and the Body Sensor Networks(BSN) Conferences (IEEE BHI & BSN) |
Sep 27, 2022 | Breathing rate is an important health biomarker and a vital indicator for health and fitness. With smart earbuds gaining popularity as a commodity device, recent works have demonstrated the potential for monitoring breathing rate using such earable devices. In this work, we use spectrograms from breathing cycle audio signals captured using earbuds as a spectral feature to train a deep convolutional neural network to infer respiration rate with high accuracy. Using novel earbud audio data collected from 30 subjects with both controlled breathing at a wide range (from 5 upto 45 breaths per minute), and uncontrolled natural breathing from 7-day home deployment, experimental results demonstrate that our model can estimate respiration rate with 0.77 MAE for controlled breathing and with 0.99 MAE for at-home natural breathing. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9928461 | Digital Health |
Respiration Rate Estimation from Remote PPG via Camera in Presence of Non-Voluntary Artifacts |
Author: Korosh Vatanparvar et al. Migyeong Gwak, Li Zhu, Jilong Kuang, Alex Gao Published: IEEE-EMBS International Conference on Biomedical and Health Informatics(BHI) and the Body Sensor Networks(BSN) Conferences (IEEE BHI & BSN) |
Sep 27, 2022 | Contactless measurement of vitals has been seen as a promising alternative to contact sensors for monitoring of health condition. In this paper, we focus on respiration rate (RR) as one of the fundamental biomarkers of a persons cardio and pulmonary activities. Remote RR estimation has gained attraction due to its various potential applications; use of RGB cameras to extract remote photoplethysmography (PPG) signal from subjects face has been debated as one of the enabling technologies for remote RR estimation. The technology is challenged with respect to wide range of RR and non-voluntary motion in uncontrolled settings. We propose a novel methodology to enhance the quality of respiration signal and remove artifacts from the remote PPG signal, which results in reducing the MAE from 4.5bpm to 2.8bpm for RR in range of 5-25bpm. We validate the accuracy of our methodology using smartphone video recordings of 30 subjects with uniform distribution of skin tone. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9928485 | Digital Health |
Enhancement of Remote PPG and Heart Rate Estimation with Optimal Signal Quality Index |
Author: Jiyang Li et al. Korosh Vatanparvar, Li Zhu, Jilong Kuang, Alex Gao Published: IEEE-EMBS International Conference on Biomedical and Health Informatics(BHI) and the Body Sensor Networks(BSN) Conferences (IEEE BHI & BSN) |
Sep 27, 2022 | With the popularity of non-invasive vital signs detection, |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9928503 | Digital Health |
IMU-based Cough Detection With Lightweight Template Matching Models |
Author: Ebrahim Nematihosseinabadi Published: IEEE-EMBS International Conference on Biomedical and Health Informatics(BHI) and the Body Sensor Networks(BSN) Conferences (IEEE BHI & BSN) |
Sep 10, 2022 | Cough is a major symptom of respiratory-related diseases. There exists a tremendous amount of work in detecting coughs from audio but there has been no effort to identify coughs from solely inertial measurement unit (IMU). Coughing causes motion across the whole body and especially on the neck and head. Therefore, head motion data during coughing captured by a head-worn IMU sensor could be leveraged to detect coughs using a template matching algorithm. In time series template matching problems, K-Nearest Neighbors (KNN) combined with elastic distance measurement (esp. Dynamic Time Warping (DTW)) achieves outstanding performance. However, it is often regarded as prohibitively time-consuming. Nearest Centroid Classifier is thereafter proposed. But the accuracy is comprised of only one centroid obtained for each class. Centroid-based Classifier performs clustering and averaging for each cluster, but requires manually setting the number of clusters. We propose a novel self-tuning multi-centroid template-matching algorithm, which can automatically adjust the number of clusters to balance accuracy and inference time. Through experiments conducted on synthetic datasets and a real-world earbud-based cough dataset, we demonstrate the superiority of our proposed algorithm and present the result of cough detection with a single accelerometer sensor on the earbuds platform. |
https://arxiv.org/pdf/2109.00630.pdf | Digital Health |
Instance Contour Adjustment via Structure-driven CNN |
Author: Yi Wei Published: European Conference on Computer Vision (ECCV) |
Jul 31, 2022 | Instance contour adjustment is desirable in image editing, which allows the contour of an instance in a photo to be either dilated or eroded via user sketching. This imposes several requirements for a favorable method in order to generate meaningful textures while preserving clear user-desired contours. Due to the ignorance of these requirements, the off-the-shelf image editing methods herein are unsuited. Therefore, we propose a specialized two-stage method. The first stage extracts the structural cues from the input image, and completes the missing structural cues for the adjusted area. The second stage is a structure-driven CNN which generates image textures following the guidance of the completed structural cues. In the structure-driven CNN, we redesign the context sampling strategy of the convolution operation and attention mechanism such that they can estimate and rank the relevance of the contexts based on the structural cues, and sample the top-ranked contexts regardless of their distribution on the image plane. Thus, the meaningfulness of image textures with clear and user-desired contours are guaranteed by the structure-driven CNN. In addition, our method does not require any semantic label as input, which thus ensures its well generalization capability. We evaluate our method against several baselines adapted from the related tasks, and the experimental results demonstrate its effectiveness. |
https://www.ecva.net/papers/eccv_2022/papers_ECCV/papers/136670142.pdf | Artificial Intelligence |
Table2Graph: Transforming Tabular Data to Unified Weighted Graph |
Author: Rui Chen et al. Li Li, Soo-Hyun Choi, Xia Hu Published: International Joint Conference on Artificial Intelligence (IJCAI) |
Jul 23, 2022 | Learning useful interactions between input features is crucial for tabular data modeling. Recent efforts start to explicitly model the feature interactions with graph, where each feature is treated as an individual node. However, the existing graph construction methods either heuristically formulate a fixed feature-interaction graph based on specific domain knowledge, or simply apply attention function to compute the pairwise feature similarities for each sample. While the fixed graph may be sub-optimal to downstream tasks, the sample-wise graph construction is time-consuming during model training and inference. To tackle these issues, we propose a framework named Table2Graph to transform the feature interaction modeling to learning a unified graph. Represented as a probability adjacency matrix, the unified graph learns to model the key feature interactions shared by the diverse samples in the tabular data. To well optimize the unified graph, we employ the reinforcement learning policy to capture the key feature interactions stably. A sparsity constraint is also proposed to regularize the learned graph from being overly-sparse/smooth. The experimental results in a variety of real-world applications demonstrate the effectiveness and efficiency of our Table2Graph, in terms of the prediction accuracy and feature interaction detection. |
https://www.ijcai.org/proceedings/2022/0336.pdf | Mobile Platform & Solutions |
A New Concept of Knowledge based Question Answering (KBQA) System using Multiple Reasoning Paths |
Author: Yu Wang Published: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) |
Jul 21, 2022 | Knowledge based question answering (KBQA) is a complex task for natural language understanding. Many KBQA approaches have been proposed in recent years, and most of them are trained based on labeled reasoning path. This hinders the system’s performance as many correct reasoning paths are not labeled as ground truth, and thus they cannot be learned. In this paper, we introduce a new concept of KBQA system which can leverage multiple reasoning paths’ information and only requires labeled answer as supervision. We name it as Mutliple Reasoning Paths KBQA System (MRPQA). We conduct experiments on several benchmark datasets containing both singlehop simple questions as well as muti-hop complex questions, including WebQuestionSP (WQSP), ComplexWebQuestion-1.1 (CWQ), and PathQuestion-Large (PQL), and demonstrate strong performance. |
https://aclanthology.org/2022.naacl-main.294.pdf | Artificial Intelligence |
Joint phase-time arrays: a paradigm for frequency-dependent analog beamforming in 6G |
Author: Vishnu Vardhan Ratnam et al. Jianhua Mo, Boon Loong Ng, Ahmad AlAmmouri, Charlie Zhang Published: IEEE Access |
Jul 12, 2022 | Hybrid beamforming is an attractive solution to build cost-effective and energy-efficient transceivers for millimeter-wave and terahertz systems. However, conventional hybrid beamforming techniques rely on analog components that generate a frequency flat response such as phase-shifters and switches, which limits the flexibility of the achievable beam patterns. As a novel alternative, this paper proposes a new class of hybrid beamforming called Joint phase-time arrays (JPTA), that additionally use true-time delay elements in the analog beamforming to create frequency-dependent analog beams. Using as an example two important frequency-dependent beam behaviors, the numerous benefits of such flexibility are exemplified. Subsequently, the JPTA beamformer design problem to generate any desired beam behavior is formulated and near optimal algorithms to the problem are proposed. Simulations show that the proposed algorithms can outperform heuristics solutions for JPTA beamformer update. Furthermore, it is shown that JPTA can achieve the two exemplified beam behaviors with 1 radio-frequency chain, while conventional hybrid beamforming requires the radio-frequency chains to scale with antennas to achieve similar performance. Finally, a wide range problems to further tap into the potential of JPTA are also listed as future directions. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9826716 | Next Generation Communications |
Detecting Physiological Stress Using Earbuds |
Author: Mahbubur Rahman et al. Viswam Nathan, Tousif Ahmed, Retiree, Jilong Kuang, Alex Gao Published: Engineering in Medicine and Biology Conference (EMBC) |
Jul 11, 2022 | Continuous stress exposure negatively impacts mental and physical well-being. Stress arousal affects heart beat frequency, changes breathing pattern, and peripheral temperature, among several other bodily responses. Traditionally the stress detection is performed by collecting bio-signals such as electrocardiogram (ECG), breathing, and galvanic skin response using uncomfortable chestbands or chestpatches. In this study, we use earbuds that passively measure photoplethysmograph (PPG), core body temperature, and inertial measurements simultaneously. We conducted a lab study exposing 18 test subjects to Trier Social Stress Test (TSST) and going through several relaxing activities including listening to functional music and progressive muscle relaxation while measuring physiological signal using earbuds. Moreover, we have simultaneously collected PPG, ECG, impedance cardiogram (ICG), and blood pressure using gold-standard reference devices. We show that earbuds can reliably capture heart rate and heart rate variability. We further show that earbud signals can be used to classify the physiological stress arousal with 91.30\% recall and 80.52\% precision using a random forest classifier with leave-one-subject-out cross-validation. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9871569 | Digital Health |
Unsupervised Remote Photoplethysmograph and Heart Rate Estimation by Dynamic Region of Interest Tracking |
Author: Retiree et al. Korosh Vatanparvar, Li Zhu, Jilong Kuang, Alex Gao Published: Engineering in Medicine and Biology Conference (EMBC) |
Jul 11, 2022 | Remote photoplethysmography (PPG) estimates vital signs by measuring changes in the reflected light from the human skin. Compared with traditional PPG techniques, remote PPG enables contactless measurement and reduced cost. In this paper, we propose a novel unsupervised method to extract remote PPG signals and heart rate from videos. We propose an algorithm to dynamically track regions of interest (ROIs) and combine the signals from all ROIs based on signal qualities. To maintain a stable frame rate and accuracy, we propose a dynamic down-sampling approach, which makes our system robust to the different video resolutions and user-camera distances. We also propose the strategy of waiting time adaptation for HR measurements, which can achieve comparable accuracy in HR estimation while reduce the average waiting time. To test the accuracy of the proposed system, we have collected data from 30 subjects with facial masks. Experimental results show that the proposed system can achieve 3.0bpm mean absolute error in HR estimation. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9871722 | Digital Health |
Deep Multivariate Domain Translation for Device Invariant Pulmonary Patient Identification from Cough and Speech Sounds |
Author: Mohsin Ahmed et al. Korosh Vatanparvar, Jilong Kuang, Alex Gao Published: Engineering in Medicine and Biology Conference (EMBC) |
Jul 11, 2022 | audio based machine learning models to infer pulmonary health, exacerbation and activity. A major challenge to widespread usage and deployment of such pulmonary health monitoring audio models is to maintain accuracy and robustness across a variety of commodity devices, due to the effect of device heterogeneity. Because of this phenomenon, pulmonary audio models developed with data from one type of device perform poorly when deployed on another type of device. In this work, we propose a framework incorporating a multivariate deep neural network regressor as a feature translator from the source device domain to the target device domain. Our empirical and extensive experiments with data from 131 real pulmonary patients and healthy controls show that our framework can recover upto 66.67% of the accuracy lost due to device heterogeneity for two different pulmonary activity based person identification tasks with two common mobile and wearable devices: smartphone and smartwatch. |
https://ieeexplore.ieee.org/document/9871967 | Digital Health |
Motion-based Respiratory Rate Estimation with Motion Artifact Removal Technique in a Facial Video with an RGB Camera |
Author: Migyeong Gwak et al. Korosh Vatanparvar, Jilong Kuang, Alex Gao Published: Engineering in Medicine and Biology Conference (EMBC) |
Jul 11, 2022 | Respiratory rate (RR) is a significant indicator |
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9871231 | Digital Health |
Utilizing Deep Learning on Limited Mobile Speech Recordings for Detection of Obstructive Pulmonary Disease |
Author: Viswam Nathan et al. Korosh Vatanparvar, Jilong Kuang Published: Engineering in Medicine and Biology Conference (EMBC) |
Jul 11, 2022 | Passive assessment of obstructive pulmonary disease has gained substantial interests over the past few years in the mobile and wearable computing communities. One of the promising approaches is speech-based pulmonary assessment where spontaneous or scripted speech is used to evaluate an individuals pulmonary conditions. Recent work in speech-based pulmonary assessment approach has shown promising results in pulmonary disease detection. However, this approach heavily relies on the accuracy of speech activity detection and a handful number of specific features. Recently, the application of deep learning has shown promising results in the domain of activity recognition involving time series data. In this paper, we |
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9871980 | Digital Health |
Lite-MDETR: A Lightweight Multi-Modal Detector |
Author: Qian Lou et al. Yen-Chang Hsu, Burak Uzkent, Ting Hua, Yilin Shen, Hongxia Jin Published: Computer Vision and Pattern Recognition (CVPR) |
Jun 21, 2022 | Recent multi-modal detectors based on transformers and modality encoders have successfully achieved impressive results on end-to-end visual object detection conditioned on a raw text query. However, they require a large model size and an enormous amount of computations to achieve high performance, which makes it difficult to deploy mobile applications that are limited by tight hardware resources. In this paper, we present a Lightweight modulated detector, Lite-MDETR, to facilitate efficient end-to-end multi-modal understanding on mobile devices. The key primitive is that Dictionary-Lookup-Transformormations (DLT) is proposed to replace Linear Transformation (LT) in multi-modal detectors where each weight in Linear Transformation (LT) is approximately factorized into a smaller dictionary, index, and coefficient. This way, the enormous linear projection with weights is converted into efficient linear projection with dictionaries, a few lookups and scalings with indices and coefficients. DLT can be applied to any pretrained multi-modal detectors, removing the need to perform expensive training from scratch. To tackle the challenging training of DLT due to non-differentiable index, we convert the index and coefficient into a sparse matrix, train this sparse matrix during the fine-tuning phase, and recover it back to index and coefficient during the inference phase. Our experiments on phrase grounding, referring expression comprehension and segmentation, and VQA show that our Lite-MDETR achieves similar accuracy as the prior multimodal detectors with up to ∼ 4.1× model size reduction. |
https://openaccess.thecvf.com/content/CVPR2022/papers/Lou_Lite-MDETR_A_Lightweight_Multi-Modal_Detector_CVPR_2022_paper.pdf | Artificial Intelligence |
Reducing FDD MMU form factor with active cancellation |
Author: Khurram Muhammad et al. Jin Yuan, Zhang Shaomin, Chance Tarver, Xinguang Xu, Yu Liu, Jie Li, Junghwan Moon, Matthew Tonnemacher, Gary Xu, Charlie Zhang Published: IEEE/MTT-S International Microwave Symposium (IMS) |
Jun 19, 2022 | In this paper, a multi-channel self-interference cancellation (SIC) technique is proposed to reduce the size of AWS/PCS dual-band 5G FDD massive MIMO base station. Combining two bands with small frequency offset such as PCS and AWS bands in a base station requires modification of frequency duplex cavity filters to provide a wide passband with extremely narrow gap between the passband and the stop band. Such duplexer is hard to realize. We propose a novel multi-channel SIC to allow the dual-band operation with the same form factor as the single-band base station. To verify the feasibility of this idea, a proof of concept (PoC) prototype is developed to demonstrate the feasibility of multi-channel SIC to this problem. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9865246 | Next Generation Communications |
COUGHTRIGGER: EARBUDS IMU BASED COUGH DETECTION ACTIVATOR USING AN ENERGY-EFFICIENT SENSITIVITY-PRIORITIZED TIME SERIES CLASSIFIER |
Author: Ebrahim Nematihosseinabadi et al. Tousif Ahmed, Minh Dinh, Jilong Kuang, Alex Gao Published: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) |
May 23, 2022 | Persistent coughs are a major symptom of respiratory-related diseases. Increasing research attention has been paid to detecting coughs using wearables. Among all types of sensors utilized, microphone is most widely used to detect coughs. However, the intense power consumption needed to process audio signals prevents acoustic sensors from being continuously powered on battery-limited commercial wearable products, such as earbuds. In this work, we present CoughTrigger, which utilizes a lower-power sensor, an inertial measurement unit (IMU), in earbuds as a cough detection activator to trigger a higher-power sensor for audio processing and classification. It is able to run all-the-time as a standby service with minimal battery consumption and trigger the audio-based cough detection when a candidate cough is detected from the IMU. Besides, the use of IMU brings the benefit of improved specificity of cough detection. Experiments are conducted on 45 subjects and achieved 90% sensitivity and 60% specificity for cough detection activation.
|
https://arxiv.org/pdf/2111.04185.pdf | Artificial Intelligence |
UbiLung: Multi-modal Passive Sensing for Lung Health Assessment |
Author: Ebrahim Nematihosseinabadi et al. Viswam Nathan, Korosh Vatanparvar, Tousif Ahmed, Mahbubur Rahman, Jilong Kuang, Alex Gao Published: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) |
May 23, 2022 | Spirometry test has been the gold standard for the measurement of a pulmonary patient’s lung function for decades. Spirometry is generally done in the hospital setting, where patients need to forcefully blow air into the spirometer’s tubes under the guidance of clinicians. Such a procedure is time-consuming, cumbersome, and extremely effort-dependent. Recent advances in ubiquitous computing investigate the feasibility of leveraging commodity devices such as smartphones to replace the standard clinical spirometry test. However, existing solutions are still demanding, usually requiring users to complete a series of tasks such as blowing towards a microphone, and could potentially introduce risks such as dizziness and shortness of breath due to the forced blowing. More importantly, the test is still dependent on the user’s effort which naturally degrades when no supervision exists. We propose UbiLung, a new method that leverages passively sensed modalities for lung function estimation. Such a method relies on the physiological correlation of the introduced passive modalities to the lung function, which consequently obviates the need for active user engagement yet can provide an accurate effort-independent measurement. We focus on sensor modalities that are feasible in passive sensing: cough and speech sound collected from microphones and blood volume pulse (BVP) signals collected via photoplethysmography (PPG) sensors. Through feature extraction and selection, our best machine learning models achieve mean absolute error of 11.1% for estimation of FVC perdicted percentage, 11.8% for FEV1 predicted percentage, and 7.4% for FEV1/FVC prediction. It significantly outperforms the baseline, with an average relative improvement of 13.9%. The generalizability of the model was further verified by an average improvement of 7.8% against baselines when applying the model directly on a completely separate and independent dataset. Moreover, we investigated important confounding factors (e.g., age, gender, and smoking behavior) and augment the results by 4.5% on average. In addition to the parameter estimation, we also trained models for a series of pulmonary disease diagnosis tasks. Our method achieves a F1-score of 0.982 on healthy v.s. diseased, 0.881 on obstructive v.s. non-obstructive, 0.854 on COPD v.s. asthma, and 0.892 on non-severe v.s. severe classification. Our technique is the first multi-modal effort-independent passive estimation of lung function, which could shed light on the passive monitoring of both pulmonary patients and general population. |
https://ieeexplore.ieee.org/document/9746614 | Digital Health |
Beam Management with Orientation and RSRP using Deep Learning for Beyond 5G Systems |
Author: Khuong Nhat Nguyen et al. Anum Ali, Jianhua Mo, Boon Loong Ng, Vutha Va, Charlie Zhang Published: IEEE International Conference on Communications (ICC) |
May 16, 2022 | Beam management (BM), i.e., the process of finding and maintaining a suitable transmit and receive beam pair, can be challenging, particularly in highly dynamic scenarios. Side-information, e.g., orientation, from on-board sensors can help in the user equipment (UE) BM. In this work, we use the orientation information coming from inertial measurement unit (IMU) for effective BM. We use a data-driven strategy and fuse the reference signal received power (RSRP) information with orientation information using an artificial neural network (ANN). Simulation results show that the proposed strategy performs better than the conventional BM and an orientation-assisted BM strategy that utilizes particle filter in another study. Specifically, the proposed data-driven strategy improves the beam-prediction accuracy up to 34% and reduces mean reference signal received power (RSRP) loss caused by sub-optimal beam-selection by up to 4.2 dB when the UE has fast rotation speed.
|
https://arxiv.org/pdf/2202.02247.pdf | Next Generation Communications |
PolarDenseNet: A Deep Learning Model for CSI Feedback in MIMO Systems |
Author: Pranav Madadi et al. Jeongho Jeon, Joonyoung Cho, Caleb Lo, Juho Lee, Charlie Zhang Published: IEEE International Conference on Communications (ICC) |
May 16, 2022 | In multiple-input multiple-output (MIMO) systems, the high-resolution channel information (CSI) is required at the base station (BS) to ensure optimal performance, especially in the case of multi-user MIMO (MU-MIMO) systems. In the absence of channel reciprocity in frequency division duplex (FDD) systems, the user needs to send the CSI to the BS. Often the large overhead associated with this CSI feedback in FDD systems becomes the bottleneck in improving the system performance. In this paper, we propose an AI-based CSI feedback based on |
https://arxiv.org/pdf/2202.01246.pdf | Next Generation Communications |
End-to-end 6G Terahertz Wireless Platform with Adaptive Transmit and Receive Beamforming |
Author: Shadi Abu-Surra et al. Won Suk Choi, SungTae Choi, Eunyoung Seok, Dongjoo Kim, Navneet Sharma, Siddharth Advani, Vitali Loseu, KITAEK BAE, ILJU NA, Gary Xu, Charlie Zhang Published: IEEE International Conference on Communications (ICC) |
May 16, 2022 | 6G is envisioned to provide ultimate experience for all through hyper-connectivity involving humans and everything, with unprecedented requirements and expectations [1]. In this vision, terahertz (THz) technology is a leading candidate to realize the 6G requirements. This paper presents the latest development and results of a terahertz wireless prototyping platform, which is being developed in Samsung research lab. The platform currently supports real-time transmission of 6 Gbps of data over a 2 GHz channel centered around 135 GHz with adaptive beamforming at the transmitter and receiver. The modem is designed to handle data-rate up to 36 Gbps, supports two MIMO streams, and aggregates two 2GHz channels. This paper also presents the specifications of the current RF units and discusses the challenges faced during the design and fabrication of these units. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9814579 | Next Generation Communications |
Atrial Fibrillation Detection and Atrial Fibrillation Burden Estimation via Wearables |
Author: Li Zhu et al. Viswam Nathan, Jilong Kuang, Jacob Kim, Alex Gao Published: IEEE Journal of Biomedical and Health Informat |
May 1, 2022 | Atrial Fibrillation (AF) is an important cardiac rhythm disorder, which if left untreated can lead to serious complications such as a stroke. AF can remain asymptomatic, and it can progressively worsen over time; it is thus a disorder that would benefit from detection and continuous monitoring with a wearable sensor. Here, we develop an AF detection algorithm, deploy it on a smartwatch, prospectively and comprehensively validate its performance on a real-world population that included patients diagnosed with AF. The algorithm showed a sensitivity of 87.8% and a specificity of 97.4% over every 5-minute segment of PPG evaluated. Furthermore, we introduce novel algorithm blocks and system designs to increase the time of coverage and monitor for AF even during periods of motion noise and other artifacts that would be encountered in daily-living scenarios. An average of 67.8% of the entire duration the patients wore the smartwatch produced a valid decision. Finally, we present the ability of our algorithm to function throughout the day and estimate the AF burden, as a first-of-this-kind measure using wearable sensor, showing 98% correlation with the ground truth and an average error of 6.2%. Authors from UCSF: |
https://ieeexplore.ieee.org/document/9633021 | Digital Health |
An Information Fusion Approach to Learning With Instance-Dependent Label Noise |
Author: Li Li et al. Rui Chen, Soo-Hyun Choi, Xia Hu Published: International Conference on Learning Representation (ICLR) |
Apr 25, 2022 | Instance-dependent label noise (IDN) widely exists in real-world datasets and usually misleads the training of deep neural networks. Noise transition matrix (i.e., the probability that clean labels flip into noisy labels) is used to characterize the label noise and achieves statistically consistent classifiers for underlying distribution that the data belongs to. However, most of instances are long-tail, i.e., the number of appearance for each instance is usually limited, which leads to the gap between underlying distribution and empirical distribution, and model degeneration. To mitigate the distribution mismatch problem, we propose posterior transition matrix to posteriorly model label noise given limited observed noisy labels achieving statistically consistent classifiers for underlying and empirical distribution}. Note that even if the instance is corrupted by the same noise transition matrix, the intrinsic randomness incurs to different noisy labels, and thus requires different correction methods. Motivated by this observation, we propose an Information Fusion (IF) approach to fine-tune the noise transition matrix based on estimated posterior transition matrix. Specifically, we adopt the noisy labels and model predicted probability to estimate posterior transition matrix and then correct the noise transition matrix in forward propagation. Empirical evaluations on synthetic and real-world datasets demonstrate that our method is superior to the state-of-the-art approaches, and achieve more stable training for learning from the instance-dependent label noise.
|
https://openreview.net/pdf?id=ecH2FKaARUp | Artificial Intelligence |
Language model compression with weighted low-rank factorization |
Author: Yen-Chang Hsu et al. Ting Hua, Sung-En Chang, Qian Lou, Yilin Shen, Hongxia Jin Published: International Conference on Learning Representation (ICLR) |
Apr 25, 2022 | Factorizing a large matrix into small matrices is a popular strategy for model compression. Singular value decomposition (SVD) plays a vital role in this compression strategy, approximating a learned matrix with fewer parameters. However, SVD minimizes the squared error toward reconstructing the original matrix without gauging the importance of the parameters, potentially giving a larger reconstruction error for those who affect the task accuracy more. In other words, the optimization objective of SVD is not aligned with the task accuracy. In this work, we propose using Fisher information to weigh the importance of parameters affecting the model prediction, then perform a weighted SVD to factorize the learned matrices of a neural network model. Although our factorized matrices are not necessary to have a smaller reconstruction error, they retain better task accuracy. We perform analysis with the transformer-based language models, showing our weighted SVD significantly reduces the misaligned optimization objectives between low-rank factorization and task accuracy. |
https://openreview.net/pdf?id=uPv9Y3gmAI5 | Artificial Intelligence |
CSI Feedback for Distributed MIMO |
Author: Gilwon Lee et al. Md Saifur Rahman, Eko Onggosanusi Published: IEEE Wireless Communications and Networking Conference (WCNC) |
Apr 10, 2022 | In this paper, we consider a distributed multi-input-multi-output (D-MIMO) system wherein multiple radio remote heads (RRHs) distributed in a cell are connected with a single baseband unit. To enable coherent joint transmission from multiple RRHs in the D-MIMO system, we propose several channel state information (CSI) codebooks as candidates for enhancements in the context of 3rd Generation Partnership Project (3GPP) 5G New Radio (NR) standardization. The proposed codebooks are developed based on the 5G Release-16 Type-II CSI codebook framework. In addition, we propose dynamic RRH selection (DRS) methods that are able to obtain performance gain and reduce the amount of feedback by sending the CSI only for the selected RRHs having dominant channel qualities. System-level simulation (SLS) results under realistic scenarios are provided to validate the potential of the proposed CSI codebooks |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9771853 | Next Generation Communications |
DictFormer: Tiny Transformer with Shared Dictionary |
Author: Qian Lou et al. Ting Hua, Yen-Chang Hsu, Yilin Shen, Hongxia Jin Published: International Conference on Learning Representation (ICLR) |
Mar 10, 2022 | We introduce DictFormer with efficient shared dictionary to provide a compact, fast, and accurate transformer model. DictFormer significantly reduces the redundancy in the transformer’s parameters by replacing the prior transformer’s parameters with compact, shared dictionary, few unshared coefficients and indices. Also, DictFormer enables faster computations since expensive weights multiplications are converted into cheap shared look-ups on dictionary and few linear projections. Training dictionary and coefficients are not trivial since indices used for looking up dictionary are not differentiable. We adopt a sparse-constraint training with l1 norm relaxation to learn coefficients and indices in DictFormer. DictFormer is flexible to support different model sizes by dynamically changing dictionary size. Compared to existing lightweight Transformers, DictFormer consistently reduces model size over Transformer on multiple tasks, e.g., machine translation, abstractive summarization, and language modeling. Extensive experiments show that DictFormer reduces 3.6× to 8.9× model size with similar accuracy over multiple tasks, compared to Transformer. |
https://openreview.net/pdf?id=GWQWAeE9EpB | Artificial Intelligence |
ISEEQ: Information Seeking Question Generation using Dynamic Meta-Information Retrieval and Knowledge Graphs |
Author: Kalpa Gunaratna et al. Vijay Srinivasan, Hongxia Jin Published: National Conference on Artificial Intelligence (AAAI) |
Feb 22, 2022 | Conversational Information Seeking (CIS) is a relatively new research area within conversational AI that attempts to seek information from end-users in order to understand and satisfy users’ needs. If realized, such a system has far-reaching benefits in the real world; for example, a CIS system can assist clinicians in pre-screening or triaging patients in healthcare. A key open sub-problem in CIS that remains unaddressed in the literature is generating Information Seeking Questions (ISQs) based on a short initial query from the enduser. To address this open problem, we propose Information SEEking Question generator (ISEEQ), a novel approach for generating ISQs from just a short user query, given a large text corpus relevant to the user query. Firstly, ISEEQ uses a knowledge graph to enrich the user query. Secondly, ISEEQ uses the knowledge-enriched query to retrieve relevant context passages to ask coherent ISQs adhering to a conceptual flow. Thirdly, ISEEQ introduces a new deep generative adversarial reinforcement learning-based approach for generating ISQs. We show that ISEEQ can generate high-quality ISQs to promote the development of CIS agents. ISEEQ significantly outperforms comparable baselines on five ISQ evaluation metrics across four datasets having user queries from diverse domains. Further, we argue that ISEEQ is transferable across domains for generating ISQs, as it shows the acceptable performance when trained and tested on different pairs of domains. The qualitative human evaluation confirms ISEEQ-generated ISQs are comparable in quality to human-generated questions and outperform the best comparable baseline. |
https://arxiv.org/pdf/2112.07622.pdf | Artificial Intelligence |
Model-driven Machine Learning Approaches for Mobility Classification in Intelligent 5G Network |
Author: Tiexing Wang et al. Yeqing Hu, Yang Li, Junmo Sung, Rui Wang, Charlie Zhang Published: IEEE Wireless Communications and Networking Conference (WCNC) |
Dec 31, 2021 | Channel information is essential to unleash the benefits of 5G New Radio (NR) by enabling network intelligence that adapts transmissions to users’ channels. In this paper, we propose model-driven feature design and use support vector machine to classify users’ speed range. Our model-driven features are designed based on stochastic channel modeling. Multiple features are derived from time-domain cross-correlation and time-domain auto-correlation function of the sounding reference signals. The classifier is trained and verified with extensive standard compliant simulation channels at different SNR levels and speeds, and attains greater than 90% accuracy. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9771678 | Next Generation Communications |
SAFENet: A Secure, Accurate and Fast Neural Network Inference |
Author: Qian Lou et al. Yilin Shen, Hongxia Jin Published: International Conference on Learning Representation (ICLR) |
Dec 12, 2021 | The advances in neural networks have driven many companies to provide prediction services to users in a wide range of applications. However, current prediction systems raise privacy concerns regarding the users private data. A cryptographic neural network inference service is an efficient way to allow two parties to execute neural network inference without revealing either party’s data or model. Nevertheless, existing cryptographic neural network inference services suffer from huge running latency; in particular, the latency of communication-expensive cryptographic activation function is 3 orders of magnitude higher than plaintext-domain activation function. And activations are the necessary components of the modern neural networks. Therefore, slow cryptographic activation has become the primary obstacle of efficient cryptographic inference. In this paper, we propose a new technique, called SAFENet, to enable a Secure, Accurate and Fast nEural Network inference service. To speedup secure inference and guarantee inference accuracy, SAFENet includes channel-wise activation approximation with multiple-degree options. This is implemented by keeping the most useful activation channels and replacing the remaining, less useful, channels with various-degree polynomials. SAFENet also supports mixed-precision activation approximation by automatically assigning different replacement ratios to various layer; further increasing the approximation ratio and reducing inference latency. Our experimental results show SAFENet obtains the state-of-the-art inference latency without a decrease in accuracy, reducing latency by $38\% \sim 61\%$ over prior techniques on various encrypted datasets. |
https://openreview.net/pdf?id=Cz3dbFm5u- | Artificial Intelligence |
RRMonitor: A Resource-Aware End-to-End System for Continuous Monitoring of Respiration Rate Using Earable Devices |
Author: Tousif Ahmed et al. Mahbubur Rahman, Mohsin Ahmed, Ebrahim Nematihosseinabadi, Minh Dinh, Nathan Robert Folkman, Jilong Kuang, Alex Gao Published: Engineering in Medicine and Biology Conference (EMBC) |
Nov 5, 2021 | Respiration rate is considered as a critical vital sign, and daily monitoring of respiration rate could provide helpful information about any acute condition in the human body. While researchers have been exploring mobile devices for respiration rate monitoring, passive and continuous monitoring is still not feasible due to many usability challenges (e.g., active participation) in existing approaches. This paper presents an end-to-end system called RRMonitor that leverages the movement sensors from commodity earbuds to continuously monitor the respiration rate in near real-time. While developing the systems, we extensively explored some key parameters, algorithms, and approaches from existing literature that are better suited for continuous and passive respiration rate monitoring. RRMonitor can passively track the respiration rate with a mean absolute error as low as 1.64 cycles per minute without requiring active participation from the user. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9631109 | Digital Health |
A Novel Multi-Center Template-Matching Algorithm and Its Application for Cough Detection |
Author: Ebrahim Nematihosseinabadi et al. Tousif Ahmed, Mahbubur Rahman, Jilong Kuang, Alex Gao Published: Engineering in Medicine and Biology Conference (EMBC) |
Nov 2, 2021 | In time series classification problems, K-Nearest Neighbors (KNN) combined with elastic distance measurement (esp. Dynamic Time Warping (DTW)) achieves outstanding classification performance. However, it is often regarded as prohibitively time-consuming. Nearest Centroid Classifier is thereafter proposed. But the accuracy is comprised with only one centroid obtained for each class. Centroid-based Classifier performs clustering and averaging for each cluster, but requires manually setting the number of clusters. In this work, we propose a novel self-tuning multi-center template-matching algorithm, which can automatically adjust the number of clusters to balance accuracy and inference time. Through experiments conducted on synthetic datasets and a real-world earbud-based cough dataset, we demonstrate the superiority of our proposed algorithm in terms of both accuracy and inference time. |
https://arxiv.org/pdf/2109.00630.pdf | Digital Health |
Using Neighborhood Context to Improve Information Extraction from Visual Documents Captured on Mobile Phones |
Author: Kalpa Gunaratna et al. Vijay Srinivasan, Sandeep Nama, Hongxia Jin Published: International Conference on Information and Knowledge Management (CIKM) |
Nov 1, 2021 | Information Extraction from visual documents is useful in practice to enable intelligent assistant to users. We present an approach that combines local context information and contextual language models to improve information extraction accuracy. We show that our method is able to perform well across model sizes and able to work well with small models that can be useful in applications that need efficient processing (e.g., mobile computing). Our method outperformed state-of-the-art global context based technique and our implementation on a mobile platform suggests its usefulness in practical real-world applications. |
https://arxiv.org/pdf/2108.10395.pdf | Artificial Intelligence |
SpeechSpiro: Lung Function Assessment from Speech Pattern as an Alternative to Spirometry for Mobile Health Tracking |
Author: Korosh Vatanparvar et al. Viswam Nathan, Ebrahim Nematihosseinabadi, Mahbubur Rahman, Daniel McCaffrey, Jilong Kuang, Alex Gao Published: Engineering in Medicine and Biology Conference (EMBC) |
Oct 31, 2021 | Abstract—Respiratory illnesses are common in the United States and globally which people deal with in various forms, such as asthma, chronic obstructive pulmonary diseases or infectious respiratory diseases (e.g. from coronavirus). Lung function of the subjects affected by these illnesses is compromised due to infection and/or inflammation in their respiratory airways. There are clinically-validated tests to assess lung function using in-clinic medical equipment, and quite recently, via portable spirometry devices. Research has shown that the obstruction and restriction in the respiratory airways affect individuals’ voice characteristics, where the audio features could be analyzed to predict the lung function and severity of the obstruction. In this paper, we go beyond well-known voice audio features and create a hybrid deep learning model using CNN-LSTM to discover spatiotemporal patterns in speech and predict the lung function parameters with accuracy comparable to conventional devices. We validate the performance and generalizability of our method using the data collected from 200 subjects enrolled in two studies internally and in collaboration with a pulmonary hospital. SpeechSpiro measures lung function parameters (e.g. FEV1, FVC, FEV1/FVC) with mean RMSE of 12% and R2 of up to 76% using 60-second phone audio recording of individuals reading a passage. Clinical relevance — Speech-based spirometry (SpeechSpiro) eliminates the need for an additional device and carries out the lung function assessment outside the clinical settings using a smartphone; hence, enabling continuous mobile health tracking for the individuals, healthy or with a respiratory illness. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9630077 | Digital Health |
Device Invariant Deep Neural Networks for Pulmonary Audio Event Detection Across Mobile and Wearable Devices |
Author: Mohsin Ahmed et al. Li Zhu, Mahbubur Rahman, Tousif Ahmed, Jilong Kuang, Alex Gao Published: Engineering in Medicine and Biology Conference (EMBC) |
Oct 31, 2021 | Mobile and wearable devices are being increasingly used for developing audio based machine learning models to infer pulmonary health, exacerbation and activity. A major challenge to widespread usage and deployment of such pulmonary health monitoring audio models is to maintain accuracy and robustness across a variety of commodity devices, due to the effect of device heterogeneity. Because of this phenomenon, pulmonary audio models developed with data from one type of device perform poorly when deployed on another type of device. In this work, we propose a framework incorporating feature normalization across individual frequency bins and combining task specific deep neural networks for model invariance across devices for pulmonary event detection. Our empirical and extensive experiments with data from 131 real pulmonary subjects and healthy controls show that our framework can recover upto163.6% of the accuracy lost due to device heterogeneity for four different pulmonary classification tasks across two broad classification scenarios with two common mobile and wearable devices: smartphone and smartwatch. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9629853 | Digital Health |
Real-Time Limb Motion Tracking with a Single IMU Sensor for Physical Therapy Exercises |
Author: Wenchuan Wei et al. Keiko Kurita, Jilong Kuang, Alex Gao Published: Engineering in Medicine and Biology Conference (EMBC) |
Oct 31, 2021 | Limb exercises are common in physical therapy to improve range of motion (RoM), strength, and flexibility of the arm/leg. To improve therapy outcomes and reduce cost, motion tracking systems have been used to monitor the user’s movements when performing the exercises and provide guidance. Traditional motion tracking systems are based on either cameras or inertial measurement unit (IMU) sensors. Camera-based systems face problems caused by occlusion and lighting. Traditional IMU-based systems require at least two IMU sensors to track the motion of the entire limb, which is not convenient for use. In this paper, we propose a novel limb motion tracking system that uses a single 9-axis IMU sensor that is worn on the distal end joint of the limb (i.e., wrist for the arm or ankle for the leg). Limb motion tracking using a single IMU sensor is a challenging problem because 1) the noisy IMU data will cause drift problem when estimating position from the acceleration data, 2) the single IMU sensor measures the motion of only one joint but the limb motion consists of motion from multiple joints. To solve these problems, we propose a recurrent neural network (RNN) model to estimate the 3D positions of the distal end joint as well as the other joints of the limb (e.g., elbow or knee) from the noisy IMU data in real time. Our proposed approach achieves high accuracy with a median error of 4.4/4.1 cm for the wrist/elbow joint when tracking the arm motion, outperforming the state-of-the-art approach by 50%. In addition, the proposed model is lightweight, enabling real-time applications on mobile devices. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9630480 | Digital Health |
Listen2Cough: Leveraging End-to-End Deep Learning Cough Detection Model to Enhance Lung Health Assessment Using Passively Sensed Audio |
Author: Ebrahim Nematihosseinabadi et al. Korosh Vatanparvar, Viswam Nathan, Tousif Ahmed, Mahbubur Rahman, Daniel McCaffrey, Jilong Kuang, Jun Gao Published: ACM International Conference on Ubiquitous Computing (UbiComp) |
Sep 13, 2021 | The prevalence of ubiquitous computing enables new opportunities for lung health monitoring and assessment. In the past few years, there have been extensive studies on cough detection using passively sensed audio signals. However, the generalizability of a cough detection model when applied to external datasets, especially in real-world implementation, is questionable and not explored adequately. Beyond detecting coughs, researchers have looked into how cough sounds can be used in assessing lung health. However, due to the challenges in collecting both cough sounds and lung health condition ground truth, previous studies have been hindered by the limited datasets. In this paper, we propose Listen2Cough to address these gaps. We first build an end-to-end deep learning architecture using public cough sound datasets to detect coughs within raw audio recordings. We employ a pre-trained MobileNet and integrate a number of augmentation techniques to improve the generalizability of our model. Without additional fine-tuning, our model is able to achieve an F1 score of 0.948 when tested against a new clean dataset, and 0.884 on another in-the-wild noisy dataset, leading to an advantage of 5.8% and 8.4% on average over the best baseline model, respectively. Then, to mitigate the issue of limited lung health data, we propose to transfer the feature representation from the cough detection task to lung health assessment tasks so that the rich cough data can be leveraged. Our hypothesis is that these tasks extract and utilize similar effective representation from cough sounds. We embed the cough detection model into a multi-instance learning framework with the attention mechanism and further tune the model for lung health assessment tasks. Our final model achieves an F1-score of 0.912 on healthy v.s. unhealthy, 0.870 on obstructive v.s. non-obstructive, and 0.813 on COPD v.s. asthma classification, outperforming the baseline by 10.7%, 6.3%, and 3.7%, respectively. Moreover, the weight value in the attention layer can be used to identify important coughs highly correlated with lung health, which can potentially provide interpretability for expert diagnosis in the future. |
https://dl.acm.org/doi/pdf/10.1145/3448124 | Digital Health |
ToA-based Localization of Far-Away Targets: Equi-DOP Surfaces, Asymptotic Bounds, and Dimension Adaptation |
Author: Raghunandan M Rao et al. Boon Loong Ng, YI YANG, Moon-Seok Kang Published: IEEE Transactions on Vehicular Technology |
Sep 3, 2021 | This paper studies the Dilution of Precision (DOP) in the Time-of-arrival (ToA)-based localization of targets outside the anchors convex hull. In the far-away target regime, we derive a closed-form expression of the DOP that reveals a linear asymptotic scaling law. We characterize the asymptotic DOP bounds, equi-DOP surfaces/contours in 3D/2D localization scenarios, which quantifies the reliability of location estimates on a trajectory. Motivated by vehicular applications, we propose a range-aided dimension adaptation scheme. Here the localization dimension is adapted in real-time using a single range measurement such that the maximum or root-mean-square DOP does not exceed a threshold. Since high-accuracy localization of far-away targets is infeasible due to linear DOP scaling with distance, this scheme prioritizes high-performance tracking of nearby targets while monitoring far-away targets with range-only measurements. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9531485 | Next Generation Communications |
Automatic Mixed-Precision Quantization Search of BERT |
Author: Changsheng Zhao et al. Ting Hua, Yilin Shen, Hongxia Jin Published: International Joint Conference on Artificial Intelligence (IJCAI) |
Aug 21, 2021 | Pre-trained language models such as BERT have shown great effectiveness in various natural lan- guage processing tasks. However, these models usually contain millions of parameters, which prevent them from the practical deployment on resource-constrained devices. Knowledge distilla- tion, Weight pruning, and Quantization are known to be the main directions in model compression. In this field of pre-trained language model com- pression, most existing work aims to obtain a com- pact model through knowledge distillation from the original larger model, which may suffer from sig- nificant accuracy drop even for a relatively small compression ratio. On the other hand, there are only a few attempts based on quantization designed for natural language processing tasks, and they usually require manual setting on hyper-parameters. In this paper, we proposed a BERT compression approach that can achieve automatic mixed-precision quanti- zation, which can conduct quantization and prun- ing at the same time. Specifically, our proposed method leverages differentiable Neural Architec- ture Search to automatically assign scales and pre- cision for parameters in each sub-group, and mean- while pruning out redundant groups of parameters. Extensive evaluations on BERT downstream tasks reveal that our proposed method beats baselines by providing the same performance with much smaller model size. We also show the possibility of obtain- ing the extremely light-weight model by combining our solution with orthogonal methods such as Dis- tilBERT. |
https://arxiv.org/pdf/2112.14938.pdf | Artificial Intelligence |
Enhancing the Generalization for Intent Classification and Out-of-Domain Detection in SLU |
Author: Yilin Shen et al. Yen-Chang Hsu, Avik Ray, Hongxia Jin Published: Association for Computational Linguistics (ACL) |
Aug 2, 2021 | Intent classification is a major task in spoken language understanding (SLU). Since most models are built with pre-collected in-domain (IND) training utterances, their ability to detect unsupported out-of-domain (OOD) utterances is critical in practice. Recent works showed that using extra data and labels can improve the OOD detection performance, yet it could be costly to collect such data. In this paper, we propose to train a joint model only on IND training set to support both IND intent classification and OOD detection. Our method explicitly models a domain variable to learn the domain disentangled utterance representation, named DDM model. DDM can be used as a drop-in replacement for any deep neural intent classifier. To further improve OOD detection performance, we introduce confidence and feature based OOD detection methods to combine with DDM and BERT-based models. On all three benchmark SLU datasets and one in-house dataset, we show that our method built on BERT and RoBERTa models achieve the state-of-the-art performance against existing approaches as well as multiple BERT based strong baselines for both intent classification and OOD detection tasks. |
https://arxiv.org/pdf/2106.14464.pdf | Artificial Intelligence |
Towards Motion-Aware Passive Resting Respiratory Rate Monitoring Using Earbuds |
Author: Mahbubur Rahman et al. Tousif Ahmed, Mohsin Ahmed, Ebrahim Nematihosseinabadi, Minh Dinh, Nathan Robert Folkman, Jilong Kuang, Jun Gao Published: IEEE-EMBS International Conference on Biomedical and Health Informatics(BHI) and the Body Sensor Networks(BSN) Conferences |
Jul 27, 2021 | Breathing rate is an important vital sign and an indicator of overall health and fitness. Traditionally breathing is monitored using specialized devices such as chestband or spirometers. However, these are uncomfortable for everyday use. Recent works show the feasibility of estimating breathing rate using earbuds. However, non-breathing head motion is one of the biggest challenges for accurate breathing rate estimation using earbuds or other head-mounted devices such as smart-glass. In this paper, we propose an algorithm to estimate the breathing rate in presence of non-breathing head motion using inertial sensors embedded in commodity earbuds. Using the chestband as a reference device, we show that our algorithms can estimate breathing rate in resting positions with $\pm$ 2.63 breaths per minute (BPM) error. However, when the algorithms developed on data without head motion and applied to the data with head motion, the error significantly increases. Our head-motion handling algorithm proposed in this paper can improve the accuracy up to 30\% in the presence of non-breathing head motion. This paper can help make a big stride towards passive breathing monitoring in everyday life using commodity earbuds which are increasingly becoming popular nowadays. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9507016 | Digital Health |
Better Battery Life: Towards Energy-Efficient Smartwatch-Based Atrial Fibrillation Detection in Ambulatory Free-living Environment |
Author: Retiree et al. Li Zhu, Viswam Nathan, Jilong Kuang Published: IEEE-EMBS International Conference on Biomedical and Health Informatics(BHI) and the Body Sensor Networks(BSN) Conferences |
Jul 27, 2021 | Atrial Fibrillation (AF) is an important medical condition that an be passively detected and tracked using a smartwatch. Diagnosis and monitoring of AF can be more effective and reliable if the smartwatch senses continuously, but this can lead to significant battery consumption by the LED in the photoplethysmography (PPG) sensor. In this paper, we explore the feasibility of leveraging downsampling to achieve energy-efficient AF detection. We collect data from participants with paroxysmal AF in real ambulatory free-living environments using a commercial smartwatch and separately study the impact of uniform downsampling and compressed sensing on AF detection. Our results reveal that downsampling enables the AF detection system to consume about 77.4% less LED power than the original sampling strategy without a significant performance drop |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9507025 | Digital Health |
Real-Time 3D Arm Motion Tracking using the 6-axis IMU sensor of a Smartwatch |
Author: Wenchuan Wei et al. Keiko Kurita, Jilong Kuang, Jun Gao Published: IEEE-EMBS International Conference on Biomedical and Health Informatics(BHI) and the Body Sensor Networks(BSN) Conferences |
Jul 27, 2021 | Inertial measurement unit (IMU) sensor is widely used in motion tracking for various applications, e.g., virtual physical therapy and fitness training. Traditional IMU-based motion tracking systems use 9-axis IMU sensors that include an accelerometer, gyroscope, and magnetometer. The magnetometer is essential to correct the yaw drift in orientation estimation. However, its magnetic field measurement is often disturbed by the ferromagnetic materials in the environment and requires frequent calibration. Moreover, most IMU-based systems require multiple IMU sensors to track the body motion and are not convenient for use. In this paper, we propose a novel approach that uses a single 6-axis IMU sensor of a consumer smartwatch without any magnetometer to track the users 3D arm motion in real time. We use a recurrent neural network (RNN) model to estimate the 3D positions of both the wrist and the elbow from the noisy IMU data. Compared with the state-of-the-art approaches that use either the 9-axis IMU sensor or the combination of a 6-axis IMU and an extra device, our proposed approach significantly improves the usability and potential for pervasiveness by not requiring an magnetometer or any extra device, while achieving comparable results. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9507012 | Digital Health |
CoughBuddy: Multi-Modal Cough Event Detection Using Earbuds Platform |
Author: Ebrahim Nematihosseinabadi et al. Tousif Ahmed, Mahbubur Rahman, Jilong Kuang, Jun Gao Published: IEEE-EMBS International Conference on Biomedical and Health Informatics(BHI) and the Body Sensor Networks(BSN) Conferences |
Jul 27, 2021 | The prevalence of novel wearable devices has opened new horizons of opportunity for lung health monitoring and assessment in the past decade. There has been extensive amount of study on cough detection using acoustic features of the cough from smartphones and smartwatches. However, the specificity of the algorithms has always been a concern when exposed to the unseen field data that contain cough-like sounds. In this paper, we propose a novel sensor fusion algorithm that employs a hybrid of classification and template matching algorithms to tackle the problem of unseen classes. The algorithm utilizes in-ear audio signal as well as head motion captured by the inertial measurement unit (IMU). A large study including 45 subjects from healthy and chronic cough cohorts was conducted that contained various tasks including cough and cough-like body sounds in various conditions such as quite/noisy and stationary/non-stationary. Our proposed hybrid algorithm which comprises audio-event classification and a dynamic time warping (DTW)-based IMU template matching is evaluated for sensitivity and specificity in the aforementioned conditions using leave one-subject out validation (LOSOV). Our model is able to achieve an average sensitivity of 83% for stationary tasks with an average specificity of 91.7% for cough-like sounds reducing the false positive rate by 55%. These results indicate the feasibility and superiority of earbuds platforms for detection of pulmonary sound events such as cough. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9507017 | Digital Health |
Fractionally Spaced Equalizer for Next Generation Terahertz Wireless Communication Systems |
Author: Jeongho Jeon et al. Joonyoung Cho, Shadi Abu-Surra, KITAEK BAE, Charlie Zhang Published: IEEE International Conference on Communications (ICC) |
Jun 14, 2021 | Higher data rates are required to support exponential growth in wireless traffic, motivating an expansion of the transmission bandwidth for sixth generation (6G) communications. The available bandwidth in the terahertz (THz) band significantly exceeds the available bandwidth in the mmWave band that has been adopted in fifth generation (5G) systems; thus, the THz band is envisioned as a pillar for 6G systems that can support data rates on the order of terabits per second (Tb/s). However, wireless communications in the THz band poses several new challenges. One of these challenges involves the practical constraint of employing a limited oversampling factor to process wideband THz signals, even while leveraging state-of-the-art analog/digital converter techniques. This limited oversampling factor – which can lead to an increased sampling timing offset – degrades the demodulation performance when it is employed in conjunction with a conventional symbol-spaced equalizer. Thus, we employ a fractionally spaced equalizer (FSE) in a THz communication system to overcome the impact of the increased sampling timing offset for a practical system that utilizes a limited sampling rate. Analysis and simulations demonstrate that the FSE can perfectly compensate the timing offset by optimally combining the available samples. Also, an approximation to the noise covariance matrix is proposed to reduce the computational complexity of the frequency-domain FSE. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9473698 | Next Generation Communications |
End-to-end 140 GHz Wireless Link Demonstration with Fully-Digital Beamformed System |
Author: Shadi Abu-Surra et al. Will Choi, SungTae Choi, Eunyoung Seok, Dongjoo Kim, Navneet Sharma, Siddharth Advani, Vitali Loseu, KITAEK BAE, ILJU NA, Gary Xu, Charlie Zhang Published: IEEE International Conference on Communications (Workshop) (ICC W/S) |
Jun 14, 2021 | It is projected that mobile traffic will increase by 80x by year 2030. To meet this increase in demand, it is inevitable to utilize the terahertz bands (0.1 THz to 10 THz) for future 6G wireless systems. However, operating at such high frequency comes with several fundamental and technical challenges. In this work, we present a proof-of-concept system to demonstrate the feasibility of establishing a communication link at 140 GHz carrier frequency. In addition, this work highlights techniques to tackle the challenges that comes with operating in the terahertz regime. To the authors knowledge, this is the world’s first end-to-end system with up to 16-channel digitally-beamformed 140 GHz system and dynamic beam steering capability. The paper presents lab results which demonstrate link throughput of 6 Gbps at 15-meter distance with adaptive beamforming. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9473600 | Next Generation Communications |
An Actor-Critic based End-to-End Neural Coreference System |
Author: Yu Wang et al. Yilin Shen, Hongxia Jin Published: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) |
Jun 11, 2021 | In this paper, we introduce a novel actor-critic based end-to-end neural coreference system to achieve joint tasks including mention detection, mention clustering and coreference resolution. Our model achieves the state-of-the-art performance on the CoNLL-2012 Shared Task English test set. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9413579 | Artificial Intelligence |
An adversarial learning based multi-step spoken language understanding system through human-computer interaction |
Author: Yu Wang et al. Yilin Shen, Hongxia Jin Published: International Conference on Acoustics, Speech, and Signal Processing (ICASSP) |
Jun 11, 2021 | Most of the existing spoken language understanding systems can perform only semantic frame parsing based on a singleround user query. They cannot take users’ feedback to update/add/remove slot values through multiround interaction with users. In this paper, we introduce a novel interactive adversarial reward learning-based spoken language understanding system that can leverage the multiround user’s feedback to update slot values. We perform two experiments on the benchmark ATIS dataset and demonstrate that the new system can improve parsing performance byat least 2:5% in terms of F1, with only one round of feedback. The improvement becomes even larger when the number of feedback rounds increases. Furthermore, we also compare the new system with state-of-the-art dialogue state tracking systems and demonstrate that the new interactive system can perform better on multiround spoken language understanding tasks in terms of slot- and sentence-level accuracy. |
https://arxiv.org/pdf/2106.14611.pdf | Artificial Intelligence |
Hyperparameter-free Continuous Learning for NLU Domain Classification |
Author: Ting Hua et al. Yilin Shen, Changsheng Zhao, Yen-Chang Hsu, Hongxia Jin Published: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL) |
Jun 8, 2021 | Domain classification is the fundamental task in natural language understanding (NLU), which often requires fast accommodation to new emerging domains. |
https://arxiv.org/pdf/2201.01420.pdf | Artificial Intelligence |
Data Augmentation for Voice-Assistant NLU using BERT-based Interchangeable Rephrase |
Author: Akhila Yerukola et al. Hongxia Jin Published: European Association for Computational Linguistics (EACL) |
Apr 21, 2021 | We introduce a data augmentation technique based on byte pair encoding and a BERT like self-attention model to boost performance on spoken language understanding tasks. We compare and evaluate this method with a range of augmentation techniques encompassing generative models such as VAEs and performance-boosting techniques such as synonym replacement and back-translation. We show our method performs strongly on domain and intent classification tasks for a voice assistant and in a user-study focused on utterance naturalness and semantic similarity. |
https://arxiv.org/pdf/2104.08268.pdf | Artificial Intelligence |
Early Detection and Burden Estimation of Atrial Fibrillation in Ambulatory Free-living Environment |
Author: Li Zhu et al. Viswam Nathan, Jilong Kuang, Jacob Kim, Jun Gao Published: ACM International Conference on Ubiquitous Computing (UbiComp) |
Mar 1, 2021 | Early detection and accurate burden estimation of AFib can provide the foundation for effective physician treatment and attract tremendous attention in recent years. In this paper, we develop a novel smartwatch-based system to achieve detection of AFib episodes and estimation of AFib burden in ambulatory free-living environment withour user engagement. Our system leverages built-in PPG sensor to collect heart rhythm without user engagement. Then, a data preprocessor module includes |
https://dl.acm.org/doi/pdf/10.1145/3463503 | Digital Health |
MIMO Evolution Towards 6G: Modular Massive MIMO in Low-Frequency Bands |
Author: Jeongho Jeon et al. Gilwon Lee, Ahmed Ibrahim, Jin Yuan, Gary Xu, Joonyoung Cho, Eko Onggosanusi, Younsun Kim, Juho Lee, Charlie Zhang Published: IEEE Communications Magazine |
Feb 28, 2021 | As the pace of global 5G network deployments accelerates, now is the moment for the cellular industry to realize the sixth generation (6G) cellular communication. In this article, the so-called modular massive MIMO (mmMIMO) is presented as one candidate technology for 6G. The 5G had relentlessly pushed the boundary of the cellular system’s operating frequency to millimeter wave bands and such a trend will be continued in the 6G era to further embrace the greenfield terahertz (THz) spectrum. Admittedly, however, the technical advances in 5G for low bands fall short, although low bands are crucial in serving a large number of users in a wide coverage area. Although, it would be ideal if massive MIMO could be utilized in low bands, it is less practical due to a large antenna form factor size. mmMIMO is a technology to distribute a large active antenna array with smaller standardized antenna modules, just like the LEGO blocks. Through this, the benefits of massive MIMO can be achieved in low bands, unconstrained from the spatial limitations. In this article, the concept of mmMIMO, its applicability, and needed research efforts to realize the technology are discussed. In addition, through the demonstration of a proof-of-concept system, it is shown that the technology will be within reach at the time of 6G massive commercialization around 2030. Lastly, the performance gain of mmMIMO is evidenced by system-level simulation. |
https://ieeexplore.ieee.org/document/9665444 | Next Generation Communications |
BreathTrack: Detecting Regular Breathing Phases from UnannotatedAcoustic Data Captured by a Smartphone |
Author: Mahbubur Rahman et al. Tousif Ahmed, Mohsin Ahmed, Korosh Vatanparvar, Ebrahim Nematihosseinabadi, Viswam Nathan, Jilong Kuang, Alex Gao Published: ACM International Conference on Ubiquitous Computing (UbiComp) |
Feb 13, 2021 | Passive and continuous monitoring of breathing biomarkers is vital for assessing well-being and detecting abnormalities in breathing patterns. In this paper, we present a novel method to detect breathing phases during regular breathing towards passive monitoring of natural breathing using acoustic sensors embedded in smartphones. Our model eliminates the need for breathing sound annotation by transferring knowledge from inertial sensor to acoustic sensor and by fusing signal processing techniques with deep learning methods. Our study with 131 subjects including healthy subjects and pulmonary patients shows that our model can detect breathing phases with 77.33% accuracy using acoustic sensors which enables novel and fine-grained breathing biomarkers such as inhalation exhalation ratio, fractional inspiratory time including commonly known vital sign called breathing rate. We further show that our algorithm can estimate fractional inspiratory time with92.08% accuracy, the inhalation-exhalation ratio with 86.76% accuracy, and the commonly known breathing rate with 91.74% accuracy. We further present the respiratory patient detection model as an example application of breathing phase detection and novel biomarker extraction. We show that fractional inspiratory time is significantly correlated with patient severity and our model can distinguish respiratory patients from healthy individuals with up to 76% accuracy. This paper is the first work to show the feasibility of detecting regular breathing phases towards passively monitoring respiratory well-being using a smartphone. |
https://dl.acm.org/doi/pdf/10.1145/3478123 | Digital Health |
FadeNet: Deep Learning based mm-Wave large-scale channel fading prediction and its applications |
Author: Vishnu Vardhan Ratnam et al. Hao Chen, Charlie Zhang, YOUNG-JIN KIM, Retiree, MINSUNG CHO, SUNG-ROK YOON Published: IEEE Access |
Sep 30, 2020 | Accurate prediction of the large-scale channel fading is fundamental to planning and optimization in 5G mm-Wave cellular networks. The current prediction methods, which are either too computationally expensive or inaccurate, are unsuitable for city-scale cell planning and optimization. This paper presents FadeNet, a convolutional neural-network enabled alternative for predicting large-scale fading with high computation speed and accuracy. By using carefully designed input features and neural-network architecture, FadeNet accurately predicts the large-scale fading from a base station to each location in its coverage area. Evaluations on realistic data, derived from mm-Wave cells across multiple cities in USA, suggest that FadeNet can achieve a prediction accuracy of 5.6 dB in root mean square error. In addition, by leveraging the parallel processing capabilities of a graphics processing unit, FadeNet can reduce the prediction time by 40X-1000X in comparison to industry prevalent methods like ray-tracing. Generalizations of FadeNet, that can handle variable topographies and base station heights, and its use for optimal cell site selection are also explored. |
https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9311729 | Next Generation Communications |