Snapshot Matching Masking for Improved PSD Estimation in Mask-Based Neural Beamforming for Multichannel Speech Enhancement
In multichannel speech enhancement (SE), time-frequency (T-F) mask-based neural beamforming algorithms take advantage of deep neural networks to predict T-F masks that represent speech/noise dominance, which are leveraged to estimate the speech/noise power spectral density (PSD) matrices that are subsequently utilized to obtain the spatial filter weights. However, in the literature most networks are trained to estimate some pre-defined masks, e.g., the ideal binary mask (IBM) and ideal ratio mask (IRM), which lack direct connection to the PSD matrices. In this paper, we propose a new masking strategy where the complex-valued U-Net is utilized to predict a novel T-F mask, namely the Snapshot Matching Mask (SMM), that aims to minimize the distance between the predicted signal snapshots and the true signal snapshots, thereby estimating the PSD matrices in a more systematic way. Performance of the SMM compared with existing IBM- and IRM-based beamformers is presented on several datasets to demonstrate its effectiveness for improved T-F mask-based beamforming.
Author: Chinghua Lee, Chouchang Yang, Yilin Shen, Hongxia Jin
Published: International Conference on Acoustics, Speech, and Signal Processing (ICASSP)
Date: Jun 4, 2023