Loudspeaker position identification using human speech directivity index
Abstract
“Extended Summary Often a regular user of a multichannel loudspeaker system in regular living rooms places the loudspeakers in a non-uniform manner, with angles that don’t necessarily follow the recommended ITU-R BS.2159-4 standard, and with inconsistent distances from each speaker to the listener. By identifying the physical loudspeakers’ location, a spatial correction can thus be applied to recreate the artistic intention of the producer. The main goal of this proposal is to obtain the user/listener location with respect to the loudspeakers, assuming a multichannel audio system equipped with N number of loudspeakers and M very near-field (NF) microphones attached to each speaker. This is done by using a supervised machine learning (ML) model that is trained with the human speech directivity index (DI) computed by room simulations, where the sound source is the typical directivity radiation pattern of human speech and the NF receivers attached to the loudspeakers are located around the listener. The DI represents an acoustical energy ratio of one specific direction to all directions. The human voice presents a unique directivity pattern which is frequency, angle/direction and distance dependent; the computed DI carries that information. Assuming the setup described above with multiple microphones/receivers placed in a room and a human speaker as source, then the DI can be extracted from a multichannel voice command recorded from the user. The neural network (NN) is trained with DI data computed from human speech in-room simulations. An image source room simulation model is utilized to replicate a typical human speech recorded by receivers placed in typical loudspeaker positions around the source (user). Since the NF microphones are attached to the loudspeakers as close as possible to the driver, their directivity is affected due to the loudspeaker baffle. In the simulation model the NF microphone directivity is included. Typical female and male directivity was included for the source in the simulations. A customized room generator was used to create shoe-box room setups of various sizes. Each setup has material absorption coefficients chosen from a selection pool, with randomized receiver and source locations within some limits. A total of 39 rooms were simulated. Among these simulations there were three room sizes from 80 to 300 cubic meters. A total of 1140 setups consisting of 570 x two gender sets of IR data were computed. Each setup includes four channels of simulated IRs by gender. The result of the simulation was impulse responses (IR) at the NF microphone/receivers which then were convolved with anechoic human male and female mono recordings. Thus, convolution audio is the result from the simulation representing the voice command audio that the loudspeaker multichannel NF mics are supposed to “record’’ in each simulation case. The data was split as 80% for testing, 10% for training and 10% for validation. Before passing the data to the NN model, principal component analysis (PCA) was utilized to increase interpretability and reduce dimensionality. The dB values were converted to linear amplitude values to facilitate the PCA analysis. Then the training, test and validation sets were passed to the two NN models, one for the distance to the user, and one for the incidence angle. The distance NN model included an Input layer, two hidden layers, and an output layer. The angle estimation network included an input”
Author: Adrian Celestinos, Carren Zhongran Wang, Victor Manuel Chin Lopez
Published: Audio engineering society convention (AES)
Date: Oct 25, 2023