Automatic Mixed-Precision Quantization Search of BERT


Pre-trained language models such as BERT have shown great effectiveness in various natural lan- guage processing tasks. However, these models usually contain millions of parameters, which prevent them from the practical deployment on resource-constrained devices. Knowledge distilla- tion, Weight pruning, and Quantization are known to be the main directions in model compression. In this field of pre-trained language model com- pression, most existing work aims to obtain a com- pact model through knowledge distillation from the original larger model, which may suffer from sig- nificant accuracy drop even for a relatively small compression ratio. On the other hand, there are only a few attempts based on quantization designed for natural language processing tasks, and they usually require manual setting on hyper-parameters. In this paper, we proposed a BERT compression approach that can achieve automatic mixed-precision quanti- zation, which can conduct quantization and prun- ing at the same time. Specifically, our proposed method leverages differentiable Neural Architec- ture Search to automatically assign scales and pre- cision for parameters in each sub-group, and mean- while pruning out redundant groups of parameters. Extensive evaluations on BERT downstream tasks reveal that our proposed method beats baselines by providing the same performance with much smaller model size. We also show the possibility of obtain- ing the extremely light-weight model by combining our solution with orthogonal methods such as Dis- tilBERT.

Author: Changsheng Zhao, Ting Hua, Yilin Shen, Hongxia Jin

Published: International Joint Conference on Artificial Intelligence (IJCAI)

Date: Aug 21, 2021