GOHSP: A Unified Framework of Graph and Optimization-based Heterogeneous Structured Pruning for Vision Transformer
Transformers have seen growing interest on processing different modalities including language and images. As a result, we can process vision and language data using transformers that are architecturally similar. This feature of transformers provides us several opportunities. In this study, we explore weight sharing between two architecturally similar transformers for vision and language tasks. More specifically, we investigate sharing two main components of the transformers; (1) Multi-Head Attention (MSA), and (2) Feed-Forward Network (FFN) layers, across two backbones. To achieve this, we propose an additional objective that encourages the minimization of the difference of the MSA weights as well as FFN weights across two backbones. After minimizing the corresponding weights in two backbones, we perform weight sharing and fine-tune the model. We perform experiments on vision and language tasks including Referring Expression Comprehension and VQA using the state-of-the-art model, MDETR. Our experiments show that we can reduce the size of the MDETR by 35 − 40% by sharing MSA and FFN weights without significant loss in accuracy.
Author: Burak Uzkent, Yilin Shen, Hongxia Jin
Published: National Conference on Artificial Intelligence (AAAI)
Date: Jan 2, 2023