Factorization Vision Transformer: Modeling Long-Range Dependency With Local Window Cost

Haolin Qin, Daquan Zhou, Tingfa Xu, Ziyang Bian, Jianan Li

Research output: Contribution to journalArticlepeer-review

2 Citations (Scopus)

Abstract

Transformers have astounding representational power but typically consume considerable computation which is quadratic with image resolution. The prevailing Swin transformer reduces computational costs through a local window strategy. However, this strategy inevitably causes two drawbacks: 1) the local window-based self-attention (WSA) hinders global dependency modeling capability and 2) recent studies point out that local windows impair robustness. To overcome these challenges, we pursue a preferable trade-off between computational cost and performance. Accordingly, we propose a novel factorization self-attention (FaSA) mechanism that enjoys both the advantages of local window cost and long-range dependency modeling capability. By factorizing the conventional attention matrix into sparse subattention matrices, FaSA captures long-range dependencies, while aggregating mixed-grained information at a computational cost equivalent to the local WSA. Leveraging FaSA, we present the factorization vision transformer (FaViT) with a hierarchical structure. FaViT achieves high performance and robustness, with linear computational complexity concerning input image spatial resolution. Extensive experiments have shown FaViT&#x2019;s advanced performance in classification and downstream tasks. Furthermore, it also exhibits strong model robustness to corrupted and biased data and hence demonstrates benefits in favor of practical applications. In comparison to the baseline model Swin-T, our FaViT-B2 significantly improves classification accuracy by <inline-formula> <tex-math notation="LaTeX">$1\%$</tex-math> </inline-formula> and robustness by <inline-formula> <tex-math notation="LaTeX">$7\%$</tex-math> </inline-formula>, while reducing model parameters by <inline-formula> <tex-math notation="LaTeX">$14\%$</tex-math> </inline-formula>. Our code will soon be publicly available: at https://github.com/q2479036243/FaViT.

Original languageEnglish
Pages (from-to)1-14
Number of pages14
JournalIEEE Transactions on Neural Networks and Learning Systems
DOIs
Publication statusAccepted/In press - 2023

Keywords

  • Computational efficiency
  • Computational modeling
  • Costs
  • Factorization
  • Robustness
  • Sparse matrices
  • Transformers
  • Windows
  • local window
  • long-range dependency
  • model robustness
  • transformer

Fingerprint

Dive into the research topics of 'Factorization Vision Transformer: Modeling Long-Range Dependency With Local Window Cost'. Together they form a unique fingerprint.

Cite this