Factorization Vision Transformer: Modeling Long-Range Dependency With Local Window Cost

Haolin Qin; Daquan Zhou; Tingfa Xu; Ziyang Bian; Jianan Li

doi:10.1109/TNNLS.2023.3342172

Factorization Vision Transformer: Modeling Long-Range Dependency With Local Window Cost

Haolin Qin, Daquan Zhou, Tingfa Xu, Ziyang Bian, Jianan Li

School of Optics and Photonics

Research output: Contribution to journal › Article › peer-review

3 Citations (Scopus)

Abstract

Transformers have astounding representational power but typically consume considerable computation which is quadratic with image resolution. The prevailing Swin transformer reduces computational costs through a local window strategy. However, this strategy inevitably causes two drawbacks: 1) the local window-based self-attention (WSA) hinders global dependency modeling capability and 2) recent studies point out that local windows impair robustness. To overcome these challenges, we pursue a preferable trade-off between computational cost and performance. Accordingly, we propose a novel factorization self-attention (FaSA) mechanism that enjoys both the advantages of local window cost and long-range dependency modeling capability. By factorizing the conventional attention matrix into sparse subattention matrices, FaSA captures long-range dependencies, while aggregating mixed-grained information at a computational cost equivalent to the local WSA. Leveraging FaSA, we present the factorization vision transformer (FaViT) with a hierarchical structure. FaViT achieves high performance and robustness, with linear computational complexity concerning input image spatial resolution. Extensive experiments have shown FaViT’s advanced performance in classification and downstream tasks. Furthermore, it also exhibits strong model robustness to corrupted and biased data and hence demonstrates benefits in favor of practical applications. In comparison to the baseline model Swin-T, our FaViT-B2 significantly improves classification accuracy by <inline-formula> <tex-math notation="LaTeX">$1\%$</tex-math> </inline-formula> and robustness by <inline-formula> <tex-math notation="LaTeX">$7\%$</tex-math> </inline-formula>, while reducing model parameters by <inline-formula> <tex-math notation="LaTeX">$14\%$</tex-math> </inline-formula>. Our code will soon be publicly available: at https://github.com/q2479036243/FaViT.

Original language	English
Pages (from-to)	1-14
Number of pages	14
Journal	IEEE Transactions on Neural Networks and Learning Systems
DOIs	https://doi.org/10.1109/TNNLS.2023.3342172
Publication status	Accepted/In press - 2023

Keywords

Computational efficiency
Computational modeling
Costs
Factorization
Robustness
Sparse matrices
Transformers
Windows
local window
long-range dependency
model robustness
transformer

Access to Document

10.1109/TNNLS.2023.3342172

Cite this

@article{0b53cd92fc624a42bbd4da2c185a3750,

title = "Factorization Vision Transformer: Modeling Long-Range Dependency With Local Window Cost",

abstract = "Transformers have astounding representational power but typically consume considerable computation which is quadratic with image resolution. The prevailing Swin transformer reduces computational costs through a local window strategy. However, this strategy inevitably causes two drawbacks: 1) the local window-based self-attention (WSA) hinders global dependency modeling capability and 2) recent studies point out that local windows impair robustness. To overcome these challenges, we pursue a preferable trade-off between computational cost and performance. Accordingly, we propose a novel factorization self-attention (FaSA) mechanism that enjoys both the advantages of local window cost and long-range dependency modeling capability. By factorizing the conventional attention matrix into sparse subattention matrices, FaSA captures long-range dependencies, while aggregating mixed-grained information at a computational cost equivalent to the local WSA. Leveraging FaSA, we present the factorization vision transformer (FaViT) with a hierarchical structure. FaViT achieves high performance and robustness, with linear computational complexity concerning input image spatial resolution. Extensive experiments have shown FaViT{\textquoteright}s advanced performance in classification and downstream tasks. Furthermore, it also exhibits strong model robustness to corrupted and biased data and hence demonstrates benefits in favor of practical applications. In comparison to the baseline model Swin-T, our FaViT-B2 significantly improves classification accuracy by $1\%$ and robustness by $7\%$ , while reducing model parameters by $14\%$ . Our code will soon be publicly available: at https://github.com/q2479036243/FaViT.",

keywords = "Computational efficiency, Computational modeling, Costs, Factorization, Robustness, Sparse matrices, Transformers, Windows, local window, long-range dependency, model robustness, transformer",

author = "Haolin Qin and Daquan Zhou and Tingfa Xu and Ziyang Bian and Jianan Li",

note = "Publisher Copyright: IEEE",

year = "2023",

doi = "10.1109/TNNLS.2023.3342172",

language = "English",

pages = "1--14",

journal = "IEEE Transactions on Neural Networks and Learning Systems",

issn = "2162-237X",

publisher = "IEEE Computational Intelligence Society",

}

TY - JOUR

T1 - Factorization Vision Transformer

T2 - Modeling Long-Range Dependency With Local Window Cost

AU - Qin, Haolin

AU - Zhou, Daquan

AU - Xu, Tingfa

AU - Bian, Ziyang

AU - Li, Jianan

N1 - Publisher Copyright: IEEE

PY - 2023

Y1 - 2023

N2 - Transformers have astounding representational power but typically consume considerable computation which is quadratic with image resolution. The prevailing Swin transformer reduces computational costs through a local window strategy. However, this strategy inevitably causes two drawbacks: 1) the local window-based self-attention (WSA) hinders global dependency modeling capability and 2) recent studies point out that local windows impair robustness. To overcome these challenges, we pursue a preferable trade-off between computational cost and performance. Accordingly, we propose a novel factorization self-attention (FaSA) mechanism that enjoys both the advantages of local window cost and long-range dependency modeling capability. By factorizing the conventional attention matrix into sparse subattention matrices, FaSA captures long-range dependencies, while aggregating mixed-grained information at a computational cost equivalent to the local WSA. Leveraging FaSA, we present the factorization vision transformer (FaViT) with a hierarchical structure. FaViT achieves high performance and robustness, with linear computational complexity concerning input image spatial resolution. Extensive experiments have shown FaViT’s advanced performance in classification and downstream tasks. Furthermore, it also exhibits strong model robustness to corrupted and biased data and hence demonstrates benefits in favor of practical applications. In comparison to the baseline model Swin-T, our FaViT-B2 significantly improves classification accuracy by $1\%$ and robustness by $7\%$ , while reducing model parameters by $14\%$ . Our code will soon be publicly available: at https://github.com/q2479036243/FaViT.

AB - Transformers have astounding representational power but typically consume considerable computation which is quadratic with image resolution. The prevailing Swin transformer reduces computational costs through a local window strategy. However, this strategy inevitably causes two drawbacks: 1) the local window-based self-attention (WSA) hinders global dependency modeling capability and 2) recent studies point out that local windows impair robustness. To overcome these challenges, we pursue a preferable trade-off between computational cost and performance. Accordingly, we propose a novel factorization self-attention (FaSA) mechanism that enjoys both the advantages of local window cost and long-range dependency modeling capability. By factorizing the conventional attention matrix into sparse subattention matrices, FaSA captures long-range dependencies, while aggregating mixed-grained information at a computational cost equivalent to the local WSA. Leveraging FaSA, we present the factorization vision transformer (FaViT) with a hierarchical structure. FaViT achieves high performance and robustness, with linear computational complexity concerning input image spatial resolution. Extensive experiments have shown FaViT’s advanced performance in classification and downstream tasks. Furthermore, it also exhibits strong model robustness to corrupted and biased data and hence demonstrates benefits in favor of practical applications. In comparison to the baseline model Swin-T, our FaViT-B2 significantly improves classification accuracy by $1\%$ and robustness by $7\%$ , while reducing model parameters by $14\%$ . Our code will soon be publicly available: at https://github.com/q2479036243/FaViT.

KW - Computational efficiency

KW - Computational modeling

KW - Costs

KW - Factorization

KW - Robustness

KW - Sparse matrices

KW - Transformers

KW - Windows

KW - local window

KW - long-range dependency

KW - model robustness

KW - transformer

UR - http://www.scopus.com/inward/record.url?scp=85181562047&partnerID=8YFLogxK

U2 - 10.1109/TNNLS.2023.3342172

DO - 10.1109/TNNLS.2023.3342172

M3 - Article

AN - SCOPUS:85181562047

SN - 2162-237X

SP - 1

EP - 14

JO - IEEE Transactions on Neural Networks and Learning Systems

JF - IEEE Transactions on Neural Networks and Learning Systems

ER -

Factorization Vision Transformer: Modeling Long-Range Dependency With Local Window Cost

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this