Abstract
Vision Transformer (VIT) effectively captures global and local image features by connecting and facilitating information transfer between image patches, making it an essential tool in computer vision. However, its computational cost has been a major limiting factor for its application. To reduce the computational cost introduced by the attention mechanism in the transformer architecture, researchers have explored two approaches: reducing the number of patches involved in the computation and innovating attention mechanisms. Although these methods have improved efficiency, they require manual preprocessing and additional model training compared to VIT, which limits their flexibility. In this work, we propose an adaptive attention pattern for vision transformers that is easily implemented within transformer architecture, and we designed a novel window transformer architecture for various vision tasks without any preprocessing or additional model training. Our method can determine which patches participate in self-attention calculations based on the similarity of image patches in multidimensional space, thereby reducing the computational cost of these calculations. Experimental results show that our method is more effective, with fewer patches involved in the attention calculation compared to the window attention architectures that do not incorporate the proposed attention block. Furthermore, to better understand the relationship between the transformer architecture and the input patches, we investigated the impact of different patches in images on the performance of transformer-based networks. We found that for typical window transformer architecture networks, only a subset of patches is crucial for accurate object recognition, while other patches primarily contribute to the confidence of the predictions.
| Original language | English |
|---|---|
| Article number | 114647 |
| Journal | Knowledge-Based Systems |
| Volume | 330 |
| DOIs | |
| Publication status | Published - 25 Nov 2025 |
| Externally published | Yes |
Keywords
- Computer vision
- Image recognition
- Machine learning
- Vision transformer
- Window clustering attention
Fingerprint
Dive into the research topics of 'Not all patches are crucial to image recognition: Window patch clustering attention for transformers'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver