TY - GEN
T1 - Dual Transformer Encoder Model for Medical Image Classification
AU - Yan, Fangyuan
AU - Yan, Bin
AU - Pei, Mingtao
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Compared with convolutional neural networks, vision transformer with powerful global modeling abilities has achieved promising results in natural image classification and has been applied in the field of medical image analysis. Vision transformer divides the input image into a token sequence of fixed hidden size and keeps the hidden size constant during training. However, a fixed size is unsuitable for all medical images. To address the above issue, we propose a new dual transformer encoder model which consists of two transformer encoders with different hidden sizes so that the model can be trained with two token sequences with different sizes. In addition, the vision transformer only considers the class token output by the last layer in the encoders when predicting the category, ignoring the information of other layers. We use a Layer-wise Class token Attention (LCA) classification module that leverages class tokens from all layers of encoders to predict categories. Extensive experiments show that our proposed model obtains better performance than other transformer-based methods, which proves the effectiveness of our model.
AB - Compared with convolutional neural networks, vision transformer with powerful global modeling abilities has achieved promising results in natural image classification and has been applied in the field of medical image analysis. Vision transformer divides the input image into a token sequence of fixed hidden size and keeps the hidden size constant during training. However, a fixed size is unsuitable for all medical images. To address the above issue, we propose a new dual transformer encoder model which consists of two transformer encoders with different hidden sizes so that the model can be trained with two token sequences with different sizes. In addition, the vision transformer only considers the class token output by the last layer in the encoders when predicting the category, ignoring the information of other layers. We use a Layer-wise Class token Attention (LCA) classification module that leverages class tokens from all layers of encoders to predict categories. Extensive experiments show that our proposed model obtains better performance than other transformer-based methods, which proves the effectiveness of our model.
KW - dual-encoder model
KW - medical image classification
KW - vision transformer
UR - http://www.scopus.com/inward/record.url?scp=85180777351&partnerID=8YFLogxK
U2 - 10.1109/ICIP49359.2023.10222303
DO - 10.1109/ICIP49359.2023.10222303
M3 - Conference contribution
AN - SCOPUS:85180777351
T3 - Proceedings - International Conference on Image Processing, ICIP
SP - 690
EP - 694
BT - 2023 IEEE International Conference on Image Processing, ICIP 2023 - Proceedings
PB - IEEE Computer Society
T2 - 30th IEEE International Conference on Image Processing, ICIP 2023
Y2 - 8 October 2023 through 11 October 2023
ER -