TY - GEN
T1 - Image Dense Captioning of Irregular Regions Based on Visual Saliency
AU - Wen, Xiaosheng
AU - Jian, Ping
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Traditional Dense Captioning intends to describe local details of image with natural language. It usually uses target detection first and then describes the contents in the detected bounding box, which will make the description content rich. But captioning based on target detection often lacks the attention to the association between objects and the environment, or between the objects. And for now, there is no dense captioning method has the ability to deal with irregular areas. To solve these problems, we propose a visual-saliency based region division method. It focuses more on areas than just on objects. Based on the division, the local description of the irregular region is carried out. For each area, we combine the image with the target area to generate features, which are put into the caption model. We used the Visual Genome dataset for training and testing. Through experiments, our model is comparable to the baseline under the traditional bounding box. And the description of irregular region generated by our method is equally good. Our model performs well in image retrieval experiments and has less information redundancy. In the application, we support to manually select the region of interest on the image for description, for assist in expanding the dataset.
AB - Traditional Dense Captioning intends to describe local details of image with natural language. It usually uses target detection first and then describes the contents in the detected bounding box, which will make the description content rich. But captioning based on target detection often lacks the attention to the association between objects and the environment, or between the objects. And for now, there is no dense captioning method has the ability to deal with irregular areas. To solve these problems, we propose a visual-saliency based region division method. It focuses more on areas than just on objects. Based on the division, the local description of the irregular region is carried out. For each area, we combine the image with the target area to generate features, which are put into the caption model. We used the Visual Genome dataset for training and testing. Through experiments, our model is comparable to the baseline under the traditional bounding box. And the description of irregular region generated by our method is equally good. Our model performs well in image retrieval experiments and has less information redundancy. In the application, we support to manually select the region of interest on the image for description, for assist in expanding the dataset.
KW - Image dense captioning
KW - image retrieval
KW - irregular region
KW - redundancy
KW - visual saliency
UR - http://www.scopus.com/inward/record.url?scp=85163846819&partnerID=8YFLogxK
U2 - 10.1109/PRMVIA58252.2023.00008
DO - 10.1109/PRMVIA58252.2023.00008
M3 - Conference contribution
AN - SCOPUS:85163846819
T3 - Proceedings - 2023 International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms, PRMVIA 2023
SP - 8
EP - 14
BT - Proceedings - 2023 International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms, PRMVIA 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 International Conference on Pattern Recognition, Machine Vision and Intelligent Algorithms, PRMVIA 2023
Y2 - 24 March 2023 through 26 March 2023
ER -