Abstract
In light of the challenges associated with enhancing multimodal knowledge graph retrieval and improving embodied intelligence in dynamic environments, we propose the Multimodal Knowledge Graph Vision-Language System (MKGVL). This system is designed to enhance the reasoning and feedback capabilities of embodied intelligent systems by continuously updating the knowledge graph through real-time feedback from the Vision-Language Model (VLM). By integrating visual encoders, language models, and knowledge graph networks, MKGVL constructs a unified multimodal representation that enables adaptive decision-making and responsiveness to environmental changes. Experimental results show that MKGVL outperforms existing models in fine-grained retrieval tasks, achieving an 11.5% improvement in Rank1 accuracy and a mean Average Precision (mAP) of 97.49%. Further evaluations conducted on datasets such as ARKitScenes, MultiScan, and 3RScan highlight the model's robustness and adaptability. Additionally, MKGVL's deployment on embedded platforms like Jetson Orin demonstrates its efficiency in real-time multimodal tasks, particularly in resource-constrained environments. These findings underscore MKGVL's ability to deliver accurate and efficient multimodal processing, making it a strong solution for adaptive knowledge-based systems in complex settings.
Original language | English |
---|---|
Title of host publication | 2024 IEEE International Conference on Robotics and Biomimetics, ROBIO 2024 |
Publisher | Institute of Electrical and Electronics Engineers Inc. |
Pages | 1271-1275 |
Number of pages | 5 |
Edition | 2024 |
ISBN (Electronic) | 9781665481090 |
DOIs | |
Publication status | Published - 2024 |
Event | 2024 IEEE International Conference on Robotics and Biomimetics, ROBIO 2024 - Bangkok, Thailand Duration: 10 Dec 2024 → 14 Dec 2024 |
Conference
Conference | 2024 IEEE International Conference on Robotics and Biomimetics, ROBIO 2024 |
---|---|
Country/Territory | Thailand |
City | Bangkok |
Period | 10/12/24 → 14/12/24 |