Abstract
To address transparency and trust concerns in autonomous driving systems, often criticized as black box models, we propose RAG-SafeAdapt, a retrieval-augmented vision-language model designed for complex traffic scenarios. This end-to-end framework integrates the CARLA simulator with Retrieval-Augmented Generation (RAG) and multimodal knowledge bases. The system combines visual and language inputs to generate game-theory-based safety recommendations through Responsibility-Sensitive Safety (RSS) principles and Vision-Language Models (VLMs). RAG-SafeAdapt enhances safety analysis and interpretable decision-making, with deployment compatibility for platforms such as NVIDIA Orin. Evaluation on datasets including Berkeley DeepDrive eXplanation (BDD-X) and NuScenes-QA demonstrates improved decision explainability and generalization across diverse driving scenarios. Experimental results show superior zero-shot generalization capabilities, enhanced transparency, and a reduced collision rate of 0.35%. The framework effectively addresses key challenges in navigation clarity, sensor precision, and adaptability while fostering trust in autonomous driving through optimized real-time safety and human-readable explanations (The bold formatting in the text is used to highlight proper nouns and key terms for emphasis).
| Original language | English |
|---|---|
| Pages (from-to) | 36-51 |
| Number of pages | 16 |
| Journal | Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering |
| Volume | 240 |
| Issue number | 1 |
| DOIs | |
| Publication status | Published - Jan 2026 |
| Externally published | Yes |
Keywords
- Retrieval-augmented generation
- autonomous driving
- explainability
- responsibility-sensitive safety
- sensor fusion
- vision-language models