Vision-To-Voice: An Intelligent Caption Generation and An Avatar-Guided Assistive System for The Visually Challenged
-
https://doi.org/10.14419/ndnxct46
Received date: October 12, 2025
Accepted date: November 9, 2025
Published date: November 19, 2025
-
Vision-To-Voice; Image Captioning; Dual-Stream Deep Feature Extraction; Pre-Trained Resnet50; Hybrid Convolutional Neural Network; 2D Avatar; Graphical User Interface. -
Abstract
Access to visual information is critical to independent living, but individuals who are visually impaired continue to encounter obstacles in seeing and understanding their environment. Conventional assistive technologies like screen readers and object detectors are generally weak in semantics, contextuality, and interactivity, which makes them inadequate for actual use in real-world settings. To address the shortcomings, an intelligent and interactive assistive system , Vision-to-Voice, is architected to transform static visual information into meaningful verbal descriptions with deep learning and real-time avatar-guided narration. The proposed system presents a new end-to-end image captioning architecture that incorporates improved preprocessing, a dual-stream deep feature extraction flow, and a context-aware caption generation model. During preprocessing, images are normalized and denoised to enhance feature clarity. The feature extraction is conducted through a hybrid ResNet50+custom-convolutional stream architecture that combines global and local representations from pre-trained ResNet50 and custom-trained convolution streams. A well-designed dataset of 1,600 visually diverse images and 8,000 respective human-written captions is employed, with 80% reserved for training and 20% for testing. Five descriptions are assigned to each image, promoting semantic diversity during training. The captioning model is trained to learn from several contextual cues, making it possible to generate rich, human-like captions. The performance of the system is measured quantitatively in terms of accuracy, precision, recall, and F1-score, all of which show significant improvements over traditional single-stream or template-based approaches.
To facilitate real-time use, the trained model is embedded in a graphical user interface (GUI) with an intuitive design for simple navigation. The interface accommodates image loading, captioning, and animated speech narration. A 2D avatar is also aligned with the synthesized speech, visually realizing the captioned speech with audio-visual coherence throughout the utterance. Captions are shown clearly in uppercase characters for improved readability. This dynamic, multimodal feedback system enables a more inclusive and interactive experience for visually impaired users. The system not only excels in generating captions with higher accuracy but also provides a pragmatic and compassionate assistive solution with its harmony of cutting-edge vision-language modeling and human-centric design. User-oriented considerations and test results all verify the framework's viability for real-world accessibility use cases, paving the ground for future advancements in assistive AI.
-
References
- World Health Organization, “World Report on Vision 2024,” Geneva: WHO, 2024.
- Ruvita Faurina, Anisa Jelita, Arie Vatresia, Indra Agustian, "Image captioning to aid blind and visually impaired outdoor navigation", IAES Inter-national Journal of Artificial Intelligence (IJ-AI), vol. 12, no. 3, pp. 1104–1117, 2023. https://doi.org/10.11591/ijai.v12.i3.pp1104-1117.
- Lu Yu, Malvina Nikandrou, Jiali Jin, Verena Rieser, "Quality-agnostic Image Captioning to Safely Assist People with Vision Impairment", arXiv preprint arXiv:2304.14623, 2023. https://doi.org/10.24963/ijcai.2023/697
- Hiba Ahsan, Daivat Bhatt, Kaivan Shah, Nikita Bhalla, "Multi-Modal Image Captioning for the Visually Impaired", Proceedings of the 2021 Con-ference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, pp. 53–60, 2021. https://doi.org/10.18653/v1/2021.naacl-srw.8.
- Pritam Langde, Shrinivas Patil, Prachi Langde, "Automating Document Narration: A Deep Learning-Based Speech Captioning System for Visually Impaired Persons", International Journal of Intelligent Systems and Applications in Engineering, vol. 11, no. 3, pp. 7469–7476, 2023.
- Huy Nguyen, Thi Huynh, Nam Tran, Thai Nguyen, "MyUEVision: An Application Generating Image Caption for Assisting Visually Impaired Peo-ple," Journal of Engineering and Technology, vol. 11, no. 2, pp. 112–125, 2024.
- Batyr Arystanbekov, Askat Kuzdeuov, Shakhizat Nurgaliyev, Huseyin Atakan Varol, "Image Captioning for the Visually Impaired and Blind: A Recipe for Low-Resource Languages", Proceedings of the 2023 45th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 1–4, 2023. https://doi.org/10.1109/EMBC40787.2023.10340575.
- Dionysia Danai Brilli, Evangelos Georgaras, Stefania Tsilivaki, Nikos Melanitis, Konstantina Nikita, "AIris: An AI-powered Wearable Assistive Device for the Visually Impaired", arXiv preprint arXiv:2405.07606, 2024.
- Sibi C. Sethuraman, Gaurav R. Tadkapally, Saraju P. Mohanty, Gautam Galada, Anitha Subramanian, "MagicEye: An Intelligent Wearable To-wards Independent Living of Visually Impaired", arXiv preprint arXiv:2303.13863, 2023.
- Fateme Zare, Paniz Sedighi, Mehdi Delrobaei, "A Wearable RFID-Based Navigation System for the Visually Impaired", arXiv preprint arXiv:2303.14792, 2023. https://doi.org/10.36227/techrxiv.22337803.
- Taewhan Kim, Soeun Lee, Si-Woo Kim, Dong-Jin Kim, "ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning," arXiv preprint arXiv:2412.19289, 2024.
- Ning Wang, Wenbin Li, Zhiqiang Tao, Jingkuan Song, "Efficient Image Captioning for Edge Devices," arXiv preprint arXiv:2212.08985, 2022.
- Tingyu Qu, Tinne Tuytelaars, Marie-Francine Moens, "Visually-Aware Context Modeling for News Image Captioning," Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 2849–2866, 2024.
- Sara Sarto, Luca Cosmo, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara, "Towards Retrieval-Augmented Architectures for Image Captioning," arXiv preprint arXiv:2405.13127, 2024. https://doi.org/10.1145/3663667
- Shourya Tyagi, Bhupesh Gour, Parveen Rajpoot, "Novel Advance Image Caption Generation Utilizing Vision Transformer and Generative Adver-sarial Networks," Computers, vol. 13, no. 12, pp. 305–321, 2024. https://doi.org/10.3390/computers13120305.
- Qing Zhou, Yixuan Su, Yan Song, "Text-only Synthesis for Image Captioning," arXiv preprint arXiv:2405.18258, 2024.
- Yogita Dongare, Ruchi Mantri, Pratiksha Patil, "Deep Neural Networks for Automated Image Captioning to Improve Accessibility for Visually Impaired Users," International Journal of Innovative Science and Advanced Engineering, vol. 11, no. 10, pp. 129–134, 2023.
- Fangyu Liu, Yufei Wang, Lei Li, "Recent Advances in Synthesis and Interaction of Speech, Text, and Vision," Electronics, vol. 13, no. 9, pp. 1726–1740, 2024. https://doi.org/10.3390/electronics13091726.
- Amirthalingam P., Kavitha D., Lakshmi Priya G., "IICVoiC – An Intelligent Image Captioning and Voice Converter for Visually Impaired," Jour-nal of Electrical Systems, vol. 20, no. 4, pp. 298–309, 2024.
- Antonia Karamolegkou, Malvina Nikandrou, Georgios Pantazopoulos, Danae Sanchez Villegas, Phillip Rust, Ruchira Dhar, Daniel Hershcovich, Anders Søgaard, "Evaluating Multimodal Language Models as Visual Assistants for Visually Impaired Users," arXiv preprint arXiv:2503.22610, 2025. https://doi.org/10.18653/v1/2025.acl-long.1260.
- Bufang Yang, Lixing He, Kaiwei Liu, Zhenyu Yan, "VIAssist: Adapting Multi-modal Large Language Models for Users with Visual Impairments," arXiv preprint arXiv:2404.02508, 2024. https://doi.org/10.1109/FMSys62467.2024.00010
- Md Alif Rahman Ridoy, M Mahmud Hasan, Shovon Bhowmick, "Compressed Image Captioning using CNN-based Encoder-Decoder Frame-work," arXiv preprint arXiv:2404.18062, 2024.
- Caiyue Chen, "Speech Synthesis Technology: Status and Challenges," ITM Web of Conferences, vol. 73, 2025 International Workshop on Ad-vanced Applications of Deep Learning in Image Processing (IWADI 2024), pp. 02006, 2025. https://doi.org/10.1051/itmconf/20257302006.
- K. Ravi Teja, Y. Sriman, A. Aneeta Joseph, R. Deepa, "Generation of Image Caption for Visually Challenged People," in Information System De-sign: Communication Networks and IoT, ISDIA 2024, Lecture Notes in Networks and Systems, vol. 1057, Springer, Singapore, pp. 537–545, 2024. https://doi.org/10.1007/978-981-97-4895-2_45.
- https://www.kaggle.com/datasets/aishrules25/automatic-image-captioning-for-visually-impaired.
-
Downloads
-
How to Cite
Karthika, D. ., & Balamurugan , D. S. P. . (2025). Vision-To-Voice: An Intelligent Caption Generation and An Avatar-Guided Assistive System for The Visually Challenged. International Journal of Basic and Applied Sciences, 14(7), 436-450. https://doi.org/10.14419/ndnxct46
