A unified deep learning model for image captioning and text-to-image synthesis
-
https://doi.org/10.14419/m9njft54
Received date: March 21, 2025
Accepted date: April 27, 2025
Published date: May 19, 2025
-
CNN-LSTM; Image Captioning; Stable Diffusion; Text to Image -
Abstract
Deep learning models have significantly advanced various artificial intelligence tasks, including text-to-image generation and image captioning. However, there remains a semantic gap between textual descriptions and visual representations, which affects the accuracy and coherence of generated images and captions. This paper proposes a novel deep learning model that integrates Stable Diffusion, Convolutional Neural Networks (CNN), and Long Short-Term Memory (LSTM) networks to enhance both text-to-image generation and image captioning tasks. The model employs CNN-LSTM architecture for feature extraction, while Stable Diffusion refines the output iteratively to improve coherence and realism. The approach generates diverse and high-quality images from text inputs and produces accurate captions for images. Experimental evaluations demonstrate the effectiveness of our approach. The image captioning model achieved a BLEU score of 0.89, highlighting its high accuracy. The text-to-image generation results also exhibit substantial improvements in visual realism and semantic alignment. The proposed model offers a robust framework for multimodal AI applications, advancing both content synthesis and understanding in multimedia tasks. These findings underscore the potential of deep learning in bridging the gap between textual and visual modalities, contributing to more effective and versatile AI-driven multimedia solutions.
-
References
- Papa, L., Faiella, L., Corvitto, L., Maiano, L., & Amerini, I. (2023). On the use of stable diffusion for creating realistic faces: From generation to detection. Proceedings of the 2023 IEEE International Workshop on Biometrics and Forensics (IWBF), 1–6. https://doi.org/10.1109/IWBF57495.2023.10156981.
- Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis.
- Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., & Chang, S. (n.d.). Uncovering the disentanglement capability in text-to-image diffusion models.
- Sun, G., Liang, W., Dong, J., Li, J., Ding, Z., & Cong, Y. (2024). Create your world: Lifelong text-to-image diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 6454–6470. https://doi.org/10.1109/TPAMI.2024.3382753.
- Sairam, G., Mandha, M., Prashanth, P., & Swetha, P. (2022). Image captioning using CNN and LSTM. Proceedings of the 4th Smart Cities Symposium (SCS 2021). https://doi.org/10.1049/icp.2022.0356.
- Amritkar, C., & Jabade, V. (2018). Image caption generation using deep learning technique. Proceedings of the 2018 4th International Conference on Computing Communication Control and Automation (ICCUBEA), 1–4. https://doi.org/10.1109/ICCUBEA.2018.8697360.
- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. 38th International Conference on Machine Learning (ICML).
- Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents.
- Rohitharun, S., Reddy, L. U. K., & Sujana, S. (2022). Image captioning using CNN and RNN. Proceedings of the 2022 2nd Asian Conference on Innovation in Technology (ASIANCON). https://doi.org/10.1109/ASIANCON55314.2022.9909146.
- Gorkar, T., Kale, V., Jagdale, V., Tarte, Y., & Battalwar, S. (2023). Image caption generator using deep learning. International Research Journal of Modernization in Engineering Technology and Science, 5(12).
- Han, S.-H., & Choi, H.-J. (2020). Domain-specific image caption generator with semantic ontology. Domain-Specific Image Caption Generator with Semantic Ontology., 526–530. https://doi.org/10.1109/BigComp48618.2020.00-12.
- Napa, K. K., Dhamodaran, V., Mohan, A., Laxman, K., & Yuvaraj, J. (2019). Detection and recognition of objects in image caption generator system: A deep learning approach. Proceedings of the 2019 International Conference on Advanced Computing and Communication Systems (ICACCS. https://doi.org/10.1109/ICACCS.2019.8728516.
- Yang, Z., Liu, Q., & Liu, G. (2020). Better understanding: Stylized image captioning with style attention and adversarial training. Symmetry, 12(12). https://doi.org/, 12(12), 1978. https://doi.org/10.3390/sym12121978.
- Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning (ICML), 10347–10357. https://doi.org/10.1109/ICCV48922.2021.00010.
- HSyed, U., & Subbarao, M. (2024). CaptionCraft: VGG with LSTM for image insights. Proceedings of the IEEE International Conference on Emerging Technologies (ICET), 1–5. https://doi.org/10.1109/ICTEST60614.2024.10576172.
- Sharma, H., & Padha, D. (2024). Neuraltalk+: Neural image captioning with visual assistance capabilities. Multimedia Tools and Applications, 1–29. https://doi.org/10.1007/s11042-024-19259-9.
- Bansal, P., Malik, K., Kumar, S., & Singh, C. (2023). EfficientNet-based image captioning system. Proceedings of the DICCT. https://doi.org/10.1109/DICCT56244.2023.10110117.
- Latimier, A., Peyre, H., & Ramus, F. (2020). A meta-analytic review of the benefit of spacing out retrieval. https://doi.org/10.31234/osf.io/kzy7u.
- L. A. A. Ignatious, S. Jeevitha, M. M. and M. H. (2019). A semantic driven cnn-lstm architecture for personalised Image caption generation. 11th International Conference on Advanced Computing (ICoAC), 356–362. https://doi.org/10.1109/ICoAC48765.2019.246867
-
Downloads
-
How to Cite
Mathew, R. M., Lal, S. P. ., & Ganesh, G. . (2025). A unified deep learning model for image captioning and text-to-image synthesis. International Journal of Basic and Applied Sciences, 14(1), 334-338. https://doi.org/10.14419/m9njft54
