A unified deep learning model for image captioning and text-‎to-image synthesis

  • Authors

    • Rose Mary Mathew Assistant Professor, Department of Computer Applications, Federal Institute of Science and Technology, Angamaly, Kerala, In-dia https://orcid.org/0000-0003-0555-4873
    • Sujesh P Lal Assistant Professor, Department of Computer Applications, Federal Institute of Science and Technology, Angamaly, Kerala, In-dia https://orcid.org/0000-0003-0810-7913
    • Gowri Ganesh PG Student, Department of Computer Applications, Federal Institute of Science and Technology, Angamaly, Kerala, India
    https://doi.org/10.14419/m9njft54

    Received date: March 21, 2025

    Accepted date: April 27, 2025

    Published date: May 19, 2025

  • CNN-LSTM; Image Captioning; Stable Diffusion; Text to Image
  • Abstract

    Deep learning models have significantly advanced various artificial intelligence tasks, including text-to-image generation and image caption‎ing. However, there remains a semantic gap between textual descriptions and visual representations, which affects the accuracy and coherence of generated images and captions. This paper proposes a novel deep learning model that integrates Stable Diffusion, Convolutional ‎Neural Networks (CNN), and Long Short-Term Memory (LSTM) networks to enhance both text-to-image generation and image captioning ‎tasks. The model employs CNN-LSTM architecture for feature extraction, while Stable Diffusion refines the output iteratively to improve ‎coherence and realism. The approach generates diverse and high-quality images from text inputs and produces accurate captions for images. ‎Experimental evaluations demonstrate the effectiveness of our approach. The image captioning model achieved a BLEU score of 0.89, highlighting its high accuracy. The text-to-image generation results also exhibit substantial improvements in visual realism and semantic alignment. The proposed model offers a robust framework for multimodal AI applications, advancing both content synthesis and understanding ‎in multimedia tasks. These findings underscore the potential of deep learning in bridging the gap between textual and visual modalities, contributing to more effective and versatile AI-driven multimedia solutions‎.

  • References

    1. Papa, L., Faiella, L., Corvitto, L., Maiano, L., & Amerini, I. (2023). On the use of stable diffusion for creating realistic faces: From generation to detection. Proceedings of the 2023 IEEE International Workshop on Biometrics and Forensics (IWBF), 1–6. https://doi.org/10.1109/IWBF57495.2023.10156981.
    2. Dhariwal, P., & Nichol, A. (2021). Diffusion models beat GANs on image synthesis.
    3. Wu, Q., Liu, Y., Zhao, H., Kale, A., Bui, T., Yu, T., Lin, Z., Zhang, Y., & Chang, S. (n.d.). Uncovering the disentanglement capability in text-to-image diffusion models.
    4. Sun, G., Liang, W., Dong, J., Li, J., Ding, Z., & Cong, Y. (2024). Create your world: Lifelong text-to-image diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 6454–6470. https://doi.org/10.1109/TPAMI.2024.3382753.
    5. Sairam, G., Mandha, M., Prashanth, P., & Swetha, P. (2022). Image captioning using CNN and LSTM. Proceedings of the 4th Smart Cities Symposium (SCS 2021). https://doi.org/10.1049/icp.2022.0356.
    6. Amritkar, C., & Jabade, V. (2018). Image caption generation using deep learning technique. Proceedings of the 2018 4th International Conference on Computing Communication Control and Automation (ICCUBEA), 1–4. https://doi.org/10.1109/ICCUBEA.2018.8697360.
    7. Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021). Learning Transferable Visual Models From Natural Language Supervision. 38th International Conference on Machine Learning (ICML).
    8. Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical Text-Conditional Image Generation with CLIP Latents.
    9. Rohitharun, S., Reddy, L. U. K., & Sujana, S. (2022). Image captioning using CNN and RNN. Proceedings of the 2022 2nd Asian Conference on Innovation in Technology (ASIANCON). https://doi.org/10.1109/ASIANCON55314.2022.9909146.
    10. Gorkar, T., Kale, V., Jagdale, V., Tarte, Y., & Battalwar, S. (2023). Image caption generator using deep learning. International Research Journal of Modernization in Engineering Technology and Science, 5(12).
    11. Han, S.-H., & Choi, H.-J. (2020). Domain-specific image caption generator with semantic ontology. Domain-Specific Image Caption Generator with Semantic Ontology., 526–530. https://doi.org/10.1109/BigComp48618.2020.00-12.
    12. Napa, K. K., Dhamodaran, V., Mohan, A., Laxman, K., & Yuvaraj, J. (2019). Detection and recognition of objects in image caption generator system: A deep learning approach. Proceedings of the 2019 International Conference on Advanced Computing and Communication Systems (ICACCS. https://doi.org/10.1109/ICACCS.2019.8728516.
    13. Yang, Z., Liu, Q., & Liu, G. (2020). Better understanding: Stylized image captioning with style attention and adversarial training. Symmetry, 12(12). https://doi.org/, 12(12), 1978. https://doi.org/10.3390/sym12121978.
    14. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning (ICML), 10347–10357. https://doi.org/10.1109/ICCV48922.2021.00010.
    15. HSyed, U., & Subbarao, M. (2024). CaptionCraft: VGG with LSTM for image insights. Proceedings of the IEEE International Conference on Emerging Technologies (ICET), 1–5. https://doi.org/10.1109/ICTEST60614.2024.10576172.
    16. Sharma, H., & Padha, D. (2024). Neuraltalk+: Neural image captioning with visual assistance capabilities. Multimedia Tools and Applications, 1–29. https://doi.org/10.1007/s11042-024-19259-9.
    17. Bansal, P., Malik, K., Kumar, S., & Singh, C. (2023). EfficientNet-based image captioning system. Proceedings of the DICCT. https://doi.org/10.1109/DICCT56244.2023.10110117.
    18. Latimier, A., Peyre, H., & Ramus, F. (2020). A meta-analytic review of the benefit of spacing out retrieval. https://doi.org/10.31234/osf.io/kzy7u.
    19. L. A. A. Ignatious, S. Jeevitha, M. M. and M. H. (2019). A semantic driven cnn-lstm architecture for personalised Image caption generation. 11th International Conference on Advanced Computing (ICoAC), 356–362. https://doi.org/10.1109/ICoAC48765.2019.246867
  • Downloads

  • How to Cite

    Mathew, R. M., Lal, S. P. ., & Ganesh, G. . (2025). A unified deep learning model for image captioning and text-‎to-image synthesis. International Journal of Basic and Applied Sciences, 14(1), 334-338. https://doi.org/10.14419/m9njft54