RDFSpark: a new solution for querying massive RDF data using spark


  • Mouad Banane Hassan 2 University
  • Abdessamad Belangour Hassan 2 University






RDF, Spark, SPARQL, MapReduce, Big Data.


The invasion of semantic data, the rapid growth of RDF data has brought significant new challenges in the querying of RDF data. On the other hand, Apache Spark is an open source distributed computing framework, characterized by its speed as MapReduce, Big Data processing has never been easier. In last years MapReduce solves problems at scales unimaginable a few years ago. But like any other tool, it remains limited. Several research works propose the querying of large volumes of RDF data using MapReduce as H2RDF, We find Spark which is more powerful and robust than MapReduce, and it is 100 times faster than MapReduce. In this paper, we present RDFSpark: a new distributed RDF query management based on Spark to ensure scalability, fault tolerance, and performance to solve low-efficiency problems for RDF data query. n this approach, a SPARQL query is analyzed first by the Parser, and compiled before being passed through a series of optimization techniques, then it will be translated to a Spark program. The results of experiments conducted on huge volumes of RDF data show that RDFSpark has a high querying performance.



[1] Vukotic, Aleksa, Nicki Watt, Tareq Abedrabbo, Dominic Fox, and Jonas Partner. Neo4j in action. Manning Publications Co., 2014.

[2] Khadilkar, Vaibhav, Murat Kantarcioglu, Bhavani Thuraisingham, and Paolo Castagna. "Jena-HBase: A distributed, scalable and efficient RDF triple store." In Proceedings of the 11th International Semantic Web Conference Posters & Demonstrations Track, ISWC-PD, vol. 12, pp. 85-88. 2012.

[3] George, Lars. HBase: the definitive guide: random access to your planet-size data. " O'Reilly Media, Inc.", 2011.

[4] Papailiou, Nikolaos, Ioannis Konstantinou, Dimitrios Tsoumakos, and Nectarios Koziris. "H2RDF: adaptive query processing on RDF data in the cloud." In Proceedings of the 21st International Conference on World Wide Web, pp. 397-400. ACM, 2012. https://doi.org/10.1145/2187980.2188058.

[5] Hewitt, Eben. Cassandra: the definitive guide. " O'Reilly Media, Inc.", 2010.

[6] Ladwig, Günter, and Andreas Harth. "CumulusRDF: linked data management on nested key-value stores." In The 7th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS 2011), vol. 30. 2011.

[7] Banane, Mouad, Abdessamad Belangour, and Labriji El Houssine. "Storing RDF data into big data NoSQL databases." In First International Conference on Real Time Intelligent Systems, pp. 69-78. Springer, Cham, 2017. https://doi.org/10.1007/978-3-319-91337-7_7.

[8] Cudré-Mauroux, Philippe, Iliya Enchev, Sever Fundatureanu, Paul Groth, Albert Haque, Andreas Harth, Felix Leif Keppmann, Daniel Miranker, Juan F. Sequeda, and Marcin Wylot. "NoSQL databases for RDF: an empirical evaluation." In International Semantic Web Conference, pp. 310-325. Springer, Berlin, Heidelberg, 2013. https://doi.org/10.1007/978-3-642-41338-4_20.

[9] Mouad Banane and Abdessamad Belangour. “An Evaluation and Comparative study of massive RDF Data management approaches based on Big Data Technologiesâ€. International Journal of Emerging Trends in Engineering Research. Volume 7 No. 7 (2019). https://doi.org/10.30534/ijeter/2019/03772019.

[10] Banane, Mouad, and Abdessamad Belangour. "A Survey on RDF Data Store Based on NoSQL Systems for the Semantic Web Applications." In International Conference on Advanced Intelligent Systems for Sustainable Development, pp. 444-451. Springer, Cham, 2018. https://doi.org/10.1007/978-3-030-11928-7_40.

[11] Hassan, M., & Bansal, S. K. (2018). RDF data storage tech-niques for efficient SPARQL query processing using distributed computation engines. Proceedings - 2018 IEEE 19th International Conference on Information Reuse and Integration for Data Science, IRI 2018, 323–330. https://doi.org/10.1109/IRI.2018.00056.

[12] Pedrycz, Witold, and Shyi-Ming Chen, eds. Information granularity, big data, and computational intelligence. Vol. 8. Springer, 2014. https://doi.org/10.1007/978-3-319-08254-7.

[13] White, Tom. Hadoop: The definitive guide. " O'Reilly Media, Inc.",

[14] Lyubimov, Dmitriy, and Andrew Palumbo. Apache Mahout: Beyond MapReduce. CreateSpace Independent Publishing Platform, Zaharia, Matei, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. "Spark: Cluster computing with working sets." HotCloud 10, no. 10-10 (2010): 95.

[15] Meng, Xiangrui, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman et al. "Mllib: Machine learning in apache spark." The Journal of Machine Learning Research 17, no. 1 (2016): 1235-1241.

[16] Guo, Yuanbo, Zhengxiang Pan, and Jeff Heflin. "LUBM: A benchmark for OWL knowledge base systems." Web Semantics: Science, Services and Agents on the World Wide Web 3, no. 2-3 (2005): 158-182. https://doi.org/10.1016/j.websem.2005.06.005.

[17] Banane, Mouad, and Abdessamad Belangour. "New Approach based on Model Driven Engineering for Processing Complex SPARQL Queries on Hive." International Journal of Advanced Computer Science and Applications (IJACSA) 10, no. 4 (2019). https://doi.org/10.14569/IJACSA.2019.0100474.

[18] Mouad Banane, and Abdessamad Belangour. « RDFMongo: A MongoDB Distributed and Scalable RDF management system based on Meta-model». International Journal of Advanced Trends in Computer Science and Engineering 8, nᵒ 3 (2019): 734 – 741. https://doi.org/10.30534/ijatcse/2019/62832019.

View Full Article: