BTC-2019: The 2019 Billion Triple Challenge Dataset

Herrera, José-Miguel; Hogan, Aidan; Käfer, Tobias

doi:10.1007/978-3-030-30796-7_11

José-Miguel Herrera¹⁷,
Aidan Hogan¹⁷ &
Tobias Käfer¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11779))

Included in the following conference series:

International Semantic Web Conference

3242 Accesses
8 Citations

Abstract

Six datasets have been published under the title of Billion Triple Challenge (BTC) since 2008. Each such dataset contains billions of triples extracted from millions of documents crawed from hundreds of domains. While these datasets were originally motivated by the annual ISWC competition from which they take their name, they would become widely used in other contexts, forming a key resource for a variety of research works concerned with managing and/or analysing diverse, real-world RDF data as found natively on the Web. Given that the last BTC dataset was published in 2014, we prepare and publish a new version – BTC-2019 – containing 2.2 billion quads parsed from 2.6 million documents on 394 pay-level-domains. This paper first motivates the BTC datasets with a survey of research works using these datasets. Next we provide details of how the BTC-2019 crawl was configured. We then present and discuss a variety of statistics that aim to gain insights into the content of BTC-2019. We discuss the hosting of the dataset and the ways in which it can be accessed, remixed and used.

Resource DOI: https://doi.org/10.5281/zenodo.2634588

Resource type: Dataset

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/ldspider/ldspider.
2.
One exception is the Crawl delay definition, where all websites are configured for a one second delay only irrespective of the robots.txt file.
3.
The script used to run the call – including all arguments passed to LDspider – is available at https://github.com/jotixh/RDFLiteralDefinitions/blob/master/ldspider-runner/bin/crawl.sh.
4.
https://github.com/jotixh/RDFLiteralDefinitions/blob/master/ldspider-runner/seed.txt.
5.
A pay-level domain (PLD) is one that must be paid for to be registered; examples would be dbpedia.org, data.gov, bbc.co.uk, but not en.dbpedia.org, news.bbc.co.uk, etc. Oftentimes datasets will rather report fully-qualified domain names (FQDNs), which we argue is not a good practice since, for example, sub-domains can be used for individual user accounts (as was the case for sites like Livejournal, which had millions of sub-domains: one for each user).

References

Avgoustaki, A., Flouris, G., Fundulaki, I., Plexousakis, D.: Provenance management for evolving RDF datasets. In: Sack, H., Blomqvist, E., d’Aquin, M., Ghidini, C., Ponzetto, S.P., Lange, C. (eds.) ESWC 2016. LNCS, vol. 9678, pp. 575–592. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-34129-3_35
Chapter Google Scholar
Balog, K., Serdyukov, P., de Vries, A.P.: Overview of the TREC 2011 entity track. In: Text REtrieval Conference (TREC). NIST (2011)
Google Scholar
Bechhofer, S., Harth, A.: The semantic web challenge 2014. J. Web Semant. 35, 141 (2015)
Article Google Scholar
Beek, W., Rietveld, L., Bazoobandi, H.R., Wielemaker, J., Schlobach, S.: LOD laundromat: a uniform way of publishing other people’s dirty data. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 213–228. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_14
Chapter Google Scholar
Bizer, C., Maynard, D.: The semantic web challenge 2010. J. Web Semant. 9(3), 315 (2011)
Article Google Scholar
Bizer, C., Maynard, D.: The semantic web challenge 2011. J. Web Semant. 16, 32 (2012)
Article Google Scholar
Bizer, C., Mika, P.: The semantic web challenge 2009. J. Web Semant. 8(4), 341 (2010)
Article Google Scholar
Blanco, R., Mika, P., Vigna, S.: Effective and efficient entity search in RDF data. In: Aroyo, L., et al. (eds.) ISWC 2011. LNCS, vol. 7031, pp. 83–97. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-25073-6_6
Chapter Google Scholar
Böhm, C., Lorey, J., Naumann, F.: Creating void descriptions for web-scale data. J. Web Semant. 9(3), 339–345 (2011)
Article Google Scholar
Böhm, C., de Melo, G., Naumann, F., Weikum, G.: LINDA: distributed web-of-data-scale entity matching. In: ACM International Conference on Information and Knowledge Management (CIKM), pp. 2104–2108. ACM (2012)
Google Scholar
Bu, Y., Borkar, V.R., Jia, J., Carey, M.J., Condie, T.: Pregelix: Big(ger) graph analytics on a dataflow engine. PVLDB 8(2), 161–172 (2014)
Google Scholar
Campinas, S., Ceccarelli, D., Perry, T.E., Delbru, R., Balog, K., Tummarello, G.: The sindice-2011 dataset for entity-oriented search on the web of data. In: International Workshop on Entity-Oriented Search (EOS), pp. 26–32 (2011)
Google Scholar
Cheng, G., Ge, W., Qu, Y.: Falcons: searching and browsing entities on the semantic web. In: International Conference on World Wide Web (WWW), pp. 1101–1102. ACM (2008)
Google Scholar
Cheng, J., Ke, Y., Chu, S., Özsu, M.T.: Efficient core decomposition in massive networks. In: International Conference on Data Engineering (ICDE), pp. 51–62. IEEE (2011)
Google Scholar
Cheng, J., Zhu, L., Ke, Y., Chu, S.: Fast algorithms for maximal clique enumeration with limited memory. In: SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), pp. 1240–1248. ACM (2012)
Google Scholar
d’Aquin, M., Baldassarre, C., Gridinoc, L., Angeletou, S., Sabou, M., Motta, E.: Characterizing knowledge on the semantic web with watson. In: International Workshop on Evaluation of Ontologies (EON), pp. 1–10. CEUR-WS.org (2007)
Google Scholar
Delbru, R., Toupikov, N., Catasta, M., Tummarello, G.: A node indexing scheme for web entity retrieval. In: Aroyo, L., et al. (eds.) ESWC 2010. LNCS, vol. 6089, pp. 240–256. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-13489-0_17
Chapter Google Scholar
Ding, L., Finin, T.: Characterizing the semantic web on the web. In: Cruz, I., et al. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 242–257. Springer, Heidelberg (2006). https://doi.org/10.1007/11926078_18
Chapter Google Scholar
Ding, L., et al.: Swoogle: a search and metadata engine for the semantic web. In: International Conference on Information and Knowledge Management (CIKM), pp. 652–659. ACM (2004)
Google Scholar
Ding, L., Shinavier, J., Shangguan, Z., McGuinness, D.L.: SameAs networks and beyond: analyzing deployment status and implications of owl:sameas in linked data. In: Patel-Schneider, P.F., et al. (eds.) ISWC 2010. LNCS, vol. 6496, pp. 145–160. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17746-0_10
Chapter Google Scholar
Fernández, J.D., Martínez-Prieto, M.A., Gutiérrez, C., Polleres, A., Arias, M.: Binary RDF representation for publication and exchange (HDT). J. Web Semant. 19, 22–41 (2013)
Article Google Scholar
Gallego, M.A., Fernández, J., Martínez-Prieto, M., de la Fuente, P.: RDF visualization using a three-dimensional adjacency matrix. In: Semantic Search Workshop (SEMSEARCH) (2011)
Google Scholar
Glimm, B., Hogan, A., Krötzsch, M., Polleres, A.: OWL: yet to arrive on the web of data? In: Linked Data on the Web (LDOW). CEUR-WS.org (2012)
Google Scholar
Goodman, E.L., Jimenez, E., Mizell, D., al-Saffar, S., Adolf, B., Haglin, D.: High-Performance computing applied to semantic databases. In: Antoniou, G., et al. (eds.) ESWC 2011. LNCS, vol. 6644, pp. 31–45. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21064-8_3
Chapter Google Scholar
Görlitz, O., Thimm, M., Staab, S.: SPLODGE: systematic generation of SPARQL benchmark queries for linked open data. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 116–132. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35176-1_8
Chapter Google Scholar
Groppe, J., Groppe, S.: Parallelizing join computations of SPARQL queries for large semantic web databases. In: Symposium on Applied Computing (SAC), pp. 1681–1686. ACM (2011)
Google Scholar
Guéret, C., Groth, P., van Harmelen, F., Schlobach, S.: Finding the achilles heel of the web of data: using network analysis for link-recommendation. In: Patel-Schneider, P.F., et al. (eds.) ISWC 2010. LNCS, vol. 6496, pp. 289–304. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-17746-0_19
Chapter Google Scholar
Gurajada, S., Seufert, S., Miliaraki, I., Theobald, M.: TriAD: a distributed shared-nothing RDF engine based on asynchronous message passing. In: SIGMOD International Conference on Management of Data, pp. 289–300. ACM (2014)
Google Scholar
Harth, A., Bechhofer, S.: The semantic web challenge 2013. J. Web Semant. 27–28, 1 (2014)
Google Scholar
Harth, A., Maynard, D.: The semantic web challenge 2012. J. Web Semant. 24, 1–2 (2014)
Article Google Scholar
Harth, A., Umbrich, J., Decker, S.: MultiCrawler: a pipelined architecture for crawling and indexing semantic web data. In: Cruz, I., et al. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 258–271. Springer, Heidelberg (2006). https://doi.org/10.1007/11926078_19
Chapter Google Scholar
Heflin, J., Song, D.: Ontology instance linking: towards interlinked knowledge graphs. In: AAAI Conference on Artificial Intelligence, pp. 4163–4169. AAAI (2016)
Google Scholar
Hogan, A.: Canonical forms for isomorphic and equivalent RDF graphs: algorithms for leaning and labelling blank nodes. TWEB 11(4), 22:1–22:62 (2017)
Article MathSciNet Google Scholar
Hogan, A., Harth, A., Umbrich, J., Kinsella, S., Polleres, A., Decker, S.: Searching and browsing linked data with SWSE: the semantic web search engine. J. Web Semant. 9(4), 365–401 (2011)
Article Google Scholar
Hose, K., Schenkel, R.: WARP: workload-aware replication and partitioning for RDF. In: Workshops Proceedings of the International Conference on Data Engineering (ICDE), pp. 1–6. IEEE (2013)
Google Scholar
Isele, R., Umbrich, J., Bizer, C., Harth, A.: LDspider: an open-source crawling framework for the web of linked data. In: ISWC Posters & Demonstrations. CEUR-WS (2010)
Google Scholar
Käfer, T., Abdelrahman, A., Umbrich, J., O’Byrne, P., Hogan, A.: Observing linked data dynamics. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 213–227. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-38288-8_15
Chapter Google Scholar
Käfer, T., Wins, A., Acosta, M.: Modelling and analysing dynamic linked data using RDF and SPARQL. In: Workshop on Dataset PROFILing and fEderated Search for Web Data (PROFILES) (2017)
Google Scholar
Konrath, M., Gottron, T., Staab, S., Scherp, A.: SchemEX - efficient construction of a data catalogue by stream-based indexing of linked data. J. Web Semant. 16, 52–58 (2012)
Article Google Scholar
Ladwig, G., Tran, T.: Index structures and top-k join algorithms for native keyword search databases. In: Conference on Information and Knowledge Management (CIKM), pp. 1505–1514. ACM (2011)
Google Scholar
Lehmberg, O., Ritze, D., Ristoski, P., Meusel, R., Paulheim, H., Bizer, C.: The mannheim search join engine. J. Web Semant. 35, 159–166 (2015)
Article Google Scholar
Liu, B., Huang, K., Li, J., Zhou, M.: An incremental and distributed inference method for large-scale ontologies based on MapReduce paradigm. IEEE Trans. Cybern. 45(1), 53–64 (2015)
Article Google Scholar
Meusel, R., Petrovski, P., Bizer, C.: The WebDataCommons microdata, RDFa and microformat dataset series. In: Mika, P., et al. (eds.) ISWC 2014. LNCS, vol. 8796, pp. 277–292. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11964-9_18
Chapter Google Scholar
Mika, P., Hendler, J.: The semantic web challenge 2008. J. Web Semant. 7(4), 271 (2009)
Article Google Scholar
Mulay, K., Kumar, P.S.: SPRING: ranking the results of SPARQL queries on linked data. In: International Conference on Management of Data (COMAD), pp. 47–56. Allied Publishers (2011)
Google Scholar
Neumann, T., Weikum, G.: Scalable join processing on very large RDF graphs. In: SIGMOD International Conference on Management of Data, pp. 627–640. ACM (2009)
Google Scholar
Neumayer, R., Balog, K., Nørvåg, K.: When simple is (more than) good enough: effective semantic search with (almost) no semantics. In: Baeza-Yates, R., et al. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 540–543. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28997-2_59
Chapter Google Scholar
Nikolov, A., Motta, E.: Capturing emerging relations between schema ontologies on the web of data. In: Consuming Linked Data (COLD). CEUR (2010)
Google Scholar
Papadakis, G., Demartini, G., Fankhauser, P., Kärger, P.: The missing links: discovering hidden same-as links among a billion of triples. In: International Conference on Information Integration and Web-based Applications and Services, pp. 453–460. ACM (2010)
Google Scholar
Paulheim, H., Hertling, S.: Discoverability of SPARQL endpoints in linked open data. In: ISWC Posters & Demonstrations, pp. 245–248. CEUR-WS.org (2013)
Google Scholar
Rula, A., Palmonari, M., Harth, A., Stadtmüller, S., Maurino, A.: On the diversity and availability of temporal information in linked open data. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 492–507. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35176-1_31
Chapter Google Scholar
Shaw, M., Koutris, P., Howe, B., Suciu, D.: Optimizing large-scale semi-naïve datalog evaluation in Hadoop. In: Barceló, P., Pichler, R. (eds.) Datalog 2.0 2012. LNCS, vol. 7494, pp. 165–176. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32925-8_17
Chapter Google Scholar
Speiser, S., Harth, A.: Integrating linked data and services with linked data services. In: Antoniou, G., et al. (eds.) ESWC 2011. LNCS, vol. 6643, pp. 170–184. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-21034-1_12
Chapter Google Scholar
Stadtmüller, S., Harth, A., Grobelnik, M.: Accessing information about linked data vocabularies with vocab.cc. In: Li, J., Qi, G., Zhao, D., Nejdl, W., Zheng, H.T. (eds.) CSWS 2012. SPCOM, pp. 391–396. (2012). https://doi.org/10.1007/978-1-4614-6880-6_34
Chapter Google Scholar
Tran, T., Mika, P., Wang, H., Grobelnik, M.: SemSearch’11: the 4th semantic search workshop. In: International Conference on World Wide Web (Companion Volume), pp. 315–316. ACM (2011)
Google Scholar
Tummarello, G., Delbru, R., Oren, E.: Sindice.com: weaving the open linked data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 552–565. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-76298-0_40
Chapter Google Scholar
Umbrich, J., Karnstedt, M., Hogan, A., Parreira, J.X.: Hybrid SPARQL queries: fresh vs. fast results. In: Cudré-Mauroux, P., et al. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 608–624. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35176-1_38
Chapter Google Scholar
Urbani, J., Kotoulas, S., Oren, E., van Harmelen, F.: Scalable Distributed reasoning using MapReduce. In: Bernstein, A., et al. (eds.) ISWC 2009. LNCS, vol. 5823, pp. 634–649. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04930-9_40
Chapter Google Scholar
Urbani, J., Maassen, J., Drost, N., Seinstra, F.J., Bal, H.E.: Scalable RDF data compression with MapReduce. Concurrency Comput.: Pract. Experience 25(1), 24–39 (2013)
Article Google Scholar
Wang, J., Cheng, J.: Truss decomposition in massive networks. PVLDB 5(9), 812–823 (2012)
Google Scholar
Wylot, M., Cudré-Mauroux, P., Hauswirth, M., Groth, P.T.: Storing, tracking, and querying provenance in linked data. IEEE Trans. Knowl. Data Eng. 29(8), 1751–1764 (2017)
Article Google Scholar
Yang, T., Chen, J., Wang, X., Chen, Y., Du, X.: Efficient SPARQL query evaluation via automatic data partitioning. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013. LNCS, vol. 7826, pp. 244–258. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-37450-0_18
Chapter Google Scholar
Fang, Y., Si, L., Somasundaram, N., Al-Ansari, S., Yu, Z., Xian, Y.: Purdue at TREC 2010 entity track: a probabilistic framework for matching types between candidate and target entities (2010)
Google Scholar
Yuan, P., Liu, P., Wu, B., Jin, H., Zhang, W., Liu, L.: TripleBit: a fast and compact system for large scale RDF data. PVLDB 6(7), 517–528 (2013)
Google Scholar
Zeng, K., Yang, J., Wang, H., Shao, B., Wang, Z.: A distributed graph engine for web scale RDF data. PVLDB 6(4), 265–276 (2013)
Google Scholar
Zhang, X., Song, D., Priya, S., Daniels, Z., Reynolds, K., Heflin, J.: Exploring linked data with contextual tag clouds. J. Web Semant. 24, 33–39 (2014)
Article Google Scholar

Download references

Acknowledgements

This work was supported by Fondecyt Grant No. 1181896 and by the Millenium Institute for Foundational Research on Data (IMFD).

Author information

Authors and Affiliations

IMFD; DCC, Universidad de Chile, Santiago, Chile
José-Miguel Herrera & Aidan Hogan
Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
Tobias Käfer

Authors

José-Miguel Herrera
View author publications
You can also search for this author in PubMed Google Scholar
Aidan Hogan
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Käfer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aidan Hogan .

Editor information

Editors and Affiliations

Fondazione Bruno Kessler, Trento, Italy
Chiara Ghidini
Linköping University, Linköping, Sweden
Olaf Hartig
University of Bonn, Bonn, Germany
Maria Maleshkova
University of Economics Prague, Prague, Czech Republic
Vojtěch Svátek
University of Illinois at Chicago, Chicago, IL, USA
Isabel Cruz
University of Chile, Santiago, Chile
Aidan Hogan
Memect Technology, Beijing, China
Jie Song
Mines Saint-Etienne, Saint-Etienne, France
Maxime Lefrançois
Inria Sophia Antipolis - Méditerranée, Sophia Antipolis, France
Fabien Gandon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Herrera, JM., Hogan, A., Käfer, T. (2019). BTC-2019: The 2019 Billion Triple Challenge Dataset. In: Ghidini, C., et al. The Semantic Web – ISWC 2019. ISWC 2019. Lecture Notes in Computer Science(), vol 11779. Springer, Cham. https://doi.org/10.1007/978-3-030-30796-7_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-30796-7_11
Published: 17 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30795-0
Online ISBN: 978-3-030-30796-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the Semantic Web Science Association (opens in a new tab)