A MapReduce-Based Approach for Prefix-Based Labeling of Large XML Data

Ahn, Jinhyun; Im, Dong-Hyuk; Kim, Hong-Gee

doi:10.1007/978-3-319-50112-3_7

Jinhyun Ahn²⁰,
Dong-Hyuk Im²¹ &
Hong-Gee Kim^20,22

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10055))

Included in the following conference series:

Joint International Semantic Technology Conference

721 Accesses
1 Citations

Abstract

A massive amount of XML (Extensible Markup Language) data is available on the web, which can be viewed as tree data. One of the fundamental building blocks of information retrieval from tree data is answering structural queries. Various labeling schemes have been suggested for rapid structural query processing. We focus on the prefix-based labeling scheme that labels each node with a concatenation of its parent’s label and its child order. This scheme has been adapted in RDF (Resource Description Framework) data management systems that index RDF data in tree by grouping subjects. Recently, a MapReduce-based algorithm for the prefix-based labeling scheme was suggested. We observe that this algorithm fails to keep label size minimized, which makes the prefix-based labeling scheme difficult for massive real-world XML datasets. To address this issue, we propose a MapReduce-based algorithm for prefix-based labeling of XML data that reduces label size by adjusting the order of label assignments based on the structural information of the XML data. Experiments with real-world XML datasets show that the proposed approach is more effective than previous works.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Clark, J., DeRose, S., et al.: XML path language (XPath) (1999)
Google Scholar
Pal, S., Cseri, I., Seeliger, O., Rys, M., Schaller, G., Yu, W., Tomic, D., Baras, A., Berg, B., Churin, D., et al.: XQuery implementation in a relational database system. In: Proceedings of the 31st International Conference on Very Large Data Bases, VLDB Endowment, pp. 1175–1186 (2005)
Google Scholar
O’Neil, P., O’Neil, E., Pal, S., Cseri, I., Schaller, G., Westbury, N.: ORDPATHs: insert-friendly XML node labels. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, pp. 903–908. ACM (2004)
Google Scholar
Delbru, R., Toupikov, N., Catasta, M., Tummarello, G.: A node indexing scheme for web entity retrieval. In: Aroyo, L., Antoniou, G., Hyvönen, E., Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010. LNCS, vol. 6089, pp. 240–256. Springer, Heidelberg (2010). doi:10.1007/978-3-642-13489-0_17
Chapter Google Scholar
Choi, H., Lee, K.H., Lee, Y.J.: Parallel labeling of massive XML data with mapreduce. J. Supercomputing 67(2), 408–437 (2014)
Article MathSciNet Google Scholar
Ahn, J., Im, D.H., Lee, T., Kim, H.G.: A dynamic and parallel approach for repetitive prime number labeling of XML data with MapReduce. J. Supercomputing (To Appear)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)
Article Google Scholar
Xu, L., Ling, T.W., Wu, H., Bao, Z.: DDE: from dewey to a fully dynamic XML labeling scheme. In: SIGMOD. ACM (2009)
Google Scholar
Tatarinov, I., Viglas, S.D., Beyer, K., Shanmugasundaram, J., Shekita, E., Zhang, C.: Storing and querying ordered XML using a relational database system. In: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, pp. 204–215. ACM (2002)
Google Scholar
Lin, R.-R., Chang, Y.-H., Chao, K.-M.: A compact and efficient labeling scheme for XML documents. In: Meng, W., Feng, L., Bressan, S., Winiwarter, W., Song, W. (eds.) DASFAA 2013. LNCS, vol. 7825, pp. 269–283. Springer, Heidelberg (2013). doi:10.1007/978-3-642-37487-6_22
Chapter Google Scholar
Lu, J., Meng, X., Ling, T.W.: Indexing and querying XML using extended dewey labeling scheme. Data Knowl. Eng. 70(1), 35–59 (2011)
Article Google Scholar
Klaib, A., Joan, L.: Investigation into indexing XML data techniques (2014)
Google Scholar
Xu, L., Bao, Z., Ling, T.W.: A dynamic labeling scheme using vectors. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 130–140. Springer, Heidelberg (2007). doi:10.1007/978-3-540-74469-6_14
Chapter Google Scholar
Li, C., Ling, T.W.: QED: a novel quaternary encoding to completely avoid re-labeling in XML updates. In: CIKM. ACM (2005)
Google Scholar
Christophides, V., Karvounarakis, G., Plexousakis, D., Scholl, M., Tourtounis, S.: Optimizing taxonomic semantic web queries using labeling schemes. Web Semant. Sci. Serv. Agents World Wide Web 1(2), 207–228 (2004)
Article Google Scholar
Xu, L., Ling, T.W., Wu, H.: Labeling dynamic XML documents: an order-centric approach. IEEE Trans. Knowl. Data Eng. 24(1), 100–113 (2012)
Article Google Scholar
Subramaniam, S., Haw, S.C., Soon, L.K.: Relab: A subtree based labeling scheme for efficient XML query processing. In: 2014 IEEE 2nd International Symposium on Telecommunication Technologies (ISTT), pp. 121–125. IEEE (2014)
Google Scholar
Wu, X., Lee, M.L., Hsu, W.: A prime number labeling scheme for dynamic ordered XML trees. In: ICDE (2004)
Google Scholar
Sun, D.H., Hwang, S.C.: A labeling methods for keyword search over large XML documents. J. KIISE 41(9), 699–706 (2014)
Article Google Scholar
Wang, Y., DeWitt, D.J., Cai, J.Y.: X-Diff: An effective change detection algorithm for XML documents. In: 2003 Proceedings of the 19th International Conference on Data Engineering, pp. 519–530. IEEE (2003)
Google Scholar
Leonardi, E., Bhowmick, S.S., Madria, S.: Xandy: Detecting changes on large unordered XML documents using relational databases. In: Zhou, L., Ooi, B.C., Meng, X. (eds.) DASFAA 2005. LNCS, vol. 3453, pp. 711–723. Springer, Heidelberg (2005). doi:10.1007/11408079_65
Chapter Google Scholar

Download references

Acknowledgement

This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No. R0101-16-0054, WiseKB: Big data based self-evolving knowledge base and reasoning platform) and Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science, ICT & Future Planning (NRF-2014R1A1A1002236).

Author information

Authors and Affiliations

Biomedical Knowledge Engineering Laboratory and Dental Research Institute, Seoul National University, Seoul, South Korea
Jinhyun Ahn & Hong-Gee Kim
Department of Computer and Information Engineering, Hoseo University, Cheonan, South Korea
Dong-Hyuk Im
Institute of Human-Environment Interface Biology, Seoul National University, Seoul, South Korea
Hong-Gee Kim

Authors

Jinhyun Ahn
View author publications
You can also search for this author in PubMed Google Scholar
Dong-Hyuk Im
View author publications
You can also search for this author in PubMed Google Scholar
Hong-Gee Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hong-Gee Kim .

Editor information

Editors and Affiliations

Information Technology, Monash University, Melbourne, Victoria, Australia
Yuan-Fang Li
Computer Science and Technology, Nanjing University, Nanjing, China
Wei Hu
Computer Science, National University of Singapore, Singapore, Singapore
Jin Song Dong
University of Huddersfield, Huddersfield, United Kingdom
Grigoris Antoniou
Information and Communication Technology, Griffith University, Brisbane, Queensland, Australia
Zhe Wang
ISTD, Singapore University of Technology and Design, Singapore, Singapore
Jun Sun
Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
Yang Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ahn, J., Im, DH., Kim, HG. (2016). A MapReduce-Based Approach for Prefix-Based Labeling of Large XML Data. In: Li, YF., et al. Semantic Technology. JIST 2016. Lecture Notes in Computer Science(), vol 10055. Springer, Cham. https://doi.org/10.1007/978-3-319-50112-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-50112-3_7
Published: 27 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-50111-6
Online ISBN: 978-3-319-50112-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics