A Semi-structured Data Classification Model with Integrating Tag Sequence and Ngram

Zhang, Lijun; Li, Ning; Pan, Wei; Li, Zhanhuai

doi:10.1007/978-3-030-73197-7_14

Lijun Zhang ORCID: orcid.org/0000-0002-2306-9823^16,17,
Ning Li^16,17,
Wei Pan^16,17 &
…
Zhanhuai Li^16,17

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12682))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

2682 Accesses

Abstract

Many collaboratively building resources, such as Wikipedia, Weibo and Quora, exist in the form of semi-structured data and semi-structured data classification plays an important role in many data analysis applications. In addition to content information, semi-structured data also contain structural information. Thus, combining the structure and content features is a crucial issue in semi-structured data classification. In this paper, we propose a supervised semi-structured data classification approach that utilizes both the structural and content information. In this approach, generalized tag sequences are extracted from the structural information, and nGrams are extracted from the content information. Then the tag sequences and nGrams are combined into features called TSGram according to their link relation, and each semi-structured document is represented as a vector of TSGram features. Based on the TSGram features, a classification model is devised to improve the performance of semi-structured data classification. Because TSGram features retain the association between the structural and content information, they are helpful in improving the classification performance. Our experimental results on two real datasets show that the proposed approach is effective.

This work was supported in part by the National Natural Science Foundation of China under Grant 61972317, Grant 61672432 and Grant 61732014, and in part by the Fundamental Research Funds for the Central Universities of China under Grant 3102015JSJ0004.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Costa, G., Ortale, R.: XML clustering by structure-constrained phrases: a fully-automatic approach using contextualized N-Grams. Int. J. Artif. Intell. Tools 26(1), 1–24 (2017)
Article Google Scholar
Costa, G., Ortale, R.: Fully-automatic XML clustering by structure-constrained phrases. In: Proceedings IEEE 27th International Conference on Tools with Artificial Intelligence, Vietri sul Mare, Italy, pp. 146–153 (2015)
Google Scholar
Tekli, J.: An overview on XML semantic disambiguation from unstructured text to semi-structured data: background, applications, and ongoing challenges. IEEE Trans. Knowl. Data Eng. 28(6), 1383–1407 (2016)
Article Google Scholar
Piernik, M., Brzezinski, D., Morzy, T.: Clustering XML documents by patterns. Knowl. Inf. Syst. 46(1), 185–212 (2015). https://doi.org/10.1007/s10115-015-0820-0
Article MATH Google Scholar
Zhao, X., Bi, X., Wang, G., et al.: Uncertain XML documents classification using extreme learning machine. Neurocomputing 174, 375–382 (2016)
Article Google Scholar
Costa, G., Ortale, R.: Mining cluster patterns in XML corpora via latent topic models of content and structure. In: Proceedings 23rd Pacific-Asia Conference on Knowledge Discovery and Data Mining, Macau, China, pp. 237–248 (2019)
Google Scholar
Tran, T., Nayak, R., Bruza, P.D.: Combining structure and content similarities for XML document clustering. In: Proceeedings the 7th Australasian Data Mining Conference (AusDM 2008), pp. 219–226 (2008)
Google Scholar
Ghosh, S., Mitra, P.: Combining content and structure similarity for XML document classification using composite SVM Kernels. In: Proceedings 19th International Conference on Pattern Recognition (ICPR 2008), pp. 1–4 (2008)
Google Scholar
Zhang, L., Li, Z., Chen, Q., Li, N.: Structure and content similarity for clustering XML documents. In: Shen, H.T., et al. (eds.) WAIM 2010. LNCS, vol. 6185, pp. 116–124. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16720-1_12
Chapter Google Scholar
Yuan, J., Xu, D., Bao, H.: An efficient XML documents classification method based on structure and keywords frequency. J. Comput. Res. Dev. 43(8), 1361–1367 (2006)
Google Scholar
Costa, G., Ortale, R., Ritacco, E.: Effective XML classification using content and structural information via rule learning. In: Proceedings the 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2011), pp. 102–109 (2011)
Google Scholar
Yang, J., Zhang, F.: XML document classification using extended VSM. In: Proceedings 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, pp. 234–244 (2008)
Google Scholar
Yang, J., Wang, S.: Extended VSM for XML document classification using frequent subtrees. In: Proceedings 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, pp. 441–448 (2009)
Google Scholar
Zhao, X., Bi, X., Qiao, B.: Probability based voting extreme learning machine for multiclass XML documents classification. World Wide Web 17(5), 1217–1231 (2013). https://doi.org/10.1007/s11280-013-0230-8
Article Google Scholar
Costa, G., Ortale, R.: Machine learning techniques for XML (co-)clustering by structure-constrained phrases. Inf. Retrieval J. 21(1), 24–55 (2017). https://doi.org/10.1007/s10791-017-9314-x
Article Google Scholar
Mladenic, D., Globelnik, M.: Word sequences as features in text learning. the 17th Electrotechnical and Computer Science Conference (ERK 1998), Slovenia, pp. 145–148 (1998)
Google Scholar
Furnkranz, J.: A Study Using n-gram features for text categorization. Austrian Res. Instit. Artif. Intell. 3, 1–10 (1998)
Google Scholar
Zhang, Y., Zhang, L., Yan, J., Li, Z.: Using association features to enhance the performance of Naive Bayes text classifier. In: Proceedings the 5th International Conference on Computational Intelligence and Multimedia Applications, pp. 336–441 (2003)
Google Scholar
Meretakis, D., Wuthrich, B.: Extending Naive Bayes classifiers using long itemsets. In: Proceedings the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 1999), pp. 165–174 (1999)
Google Scholar
Tesar, R., Strnad, V., Jezek, K., Poesio, M.: Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: Proceedings the ACM Symposium on Document Engineering, pp. 138–146 (2006)
Google Scholar
Zhang, L., Li, Z., Chen, Q., Li, X., Li, N., Lou, Y.: Mining frequent association tag sequences for clustering XML documents. In: Sheng, Q.Z., Wang, G., Jensen, C.S., Xu, G. (eds.) APWeb 2012. LNCS, vol. 7235, pp. 85–96. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-29253-8_8
Chapter Google Scholar
Caropreso, M.F., Matwin, S., Sebastiani, F.: Statistical phrases in automated text categorization. Technical report IEI-B4-07-2000. Istituto di Elaborazione dell’Informazione, Pisa, Italy (2000)
Google Scholar
Mitra, M., Buckley, C., Singhal, A., Cardie, C: An analysis of statistical and syntactic phrases. In: The 5th International Conference on Recherche d’Information Assistee par Ordinateur (RIAO 1997), Montreal, CA, pp. 200–214 (1997)
Google Scholar
Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: The 7th ACM International Conference on Information and Knowledge Management (CIKM 1998), New York, US, pp. 148–155. ACM Press (1998)
Google Scholar
Tesar, R., Fiala, D., Rousselot, F., Jezek, K.: A comparison of two algorithms for discovering repeated word sequences. WIT transaction on information and communication technologies 35, 121–131 (2005)
Google Scholar
Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: The 14th International Conference on Machine Learning (ICML 1997), pp. 412–420 (1997)
Google Scholar
Rezk, N.G., Sarhan, A., Algergawy, A.: Clustering of XML documents based on structure and aggregated content. In: Proceedings 11th International Conference on Computer Engineering and Systems, Cairo, Egypt, pp. 93–102 (2016)
Google Scholar
Denoyer, L., Gallinari, P.: Report on the XML mining track at INEX 2007 categorization and clustering of XML documents. SIGIR forum 42, 22–28 (2008)
Article Google Scholar
Kurt, A., Tozal, E.: Classification of XSLT-generated web documents with support vector machines. In: Nayak, R., Zaki, M.J. (eds.) KDXD 2006. LNCS, vol. 3915, pp. 33–42. Springer, Heidelberg (2006). https://doi.org/10.1007/11730262_6
Chapter Google Scholar
Wu, J., Tang, J.: A bottom-up approach for XML documents classification. In: The 2008 International Symposium on Database Engineering and Applications, Coimbra, Portugal, pp. 131–137. ACM (2008)
Google Scholar
Zhang, L., Li, Z., Chen, Q., et al.: Classifying XML documents based on term semantics. Jilin Daxue Xuebao/J. Jilin Univ. (Eng. Technol. Edn.) 42(6), 1510–1514 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science, Northwestern Polytechnical University, Xi’an, 710072, China
Lijun Zhang, Ning Li, Wei Pan & Zhanhuai Li
Key Laboratory of Big Data Storage and Management, Northwestern Polytechnical University, Ministry of Industry and Information Technology, Xi’an, 710072, China
Lijun Zhang, Ning Li, Wei Pan & Zhanhuai Li

Authors

Lijun Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ning Li
View author publications
You can also search for this author in PubMed Google Scholar
Wei Pan
View author publications
You can also search for this author in PubMed Google Scholar
Zhanhuai Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lijun Zhang .

Editor information

Editors and Affiliations

Aalborg University, Aalborg, Denmark
Christian S. Jensen
Singapore Management University, Singapore, Singapore
Ee-Peng Lim
Academia Sinica, Taipei, Taiwan
De-Nian Yang
The Pennsylvania State University, University Park, PA, USA
Wang-Chien Lee
National Chiao Tung University, Hsinchu, Taiwan
Vincent S. Tseng
Athens University of Economics and Business, Athens, Greece
Vana Kalogeraki
National Cheng Kung University, Tainan City, Taiwan
Jen-Wei Huang
National Tsing Hua University, Hsinchu, Taiwan
Chih-Ya Shen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, L., Li, N., Pan, W., Li, Z. (2021). A Semi-structured Data Classification Model with Integrating Tag Sequence and Ngram. In: Jensen, C.S., et al. Database Systems for Advanced Applications. DASFAA 2021. Lecture Notes in Computer Science(), vol 12682. Springer, Cham. https://doi.org/10.1007/978-3-030-73197-7_14

Download citation

DOI: https://doi.org/10.1007/978-3-030-73197-7_14
Published: 06 April 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73196-0
Online ISBN: 978-3-030-73197-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics