Skip to main content

A Semi-structured Data Classification Model with Integrating Tag Sequence and Ngram

  • Conference paper
  • First Online:
Database Systems for Advanced Applications (DASFAA 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12682))

Included in the following conference series:

  • 2682 Accesses

Abstract

Many collaboratively building resources, such as Wikipedia, Weibo and Quora, exist in the form of semi-structured data and semi-structured data classification plays an important role in many data analysis applications. In addition to content information, semi-structured data also contain structural information. Thus, combining the structure and content features is a crucial issue in semi-structured data classification. In this paper, we propose a supervised semi-structured data classification approach that utilizes both the structural and content information. In this approach, generalized tag sequences are extracted from the structural information, and nGrams are extracted from the content information. Then the tag sequences and nGrams are combined into features called TSGram according to their link relation, and each semi-structured document is represented as a vector of TSGram features. Based on the TSGram features, a classification model is devised to improve the performance of semi-structured data classification. Because TSGram features retain the association between the structural and content information, they are helpful in improving the classification performance. Our experimental results on two real datasets show that the proposed approach is effective.

This work was supported in part by the National Natural Science Foundation of China under Grant 61972317, Grant 61672432 and Grant 61732014, and in part by the Fundamental Research Funds for the Central Universities of China under Grant 3102015JSJ0004.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Costa, G., Ortale, R.: XML clustering by structure-constrained phrases: a fully-automatic approach using contextualized N-Grams. Int. J. Artif. Intell. Tools 26(1), 1–24 (2017)

    Article  Google Scholar 

  2. Costa, G., Ortale, R.: Fully-automatic XML clustering by structure-constrained phrases. In: Proceedings IEEE 27th International Conference on Tools with Artificial Intelligence, Vietri sul Mare, Italy, pp. 146–153 (2015)

    Google Scholar 

  3. Tekli, J.: An overview on XML semantic disambiguation from unstructured text to semi-structured data: background, applications, and ongoing challenges. IEEE Trans. Knowl. Data Eng. 28(6), 1383–1407 (2016)

    Article  Google Scholar 

  4. Piernik, M., Brzezinski, D., Morzy, T.: Clustering XML documents by patterns. Knowl. Inf. Syst. 46(1), 185–212 (2015). https://doi.org/10.1007/s10115-015-0820-0

    Article  MATH  Google Scholar 

  5. Zhao, X., Bi, X., Wang, G., et al.: Uncertain XML documents classification using extreme learning machine. Neurocomputing 174, 375–382 (2016)

    Article  Google Scholar 

  6. Costa, G., Ortale, R.: Mining cluster patterns in XML corpora via latent topic models of content and structure. In: Proceedings 23rd Pacific-Asia Conference on Knowledge Discovery and Data Mining, Macau, China, pp. 237–248 (2019)

    Google Scholar 

  7. Tran, T., Nayak, R., Bruza, P.D.: Combining structure and content similarities for XML document clustering. In: Proceeedings the 7th Australasian Data Mining Conference (AusDM 2008), pp. 219–226 (2008)

    Google Scholar 

  8. Ghosh, S., Mitra, P.: Combining content and structure similarity for XML document classification using composite SVM Kernels. In: Proceedings 19th International Conference on Pattern Recognition (ICPR 2008), pp. 1–4 (2008)

    Google Scholar 

  9. Zhang, L., Li, Z., Chen, Q., Li, N.: Structure and content similarity for clustering XML documents. In: Shen, H.T., et al. (eds.) WAIM 2010. LNCS, vol. 6185, pp. 116–124. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16720-1_12

    Chapter  Google Scholar 

  10. Yuan, J., Xu, D., Bao, H.: An efficient XML documents classification method based on structure and keywords frequency. J. Comput. Res. Dev. 43(8), 1361–1367 (2006)

    Google Scholar 

  11. Costa, G., Ortale, R., Ritacco, E.: Effective XML classification using content and structural information via rule learning. In: Proceedings the 23rd IEEE International Conference on Tools with Artificial Intelligence (ICTAI 2011), pp. 102–109 (2011)

    Google Scholar 

  12. Yang, J., Zhang, F.: XML document classification using extended VSM. In: Proceedings 6th International Workshop of the Initiative for the Evaluation of XML Retrieval, pp. 234–244 (2008)

    Google Scholar 

  13. Yang, J., Wang, S.: Extended VSM for XML document classification using frequent subtrees. In: Proceedings 8th International Workshop of the Initiative for the Evaluation of XML Retrieval, pp. 441–448 (2009)

    Google Scholar 

  14. Zhao, X., Bi, X., Qiao, B.: Probability based voting extreme learning machine for multiclass XML documents classification. World Wide Web 17(5), 1217–1231 (2013). https://doi.org/10.1007/s11280-013-0230-8

    Article  Google Scholar 

  15. Costa, G., Ortale, R.: Machine learning techniques for XML (co-)clustering by structure-constrained phrases. Inf. Retrieval J. 21(1), 24–55 (2017). https://doi.org/10.1007/s10791-017-9314-x

    Article  Google Scholar 

  16. Mladenic, D., Globelnik, M.: Word sequences as features in text learning. the 17th Electrotechnical and Computer Science Conference (ERK 1998), Slovenia, pp. 145–148 (1998)

    Google Scholar 

  17. Furnkranz, J.: A Study Using n-gram features for text categorization. Austrian Res. Instit. Artif. Intell. 3, 1–10 (1998)

    Google Scholar 

  18. Zhang, Y., Zhang, L., Yan, J., Li, Z.: Using association features to enhance the performance of Naive Bayes text classifier. In: Proceedings the 5th International Conference on Computational Intelligence and Multimedia Applications, pp. 336–441 (2003)

    Google Scholar 

  19. Meretakis, D., Wuthrich, B.: Extending Naive Bayes classifiers using long itemsets. In: Proceedings the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD 1999), pp. 165–174 (1999)

    Google Scholar 

  20. Tesar, R., Strnad, V., Jezek, K., Poesio, M.: Extending the single words-based document model: a comparison of bigrams and 2-itemsets. In: Proceedings the ACM Symposium on Document Engineering, pp. 138–146 (2006)

    Google Scholar 

  21. Zhang, L., Li, Z., Chen, Q., Li, X., Li, N., Lou, Y.: Mining frequent association tag sequences for clustering XML documents. In: Sheng, Q.Z., Wang, G., Jensen, C.S., Xu, G. (eds.) APWeb 2012. LNCS, vol. 7235, pp. 85–96. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-29253-8_8

    Chapter  Google Scholar 

  22. Caropreso, M.F., Matwin, S., Sebastiani, F.: Statistical phrases in automated text categorization. Technical report IEI-B4-07-2000. Istituto di Elaborazione dell’Informazione, Pisa, Italy (2000)

    Google Scholar 

  23. Mitra, M., Buckley, C., Singhal, A., Cardie, C: An analysis of statistical and syntactic phrases. In: The 5th International Conference on Recherche d’Information Assistee par Ordinateur (RIAO 1997), Montreal, CA, pp. 200–214 (1997)

    Google Scholar 

  24. Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: The 7th ACM International Conference on Information and Knowledge Management (CIKM 1998), New York, US, pp. 148–155. ACM Press (1998)

    Google Scholar 

  25. Tesar, R., Fiala, D., Rousselot, F., Jezek, K.: A comparison of two algorithms for discovering repeated word sequences. WIT transaction on information and communication technologies 35, 121–131 (2005)

    Google Scholar 

  26. Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: The 14th International Conference on Machine Learning (ICML 1997), pp. 412–420 (1997)

    Google Scholar 

  27. Rezk, N.G., Sarhan, A., Algergawy, A.: Clustering of XML documents based on structure and aggregated content. In: Proceedings 11th International Conference on Computer Engineering and Systems, Cairo, Egypt, pp. 93–102 (2016)

    Google Scholar 

  28. Denoyer, L., Gallinari, P.: Report on the XML mining track at INEX 2007 categorization and clustering of XML documents. SIGIR forum 42, 22–28 (2008)

    Article  Google Scholar 

  29. Kurt, A., Tozal, E.: Classification of XSLT-generated web documents with support vector machines. In: Nayak, R., Zaki, M.J. (eds.) KDXD 2006. LNCS, vol. 3915, pp. 33–42. Springer, Heidelberg (2006). https://doi.org/10.1007/11730262_6

    Chapter  Google Scholar 

  30. Wu, J., Tang, J.: A bottom-up approach for XML documents classification. In: The 2008 International Symposium on Database Engineering and Applications, Coimbra, Portugal, pp. 131–137. ACM (2008)

    Google Scholar 

  31. Zhang, L., Li, Z., Chen, Q., et al.: Classifying XML documents based on term semantics. Jilin Daxue Xuebao/J. Jilin Univ. (Eng. Technol. Edn.) 42(6), 1510–1514 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lijun Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zhang, L., Li, N., Pan, W., Li, Z. (2021). A Semi-structured Data Classification Model with Integrating Tag Sequence and Ngram. In: Jensen, C.S., et al. Database Systems for Advanced Applications. DASFAA 2021. Lecture Notes in Computer Science(), vol 12682. Springer, Cham. https://doi.org/10.1007/978-3-030-73197-7_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-73197-7_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-73196-0

  • Online ISBN: 978-3-030-73197-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics