Web Page Template and Data Separation for Better Maintainability

Zhao, Chenxu; Zhang, Rui; Qi, Jianzhong

doi:10.1007/978-3-030-02922-7_30

Chenxu Zhao¹⁸,
Rui Zhang¹⁸ &
Jianzhong Qi¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11233))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1594 Accesses

Abstract

Separating a web page into template code and data records populated into the template is an important problem. This problem has a wide range of applications in web page compression and information extraction. We study this problem with the aim to separate a web page into easily maintainable template code and data records. We show that this problem is NP-hard. We then propose a heuristic algorithm to solve the problem. The main idea of our algorithm is to parse a web page into a tree and then to process it recursively in a bottom-up manner with three steps: splitting, folding, and alignment. We perform experiments on real datasets to evaluate the performance of our proposed algorithms in maximizing the maintainability of the template code produced. The experimental results show that our proposed algorithms outperform the baseline algorithms by 25% in the maintainability measure.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Site-Level Web Template Extraction Based on DOM Analysis

DCADE: divide and conquer alignment with dynamic encoding for full page data extraction

Article 22 July 2019

A novel alignment algorithm for effective web data extraction from singleton-item pages

Article 15 June 2018

References

Counsell, S., et al.: Re-visiting the ‘maintainability index’ metric from an object-oriented perspective. In: SEAA, pp. 84–87 (2015)
Google Scholar
Hammouda, K.M., Kamel, M.S.: Phrase-based document similarity based on an index graph model. In: ICDM, pp. 203–210 (2002)
Google Scholar
Karp, R.M.: Reducibility among combinatorial problems. In: Miller, R.E., Thatcher, J.W., Bohlinger, J.D. (eds.) Complexity of Computer Computations. The IBM Research Symposia Series, pp. 85–103. Springer, Boston (1972). https://doi.org/10.1007/978-1-4684-2001-2_9
Chapter Google Scholar
Kayed, M., Chang, C.H.: FiVaTech: page-level web data extraction from template pages. IEEE Trans. Knowl. Data Eng. 22(2), 249–263 (2010)
Article Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: KDD, pp. 601–606 (2003)
Google Scholar
McCabe, T.J.: A complexity measure. IEEE Trans. Softw. Eng. 4, 308–320 (1976)
Article MathSciNet Google Scholar
Omari, A., Kimelfeld, B., Yahav, E., Shoham, S.: Lossless separation of web pages into layout code and data. In: KDD, pp. 1805–1814 (2016)
Google Scholar
Pang, C., Zhang, R., Zhang, Q., Wang, J.: Dominating sets in directed graphs. Inf. Sci. 180(19), 3647–3652 (2010)
Article MathSciNet Google Scholar
Rao, R.V., Savsani, V.J., Vakharia, D.: Teaching-learning-based optimization: a novel method for constrained mechanical design optimization problems. Comput.-Aided Des. 43(3), 303–315 (2011)
Article Google Scholar
Yamada, Y., Craswell, N., Nakatoh, T., Hirokawa, S.: Testbed for information extraction from deep web. In: WWW, pp. 346–347 (2004)
Google Scholar
Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: WWW, pp. 76–85 (2005)
Google Scholar

Download references

Acknowledgment

This work is supported by Australian Research Council (ARC) Future Fellowships Project FT120100832 and Discovery Project DP180102050.

Author information

Authors and Affiliations

School of CIS, The University of Melbourne, Parkville, Australia
Chenxu Zhao, Rui Zhang & Jianzhong Qi

Authors

Chenxu Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Rui Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jianzhong Qi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rui Zhang .

Editor information

Editors and Affiliations

Zayed University, Dubai, United Arab Emirates
Hakim Hacid
Poznan University of Economics, Poznan, Poland
Wojciech Cellary
University of Victoria, Footscray, VIC, Australia
Hua Wang
UNSW Australia, Sydney, NSW, Australia
Hye-Young Paik
Swinburne University of Technology, Hawthorn, VIC, Australia
Rui Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, C., Zhang, R., Qi, J. (2018). Web Page Template and Data Separation for Better Maintainability. In: Hacid, H., Cellary, W., Wang, H., Paik, HY., Zhou, R. (eds) Web Information Systems Engineering – WISE 2018. WISE 2018. Lecture Notes in Computer Science(), vol 11233. Springer, Cham. https://doi.org/10.1007/978-3-030-02922-7_30

Download citation

DOI: https://doi.org/10.1007/978-3-030-02922-7_30
Published: 20 October 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-02921-0
Online ISBN: 978-3-030-02922-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics