Web Data Extraction from Scientific Publishers’ Website Using Hidden Markov Model

Huang, Jing; Liu, Ziyu; Wang, Beibei; Duan, Mingyue; Yang, Bo

doi:10.1007/978-3-319-99365-2_42

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11061))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

Abstract

Recently, large amounts of information on web pages have been emerging in an endless stream. And numerously papers are published on more than three thousands of journals, especially in the field of technology. It’s almost impossible for the user to search the information one by one. The user has to click a lot of links when he or she wants to get information among the thousands of journals, such as the introduction of the journals, impact factor, ISSN and so on. To solve this problem, it’s necessary to develop an automatic method that filter the information out of deep web automatically. The method in this paper is able to help people quickly get needed information classified and extracted. This paper contains the following work: firstly, the method of machine learning, HMM, is used to extract the journal information from the publisher’s website, which improves the generalization ability of using the heuristic method; then, during the data processing step, content extraction technique is used to improve the performance of Hidden Markov Model; finally, we store the extracted information in a structured way and display it. In the experimental step, three algorithms are tested and compared in the accuracy, recall and F-measure, the results show that HMM with content extraction (C-HMM) has the best performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bergman, M.: The deep web: surfacing hidden value. J. Electron. Publ. 7(1), 1–14 (2001)
Article MathSciNet Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large web sites. In: 27th International Conference on Very Large Data Bases, pp. 109–118. Morgan Kaufmann, Roma, Italy (2001)
Google Scholar
Gutierrez, F., Dou, D., Fickas, S., et al.: A hybrid ontology-based information extraction system. J. Inf. Sci. 42(6), 798–820 (2016)
Article Google Scholar
Zhang, N., Chen, H., Wang, Y., et al.: Odaies: ontology-driven adaptive Web information extraction system. In: IEEE/WIC International Conference on Intelligent Agent Technology, pp. 454–460. IEEE (2003)
Google Scholar
Wang, J., Lochovsky, F.H.: Data-rich section extraction from HTML pages. In: International Conference on Web Information Systems Engineering, pp. 313–322. IEEE, Singapore (2003)
Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. In: ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM (2003)
Google Scholar
Kumaresan, U., Ramanujam, K.: Web data extraction from scientific publishers’ website using heuristic algorithm. Int. J. Intell. Syst. Appl. 9(10), 31–39 (2017)
Google Scholar
Zhong, P., Chen, J.: A generalized hidden markov model approach for web information extraction. In: IEEE/WIC/ACM International Conference on Web Intelligence, pp. 709–718. IEEE, Hong Kong (2006)
Google Scholar
Forney, G.: The Viterbi algorithm. Proc. IEEE 61(3), 268–278 (1973)
Article MathSciNet Google Scholar
Rabiner, L.R., Juang, B.H.: An introduction to hidden Markov models. IEEE ASSP Mag. 3(1), 4–16 (1986)
Article Google Scholar
Lai, J., Liu, Q., Liu, Y.: Web information extraction based on hidden Markov model. In: 14th International Conference on Computer Supported Cooperative Work in Design, pp. 234–238. IEEE, Shanghai (2010)
Google Scholar
Xiong, Z., Lin, X., Zhang, Y., Ya, M.: Content extraction method combining web page structure and text feature. Comput. Eng. 39(12), 200–203 (2013)
Google Scholar
Elsevier. https://www.elsevier.com/. Accessed 25 Apr 2018
Springer. https://link.springer.com/. Accessed 25 Apr 2018
Wiley. https://onlinelibrary.wiley.com/. Accessed 25 Apr 2018
APP download link. http://www.acheadline.com/

Download references

Acknowledgments

This work was supported in part by National Natural Science Foundation of China under grants 61373053 and 61572226, and Jilin Province Key Scientific and Technological Research and Development project under grants 20180201044GX and 20180201067GX.

Author information

Authors and Affiliations

College of Computer Science and Technology, Jilin University, Changchun, 130012, China
Jing Huang, Ziyu Liu, Beibei Wang, Mingyue Duan & Bo Yang
Key Laboratory of Symbol Computation and Knowledge Engineering, Jilin University, Ministry of Education, Changchun, 130012, China
Jing Huang, Ziyu Liu, Beibei Wang, Mingyue Duan & Bo Yang

Authors

Jing Huang
View author publications
You can also search for this author in PubMed Google Scholar
Ziyu Liu
View author publications
You can also search for this author in PubMed Google Scholar
Beibei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mingyue Duan
View author publications
You can also search for this author in PubMed Google Scholar
Bo Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bo Yang .

Editor information

Editors and Affiliations

University of Bristol, Bristol, United Kingdom
Weiru Liu
Università di Trento, Povo, Italy
Fausto Giunchiglia
Jilin University, Changchun, China
Bo Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, J., Liu, Z., Wang, B., Duan, M., Yang, B. (2018). Web Data Extraction from Scientific Publishers’ Website Using Hidden Markov Model. In: Liu, W., Giunchiglia, F., Yang, B. (eds) Knowledge Science, Engineering and Management. KSEM 2018. Lecture Notes in Computer Science(), vol 11061. Springer, Cham. https://doi.org/10.1007/978-3-319-99365-2_42

Download citation

DOI: https://doi.org/10.1007/978-3-319-99365-2_42
Published: 12 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99364-5
Online ISBN: 978-3-319-99365-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics