Automatic Web Data API Creation via Cross-Lingual Neural Pagination Recognition

Chang, Chia-Hui; Wu, Cheng-Ju; Lin, Tzu-Ping

doi:10.1007/978-3-031-09917-5_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13362))

Included in the following conference series:

International Conference on Web Engineering

1498 Accesses

Abstract

Information extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications on the Web. Typical Web scraping systems such as “Dexi.io” or “Import.io” allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a graphical user interface to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of finding similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model that will label each clickable link in a page as either “NEXT”, “PAGE” or “Other”, where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the attribute contents in the links as well as Language-Agnostic SEntence Representations (LASER) for anchor text embedding. The experimental results show that the proposed model, achieves an average of micro 0.834 and macro 0.818 F1 score on pagination recognition. In terms of practical deployment, we are able to automatically create 1,060 (MDR) and 153 (DCADE) data APIs from 392 event source pages within 62 min.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Hierarchical RNN for Few-Shot Information Extraction Learning

Web API Search: Discover Web API and Its Endpoint with Natural Language Queries

Learning from similarity and information extraction from structured documents

Article 11 June 2021

Notes

1.
https://github.com/UnderSam/pagination-prediction.

References

Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 337–348. ACM, New York (2003). https://doi.org/10.1145/872757.872799
Artetxe, M., Schwenk, H.: Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. TACL 7, 597–610 (2018)
Google Scholar
Cantino, A.: Selector gadget. https://github.com/cantino/selectorgadget
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006). https://doi.org/10.1109/TKDE.2006.152
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. NAACL, Association for Computational Linguistics, Minneapolis, Minnesota (2019)
Google Scholar
Dexi.io: Dexi.io (2015). https://www.dexi.io/
Ferrara, E., Meo, P.D., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)
Article Google Scholar
Furche, T., et al.: Diadem: thousands of websites to a single database. Proc. VLDB Endow. 7(14), 1845–1856 (2014). https://doi.org/10.14778/2733085.2733091
Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The Lixto data extraction project: back and forth between theory and practice. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2004, pp. 1–12. ACM, New York (2004). https://doi.org/10.1145/1055558.1055560
Import.io: Import.io (2012). https://www.import.io/product/
Korobov, M., de Prado, I., Haase, M.E.: AutoPager: Detect and classify pagination links (2020). https://github.com/TeamHG-Memex/autopager
Leonhardt, J., Anand, A., Khosla, M.: Boilerplate removal using a neural sequence labeling model. Association for Computing Machinery, New York, NY, USA (2020)
Google Scholar
Liao, Y.C.: Event Source Page Discovery via Reinforcement Learning. Master’s thesis, National Central University, Taoyuan, Taiwan (2021)
Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM, New York (2003). https://doi.org/10.1145/956750.956826
Meusel, R., Mika, P., Blanco, R.: Focused crawling for structured data. Association for Computing Machinery, New York, NY, USA (2014)
Google Scholar
Vogels, T., Ganea, O.-E., Eickhoff, C.: Web2Text: deep structured boilerplate removal. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 167–179. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_13
Chapter Google Scholar
Wu, T., Sgro, V.: Methods and systems for automated detection of pagination (2016). uS20160103799A1
Google Scholar
Yuliana, O.Y., Chang, C.-H.: DCADE: divide and conquer alignment with dynamic encoding for full page data extraction. Appl. Intell. 50(2), 271–295 (2019). https://doi.org/10.1007/s10489-019-01499-0
Article Google Scholar
Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006). https://doi.org/10.1109/TKDE.2006.197
Article Google Scholar
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS 2015, vol. 1, pp. 649–657. MIT Press, Cambridge, MA, USA (2015)
Google Scholar

Download references

Acknowledgement

This paper is partially sponsored by Ministry of Science and Technology, Taiwan under grant MOST-109-2221-E-008-060-MY3.

Author information

Authors and Affiliations

National Central University, Taoyuan, Taiwan
Chia-Hui Chang, Cheng-Ju Wu & Tzu-Ping Lin

Authors

Chia-Hui Chang
View author publications
You can also search for this author in PubMed Google Scholar
Cheng-Ju Wu
View author publications
You can also search for this author in PubMed Google Scholar
Tzu-Ping Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chia-Hui Chang .

Editor information

Editors and Affiliations

Polytechnic University of Bari, Bari, Italy
Tommaso Di Noia
Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Korea (Republic of)
In-Young Ko
Johannes Kepler University Linz, Linz, Austria
Markus Schedl
Polytechnic University of Bari, Bari, Italy
Carmelo Ardito

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chang, CH., Wu, CJ., Lin, TP. (2022). Automatic Web Data API Creation via Cross-Lingual Neural Pagination Recognition. In: Di Noia, T., Ko, IY., Schedl, M., Ardito, C. (eds) Web Engineering. ICWE 2022. Lecture Notes in Computer Science, vol 13362. Springer, Cham. https://doi.org/10.1007/978-3-031-09917-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-09917-5_8
Published: 01 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-09916-8
Online ISBN: 978-3-031-09917-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics