Abstract
Information extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications on the Web. Typical Web scraping systems such as “Dexi.io” or “Import.io” allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a graphical user interface to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of finding similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model that will label each clickable link in a page as either “NEXT”, “PAGE” or “Other”, where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the attribute contents in the links as well as Language-Agnostic SEntence Representations (LASER) for anchor text embedding. The experimental results show that the proposed model, achieves an average of micro 0.834 and macro 0.818 F1 score on pagination recognition. In terms of practical deployment, we are able to automatically create 1,060 (MDR) and 153 (DCADE) data APIs from 392 event source pages within 62 min.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 337–348. ACM, New York (2003). https://doi.org/10.1145/872757.872799
Artetxe, M., Schwenk, H.: Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. TACL 7, 597–610 (2018)
Cantino, A.: Selector gadget. https://github.com/cantino/selectorgadget
Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006). https://doi.org/10.1109/TKDE.2006.152
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. NAACL, Association for Computational Linguistics, Minneapolis, Minnesota (2019)
Dexi.io: Dexi.io (2015). https://www.dexi.io/
Ferrara, E., Meo, P.D., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)
Furche, T., et al.: Diadem: thousands of websites to a single database. Proc. VLDB Endow. 7(14), 1845–1856 (2014). https://doi.org/10.14778/2733085.2733091
Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The Lixto data extraction project: back and forth between theory and practice. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2004, pp. 1–12. ACM, New York (2004). https://doi.org/10.1145/1055558.1055560
Import.io: Import.io (2012). https://www.import.io/product/
Korobov, M., de Prado, I., Haase, M.E.: AutoPager: Detect and classify pagination links (2020). https://github.com/TeamHG-Memex/autopager
Leonhardt, J., Anand, A., Khosla, M.: Boilerplate removal using a neural sequence labeling model. Association for Computing Machinery, New York, NY, USA (2020)
Liao, Y.C.: Event Source Page Discovery via Reinforcement Learning. Master’s thesis, National Central University, Taoyuan, Taiwan (2021)
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM, New York (2003). https://doi.org/10.1145/956750.956826
Meusel, R., Mika, P., Blanco, R.: Focused crawling for structured data. Association for Computing Machinery, New York, NY, USA (2014)
Vogels, T., Ganea, O.-E., Eickhoff, C.: Web2Text: deep structured boilerplate removal. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 167–179. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_13
Wu, T., Sgro, V.: Methods and systems for automated detection of pagination (2016). uS20160103799A1
Yuliana, O.Y., Chang, C.-H.: DCADE: divide and conquer alignment with dynamic encoding for full page data extraction. Appl. Intell. 50(2), 271–295 (2019). https://doi.org/10.1007/s10489-019-01499-0
Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006). https://doi.org/10.1109/TKDE.2006.197
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS 2015, vol. 1, pp. 649–657. MIT Press, Cambridge, MA, USA (2015)
Acknowledgement
This paper is partially sponsored by Ministry of Science and Technology, Taiwan under grant MOST-109-2221-E-008-060-MY3.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Chang, CH., Wu, CJ., Lin, TP. (2022). Automatic Web Data API Creation via Cross-Lingual Neural Pagination Recognition. In: Di Noia, T., Ko, IY., Schedl, M., Ardito, C. (eds) Web Engineering. ICWE 2022. Lecture Notes in Computer Science, vol 13362. Springer, Cham. https://doi.org/10.1007/978-3-031-09917-5_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-09917-5_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-09916-8
Online ISBN: 978-3-031-09917-5
eBook Packages: Computer ScienceComputer Science (R0)