Skip to main content

Automatic Web Data API Creation via Cross-Lingual Neural Pagination Recognition

  • Conference paper
  • First Online:
Web Engineering (ICWE 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13362))

Included in the following conference series:

  • 1249 Accesses

Abstract

Information extraction, transformation and loading (abbreviated as ETL) tools are important for big data analysis and value-added applications on the Web. Typical Web scraping systems such as “Dexi.io” or “Import.io” allow users to specify where to fetch the page and what information or data to be extracted from the page. Although these commercial services already provide a graphical user interface to guide the system to the target pages for each data source, such systems are not scalable because users have to create crawlers one by one. In this paper we consider the problem of pagination recognition, which aims to automate the process of finding similar pages by locating the next page link and the list of page links from any starting URL. We propose a neural sequence model that will label each clickable link in a page as either “NEXT”, “PAGE” or “Other”, where the first two could guide the system to find similar pages of the seed URL. To have multilingual support, we have exploited the attribute contents in the links as well as Language-Agnostic SEntence Representations (LASER) for anchor text embedding. The experimental results show that the proposed model, achieves an average of micro 0.834 and macro 0.818 F1 score on pagination recognition. In terms of practical deployment, we are able to automatically create 1,060 (MDR) and 153 (DCADE) data APIs from 392 event source pages within 62 min.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://github.com/UnderSam/pagination-prediction.

References

  1. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 337–348. ACM, New York (2003). https://doi.org/10.1145/872757.872799

  2. Artetxe, M., Schwenk, H.: Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. Trans. Assoc. Comput. Linguist. TACL 7, 597–610 (2018)

    Google Scholar 

  3. Cantino, A.: Selector gadget. https://github.com/cantino/selectorgadget

  4. Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006). https://doi.org/10.1109/TKDE.2006.152

    Article  Google Scholar 

  5. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. NAACL, Association for Computational Linguistics, Minneapolis, Minnesota (2019)

    Google Scholar 

  6. Dexi.io: Dexi.io (2015). https://www.dexi.io/

  7. Ferrara, E., Meo, P.D., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)

    Article  Google Scholar 

  8. Furche, T., et al.: Diadem: thousands of websites to a single database. Proc. VLDB Endow. 7(14), 1845–1856 (2014). https://doi.org/10.14778/2733085.2733091

  9. Gottlob, G., Koch, C., Baumgartner, R., Herzog, M., Flesca, S.: The Lixto data extraction project: back and forth between theory and practice. In: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS 2004, pp. 1–12. ACM, New York (2004). https://doi.org/10.1145/1055558.1055560

  10. Import.io: Import.io (2012). https://www.import.io/product/

  11. Korobov, M., de Prado, I., Haase, M.E.: AutoPager: Detect and classify pagination links (2020). https://github.com/TeamHG-Memex/autopager

  12. Leonhardt, J., Anand, A., Khosla, M.: Boilerplate removal using a neural sequence labeling model. Association for Computing Machinery, New York, NY, USA (2020)

    Google Scholar 

  13. Liao, Y.C.: Event Source Page Discovery via Reinforcement Learning. Master’s thesis, National Central University, Taoyuan, Taiwan (2021)

    Google Scholar 

  14. Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606. ACM, New York (2003). https://doi.org/10.1145/956750.956826

  15. Meusel, R., Mika, P., Blanco, R.: Focused crawling for structured data. Association for Computing Machinery, New York, NY, USA (2014)

    Google Scholar 

  16. Vogels, T., Ganea, O.-E., Eickhoff, C.: Web2Text: deep structured boilerplate removal. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds.) ECIR 2018. LNCS, vol. 10772, pp. 167–179. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-76941-7_13

    Chapter  Google Scholar 

  17. Wu, T., Sgro, V.: Methods and systems for automated detection of pagination (2016). uS20160103799A1

    Google Scholar 

  18. Yuliana, O.Y., Chang, C.-H.: DCADE: divide and conquer alignment with dynamic encoding for full page data extraction. Appl. Intell. 50(2), 271–295 (2019). https://doi.org/10.1007/s10489-019-01499-0

    Article  Google Scholar 

  19. Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006). https://doi.org/10.1109/TKDE.2006.197

    Article  Google Scholar 

  20. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Proceedings of the 28th International Conference on Neural Information Processing Systems, NIPS 2015, vol. 1, pp. 649–657. MIT Press, Cambridge, MA, USA (2015)

    Google Scholar 

Download references

Acknowledgement

This paper is partially sponsored by Ministry of Science and Technology, Taiwan under grant MOST-109-2221-E-008-060-MY3.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chia-Hui Chang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Chang, CH., Wu, CJ., Lin, TP. (2022). Automatic Web Data API Creation via Cross-Lingual Neural Pagination Recognition. In: Di Noia, T., Ko, IY., Schedl, M., Ardito, C. (eds) Web Engineering. ICWE 2022. Lecture Notes in Computer Science, vol 13362. Springer, Cham. https://doi.org/10.1007/978-3-031-09917-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-09917-5_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-09916-8

  • Online ISBN: 978-3-031-09917-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics