NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model

Shi, Shengsheng; Wei, Wu; Liu, Yulong; Wang, Haitao; Luo, Lei; Yuan, Chunfeng; Huang, Yihua

doi:10.1007/978-3-642-41230-1_3

Shengsheng Shi^20,21,
Wu Wei^20,21,
Yulong Liu^20,21,
Haitao Wang^20,21,
Lei Luo^20,21,
Chunfeng Yuan^20,21 &
…
Yihua Huang^20,21

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8180))

Included in the following conference series:

International Conference on Web Information Systems Engineering

2054 Accesses

Abstract

As the most popular information publishing platform, the Web contains a lot of valued data information of interests to users or applications. Nowadays, although a lot of data mining or analysis techniques have been studied in last decade, there are still not many easy-to-use web data mining tools available for users to extract useful data information from the Web. The web information extraction is a whole process involving web page navigation, data extraction and data integration. Unfortunately most of existing studies or systems lack of sufficient consideration toward the three-stage process. Also most of them lack the powerful rules to express the flexible extraction logic to extract data records with complicate structure. In this paper, we propose a novel web data extraction language, NEXIR, toward a three-stage web data extraction model. First of all, the language can define rules for system to automate the navigation process of the web pages, including deep web pages that need interactions from users. Then the language allows users to define flexible and complicated rules to extract data records from web pages and integrate extracted data into a pre-defined structure. A language engine and a prototype extraction system have been implemented based on the proposed language. The experimental results show that our language and system work effective and powerful compared with existing data extraction approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Information Extraction Approaches: A Survey

Exploiting Multi-Category Characteristics and Unified Framework to Extract Web Content

Article Open access 07 June 2018

Kizomba: An Unsupervised Heuristic-Based Web Information Extractor

References

Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. SIGMOD Record 31(2), 84–93 (2002)
Article Google Scholar
Chang, C.–H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A Survey of Web Information Extraction Systems. IEEE Transactions on Knowledge and Data Engineering 18(10), 1411–1428 (2006)
Article Google Scholar
Sleiman, H., Corchuelo, R.: A Survey on Region Extractors from Web Documents. IEEE Transactions on Knowledge and Data Engineering PP(99) (2012)
Google Scholar
Hammer, J., McHugh, J., Garcia-Molina, H.: Semistructured Data: The TSIMMIS Experience. In: Proceedings of the First East-European Conference on Advances in Databases and Information Systems, pp. 1–8 (1997)
Google Scholar
Crescenzi, V., Mecca, G.: Grammars Have Exceptions. Information Systems 23(8), 539–565 (1998)
Article Google Scholar
Arocena, G.O., Mendelzon, A.O.: WebOQL: Restructuring Documents, Databases and Webs. In: Proceedings of the 14th International Conference on Data Engineering, pp. 24–33 (1998)
Google Scholar
Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 119–128 (2001)
Google Scholar
Baumgartner, R., Gottlob, G., Herzog, M.: Scalable Web Data Extraction for Online Market Intelligence. Proceedings of the VLDB Endowment 2(2), 1512–1523 (2009)
Google Scholar
Raposo, J., Pan, A., Álvarez, M., Hidalgo, J., Viña, A.: The Wargo System: Semi-Automatic Wrapper Generation in Presence of Complex Data Access Modes. In: Proceedings of the 13th International Workshop on Database and Expert Systems Applications, pp. 313–317 (2002)
Google Scholar
Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: OXPath: A Language for Scalable Data Extraction, Automation, and Crawling on the Deep Web. The VLDB Journal 22(1), 47–72 (2013)
Article Google Scholar
Freitag, D.: Information Extraction from HTML: Application of a General Machine Learning Approach. In: Proceedings of the 15th National Conference on Artificial Intelligence, pp. 517–523 (1998)
Google Scholar
Califf, M.E., Mooney, R.J.: Relational Learning of Pattern-Match Rules for Information Extraction. In: Proceedings of the 16th National Conference on Artificial Intelligence, pp. 328–334 (1999)
Google Scholar
Kushmerick, N.: Wrapper Induction: Efficiency and Expressiveness. Artificial Intelligence 118(1-2), 15–68 (2000)
Article MathSciNet MATH Google Scholar
Hsu, C.-N., Dung, M.-T.: Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web. Information Systems 23(8), 521–538 (1998)
Article Google Scholar
Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems 4(1-2), 93–114 (2001)
Article Google Scholar
Laender, A.H.F., Ribeiro-Neto, B., da Silva, A.S.: DEByE – Data Extraction By Example. Data & Knowledge Engineering 40(2), 121–154 (2002)
Article MATH Google Scholar
Gulhane, P., et al.: Web-Scale Information Extraction with Vertex. In: Proceedings of the 27th International Conference on Data Engineering, pp. 1209–1220 (2011)
Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)
Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 337–348 (2003)
Google Scholar
Kayed, M., Chang, C.-H.: FiVaTech: Page-Level Web Data Extraction from Template Pages. IEEE Transactions on Knowledge and Data Engineering 22(2), 249–263 (2010)
Article Google Scholar
Chang, C.-H., Lui, S.-C.: IEPAD: Information Extraction Based on Pattern Discovery. In: Proceedings of the 10th International Conference on World Wide Web, pp. 681–688 (2001)
Google Scholar
Wang, J., Lochovsky, F.H.: Data Extraction and Label Assignment for Web Databases. In: Proceedings of the 12th International Conference on World Wide Web, pp. 187–196 (2003)
Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining Data Records in Web Pages. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 601–606 (2003)
Google Scholar
Liu, B., Zhai, Y.: NET – A System for Extracting Web Data from Flat and Nested Data Records. In: Ngu, A.H.H., Kitsuregawa, M., Neuhold, E.J., Chung, J.-Y., Sheng, Q.Z. (eds.) WISE 2005. LNCS, vol. 3806, pp. 487–495. Springer, Heidelberg (2005)
Chapter Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully Automatic Wrapper Generation for Search Engines. In: Proceedings of the 14th International Conference on World Wide Web, pp. 66–75 (2005)
Google Scholar
Zhao, H., Meng, W., Yu, C.: Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages. In: Proceedings of the 32nd International Conference on Very Large Data Bases, pp. 989–1000 (2006)
Google Scholar
Zhai, Y., Liu, B.: Structured Data Extraction from the Web Based on Partial Tree Alignment. IEEE Transactions on Knowledge and Data Engineering 18(12), 1614–1628 (2006)
Article Google Scholar
Jindal, N., Liu, B.: A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction. In: Proceedings of the 10th SIAM International Conference on Data Mining, pp. 930–941 (2010)
Google Scholar
Liu, W., Meng, X., Meng, W.: ViDE: A Vision-Based Approach for Deep Web Data Extraction. IEEE Transactions on Knowledge and Data Engineering 22(3), 447–460 (2010)
Article Google Scholar
Su, W., Wang, J., Lochovsky, F.H., Liu, Y.: Combining Tag and Value Similarity for Data Extraction and Alignment. IEEE Transactions on Knowledge and Data Engineering 24(7), 1186–1200 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Nanjing University, Nanjing, 210023, China
Shengsheng Shi, Wu Wei, Yulong Liu, Haitao Wang, Lei Luo, Chunfeng Yuan & Yihua Huang
National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 210023, China
Shengsheng Shi, Wu Wei, Yulong Liu, Haitao Wang, Lei Luo, Chunfeng Yuan & Yihua Huang

Authors

Shengsheng Shi
View author publications
You can also search for this author in PubMed Google Scholar
Wu Wei
View author publications
You can also search for this author in PubMed Google Scholar
Yulong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Haitao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Luo
View author publications
You can also search for this author in PubMed Google Scholar
Chunfeng Yuan
View author publications
You can also search for this author in PubMed Google Scholar
Yihua Huang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

The University of New South Wales, Sydney, NSW, Australia
Xuemin Lin
Aristotle University of Thessaloniki, Thessaloniki, Greece
Yannis Manolopoulos
AT&T Labs-Research, Florham Park, NJ, USA
Divesh Srivastava
Victoria University, Melbourne, Australia
Guangyan Huang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shi, S. et al. (2013). NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model. In: Lin, X., Manolopoulos, Y., Srivastava, D., Huang, G. (eds) Web Information Systems Engineering – WISE 2013. WISE 2013. Lecture Notes in Computer Science, vol 8180. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41230-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-642-41230-1_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41229-5
Online ISBN: 978-3-642-41230-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics