ABSTRACT
Deep Web pages convey very relevant information for different application domains like e-government, e-commerce, social networking. For this reason there is a constant high interest in efficiently, effectively and automatically extracting data from Deep Web data sources. In this paper we present SILA, a novel Spatial Instance Learning Approach, that allows for extracting data records from Deep Web pages by exploiting both the spatial arrangement and the presentation features of data items/fields produced by layout engines of Web browsers in visualizing Deep Web pages on the screen. SILA is independent from the internal HTML encodings of Web pages, and allows for recognizing data records in pages having multiple data regions in which data items are arranged by many different presentation layouts. Experimental results show that SILA has very high precision and recall and that it works much better than MDR and ViNTs approaches.
- B. Liu, R. Grossman, and Y. Zhai. Mining data records in web pages. In KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 601--606, New York, NY, USA, 2003. ACM. Google ScholarDigital Library
- W. Liu, X. Meng, and W. Meng. Vide: A vision-based approach for deep web data extraction. IEEE Trans. on Knowl. and Data Eng., 22(3):447--460, 2010. Google ScholarDigital Library
- I. Navarrete and G. Sciavicco. Spatial reasoning with rectangular cardinal direction relations. In ECAI, pages 1--9, 2006.Google Scholar
- E. Oro, M. Ruffolo, and S. Staab. Sxpath - extending xpath towards spatial querying on web documents. PVLDB, 4(2):129--140, 2010. Google ScholarDigital Library
- N. K. Papadakis, D. Skoutas, K. Raftopoulos, and T. A. Varvarigou. Stavies: A system for information extraction from unknown web data sources through automatic web wrapper generation using clustering techniques. TKDE, 17(12):1638--1652, 2005. Google ScholarDigital Library
- Y. Zhai and B. Liu. Structured data extraction from the web based on partial tree alignment. TKDE, 18(12):1614--1628, 2006. Google ScholarDigital Library
- H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully automatic wrapper generation for search engines. In Proceedings of the 14th international conference on World Wide Web, WWW '05, pages 66--75, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
Index Terms
- SILA: a spatial instance learning approach for deep webpages
Recommendations
Towards a spatial instance learning method for deep web pages
ICDM'11: Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspectsA large part of information available on the Web is hidden to conventional research engines because Web pages containing such information are dynamically generated as answers to query submitted by search form filled in by keywords. Such pages are ...
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
A framework for incremental deep web crawler based on URL classification
WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part IIWith the Web grows rapidly, more and more data become available in the Deep Web.But users have to key in a set of keywords in order to access the pages from some web sites. Traditional search engines only index and retrieve Surface Web pages through ...
Comments