Information extraction for deep web using repetitive subject pattern

Thamviset, Wachirawut; Wongthanavasu, Sartra

doi:10.1007/s11280-013-0248-y

Information extraction for deep web using repetitive subject pattern

Published: 14 August 2013

Volume 17, pages 1109–1139, (2014)
Cite this article

World Wide Web Aims and scope Submit manuscript

Wachirawut Thamviset¹ &
Sartra Wongthanavasu¹

681 Accesses
12 Citations
3 Altmetric
Explore all metrics

Abstract

In this paper, we propose an information extraction (IE) system for extracting data records from semi-structured documents on the Deep Web using a promising proposed technique, called Repetitive Subject Pattern. This technique was based on the hypothesis that data records in the web page must have a subject item, and the repetitive pattern of the subject items can be used to identify the boundary of data records. The system consists of four automatic tasks: (1) parsing a sample page to a DOM tree, (2) recognizing a subject string in the DOM tree, (3) using the subject string for identifying the pattern of data records and generating a wrapper, and (4) using the generated wrapper for extracting data records. This approach enables the very flexible wrapper generator; when the automatic process generated the wrong wrapper, user can also provide a new sample subject string for generating better wrapper. As the result, the system can be both semi-supervised and unsupervised system. The experimentation shows that the proposed technique provides the outstanding results in generating the very high quality wrappers, with both recall and precision close to 100 % when tested on a number of datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Adelberg, B.: NoDoSE - A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. Proceedings of the 1998 ACM SIGMOD in-ternational conference on Management of data. pp. 283–294 ACM, New York, NY, USA (1998). doi:10.1145/276304.276330
Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data Knowl. Eng 64(2), 491–509 (2008). doi:10.1016/ j.datak.2007.10.002
Article Google Scholar
Arasu, A., Garcia-Molina, H.: Extracting structured data from Web pages. Proceedings of the 2003 ACM SIGMOD international conference on Management of data. pp. 337–348 ACM, New York, NY, USA (2003). doi:10.1145/872757.872799
Arocena, G.O., Mendelzon, A.O.: WebOQL: Restructuring Documents, Databases, and Webs. Proceedings of the Fourteenth International Conference on Data Engineering. pp. 24–33 I.E. Computer Society, Washington, DC, USA (1998)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. APWeb. 406–417 (2003)
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE Trans Knowl Data Eng 18(10), 1411–1428 (2006). doi:10.1109/TKDE.2006.152
Article Google Scholar
Chang, C.-H., Kuo, S.-C.: OLERA: semisupervised Web-data extraction with visual support. IEEE Intell Syst 19(6), 56–64 (2004). doi:10.1109/MIS.2004.71
Article Google Scholar
Chang, C.-H., Lui, S.-C.: IEPAD: information extraction based on pattern discovery. Proceedings of the 10th international conference on World Wide Web. pp. 681–688 ACM, New York, USA (2001). doi:10.1145/371920.372182
Ciravegna, F., Dingli, A., Wilks, Y., Petrelli, D.: Adaptive information extraction for document annotation in amilcare. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 451–451 ACM, New York, NY, USA (2002). doi:10.1145/564376.564492
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. Proceedings of the 27th International Conference on Very Large Data Bases. pp. 109–118 Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)
He, B., Patel, M., Zhang, Z., Chang, K.C.-C.: Accessing the deep web. Commun of the ACM. 50(5), 94–101 (2007). doi:10.1145/1230819.1241670
Article Google Scholar
Hengru, Z., Chun, C.: Web Information Extraction Technology Research Based on Ajax. Proceedings of the 2011 International Conference on Business Computing and Global Informatization. pp. 208–211 I.E. Computer Society, Washington, DC, USA (2011). doi:10.1109/BCGIn.2011.60
Hogue, A., Karger, D.: Thresher: automating the unwrapping of semantic content from the World Wide Web. Proceedings of the 14th international conference on World Wide Web. pp. 86–95 ACM, New York, NY, USA (2005). doi:10.1145/1060745.1060762
Hong, J.L.: Data extraction for deep Web using WordNet. IEEE Trans Syst Man, Cybern, Part C: Appl Rev 41(6), 854–868 (2011). doi:10.1109/TSMCC.2010.2089678
Article Google Scholar
Hong, J.L., Siew, E.-G., Egerton, S.: Information extraction for search engines using fast heuristic techniques. Data Knowl. Eng 69(2), 169–196 (2010). doi:10.1016/j.datak.2009.10.002
Article Google Scholar
Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf Syst. 23(8), 521–538 (1998). doi:10.1016/S0306-4379(98)00027-1
Article Google Scholar
Kayed, M., Chang, C.H.: FiVaTech: page-level Web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2), 249–263 (2009). doi:10.1109/TKDE.2009.82
Article Google Scholar
Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 601–606 ACM, New York, NY, USA (2003). doi:10.1145/956750.956826
Liu, W., Meng, X., Meng, W.: ViDE: a vision-based approach for deep Web data extraction. IEEE IEEE Trans Knowl Data Eng 22(3), 447–460 (2010). doi:10.1109/TKDE.2009.109
Article Google Scholar
Liu, L., Pu, C., Han, W.: XWRAP: an XML-enabled wrapper construction system for Web information sources. Data Engineering, 2000. Proceedings. 16th International Conference on. pp. 611 –621 (2000). doi:10.1109/ICDE.2000.839475
Myllymaki, J.: Effective Web data extraction with standard XML technologies. Computer Networks. 39(5), 635–644 (2002). doi:10.1016/S1389-1286(02)00214-1
Article Google Scholar
Padmadas, V., Gadge, J.: Web data extraction using visual features. Proceedings of the International Conference and Workshop on Emerging Trends in Technology. pp. 218–221 ACM, New York, NY, USA (2010). doi:10.1145/1741906.1741954
Qin, Y., Zheng, D., Zhao, T.: Research on search results optimization technology with category features integration. Int J Mach Learn Cybern 3(1), 71–76 (2012). doi:10.1007/s13042-011-0037-9
Article Google Scholar
Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. Proceedings of the 14th ACM international conference on Information and knowledge management. pp. 381–388 ACM, New York, NY, USA (2005). doi:10.1145/ 1099554.1099672 DOI:10.1145/1099554.1099672
Sleiman, H.A., Corchuelo, R.: An unsupervised technique to extract information from semi-structured Web pages. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) Web information systems engineering - WISE, pp. 631–637. Springer, Berlin (2012)
Google Scholar
Sleiman, H.A., Corchuelo, R.: TEX: an efficient and effective unsupervised Web information extracto. Knowl-Based Syst 39(0), 109–123 (2013). doi:10.1016/j.knosys.2012.10.009
Article Google Scholar
Sleiman, H.A., Corchuelo, R.: A Survey on Region Extractors From Web Documents. IEEE Transactions on Knowledge and Data Engineering. 99, (2012). doi:10.1109/TKDE. 2012.135 DOI:10.1109/TKDE.2012.135
Thamviset, W., Wongthanavasu, S.: Structured web information extraction using repetitive subject pattern. Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2012 9th International Conference on. pp. 1 –4 , Thailand (2012). doi:10.1109/ECTICon.2012.6254247
Vadrevu, S., Gelgi, F., Davulcu, H.: Information extraction from Web pages using presentation regularities and domain knowledge. World Wide Web. 10(2), 157–179 (2007). doi:10.1007/s11280-007-0021-1
Article Google Scholar
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. Proceedings of the 12th international conference on World Wide Web. pp. 187–196 ACM, New York, NY, USA (2003). doi:10.1145/775152.775179
Yang, S., Wang, G., Han, Y.: Grubber: Allowing End-Users to Develop XML-Based Wrappers for Web Data Sources. Proceedings of the Joint International Conferences on Advances in Data and Web Management. pp. 647–652 Springer-Verlag, Berlin, Heidelberg (2009). doi:10.1007/978-3-642-00672-2_65
Zhai, Y., Liu, B.: Structured data extraction from the Web based on partial tree alignment. IEEE Trans Knowledge Data Eng 18(12), 1614–1628 (2006). doi:10.1109/TKDE.2006.197
Article Google Scholar
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. Proceedings of the 14th international conference on World Wide Web. pp. 66–75 ACM, New York, NY, USA (2005). doi:10.1145/1060745.1060760
Zheng, X., Gu, Y., Li, Y.: Data extraction from web pages based on structural-semantic entropy. Proceedings of the 21st international conference companion on World Wide Web. pp. 93–102 ACM, New York, NY, USA (2012). doi:10.1145/2187980.2187991

Download references

Author information

Authors and Affiliations

Machine Learning and Intelligent Systems (MLIS) Laboratory,Cellular Automata and Knowledge, Engineering (CAKE) Laboratory,Department of Computer Science, Faculty of Science, Khon Kaen University, Khon Kaen, 40002, Thailand
Wachirawut Thamviset & Sartra Wongthanavasu

Authors

Wachirawut Thamviset
View author publications
You can also search for this author in PubMed Google Scholar
Sartra Wongthanavasu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wachirawut Thamviset.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Thamviset, W., Wongthanavasu, S. Information extraction for deep web using repetitive subject pattern. World Wide Web 17, 1109–1139 (2014). https://doi.org/10.1007/s11280-013-0248-y

Download citation

Received: 31 August 2012
Revised: 16 July 2013
Accepted: 23 July 2013
Published: 14 August 2013
Issue Date: September 2014
DOI: https://doi.org/10.1007/s11280-013-0248-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Information extraction for deep web using repetitive subject pattern

Abstract

Access this article

Similar content being viewed by others

NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model

Data Mining Algorithms for Knowledge Extraction from Web Log Files

Hidden Data Extraction Using URL Templates Processing

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Information extraction for deep web using repetitive subject pattern

Abstract

Access this article

Similar content being viewed by others

NEXIR: A Novel Web Extraction Rule Language toward a Three-Stage Web Data Extraction Model

Data Mining Algorithms for Knowledge Extraction from Web Log Files

Hidden Data Extraction Using URL Templates Processing

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation