Abstract
In this paper, we propose an information extraction (IE) system for extracting data records from semi-structured documents on the Deep Web using a promising proposed technique, called Repetitive Subject Pattern. This technique was based on the hypothesis that data records in the web page must have a subject item, and the repetitive pattern of the subject items can be used to identify the boundary of data records. The system consists of four automatic tasks: (1) parsing a sample page to a DOM tree, (2) recognizing a subject string in the DOM tree, (3) using the subject string for identifying the pattern of data records and generating a wrapper, and (4) using the generated wrapper for extracting data records. This approach enables the very flexible wrapper generator; when the automatic process generated the wrong wrapper, user can also provide a new sample subject string for generating better wrapper. As the result, the system can be both semi-supervised and unsupervised system. The experimentation shows that the proposed technique provides the outstanding results in generating the very high quality wrappers, with both recall and precision close to 100 % when tested on a number of datasets.
Similar content being viewed by others
References
Adelberg, B.: NoDoSE - A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. Proceedings of the 1998 ACM SIGMOD in-ternational conference on Management of data. pp. 283–294 ACM, New York, NY, USA (1998). doi:10.1145/276304.276330
Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data Knowl. Eng 64(2), 491–509 (2008). doi:10.1016/ j.datak.2007.10.002
Arasu, A., Garcia-Molina, H.: Extracting structured data from Web pages. Proceedings of the 2003 ACM SIGMOD international conference on Management of data. pp. 337–348 ACM, New York, NY, USA (2003). doi:10.1145/872757.872799
Arocena, G.O., Mendelzon, A.O.: WebOQL: Restructuring Documents, Databases, and Webs. Proceedings of the Fourteenth International Conference on Data Engineering. pp. 24–33 I.E. Computer Society, Washington, DC, USA (1998)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. APWeb. 406–417 (2003)
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE Trans Knowl Data Eng 18(10), 1411–1428 (2006). doi:10.1109/TKDE.2006.152
Chang, C.-H., Kuo, S.-C.: OLERA: semisupervised Web-data extraction with visual support. IEEE Intell Syst 19(6), 56–64 (2004). doi:10.1109/MIS.2004.71
Chang, C.-H., Lui, S.-C.: IEPAD: information extraction based on pattern discovery. Proceedings of the 10th international conference on World Wide Web. pp. 681–688 ACM, New York, USA (2001). doi:10.1145/371920.372182
Ciravegna, F., Dingli, A., Wilks, Y., Petrelli, D.: Adaptive information extraction for document annotation in amilcare. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 451–451 ACM, New York, NY, USA (2002). doi:10.1145/564376.564492
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. Proceedings of the 27th International Conference on Very Large Data Bases. pp. 109–118 Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)
He, B., Patel, M., Zhang, Z., Chang, K.C.-C.: Accessing the deep web. Commun of the ACM. 50(5), 94–101 (2007). doi:10.1145/1230819.1241670
Hengru, Z., Chun, C.: Web Information Extraction Technology Research Based on Ajax. Proceedings of the 2011 International Conference on Business Computing and Global Informatization. pp. 208–211 I.E. Computer Society, Washington, DC, USA (2011). doi:10.1109/BCGIn.2011.60
Hogue, A., Karger, D.: Thresher: automating the unwrapping of semantic content from the World Wide Web. Proceedings of the 14th international conference on World Wide Web. pp. 86–95 ACM, New York, NY, USA (2005). doi:10.1145/1060745.1060762
Hong, J.L.: Data extraction for deep Web using WordNet. IEEE Trans Syst Man, Cybern, Part C: Appl Rev 41(6), 854–868 (2011). doi:10.1109/TSMCC.2010.2089678
Hong, J.L., Siew, E.-G., Egerton, S.: Information extraction for search engines using fast heuristic techniques. Data Knowl. Eng 69(2), 169–196 (2010). doi:10.1016/j.datak.2009.10.002
Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf Syst. 23(8), 521–538 (1998). doi:10.1016/S0306-4379(98)00027-1
Kayed, M., Chang, C.H.: FiVaTech: page-level Web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2), 249–263 (2009). doi:10.1109/TKDE.2009.82
Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 601–606 ACM, New York, NY, USA (2003). doi:10.1145/956750.956826
Liu, W., Meng, X., Meng, W.: ViDE: a vision-based approach for deep Web data extraction. IEEE IEEE Trans Knowl Data Eng 22(3), 447–460 (2010). doi:10.1109/TKDE.2009.109
Liu, L., Pu, C., Han, W.: XWRAP: an XML-enabled wrapper construction system for Web information sources. Data Engineering, 2000. Proceedings. 16th International Conference on. pp. 611 –621 (2000). doi:10.1109/ICDE.2000.839475
Myllymaki, J.: Effective Web data extraction with standard XML technologies. Computer Networks. 39(5), 635–644 (2002). doi:10.1016/S1389-1286(02)00214-1
Padmadas, V., Gadge, J.: Web data extraction using visual features. Proceedings of the International Conference and Workshop on Emerging Trends in Technology. pp. 218–221 ACM, New York, NY, USA (2010). doi:10.1145/1741906.1741954
Qin, Y., Zheng, D., Zhao, T.: Research on search results optimization technology with category features integration. Int J Mach Learn Cybern 3(1), 71–76 (2012). doi:10.1007/s13042-011-0037-9
Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. Proceedings of the 14th ACM international conference on Information and knowledge management. pp. 381–388 ACM, New York, NY, USA (2005). doi:10.1145/ 1099554.1099672 DOI:10.1145/1099554.1099672
Sleiman, H.A., Corchuelo, R.: An unsupervised technique to extract information from semi-structured Web pages. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) Web information systems engineering - WISE, pp. 631–637. Springer, Berlin (2012)
Sleiman, H.A., Corchuelo, R.: TEX: an efficient and effective unsupervised Web information extracto. Knowl-Based Syst 39(0), 109–123 (2013). doi:10.1016/j.knosys.2012.10.009
Sleiman, H.A., Corchuelo, R.: A Survey on Region Extractors From Web Documents. IEEE Transactions on Knowledge and Data Engineering. 99, (2012). doi:10.1109/TKDE. 2012.135 DOI:10.1109/TKDE.2012.135
Thamviset, W., Wongthanavasu, S.: Structured web information extraction using repetitive subject pattern. Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2012 9th International Conference on. pp. 1 –4 , Thailand (2012). doi:10.1109/ECTICon.2012.6254247
Vadrevu, S., Gelgi, F., Davulcu, H.: Information extraction from Web pages using presentation regularities and domain knowledge. World Wide Web. 10(2), 157–179 (2007). doi:10.1007/s11280-007-0021-1
Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. Proceedings of the 12th international conference on World Wide Web. pp. 187–196 ACM, New York, NY, USA (2003). doi:10.1145/775152.775179
Yang, S., Wang, G., Han, Y.: Grubber: Allowing End-Users to Develop XML-Based Wrappers for Web Data Sources. Proceedings of the Joint International Conferences on Advances in Data and Web Management. pp. 647–652 Springer-Verlag, Berlin, Heidelberg (2009). doi:10.1007/978-3-642-00672-2_65
Zhai, Y., Liu, B.: Structured data extraction from the Web based on partial tree alignment. IEEE Trans Knowledge Data Eng 18(12), 1614–1628 (2006). doi:10.1109/TKDE.2006.197
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. Proceedings of the 14th international conference on World Wide Web. pp. 66–75 ACM, New York, NY, USA (2005). doi:10.1145/1060745.1060760
Zheng, X., Gu, Y., Li, Y.: Data extraction from web pages based on structural-semantic entropy. Proceedings of the 21st international conference companion on World Wide Web. pp. 93–102 ACM, New York, NY, USA (2012). doi:10.1145/2187980.2187991
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Thamviset, W., Wongthanavasu, S. Information extraction for deep web using repetitive subject pattern. World Wide Web 17, 1109–1139 (2014). https://doi.org/10.1007/s11280-013-0248-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-013-0248-y