Abstract
Extracting loosely structured data records (LSDRs) has wide applications in many domains, such as forum pattern recognition, Weblogs data analysis, and books and news review analysis. Yet currently existing methods only work well for strongly structured data records (SDRs). In this paper, we propose to address the problem of extracting LSDRs through mining strict patterns. In our method, we utilize both content feature and tag tree feature to recognize the LSDRs, and propose a new algorithm to extract the Data Records (DRs) automatically. The experimental results demonstrate that our algorithm is able to effectively extract LSDRs with higher precision and recall.
Similar content being viewed by others
References
Adelberg, B.: NoDoSE—A tool for semi-automatically extracting structured and semistructured data from text documents. Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp. 283–294 (1998)
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. Proceedings of the ACM International Conference on Management of Data, pp. 337–348 (2003)
Baumgartner, R, Flesca, S, Gottlob, G.: Visual web information extraction with Lixto. Proceedings of the 27th international conference on Very large data bases, pp. 119–128 (2001)
Cai, D., Yu, S., Wen, J., Ma, W.: VIPS: a visionbased page segmentation algorithm. Technical report, Microsoft Technical Report, MSR-TR-2003-79 (2003).
Chuang, S., Chang, K., Zhai, C.: Context-aware wrapping: Synchronized data extraction. Proceedings of the 33nd international conference on Very large data bases, pp. 699–710 (2007)
Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. J. ACM 51(5), 731–779 (2004)
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large web sites. Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)
Gupta, S., Kaiser, G.E., Grimm, P., Chiang, M.F., Starren, J.: Automating content extraction of html documents. World Wide Web 8(2), 179–224 (2005)
Hu, M., Liu, B.: Mining opinion features in customer reviews. Proceedings of the 19th National Conference on Artificial Intelligence, pp. 755–760 (2004)
Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., Li, H.: Title extraction from bodies of HTML documents and its application to web page retrieval. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp 250–257 (2005)
Kushmerick, N.: Wrapper induction for information extraction. PhD thesis, University of Washington (1997)
Laender, A., Ribeiro-Neto, B., da Silva, A., Teixeira, J.: A brief survey of web data extraction tools. ACM SIGMOD Record 31(2), 84–93 (2002)
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. Proceedings of the ACM International Conference on Management of Data, pp. 119–130 (2004)
Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled wrapper construction system for web information sources. Proceedings of the 16th International Conference on Data Engineering, pp. 611–621 (2000)
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 601–606 (2003)
Lu, Y., He, H., Zhao, H., Meng, W., Yu, C.: Annotating structured data of the deep web. Proceedings of the 23th International Conference on Data Engineering, pp. 376–385 (2007)
Reis, D., Golgher, P., Silva, A., Laender, A.: Automatic web news extraction using tree edit distance. Proceedings of the 13th international conference on World Wide Web, pp. 502–511 (2004)
Reyes, P., Tchounikine, P.: Mining learning groups’ activities in forum-type tools. Proceedings of Computer supported collaborative learning 2005: the next 10 years!, pp. 509–513 (2005)
Shen,D., Sun, J., Yang, Q., Chen, Z.: Latent friend mining from weblogs data. Proceedings of the Sixth International Conference on Data Mining, pp. 552–561 (2006)
Vadrevu, S., Gelgi, F., Davulcu, H.: Information extraction from web pages using presentation regularities and domain knowledge. World Wide Web 10(2), 157–179 (2007)
Wang, J., Lochovsky, F.: Data extraction and label assignment for web databases. Proceedings of the twelfth international conference on World Wide Web, pp. 187–196 (2003)
Yih, W., Chang, P., Kim, W.: Mining online deal forums for hot deals. Proceedings of the Web Intelligence, IEEE/WIC/ACM International Conference on, pp. 384–390 (2004)
Zhai, Y., Liu, B.: Extracting web data using instance-based learning. World Wide Web 10(2), 113–132 (2007)
Zhai, Y., Liu, B. Web data extraction based on partial tree alignment. Proceedings of the 14th international conference on World Wide Web, pp. 76–85 (2005)
Zhai, Y., Liu, B.: Extracting web data using instance-based learning. Proc. of 6th Int. Conf. on Web Information Systems Engineering (WISE-05) (2005)
Zhao, H., Meng, W., Yu, C.: Automatic extraction of dynamic record sections from search engine result pages. Proceedings of the 32nd international conference on Very large data bases, pp. 989–1000 (2006)
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. Proceedings of the 14th international conference on World Wide Web, pp. 66–75 (2005)
Zigoris, P., Eads, D., Zhang, Y.: Unsupervised learning of tree alignment models for information extraction. Proceedings of the Sixth International Conference on Data Mining—Workshops, pp. 45–49 (2006)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, Q., Chen, J. & Wu, Y. Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns. World Wide Web 12, 263–284 (2009). https://doi.org/10.1007/s11280-009-0062-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11280-009-0062-8