Skip to main content

Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

  • Published:
World Wide Web Aims and scope Submit manuscript

Abstract

Extracting loosely structured data records (LSDRs) has wide applications in many domains, such as forum pattern recognition, Weblogs data analysis, and books and news review analysis. Yet currently existing methods only work well for strongly structured data records (SDRs). In this paper, we propose to address the problem of extracting LSDRs through mining strict patterns. In our method, we utilize both content feature and tag tree feature to recognize the LSDRs, and propose a new algorithm to extract the Data Records (DRs) automatically. The experimental results demonstrate that our algorithm is able to effectively extract LSDRs with higher precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Adelberg, B.: NoDoSE—A tool for semi-automatically extracting structured and semistructured data from text documents. Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp. 283–294 (1998)

  2. Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. Proceedings of the ACM International Conference on Management of Data, pp. 337–348 (2003)

  3. Baumgartner, R, Flesca, S, Gottlob, G.: Visual web information extraction with Lixto. Proceedings of the 27th international conference on Very large data bases, pp. 119–128 (2001)

  4. Cai, D., Yu, S., Wen, J., Ma, W.: VIPS: a visionbased page segmentation algorithm. Technical report, Microsoft Technical Report, MSR-TR-2003-79 (2003).

  5. Chuang, S., Chang, K., Zhai, C.: Context-aware wrapping: Synchronized data extraction. Proceedings of the 33nd international conference on Very large data bases, pp. 699–710 (2007)

  6. Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. J. ACM 51(5), 731–779 (2004)

    Article  MathSciNet  Google Scholar 

  7. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large web sites. Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)

  8. Gupta, S., Kaiser, G.E., Grimm, P., Chiang, M.F., Starren, J.: Automating content extraction of html documents. World Wide Web 8(2), 179–224 (2005)

    Article  Google Scholar 

  9. Hu, M., Liu, B.: Mining opinion features in customer reviews. Proceedings of the 19th National Conference on Artificial Intelligence, pp. 755–760 (2004)

  10. Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., Li, H.: Title extraction from bodies of HTML documents and its application to web page retrieval. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp 250–257 (2005)

  11. Kushmerick, N.: Wrapper induction for information extraction. PhD thesis, University of Washington (1997)

  12. Laender, A., Ribeiro-Neto, B., da Silva, A., Teixeira, J.: A brief survey of web data extraction tools. ACM SIGMOD Record 31(2), 84–93 (2002)

    Article  Google Scholar 

  13. Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. Proceedings of the ACM International Conference on Management of Data, pp. 119–130 (2004)

  14. Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled wrapper construction system for web information sources. Proceedings of the 16th International Conference on Data Engineering, pp. 611–621 (2000)

  15. Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 601–606 (2003)

  16. Lu, Y., He, H., Zhao, H., Meng, W., Yu, C.: Annotating structured data of the deep web. Proceedings of the 23th International Conference on Data Engineering, pp. 376–385 (2007)

  17. Reis, D., Golgher, P., Silva, A., Laender, A.: Automatic web news extraction using tree edit distance. Proceedings of the 13th international conference on World Wide Web, pp. 502–511 (2004)

  18. Reyes, P., Tchounikine, P.: Mining learning groups’ activities in forum-type tools. Proceedings of Computer supported collaborative learning 2005: the next 10 years!, pp. 509–513 (2005)

  19. Shen,D., Sun, J., Yang, Q., Chen, Z.: Latent friend mining from weblogs data. Proceedings of the Sixth International Conference on Data Mining, pp. 552–561 (2006)

  20. Vadrevu, S., Gelgi, F., Davulcu, H.: Information extraction from web pages using presentation regularities and domain knowledge. World Wide Web 10(2), 157–179 (2007)

    Article  Google Scholar 

  21. Wang, J., Lochovsky, F.: Data extraction and label assignment for web databases. Proceedings of the twelfth international conference on World Wide Web, pp. 187–196 (2003)

  22. Yih, W., Chang, P., Kim, W.: Mining online deal forums for hot deals. Proceedings of the Web Intelligence, IEEE/WIC/ACM International Conference on, pp. 384–390 (2004)

  23. Zhai, Y., Liu, B.: Extracting web data using instance-based learning. World Wide Web 10(2), 113–132 (2007)

    Article  Google Scholar 

  24. Zhai, Y., Liu, B. Web data extraction based on partial tree alignment. Proceedings of the 14th international conference on World Wide Web, pp. 76–85 (2005)

  25. Zhai, Y., Liu, B.: Extracting web data using instance-based learning. Proc. of 6th Int. Conf. on Web Information Systems Engineering (WISE-05) (2005)

  26. Zhao, H., Meng, W., Yu, C.: Automatic extraction of dynamic record sections from search engine result pages. Proceedings of the 32nd international conference on Very large data bases, pp. 989–1000 (2006)

  27. Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. Proceedings of the 14th international conference on World Wide Web, pp. 66–75 (2005)

  28. Zigoris, P., Eads, D., Zhang, Y.: Unsupervised learning of tree alignment models for information extraction. Proceedings of the Sixth International Conference on Data Mining—Workshops, pp. 45–49 (2006)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Qing Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Q., Chen, J. & Wu, Y. Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns. World Wide Web 12, 263–284 (2009). https://doi.org/10.1007/s11280-009-0062-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11280-009-0062-8

Keywords