Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

Li, Qing; Chen, Jing; Wu, Yipu

doi:10.1007/s11280-009-0062-8

Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

Published: 25 April 2009

Volume 12, pages 263–284, (2009)
Cite this article

World Wide Web Aims and scope Submit manuscript

Qing Li¹,
Jing Chen¹ &
Yipu Wu¹

110 Accesses
Explore all metrics

Abstract

Extracting loosely structured data records (LSDRs) has wide applications in many domains, such as forum pattern recognition, Weblogs data analysis, and books and news review analysis. Yet currently existing methods only work well for strongly structured data records (SDRs). In this paper, we propose to address the problem of extracting LSDRs through mining strict patterns. In our method, we utilize both content feature and tag tree feature to recognize the LSDRs, and propose a new algorithm to extract the Data Records (DRs) automatically. The experimental results demonstrate that our algorithm is able to effectively extract LSDRs with higher precision and recall.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Adelberg, B.: NoDoSE—A tool for semi-automatically extracting structured and semistructured data from text documents. Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp. 283–294 (1998)
Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. Proceedings of the ACM International Conference on Management of Data, pp. 337–348 (2003)
Baumgartner, R, Flesca, S, Gottlob, G.: Visual web information extraction with Lixto. Proceedings of the 27th international conference on Very large data bases, pp. 119–128 (2001)
Cai, D., Yu, S., Wen, J., Ma, W.: VIPS: a visionbased page segmentation algorithm. Technical report, Microsoft Technical Report, MSR-TR-2003-79 (2003).
Chuang, S., Chang, K., Zhai, C.: Context-aware wrapping: Synchronized data extraction. Proceedings of the 33nd international conference on Very large data bases, pp. 699–710 (2007)
Crescenzi, V., Mecca, G.: Automatic information extraction from large websites. J. ACM 51(5), 731–779 (2004)
Article MathSciNet Google Scholar
Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large web sites. Proceedings of the 27th International Conference on Very Large Data Bases, pp. 109–118 (2001)
Gupta, S., Kaiser, G.E., Grimm, P., Chiang, M.F., Starren, J.: Automating content extraction of html documents. World Wide Web 8(2), 179–224 (2005)
Article Google Scholar
Hu, M., Liu, B.: Mining opinion features in customer reviews. Proceedings of the 19th National Conference on Artificial Intelligence, pp. 755–760 (2004)
Hu, Y., Xin, G., Song, R., Hu, G., Shi, S., Cao, Y., Li, H.: Title extraction from bodies of HTML documents and its application to web page retrieval. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, pp 250–257 (2005)
Kushmerick, N.: Wrapper induction for information extraction. PhD thesis, University of Washington (1997)
Laender, A., Ribeiro-Neto, B., da Silva, A., Teixeira, J.: A brief survey of web data extraction tools. ACM SIGMOD Record 31(2), 84–93 (2002)
Article Google Scholar
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of web sites for automatic segmentation of tables. Proceedings of the ACM International Conference on Management of Data, pp. 119–130 (2004)
Liu, L., Pu, C., Han, W.: XWRAP: An XML-enabled wrapper construction system for web information sources. Proceedings of the 16th International Conference on Data Engineering, pp. 611–621 (2000)
Liu, B., Grossman, R., Zhai, Y.: Mining data records in web pages. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 601–606 (2003)
Lu, Y., He, H., Zhao, H., Meng, W., Yu, C.: Annotating structured data of the deep web. Proceedings of the 23th International Conference on Data Engineering, pp. 376–385 (2007)
Reis, D., Golgher, P., Silva, A., Laender, A.: Automatic web news extraction using tree edit distance. Proceedings of the 13th international conference on World Wide Web, pp. 502–511 (2004)
Reyes, P., Tchounikine, P.: Mining learning groups’ activities in forum-type tools. Proceedings of Computer supported collaborative learning 2005: the next 10 years!, pp. 509–513 (2005)
Shen,D., Sun, J., Yang, Q., Chen, Z.: Latent friend mining from weblogs data. Proceedings of the Sixth International Conference on Data Mining, pp. 552–561 (2006)
Vadrevu, S., Gelgi, F., Davulcu, H.: Information extraction from web pages using presentation regularities and domain knowledge. World Wide Web 10(2), 157–179 (2007)
Article Google Scholar
Wang, J., Lochovsky, F.: Data extraction and label assignment for web databases. Proceedings of the twelfth international conference on World Wide Web, pp. 187–196 (2003)
Yih, W., Chang, P., Kim, W.: Mining online deal forums for hot deals. Proceedings of the Web Intelligence, IEEE/WIC/ACM International Conference on, pp. 384–390 (2004)
Zhai, Y., Liu, B.: Extracting web data using instance-based learning. World Wide Web 10(2), 113–132 (2007)
Article Google Scholar
Zhai, Y., Liu, B. Web data extraction based on partial tree alignment. Proceedings of the 14th international conference on World Wide Web, pp. 76–85 (2005)
Zhai, Y., Liu, B.: Extracting web data using instance-based learning. Proc. of 6th Int. Conf. on Web Information Systems Engineering (WISE-05) (2005)
Zhao, H., Meng, W., Yu, C.: Automatic extraction of dynamic record sections from search engine result pages. Proceedings of the 32nd international conference on Very large data bases, pp. 989–1000 (2006)
Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. Proceedings of the 14th international conference on World Wide Web, pp. 66–75 (2005)
Zigoris, P., Eads, D., Zhang, Y.: Unsupervised learning of tree alignment models for information extraction. Proceedings of the Sixth International Conference on Data Mining—Workshops, pp. 45–49 (2006)

Download references

Author information

Authors and Affiliations

Department of Computer Science, City University of Hong Kong, 83. Tat Chee Avenue, Kowloon, Hong Kong
Qing Li, Jing Chen & Yipu Wu

Authors

Qing Li
View author publications
You can also search for this author inPubMed Google Scholar
Jing Chen
View author publications
You can also search for this author inPubMed Google Scholar
Yipu Wu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Qing Li.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Q., Chen, J. & Wu, Y. Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns. World Wide Web 12, 263–284 (2009). https://doi.org/10.1007/s11280-009-0062-8

Download citation

Received: 14 April 2008
Revised: 17 February 2009
Accepted: 16 March 2009
Published: 25 April 2009
Issue Date: September 2009
DOI: https://doi.org/10.1007/s11280-009-0062-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A survey on semantic schema discovery

STEM: a suffix tree-based method for web data records extraction

Swift Linked Data Miner Extension for WebProtégé

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Algorithm for Extracting Loosely Structured Data Records Through Digging Strict Patterns

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A survey on semantic schema discovery

STEM: a suffix tree-based method for web data records extraction

Swift Linked Data Miner Extension for WebProtégé

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now