L-Tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises

Deng, Xu-Bin; Zhu, Yang-Yong

doi:10.1007/s11390-005-0763-0

L-Tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises

Published: November 2005

Volume 20, pages 763–773, (2005)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Xu-Bin Deng¹ &
Yang-Yong Zhu^1,2

44 Accesses
2 Citations
Explore all metrics

Abstract

In this paper, a new method, named as L-tree match, is presented for extracting data from complex data sources. Firstly, based on data extraction logic presented in this work, a new data extraction model is constructed in which model components are structurally correlated via a generalized template. Secondly, a database-populating mechanism is built, along with some object-manipulating operations needed for flexible database design, to support data extraction from huge text stream. Thirdly, top-down and bottom-up strategies are combined to design a new extraction algorithm that can extract data from data sources with optional, unordered, nested, and/or noisy components. Lastly, this method is applied to extract accurate data from biological documents amounting to 100GB for the first online integrated biological data warehouse of China.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Trends and Future Perspective Challenges in Big Data

DB-GPT: Large Language Model Meets Database

Article Open access 19 January 2024

Xuanhe Zhou, Zhaoyan Sun & Guoliang Li

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Salvador García, Sergio Ramírez-Gallego, … Francisco Herrera

References

Schwinn A, Schelp J. Data integration patterns. In Proc. 6th Int. Conf. Business Information Systems (BIS'03), Colorado Springs, Colorado, USA, June 4–6, 2003, pp.232–238.
Laender A, Ribeiro-Neto B, da Silva A. DEByE: Data extraction by example. Data and Knowledge Engineering, 2002, 40(2): 121–154.
Article Google Scholar
Adelberg B. NoDoSE: A tool for semi-automatically extracting structured and semistructured data from text documents. In Proc. 1998 ACM SIGMOD Int. Conf. Management of Data (SIGMOD'98), Seattle, Washington, USA, June 2–4, 1998, pp.283-294.
Arasu A, Garcia-Molina H. Extracting structured data from web pages. In Proc. 2003 ACM SIGMOD Int. Conf. Management of Data (SIGMOD'03), San Diego, California, USA, June 10–12, 2003, pp.337–348.
Crescenzi V, Mecca G, Merialdo P. RoadRunner: Towards automatic data extraction from large web sites. In Proc. 27th Int. Conf. Very Large Data Bases (VLDB'01), Roma, Italy, September 11–14, 2001, pp.109–118.
Papakonstantinous Y, Garcia-Molina H, Widom J. Object exchange across heterogeneous information sources. In Proc. 11th Int. Conf. Data Engineering (ICDE'95), Taipei, March 6–10, 1995, pp.251–260.
Laender A, da Silva A, Ribeiro-Neto B et al. The Debye environment for web data management. IEEE Internet Computing, 2002, 6(4): 60–69.
Article Google Scholar
Embley D, Campbell D, Liddle S, Smith R. Ontology-based extraction and structuring of information from data-rich unstructured documents. In Proc. 7th Int. Conf. Information and Knowledge Management (CIKM'98), Bethesda, Maryland, USA, November 2–7, 1998, pp.52–59.
Meng X F, Lu H J, Wang H Y et al. Data extraction from the web based on pre-defined schema. Journal of Computer Science and Technology, 2002, 17(4): 377–388.
Google Scholar
Embley D W, Jiang Y, Ng Y K. Record-boundary discovery in web documents. In Proc. 1999 ACM SIGMOD Int. Conf. Management of Data (SIGMOD'99), Philadelphia, Pennsylvania, USA, June 1–3, 1999, pp.467–478.
Yamada Y, Ikeda D, Hirokawa S. Automatic wrapper generation for multilingual web resources. In Proc. 5th Int. Conf. Discovery Science (DS'02), Lübeck, Germany, November 24–26, 2002, pp.332–339.
Frisch A, Cardelli L. Greedy regular expression matching. In Proc. POPL'04 Workshop on Programming Languages Technologies for XML (PLAN-X'04), Venice, Italy, January 13, 2004, pp.1–12.

Download references

Author information

Authors and Affiliations

Department of Computing and Information Technology, Fudan University, Shanghai, 200433, P.R. China
Xu-Bin Deng & Yang-Yong Zhu
Shanghai Center for Bioinformation Technology, Shanghai, 201203, P.R. China
Yang-Yong Zhu

Authors

Xu-Bin Deng
View author publications
You can also search for this author in PubMed Google Scholar
Yang-Yong Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xu-Bin Deng.

Additional information

Supported by the National High Technology Development 863 Program of China under Grant No.2002AA231011, and the Major Project of Shanghai Science & Technology Commission under Grant No.02DJ14013.

Xu-Bin Deng received the M.S. degree in computer science from Xinjiang University in 1994. He is a Ph.D. candidate in computer science at Fudan University. His research interests are in the areas of database, data mining and bioinformatics.

Yang-Yong Zhu received the Ph.D. degree in computer science from Fudan University in 1994. He is a professor and a Ph.D. supervisor of Department of Computing and Information Technology, Fudan University. His research interests are in the areas of database, knowledge base, data mining and bioinformatics.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Deng, XB., Zhu, YY. L-Tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises. J Comput Sci Technol 20, 763–773 (2005). https://doi.org/10.1007/s11390-005-0763-0

Download citation

Received: 24 March 2004
Revised: 25 January 2005
Issue Date: November 2005
DOI: https://doi.org/10.1007/s11390-005-0763-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

L-Tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

DB-GPT: Large Language Model Meets Database

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

L-Tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

DB-GPT: Large Language Model Meets Database

Big data preprocessing: methods and prospects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation