Skip to main content
Log in

L-Tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises

  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

In this paper, a new method, named as L-tree match, is presented for extracting data from complex data sources. Firstly, based on data extraction logic presented in this work, a new data extraction model is constructed in which model components are structurally correlated via a generalized template. Secondly, a database-populating mechanism is built, along with some object-manipulating operations needed for flexible database design, to support data extraction from huge text stream. Thirdly, top-down and bottom-up strategies are combined to design a new extraction algorithm that can extract data from data sources with optional, unordered, nested, and/or noisy components. Lastly, this method is applied to extract accurate data from biological documents amounting to 100GB for the first online integrated biological data warehouse of China.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Schwinn A, Schelp J. Data integration patterns. In Proc. 6th Int. Conf. Business Information Systems (BIS'03), Colorado Springs, Colorado, USA, June 4–6, 2003, pp.232–238.

  2. Laender A, Ribeiro-Neto B, da Silva A. DEByE: Data extraction by example. Data and Knowledge Engineering, 2002, 40(2): 121–154.

    Article  Google Scholar 

  3. Adelberg B. NoDoSE: A tool for semi-automatically extracting structured and semistructured data from text documents. In Proc. 1998 ACM SIGMOD Int. Conf. Management of Data (SIGMOD'98), Seattle, Washington, USA, June 2–4, 1998, pp.283-294.

  4. Arasu A, Garcia-Molina H. Extracting structured data from web pages. In Proc. 2003 ACM SIGMOD Int. Conf. Management of Data (SIGMOD'03), San Diego, California, USA, June 10–12, 2003, pp.337–348.

  5. Crescenzi V, Mecca G, Merialdo P. RoadRunner: Towards automatic data extraction from large web sites. In Proc. 27th Int. Conf. Very Large Data Bases (VLDB'01), Roma, Italy, September 11–14, 2001, pp.109–118.

  6. Papakonstantinous Y, Garcia-Molina H, Widom J. Object exchange across heterogeneous information sources. In Proc. 11th Int. Conf. Data Engineering (ICDE'95), Taipei, March 6–10, 1995, pp.251–260.

  7. Laender A, da Silva A, Ribeiro-Neto B et al. The Debye environment for web data management. IEEE Internet Computing, 2002, 6(4): 60–69.

    Article  Google Scholar 

  8. Embley D, Campbell D, Liddle S, Smith R. Ontology-based extraction and structuring of information from data-rich unstructured documents. In Proc. 7th Int. Conf. Information and Knowledge Management (CIKM'98), Bethesda, Maryland, USA, November 2–7, 1998, pp.52–59.

  9. Meng X F, Lu H J, Wang H Y et al. Data extraction from the web based on pre-defined schema. Journal of Computer Science and Technology, 2002, 17(4): 377–388.

    Google Scholar 

  10. Embley D W, Jiang Y, Ng Y K. Record-boundary discovery in web documents. In Proc. 1999 ACM SIGMOD Int. Conf. Management of Data (SIGMOD'99), Philadelphia, Pennsylvania, USA, June 1–3, 1999, pp.467–478.

  11. Yamada Y, Ikeda D, Hirokawa S. Automatic wrapper generation for multilingual web resources. In Proc. 5th Int. Conf. Discovery Science (DS'02), Lübeck, Germany, November 24–26, 2002, pp.332–339.

  12. Frisch A, Cardelli L. Greedy regular expression matching. In Proc. POPL'04 Workshop on Programming Languages Technologies for XML (PLAN-X'04), Venice, Italy, January 13, 2004, pp.1–12.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xu-Bin Deng.

Additional information

Supported by the National High Technology Development 863 Program of China under Grant No.2002AA231011, and the Major Project of Shanghai Science & Technology Commission under Grant No.02DJ14013.

Xu-Bin Deng received the M.S. degree in computer science from Xinjiang University in 1994. He is a Ph.D. candidate in computer science at Fudan University. His research interests are in the areas of database, data mining and bioinformatics.

Yang-Yong Zhu received the Ph.D. degree in computer science from Fudan University in 1994. He is a professor and a Ph.D. supervisor of Department of Computing and Information Technology, Fudan University. His research interests are in the areas of database, knowledge base, data mining and bioinformatics.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Deng, XB., Zhu, YY. L-Tree Match: A New Data Extraction Model and Algorithm for Huge Text Stream with Noises. J Comput Sci Technol 20, 763–773 (2005). https://doi.org/10.1007/s11390-005-0763-0

Download citation

  • Received:

  • Revised:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-005-0763-0

Keywords

Navigation