Skip to main content

Attribute Value Extraction and Standardization in Data Integration

  • Conference paper
  • First Online:
Advances in Web-Age Information Management (WAIM 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2118))

Included in the following conference series:

Abstract

Most database systems and data analysis tools work with relational or well-structured data. When data is collected from various sources for warehousing or analysis, extracting and formatting the input data into required form is never a trivial task, as it seems. We present in this paper a pattern-matching based approach for extracting and standardizing attribute values from input data entries in the form of character strings. The core component of the approach is a powerful pattern language, which provides a simple way for specifying the semantic features, length limitations, external references, element extraction and restructure of attributes. Attribute values can then be extracted from input strings by pattern matching. Constraints on attributes can be enforced so that the attribute values are standardized even the input data is from different sources and in different formats. The pattern language and matching algorithms are presented. A prototype system based on the proposed approach is also described.

This work is partially supported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (HKUST6092/99E) a grant from the National 973 project of China (No. G1998030414).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Hopcroft, J.E., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company, 1979.

    Google Scholar 

  2. Aho, A.V., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Massachusetts, 1974.

    MATH  Google Scholar 

  3. Aho, A.V.: Algorithms for Finding Patterns in Strings. In: Leeuwen, J.V. (ed): Handbook of Theoretical Computer Science. Elsevier Science Publishers, (1990) 256–300.

    Google Scholar 

  4. Clarke, C.L.A., Cormack, G.V.: On the Use of regular Expressions for Searching Text. Technical Report CS-95-07, Department of Computer Science, University of Waterloo.

    Google Scholar 

  5. Aho, A.V., Corasick, M.J.: Efficient String Matching-An Aid to Bibliographic Search. Communications of the ACM, 18(6), (1975) 333–340.

    Article  MATH  MathSciNet  Google Scholar 

  6. Atzeni, P., Mecca, G.: Cut and Paste. In: Ozsoyoglu, Z. M. (ed): Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 12–14, 1997, Tucson, Arizona. ACM Press, (1997) 144–153.

    Chapter  Google Scholar 

  7. Lu, H., Tian, Z., Ng, Y.Y.: Attribute Value Extraction and Standardization by Pattern Matching, submitted for publication, May 2000.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lu, H., Tian, Z. (2001). Attribute Value Extraction and Standardization in Data Integration. In: Wang, X.S., Yu, G., Lu, H. (eds) Advances in Web-Age Information Management. WAIM 2001. Lecture Notes in Computer Science, vol 2118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47714-4_31

Download citation

  • DOI: https://doi.org/10.1007/3-540-47714-4_31

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42298-3

  • Online ISBN: 978-3-540-47714-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics