Abstract
Most database systems and data analysis tools work with relational or well-structured data. When data is collected from various sources for warehousing or analysis, extracting and formatting the input data into required form is never a trivial task, as it seems. We present in this paper a pattern-matching based approach for extracting and standardizing attribute values from input data entries in the form of character strings. The core component of the approach is a powerful pattern language, which provides a simple way for specifying the semantic features, length limitations, external references, element extraction and restructure of attributes. Attribute values can then be extracted from input strings by pattern matching. Constraints on attributes can be enforced so that the attribute values are standardized even the input data is from different sources and in different formats. The pattern language and matching algorithms are presented. A prototype system based on the proposed approach is also described.
This work is partially supported by a grant from the Research Grant Council of the Hong Kong Special Administrative Region, China (HKUST6092/99E) a grant from the National 973 project of China (No. G1998030414).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Hopcroft, J.E., Ullman, J.D.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company, 1979.
Aho, A.V., Hopcroft, J.E., Ullman, J.D.: The Design and Analysis of Computer Algorithms. Addison-Wesley, Reading, Massachusetts, 1974.
Aho, A.V.: Algorithms for Finding Patterns in Strings. In: Leeuwen, J.V. (ed): Handbook of Theoretical Computer Science. Elsevier Science Publishers, (1990) 256–300.
Clarke, C.L.A., Cormack, G.V.: On the Use of regular Expressions for Searching Text. Technical Report CS-95-07, Department of Computer Science, University of Waterloo.
Aho, A.V., Corasick, M.J.: Efficient String Matching-An Aid to Bibliographic Search. Communications of the ACM, 18(6), (1975) 333–340.
Atzeni, P., Mecca, G.: Cut and Paste. In: Ozsoyoglu, Z. M. (ed): Proceedings of the Sixteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, May 12–14, 1997, Tucson, Arizona. ACM Press, (1997) 144–153.
Lu, H., Tian, Z., Ng, Y.Y.: Attribute Value Extraction and Standardization by Pattern Matching, submitted for publication, May 2000.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Lu, H., Tian, Z. (2001). Attribute Value Extraction and Standardization in Data Integration. In: Wang, X.S., Yu, G., Lu, H. (eds) Advances in Web-Age Information Management. WAIM 2001. Lecture Notes in Computer Science, vol 2118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47714-4_31
Download citation
DOI: https://doi.org/10.1007/3-540-47714-4_31
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42298-3
Online ISBN: 978-3-540-47714-3
eBook Packages: Springer Book Archive