Abstract
The ubiquity of textual information nowadays reflects its great significance in knowledge discovery. However, effective usage of these textual materials is always hampered by data incompleteness in real-life applications. In this paper, we apply a closest fit approach to attack textual missing values. To evaluate the closeness of texts in this application, we present an order perspective of text similarity and propose a hybrid order-semisensitive measure, M-similarity, to capture the proximity of texts. This measure combines single item matching, maximum sequence matching and potential matching and get a proper balance between usage of sequence information and efficiency. We incorporate M-similarity into two closest fit methods to missing values in textual attributes and evaluate them on data sets of Traditional Chinese Medicine (TCM). Experimental results illustrate the effectiveness of these methods with M-similarity.
This work is supported by China 973 project: 2003CB317006
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4) (1965)
Singhal, A.: Modern information retrieval: a brief overview. IEEE Data Engineering Bulletin 24(4), 35–43 (2001)
Zhou, X., Wu, Z., Lu, W.: TCMMDB: a distributed multidatabase query system and its key technique implemention. In: IEEE SMC 2001, vol. 2, pp. 1095–1100 (2001)
Lopresti, D., Tomkins, A.: Block edit models for approximate string matching. Theoretical Computer Science 181(1), 159–179 (1997)
Sankoff, D., Kruskal, J.: Timewarps, string edits, and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading (1983)
Grzymala-Busse, W., et al.: A comparison of three closest fit approaches to missing attribute values in preterm birth data. International journal of intelligent systems 17, 125–134 (2002)
Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48, 444–453 (1970)
Tichy, F.: The string-to-string correction problem with block moves. ACM Transactions on Computer Systems 2(4), 309–321 (1984)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Feng, Y., Wu, Z., Zhou, Z. (2005). Combining an Order-Semisensitive Text Similarity and Closest Fit Approach to Textual Missing Values in Knowledge Discovery. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2005. Lecture Notes in Computer Science(), vol 3682. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11552451_130
Download citation
DOI: https://doi.org/10.1007/11552451_130
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28895-4
Online ISBN: 978-3-540-31986-3
eBook Packages: Computer ScienceComputer Science (R0)