Combining an Order-Semisensitive Text Similarity and Closest Fit Approach to Textual Missing Values in Knowledge Discovery

Feng, Yi; Wu, Zhaohui; Zhou, Zhongmei

doi:10.1007/11552451_130

Yi Feng²¹,
Zhaohui Wu²¹ &
Zhongmei Zhou²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3682))

Included in the following conference series:

International Conference on Knowledge-Based and Intelligent Information and Engineering Systems

1061 Accesses
6 Citations

Abstract

The ubiquity of textual information nowadays reflects its great significance in knowledge discovery. However, effective usage of these textual materials is always hampered by data incompleteness in real-life applications. In this paper, we apply a closest fit approach to attack textual missing values. To evaluate the closeness of texts in this application, we present an order perspective of text similarity and propose a hybrid order-semisensitive measure, M-similarity, to capture the proximity of texts. This measure combines single item matching, maximum sequence matching and potential matching and get a proper balance between usage of sequence information and efficiency. We incorporate M-similarity into two closest fit methods to missing values in textual attributes and evaluate them on data sets of Traditional Chinese Medicine (TCM). Experimental results illustrate the effectiveness of these methods with M-similarity.

This work is supported by China 973 project: 2003CB317006

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Efficiently Identifying Disguised Missing Values in Heterogeneous, Text-Rich Data

Taking Advantage of Highly-Correlated Attributes in Similarity Queries with Missing Values

Unsupervised record matching with noisy and incomplete data

Article Open access 23 May 2018

References

Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions and reversals. Doklady Akademii Nauk SSSR 163(4) (1965)
Google Scholar
Singhal, A.: Modern information retrieval: a brief overview. IEEE Data Engineering Bulletin 24(4), 35–43 (2001)
Google Scholar
Zhou, X., Wu, Z., Lu, W.: TCMMDB: a distributed multidatabase query system and its key technique implemention. In: IEEE SMC 2001, vol. 2, pp. 1095–1100 (2001)
Google Scholar
Lopresti, D., Tomkins, A.: Block edit models for approximate string matching. Theoretical Computer Science 181(1), 159–179 (1997)
Article MATH MathSciNet Google Scholar
Sankoff, D., Kruskal, J.: Timewarps, string edits, and macromolecules: the theory and practice of sequence comparison. Addison-Wesley, Reading (1983)
Google Scholar
Grzymala-Busse, W., et al.: A comparison of three closest fit approaches to missing attribute values in preterm birth data. International journal of intelligent systems 17, 125–134 (2002)
Article MATH Google Scholar
Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48, 444–453 (1970)
Article Google Scholar
Tichy, F.: The string-to-string correction problem with block moves. ACM Transactions on Computer Systems 2(4), 309–321 (1984)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer Science, Zhejiang University, Hangzhou, 310027, P.R. China
Yi Feng, Zhaohui Wu & Zhongmei Zhou

Authors

Yi Feng
View author publications
You can also search for this author in PubMed Google Scholar
Zhaohui Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhongmei Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Business, La Trobe University, 3086, Melbourne, Victoria, Australia
Rajiv Khosla
Centre for SMART systems Engineering Research Centre, University of Brighton, BN2 4GJ, Moulsecoomb, Brighton, UK
Robert J. Howlett
School of Electrical and Information Engineering, Knowledge Based Intelligent Engineering Systems Centre, University of South Australia, 5095, Mawson Lakes, SA, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feng, Y., Wu, Z., Zhou, Z. (2005). Combining an Order-Semisensitive Text Similarity and Closest Fit Approach to Textual Missing Values in Knowledge Discovery. In: Khosla, R., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2005. Lecture Notes in Computer Science(), vol 3682. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11552451_130

Download citation

DOI: https://doi.org/10.1007/11552451_130
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-28895-4
Online ISBN: 978-3-540-31986-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics