Efficient Subject-Oriented Evaluating and Mining Methods for Data with Schema Uncertainty

Wang, Yue; Tang, Changjie; Wang, Tengjiao; Yang, Dongqing; Zhu, Jun

doi:10.1007/978-3-642-25853-4_25

Yue Wang²²,
Changjie Tang²³,
Tengjiao Wang²²,
Dongqing Yang²² &
…
Jun Zhu²⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7120))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

964 Accesses

Abstract

With the progressing of data collecting methods, people have already collected scales of data for various application fields such as medical science, meteorology, electronic commerce and so on. To analyze these data needs to integrate data from the various heterogeneous data sets. As historical reasons technically or non-technically, usually, the schemas of the data sets to be integrated are complex and different. Thus to analyze the integrated data may cause ambiguous results for their non-uniform schemas. This paper targets mining this kind of data, and its main contributions include:(1) proposed schema uncertainty to describe data with non-uniform schemas and proposed couple correlation degree (Cor) to evaluate the existence probabilities for records in data with schema uncertainty based on the analyzing subject;(2) designed a data structure ”B-correlation tree” to establish a hierarchical structure for uncertain data with their existence probabilities and discussed the distribution affection by selecting nodes on different levels of B-correlation tree ; (3) proposed a efficient Monte Carlo uncertain data analyzing algorithm, MonteCarlo-evaluate (MCE), based on B-correlation tree for data with schema uncertainty; (4) analyzed the accuracy and convergence property for MCE theoretically; (5) implemented a prototype system by using B-correlation tree and MCE on real medical data and synthetic TPC-H benchmark?[20] data; provided sufficient experiments to test the effectiveness and efficiency of the provided methods. The results of experiments show that: the provided methods can efficient evaluate the schema uncertainty in data and thus can be equal to the tasks of analyzing large scale data with schema uncertainty efficiently.

This work is supported by the National Key Technology R&D Program of China(No. 2009BAK63B08), National High Technology Research and Development Program of China(’863’ Program)(No.2009AA01Z150), National Science& Technology Pillar Program of China(No. 2009BAH44B03), China Postdoctoral Science Foundation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Automatic Rules Generation Approach for Data Cleaning in Medical Applications

A survey of uncertain data management

Article 06 September 2018

Sum-Product Network-Based Cardinality Estimation Research

References

Aggarwal, C.C.: Managing and Mining Uncertain Data. In: Advances In Database Systems (2009)
Google Scholar
Sarma, A.D., Benjelloun, O., Halevy, A., Widom, J.: Working Models for Uncertain Data. In: ICDE 2006 (2006)
Google Scholar
Cavallo, R., Pittarelli, M.: The Theory of Probabilistic Databases. In: Proceedings of the 13th VLDB Conference, Brighton (1987)
Google Scholar
BarbarB, D., Garcia-Molina, H., Porter, D.: The management of probabilistic data. IEEE Transactions on Knowledge and Data Engineering 4(5), 487–502 (1992)
Article Google Scholar
Chatfield, C.: Model Uncertainty, Data Mining and Statistical Inference. Journal of the Royal Statistical Society. Series A (Statistics in Society) 158(3), 419–466 (1995)
Article Google Scholar
Dong, X.L., Halevy, A., Yu, C.: Data integration with uncertainty. The VLDB Journal 18(2), 469–500 (2009)
Article Google Scholar
Bernecker, T., Kriegel, H.-P., Renz, M., Verhein, F., Zuefle, A.: Probabilistic Frequent Itemset Mining in Uncertain Databases. In: Proc. 15th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD 2009), Paris, France (2009)
Google Scholar
http://infolab.stanford.edu/trio
http://www.almaden.ibm.com/cs/projects/avatar
http://www.math.ups.edu/~anierman/umich/prodb
Khoussainova, N., Balazinska, M., Suciu, D.: Towards Correcting Input Data Errors Probabilistically Using Integrity Constraints. In: MobiDE 2006, June 25 (2006)
Google Scholar
Jayram, T.S., McGregor, A.: Estimating Statistical Aggregates on Probabilistic Data Streams. In: PODS 2007, June 11-14 (2007)
Google Scholar
Metropolis, N., Ulam, S.: The Monte Carlo Method. Journal of the American Statistical Association 44(247), 335–341 (1949)
Article MathSciNet MATH Google Scholar
Stigler, S.M.: A Historical View of Statistical Concepts in Psychology and Educational Research. American Journal of Education 101(1), 60–70 (1992)
Article Google Scholar
Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C., Haas, P.J.: MCDB: a monte carlo approach to managing uncertain data. In: Proceedings of the 2008 ACM SIGMOD (2008)
Google Scholar
Xu, F., Beyer, K., Ercegovac, V., Haas, P.J., Shekita, E.J.: E = MC3: managing uncertain enterprise data in a cluster-computing environment. In: Proceedings of the 2009 ACM SIGMOD (2009)
Google Scholar
Li, Y., Algarni, A., Zhong, N.: Mining positive and negative patterns for relevance feature discovery. In: KDD 2010 Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2010)
Google Scholar
Renyi, A.: Probability theory. NorthHolland, Amsterdam (1970)
MATH Google Scholar
Karp, R., Luby, M.: Monte-Carlo Algorithms for Enumeration and Reliability Problems. In: 24th STOC, pp. 56–64 (1983)
Google Scholar
http://www.tpc.org/tpch/
Yang, H., Cai, H.: Clinicopatholog analysis on 46 inborn anencephaluses. Chinese Journal of Birth Health and Heredity (2008)
Google Scholar

Download references

Author information

Authors and Affiliations

Key Laboratory of High Confidence Software Technologies, Ministry of Education, Peking University, China
Yue Wang, Tengjiao Wang & Dongqing Yang
School of Computer Science, Sichuan University, Chengdu, 610065, China
Changjie Tang
China Birth Defect Monitoring Centre, Sichuan University, Chengdu, 610065, China
Jun Zhu

Authors

Yue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Changjie Tang
View author publications
You can also search for this author in PubMed Google Scholar
Tengjiao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dongqing Yang
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China
Jie Tang & Jianyong Wang &
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, SAR, China
Irwin King
Faculty of Engineering and Information Technology, University of Technology, 2007, Sydney, NSW, Australia
Ling Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Tang, C., Wang, T., Yang, D., Zhu, J. (2011). Efficient Subject-Oriented Evaluating and Mining Methods for Data with Schema Uncertainty. In: Tang, J., King, I., Chen, L., Wang, J. (eds) Advanced Data Mining and Applications. ADMA 2011. Lecture Notes in Computer Science(), vol 7120. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25853-4_25

Download citation

DOI: https://doi.org/10.1007/978-3-642-25853-4_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25852-7
Online ISBN: 978-3-642-25853-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics