Abstract
With the progressing of data collecting methods, people have already collected scales of data for various application fields such as medical science, meteorology, electronic commerce and so on. To analyze these data needs to integrate data from the various heterogeneous data sets. As historical reasons technically or non-technically, usually, the schemas of the data sets to be integrated are complex and different. Thus to analyze the integrated data may cause ambiguous results for their non-uniform schemas. This paper targets mining this kind of data, and its main contributions include:(1) proposed schema uncertainty to describe data with non-uniform schemas and proposed couple correlation degree (Cor) to evaluate the existence probabilities for records in data with schema uncertainty based on the analyzing subject;(2) designed a data structure ”B-correlation tree” to establish a hierarchical structure for uncertain data with their existence probabilities and discussed the distribution affection by selecting nodes on different levels of B-correlation tree ; (3) proposed a efficient Monte Carlo uncertain data analyzing algorithm, MonteCarlo-evaluate (MCE), based on B-correlation tree for data with schema uncertainty; (4) analyzed the accuracy and convergence property for MCE theoretically; (5) implemented a prototype system by using B-correlation tree and MCE on real medical data and synthetic TPC-H benchmark?[20] data; provided sufficient experiments to test the effectiveness and efficiency of the provided methods. The results of experiments show that: the provided methods can efficient evaluate the schema uncertainty in data and thus can be equal to the tasks of analyzing large scale data with schema uncertainty efficiently.
This work is supported by the National Key Technology R&D Program of China(No. 2009BAK63B08), National High Technology Research and Development Program of China(’863’ Program)(No.2009AA01Z150), National Science& Technology Pillar Program of China(No. 2009BAH44B03), China Postdoctoral Science Foundation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Aggarwal, C.C.: Managing and Mining Uncertain Data. In: Advances In Database Systems (2009)
Sarma, A.D., Benjelloun, O., Halevy, A., Widom, J.: Working Models for Uncertain Data. In: ICDE 2006 (2006)
Cavallo, R., Pittarelli, M.: The Theory of Probabilistic Databases. In: Proceedings of the 13th VLDB Conference, Brighton (1987)
BarbarB, D., Garcia-Molina, H., Porter, D.: The management of probabilistic data. IEEE Transactions on Knowledge and Data Engineering 4(5), 487–502 (1992)
Chatfield, C.: Model Uncertainty, Data Mining and Statistical Inference. Journal of the Royal Statistical Society. Series A (Statistics in Society) 158(3), 419–466 (1995)
Dong, X.L., Halevy, A., Yu, C.: Data integration with uncertainty. The VLDB Journal 18(2), 469–500 (2009)
Bernecker, T., Kriegel, H.-P., Renz, M., Verhein, F., Zuefle, A.: Probabilistic Frequent Itemset Mining in Uncertain Databases. In: Proc. 15th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining (KDD 2009), Paris, France (2009)
Khoussainova, N., Balazinska, M., Suciu, D.: Towards Correcting Input Data Errors Probabilistically Using Integrity Constraints. In: MobiDE 2006, June 25 (2006)
Jayram, T.S., McGregor, A.: Estimating Statistical Aggregates on Probabilistic Data Streams. In: PODS 2007, June 11-14 (2007)
Metropolis, N., Ulam, S.: The Monte Carlo Method. Journal of the American Statistical Association 44(247), 335–341 (1949)
Stigler, S.M.: A Historical View of Statistical Concepts in Psychology and Educational Research. American Journal of Education 101(1), 60–70 (1992)
Jampani, R., Xu, F., Wu, M., Perez, L.L., Jermaine, C., Haas, P.J.: MCDB: a monte carlo approach to managing uncertain data. In: Proceedings of the 2008 ACM SIGMOD (2008)
Xu, F., Beyer, K., Ercegovac, V., Haas, P.J., Shekita, E.J.: E = MC3: managing uncertain enterprise data in a cluster-computing environment. In: Proceedings of the 2009 ACM SIGMOD (2009)
Li, Y., Algarni, A., Zhong, N.: Mining positive and negative patterns for relevance feature discovery. In: KDD 2010 Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2010)
Renyi, A.: Probability theory. NorthHolland, Amsterdam (1970)
Karp, R., Luby, M.: Monte-Carlo Algorithms for Enumeration and Reliability Problems. In: 24th STOC, pp. 56–64 (1983)
Yang, H., Cai, H.: Clinicopatholog analysis on 46 inborn anencephaluses. Chinese Journal of Birth Health and Heredity (2008)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, Y., Tang, C., Wang, T., Yang, D., Zhu, J. (2011). Efficient Subject-Oriented Evaluating and Mining Methods for Data with Schema Uncertainty. In: Tang, J., King, I., Chen, L., Wang, J. (eds) Advanced Data Mining and Applications. ADMA 2011. Lecture Notes in Computer Science(), vol 7120. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25853-4_25
Download citation
DOI: https://doi.org/10.1007/978-3-642-25853-4_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-25852-7
Online ISBN: 978-3-642-25853-4
eBook Packages: Computer ScienceComputer Science (R0)