Abstract
Recently mining frequent substructures from XML data has gained a considerable amount of interest. Different methods have been proposed and examined for mining frequent patterns from XML documents efficiently and effectively. While many frequent XML patterns generated are useful and interesting, it is common that a large portion of them is not considered as interesting or significant for the application at hand. In this paper, we present a systematic approach to ascertain whether the discovered XML patterns are significant and not just coincidental associations, and provide a precise statistical approach to support this framework. The proposed strategy combines data mining and statistical measurement techniques to discard the non significant patterns. In this paper we considered the “Prions” database that describes the protein instances stored for Human Prions Protein. The proposed unified framework is applied on this dataset to demonstrate its effectiveness in assessing interestingness of discovered XML patterns by statistical means. When the dataset is used for classification/prediction purposes, the proposed approach will discard non significant XML patterns, without the cost of a reduction in the accuracy of the pattern set as a whole.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann, San Francisco (2001)
Zhang, J., Ling, T.W., Bruckner, R.M., Tjoa, A.M., Liu, H.: On Efficient and Effective Association Rule Mining from XML Data. In: Proceedings of the 15th Int. Conf. Database and Expert Systems Applications, Zaragoza, Spain, pp. 497–507 (2004)
Chi, Y., Muntz, R.R., Nijssen, S., Kok, J.N.: Frequent Subtree Mining - An Overview. Fundamenta Informaticae 661, 61–198 (2005)
Agrawal, R., Imieliski, T., Swami, A.: Mining association rules between sets of items in large databases. ACM SIGMOD Rec. 22, 207–216 (1993)
Aggarwal, C.C., Yu, P.S.: A new framework for itemset generation. In: Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 18–24. ACM, Washington (1998)
Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. In: Proceedings of the 20th Int. Conf. on Very Large Data Bases, Santiago, Chile (1994)
Toivonen, H.: Sampling Large Databases for Association Rules. In: Proceedings of the 20th Int. Conf. on Very Large Data Bases, Mumbai, India, pp. 134–145 (1996)
Bayardo, R., Agrawal, R., Gunopulos, D.: Constraint-Based Rule Mining in Large, Dense Databases. J. Data Mining and Knowledge Discovery 4, 217–240 (2000)
Lavrač, N., Flach, P., Zupan, B.: Rule Evaluation Measures: A Unifying View. In: Džeroski, S., Flach, P.A. (eds.) ILP 1999. LNCS (LNAI), vol. 1634, pp. 174–185. Springer, Heidelberg (1999)
Lenca, P., Meyer, P., Vaillant, B., Lallich, S.: On selecting interestingness measures for association rules: User oriented description and multiple criteria decision aid. European Journal of Operational Research 184, 610–626 (2008)
Shaharanee, I.N.M., Hadzic, F., Dillon, T.: Interestingness of Assocaition Rules Using Symmetrical Tau and Logistic Regression. In: Nicholson, A., Li, X. (eds.) AI 2009. LNCS (LNAI), vol. 5866, pp. 422–431. Springer, Heidelberg (2009)
Tan, P.N., Kumar, V., Srivastava, J.: Selecting the right interestingness measure for association patterns. In: Proceedings of the 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 32–41. ACM, Alberta (2002)
Webb, G.I.: Discovering Significant Patterns. In: Machine Learning, pp. 1–33. Springer, Heidelberg (2007)
Yun, H., Ha, D., Hwang, B., Ho Ryu, K.: Mining association rules on significant rare data using relative support. J. Systems and Software 67, 181–191 (2003)
Tan, H., Hadzic, F., Dillon, T.S., Chang, E.: State of the art of data mining of tree structured information. Int. Journal of Computer Systems Science and Engineering 23 (2008)
Asai, T., Abe, K., Kawasoe, S., Arimura, H., Sakamoto, H., Arikawa, S.: Efficient Substructure Discovery from Large Semi-structured Data. In: Proc. of the 2nd SIAM Int. Conf. on Data Mining (SIAM 2002), pp. 158–174 (2002)
Feng, L., Dillon, T., Weigand, H., Chang, E.: An XML-Enabled Association Rule Framework. Database and Expert Systems Applications, 88–97 (2003)
Tan, H., Hadzic, F., Dillon, T.S., Chang, E., Feng, L.: Tree model guided candidate generation for mining frequent subtrees from XML documents. ACM Trans. Knowl. Discov. Data 2, 1–43 (2008)
Goodman, A., Kamath, C., Kumar, V.: Data Analysis in the 21st Century. Stat. Anal. Data Mining 1, 1–3 (2008)
Agresti, A.: An Intro. to Categorical Data Analysis. Wiley Interscience, New Jersey (2007)
Tan, H., Dillon, T., Hadzic, F., Chang, E., Feng, L.: IMB3-Miner: Mining Induced/Embedded Subtrees by Constraining the Level of Embedding. In: Proceedings of the 8th Pacific-Asia Conference on Knowl. Discovery and Data Mining, pp. 450–461 (2006)
Fedja, H., Tharam, S.D., Elizabeth, C.: Knowledge Analysis with Tree Patterns. In: Proceedings of the 41st Annual Hawai Int. Conf. on System Sciences. IEEE, Los Alamitos (2008)
Fedja, H., Henry, T., Tharam, S.D.: U3 - Mning Unordered Embedded Subtrees Using TMG Candidate Generation. In: The 1st ACM Int. Conf. on Web Search and Data Mining, California, USA (2008)
Tan, H., Dillon, T., Hadzic, F., Chang, E., Feng, L.: MB3-Miner: efficiently mining eMBedded subTREEs using Tree Model Guided candidate generation. In: Proceedings of the 1st Int. Workshop on Mining Complex Data 2005, Texas, USA (2005)
Zaki, M.J.: Efficiently Mining Frequent Trees in a Forest: Algorithms and Applications. IEEE Transactions on Knowledge and Data Engineering 17, 1021–1035 (2005)
Aumann, Y., Lindell, Y.: A Statistical Theory for Quantitative Association Rules. J. Intell. Inf. Syst. 20, 255–283 (2003)
Meggido, N., Srikant, R.: Discovering Predictive Association Rules. In: 4th International Conference on Knowledge Discovery in Databases and Data Mining, pp. 274–278 (1998)
Webb, G.I.: Preliminary investigations into statistically valid exploratory rule discovery. In: Simoff, S.J., Williams, G.J., Hegland, M. (eds.) AusDM 2003, Sydney, pp. 1–9 (2003)
Ozaki, T., Ohkawa, T.: Mining Mutually Dependent Ordered Subtrees in Tree Databases. In: New Frontiers in Applied Data Mining: PAKDD 2008 Int. Workshops, Japan, Revised Selected Papers, pp. 75–86. Springer, Heidelberg (2009)
Hui, X., Pang-Ning, T., Vipin, K.: Hyperclique pattern discovery. Data Min. Knowl. Discov. 13, 219–242 (2006)
Bathoorn, R., Koopman, A., Siebes, A.: Reducing the Frequent Pattern Set. In: Proceedings of the 6th IEEE International Conference on Data Mining – Workshops, pp. 55–59 (2006)
Siebes, A., Vreeken, J., Leeuwen, M.V.: Item Sets That Compress. In: Proceedings of the SIAM Conference on Data Mining, Maryland, USA, pp. 393–404 (2006)
Nakamura, A., Kudo, M.: Mining Frequent Trees with Node-Inclusion Constraints. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 850–860. Springer, Heidelberg (2005)
Srikant, R., Vu, Q., Agrawal, R.: Mining Association Rules with Item Constraints. In: 3rd Int. Conf. on Knowledge Discovery in Databases and Data Mining, Newport Beach, California, pp. 67–73 (1997)
Knijf, J.D., Feelders, A.J.: Monotone Constraints in Frequent Tree Mining. In: Poel, M., Nijholt, A. (eds.) BENELEARN, Enschede, The Netherlands, pp. 13–20 (2005)
Fedja, H., Henry, T., Tharam, D.: Mining Unordered Distance-Constrained Embedded Subtrees. In: Boulicaut, J.-F., Berthold, M.R., Horváth, T. (eds.) DS 2008. LNCS (LNAI), vol. 5255, pp. 272–283. Springer, Heidelberg (2008)
Rusu, L.I., Rahayu, W., Taniar, D.: Extracting Variable Knowledge from Multiversioned XML Documents. In: 6th IEEE International Conference on Data Mining - Workshops (ICDMW 2006), pp. 70–74 (2006)
Yan, X., Cheng, H., Han, J., Yu, P.S.: Mining significant graph patterns by leap search. In: SIGMOD Conference, Canada, pp. 433–444 (2008)
Hashimoto, K., Takigawa, I., Shiga, M., Kanehisa, M., Mamitsuka, H.: Mining significant tree patterns in carbohydrate sugar chains. Bioinformatics 24, 167–173 (2008)
Zaki, M.J., Aggarwal, C.C.: XRules: an effective structural classifier for XML data. In: SIGKDD 2003, Washington, DC (2003)
Wu, X., Barbar, D., Ye, Y.: Screening and interpreting multi-item associations based on log-linear modeling. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Press, Washington (2003)
Fedja, H., Tharam, S.D., Amandeep, S.S., Elizabeth, C., Henry, T.: Mining Substructures in Protein Data. In: Proceedings of the 6th IEEE Int. Conf. on Data Mining-Workshops (2006)
Zhou, X.J., Dillon, T.S.: A statistical-heuristic feature selection criterion for decision tree induction. IEEE Transactions on Pattern Analysis and Machine Intelligence (1991)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Shaharanee, I.N.M., Hadzic, F., Dillon, T.S. (2010). A Statistical Interestingness Measures for XML Based Association Rules. In: Zhang, BT., Orgun, M.A. (eds) PRICAI 2010: Trends in Artificial Intelligence. PRICAI 2010. Lecture Notes in Computer Science(), vol 6230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15246-7_20
Download citation
DOI: https://doi.org/10.1007/978-3-642-15246-7_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15245-0
Online ISBN: 978-3-642-15246-7
eBook Packages: Computer ScienceComputer Science (R0)