Abstract
The volume of unstructured data has been growing sharply as the era of Big Data arrives. Decision tree is one of the most widely used classification models designed for structured data. Unstructured data such as text need to be converted to structured format before being analyzed using decision tree model. In this paper, we discuss how to construct decision trees for datasets containing unstructured data. For that purpose, a decision tree construction algorithm called CUST was proposed, which can directly tackle unstructured data. CUST introduces the use of splitting criteria formed by unstructured attribute values, and reduces the number of scans on datasets by designing appropriate data structures. Experiments on real-world datasets show that CUST improves the efficiency of building classifiers for unstructured data and performs as well as, if not better than existing solutions in classification accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proceedings of the 20th International Conference on Very Large Databases, pp. 487–499. Morgan Kaufmann, San Francisco (1994)
Ben-Haim, Y., Yom-Tov, E.: A Streaming Parallel Decision Tree Algorithm. J. Mach. Learn. Res. 11, 849–872 (2010)
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. CRC Press, Boca Raton (1984)
Brodley, C.E., Utgoff, P.E.: Multivariate Decision Trees. Mach. Learn. 19(1), 45–77 (1995)
Gehrke, J., Ramakrishnan, R., Ganti, V.: RainForest-A framework for fast decision tree construction of large datasets. Data Min. Knowl. Disc. 4(2-3), 127–162 (2000)
Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 1–12. ACM, New York (2000)
Li, W., Han, J., Pei, J.: CMAR: Accurate and Efficient Classification Based on Multiple Class-Association Rules. In: Proceedings of the 2001 IEEE International Conference on Data Mining, pp. 369–376. IEEE Computer Society, Washington, DC (2001)
Liu, B., Hsu, W., Ma, Y.: Integrating Classification and Association Rule Mining. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 80–86. AAAI Press, Menlo Park (1998)
Liu, H.: Business Intelligence Techniques and Application. Tsinghua University Press, Beijing (2013) (in Chinese)
Liu, H., Yu, J.X., Lu, H.: Unifying Decision Tree Induction and Association Based Classification. In: Proceedings of the 2002 IEEE International Conference on Systems, Man and Cybernetics. IEEE Computer Society, Washington, DC (2002)
Lo, W.-T., Chang, Y.-S., Sheu, R.-K., Chiu, C.-C., Yuan, S.-M.: CUDT: A CUDA Based Decision Tree Algorithm. Scientific World Journal 2014, Article ID 745640 (2014)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993)
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Shafer, J., Agrawal, R., Mehta, M.: SPRINT: A Scalable Parallel Classifier for Data Mining. In: Proceedings of the 22nd International Conference on Very Large Databases, pp. 544–555. Morgan Kaufmann, San Francisco (1996)
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Boston (2005)
Yin, X., Han, J.: CPAR: Classification Based on Predictive Association Rules. In: Proceedings of the 3rd SIAM International Conference on Data Mining, pp. 331–335. SIAM, San Francisco (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Gong, S., Liu, H. (2014). Constructing Decision Trees for Unstructured Data. In: Luo, X., Yu, J.X., Li, Z. (eds) Advanced Data Mining and Applications. ADMA 2014. Lecture Notes in Computer Science(), vol 8933. Springer, Cham. https://doi.org/10.1007/978-3-319-14717-8_37
Download citation
DOI: https://doi.org/10.1007/978-3-319-14717-8_37
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-14716-1
Online ISBN: 978-3-319-14717-8
eBook Packages: Computer ScienceComputer Science (R0)