Abstract
Concept drift has regained research interest during recent years as many applications use data sources that are changing over time. We study the classification task using logistic regression on a large news collection of 248K texts during a period of seven years. We present extrinsic methods of concept drift detection and quantification using training set formation with different windowing techniques. We characterize concept drift on a seven-year-long Le Monde news corpus and show the overestimation of classifier performance if it is neglected. We lay out paths for future work where we plan to refine extrinsic characterization methods and investigate the drifting of learning parameters when few examples are available.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Fan, R., Chang, K., Hsieh, C., Wang, X., Lin, C.: LIBLINEAR: A library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)
Forman, G.: Tackling concept drift by temporal inductive transfer. Technical Report HPL-2006-20R1, Hewlett Packard Laboratories (2006)
Katakis, I., Tsoumakas, G., Banos, E., Bassiliades, N., Vlahavas, I.P.: An adaptive personalized news dissemination system. J. Intell. Inf. Syst. 32(2), 191–212 (2009)
Klinkenberg, R.: Learning drifting concepts: Example selection vs. example weighting. Intell. Data Anal. 8(3), 281–300 (2004)
Klinkenberg, R., Rüping, S.: Concept drift and the importance of examples. In: Text Mining – Theoretical Aspects and Applications, pp. 55–78. Physica-Verlag (2003)
Lang, K.: Newsweeder: Learning to filter netnews. In: Proc. 12th ICML, pp. 331–339 (1995)
Lebanon, G., Zhao, Y.: Local likelihood modeling of temporal text streams. In: Proc. 25th ICML, pp. 552–559. ACM (2008)
Liu, R.-L., Lu, Y.-L.: Incremental context mining for adaptive document classification. In: Proc. 8th KDD, pp. 599–604. ACM (2002)
Mourão, F., da Rocha, L.C., Araújo, R.B., Couto, T., Gonçalves, M.A., Meira Jr., W.: Understanding temporal aspects in document classification. In: WSDM, pp. 159–170. ACM (2008)
Rakotomalala, R., Chauchat, J.-H., Pellegrino, F.: Accuracy estimation with clustered dataset. In: Proc. 5th AusDM, pp. 17–22. Australian Comp. Soc. (2006)
Rocha, L., Mourão, F., Pereira, A., Gonçalves, M.A., Meira Jr., W.: Exploiting temporal context in text classification. In: Proc. 17th Conf. Information and Knowledge Management. ACM (2008)
Salles, T., da Rocha, L.C., Pappa, G.L., Mourão, F., Meira Jr., W., Gonçalves, M.A.: Temporally-aware algorithms for document classification. In: Proc. 33rd SIGIR, pp. 307–314. ACM (2010)
Salton, G., Wong, A., Yang, A.C.S.: A vector space model for automatic indexing. Communications of the ACM 18, 229–237 (1975)
Scholz, M., Klinkenberg, R.: An ensemble classifier for drifting concepts. In: Proc. 2nd Int. Wksh. on Knowledge Discovery in Data Streams, pp. 53–64 (2005)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Tsymbal, A.: The problem of concept drift: definitions and related work. Technical report, Trinity College Dublin (2004)
Widyantoro, D.H., Yen, J.: Relevant data expansion for learning concept drift from sparsely labeled data. IEEE Trans. Knowl. Data Eng. 17(3), 401–412 (2005)
Yeon, K., Song, M.S., Kim, Y., Choi, H., Park, C.: Model averaging via penalized regression for tracking concept drift. J. Comput. Graph. Stat. 19(2), 457–473 (2010)
Zliobaite, I.: Learning under concept drift: an overview. Technical report, Vilnius University (2010)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Šilić, A., Dalbelo Bašić, B. (2012). Exploring Classification Concept Drift on a Large News Text Corpus. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2012. Lecture Notes in Computer Science, vol 7181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28604-9_35
Download citation
DOI: https://doi.org/10.1007/978-3-642-28604-9_35
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28603-2
Online ISBN: 978-3-642-28604-9
eBook Packages: Computer ScienceComputer Science (R0)