Skip to main content

Feature selection for classifying multi-labeled past events

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

The study and analysis of past events can provide numerous benefits. While event categorization has been previously studied, it usually assigned only one event category to an event. In this study, we focus on multi-label classification for past events, which is a more general and challenging problem than those approached in previous studies. We categorize events into thirteen different types using a range of diverse features and classifiers trained on a dataset that has at least 50 labeled news articles for each category. We have confirmed that using all the features to train classifiers has statistical significance and improves all micro- and macro-average \(F_1\), multi-label accuracy, average precision@5, area under the receiver operating characteristic curve and example-based loss functions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. https://en.wikipedia.org/wiki/West_African_Ebola_virus_epidemic.

  2. Usually, only very popular or important events have own names.

  3. https://en.wikipedia.org/wiki/Portal:Current_events.

  4. We use Japanese news articles to evaluate classifications in this paper as described in Sect. 5. Even though we did not use the listed example events in the evaluation, we show them to aid understanding what kinds of events can be assigned to from the 13 categories.

  5. Some articles are stored in CD-Mainichi Newspapers 2012 data, Nichigai Associates, Inc., 2012 (Japanese). The others are collected by Web crawling.

  6. https://doi.org/10.5281/zenodo.3258150. This opened dataset excludes all texts of the articles to respect copyright law. However, it is possible to obtain the texts because the opened dataset includes event IDs defined in Mainichi Newspapers 2012 data or URLs used to Web crawling. Thus, after buying Mainichi Newspapers 2012 data or recrawling the URLs with Wayback Machine (the accessed day is 18 June, 2019), their corresponding texts can be retrieved.

  7. https://www3.nhk.or.jp/news/html/20181122/k10011720261000.html accessed on 22 Nov. 2018.

  8. https://www3.nhk.or.jp/news/html/20181117/k10011714161000.html accessed on 17 Nov. 2018.

  9. In Japanese, this term can be represented as a word.

  10. https://radimrehurek.com/gensim/models/ldamodel.html,

    https://radimrehurek.com/gensim/models/lsimodel.html,

    https://radimrehurek.com/gensim/models/doc2vec.html and

    https://radimrehurek.com/gensim/models/word2vec.html.

References

  1. Au Yeung, C.M., Jatowt, A.: Studying how the past is remembered: towards computational history through large scale text mining. In: CIKM ’11, pp. 1231–1240. ACM, New York (2011)

  2. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  3. Boix-Mansilla, V.: Historical understanding: beyond the past and into the present. In: Stearns, P.N., Seixas, P., Wineburg, S. (eds.) Knowing, Teaching, and Leaning History: National and International Perspectives, pp. 390–418. New York University Press, New York (2000)

    Google Scholar 

  4. Chapman, A., Facey, J.: Placing history: territory story identity-and historical consciousness. Teach. Hist. 116, 36–41 (2004)

    Google Scholar 

  5. Chen, W., Yan, J., Zhang, B., Chen, Z., Yang, Q.: Document transformation for multi-label feature selection in text categorization. In: ICDM ’07, pp. 451–456. IEEE Computer Society, Washington, DC (2007)

  6. Chew, M.M., Bhowmick, S.S., Jatowt, A.: Ranking without learning: towards historical relevance-based ranking of social images. In: SIGIR ’18, pp. 1133–1136. ACM, New York (2018)

  7. Clavert, F., Majerus, B., Beaupré, N.: #ww1. twitter, the centenary of the first world war and the historian. Twitter for Research (2015)

  8. Cong, G., Lee, W., Wu, H., Liu, B.: Semi-supervised text classification using partitioned em. In: Lee, Y., Li, J., Whang, K.Y., Lee, D. (eds.) Database Systems for Advanced Applications. Lecture Notes in Computer Science, vol. 2973, pp. 482–493. Springer, Berlin (2004)

    Chapter  Google Scholar 

  9. Cook, J., Das Sarma, A., Fabrikant, A., Tomkins, A.: Your two weeks of fame and your grandmother’s. In: Proceedings of the 21st International Conference on World Wide Web, WWW ’12, pp. 919–928. ACM, New York (2012)

  10. Creecy, R.H., Masand, B.M., Smith, S.J., Waltz, D.L.: Trading MIPS and memory for knowledge engineering. Commun. ACM 35(8), 48–64 (1992)

    Article  Google Scholar 

  11. Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  12. Doquire, G., Verleysen, M.: Mutual information-based feature selection for multilabel classification. Neurocomputing 122, 148–155 (2013). (Advances in cognitive and ubiquitous computing)

    Article  Google Scholar 

  13. Elisseeff, A., Weston, J.: A kernel method for multi-labelled classification. In: NIPS’01, pp. 681–687. MIT Press, Cambridge (2001)

  14. Ferron, M., Massa, P.: Collective memory building in Wikipedia: the case of north African uprisings. In: WikiSym ’11, pp. 114–123. Mountain View, California (2011)

  15. Garcia-Gavilanes, R., Mollgaard, A., Tsvetkova, M., Yasseri, T.: The memory remains: understanding collective memory in the digital age. Sci. Adv. 3(4), e1602368 (2017)

    Article  Google Scholar 

  16. Ghani, R.: Combining labeled and unlabeled data for multiclass text categorization. In: ICML ’02, pp. 187–194. Morgan Kaufmann Publishers Inc., San Francisco (2002)

  17. Gopal, S., Yang, Y.: Multilabel classification with meta-level features. In: SIGIR ’10, pp. 315–322. ACM, New York (2010)

  18. Halbwachs, M.: La Memoire Collective. Les Presses universitaires de France (in French) (1950)

  19. Harris, R., Rea, A.: Making history meaningful: helping pupils to see why history matters. Teach. Hist. 125, 28–36 (2006)

    Google Scholar 

  20. Hoerl, C., McCormack, T.: Time and Memory: Issues in Philosophy and Psychology. Oxford University Press, Oxford (2001)

    Google Scholar 

  21. Huet, T., Biega, J., Suchanek, F.M.: Mining history with le monde. In: Proceedings of the 2013 Workshop on Automated Knowledge Base Construction. In: AKBC ’13, pp. 49–54. ACM, New York (2013)

  22. Ikejiri, R.: Designing and evaluating the card game which fosters the ability to apply the historical causal relation to the modern problems. Jpn. Soc. Educ. Technol. 34(4), 375–386 (2011). (in Japanese)

    Google Scholar 

  23. Ikejiri, R., Fujimoto, T., Tsubakimoto, M., Yamauchi, Y.: Designing and evaluating a card game to support high school students in applying their knowledge of world history to solve modern political issues. In: ICoME ’12. Beijing Normal University (2012)

  24. Ikejiri, R., Sumikawa, Y.: Developing a mining system to transfer historical causations to solving modern social issues. In: WHA ’16 (2016)

  25. Ikejiri, R., Sumikawa, Y.: Developing world history lessons to foster authentic social participation by searching for historical causation in relation to current issues dominating the news. J. Educ. Res. Soc. Stud. 84, 37–48 (2016). (in Japanese)

    Google Scholar 

  26. Jacoby, R.: Social Amnesia: A Critique of Contemporary Psychology. Transaction Publishers, Piscataway (1997)

    Google Scholar 

  27. Jatowt, A., Duh, K.: A framework for analyzing semantic change of words across time. In: JCDL ’14, pp. 229–238. IEEE Press, Piscataway (2014)

  28. Jatowt, A., Kawai, D., Tanaka, K.: Digital history meets Wikipedia: analyzing historical persons in Wikipedia. In: JCDL ’16, Newark, New Jersey, USA, pp. 17–26 (2016)

  29. Jatowt, A., Kawai, D., Tanaka, K.: Predicting importance of historical persons using Wikipedia. In: CIKM ’16, pp. 1909–1912. ACM, New York (2016)

  30. Jatowt, A., Kawai, D., Tanaka, K.: Timestamping entities using contextual information. In: SIGIR ’17, pp. 1205–1208. ACM, New York (2017)

  31. Jatowt, A., Kawai, H., Kanazawa, K., Tanaka, K., Kunieda, K., Yamada, K.: Multi-lingual analysis of future-related information on the web. In: Culture and Computing’13, pp. 27–32 (2013)

  32. Kanhabua, N., Nguyen, T.N., Niederée, C.: What triggers human remembering of events?: a large-scale analysis of catalysts for collective memory in Wikipedia. In: JCDL ’14, London, United Kingdom, pp. 341–350 (2014)

  33. Kosmerlj, A., Belyaeva, E., Leban, G., Grobelnik, M., Fortuna, B.: Towards a complete event type taxonomy. In: WWW ’15 Companion, pp. 899–902. ACM, New York (2015)

  34. Kudo, T., Yamamoto, K., Matsumoto, Y.: Applying conditional random fields to japanese morphological analysis. In: EMNLP ’04, pp. 230–237

  35. Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: ICML’14, Bejing, China, pp. 1188–1196 (2014)

  36. Lee, J., Kim, D.W.: Feature selection for multi-label classification using multivariate mutual information. Pattern Recognit. Lett. 34(3), 349–357 (2013)

    Article  Google Scholar 

  37. Lee, J., Kim, D.W.: Fast multi-label feature selection based on information-theoretic feature ranking. Pattern Recognit. 48(9), 2761–2771 (2015)

    Article  Google Scholar 

  38. Lee, P.: Historical literacy: theory and research. Int. J. Hist. Learn. Teach. Res. 5(1), 25–40 (2005)

    Google Scholar 

  39. Lee, U., Liu, Z., Cho, J.: Automatic identification of user goals in web search. In: WWW ’05, pp. 391–400. ACM, New York (2005)

  40. Lieberman, E., Michel, J.B., Jackson, J., Tang, T., Nowak, M.A.: Quantifying the evolutionary dynamics of language. Nature 449, 713–716 (2007)

    Article  Google Scholar 

  41. McCallum, A.K.: Multi-label text classification with a mixture model trained by EM. In: AAAI 99 Workshop on Text Learning (1999)

  42. Mikolov, T., Kai, C., Suchanek Greg, C., Dean, J.: Linguistic regularities in continuous space word representations. In: ICLR’13 Workshop (2013)

  43. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS’13, 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119 (2013)

  44. Mikolov, T., Yih, W.t., Zweig, G.: Efficient estimation of word representations in vector space. In: NAACL’13 (2013)

  45. Ministry of Education Culture, Sports, Science and Technology: Japan Course of Study for Senior High Schools (2009)

  46. Miyazaki, T., Sumikawa, Y.: Label propagation using amendable clamping. In: IUI’18 Workshop on WII (2018)

  47. Nie, L., Wang, M., Zha, Z., Li, G., Chua, T.S.: Multimedia answering: enriching text QA with media information. In: SIGIR ’11, pp. 695–704. ACM, New York (2011)

  48. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Mach. Learn. 39(2–3), 103–134 (2000)

    Article  Google Scholar 

  49. Noreen, E.W.: Computer-Intensive Methods for Testing Hypotheses. Wiley, New York (1989)

    Google Scholar 

  50. Odijk, D., de Rooij, O., Peetz, M.H., Pieters, T., de Rijke, M., Snelders, S.: Semantic document selection. In: TPDL’12, pp. 215–221. Springer, Berlin (2012)

  51. Ogata, I., Kato, T., Kabayama, K., Kawakita, M., Kishimoto, M., Kuroda, H., Sato, T., Minamizuka, S., Yamamoto, H.: Encyclopedia of Historiography. Koubundou, Minamiuonuma (1994)

    Google Scholar 

  52. Pargel, M., Atkinson, Q.D., Meade, A.: Frequency of word-use predicts rates of lexical evolution throughout Indo-European history. Nature 449, 717–720 (2007)

    Article  Google Scholar 

  53. Phan, X.H., Nguyen, L.M., Horiguchi, S.: Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: WWW ’08, pp. 91–100. ACM, New York (2008)

  54. Radinsky, K., Davidovich, S., Markovitch, S.: Learning causality for news events prediction. In: WWW ’12, pp. 909–918. ACM, New York (2012)

  55. Radinsky, K., Horvitz, E.: Mining the web to predict future events. In: WSDM ’13, pp. 255–264. ACM, New York (2013)

  56. Singh, J., Nejdl, W., Anand, A.: History by diversity: helping historians search news archives. In: HIIR ’16, pp. 183–192. ACM, New York (2016)

  57. Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., Demirbas, M.: Short text classification in twitter to improve information filtering. In: SIGIR ’10, pp. 841–842. ACM, New York (2010)

  58. Staley, D.J.: A history of the future. Hist. Theory 41, 72–89 (2002)

    Article  Google Scholar 

  59. Sumikawa, Y., Jatowt, A.: Classifying short descriptions of past events. In: Advances in Information Retrieval, ECIR ’18, pp. 729–736. Springer, Berlin (2018)

  60. Sumikawa, Y., Jatowt, A., Düring, M.: Digital history meets microblogging: analyzing collective memories in twitter. In: JCDL ’18, pp. 213–222. ACM, New York (2018)

  61. Sun, X., Wang, H., Yu, Y.: Towards effective short text deep classification. In: SIGIR ’11, pp. 1143–1144. ACM, New York (2011)

  62. Takahashi, Y., Ohshima, H., Yamamoto, M., Iwasaki, H., Oyama, S., Tanaka, K.: Evaluating significance of historical entities based on tempo-spatial impacts analysis using Wikipedia link structure. In: HT ’11, pp. 83–92. ACM, New York (2011)

  63. Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I.: Multi-label classification of music by emotion. EURASIP J. Audio Speech Music Process. 2011(1), 4 (2011)

    Article  Google Scholar 

  64. Tsoumakas, G., Katakis, I., Vlahavas, I.: Mining Multi-label Data, pp. 667–685. Springer, Boston (2010)

    Google Scholar 

  65. Tumasjan, A., Sprenger, T.O., Sandner, P.G., Welpe, I.M.: Predicting elections with twitter: what 140 characters reveal about political sentiment. In: ICWSM’10, Washington, DC, USA (2010)

  66. van Drie, J., van Boxtel, C.: Historical reasoning: towards a framework for analyzing students’ reasoning about the past. Educ. Psychol. Rev. 20(2), 87–110 (2008)

    Article  Google Scholar 

  67. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Berlin (1995)

    Book  Google Scholar 

  68. Wang, B., Tu, Z., Tsotsos, J.K.: Dynamic label propagation for semi-supervised multi-class multi-label classification. In: 2013 IEEE International Conference on Computer Vision, pp. 425–432 (2013)

  69. Wang, F., Zhang, C.: Label propagation through linear neighborhoods. In: ICML’06, pp. 985–992. ACM, New York (2006)

  70. Yang, Y.: Expert network: Effective and efficient learning from human decisions in text categorization and retrieval. In: SIGIR ’94, New York, NY, USA, pp. 13–22 (1994)

  71. Zelikovitz, S., Marquez, F.: Transductive learning for short-text classification problems using latent semantic indexing. Int. J. Pattern Recognit. Artif. Intell. 19(2), 146–163 (2005)

    Article  Google Scholar 

  72. Zhang, M.L., Pea, J.M., Robles, V.: Feature selection for multi-label naive Bayes classification. Inf. Sci. 179(19), 3218–3229 (2009)

    Article  Google Scholar 

  73. Zhang, M.L., Zhou, Z.H.: ML-KNN: a lazy learning approach to multi-label learning. Pattern Recognit. 40(7), 2038–2048 (2007)

    Article  Google Scholar 

  74. Zhang, Y., Jatowt, A., Bhowmick, S., Tanaka, K.: Omnia Mutantur, Nihil Interit: connecting past with present by finding corresponding terms across time. In: ACL/IJCNLP, pp. 645–655. ACL (2015)

  75. Zhang, Y., Jatowt, A., Tanaka, K.: Temporal analog retrieval using transformation over dual hierarchical structures. In: CIKM ’17, pp. 717–726. ACM, New York (2017)

  76. Zhu, X.: Semi-supervised learning with graphs. Ph.D. thesis, Pittsburgh, PA, USA (2005). AAI3179046

Download references

Acknowledgements

This work was partially supported in part by MEXT Grant-in-Aids (#17K12792, #19K20631 and #26750076). We express our gratitude to all the reviewers for their thoughtful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yasunobu Sumikawa.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sumikawa, Y., Ikejiri, R. Feature selection for classifying multi-labeled past events. Int J Digit Libr 22, 63–83 (2021). https://doi.org/10.1007/s00799-020-00293-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-020-00293-5

Keywords