Skip to main content
Log in

Text Categorization from category name in an industry-motivated scenario

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In this work we suggest a novel Text Categorization (TC) scenario, motivated by an ad-hoc industrial need to assign documents to a set of predefined categories, while labeled training data for the categories is not available. The scenario is applicable in many industrial settings and is interesting from the academic perspective. We present a new dataset geared for the main characteristics of the scenario, and utilize it to investigate the name-based TC approach, which uses the category names as its only input and does not require training data. We evaluate and analyze the performance of state-of-the-art methods for this dataset to identify the shortcomings of these methods for our scenario, and suggest ways for overcoming these shortcomings. We utilize statistical correlation measured over a target corpus for improving the state-of-the-art, and offer a different classification scheme based on the characteristics of the setting. We evaluate our improvements and adaptations and show superior performance of our suggested method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. Available at http://u.cs.biu.ac.il/~nlp/downloads/WikiRules.html.

  2. Therewith we filtered rules where terms do not co-occur frequently enough within Wikipedia, based on the filtering method proposed in Shnarch et al. (2009). We tuned the required threshold parameter (which was set to 0.01) on our development dataset, described in Sect. 3.

  3. If the documents that should not be classified to any of the categories are not taken into consideration, the average number of labels per document is 1.39.

  4. We used a conservative approach, counting a fit only if exactly the same category was assigned by both annotators, and considering assignments to parent category or sub-category as errors. To account for multi-label categorization, we used the adaptation suggested by Rosenberg and Binkowski (2004), according to which each document has partial membership with each of its multiple labels.

  5. We also experimented with the multi-label scheme by assigning each document to its top-k highest-scoring categories, yet the results obtained were of the same order. More details on the bootstrapping step are reported in Sect. 5.5.

  6. In our implementation we have set the LSA dimension value to 300.

  7. In this calculation we follow the micro-averaging approach commonly used in TC.

  8. We use this test to measure statistical significance of all the results reported in this Section.

References

  • Ali, A., Magdy, W., & Vogel, S. (2013). A tool for monitoring and analyzing healthcare tweets. In HSD workshop, SIGIR 2013.

  • Archak, N., Ghose, A., & Ipeirotis, P. G. (2011). Deriving the pricing power of product features by mining consumer reviews. Management Science, 57(8), 1485–1509.

    Article  Google Scholar 

  • Barak, L., Dagan, I., & Shnarch, E. (2009). Text categorization from category name via lexical reference. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the Association for Computational Linguistics, Companion volume: Short papers, NAACL-Short’09 (pp. 33–36). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1620853.1620864

  • Baroni, M., & Zamparelli, R. (2010). Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 conference on empirical methods in natural language processing, EMNLP’10 (pp. 1183–1193). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1870658.1870773

  • Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

    Google Scholar 

  • Church, K.W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational linguistics, 16(1), 22–29. http://dl.acm.org/citation.cfm?id=89086.89095

  • de Buenaga Rodriguez, M., Go’mez-Hidalgo, J. M., & Diaz-Agudo, B. (1997). Using WordNet to complement training information in text categorization. In N. Nicolov, & R. Mitkov (Eds.), Recent advances in natural language processing II: Selected papers from the second international conference on recent advances in natural language processing (RANLP 1997), March 25–27, 1997, Stanford, CA, USA, Amsterdam Studies in the Theory and History of Linguistic Science, Series IV: Current Issues in Linguistic Theory (pp. 353–364). Amsterdam, The Netherlands: John Benjamins Publishing.

  • Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.

    Article  Google Scholar 

  • Diakopoulos, N., Naaman, M., & Kivran-Swaine, F.: Diamonds in the rough: Social media visual analytics for journalistic inquiry. In 2010 IEEE symposium on visual analytics science and technology (VAST) (pp. 115–122). IEEE (2010).

  • Downey, D., & Etzioni, O. (2008). Look ma, no hands: Analyzing the monotonic feature abstraction for text classification. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS-08), Vancouver, Canada. http://books.nips.cc/papers/files/nips21/NIPS2008_0054.pdf

  • Eichler, K., Gabryszak, A., & Neumann, G. (2014). An analysis of textual inference in german customer emails. Lexical and Computational Semantics (*SEM 2014) p. 69.

  • Eichler, K., Meisdrock, M., & Schmeier, S. (2012). Search and topic detection in customer requests. KI-Künstliche Intelligenz, 26(4), 419–422.

    Article  Google Scholar 

  • Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.

    Google Scholar 

  • Fleischman, M., & Hovy, E. (2002). Fine grained classification of named entities. In Proceedings of the 19th international conference on computational linguistics—COLING’02, (Vol. 1, pp. 1–7). Stroudsburg, PA, USA: Association for Computational Linguistics. doi:10.3115/1072228.1072358

  • Funk, A., Li, Y., Saggion, H., Bontcheva, K., & Leibold, C. (2008). Opinion analysis for business intelligence applications. In Proceedings of the first international workshop on Ontology-supported business intelligence. New York: ACM.

  • Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI, pp. 1606–1611.

  • Giampiccolo, D., Magnini, B., Dagan, I., & Dolan, B. (2007). The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, RTE’07 (pp. 1–9). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1654536.1654538

  • Glickman, O., & Dagan, I. (2005). A probabilistic setting and lexical cooccurrence model for textual entailment. In Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment, EMSEE’05, (pp. 43–48). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1631862.1631870

  • Glickman, O., Shnarch, E., & Dagan, I. (2006). Lexical reference: A semantic matching subtask. In Proceedings of the 2006 conference on empirical methods in natural language processing, EMNLP’06 (pp. 172–179). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1610075.1610103

  • Gliozzo, A., Strapparava, C., & Dagan, I. (2005). Investigating unsupervised learning for text categorization bootstrapping. In Proceedings of the conference on human language technology and empirical methods in natural language processing, HLT’05, (pp. 129–136). Stroudsburg, PA, USA: Association for Computational Linguistics. doi:10.3115/1220575.1220592

  • Gliozzo, A., Strapparava, C., & Dagan, I. (2009). Improving text categorization bootstrapping via unsupervised learning. ACM Transactions on Speech and Language Processing, 6(1), 1:1–1:24. doi:10.1145/1596515.1596516

    Article  Google Scholar 

  • Grefenstette, E., & Sadrzadeh, M. (2011). Experimental support for a categorical compositional distributional model of meaning. In Proceedings of the conference on empirical methods in natural language processing, EMNLP’11 (pp. 1394–1404). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=2145432.2145580

  • Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, KDD’04, (pp. 168–177). New York, NY, USA: ACM. doi:10.1145/1014052.1014073

  • Ko, Y., & Seo, J. (2004). Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques. In Proceedings of the 42nd annual meeting on association for computational linguistics, ACL’04. Stroudsburg, PA, USA: Association for Computational Linguistics. doi:10.3115/1218955.1218988

  • Li, C., Weng, J., He, Q., Yao, Y., Datta, A., Sun, A., & Lee, B. S. (2012). Twiner: Named entity recognition in targeted twitter stream. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, SIGIR’12 (pp. 721–730). New York, NY, USA: ACM. doi:10.1145/2348283.2348380

  • Liu, B., Li, X., Lee, W. S., & Yu, P. S. (2004). Text classification by labeling words. In Proceedings of the 19th national conference on Artifical intelligence, AAAI’04, (pp. 425–430). San Jose: AAAI Press. http://dl.acm.org/citation.cfm?id=1597148.1597218

  • Liu, B., & Zhang, L. (2012). A survey of opinion mining and sentiment analysis. In Mining text data, (pp. 415–463). Berlin: Springer.

  • Mansuy, T., & Hilderman, R. J. (2006). A characterization of wordnet features in boolean models for text classification. In Proceedings of the fifth Australasian conference on data mining and analystics—AusDM’06, (Vol. 61, pp. 103–109). Darlinghurst, Australia: Australian Computer Society Inc. http://dl.acm.org/citation.cfm?id=1273808.1273822

  • Maynard, D., & Funk, A. (2012). Automatic detection of political opinions in tweets. In The semantic web: ESWC 2011 workshops (pp. 88–99). Berlin: Springer.

  • Mccallum, A., & Nigam, K. (1999). Text classification by bootstrapping with keywords, em and shrinkage. In ACL99—Workshop for unsupervised learning in natural language processing (pp. 52–58).

  • Metzler, D., Dumais, S. T., & Meek, C. (2007). Similarity measures for short segments of text. In ECIR (pp. 16–27).

  • Mitchell, J., & Lapata, M. (2010). Composition in distributional models of semantics. Cognitive Science, 34(8), 1388–1439.

    Article  Google Scholar 

  • Pasca, M., & Harabagiu, S. M. (2001). The informative role of WordNet in open-domain question answering. In Proceedings of the NAACL 2001 workshop on wordnet and other lexical resources: Applications, extensions and customizations (pp. 138–143).

  • Ritter, A., Etzioni Mausam, O., & Clark, S. (2012). Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD’12 (pp. 1104–1112). New York, NY, USA: ACM. doi:10.1145/2339530.2339704

  • Rosenberg, A., & Binkowski, E. (2004). Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. In D. M. Susan Dumais & S. Roukos (Eds.), HLT-NAACL 2004: Short Papers (pp. 77–80). Boston, MA: Association for Computational Linguistics.

    Chapter  Google Scholar 

  • Ruhl, J., Datar, M., & Lee, J. (2006). Method, system and graphical user interface for providing reviews for a product. https://www.google.com/patents/US20060143158. US Patent App. 11/012,846

  • Saggion, H., & Funk, A. (2009). Extracting opinions and facts for business intelligence. RNTI Journal, 10(17), 119–146.

    Google Scholar 

  • Scharl, A., & Weichselbraun, A. (2008). An automated approach to investigating the online media coverage of US presidential elections. Journal of Information Technology & Politics, 5(1), 121–132.

    Article  Google Scholar 

  • Scott, S., & Matwin, S. (1999). Feature engineering for text classification. In Proceedings of the sixteenth international conference on machine learning, ICML’99, pp. 379–388. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=645528.657484

  • Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. doi:10.1145/505282.505283

    Article  Google Scholar 

  • Shah, C., & Croft, W. B. (2004). Evaluating high accuracy retrieval techniques. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR’04 (pp. 2–9). New York, NY, USA: ACM. doi:10.1145/1008992.1008996

  • Shnarch, E., Barak, L., & Dagan, I. (2009). Extracting lexical reference rules from wikipedia. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, ACL’09 (Vol. 1, pp. 450–458). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1687878.1687942

  • Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1), 1–38. http://dl.acm.org/citation.cfm?id=234285.234287

  • Socher, R., Huval, B., Manning, C. D., & Ng, A. Y. (2012). Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, EMNLP-CoNLL’12 (pp. 1201–1211). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=2390948.2391084

  • Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Predicting elections with twitter: What 140 characters reveal about political sentiment. ICWSM, 10, 178–185.

    Google Scholar 

  • Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83. doi:10.2307/3001968

    Article  Google Scholar 

  • Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, (pp. 4–11). New York: ACM.

Download references

Acknowledgments

This work was supported by the Next Generation Video (NeGeV) Project. We would like to thank our industrial partners, Comverse Technology Inc. and Orca Interactive Ltd. We thank Libby Barak for helping us in replicating the results of Barak et al. (2009). We thank Naomi Zeichner for preparing the taxonomy and annotating the dataset. Finally, we thank the anonymous reviewers for their useful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chaya Liebeskind.

Appendices

Appendix 1

1.1 Our complete IMDB taxonomy

Categories

  1. 1.

    Religion

    1. 1.1.

      Buddhism

    2. 1.2.

      Hinduism

    3. 1.3.

      Christianity

      1. 1.3.1.

        Christmas

    4. 1.4.

      Islam

    5. 1.5.

      Judaism

  2. 2.

    Sport

    1. 2.1.

      Bicycle

    2. 2.2.

      Boxing

    3. 2.3.

      Fishing

    4. 2.4.

      Football

    5. 2.5.

      Golf

    6. 2.6.

      Hockey

    7. 2.7.

      Martial-arts

      1. 2.7.1.

        Karate

    8. 2.8.

      Athletics

    9. 2.9.

      Running

    10. 2.10.

      Shooting

    11. 2.11.

      Skiing

    12. 2.12.

      Soccer

    13. 2.13.

      Water sports

      1. 2.13.1.

        Surfing

      2. 2.13.2.

        Swimming

    14. 2.14.

      Tennis

    15. 2.15.

      Baseball

    16. 2.16.

      Wrestling

    17. 2.17.

      Basketball

    18. 2.18.

      Horseracing

    19. 2.19.

      Olympic games

  3. 3.

    Interests (NON-CAT)

    1. 3.1.

      Beach

    2. 3.2.

      Outdoor

    3. 3.3.

      Gardening

    4. 3.4.

      Pets

    5. 3.5.

      Fitness

    6. 3.6.

      Cookery

    7. 3.7.

      Fashion

    8. 3.8.

      Computing

    9. 3.9.

      Travel

    10. 3.10.

      Motoring

      1. 3.10.1.

        Cars

      2. 3.10.2.

        Motorcycle

    11. 3.11.

      Trains

    12. 3.12.

      Airplanes

    13. 3.13.

      Ships

    14. 3.14.

      Radio

    15. 3.15.

      Business

    16. 3.16.

      Nature

      1. 3.16.1.

        Animals

    17. 3.17.

      Outer Space

    18. 3.18.

      The environment

    19. 3.19.

      Showbiz

    20. 3.20.

      Traditions

    21. 3.21.

      Infants

    22. 3.22.

      Military

    23. 3.23.

      Weather

  4. 4.

    Arts

    1. 4.1.

      Cinema

    2. 4.2.

      Advertising

    3. 4.3.

      Theater

    4. 4.4.

      Music

      1. 4.4.1.

        Opera

      2. 4.4.2.

        Classical music

      3. 4.4.3.

        Jazz

      4. 4.4.4.

        Pop/rock

      5. 4.4.5.

        Country music

      6. 4.4.6.

        Hip Hop

    5. 4.5.

      Dance

      1. 4.5.1.

        Ballet

  5. 5.

    Science

    1. 5.1.

      Medicine

      1. 5.1.1.

        Disability

    2. 5.2.

      Technology

    3. 5.3.

      Psychology

  6. 6.

    Education

    1. 6.1.

      School

    2. 6.2.

      College/University

  7. 7.

    Miscellaneous (NON-CAT)

    1. 7.1.

      Crime (NON-CAT)

      1. 7.1.1.

        Prison

      2. 7.1.2.

        Mafia

      3. 7.1.3.

        Drugs

      4. 7.1.4.

        Fraud

      5. 7.1.5.

        Gambling

      6. 7.1.6.

        Terrorism

    2. 7.2.

      Literature

    3. 7.3.

      History

    4. 7.4.

      Political

    5. 7.5.

      Social (NON-CAT)

      1. 7.5.1.

        Racism

    6. 7.6.

      Legal

    7. 7.7.

      Communism

    8. 7.8.

      War

      1. 7.8.1.

        World war 1

      2. 7.8.2.

        World war 2

    9. 7.9.

      Aliens

    10. 7.10.

      Comic-book

    11. 7.11.

      Journalism

    12. 7.12.

      Mythology

Appendix 2

1.1 The annotation guidelines

You are given a list of films with their plot description and a taxonomy of film categories.

The taxonomy is made up of film subject matters and is arranged in a hierarchical order so that if a sub-category is marked its ancestors are also relevant. This is true in all cases except when a category is only present in order to group similar subject together in which case it is marked with the text (NON-CAT) next to it.

For example: A film categorized as dealing with ’cars’ will also be relevant to ’motoring’ but not to ’interests’ as it is not a category.

  1. 3.

    Interests (NON-CAT)

    1. 3.9

      Travel

    2. 3.10

      Motoring

      1. 3.10.1

        cars

      2. 3.10.2

        Motorcycle

Note—the taxonomy is not exhaustive, you may find that there is no category in the taxonomy which accurately fits the film even though you can think of a subject matter that does. If a broader category is present choose it, otherwise choose none.

For each film, you must decide which categories (if any) out of the taxonomy are relevant to it. You can choose as many or as few categories as you see fit, or none.

Note—if you find more than one category, please put each category in a separate line (insert lines if necessary).

You must categorize according to the following guidelines:

  1. 1.

    Is the background story prominent—not just a passing reference.

    Examples:

    1. (a)

      The following film should be categorized as relevant to ’crime’:

      “Jessie is an ageing career criminal who has been in more jails, fights, schemes, and lineups than just about anyone else. His son Vito, while currently on the straight and narrow, has had a fairly shady past and is indeed no stranger to illegal activity. They both have great hope for Adam, Vito’s son and Jessie’s grandson, who is bright, good-looking, and without a criminal past. So when Adam approaches Jessie with a scheme for a burglary he’s shocked, but not necessarily disinterested\(\ldots \).”

    2. (b)

      The following film should be categorized as relevant to ‘animals’:

      Farmer Hoggett wins a runt piglet at a local fair and young Babe, as the piglet decides to call himself, befriends and learns about all the other creatures on the farm. He becomes special friends with one of the sheepdogs, Fly. With Fly’s help, and Farmer Hoggett’s intuition, Babe embarks on a career in sheepherding with some surprising and spectacular results. Babe is a little pig who doesn’t quite know his place in the world. With a bunch of odd friends, like Ferdinand the duck who thinks he is a rooster and Fly the dog he calls mom, Babe realizes that he has the makings to become the greatest sheep pig of all time, and Farmer Hogget Knows it. With the help of the sheep dogs Babe learns that a pig can be anything that he wants to be.

    3. (c)

      The following film should not be categorized as relevant to ’baskeball’:

      “This gritty drama follows two high school acquaintances, Hancock, a basketball star, and Danny, a geek turned drifter, after they graduate. The first film commissioned by the Sundance Film Festival, it portrays the other half of the American dream, as Hancock and his cheerleader girlfriend Mary wander to a middle-class mediocrity out itself out of reach for Danny and his psychotic wife Bev.”

  2. 2.

    You must not base your decision on prior knowledge of the film, only on information provided in the plot.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liebeskind, C., Kotlerman, L. & Dagan, I. Text Categorization from category name in an industry-motivated scenario. Lang Resources & Evaluation 49, 227–261 (2015). https://doi.org/10.1007/s10579-015-9298-3

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-015-9298-3

Keywords

Navigation