Abstract
In this work we suggest a novel Text Categorization (TC) scenario, motivated by an ad-hoc industrial need to assign documents to a set of predefined categories, while labeled training data for the categories is not available. The scenario is applicable in many industrial settings and is interesting from the academic perspective. We present a new dataset geared for the main characteristics of the scenario, and utilize it to investigate the name-based TC approach, which uses the category names as its only input and does not require training data. We evaluate and analyze the performance of state-of-the-art methods for this dataset to identify the shortcomings of these methods for our scenario, and suggest ways for overcoming these shortcomings. We utilize statistical correlation measured over a target corpus for improving the state-of-the-art, and offer a different classification scheme based on the characteristics of the setting. We evaluate our improvements and adaptations and show superior performance of our suggested method.






Similar content being viewed by others
Notes
Available at http://u.cs.biu.ac.il/~nlp/downloads/WikiRules.html.
If the documents that should not be classified to any of the categories are not taken into consideration, the average number of labels per document is 1.39.
We used a conservative approach, counting a fit only if exactly the same category was assigned by both annotators, and considering assignments to parent category or sub-category as errors. To account for multi-label categorization, we used the adaptation suggested by Rosenberg and Binkowski (2004), according to which each document has partial membership with each of its multiple labels.
We also experimented with the multi-label scheme by assigning each document to its top-k highest-scoring categories, yet the results obtained were of the same order. More details on the bootstrapping step are reported in Sect. 5.5.
In our implementation we have set the LSA dimension value to 300.
In this calculation we follow the micro-averaging approach commonly used in TC.
We use this test to measure statistical significance of all the results reported in this Section.
References
Ali, A., Magdy, W., & Vogel, S. (2013). A tool for monitoring and analyzing healthcare tweets. In HSD workshop, SIGIR 2013.
Archak, N., Ghose, A., & Ipeirotis, P. G. (2011). Deriving the pricing power of product features by mining consumer reviews. Management Science, 57(8), 1485–1509.
Barak, L., Dagan, I., & Shnarch, E. (2009). Text categorization from category name via lexical reference. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the Association for Computational Linguistics, Companion volume: Short papers, NAACL-Short’09 (pp. 33–36). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1620853.1620864
Baroni, M., & Zamparelli, R. (2010). Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 conference on empirical methods in natural language processing, EMNLP’10 (pp. 1183–1193). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1870658.1870773
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Church, K.W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational linguistics, 16(1), 22–29. http://dl.acm.org/citation.cfm?id=89086.89095
de Buenaga Rodriguez, M., Go’mez-Hidalgo, J. M., & Diaz-Agudo, B. (1997). Using WordNet to complement training information in text categorization. In N. Nicolov, & R. Mitkov (Eds.), Recent advances in natural language processing II: Selected papers from the second international conference on recent advances in natural language processing (RANLP 1997), March 25–27, 1997, Stanford, CA, USA, Amsterdam Studies in the Theory and History of Linguistic Science, Series IV: Current Issues in Linguistic Theory (pp. 353–364). Amsterdam, The Netherlands: John Benjamins Publishing.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
Diakopoulos, N., Naaman, M., & Kivran-Swaine, F.: Diamonds in the rough: Social media visual analytics for journalistic inquiry. In 2010 IEEE symposium on visual analytics science and technology (VAST) (pp. 115–122). IEEE (2010).
Downey, D., & Etzioni, O. (2008). Look ma, no hands: Analyzing the monotonic feature abstraction for text classification. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS-08), Vancouver, Canada. http://books.nips.cc/papers/files/nips21/NIPS2008_0054.pdf
Eichler, K., Gabryszak, A., & Neumann, G. (2014). An analysis of textual inference in german customer emails. Lexical and Computational Semantics (*SEM 2014) p. 69.
Eichler, K., Meisdrock, M., & Schmeier, S. (2012). Search and topic detection in customer requests. KI-Künstliche Intelligenz, 26(4), 419–422.
Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.
Fleischman, M., & Hovy, E. (2002). Fine grained classification of named entities. In Proceedings of the 19th international conference on computational linguistics—COLING’02, (Vol. 1, pp. 1–7). Stroudsburg, PA, USA: Association for Computational Linguistics. doi:10.3115/1072228.1072358
Funk, A., Li, Y., Saggion, H., Bontcheva, K., & Leibold, C. (2008). Opinion analysis for business intelligence applications. In Proceedings of the first international workshop on Ontology-supported business intelligence. New York: ACM.
Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI, pp. 1606–1611.
Giampiccolo, D., Magnini, B., Dagan, I., & Dolan, B. (2007). The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, RTE’07 (pp. 1–9). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1654536.1654538
Glickman, O., & Dagan, I. (2005). A probabilistic setting and lexical cooccurrence model for textual entailment. In Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment, EMSEE’05, (pp. 43–48). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1631862.1631870
Glickman, O., Shnarch, E., & Dagan, I. (2006). Lexical reference: A semantic matching subtask. In Proceedings of the 2006 conference on empirical methods in natural language processing, EMNLP’06 (pp. 172–179). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1610075.1610103
Gliozzo, A., Strapparava, C., & Dagan, I. (2005). Investigating unsupervised learning for text categorization bootstrapping. In Proceedings of the conference on human language technology and empirical methods in natural language processing, HLT’05, (pp. 129–136). Stroudsburg, PA, USA: Association for Computational Linguistics. doi:10.3115/1220575.1220592
Gliozzo, A., Strapparava, C., & Dagan, I. (2009). Improving text categorization bootstrapping via unsupervised learning. ACM Transactions on Speech and Language Processing, 6(1), 1:1–1:24. doi:10.1145/1596515.1596516
Grefenstette, E., & Sadrzadeh, M. (2011). Experimental support for a categorical compositional distributional model of meaning. In Proceedings of the conference on empirical methods in natural language processing, EMNLP’11 (pp. 1394–1404). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=2145432.2145580
Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, KDD’04, (pp. 168–177). New York, NY, USA: ACM. doi:10.1145/1014052.1014073
Ko, Y., & Seo, J. (2004). Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques. In Proceedings of the 42nd annual meeting on association for computational linguistics, ACL’04. Stroudsburg, PA, USA: Association for Computational Linguistics. doi:10.3115/1218955.1218988
Li, C., Weng, J., He, Q., Yao, Y., Datta, A., Sun, A., & Lee, B. S. (2012). Twiner: Named entity recognition in targeted twitter stream. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, SIGIR’12 (pp. 721–730). New York, NY, USA: ACM. doi:10.1145/2348283.2348380
Liu, B., Li, X., Lee, W. S., & Yu, P. S. (2004). Text classification by labeling words. In Proceedings of the 19th national conference on Artifical intelligence, AAAI’04, (pp. 425–430). San Jose: AAAI Press. http://dl.acm.org/citation.cfm?id=1597148.1597218
Liu, B., & Zhang, L. (2012). A survey of opinion mining and sentiment analysis. In Mining text data, (pp. 415–463). Berlin: Springer.
Mansuy, T., & Hilderman, R. J. (2006). A characterization of wordnet features in boolean models for text classification. In Proceedings of the fifth Australasian conference on data mining and analystics—AusDM’06, (Vol. 61, pp. 103–109). Darlinghurst, Australia: Australian Computer Society Inc. http://dl.acm.org/citation.cfm?id=1273808.1273822
Maynard, D., & Funk, A. (2012). Automatic detection of political opinions in tweets. In The semantic web: ESWC 2011 workshops (pp. 88–99). Berlin: Springer.
Mccallum, A., & Nigam, K. (1999). Text classification by bootstrapping with keywords, em and shrinkage. In ACL99—Workshop for unsupervised learning in natural language processing (pp. 52–58).
Metzler, D., Dumais, S. T., & Meek, C. (2007). Similarity measures for short segments of text. In ECIR (pp. 16–27).
Mitchell, J., & Lapata, M. (2010). Composition in distributional models of semantics. Cognitive Science, 34(8), 1388–1439.
Pasca, M., & Harabagiu, S. M. (2001). The informative role of WordNet in open-domain question answering. In Proceedings of the NAACL 2001 workshop on wordnet and other lexical resources: Applications, extensions and customizations (pp. 138–143).
Ritter, A., Etzioni Mausam, O., & Clark, S. (2012). Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD’12 (pp. 1104–1112). New York, NY, USA: ACM. doi:10.1145/2339530.2339704
Rosenberg, A., & Binkowski, E. (2004). Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. In D. M. Susan Dumais & S. Roukos (Eds.), HLT-NAACL 2004: Short Papers (pp. 77–80). Boston, MA: Association for Computational Linguistics.
Ruhl, J., Datar, M., & Lee, J. (2006). Method, system and graphical user interface for providing reviews for a product. https://www.google.com/patents/US20060143158. US Patent App. 11/012,846
Saggion, H., & Funk, A. (2009). Extracting opinions and facts for business intelligence. RNTI Journal, 10(17), 119–146.
Scharl, A., & Weichselbraun, A. (2008). An automated approach to investigating the online media coverage of US presidential elections. Journal of Information Technology & Politics, 5(1), 121–132.
Scott, S., & Matwin, S. (1999). Feature engineering for text classification. In Proceedings of the sixteenth international conference on machine learning, ICML’99, pp. 379–388. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=645528.657484
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. doi:10.1145/505282.505283
Shah, C., & Croft, W. B. (2004). Evaluating high accuracy retrieval techniques. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR’04 (pp. 2–9). New York, NY, USA: ACM. doi:10.1145/1008992.1008996
Shnarch, E., Barak, L., & Dagan, I. (2009). Extracting lexical reference rules from wikipedia. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, ACL’09 (Vol. 1, pp. 450–458). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1687878.1687942
Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1), 1–38. http://dl.acm.org/citation.cfm?id=234285.234287
Socher, R., Huval, B., Manning, C. D., & Ng, A. Y. (2012). Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, EMNLP-CoNLL’12 (pp. 1201–1211). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=2390948.2391084
Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Predicting elections with twitter: What 140 characters reveal about political sentiment. ICWSM, 10, 178–185.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83. doi:10.2307/3001968
Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, (pp. 4–11). New York: ACM.
Acknowledgments
This work was supported by the Next Generation Video (NeGeV) Project. We would like to thank our industrial partners, Comverse Technology Inc. and Orca Interactive Ltd. We thank Libby Barak for helping us in replicating the results of Barak et al. (2009). We thank Naomi Zeichner for preparing the taxonomy and annotating the dataset. Finally, we thank the anonymous reviewers for their useful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1
1.1 Our complete IMDB taxonomy
Categories
-
1.
Religion
-
1.1.
Buddhism
-
1.2.
Hinduism
-
1.3.
Christianity
-
1.3.1.
Christmas
-
1.3.1.
-
1.4.
Islam
-
1.5.
Judaism
-
1.1.
-
2.
Sport
-
2.1.
Bicycle
-
2.2.
Boxing
-
2.3.
Fishing
-
2.4.
Football
-
2.5.
Golf
-
2.6.
Hockey
-
2.7.
Martial-arts
-
2.7.1.
Karate
-
2.7.1.
-
2.8.
Athletics
-
2.9.
Running
-
2.10.
Shooting
-
2.11.
Skiing
-
2.12.
Soccer
-
2.13.
Water sports
-
2.13.1.
Surfing
-
2.13.2.
Swimming
-
2.13.1.
-
2.14.
Tennis
-
2.15.
Baseball
-
2.16.
Wrestling
-
2.17.
Basketball
-
2.18.
Horseracing
-
2.19.
Olympic games
-
2.1.
-
3.
Interests (NON-CAT)
-
3.1.
Beach
-
3.2.
Outdoor
-
3.3.
Gardening
-
3.4.
Pets
-
3.5.
Fitness
-
3.6.
Cookery
-
3.7.
Fashion
-
3.8.
Computing
-
3.9.
Travel
-
3.10.
Motoring
-
3.10.1.
Cars
-
3.10.2.
Motorcycle
-
3.10.1.
-
3.11.
Trains
-
3.12.
Airplanes
-
3.13.
Ships
-
3.14.
Radio
-
3.15.
Business
-
3.16.
Nature
-
3.16.1.
Animals
-
3.16.1.
-
3.17.
Outer Space
-
3.18.
The environment
-
3.19.
Showbiz
-
3.20.
Traditions
-
3.21.
Infants
-
3.22.
Military
-
3.23.
Weather
-
3.1.
-
4.
Arts
-
4.1.
Cinema
-
4.2.
Advertising
-
4.3.
Theater
-
4.4.
Music
-
4.4.1.
Opera
-
4.4.2.
Classical music
-
4.4.3.
Jazz
-
4.4.4.
Pop/rock
-
4.4.5.
Country music
-
4.4.6.
Hip Hop
-
4.4.1.
-
4.5.
Dance
-
4.5.1.
Ballet
-
4.5.1.
-
4.1.
-
5.
Science
-
5.1.
Medicine
-
5.1.1.
Disability
-
5.1.1.
-
5.2.
Technology
-
5.3.
Psychology
-
5.1.
-
6.
Education
-
6.1.
School
-
6.2.
College/University
-
6.1.
-
7.
Miscellaneous (NON-CAT)
-
7.1.
Crime (NON-CAT)
-
7.1.1.
Prison
-
7.1.2.
Mafia
-
7.1.3.
Drugs
-
7.1.4.
Fraud
-
7.1.5.
Gambling
-
7.1.6.
Terrorism
-
7.1.1.
-
7.2.
Literature
-
7.3.
History
-
7.4.
Political
-
7.5.
Social (NON-CAT)
-
7.5.1.
Racism
-
7.5.1.
-
7.6.
Legal
-
7.7.
Communism
-
7.8.
War
-
7.8.1.
World war 1
-
7.8.2.
World war 2
-
7.8.1.
-
7.9.
Aliens
-
7.10.
Comic-book
-
7.11.
Journalism
-
7.12.
Mythology
-
7.1.
Appendix 2
1.1 The annotation guidelines
You are given a list of films with their plot description and a taxonomy of film categories.
The taxonomy is made up of film subject matters and is arranged in a hierarchical order so that if a sub-category is marked its ancestors are also relevant. This is true in all cases except when a category is only present in order to group similar subject together in which case it is marked with the text (NON-CAT) next to it.
For example: A film categorized as dealing with ’cars’ will also be relevant to ’motoring’ but not to ’interests’ as it is not a category.
-
3.
Interests (NON-CAT)
-
3.9
Travel
-
3.10
Motoring
-
3.10.1
cars
-
3.10.2
Motorcycle
-
3.10.1
-
3.9
Note—the taxonomy is not exhaustive, you may find that there is no category in the taxonomy which accurately fits the film even though you can think of a subject matter that does. If a broader category is present choose it, otherwise choose none.
For each film, you must decide which categories (if any) out of the taxonomy are relevant to it. You can choose as many or as few categories as you see fit, or none.
Note—if you find more than one category, please put each category in a separate line (insert lines if necessary).
You must categorize according to the following guidelines:
-
1.
Is the background story prominent—not just a passing reference.
Examples:
-
(a)
The following film should be categorized as relevant to ’crime’:
“Jessie is an ageing career criminal who has been in more jails, fights, schemes, and lineups than just about anyone else. His son Vito, while currently on the straight and narrow, has had a fairly shady past and is indeed no stranger to illegal activity. They both have great hope for Adam, Vito’s son and Jessie’s grandson, who is bright, good-looking, and without a criminal past. So when Adam approaches Jessie with a scheme for a burglary he’s shocked, but not necessarily disinterested\(\ldots \).”
-
(b)
The following film should be categorized as relevant to ‘animals’:
Farmer Hoggett wins a runt piglet at a local fair and young Babe, as the piglet decides to call himself, befriends and learns about all the other creatures on the farm. He becomes special friends with one of the sheepdogs, Fly. With Fly’s help, and Farmer Hoggett’s intuition, Babe embarks on a career in sheepherding with some surprising and spectacular results. Babe is a little pig who doesn’t quite know his place in the world. With a bunch of odd friends, like Ferdinand the duck who thinks he is a rooster and Fly the dog he calls mom, Babe realizes that he has the makings to become the greatest sheep pig of all time, and Farmer Hogget Knows it. With the help of the sheep dogs Babe learns that a pig can be anything that he wants to be.
-
(c)
The following film should not be categorized as relevant to ’baskeball’:
“This gritty drama follows two high school acquaintances, Hancock, a basketball star, and Danny, a geek turned drifter, after they graduate. The first film commissioned by the Sundance Film Festival, it portrays the other half of the American dream, as Hancock and his cheerleader girlfriend Mary wander to a middle-class mediocrity out itself out of reach for Danny and his psychotic wife Bev.”
-
(a)
-
2.
You must not base your decision on prior knowledge of the film, only on information provided in the plot.
Rights and permissions
About this article
Cite this article
Liebeskind, C., Kotlerman, L. & Dagan, I. Text Categorization from category name in an industry-motivated scenario. Lang Resources & Evaluation 49, 227–261 (2015). https://doi.org/10.1007/s10579-015-9298-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-015-9298-3