Text Categorization from category name in an industry-motivated scenario

Liebeskind, Chaya; Kotlerman, Lili; Dagan, Ido

doi:10.1007/s10579-015-9298-3

Text Categorization from category name in an industry-motivated scenario

Original Paper
Published: 28 February 2015

Volume 49, pages 227–261, (2015)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Chaya Liebeskind¹,
Lili Kotlerman¹ &
Ido Dagan¹

337 Accesses
8 Citations
Explore all metrics

Abstract

In this work we suggest a novel Text Categorization (TC) scenario, motivated by an ad-hoc industrial need to assign documents to a set of predefined categories, while labeled training data for the categories is not available. The scenario is applicable in many industrial settings and is interesting from the academic perspective. We present a new dataset geared for the main characteristics of the scenario, and utilize it to investigate the name-based TC approach, which uses the category names as its only input and does not require training data. We evaluate and analyze the performance of state-of-the-art methods for this dataset to identify the shortcomings of these methods for our scenario, and suggest ways for overcoming these shortcomings. We utilize statistical correlation measured over a target corpus for improving the state-of-the-art, and offer a different classification scheme based on the characteristics of the setting. We evaluate our improvements and adaptations and show superior performance of our suggested method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Open-categorical text classification based on multi-LDA models

Article 31 July 2014

Ruiji Fu, Bing Qin & Ting Liu

Semi-supervised learning in large scale text categorization

Article 30 May 2017

Zewen Xu, Jianqiang Li, … Rui Mao

Text Categorization Based on Semantic Cluster-Hidden Markov Models

Notes

Available at http://u.cs.biu.ac.il/~nlp/downloads/WikiRules.html.
Therewith we filtered rules where terms do not co-occur frequently enough within Wikipedia, based on the filtering method proposed in Shnarch et al. (2009). We tuned the required threshold parameter (which was set to 0.01) on our development dataset, described in Sect. 3.
If the documents that should not be classified to any of the categories are not taken into consideration, the average number of labels per document is 1.39.
We used a conservative approach, counting a fit only if exactly the same category was assigned by both annotators, and considering assignments to parent category or sub-category as errors. To account for multi-label categorization, we used the adaptation suggested by Rosenberg and Binkowski (2004), according to which each document has partial membership with each of its multiple labels.
We also experimented with the multi-label scheme by assigning each document to its top-k highest-scoring categories, yet the results obtained were of the same order. More details on the bootstrapping step are reported in Sect. 5.5.
In our implementation we have set the LSA dimension value to 300.
In this calculation we follow the micro-averaging approach commonly used in TC.
We use this test to measure statistical significance of all the results reported in this Section.

References

Ali, A., Magdy, W., & Vogel, S. (2013). A tool for monitoring and analyzing healthcare tweets. In HSD workshop, SIGIR 2013.
Archak, N., Ghose, A., & Ipeirotis, P. G. (2011). Deriving the pricing power of product features by mining consumer reviews. Management Science, 57(8), 1485–1509.
Article Google Scholar
Barak, L., Dagan, I., & Shnarch, E. (2009). Text categorization from category name via lexical reference. In Proceedings of human language technologies: The 2009 annual conference of the North American chapter of the Association for Computational Linguistics, Companion volume: Short papers, NAACL-Short’09 (pp. 33–36). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1620853.1620864
Baroni, M., & Zamparelli, R. (2010). Nouns are vectors, adjectives are matrices: Representing adjective-noun constructions in semantic space. In Proceedings of the 2010 conference on empirical methods in natural language processing, EMNLP’10 (pp. 1183–1193). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1870658.1870773
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.
Google Scholar
Church, K.W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational linguistics, 16(1), 22–29. http://dl.acm.org/citation.cfm?id=89086.89095
de Buenaga Rodriguez, M., Go’mez-Hidalgo, J. M., & Diaz-Agudo, B. (1997). Using WordNet to complement training information in text categorization. In N. Nicolov, & R. Mitkov (Eds.), Recent advances in natural language processing II: Selected papers from the second international conference on recent advances in natural language processing (RANLP 1997), March 25–27, 1997, Stanford, CA, USA, Amsterdam Studies in the Theory and History of Linguistic Science, Series IV: Current Issues in Linguistic Theory (pp. 353–364). Amsterdam, The Netherlands: John Benjamins Publishing.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.
Article Google Scholar
Diakopoulos, N., Naaman, M., & Kivran-Swaine, F.: Diamonds in the rough: Social media visual analytics for journalistic inquiry. In 2010 IEEE symposium on visual analytics science and technology (VAST) (pp. 115–122). IEEE (2010).
Downey, D., & Etzioni, O. (2008). Look ma, no hands: Analyzing the monotonic feature abstraction for text classification. In Proceedings of the 22nd Annual Conference on Neural Information Processing Systems (NIPS-08), Vancouver, Canada. http://books.nips.cc/papers/files/nips21/NIPS2008_0054.pdf
Eichler, K., Gabryszak, A., & Neumann, G. (2014). An analysis of textual inference in german customer emails. Lexical and Computational Semantics (*SEM 2014) p. 69.
Eichler, K., Meisdrock, M., & Schmeier, S. (2012). Search and topic detection in customer requests. KI-Künstliche Intelligenz, 26(4), 419–422.
Article Google Scholar
Fellbaum, C. (Ed.). (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.
Google Scholar
Fleischman, M., & Hovy, E. (2002). Fine grained classification of named entities. In Proceedings of the 19th international conference on computational linguistics—COLING’02, (Vol. 1, pp. 1–7). Stroudsburg, PA, USA: Association for Computational Linguistics. doi:10.3115/1072228.1072358
Funk, A., Li, Y., Saggion, H., Bontcheva, K., & Leibold, C. (2008). Opinion analysis for business intelligence applications. In Proceedings of the first international workshop on Ontology-supported business intelligence. New York: ACM.
Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI, pp. 1606–1611.
Giampiccolo, D., Magnini, B., Dagan, I., & Dolan, B. (2007). The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, RTE’07 (pp. 1–9). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1654536.1654538
Glickman, O., & Dagan, I. (2005). A probabilistic setting and lexical cooccurrence model for textual entailment. In Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment, EMSEE’05, (pp. 43–48). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1631862.1631870
Glickman, O., Shnarch, E., & Dagan, I. (2006). Lexical reference: A semantic matching subtask. In Proceedings of the 2006 conference on empirical methods in natural language processing, EMNLP’06 (pp. 172–179). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1610075.1610103
Gliozzo, A., Strapparava, C., & Dagan, I. (2005). Investigating unsupervised learning for text categorization bootstrapping. In Proceedings of the conference on human language technology and empirical methods in natural language processing, HLT’05, (pp. 129–136). Stroudsburg, PA, USA: Association for Computational Linguistics. doi:10.3115/1220575.1220592
Gliozzo, A., Strapparava, C., & Dagan, I. (2009). Improving text categorization bootstrapping via unsupervised learning. ACM Transactions on Speech and Language Processing, 6(1), 1:1–1:24. doi:10.1145/1596515.1596516
Article Google Scholar
Grefenstette, E., & Sadrzadeh, M. (2011). Experimental support for a categorical compositional distributional model of meaning. In Proceedings of the conference on empirical methods in natural language processing, EMNLP’11 (pp. 1394–1404). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=2145432.2145580
Hu, M., & Liu, B. (2004). Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, KDD’04, (pp. 168–177). New York, NY, USA: ACM. doi:10.1145/1014052.1014073
Ko, Y., & Seo, J. (2004). Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques. In Proceedings of the 42nd annual meeting on association for computational linguistics, ACL’04. Stroudsburg, PA, USA: Association for Computational Linguistics. doi:10.3115/1218955.1218988
Li, C., Weng, J., He, Q., Yao, Y., Datta, A., Sun, A., & Lee, B. S. (2012). Twiner: Named entity recognition in targeted twitter stream. In Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, SIGIR’12 (pp. 721–730). New York, NY, USA: ACM. doi:10.1145/2348283.2348380
Liu, B., Li, X., Lee, W. S., & Yu, P. S. (2004). Text classification by labeling words. In Proceedings of the 19th national conference on Artifical intelligence, AAAI’04, (pp. 425–430). San Jose: AAAI Press. http://dl.acm.org/citation.cfm?id=1597148.1597218
Liu, B., & Zhang, L. (2012). A survey of opinion mining and sentiment analysis. In Mining text data, (pp. 415–463). Berlin: Springer.
Mansuy, T., & Hilderman, R. J. (2006). A characterization of wordnet features in boolean models for text classification. In Proceedings of the fifth Australasian conference on data mining and analystics—AusDM’06, (Vol. 61, pp. 103–109). Darlinghurst, Australia: Australian Computer Society Inc. http://dl.acm.org/citation.cfm?id=1273808.1273822
Maynard, D., & Funk, A. (2012). Automatic detection of political opinions in tweets. In The semantic web: ESWC 2011 workshops (pp. 88–99). Berlin: Springer.
Mccallum, A., & Nigam, K. (1999). Text classification by bootstrapping with keywords, em and shrinkage. In ACL99—Workshop for unsupervised learning in natural language processing (pp. 52–58).
Metzler, D., Dumais, S. T., & Meek, C. (2007). Similarity measures for short segments of text. In ECIR (pp. 16–27).
Mitchell, J., & Lapata, M. (2010). Composition in distributional models of semantics. Cognitive Science, 34(8), 1388–1439.
Article Google Scholar
Pasca, M., & Harabagiu, S. M. (2001). The informative role of WordNet in open-domain question answering. In Proceedings of the NAACL 2001 workshop on wordnet and other lexical resources: Applications, extensions and customizations (pp. 138–143).
Ritter, A., Etzioni Mausam, O., & Clark, S. (2012). Open domain event extraction from twitter. In Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD’12 (pp. 1104–1112). New York, NY, USA: ACM. doi:10.1145/2339530.2339704
Rosenberg, A., & Binkowski, E. (2004). Augmenting the kappa statistic to determine interannotator reliability for multiply labeled data points. In D. M. Susan Dumais & S. Roukos (Eds.), HLT-NAACL 2004: Short Papers (pp. 77–80). Boston, MA: Association for Computational Linguistics.
Chapter Google Scholar
Ruhl, J., Datar, M., & Lee, J. (2006). Method, system and graphical user interface for providing reviews for a product. https://www.google.com/patents/US20060143158. US Patent App. 11/012,846
Saggion, H., & Funk, A. (2009). Extracting opinions and facts for business intelligence. RNTI Journal, 10(17), 119–146.
Google Scholar
Scharl, A., & Weichselbraun, A. (2008). An automated approach to investigating the online media coverage of US presidential elections. Journal of Information Technology & Politics, 5(1), 121–132.
Article Google Scholar
Scott, S., & Matwin, S. (1999). Feature engineering for text classification. In Proceedings of the sixteenth international conference on machine learning, ICML’99, pp. 379–388. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. http://dl.acm.org/citation.cfm?id=645528.657484
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47. doi:10.1145/505282.505283
Article Google Scholar
Shah, C., & Croft, W. B. (2004). Evaluating high accuracy retrieval techniques. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR’04 (pp. 2–9). New York, NY, USA: ACM. doi:10.1145/1008992.1008996
Shnarch, E., Barak, L., & Dagan, I. (2009). Extracting lexical reference rules from wikipedia. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, ACL’09 (Vol. 1, pp. 450–458). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=1687878.1687942
Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22(1), 1–38. http://dl.acm.org/citation.cfm?id=234285.234287
Socher, R., Huval, B., Manning, C. D., & Ng, A. Y. (2012). Semantic compositionality through recursive matrix-vector spaces. In Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, EMNLP-CoNLL’12 (pp. 1201–1211). Stroudsburg, PA, USA: Association for Computational Linguistics. http://dl.acm.org/citation.cfm?id=2390948.2391084
Tumasjan, A., Sprenger, T. O., Sandner, P. G., & Welpe, I. M. (2010). Predicting elections with twitter: What 140 characters reveal about political sentiment. ICWSM, 10, 178–185.
Google Scholar
Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6), 80–83. doi:10.2307/3001968
Article Google Scholar
Xu, J., & Croft, W. B. (1996). Query expansion using local and global document analysis. In Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval, (pp. 4–11). New York: ACM.

Download references

Acknowledgments

This work was supported by the Next Generation Video (NeGeV) Project. We would like to thank our industrial partners, Comverse Technology Inc. and Orca Interactive Ltd. We thank Libby Barak for helping us in replicating the results of Barak et al. (2009). We thank Naomi Zeichner for preparing the taxonomy and annotating the dataset. Finally, we thank the anonymous reviewers for their useful comments and suggestions.

Author information

Authors and Affiliations

Bar Ilan University, 5290002, Ramat Gan, Israel
Chaya Liebeskind, Lili Kotlerman & Ido Dagan

Authors

Chaya Liebeskind
View author publications
You can also search for this author in PubMed Google Scholar
Lili Kotlerman
View author publications
You can also search for this author in PubMed Google Scholar
Ido Dagan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chaya Liebeskind.

Appendices

Appendix 1 1.1 Our complete IMDB taxonomy

Categories

1.
Religion
1. 1.1.
  Buddhism
2. 1.2.
  Hinduism
3. 1.3.
  Christianity
  1. 1.3.1.
    Christmas
4. 1.4.
  Islam
5. 1.5.
  Judaism
2.
Sport
1. 2.1.
  Bicycle
2. 2.2.
  Boxing
3. 2.3.
  Fishing
4. 2.4.
  Football
5. 2.5.
  Golf
6. 2.6.
  Hockey
7. 2.7.
  Martial-arts
  1. 2.7.1.
    Karate
8. 2.8.
  Athletics
9. 2.9.
  Running
10. 2.10.
  Shooting
11. 2.11.
  Skiing
12. 2.12.
  Soccer
13. 2.13.
  Water sports
  1. 2.13.1.
    Surfing
  2. 2.13.2.
    Swimming
14. 2.14.
  Tennis
15. 2.15.
  Baseball
16. 2.16.
  Wrestling
17. 2.17.
  Basketball
18. 2.18.
  Horseracing
19. 2.19.
  Olympic games
3.
Interests (NON-CAT)
1. 3.1.
  Beach
2. 3.2.
  Outdoor
3. 3.3.
  Gardening
4. 3.4.
  Pets
5. 3.5.
  Fitness
6. 3.6.
  Cookery
7. 3.7.
  Fashion
8. 3.8.
  Computing
9. 3.9.
  Travel
10. 3.10.
  Motoring
  1. 3.10.1.
    Cars
  2. 3.10.2.
    Motorcycle
11. 3.11.
  Trains
12. 3.12.
  Airplanes
13. 3.13.
  Ships
14. 3.14.
  Radio
15. 3.15.
  Business
16. 3.16.
  Nature
  1. 3.16.1.
    Animals
17. 3.17.
  Outer Space
18. 3.18.
  The environment
19. 3.19.
  Showbiz
20. 3.20.
  Traditions
21. 3.21.
  Infants
22. 3.22.
  Military
23. 3.23.
  Weather
4.
Arts
1. 4.1.
  Cinema
2. 4.2.
  Advertising
3. 4.3.
  Theater
4. 4.4.
  Music
  1. 4.4.1.
    Opera
  2. 4.4.2.
    Classical music
  3. 4.4.3.
    Jazz
  4. 4.4.4.
    Pop/rock
  5. 4.4.5.
    Country music
  6. 4.4.6.
    Hip Hop
5. 4.5.
  Dance
  1. 4.5.1.
    Ballet
5.
Science
1. 5.1.
  Medicine
  1. 5.1.1.
    Disability
2. 5.2.
  Technology
3. 5.3.
  Psychology
6.
Education
1. 6.1.
  School
2. 6.2.
  College/University
7.
Miscellaneous (NON-CAT)
1. 7.1.
  Crime (NON-CAT)
  1. 7.1.1.
    Prison
  2. 7.1.2.
    Mafia
  3. 7.1.3.
    Drugs
  4. 7.1.4.
    Fraud
  5. 7.1.5.
    Gambling
  6. 7.1.6.
    Terrorism
2. 7.2.
  Literature
3. 7.3.
  History
4. 7.4.
  Political
5. 7.5.
  Social (NON-CAT)
  1. 7.5.1.
    Racism
6. 7.6.
  Legal
7. 7.7.
  Communism
8. 7.8.
  War
  1. 7.8.1.
    World war 1
  2. 7.8.2.
    World war 2
9. 7.9.
  Aliens
10. 7.10.
  Comic-book
11. 7.11.
  Journalism
12. 7.12.
  Mythology

Appendix 2 1.1 The annotation guidelines

You are given a list of films with their plot description and a taxonomy of film categories.

The taxonomy is made up of film subject matters and is arranged in a hierarchical order so that if a sub-category is marked its ancestors are also relevant. This is true in all cases except when a category is only present in order to group similar subject together in which case it is marked with the text (NON-CAT) next to it.

For example: A film categorized as dealing with ’cars’ will also be relevant to ’motoring’ but not to ’interests’ as it is not a category.

3.
Interests (NON-CAT)
1. 3.9
  Travel
2. 3.10
  Motoring
  1. 3.10.1
    cars
  2. 3.10.2
    Motorcycle

Note—the taxonomy is not exhaustive, you may find that there is no category in the taxonomy which accurately fits the film even though you can think of a subject matter that does. If a broader category is present choose it, otherwise choose none.

For each film, you must decide which categories (if any) out of the taxonomy are relevant to it. You can choose as many or as few categories as you see fit, or none.

Note—if you find more than one category, please put each category in a separate line (insert lines if necessary).

You must categorize according to the following guidelines:

1.
Is the background story prominent—not just a passing reference.

Examples:
1. (a)
  The following film should be categorized as relevant to ’crime’:
  
  “Jessie is an ageing career criminal who has been in more jails, fights, schemes, and lineups than just about anyone else. His son Vito, while currently on the straight and narrow, has had a fairly shady past and is indeed no stranger to illegal activity. They both have great hope for Adam, Vito’s son and Jessie’s grandson, who is bright, good-looking, and without a criminal past. So when Adam approaches Jessie with a scheme for a burglary he’s shocked, but not necessarily disinterested\(\ldots \).”
2. (b)
  The following film should be categorized as relevant to ‘animals’:
  
  Farmer Hoggett wins a runt piglet at a local fair and young Babe, as the piglet decides to call himself, befriends and learns about all the other creatures on the farm. He becomes special friends with one of the sheepdogs, Fly. With Fly’s help, and Farmer Hoggett’s intuition, Babe embarks on a career in sheepherding with some surprising and spectacular results. Babe is a little pig who doesn’t quite know his place in the world. With a bunch of odd friends, like Ferdinand the duck who thinks he is a rooster and Fly the dog he calls mom, Babe realizes that he has the makings to become the greatest sheep pig of all time, and Farmer Hogget Knows it. With the help of the sheep dogs Babe learns that a pig can be anything that he wants to be.
3. (c)
  The following film should not be categorized as relevant to ’baskeball’:
  
  “This gritty drama follows two high school acquaintances, Hancock, a basketball star, and Danny, a geek turned drifter, after they graduate. The first film commissioned by the Sundance Film Festival, it portrays the other half of the American dream, as Hancock and his cheerleader girlfriend Mary wander to a middle-class mediocrity out itself out of reach for Danny and his psychotic wife Bev.”
2.
You must not base your decision on prior knowledge of the film, only on information provided in the plot.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liebeskind, C., Kotlerman, L. & Dagan, I. Text Categorization from category name in an industry-motivated scenario. Lang Resources & Evaluation 49, 227–261 (2015). https://doi.org/10.1007/s10579-015-9298-3

Download citation

Published: 28 February 2015
Issue Date: June 2015
DOI: https://doi.org/10.1007/s10579-015-9298-3

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Text Categorization from category name in an industry-motivated scenario

Abstract

Access this article

Similar content being viewed by others

Open-categorical text classification based on multi-LDA models

Semi-supervised learning in large scale text categorization

Text Categorization Based on Semantic Cluster-Hidden Markov Models

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1

1.1 Our complete IMDB taxonomy

Appendix 2

1.1 The annotation guidelines

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Text Categorization from category name in an industry-motivated scenario

Abstract

Access this article

Similar content being viewed by others

Open-categorical text classification based on multi-LDA models

Semi-supervised learning in large scale text categorization

Text Categorization Based on Semantic Cluster-Hidden Markov Models

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendices

Appendix 1

1.1 Our complete IMDB taxonomy

Appendix 2

1.1 The annotation guidelines

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation