Classifying unlabeled short texts using a fuzzy declarative approach

Romero, Francisco P.; Julián-Iranzo, Pascual; Soto, Andrés; Ferreira-Satler, Mateus; Gallardo-Casero, Juan

doi:10.1007/s10579-012-9203-2

Classifying unlabeled short texts using a fuzzy declarative approach

Original Paper
Published: 04 November 2012

Volume 47, pages 151–178, (2013)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Francisco P. Romero¹,
Pascual Julián-Iranzo¹,
Andrés Soto²,
Mateus Ferreira-Satler¹ &
…
Juan Gallardo-Casero¹

683 Accesses
15 Citations
Explore all metrics

Abstract

Web 2.0 provides user-friendly tools that allow persons to create and publish content online. User generated content often takes the form of short texts (e.g., blog posts, news feeds, snippets, etc). This has motivated an increasing interest on the analysis of short texts and, specifically, on their categorisation. Text categorisation is the task of classifying documents into a certain number of predefined categories. Traditional text classification techniques are mainly based on word frequency statistical analysis and have been proved inadequate for the classification of short texts where word occurrence is too small. On the other hand, the classic approach to text categorization is based on a learning process that requires a large number of labeled training texts to achieve an accurate performance. However labeled documents might not be available, when unlabeled documents can be easily collected. This paper presents an approach to text categorisation which does not need a pre-classified set of training documents. The proposed method only requires the category names as user input. Each one of these categories is defined by means of an ontology of terms modelled by a set of what we call proximity equations. Hence, our method is not category occurrence frequency based, but highly depends on the definition of that category and how the text fits that definition. Therefore, the proposed approach is an appropriate method for short text classification where the frequency of occurrence of a category is very small or even zero. Another feature of our method is that the classification process is based on the ability of an extension of the standard Prolog language, named Bousi~Prolog , for flexible matching and knowledge representation. This declarative approach provides a text classifier which is quick and easy to build, and a classification process which is easy for the user to understand. The results of experiments showed that the proposed method achieved a reasonably useful performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Article 05 March 2020

Kanish Shah, Henil Patel, … Manan Shah

A review of semi-supervised learning for text classification

Article 31 January 2023

José Marcio Duarte & Lilian Berton

A comprehensive and analytical review of text clustering techniques

Article 08 April 2024

Vivek Mehta, Mohit Agarwal & Rohit Kumar Kaliyar

Notes

http://blogpulse.com/.
http://www.wikipedia.org.
This project tries to formalize commonsense knowledge into a logical framework using logical assertions written in a minute representation language called CycL. Cyc is an attempt to do symbolic AI on a massive scale. It is not based on numerical methods such as statistical probabilities, nor is based on neural networks or fuzzy logic.
Available at http://conceptnet.media.mit.edu/.
http://wn-similarity.sourceforge.net.
Observe that, the Bousi~Prolog version executable via Java Web Start is the one corresponding to the low level implementation, which does not fulfill all standard Prolog functionalities. On the contrary, the high level implementation used in this work for the development of the experiments is a true extension of the Prolog programming language.
Actually, fuzzy binary relations which are automatically converted into proximity or similarity relations.
This is the default behavior. See later, at the end of this subsection, for more information.
That is, a pair (substitution, approximation _ degree).
Whenever the elements of the initial matrix fulfill the so called “transitivity property” (Julián-Iranzo 2008).
http://www.enviweb.cz.
Observe that, the hole predicate inspect/3 is a crisp predicate (that is, it only returns “yes”, with approximation degree 1.0, or “no”) because the weak unification operator was designed as a crisp operator (a term is either close or similar to another one or it is not). Hence, the approximation degree for the hole goal is 1.0 in this example, since there are positive answers (three words close or similar to “water” were found in the file runningEX ).
http://www.enviweb.cz.
http://www.dmoz.org/.
http://www.daviddlewis.com/resources/testcollections/reuters21578/.
http://www.euronews.net/newswires/.
\(\copyright\) European Union, 2010, http://eurovoc.europa.eu/.

References

Apté, C., Damerau, F., & Weiss, S. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information System, 12(3), 233–251.
Article Google Scholar
Barak, L., Dagan, I., & Shnarch, E. (2009). Text categorization from category name via lexical reference. In Proceedings of the NAACL ’09, association for computational linguistics (pp. 33–36). Morristown, NJ.
Bollacker, K., Evans, C., Paritosh, P., Sturge, T., & Taylor, J. (2008). Freebase: A collaboratively created graph database for structuring human knowledge In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, ACM (pp. 1247–1250). New York, NY, USA, SIGMOD ’08.
Boutari, A., Carpineto, C., & Nicolussi, R. (2010). Evaluating term concept association measures for short text expansion: Two case studies of classification and clustering. In Proceedings of the seventh international conference on concept lattices and their applications (CLA 2010) (pp. 162–174).
Cambria, E., Grassi, M., Hussain, A., & Havasi, C. (2011a). Sentic computing for social media marketing. Multimedia tools and applications (pp. 1–21). doi: http://dx.doi.org/10.1007/s11042-011-0815-0.
Cambria, E., Hupont, I., Hussain, A., Cerezo, E., & Baldassarri, S. (2011b). Sentic avatar: Multimodal affective conversational agent with common sense. In A. Esposito, A. Esposito, R. Martone, V. Mller, & G. Scarpetta (Eds.), Toward autonomous, adaptive, and context-aware multimodal interfaces. Theoretical and practical issues, Lecture Notes in Computer Science (Vol. 6456, pp. 81–95). Springer, Berlin, Heidelberg.
Cambria, E., Mazzocco, T., Hussain, A., & Eckl, C. (2011c). Sentic Medoids: Organizing affective common sense knowledge in a multi-dimensional vector space. In D. Liu, H. Zhang, M. Polycarpou, C. Alippi, & H. He (Eds.), Advances in neural networks - ISNN 2011, Lecture Notes in Computer Science (Vol. 6677, pp. 601–610). Springer, Berlin / Heidelberg.
Carpineto, C., & Romano, G. (2009). Odp239 dataset. http://credo.fub.it/odp239/. Last Visit March 2011.
Faguo, Z., Fan, Z., Bingru, Y., & Xingang, Y. (2010). Research on Short text classification algorithm based on statistics and rules. In Proceedings of the 2010 third international symposium on electronic commerce and security (pp. 3–7). Washington, DC: IEEE Computer Society, ISECS ’10.
Fellbaum, C. (1998). WordNet: An electronic lexical database. Massachusetts: MIT Press.
Google Scholar
Garcés, P., Olivas, J., Romero, F. (2006). Concept-matching IR Systems versus Word-matching information retrieval systems: Considering fuzzy interrelations for indexing Web Pages. Journal of the Americam Society for Information Science & Technology, 57(4), 564–576.
Article Google Scholar
Gliozzo, A., Strapparava, C., Dagan, I. (2005). Investigating unsupervised learning for text categorization bootstrapping. In Proceedings of the conference on HLT ’05, association for computational linguistics (pp. 129–136). Morristown, NJ.
Grassi, M., Cambria, E., Hussain, A., & Piazza, F. (2011). Sentic web: A new paradigm for managing social media affective information. Cognitive Computation, 3, 480–489.
Article Google Scholar
Gruber, T. R. (1995). Toward principles for the design of ontologies used for knowledge sharing. International Journal of Human Computer Studies, 43(5-6), 907–928.
Article Google Scholar
Ha-Thuc, V., & Renders, J. M. (2011). Large-scale hierarchical text classification without labelled data. In Proceedings of the fourth ACM international conference on Web search and data mining, ACM (pp. 685–694). New York, NY, WSDM ’11.
Havasi, C., Speer, R., & Alonso, J. (2007). ConceptNet3: A flexible, multilingual semantic network for common sense knowledge. In Proceedings of recent advances on natural language processing.
Hrebicek, J., & Kubasek, M. (2004). EnviWeb and environmental web services: Case study of an environmental web portal. In Environmental online communication, advanced information and knowledge processing series (pp. 21–24). London: Springer.
Julián-Iranzo, P. (2008). A procedure for the construction of a similarity relation. In L. Magdalena & J. V. M. Ojeda-Aciego (Eds.), In Proceedings of the IPMU 2008, June 22–27, 2008, Torremolinos (Málaga) (pp. 489–496) Spain, U. Málaga.
Julián-Iranzo, P., & Rubio-Manzano, C. (2009a). A declarative semantics for Bousi∼Prolog. In A. Porto, & F. J. López-Fraguas (Eds.), Proceedings of the 11th international ACM SIGPLAN symposium on principles and practice of declarative programming (pp. 149–160). September 7–9, 2009, Coimbra, Portugal (PPDP’2009), ACM.
Julián-Iranzo, P., & Rubio-Manzano, C. (2009b). A similarity-based WAM for Bousi∼Prolog. In J. Cabestany et al. (Eds.), Bio-Inspired Systems: Computational and ambient intelligence (Vol. 5517, pp. 245–252). 10th international work-conference on artificial neural networks IWANN 2009, Salamanca, Spain, June 10–12, 2009, Proceedings, Part I, Springer, Lecture Notes in Computer Science .
Julián-Iranzo, P., & Rubio-Manzano, C. (2010). An efficient fuzzy unification method and its implementation into the Bousi∼Prolog system. In 2010 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE 2010).
Julián-Iranzo, P., Rubio-Manzano, C., & Gallardo-Casero, J. (2009). Bousi∼Prolog: A prolog extension language for flexible query answering. Electronic Notes in Theoretical Computer Science, 248, 131–147.
Article Google Scholar
Ko, Y., & Seo, J. (2004). Learning with unlabeled data for text categorization using bootstrapping and feature projection techniques. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, ACL ’04.
Ko, Y., & Seo, J. (2009). Text classification from unlabeled documents with bootstrapping and feature projection techniques. Information Processing & Management, 45(1), 70–83.
Article Google Scholar
Le, D. N., & Goh, A. (2007). Current practices in measuring ontological concept similarity. In Proceedings of the third international conference on semantics, knowledge and grid (pp. 266–269).
Lenat, D. B. (1995). CYC: A large-scale investment in knowledge infrastructure. Communication of the ACM, 38(11), 33–38.
Article Google Scholar
Liu, H., & Singh, P. (2004). Commonsense reasoning in and over natural language. Proceedings of the 8th international conference on knowledge-based intelligent information and engineering systems (KES-2004) (pp. 293–306).
Liu, J., Birnbaum, L., & Pardo, B. (2008) Categorizing Blogger’s Interests based on Short Snippets of Blog Posts. In Proceedings of the 17th ACM conference on information and knowledge management, ACM (pp. 1525–1526). New York, NY, CIKM ’08.
Manning, C. D., Raghavan, P., & Schutze, H. (2008). An introduction to information retrieval. Cambridge, England: Cambridge University Press.
Book Google Scholar
Meretakis, D., Fragoudis, D., Lu, D., & Likothanassis, S. (2000). Scalable association based text classiffication. In Proceedings of 9th ACM international conferance of information and knowledge management (pp. 5–11). Washington, USA.
Nigam, K. P., McCallum, A., Thrun, S., & Mitchell, T. (1998). Learning to classify text from labeled and unlabeled documents. In Proceedings of the AAAI-98 conference (pp. 792–799). California: AAAI Press.
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computer Surveys, 34(1), 1–47.
Article Google Scholar
Shadbolt, N., Berners-Lee, T., & Hall, W. (2006). The semantic web revisited. IEEE Intelligent Systems, 21(3), 96–101.
Article Google Scholar
Song, Y., Wang, H., Wang, Z., & Li, H. (2011). Short text conceptualization using a probabilistic knowledgebase. Tech. Rep. MSR-TR-2011-26, Microsoft Research Asia.
Soto, A., Olivas, J., & Prieto, M. (2008). Fuzzy approach of synonymy and polysemy for information retrieval. In R. Bello, R. Falcn, W. Pedrycz, & J. Kacprzyk (Eds.), Granular computing: At the junction of rough sets and fuzzy sets, studies in fuzziness and soft computing (Vol. 224, pp. 179–198). Berlin, Heidelberg: Springer.
Chapter Google Scholar
Speer, R., Havasi, C., & Lieberman, H. (2008). AnalogySpace: Reducing the dimensionality of common sense knowledge. In Proceedings of the 23rd national conference on Artificial intelligence (pp. 548–553). California: AAAI Press.
Sriram, B., Fuhry, D., Demir, E., Ferhatosmanoglu, H., & Demirbas, M. (2010). Short text classification in twitter to improve information filtering. In Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, ACM (pp. 841–842) New York, NY : SIGIR ’10.
Suchanek, F. M., Kasneci, G., & Weikum, G. (2008). YAGO: A large ontology from wikipedia and wordnet. Web Semantics, 6, 203–217.
Article Google Scholar
Van Rijsbergen, C. (1979). Information retrieval. London, UK: Butterworth.
Google Scholar
Warshall, S. (1962). A theorem on boolean matrices. Journal of the ACM, 9, 11–12.
Article Google Scholar
Wu, Z., & Palmer, M. (1994). Verbs semantics and lexical selection. In Proceedings of the 32nd annual meeting on association for computational linguistics, association for computational linguistics (pp. 133–138). Morristown, NJ.
Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM (pp. 42–49). New York, NY.
Zadeh, L. (1965). Fuzzy sets. Information and Control, 8, 338–353.
Article Google Scholar
Zelikovitz, S., & Hirsh, H. (2000). Improving short text classification using unlabeled background knowledge to assess document similarity. In Proceedings of the seventeenth international conference on machine learning (pp. 1183–1190).

Download references

Acknowledgments

This research was partially supported by the Spanish Ministry of Science and Innovation (MEC) under TIN2007-67494 and TIN2010-20395 projects and by the Regional Government of Castilla-La Mancha under PEIC09-0196-3018 and PII1I09-0117-4481 projects

Author information

Authors and Affiliations

Department of Information Technologies and Systems, University of Castilla La Mancha, Paseo de la Universidad, 4, 13071, Ciudad Real, Spain
Francisco P. Romero, Pascual Julián-Iranzo, Mateus Ferreira-Satler & Juan Gallardo-Casero
Department of Computer Science, Universidad Autònoma del Carmen, Ciudad del Carmen, CP 24160, Campeche, Mèxico
Andrés Soto

Authors

Francisco P. Romero
View author publications
You can also search for this author in PubMed Google Scholar
Pascual Julián-Iranzo
View author publications
You can also search for this author in PubMed Google Scholar
Andrés Soto
View author publications
You can also search for this author in PubMed Google Scholar
Mateus Ferreira-Satler
View author publications
You can also search for this author in PubMed Google Scholar
Juan Gallardo-Casero
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Francisco P. Romero.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Romero, F.P., Julián-Iranzo, P., Soto, A. et al. Classifying unlabeled short texts using a fuzzy declarative approach. Lang Resources & Evaluation 47, 151–178 (2013). https://doi.org/10.1007/s10579-012-9203-2

Download citation

Published: 04 November 2012
Issue Date: March 2013
DOI: https://doi.org/10.1007/s10579-012-9203-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Classifying unlabeled short texts using a fuzzy declarative approach

Abstract

Access this article

Similar content being viewed by others

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

A review of semi-supervised learning for text classification

A comprehensive and analytical review of text clustering techniques

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Classifying unlabeled short texts using a fuzzy declarative approach

Abstract

Access this article

Similar content being viewed by others

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

A review of semi-supervised learning for text classification

A comprehensive and analytical review of text clustering techniques

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation