article

Overview and semantic issues of text mining

Authors:

Anna Stavrianou,

Periklis Andritsos,

Nicolas NicoloyannisAuthors Info & Claims

ACM SIGMOD Record, Volume 36, Issue 3

Pages 23 - 34

https://doi.org/10.1145/1324185.1324190

Published: 01 September 2007 Publication History

Abstract

Text mining refers to the discovery of previously unknown knowledge that can be found in text collections. In recent years, the text mining field has received great attention due to the abundance of textual data. A researcher in this area is requested to cope with issues originating from the natural language particularities. This survey discusses such semantic issues along with the approaches and methodologies proposed in the existing literature. It covers syntactic matters, tokenization concerns and it focuses on the different text representation techniques, categorisation tasks and similarity measures suggested.

References

[1]

Abadi, D., Marcus, A., Madden, S., and Hollenbach K. 2007. Scalable Semantic Web Data Management Using Vertical Partitioning. In Proc. of the 33rd VLDB, Austria, pp. 411--422.

Digital Library

[2]

Ananiadou, S., Chruszcz, J., Keane, J., Mcnaught, J., and Watry, P. 2005. The national centre for text mining: aims and objectives. In Ariadne 42, Jan. 2005.

[3]

Ando, R. K., and Zhang, T. 2005. A high-performance semi-supervised learning method for text chunking. In Proc. of the 43^rd ACL, Ann Arbor, pp 1--9.

Digital Library

[4]

Antonellis, I., and Gallopoulos, E. 2006. Exploring term-document matrices from matrix models in text mining. In Proc. of the SIAM Text Mining Workshop 2006, 6th SIAM SDM Conference, Maryland.

[5]

Apte, C., Damerau, F., and Weiss, S. 1998. Text mining with decision rules and decision trees. In Conference on Automated Learning and Discovery, Carnegie-Mellon University.

[6]

Blake, C., and Pratt, W. 2001. Better rules, fewer features: a semantic approach to selecting features from text. In Proc. of IEEE DM Conference (IEEE DM), San Jose, CA, pp. 59--66.

Digital Library

[7]

Blei, D., Ng, A., and Jordan, M. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3, pp. 993--1022.

Digital Library

[8]

Bloehdorn, S., Cimiano, P., and Hotho, A. 2005. Learning ontologies to improve text clustering and classification. In Proc. of the 29th Annual Conference of the German Classification Society (GfKl), Magdeburg, Germany, pp. 334--341.

[9]

Bloehdorn, S., and Hotho, A. 2004. Text classification by boosting weak learners based on terms and concepts. In Proc. of the 4^th ICDM, Brighton, UK, pp. 331--334.

Digital Library

[10]

Bourigault D. 1992. Surface grammatical analysis for the extraction of terminological noun phrases. In Proc. of the 14th COLING-92, Nantes, pp. 977--981.

Digital Library

[11]

Brown Corpus. http://helmer.aksis.uib.no/icame/brown/bcm.html

[12]

Budanitsky, A., and Hirst, G. 2001. Semantic distance in WordNet: an experimental, application-oriented evaluation of five measures. Workshop on WordNet and Other Lexical Resources, 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA.

[13]

Buitelaar, P., Cimiano, P., and Magnini, B. 2005. Ontology Learning from Text: Methods, Evaluation and Applications, IOS Press, USA.

[14]

Carenini, G., Ng, R. T., and Zwart, E. 2005. Extracting knowledge from evaluative text. In the 3rd KCAP, Banff, Alberta, Canada, pp. 11--18.

Digital Library

[15]

Caropreso, M. F., Matwin, S., and Sebastiani, F. 2001. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Text Databases and Document Management: Theory and Practice, AMITA G. CHIN, Ed. Idea Group Publishing, Hershey, PA, 78--102.

Digital Library

[16]

Cimiano, P., Hotho, A., and Staab, S. 2005. Learning concept hierarchies from text corpora using formal concept analysis. Journal of Artificial Intelligence Research, 24, pp. 305--339.

Digital Library

[17]

Cohen, K. B., and Hunter, L. 2004. Natural language processing and systems biology. In Artificial Intelligence methods and tools for systems biology, Dubitzky and Pereira, Springer Verlag.

[18]

Cong, G., Lee, W., Wu, H., and Liu, B. 2004. Semi-supervised text classification using partitioned EM. In 9^th DASFAA, Jesu Island, Korea, pp., 482--493.

[19]

Culotta, A., Mccallum, A., and Betz, J. 2006. Integrating probabilistic extraction models and data mining to discover relations and patterns in text. In Human Language Technology - North American Chapter of the Association for Computational Linguistics Annual Meeting, NY, 296--303.

Digital Library

[20]

Daille, B., Gaussier, E., and Langé, JM. 1994. Towards automatic extraction of monolingual and bilingual terminology. In Proc. of the 15th International Conference on Computational Linguistics, 515--521.

Digital Library

[21]

Dumais, S., Platt, J., Heckerman, D., and Sahami, M. 1998. Inductive learning algorithms and representations for text categorization. In Proc. of the 7th CIKM, Bethesda, MD, 148--155.

Digital Library

[22]

EuroWordNet. http://www.illc.uva.nl/EuroWordNet

[23]

Fan, W., Wallace, L., Rich, S. and Zhang, Z. 2006. Tapping the power of text mining. In Communications of the ACM 49(9), pp. 76--82.

Digital Library

[24]

Firth, J. R. 1957. A synopsis of linguistic theory 1930--1955. In Studies in Linguistic Analysis, Philological Society, Oxford, 1--32. Reprinted in Selected papers of J. R. Firth 1952--1959, Longman, London.

[25]

Fortuna, B., Grobelnik M., and Mladenic D. 2006. Background Knowledge for Ontology Construction. In Proc. of the 15^th International Conference on WWW, Edinburgh, Scotland, UK, pp. 949--950.

Digital Library

[26]

Fortuna, B., Mladenic, D., and Grobelnik, M. 2005. Semi-automatic Construction of Topic Ontologies. In Joint International Workshops, EWMF 2005 and KDO 2005, on Semantics, Web and Mining, Porto, Portugal, pp. 121--131.

Digital Library

[27]

Freitag. D. 1998. Machine Learning for Information Extraction in Informal Domains. Ph.D. thesis, Carnegie Mellon University.

Digital Library

[28]

Furnkranz, J., Mitchell, T., and Riloff, E. 1998. A case study in using linguistic phrases for text categorization on the WWW. Working Notes of the AAAI/ICML, Workshop on Learning for Text Categorization, Madison, WI, pp. 5--12.

[29]

Global WordNet Assoc. http://www.globalwordnet.org/

[30]

Gomez-Perez, A., and Benjamins, V. R. 1999. Overview of knowledge sharing and reuse components: ontologies and problem-solving methods. In Proc. of the IJCAI-99 Workshop on Ontologies and Problem-Solving Methods, Stockholm, Sweden.

[31]

Halevy, A., Rajaraman, A., and Ordille, J. 2006. Data Integration: The teenage years. In Proc. of the 32nd VLDB, Korea, pp. 9--16.

Digital Library

[32]

Hatzivassiloglou, V., and Mckeown, K. R. 1997. Predicting the semantic orientation of adjectives. In Proc. of the 35th ACL and the 8th Conference of the European chapter of the ACL, New Brunswick, NJ, pp. 174--181.

Digital Library

[33]

Hearst, M. A. 1994. Multi-paragraph segmentation of expository text. In Proc. of the 32nd ACL, Las Cruces, NM, pp. 9--16.

Digital Library

[34]

Hearst, M. A. 1999. Untangling text data mining. In Proc. of the 37^th ACL, College Park, MD, pp. 3--10.

Digital Library

[35]

Hirschman, L., Park, J. C., Tsujii, J., Wong, L., and Wu, C. 2002. Accomplishments and challenges in literature data mining for biology. In BioInformatics, 18(12), pp. 1553--1561.

[36]

Hoskinson, A. 2005. Creating the ultimate research assistant. IEEE Computer, 38(11), pp. 97--99.

Digital Library

[37]

Jindal, N., and Bing, L. 2006. Identifying comparative sentences in text documents. In Proc. of the 29th SIGIR, Seattle, USA, pp. 244--251.

Digital Library

[38]

Kageura, K., and Umino, B. 1996. Methods of automatic term recognition. Technology Journal, 3(2), pp. 259--289.

[39]

Kamps, J., Marx, M., Mokken, R. J., and Maarten De Rijke 2004. Using WordNet to measure semantic orientations of adjectives. In Proc. of the 4^th LREC, vol. IV, European Language Resources Association, Paris, 2004, pp. 1115--1118.

[40]

Kao, A., and Poteet, S. 2004. Report on KDD conference 2004 panel discussion - can natural language processing help text mining? SIGKDD Explorations 6(2), Dec. 2004, pp. 132--133.

Digital Library

[41]

Kao, A., and Poteet S. 2006. Text mining and natural language processing -- Introduction for the special issue. SIGKDD Explorations 7(1), June 2006, pp. 1--2.

Digital Library

[42]

Kehagias, A., Petridis, V., Kaburlasos, V. G., and Fragkou, P. 2001. A comparison of word- and sense-based text categorization using several classification algorithms. Journal of Intelligent Information Systems, 21(3), pp. 227--247.

Digital Library

[43]

Kozima, H. 1993. Text segmentation based on similarity between words. In Proc. of the 31^st ACL, Columbus, Ohio, USA, pp. 286--288.

Digital Library

[44]

Lafferty, J., Mccallum, A., and Pereira, F. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th ICML, Williamstown, MA, pp. 282--289.

Digital Library

[45]

Lewis, D. D. 1992. An evaluation of phrasal and clustered representations on a text categorization task. In Proc. of SIGIR, Copenhagen, Denmark, pp. 37--50.

Digital Library

[46]

Manning, C., and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts.

Digital Library

[47]

Mansuri, I. R, and Sarawagi, S. 2006. Integrating unstructured data into relational databases. In Proc. of the 22nd ICDE, 29.

Digital Library

[48]

McCallum, A. 2005. Information Extraction: Distilling Structured Data from Unstructured Text. ACM Queue, 3(9), November 2005.

Digital Library

[49]

Metzler, D., Bernstein, Y., Croft, W. B., Moffat, A., and Zobel, J. 2005. Similarity measures for tracking information flow. In Proc. of CIKM, Bremen, Germany, pp. 517--524.

Digital Library

[50]

Miller, G. A. and Charles, W. G. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes 6 (1), pp. 1--28.

[51]

Mitra, M., Buckley, C., Singhal, A., and Cardie, C. 1997. An analysis of statistical and syntactic phrases. In Proc. of the 5th International Conference "Recherche d' Information Assistee par Ordinateur" (RIAO), Montreal, CA, pp. 200--214.

[52]

Mladenic, D., and Grobelnik, M. 1998. Word sequences as features in text-learning. In Proc. of the 7th Electrotechnical and Computer Science Conference, Ljubljana, Slovenia, pp. 145--148.

[53]

Mooney, R. J., and Bunescu, R. 2005. Mining knowledge from text using information extraction. ACM SIGKDD Explorations 7(1), June 2006, pp. 3--10.

Digital Library

[54]

Nenadic, G., and Ananiadou, S. 2006. Mining semantically related terms from biomedical literature. In ACM TALIP Special Issue on Text Mining and Management in Biomedicine, 5(1), pp. 22--43.

Digital Library

[55]

Nigam, K., and Ghani, R. 2000. Analyzing the effectiveness and applicability of co-training. In the 8th CIKM, Kansas City, MI, pp. 86--93.

Digital Library

[56]

Niles, I., and Pease, A. 2003. Linking lexicons and ontologies: mapping WordNet to the suggested upper merged ontology. In Proc. of the 2003 International Conference on IKE, Las Vegas, Nevada, pp. 412--416.

[57]

Pang, B., Lee, L., and Vaithyanathan, S. 2002. Thumbs up? Sentiment classification using machine learning techniques. In Proc. of the 2002 EMNLP, pp. 79--86.

Digital Library

[58]

Penn Treebank. http://www.cis.upenn.edu/~treebank/home.html

[59]

Rajman, M., and Besançon, R. 1999. Stochastic distributional models for textual information retrieval. In Proc. of 9th ASMDA, Lisbon, Portugal, pp. 80--85.

[60]

Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proc. of the 14th IJCAI-95, Montreal, QC, Canada, pp. 448--453.

Digital Library

[61]

Resnik, P. 1999. Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11, pp. 95--130.

Digital Library

[62]

Riloff, E. 1995. Little words can make a big difference for text classification. In Proc. of the 18th SIGIR, Seattle, WA, pp. 130--136.

Digital Library

[63]

Salton, G. 1988. Syntactic approaches to automatic book indexing. In Proc. of the 26th ACL, NY, 120--138.

Digital Library

[64]

Salton, G., Wong, A., and Yang, C. S. 1975. A vector space model for automatic indexing. In Communications of the ACM 18(11), pp. 613--620.

Digital Library

[65]

Sapir, E. 1921. Language: an introduction to the study of speech. HARCOURT BRACE & CO., New York.

[66]

Schapire, R. E. 1999. A brief introduction to boosting. In Proc. of the 16th IJCAI, Stockholm, pp. 1401--1405.

Digital Library

[67]

Sebastiani, F. 2002. Machine learning in automated text categorization. In ACM Computing Surveys, 34(1), pp. 1--47.

Digital Library

[68]

Sebastiani, F. 2006. Classification of text, automatic. In The Encyclopedia of Language and Linguistics 14, 2nd ed., Elsevier Science Pub., pp. 457--462.

[69]

Seco, N., Veale, T., and Hayes, J. 2004. An intrinsic information content metric for semantic similarity in WordNet. In Proc. of the 16th ECAI, Valencia, Spain, pp. 1089--1090.

[70]

Shannon, C. E. 1948. A mathematical theory of communication. Bell System Technical Journal, 27, pp. 379--423.

[71]

Spasic, I., Ananiadou, S., Mcnaught, J., and Kumar, A. 2005. Text mining and ontologies in biomedicine: making sense of raw text. Briefings in Bioinformatics 6(3), pp. 239--251.

[72]

SUMO. http://ontology.teknowledge.com/

[73]

Swanson, D. R., and Smalheiser, N. R. 1994. Assessing a gap in the biomedical literature: magnesium deficiency and neurologic disease. Neuroscience Research Communications 15(1), pp. 1--9.

[74]

Swanson, D. R., and Smalheiser, N. R. 1997. An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artificial Intelligence 91, pp. 183--203.

Digital Library

[75]

Turney, P. D., and Littman, M. L. 2003. Measuring praise and criticism: inference of semantic orientation from association. ACM TOIS 21(4), pp. 315--346.

Digital Library

[76]

van Rijsbergen, C. J. 1979. Information Retrieval. 2nd edition, Butterworths, London.

Digital Library

[77]

Varelas, G., VoutsakiS, E., Raftopoulou, P., Petrakis, E., and Milios, E. E. 2005. Semantic similarity methods in WordNet and their application to information retrieval on the web. In Proc. of the 7^th WIDM, Bremen, Germany, pp. 10--16.

Digital Library

[78]

Witten, I. H., Bray, Z., Mahoui, M., and Teahan, B. 1999. Text mining: a new frontier for lossless compression. In Proc. of DCC, Snowbird, Utah, pp. 198--207.

Digital Library

[79]

WordNet. http://wordnet.princeton.edu/

[80]

Yang, Y., and Liu, X. 1999. A re-examination of text categorization methods. In Proc. of SIGIR, Berkeley, CA, pp. 42--49.

Digital Library

[81]

Yang, Y., and Pedersen, J. 1997. A comparative study on feature selection in text categorization. In Proc. of the 14th ICML, Nashville, TN, pp. 412--420.

Digital Library

[82]

Yarowsky, D. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proc. of the 33rd ACL, Cambridge, MA, pp. 189--196.

Digital Library

[83]

Yeh, A. S., Hirschman, L., and Morgan, A. A. 2003. Evaluation of text data mining for database curation: lessons learned from the KDD challenge cup. Bioinformatics 19 (Suppl. 1), pp. i331--i339.

[84]

Zaïane, O. R. 1998. From resource discovery to knowledge discovery on the internet. Technical Report TR 1998-13, Simon Fraser University, August, 1998.

Cited By

Small ECabrera J(2025)Principal phrase mining: an automated method for extracting meaningful phrases from textInternational Journal of Computers and Applications10.1080/1206212X.2024.244849447:1(84-92)Online publication date: 3-Jan-2025
https://doi.org/10.1080/1206212X.2024.2448494
Çam S(2024)Empowering Marketing Intelligence via Text AnalyticsMarketing Innovation Strategies and Consumer Behavior10.4018/979-8-3693-4195-7.ch002(31-57)Online publication date: 5-Apr-2024
https://doi.org/10.4018/979-8-3693-4195-7.ch002
Chen CHowe JKariotis TJackson S(2024)Open data, workplace relations law compliance, and digital regulationData & Policy10.1017/dap.2024.186Online publication date: 29-Apr-2024
https://doi.org/10.1017/dap.2024.18
Show More Cited By

Recommendations

Mining comparable bilingual text corpora for cross-language information integration
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-...
Mining meaning from text by harvesting frequent and diverse semantic itemsets
DMNLP'14: Proceedings of the 1st International Conference on Interactions between Data Mining and Natural Language Processing - Volume 1202

In this paper, we present a novel and completely-unsupervised approach to unravel meanings (or senses) from linguistic constructions found in large corpora by introducing the concept of semantic vector. A semantic vector is a space-transformed vector ...
Frequent Subtree Mining - An Overview
Advances in Mining Graphs, Trees and Sequences

Mining frequent subtrees from databases of labeled trees is a new research field that has many practical applications in areas such as computer networks, Web mining, bioinformatics, XML document mining, etc. These applications share a requirement for ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record

ACM SIGMOD Record Volume 36, Issue 3

September 2007

52 pages

ISSN:0163-5808

DOI:10.1145/1324185

Issue’s Table of Contents

Copyright © 2007 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2007

Published in SIGMOD Volume 36, Issue 3

Check for updates

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

102
Total Citations
View Citations
3,070
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Small ECabrera J(2025)Principal phrase mining: an automated method for extracting meaningful phrases from textInternational Journal of Computers and Applications10.1080/1206212X.2024.244849447:1(84-92)Online publication date: 3-Jan-2025
https://doi.org/10.1080/1206212X.2024.2448494
Çam S(2024)Empowering Marketing Intelligence via Text AnalyticsMarketing Innovation Strategies and Consumer Behavior10.4018/979-8-3693-4195-7.ch002(31-57)Online publication date: 5-Apr-2024
https://doi.org/10.4018/979-8-3693-4195-7.ch002
Chen CHowe JKariotis TJackson S(2024)Open data, workplace relations law compliance, and digital regulationData & Policy10.1017/dap.2024.186Online publication date: 29-Apr-2024
https://doi.org/10.1017/dap.2024.18
Daim TZamani MNaeini AAlsoubaie FZhang HYalçın H(2024)Technology Intelligence: Geothermal EnergyFuture‐Oriented Technology Assessment10.1002/9781119909880.ch7(151-220)Online publication date: 6-Sep-2024
https://doi.org/10.1002/9781119909880.ch7
Sorokina S(2023)Artificial Intelligence in Interdisciplinary LinguisticsBulletin of Kemerovo State University. Series: Humanities and Social Sciences10.21603/2542-1840-2023-7-3-267-2802023:3(267-280)Online publication date: 2-Oct-2023
https://doi.org/10.21603/2542-1840-2023-7-3-267-280
Akello PVemprala NLang Beebe NRaymond Choo K(2023)Blockchain Use Case in Ballistics and Crime Gun Tracing and Intelligence: Toward Overcoming Gun ViolenceACM Transactions on Management Information Systems10.1145/357129014:1(1-26)Online publication date: 3-Feb-2023
https://dl.acm.org/doi/10.1145/3571290
Sekaran RAl-Turjman FPatan RRamasamy V(2023)Tripartite Transmitting Methodology for Intermittently Connected Mobile Network (ICMN)ACM Transactions on Internet Technology10.1145/343354522:4(1-18)Online publication date: 3-Feb-2023
https://dl.acm.org/doi/10.1145/3433545
Durga BSanjana KBaig YTendulkar NMothukuri RVignesh T(2023)Information Extraction From Text Messages Using Natural Language Processing2023 International Conference on Computer Communication and Informatics (ICCCI)10.1109/ICCCI56745.2023.10128641(1-6)Online publication date: 23-Jan-2023
https://doi.org/10.1109/ICCCI56745.2023.10128641
Sun XYe ZBo LWu XWei YZhang TLi B(2023)Automatic software vulnerability assessment by extracting vulnerability elementsJournal of Systems and Software10.1016/j.jss.2023.111790204:COnline publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1016/j.jss.2023.111790
Lau TChen LHe DLi ZTang K(2023)Partition-based Print Sequence Planning and Adaptive Slicing for Scalar Field-based Multi-axis Additive ManufacturingComputer-Aided Design10.1016/j.cad.2023.103576163:COnline publication date: 1-Oct-2023
https://dl.acm.org/doi/10.1016/j.cad.2023.103576
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents