skip to main content
article

Overview and semantic issues of text mining

Published: 01 September 2007 Publication History

Abstract

Text mining refers to the discovery of previously unknown knowledge that can be found in text collections. In recent years, the text mining field has received great attention due to the abundance of textual data. A researcher in this area is requested to cope with issues originating from the natural language particularities. This survey discusses such semantic issues along with the approaches and methodologies proposed in the existing literature. It covers syntactic matters, tokenization concerns and it focuses on the different text representation techniques, categorisation tasks and similarity measures suggested.

References

[1]
Abadi, D., Marcus, A., Madden, S., and Hollenbach K. 2007. Scalable Semantic Web Data Management Using Vertical Partitioning. In Proc. of the 33rd VLDB, Austria, pp. 411--422.
[2]
Ananiadou, S., Chruszcz, J., Keane, J., Mcnaught, J., and Watry, P. 2005. The national centre for text mining: aims and objectives. In Ariadne 42, Jan. 2005.
[3]
Ando, R. K., and Zhang, T. 2005. A high-performance semi-supervised learning method for text chunking. In Proc. of the 43rd ACL, Ann Arbor, pp 1--9.
[4]
Antonellis, I., and Gallopoulos, E. 2006. Exploring term-document matrices from matrix models in text mining. In Proc. of the SIAM Text Mining Workshop 2006, 6th SIAM SDM Conference, Maryland.
[5]
Apte, C., Damerau, F., and Weiss, S. 1998. Text mining with decision rules and decision trees. In Conference on Automated Learning and Discovery, Carnegie-Mellon University.
[6]
Blake, C., and Pratt, W. 2001. Better rules, fewer features: a semantic approach to selecting features from text. In Proc. of IEEE DM Conference (IEEE DM), San Jose, CA, pp. 59--66.
[7]
Blei, D., Ng, A., and Jordan, M. 2003. Latent dirichlet allocation. Journal of Machine Learning Research 3, pp. 993--1022.
[8]
Bloehdorn, S., Cimiano, P., and Hotho, A. 2005. Learning ontologies to improve text clustering and classification. In Proc. of the 29th Annual Conference of the German Classification Society (GfKl), Magdeburg, Germany, pp. 334--341.
[9]
Bloehdorn, S., and Hotho, A. 2004. Text classification by boosting weak learners based on terms and concepts. In Proc. of the 4th ICDM, Brighton, UK, pp. 331--334.
[10]
Bourigault D. 1992. Surface grammatical analysis for the extraction of terminological noun phrases. In Proc. of the 14th COLING-92, Nantes, pp. 977--981.
[11]
Brown Corpus. http://helmer.aksis.uib.no/icame/brown/bcm.html
[12]
Budanitsky, A., and Hirst, G. 2001. Semantic distance in WordNet: an experimental, application-oriented evaluation of five measures. Workshop on WordNet and Other Lexical Resources, 2nd Meeting of the North American Chapter of the Association for Computational Linguistics, Pittsburgh, PA.
[13]
Buitelaar, P., Cimiano, P., and Magnini, B. 2005. Ontology Learning from Text: Methods, Evaluation and Applications, IOS Press, USA.
[14]
Carenini, G., Ng, R. T., and Zwart, E. 2005. Extracting knowledge from evaluative text. In the 3rd KCAP, Banff, Alberta, Canada, pp. 11--18.
[15]
Caropreso, M. F., Matwin, S., and Sebastiani, F. 2001. A learner-independent evaluation of the usefulness of statistical phrases for automated text categorization. In Text Databases and Document Management: Theory and Practice, AMITA G. CHIN, Ed. Idea Group Publishing, Hershey, PA, 78--102.
[16]
Cimiano, P., Hotho, A., and Staab, S. 2005. Learning concept hierarchies from text corpora using formal concept analysis. Journal of Artificial Intelligence Research, 24, pp. 305--339.
[17]
Cohen, K. B., and Hunter, L. 2004. Natural language processing and systems biology. In Artificial Intelligence methods and tools for systems biology, Dubitzky and Pereira, Springer Verlag.
[18]
Cong, G., Lee, W., Wu, H., and Liu, B. 2004. Semi-supervised text classification using partitioned EM. In 9th DASFAA, Jesu Island, Korea, pp., 482--493.
[19]
Culotta, A., Mccallum, A., and Betz, J. 2006. Integrating probabilistic extraction models and data mining to discover relations and patterns in text. In Human Language Technology - North American Chapter of the Association for Computational Linguistics Annual Meeting, NY, 296--303.
[20]
Daille, B., Gaussier, E., and Langé, JM. 1994. Towards automatic extraction of monolingual and bilingual terminology. In Proc. of the 15th International Conference on Computational Linguistics, 515--521.
[21]
Dumais, S., Platt, J., Heckerman, D., and Sahami, M. 1998. Inductive learning algorithms and representations for text categorization. In Proc. of the 7th CIKM, Bethesda, MD, 148--155.
[22]
EuroWordNet. http://www.illc.uva.nl/EuroWordNet
[23]
Fan, W., Wallace, L., Rich, S. and Zhang, Z. 2006. Tapping the power of text mining. In Communications of the ACM 49(9), pp. 76--82.
[24]
Firth, J. R. 1957. A synopsis of linguistic theory 1930--1955. In Studies in Linguistic Analysis, Philological Society, Oxford, 1--32. Reprinted in Selected papers of J. R. Firth 1952--1959, Longman, London.
[25]
Fortuna, B., Grobelnik M., and Mladenic D. 2006. Background Knowledge for Ontology Construction. In Proc. of the 15th International Conference on WWW, Edinburgh, Scotland, UK, pp. 949--950.
[26]
Fortuna, B., Mladenic, D., and Grobelnik, M. 2005. Semi-automatic Construction of Topic Ontologies. In Joint International Workshops, EWMF 2005 and KDO 2005, on Semantics, Web and Mining, Porto, Portugal, pp. 121--131.
[27]
Freitag. D. 1998. Machine Learning for Information Extraction in Informal Domains. Ph.D. thesis, Carnegie Mellon University.
[28]
Furnkranz, J., Mitchell, T., and Riloff, E. 1998. A case study in using linguistic phrases for text categorization on the WWW. Working Notes of the AAAI/ICML, Workshop on Learning for Text Categorization, Madison, WI, pp. 5--12.
[29]
Global WordNet Assoc. http://www.globalwordnet.org/
[30]
Gomez-Perez, A., and Benjamins, V. R. 1999. Overview of knowledge sharing and reuse components: ontologies and problem-solving methods. In Proc. of the IJCAI-99 Workshop on Ontologies and Problem-Solving Methods, Stockholm, Sweden.
[31]
Halevy, A., Rajaraman, A., and Ordille, J. 2006. Data Integration: The teenage years. In Proc. of the 32nd VLDB, Korea, pp. 9--16.
[32]
Hatzivassiloglou, V., and Mckeown, K. R. 1997. Predicting the semantic orientation of adjectives. In Proc. of the 35th ACL and the 8th Conference of the European chapter of the ACL, New Brunswick, NJ, pp. 174--181.
[33]
Hearst, M. A. 1994. Multi-paragraph segmentation of expository text. In Proc. of the 32nd ACL, Las Cruces, NM, pp. 9--16.
[34]
Hearst, M. A. 1999. Untangling text data mining. In Proc. of the 37th ACL, College Park, MD, pp. 3--10.
[35]
Hirschman, L., Park, J. C., Tsujii, J., Wong, L., and Wu, C. 2002. Accomplishments and challenges in literature data mining for biology. In BioInformatics, 18(12), pp. 1553--1561.
[36]
Hoskinson, A. 2005. Creating the ultimate research assistant. IEEE Computer, 38(11), pp. 97--99.
[37]
Jindal, N., and Bing, L. 2006. Identifying comparative sentences in text documents. In Proc. of the 29th SIGIR, Seattle, USA, pp. 244--251.
[38]
Kageura, K., and Umino, B. 1996. Methods of automatic term recognition. Technology Journal, 3(2), pp. 259--289.
[39]
Kamps, J., Marx, M., Mokken, R. J., and Maarten De Rijke 2004. Using WordNet to measure semantic orientations of adjectives. In Proc. of the 4th LREC, vol. IV, European Language Resources Association, Paris, 2004, pp. 1115--1118.
[40]
Kao, A., and Poteet, S. 2004. Report on KDD conference 2004 panel discussion - can natural language processing help text mining? SIGKDD Explorations 6(2), Dec. 2004, pp. 132--133.
[41]
Kao, A., and Poteet S. 2006. Text mining and natural language processing -- Introduction for the special issue. SIGKDD Explorations 7(1), June 2006, pp. 1--2.
[42]
Kehagias, A., Petridis, V., Kaburlasos, V. G., and Fragkou, P. 2001. A comparison of word- and sense-based text categorization using several classification algorithms. Journal of Intelligent Information Systems, 21(3), pp. 227--247.
[43]
Kozima, H. 1993. Text segmentation based on similarity between words. In Proc. of the 31st ACL, Columbus, Ohio, USA, pp. 286--288.
[44]
Lafferty, J., Mccallum, A., and Pereira, F. 2001. Conditional random fields: probabilistic models for segmenting and labeling sequence data. In Proc. of the 18th ICML, Williamstown, MA, pp. 282--289.
[45]
Lewis, D. D. 1992. An evaluation of phrasal and clustered representations on a text categorization task. In Proc. of SIGIR, Copenhagen, Denmark, pp. 37--50.
[46]
Manning, C., and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts.
[47]
Mansuri, I. R, and Sarawagi, S. 2006. Integrating unstructured data into relational databases. In Proc. of the 22nd ICDE, 29.
[48]
McCallum, A. 2005. Information Extraction: Distilling Structured Data from Unstructured Text. ACM Queue, 3(9), November 2005.
[49]
Metzler, D., Bernstein, Y., Croft, W. B., Moffat, A., and Zobel, J. 2005. Similarity measures for tracking information flow. In Proc. of CIKM, Bremen, Germany, pp. 517--524.
[50]
Miller, G. A. and Charles, W. G. 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes 6 (1), pp. 1--28.
[51]
Mitra, M., Buckley, C., Singhal, A., and Cardie, C. 1997. An analysis of statistical and syntactic phrases. In Proc. of the 5th International Conference "Recherche d' Information Assistee par Ordinateur" (RIAO), Montreal, CA, pp. 200--214.
[52]
Mladenic, D., and Grobelnik, M. 1998. Word sequences as features in text-learning. In Proc. of the 7th Electrotechnical and Computer Science Conference, Ljubljana, Slovenia, pp. 145--148.
[53]
Mooney, R. J., and Bunescu, R. 2005. Mining knowledge from text using information extraction. ACM SIGKDD Explorations 7(1), June 2006, pp. 3--10.
[54]
Nenadic, G., and Ananiadou, S. 2006. Mining semantically related terms from biomedical literature. In ACM TALIP Special Issue on Text Mining and Management in Biomedicine, 5(1), pp. 22--43.
[55]
Nigam, K., and Ghani, R. 2000. Analyzing the effectiveness and applicability of co-training. In the 8th CIKM, Kansas City, MI, pp. 86--93.
[56]
Niles, I., and Pease, A. 2003. Linking lexicons and ontologies: mapping WordNet to the suggested upper merged ontology. In Proc. of the 2003 International Conference on IKE, Las Vegas, Nevada, pp. 412--416.
[57]
Pang, B., Lee, L., and Vaithyanathan, S. 2002. Thumbs up? Sentiment classification using machine learning techniques. In Proc. of the 2002 EMNLP, pp. 79--86.
[58]
Penn Treebank. http://www.cis.upenn.edu/~treebank/home.html
[59]
Rajman, M., and Besançon, R. 1999. Stochastic distributional models for textual information retrieval. In Proc. of 9th ASMDA, Lisbon, Portugal, pp. 80--85.
[60]
Resnik, P. 1995. Using information content to evaluate semantic similarity in a taxonomy. In Proc. of the 14th IJCAI-95, Montreal, QC, Canada, pp. 448--453.
[61]
Resnik, P. 1999. Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11, pp. 95--130.
[62]
Riloff, E. 1995. Little words can make a big difference for text classification. In Proc. of the 18th SIGIR, Seattle, WA, pp. 130--136.
[63]
Salton, G. 1988. Syntactic approaches to automatic book indexing. In Proc. of the 26th ACL, NY, 120--138.
[64]
Salton, G., Wong, A., and Yang, C. S. 1975. A vector space model for automatic indexing. In Communications of the ACM 18(11), pp. 613--620.
[65]
Sapir, E. 1921. Language: an introduction to the study of speech. HARCOURT BRACE & CO., New York.
[66]
Schapire, R. E. 1999. A brief introduction to boosting. In Proc. of the 16th IJCAI, Stockholm, pp. 1401--1405.
[67]
Sebastiani, F. 2002. Machine learning in automated text categorization. In ACM Computing Surveys, 34(1), pp. 1--47.
[68]
Sebastiani, F. 2006. Classification of text, automatic. In The Encyclopedia of Language and Linguistics 14, 2nd ed., Elsevier Science Pub., pp. 457--462.
[69]
Seco, N., Veale, T., and Hayes, J. 2004. An intrinsic information content metric for semantic similarity in WordNet. In Proc. of the 16th ECAI, Valencia, Spain, pp. 1089--1090.
[70]
Shannon, C. E. 1948. A mathematical theory of communication. Bell System Technical Journal, 27, pp. 379--423.
[71]
Spasic, I., Ananiadou, S., Mcnaught, J., and Kumar, A. 2005. Text mining and ontologies in biomedicine: making sense of raw text. Briefings in Bioinformatics 6(3), pp. 239--251.
[72]
SUMO. http://ontology.teknowledge.com/
[73]
Swanson, D. R., and Smalheiser, N. R. 1994. Assessing a gap in the biomedical literature: magnesium deficiency and neurologic disease. Neuroscience Research Communications 15(1), pp. 1--9.
[74]
Swanson, D. R., and Smalheiser, N. R. 1997. An interactive system for finding complementary literatures: a stimulus to scientific discovery. Artificial Intelligence 91, pp. 183--203.
[75]
Turney, P. D., and Littman, M. L. 2003. Measuring praise and criticism: inference of semantic orientation from association. ACM TOIS 21(4), pp. 315--346.
[76]
van Rijsbergen, C. J. 1979. Information Retrieval. 2nd edition, Butterworths, London.
[77]
Varelas, G., VoutsakiS, E., Raftopoulou, P., Petrakis, E., and Milios, E. E. 2005. Semantic similarity methods in WordNet and their application to information retrieval on the web. In Proc. of the 7th WIDM, Bremen, Germany, pp. 10--16.
[78]
Witten, I. H., Bray, Z., Mahoui, M., and Teahan, B. 1999. Text mining: a new frontier for lossless compression. In Proc. of DCC, Snowbird, Utah, pp. 198--207.
[79]
WordNet. http://wordnet.princeton.edu/
[80]
Yang, Y., and Liu, X. 1999. A re-examination of text categorization methods. In Proc. of SIGIR, Berkeley, CA, pp. 42--49.
[81]
Yang, Y., and Pedersen, J. 1997. A comparative study on feature selection in text categorization. In Proc. of the 14th ICML, Nashville, TN, pp. 412--420.
[82]
Yarowsky, D. 1995. Unsupervised word sense disambiguation rivaling supervised methods. In Proc. of the 33rd ACL, Cambridge, MA, pp. 189--196.
[83]
Yeh, A. S., Hirschman, L., and Morgan, A. A. 2003. Evaluation of text data mining for database curation: lessons learned from the KDD challenge cup. Bioinformatics 19 (Suppl. 1), pp. i331--i339.
[84]
Zaïane, O. R. 1998. From resource discovery to knowledge discovery on the internet. Technical Report TR 1998-13, Simon Fraser University, August, 1998.

Cited By

View all
  • (2025)Principal phrase mining: an automated method for extracting meaningful phrases from textInternational Journal of Computers and Applications10.1080/1206212X.2024.244849447:1(84-92)Online publication date: 3-Jan-2025
  • (2024)Empowering Marketing Intelligence via Text AnalyticsMarketing Innovation Strategies and Consumer Behavior10.4018/979-8-3693-4195-7.ch002(31-57)Online publication date: 5-Apr-2024
  • (2024)Open data, workplace relations law compliance, and digital regulationData & Policy10.1017/dap.2024.186Online publication date: 29-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGMOD Record
ACM SIGMOD Record  Volume 36, Issue 3
September 2007
52 pages
ISSN:0163-5808
DOI:10.1145/1324185
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2007
Published in SIGMOD Volume 36, Issue 3

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)27
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Principal phrase mining: an automated method for extracting meaningful phrases from textInternational Journal of Computers and Applications10.1080/1206212X.2024.244849447:1(84-92)Online publication date: 3-Jan-2025
  • (2024)Empowering Marketing Intelligence via Text AnalyticsMarketing Innovation Strategies and Consumer Behavior10.4018/979-8-3693-4195-7.ch002(31-57)Online publication date: 5-Apr-2024
  • (2024)Open data, workplace relations law compliance, and digital regulationData & Policy10.1017/dap.2024.186Online publication date: 29-Apr-2024
  • (2024)Technology Intelligence: Geothermal EnergyFuture‐Oriented Technology Assessment10.1002/9781119909880.ch7(151-220)Online publication date: 6-Sep-2024
  • (2023)Artificial Intelligence in Interdisciplinary LinguisticsBulletin of Kemerovo State University. Series: Humanities and Social Sciences10.21603/2542-1840-2023-7-3-267-2802023:3(267-280)Online publication date: 2-Oct-2023
  • (2023)Blockchain Use Case in Ballistics and Crime Gun Tracing and Intelligence: Toward Overcoming Gun ViolenceACM Transactions on Management Information Systems10.1145/357129014:1(1-26)Online publication date: 3-Feb-2023
  • (2023)Tripartite Transmitting Methodology for Intermittently Connected Mobile Network (ICMN)ACM Transactions on Internet Technology10.1145/343354522:4(1-18)Online publication date: 3-Feb-2023
  • (2023)Information Extraction From Text Messages Using Natural Language Processing2023 International Conference on Computer Communication and Informatics (ICCCI)10.1109/ICCCI56745.2023.10128641(1-6)Online publication date: 23-Jan-2023
  • (2023)Automatic software vulnerability assessment by extracting vulnerability elementsJournal of Systems and Software10.1016/j.jss.2023.111790204:COnline publication date: 1-Oct-2023
  • (2023)Partition-based Print Sequence Planning and Adaptive Slicing for Scalar Field-based Multi-axis Additive ManufacturingComputer-Aided Design10.1016/j.cad.2023.103576163:COnline publication date: 1-Oct-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media