skip to main content
research-article

SOFSAT: Towards a Setlike Operator based Framework for Semantic Analysis of Text

Published: 11 December 2018 Publication History

Abstract

As data reported by humans about our world, text data play a very important role in all data mining applications, yet how to develop a general text analysis system to sup- port all text mining applications is a difficult challenge. In this position paper, we introduce SOFSAT, a new frame- work that can support set-like operators for semantic analy- sis of natural text data with variable text representations. It includes three basic set-like operators|TextIntersect, Tex- tUnion, and TextDi erence|that are analogous to the cor- responding set operators intersection, union, and di erence, respectively, which can be applied to any representation of text data, and di erent representations can be combined via transformation functions that map text to and from any rep- resentation. Just as the set operators can be exibly com- bined iteratively to construct arbitrary subsets or supersets based on some given sets, we show that the correspond- ing text analysis operators can also be combined exibly to support a wide range of analysis tasks that may require di erent work ows, thus enabling an application developer to \program" a text mining application by using SOFSAT as an application programming language for text analysis. We discuss instantiations and implementation strategies of the framework with some speci c examples, present ideas about how the framework can be implemented by exploit- ing/extending existing techniques, and provide a roadmap for future research in this new direction.

References

[1]
Apache lucene. https://lucene.apache.org/. Accessed: 2018-05--14.
[2]
P. Achananuparp, X. Hu, and X. Shen. The evaluation of sentence similarity measures. In International Con- ference on data warehousing and knowledge discovery, pages 305{316. Springer, 2008.
[3]
C. C. Aggarwal and C. Zhai. Mining text data. Springer Science & Business Media, 2012.
[4]
A. D. Baddeley. Short-term memory for word sequences as a function of acoustic, semantic and formal sim- ilarity. Quarterly journal of experimental psychology, 18(4):362{365, 1966.
[5]
R. Barzilay and K. R. Mckeown. Information Fusion for Multidocument Summerization: Paraphrasing and Generation. PhD thesis, Columbia University, 2003.
[6]
R. Barzilay and K. R. McKeown. Sentence fusion for multidocument news summarization. Computational Linguistics, 31(3):297{328, 2005.
[7]
S. Bird and E. Loper. Nltk: the natural language toolkit. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, page 31. Association for Computational Linguistics, 2004.
[8]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirich- let allocation. Journal of machine Learning research, 3(Jan):993{1022, 2003.
[9]
W. Cavnar. Using an n-gram-based document represen- tation with a vector processing retrieval model. NIST SPECIAL PUBLICATION SP, pages 269{269, 1995.
[10]
D. D. Chamberlin and R. F. Boyce. Sequel: A struc- tured english query language. In Proceedings of the 1974 ACM SIGFIDET (now SIGMOD) workshop on Data description, access and control, pages 249{264. ACM, 1974.
[11]
B. Choudhary and P. Bhattacharyya. Text clustering using universal networking language representation. In Proceedings of Eleventh International World Wide Web Conference, 2002.
[12]
E. F. Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6):377{ 387, 1970.
[13]
B. Cope, M. Kalantzis, S. McCarthey, C. Vojak, and S. Kline. Technology-mediated writing assessments: Principles and processes. Computers and Composition, 28(2):79{96, 2011.
[14]
A. M. Dai, C. Olah, and Q. V. Le. Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998, 2015.
[15]
K. Filippova and M. Strube. Sentence fusion via depen- dency graph compression. In Proceedings of the Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 177{185. Association for Computational Linguistics, 2008.
[16]
C. Geigle, Q. Mei, and C. Zhai. Feature engineering for text data. In G. Dong and H. Liu, editors, Feature Engineering for Machine Learning and Data Analytics, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, pages 15{45. CRC Press, 2018.
[17]
B. S. Harish, D. S. Guru, and S. Manjunath. Repre- sentation and classi cation of text documents: A brief review. IJCA, Special Issue on RTIPPR (2), pages 110{ 119, 2010.
[18]
H. He, K. Gimpel, and J. Lin. Multi-perspective sen- tence similarity modeling with convolutional neural net- works. In Proceedings of the 2015 Conference on Em- pirical Methods in Natural Language Processing, pages 1576{1586, 2015.
[19]
H. He and J. Lin. Pairwise word interaction modeling with deep neural networks for semantic similarity mea- surement. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, pages 937{948, 2016.
[20]
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735{1780, 1997.
[21]
T. Hofmann. Probabilistic latent semantic indexing. In ACM SIGIR Forum, volume 51, pages 211{218. ACM, 2017.
[22]
A. Hotho, A. Maedche, and S. Staab. Ontology-based text document clustering. KI, 16(4):48{54, 2002.
[23]
B. Lemaire and G. Denhiere. E ects of high-order co- occurrences on word semantic similarity. Current psy- chology letters. Behaviour, brain & cognition, (18, Vol. 1, 2006), 2006.
[24]
O. Levy, I. Dagan, G. Stanovsky, J. Eckle-Kohler, and I. Gurevych. Modeling extractive sentence intersec- tion via subtree entailment. In Proceedings of COLING 2016, the 26th International Conference on Computa- tional Linguistics: Technical Papers, pages 2891{2901, 2016.
[25]
Y. Li, Z. A. Bandar, and D. McLean. An approach for measuring semantic similarity between words us- ing multiple information sources. IEEE Transactions on knowledge and data engineering, 15(4):871{882, 2003.
[26]
Y. H. Li and A. K. Jain. Classi cation of text docu- ments. The Computer Journal, 41(8):537{546, 1998.
[27]
S. Massung, C. Geigle, and C. Zhai. Meta: A uni ed toolkit for text retrieval and analysis. Proceedings of ACL-2016 System Demonstrations, pages 91{96, 2016.
[28]
K. McKeown, S. Rosenthal, K. Thadani, and C. Moore. Time-efficient creation of an accurate sentence fusion corpus. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 317{320. Association for Computational Linguistics, 2010.
[29]
R. Mihalcea, C. Corley, C. Strapparava, et al. Corpus- based and knowledge-based measures of text semantic similarity. In AAAI, volume 6, pages 775{780, 2006.
[30]
T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neu- ral Information Processing Systems 26, pages 3111{ 3119. 2013.
[31]
J. Mueller and A. Thyagarajan. Siamese recurrent ar- chitectures for learning sentence similarity. In AAAI, pages 2786{2792, 2016.
[32]
J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532{1543, 2014.
[33]
C. Piech, J. Huang, Z. Chen, C. Do, A. Ng, and D. Koller. Tuned models of peer assessment in moocs. In Proceedings of the 6th International Conference on Educational Data Mining (EDM 2013), 2013.
[34]
L. R. Rabiner. Readings in speech recognition. chapter A Tutorial on Hidden Markov Models and Selected Ap- plications in Speech Recognition, pages 267{296. 1990.
[35]
G. Salton, A. Wong, and C.-S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613{620, 1975.
[36]
R. Sinha and R. Mihalcea. Unsupervised graph- basedword sense disambiguation using measures of word semantic similarity. In Semantic Computing, 2007. ICSC 2007. International Conference on, pages 363{369. IEEE, 2007.
[37]
H. K. Suen. Peer assessment for massive open online courses (moocs). The International Review of Research in Open and Distributed Learning, 15(3), 2014.
[38]
M. A. Sultan, S. Bethard, and T. Sumner. Dls @ cu: Sentence similarity from word alignment and seman- tic vector composition. In Proceedings of the 9th Inter- national Workshop on Semantic Evaluation (SemEval 2015), pages 148{153, 2015.
[39]
D. R. Swanson. Fish oil, raynaud's syndrome, and undiscovered public knowledge. Perspectives in biology and medicine, 30(1):7{18, 1986.
[40]
K. Thadani and K. McKeown. Towards strict sentence intersection: decoding and evaluation strategies. In Proceedings of the Workshop on Monolingual Text-To- Text Generation, pages 43{53. Association for Compu- tational Linguistics, 2011.
[41]
D. Wang, T. Li, S. Zhu, and C. Ding. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 307{314. ACM, 2008.
[42]
C.-P. Wei, C. C. Yang, and C.-M. Lin. A latent se- mantic indexing-based approach to multilingual docu- ment clustering. Decision Support Systems, 45(3):606{ 620, 2008.
[43]
C. Zhai and S. Massung. Text data management and analysis: a practical introduction to information re- trieval and text mining. Morgan & Claypool, 2016.

Cited By

View all
  • (2022)Concept Annotation from Users Perspective: A New ChallengeCompanion Proceedings of the Web Conference 202210.1145/3487553.3524933(1180-1188)Online publication date: 25-Apr-2022
  • (2020)Futures for research in educationEducational Philosophy and Theory10.1080/00131857.2020.182478154:11(1732-1739)Online publication date: 22-Sep-2020
  • (2020)Artificial intelligence for education: Knowledge and its assessment in AI-enabled learning ecologiesEducational Philosophy and Theory10.1080/00131857.2020.1728732(1-17)Online publication date: 18-Feb-2020

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter
ACM SIGKDD Explorations Newsletter  Volume 20, Issue 2
December 2018
30 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/3299986
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2018
Published in SIGKDD Volume 20, Issue 2

Check for updates

Author Tags

  1. Intelligent Text Analysis
  2. Semantic Analysis
  3. Semantic Operator for Text
  4. Text Mining

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)9
  • Downloads (Last 6 weeks)1
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Concept Annotation from Users Perspective: A New ChallengeCompanion Proceedings of the Web Conference 202210.1145/3487553.3524933(1180-1188)Online publication date: 25-Apr-2022
  • (2020)Futures for research in educationEducational Philosophy and Theory10.1080/00131857.2020.182478154:11(1732-1739)Online publication date: 22-Sep-2020
  • (2020)Artificial intelligence for education: Knowledge and its assessment in AI-enabled learning ecologiesEducational Philosophy and Theory10.1080/00131857.2020.1728732(1-17)Online publication date: 18-Feb-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media