research-article

SOFSAT: Towards a Setlike Operator based Framework for Semantic Analysis of Text

Authors:

Shubhra Kanti Karmaker Santu,

Duncan Ferguson,

Mary Kalantzis,

Duane Searsmith,

Chengxiang ZhaiAuthors Info & Claims

ACM SIGKDD Explorations Newsletter, Volume 20, Issue 2

Pages 21 - 30

https://doi.org/10.1145/3299986.3299990

Published: 11 December 2018 Publication History

Abstract

As data reported by humans about our world, text data play a very important role in all data mining applications, yet how to develop a general text analysis system to sup- port all text mining applications is a difficult challenge. In this position paper, we introduce SOFSAT, a new frame- work that can support set-like operators for semantic analy- sis of natural text data with variable text representations. It includes three basic set-like operators|TextIntersect, Tex- tUnion, and TextDi erence|that are analogous to the cor- responding set operators intersection, union, and di erence, respectively, which can be applied to any representation of text data, and di erent representations can be combined via transformation functions that map text to and from any rep- resentation. Just as the set operators can be exibly com- bined iteratively to construct arbitrary subsets or supersets based on some given sets, we show that the correspond- ing text analysis operators can also be combined exibly to support a wide range of analysis tasks that may require di erent work ows, thus enabling an application developer to \program" a text mining application by using SOFSAT as an application programming language for text analysis. We discuss instantiations and implementation strategies of the framework with some speci c examples, present ideas about how the framework can be implemented by exploit- ing/extending existing techniques, and provide a roadmap for future research in this new direction.

References

[1]

Apache lucene. https://lucene.apache.org/. Accessed: 2018-05--14.

[2]

P. Achananuparp, X. Hu, and X. Shen. The evaluation of sentence similarity measures. In International Con- ference on data warehousing and knowledge discovery, pages 305{316. Springer, 2008.

Digital Library

[3]

C. C. Aggarwal and C. Zhai. Mining text data. Springer Science & Business Media, 2012.

Digital Library

[4]

A. D. Baddeley. Short-term memory for word sequences as a function of acoustic, semantic and formal sim- ilarity. Quarterly journal of experimental psychology, 18(4):362{365, 1966.

[5]

R. Barzilay and K. R. Mckeown. Information Fusion for Multidocument Summerization: Paraphrasing and Generation. PhD thesis, Columbia University, 2003.

Digital Library

[6]

R. Barzilay and K. R. McKeown. Sentence fusion for multidocument news summarization. Computational Linguistics, 31(3):297{328, 2005.

[7]

S. Bird and E. Loper. Nltk: the natural language toolkit. In Proceedings of the ACL 2004 on Interactive poster and demonstration sessions, page 31. Association for Computational Linguistics, 2004.

Digital Library

[8]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirich- let allocation. Journal of machine Learning research, 3(Jan):993{1022, 2003.

Digital Library

[9]

W. Cavnar. Using an n-gram-based document represen- tation with a vector processing retrieval model. NIST SPECIAL PUBLICATION SP, pages 269{269, 1995.

[10]

D. D. Chamberlin and R. F. Boyce. Sequel: A struc- tured english query language. In Proceedings of the 1974 ACM SIGFIDET (now SIGMOD) workshop on Data description, access and control, pages 249{264. ACM, 1974.

Digital Library

[11]

B. Choudhary and P. Bhattacharyya. Text clustering using universal networking language representation. In Proceedings of Eleventh International World Wide Web Conference, 2002.

[12]

E. F. Codd. A relational model of data for large shared data banks. Communications of the ACM, 13(6):377{ 387, 1970.

Digital Library

[13]

B. Cope, M. Kalantzis, S. McCarthey, C. Vojak, and S. Kline. Technology-mediated writing assessments: Principles and processes. Computers and Composition, 28(2):79{96, 2011.

[14]

A. M. Dai, C. Olah, and Q. V. Le. Document embedding with paragraph vectors. arXiv preprint arXiv:1507.07998, 2015.

[15]

K. Filippova and M. Strube. Sentence fusion via depen- dency graph compression. In Proceedings of the Confer- ence on Empirical Methods in Natural Language Pro- cessing, pages 177{185. Association for Computational Linguistics, 2008.

Digital Library

[16]

C. Geigle, Q. Mei, and C. Zhai. Feature engineering for text data. In G. Dong and H. Liu, editors, Feature Engineering for Machine Learning and Data Analytics, Chapman & Hall/CRC Data Mining and Knowledge Discovery Series, pages 15{45. CRC Press, 2018.

[17]

B. S. Harish, D. S. Guru, and S. Manjunath. Repre- sentation and classi cation of text documents: A brief review. IJCA, Special Issue on RTIPPR (2), pages 110{ 119, 2010.

[18]

H. He, K. Gimpel, and J. Lin. Multi-perspective sen- tence similarity modeling with convolutional neural net- works. In Proceedings of the 2015 Conference on Em- pirical Methods in Natural Language Processing, pages 1576{1586, 2015.

[19]

H. He and J. Lin. Pairwise word interaction modeling with deep neural networks for semantic similarity mea- surement. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies, pages 937{948, 2016.

[20]

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735{1780, 1997.

Digital Library

[21]

T. Hofmann. Probabilistic latent semantic indexing. In ACM SIGIR Forum, volume 51, pages 211{218. ACM, 2017.

Digital Library

[22]

A. Hotho, A. Maedche, and S. Staab. Ontology-based text document clustering. KI, 16(4):48{54, 2002.

[23]

B. Lemaire and G. Denhiere. E ects of high-order co- occurrences on word semantic similarity. Current psy- chology letters. Behaviour, brain & cognition, (18, Vol. 1, 2006), 2006.

[24]

O. Levy, I. Dagan, G. Stanovsky, J. Eckle-Kohler, and I. Gurevych. Modeling extractive sentence intersec- tion via subtree entailment. In Proceedings of COLING 2016, the 26th International Conference on Computa- tional Linguistics: Technical Papers, pages 2891{2901, 2016.

[25]

Y. Li, Z. A. Bandar, and D. McLean. An approach for measuring semantic similarity between words us- ing multiple information sources. IEEE Transactions on knowledge and data engineering, 15(4):871{882, 2003.

Digital Library

[26]

Y. H. Li and A. K. Jain. Classi cation of text docu- ments. The Computer Journal, 41(8):537{546, 1998.

[27]

S. Massung, C. Geigle, and C. Zhai. Meta: A uni ed toolkit for text retrieval and analysis. Proceedings of ACL-2016 System Demonstrations, pages 91{96, 2016.

[28]

K. McKeown, S. Rosenthal, K. Thadani, and C. Moore. Time-efficient creation of an accurate sentence fusion corpus. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 317{320. Association for Computational Linguistics, 2010.

Digital Library

[29]

R. Mihalcea, C. Corley, C. Strapparava, et al. Corpus- based and knowledge-based measures of text semantic similarity. In AAAI, volume 6, pages 775{780, 2006.

Digital Library

[30]

T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In Advances in Neu- ral Information Processing Systems 26, pages 3111{ 3119. 2013.

Digital Library

[31]

J. Mueller and A. Thyagarajan. Siamese recurrent ar- chitectures for learning sentence similarity. In AAAI, pages 2786{2792, 2016.

Digital Library

[32]

J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532{1543, 2014.

[33]

C. Piech, J. Huang, Z. Chen, C. Do, A. Ng, and D. Koller. Tuned models of peer assessment in moocs. In Proceedings of the 6th International Conference on Educational Data Mining (EDM 2013), 2013.

[34]

L. R. Rabiner. Readings in speech recognition. chapter A Tutorial on Hidden Markov Models and Selected Ap- plications in Speech Recognition, pages 267{296. 1990.

Digital Library

[35]

G. Salton, A. Wong, and C.-S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613{620, 1975.

Digital Library

[36]

R. Sinha and R. Mihalcea. Unsupervised graph- basedword sense disambiguation using measures of word semantic similarity. In Semantic Computing, 2007. ICSC 2007. International Conference on, pages 363{369. IEEE, 2007.

Digital Library

[37]

H. K. Suen. Peer assessment for massive open online courses (moocs). The International Review of Research in Open and Distributed Learning, 15(3), 2014.

[38]

M. A. Sultan, S. Bethard, and T. Sumner. Dls @ cu: Sentence similarity from word alignment and seman- tic vector composition. In Proceedings of the 9th Inter- national Workshop on Semantic Evaluation (SemEval 2015), pages 148{153, 2015.

[39]

D. R. Swanson. Fish oil, raynaud's syndrome, and undiscovered public knowledge. Perspectives in biology and medicine, 30(1):7{18, 1986.

[40]

K. Thadani and K. McKeown. Towards strict sentence intersection: decoding and evaluation strategies. In Proceedings of the Workshop on Monolingual Text-To- Text Generation, pages 43{53. Association for Compu- tational Linguistics, 2011.

Digital Library

[41]

D. Wang, T. Li, S. Zhu, and C. Ding. Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, pages 307{314. ACM, 2008.

Digital Library

[42]

C.-P. Wei, C. C. Yang, and C.-M. Lin. A latent se- mantic indexing-based approach to multilingual docu- ment clustering. Decision Support Systems, 45(3):606{ 620, 2008.

Digital Library

[43]

C. Zhai and S. Massung. Text data management and analysis: a practical introduction to information re- trieval and text mining. Morgan & Claypool, 2016.

Digital Library

Cited By

Sarkar SKarmaker S(2022)Concept Annotation from Users Perspective: A New ChallengeCompanion Proceedings of the Web Conference 202210.1145/3487553.3524933(1180-1188)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3487553.3524933
Cope BKalantzis M(2020)Futures for research in educationEducational Philosophy and Theory10.1080/00131857.2020.182478154:11(1732-1739)Online publication date: 22-Sep-2020
https://doi.org/10.1080/00131857.2020.1824781
Cope BKalantzis MSearsmith D(2020)Artificial intelligence for education: Knowledge and its assessment in AI-enabled learning ecologiesEducational Philosophy and Theory10.1080/00131857.2020.1728732(1-17)Online publication date: 18-Feb-2020
https://doi.org/10.1080/00131857.2020.1728732

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter

ACM SIGKDD Explorations Newsletter Volume 20, Issue 2

December 2018

30 pages

ISSN:1931-0145

EISSN:1931-0153

DOI:10.1145/3299986

Editors:
Hanghang Tong
Arizona State University
,
Xin Luna Dong
Google
,
Ankur Teredesai
University of Washington Tacoma
,
Reza Zafarani
Syracuse University

Issue’s Table of Contents

Copyright © 2018 Authors.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 December 2018

Published in SIGKDD Volume 20, Issue 2

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
118
Total Downloads

Downloads (Last 12 months)9
Downloads (Last 6 weeks)1

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sarkar SKarmaker S(2022)Concept Annotation from Users Perspective: A New ChallengeCompanion Proceedings of the Web Conference 202210.1145/3487553.3524933(1180-1188)Online publication date: 25-Apr-2022
https://dl.acm.org/doi/10.1145/3487553.3524933
Cope BKalantzis M(2020)Futures for research in educationEducational Philosophy and Theory10.1080/00131857.2020.182478154:11(1732-1739)Online publication date: 22-Sep-2020
https://doi.org/10.1080/00131857.2020.1824781
Cope BKalantzis MSearsmith D(2020)Artificial intelligence for education: Knowledge and its assessment in AI-enabled learning ecologiesEducational Philosophy and Theory10.1080/00131857.2020.1728732(1-17)Online publication date: 18-Feb-2020
https://doi.org/10.1080/00131857.2020.1728732

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents