An Introduction to the Novel Challenges in Information Retrieval for Social Media

Inches, Giacomo; Crestani, Fabio

doi:10.1007/978-3-642-54798-0_1

An Introduction to the Novel Challenges in Information Retrieval for Social Media

Giacomo Inches¹⁶ &
Fabio Crestani¹⁶

Chapter

936 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8173))

Abstract

The importance of the Internet as a communication medium is reflected in the large amount of documents being generated every day by users of the different services that take place online. This has caused a massive change in the documents being reached and retrieved. In this article we study how Information Retrieval models should change to reflect the changes that are happening to the documents being processed. We analyse the properties of the online user-generated documents of some of the most established services over the Internet (e.g. Kongregate, Twitter, Myspace and Slashdot) and compare them with a consolidated collection of standard information retrieval documents (e.g. Wall Street Journal, Associated Press, Financial Times). We study the statistical properties of these collections (e.g. Zipf’s Law and Heap’s Law) and investigate other important feature, such as document similarity, term burstiness, emoticons and part-of-speech analysis. We highlight the applicability and limits of traditional content analysis techniques to the new online user-generated documents and show the need for a specific processing for those documents in oder to be able to provide effective content analysis.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Mendenhall, T.C.: The characteristic curves of composition. Science, 237–246 (1887)
Google Scholar
Croft, B., Metzler, D., Strohman, T.: Search Engines: Information Retrieval in Practice. Addison Wesley (2009)
Google Scholar
Salton, G.: Automatic Information Organization and Retrieval. McGraw Hill Text (1968)
Google Scholar
Lo, R.T.W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. JDIM 3(1), 3–8 (2005)
Google Scholar
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Google Scholar
Cover, T., Thomas, J.: Elements of Information Theory. Wiley Series in Telecommunications and Signal Processing. John Wiley & Sons (2006)
Google Scholar
Inches, G., Harvey, M., Crestani, F.: Finding participants in a chat: Authorship attribution for conversational documents. In: ASE/IEEE International Conference on Social Computing, Washington, DC, USA, pp. 272–279 (September 2013)
Google Scholar
Savoy, J.: Authorship Attribution Based on Specific Vocabulary. ACM Transactions on Information Systems 30(2), 1–30 (2012)
Article Google Scholar
Bader, B.W., Chew, P.A.: Algebraic Techniques for Multilingual Document Clustering. In: Text Mining: Applications and Theory, pp. 21–36. John Wiley & Sons, Ltd. (2010)
Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc. (2006)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and Technology behind Search, 2nd edn. Addison-Wesley Professional (2011)
Google Scholar
Codina, J., Kaltenbrunner, A., Grivolla, J., Banchs, R.E., Baeza-Yates, R.: Content analysis in web 2.0. In: 18th International World Wide Web Conference (2009)
Google Scholar
Elsner, M., Charniak, E.: You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement. In: Proceedings of ACL 2008: HLT, pp. 834–842 (2008)
Google Scholar
Wang, L., Oard, D.W.: Context-based message expansion for disentanglement of interleaved text conversations. In: NAACL 2009, pp. 200–208 (2009)
Google Scholar
Layton, R., McCombie, S., Watters, P.: Authorship attribution of irc messages using inverse author frequency. In: 3rd Cybercrime and Trustworthy Computing Workshop (CTC), pp. 1–8 (2012)
Google Scholar
Ritter, A., Cherry, C., Dolan, B.: Unsupervised Modeling of Twitter Conversations. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 172–180 (2010)
Google Scholar
Inches, G., Crestani, F.: Online conversation mining for author characterization and topic identification. In: Proceedings of the 4th Workshop for Ph.D. Students in Information & Knowledge Management - PIKM 2011 (2011)
Google Scholar
Lin, J.: Automatic Author Profiling of Online Chat Logs. PhD thesis, Naval Postgraduate School, Monterey, USA (2007)
Google Scholar
Forsythand, E.N., Martell, C.H.: Lexical and Discourse Analysis of Online Chat Dialog. In: International Conference on Semantic Computing (ICSC), pp. 19–26 (2007)
Google Scholar
Durham, J.S.: Topic detection in online chat. Master’s thesis, Naval Postgraduate School, Monterey, USA (2009)
Google Scholar
How, Y., Kan, M.: Optimizing predictive text entry for short message service on mobile phones. In: Human Computer Interfaces International (HCII 2005), Las Vegas (2005)
Google Scholar
Chen, T., Kan, M.Y.: Creating a live, public short message service corpus: the nus sms corpus. Language Resources and Evaluation 47(2), 299–335 (2013)
Google Scholar
Macdonald, C., Santos, R.L.T., Ounis, I., Soboroff, I.: Blog track research at trec. SIGIR Forum 44(1), 58–75 (2010)
Article Google Scholar
Santos, R.L.T., Macdonald, C., McCreadie, R., Ounis, I., Soboroff, I.: Information retrieval on the blogosphere. Foundations and Trends in Information Retrieval 6(1), 1–125 (2012)
Article Google Scholar
Ounis, I., Macdonald, C., Lin, J., Soboroff, I.: Overview of the trec-2011 microblog track. In: Proceeddings of the 20th Text REtrieval Conference, TREC 2011 (2011)
Google Scholar
Klimt, B., Yang, Y.: The enron corpus: A new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)
Chapter Google Scholar
Inches, G., Crestani, F.: Overview of the International Sexual Predator Identification Competition at PAN-2012. In: CLEF (Online Working Notes/Labs/Workshop) (2012)
Google Scholar
Voorhees, E.M., Harman, D.: Overview of the eighth text retrieval conference (trec-8). In: NIST Special Publication 500-246: The Eighth Text REtrieval Conference (TREC 8), pp. 1–24 (2000)
Google Scholar
Peters, C., et al. (eds.): CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009)
Google Scholar
Ramage, D., Dumais, S., Liebling, D.: Characterizing microblogs with topic models. In: ICWSM (2010)
Google Scholar
Dong, H., Hui, S., He, Y.: Structural analysis of chat messages for topic detection. Online Information Review, 496–516 (2006)
Google Scholar
Tuulos, V.H., Tirri, H.: Combining topic models and social networks for chat data mining. In: WI 2004: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 206–213 (2004)
Google Scholar
Serrano, M., Flammini, A., Menczer, F.: Modeling statistical properties of written text. PLoS ONE 4(4), 1–8 (2009)
Google Scholar
Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
MATH Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (2002)
Google Scholar
Wilcock, G.: Introduction to linguistic annotation and text analytics. Synthesis Lectures on Human Language Technologies 2(1), 1–159 (2009)
Article Google Scholar
Yin, D., Xue, Z., Hong, L., Davison, B.D., Kontostathis, A., Edwards, L.: Detection of harassment on web 2.0. In: CAW 2.0 2009: Proceedings of the 1st Content Analysis in Web 2.0 Workshop (2009)
Google Scholar
Balog, K., Bron, M., He, J., Hofmann, K., Meij, E.J., de Rijke, M., Tsagkias, E., Weerkamp, W.: The University of Amsterdam at TREC 2009: Blog, Web, Entity, and Relevance Feedback. In: TREC 2009 Working Notes (2009)
Google Scholar
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: Proc. of NAACL (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Informatics, University of Lugano (USI), Lugano, Switzerland
Giacomo Inches & Fabio Crestani

Authors

Giacomo Inches
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Crestani
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Padua, Italy
Nicola Ferro

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Inches, G., Crestani, F. (2014). An Introduction to the Novel Challenges in Information Retrieval for Social Media. In: Ferro, N. (eds) Bridging Between Information Retrieval and Databases. PROMISE 2013. Lecture Notes in Computer Science, vol 8173. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54798-0_1

Download citation

DOI: https://doi.org/10.1007/978-3-642-54798-0_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54797-3
Online ISBN: 978-3-642-54798-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics