Skip to main content

An Introduction to the Novel Challenges in Information Retrieval for Social Media

  • Chapter

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8173))

Abstract

The importance of the Internet as a communication medium is reflected in the large amount of documents being generated every day by users of the different services that take place online. This has caused a massive change in the documents being reached and retrieved. In this article we study how Information Retrieval models should change to reflect the changes that are happening to the documents being processed. We analyse the properties of the online user-generated documents of some of the most established services over the Internet (e.g. Kongregate, Twitter, Myspace and Slashdot) and compare them with a consolidated collection of standard information retrieval documents (e.g. Wall Street Journal, Associated Press, Financial Times). We study the statistical properties of these collections (e.g. Zipf’s Law and Heap’s Law) and investigate other important feature, such as document similarity, term burstiness, emoticons and part-of-speech analysis. We highlight the applicability and limits of traditional content analysis techniques to the new online user-generated documents and show the need for a specific processing for those documents in oder to be able to provide effective content analysis.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Mendenhall, T.C.: The characteristic curves of composition. Science, 237–246 (1887)

    Google Scholar 

  2. Croft, B., Metzler, D., Strohman, T.: Search Engines: Information Retrieval in Practice. Addison Wesley (2009)

    Google Scholar 

  3. Salton, G.: Automatic Information Organization and Retrieval. McGraw Hill Text (1968)

    Google Scholar 

  4. Lo, R.T.W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. JDIM 3(1), 3–8 (2005)

    Google Scholar 

  5. Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)

    Google Scholar 

  6. Cover, T., Thomas, J.: Elements of Information Theory. Wiley Series in Telecommunications and Signal Processing. John Wiley & Sons (2006)

    Google Scholar 

  7. Inches, G., Harvey, M., Crestani, F.: Finding participants in a chat: Authorship attribution for conversational documents. In: ASE/IEEE International Conference on Social Computing, Washington, DC, USA, pp. 272–279 (September 2013)

    Google Scholar 

  8. Savoy, J.: Authorship Attribution Based on Specific Vocabulary. ACM Transactions on Information Systems 30(2), 1–30 (2012)

    Article  Google Scholar 

  9. Bader, B.W., Chew, P.A.: Algebraic Techniques for Multilingual Document Clustering. In: Text Mining: Applications and Theory, pp. 21–36. John Wiley & Sons, Ltd. (2010)

    Google Scholar 

  10. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc. (2006)

    Google Scholar 

  11. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and Technology behind Search, 2nd edn. Addison-Wesley Professional (2011)

    Google Scholar 

  12. Codina, J., Kaltenbrunner, A., Grivolla, J., Banchs, R.E., Baeza-Yates, R.: Content analysis in web 2.0. In: 18th International World Wide Web Conference (2009)

    Google Scholar 

  13. Elsner, M., Charniak, E.: You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement. In: Proceedings of ACL 2008: HLT, pp. 834–842 (2008)

    Google Scholar 

  14. Wang, L., Oard, D.W.: Context-based message expansion for disentanglement of interleaved text conversations. In: NAACL 2009, pp. 200–208 (2009)

    Google Scholar 

  15. Layton, R., McCombie, S., Watters, P.: Authorship attribution of irc messages using inverse author frequency. In: 3rd Cybercrime and Trustworthy Computing Workshop (CTC), pp. 1–8 (2012)

    Google Scholar 

  16. Ritter, A., Cherry, C., Dolan, B.: Unsupervised Modeling of Twitter Conversations. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 172–180 (2010)

    Google Scholar 

  17. Inches, G., Crestani, F.: Online conversation mining for author characterization and topic identification. In: Proceedings of the 4th Workshop for Ph.D. Students in Information & Knowledge Management - PIKM 2011 (2011)

    Google Scholar 

  18. Lin, J.: Automatic Author Profiling of Online Chat Logs. PhD thesis, Naval Postgraduate School, Monterey, USA (2007)

    Google Scholar 

  19. Forsythand, E.N., Martell, C.H.: Lexical and Discourse Analysis of Online Chat Dialog. In: International Conference on Semantic Computing (ICSC), pp. 19–26 (2007)

    Google Scholar 

  20. Durham, J.S.: Topic detection in online chat. Master’s thesis, Naval Postgraduate School, Monterey, USA (2009)

    Google Scholar 

  21. How, Y., Kan, M.: Optimizing predictive text entry for short message service on mobile phones. In: Human Computer Interfaces International (HCII 2005), Las Vegas (2005)

    Google Scholar 

  22. Chen, T., Kan, M.Y.: Creating a live, public short message service corpus: the nus sms corpus. Language Resources and Evaluation 47(2), 299–335 (2013)

    Google Scholar 

  23. Macdonald, C., Santos, R.L.T., Ounis, I., Soboroff, I.: Blog track research at trec. SIGIR Forum 44(1), 58–75 (2010)

    Article  Google Scholar 

  24. Santos, R.L.T., Macdonald, C., McCreadie, R., Ounis, I., Soboroff, I.: Information retrieval on the blogosphere. Foundations and Trends in Information Retrieval 6(1), 1–125 (2012)

    Article  Google Scholar 

  25. Ounis, I., Macdonald, C., Lin, J., Soboroff, I.: Overview of the trec-2011 microblog track. In: Proceeddings of the 20th Text REtrieval Conference, TREC 2011 (2011)

    Google Scholar 

  26. Klimt, B., Yang, Y.: The enron corpus: A new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  27. Inches, G., Crestani, F.: Overview of the International Sexual Predator Identification Competition at PAN-2012. In: CLEF (Online Working Notes/Labs/Workshop) (2012)

    Google Scholar 

  28. Voorhees, E.M., Harman, D.: Overview of the eighth text retrieval conference (trec-8). In: NIST Special Publication 500-246: The Eighth Text REtrieval Conference (TREC 8), pp. 1–24 (2000)

    Google Scholar 

  29. Peters, C., et al. (eds.): CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009)

    Google Scholar 

  30. Ramage, D., Dumais, S., Liebling, D.: Characterizing microblogs with topic models. In: ICWSM (2010)

    Google Scholar 

  31. Dong, H., Hui, S., He, Y.: Structural analysis of chat messages for topic detection. Online Information Review, 496–516 (2006)

    Google Scholar 

  32. Tuulos, V.H., Tirri, H.: Combining topic models and social networks for chat data mining. In: WI 2004: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 206–213 (2004)

    Google Scholar 

  33. Serrano, M., Flammini, A., Menczer, F.: Modeling statistical properties of written text. PLoS ONE 4(4), 1–8 (2009)

    Google Scholar 

  34. Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  35. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (2002)

    Google Scholar 

  36. Wilcock, G.: Introduction to linguistic annotation and text analytics. Synthesis Lectures on Human Language Technologies 2(1), 1–159 (2009)

    Article  Google Scholar 

  37. Yin, D., Xue, Z., Hong, L., Davison, B.D., Kontostathis, A., Edwards, L.: Detection of harassment on web 2.0. In: CAW 2.0 2009: Proceedings of the 1st Content Analysis in Web 2.0 Workshop (2009)

    Google Scholar 

  38. Balog, K., Bron, M., He, J., Hofmann, K., Meij, E.J., de Rijke, M., Tsagkias, E., Weerkamp, W.: The University of Amsterdam at TREC 2009: Blog, Web, Entity, and Relevance Feedback. In: TREC 2009 Working Notes (2009)

    Google Scholar 

  39. Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: Proc. of NAACL (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Inches, G., Crestani, F. (2014). An Introduction to the Novel Challenges in Information Retrieval for Social Media. In: Ferro, N. (eds) Bridging Between Information Retrieval and Databases. PROMISE 2013. Lecture Notes in Computer Science, vol 8173. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54798-0_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-54798-0_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-54797-3

  • Online ISBN: 978-3-642-54798-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics