Abstract
The importance of the Internet as a communication medium is reflected in the large amount of documents being generated every day by users of the different services that take place online. This has caused a massive change in the documents being reached and retrieved. In this article we study how Information Retrieval models should change to reflect the changes that are happening to the documents being processed. We analyse the properties of the online user-generated documents of some of the most established services over the Internet (e.g. Kongregate, Twitter, Myspace and Slashdot) and compare them with a consolidated collection of standard information retrieval documents (e.g. Wall Street Journal, Associated Press, Financial Times). We study the statistical properties of these collections (e.g. Zipf’s Law and Heap’s Law) and investigate other important feature, such as document similarity, term burstiness, emoticons and part-of-speech analysis. We highlight the applicability and limits of traditional content analysis techniques to the new online user-generated documents and show the need for a specific processing for those documents in oder to be able to provide effective content analysis.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Mendenhall, T.C.: The characteristic curves of composition. Science, 237–246 (1887)
Croft, B., Metzler, D., Strohman, T.: Search Engines: Information Retrieval in Practice. Addison Wesley (2009)
Salton, G.: Automatic Information Organization and Retrieval. McGraw Hill Text (1968)
Lo, R.T.W., He, B., Ounis, I.: Automatically building a stopword list for an information retrieval system. JDIM 3(1), 3–8 (2005)
Manning, C.D., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
Cover, T., Thomas, J.: Elements of Information Theory. Wiley Series in Telecommunications and Signal Processing. John Wiley & Sons (2006)
Inches, G., Harvey, M., Crestani, F.: Finding participants in a chat: Authorship attribution for conversational documents. In: ASE/IEEE International Conference on Social Computing, Washington, DC, USA, pp. 272–279 (September 2013)
Savoy, J.: Authorship Attribution Based on Specific Vocabulary. ACM Transactions on Information Systems 30(2), 1–30 (2012)
Bader, B.W., Chew, P.A.: Algebraic Techniques for Multilingual Document Clustering. In: Text Mining: Applications and Theory, pp. 21–36. John Wiley & Sons, Ltd. (2010)
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer-Verlag New York, Inc. (2006)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and Technology behind Search, 2nd edn. Addison-Wesley Professional (2011)
Codina, J., Kaltenbrunner, A., Grivolla, J., Banchs, R.E., Baeza-Yates, R.: Content analysis in web 2.0. In: 18th International World Wide Web Conference (2009)
Elsner, M., Charniak, E.: You Talking to Me? A Corpus and Algorithm for Conversation Disentanglement. In: Proceedings of ACL 2008: HLT, pp. 834–842 (2008)
Wang, L., Oard, D.W.: Context-based message expansion for disentanglement of interleaved text conversations. In: NAACL 2009, pp. 200–208 (2009)
Layton, R., McCombie, S., Watters, P.: Authorship attribution of irc messages using inverse author frequency. In: 3rd Cybercrime and Trustworthy Computing Workshop (CTC), pp. 1–8 (2012)
Ritter, A., Cherry, C., Dolan, B.: Unsupervised Modeling of Twitter Conversations. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 172–180 (2010)
Inches, G., Crestani, F.: Online conversation mining for author characterization and topic identification. In: Proceedings of the 4th Workshop for Ph.D. Students in Information & Knowledge Management - PIKM 2011 (2011)
Lin, J.: Automatic Author Profiling of Online Chat Logs. PhD thesis, Naval Postgraduate School, Monterey, USA (2007)
Forsythand, E.N., Martell, C.H.: Lexical and Discourse Analysis of Online Chat Dialog. In: International Conference on Semantic Computing (ICSC), pp. 19–26 (2007)
Durham, J.S.: Topic detection in online chat. Master’s thesis, Naval Postgraduate School, Monterey, USA (2009)
How, Y., Kan, M.: Optimizing predictive text entry for short message service on mobile phones. In: Human Computer Interfaces International (HCII 2005), Las Vegas (2005)
Chen, T., Kan, M.Y.: Creating a live, public short message service corpus: the nus sms corpus. Language Resources and Evaluation 47(2), 299–335 (2013)
Macdonald, C., Santos, R.L.T., Ounis, I., Soboroff, I.: Blog track research at trec. SIGIR Forum 44(1), 58–75 (2010)
Santos, R.L.T., Macdonald, C., McCreadie, R., Ounis, I., Soboroff, I.: Information retrieval on the blogosphere. Foundations and Trends in Information Retrieval 6(1), 1–125 (2012)
Ounis, I., Macdonald, C., Lin, J., Soboroff, I.: Overview of the trec-2011 microblog track. In: Proceeddings of the 20th Text REtrieval Conference, TREC 2011 (2011)
Klimt, B., Yang, Y.: The enron corpus: A new dataset for email classification research. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) ECML 2004. LNCS (LNAI), vol. 3201, pp. 217–226. Springer, Heidelberg (2004)
Inches, G., Crestani, F.: Overview of the International Sexual Predator Identification Competition at PAN-2012. In: CLEF (Online Working Notes/Labs/Workshop) (2012)
Voorhees, E.M., Harman, D.: Overview of the eighth text retrieval conference (trec-8). In: NIST Special Publication 500-246: The Eighth Text REtrieval Conference (TREC 8), pp. 1–24 (2000)
Peters, C., et al. (eds.): CLEF 2008. LNCS, vol. 5706. Springer, Heidelberg (2009)
Ramage, D., Dumais, S., Liebling, D.: Characterizing microblogs with topic models. In: ICWSM (2010)
Dong, H., Hui, S., He, Y.: Structural analysis of chat messages for topic detection. Online Information Review, 496–516 (2006)
Tuulos, V.H., Tirri, H.: Combining topic models and social networks for chat data mining. In: WI 2004: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 206–213 (2004)
Serrano, M., Flammini, A., Menczer, F.: Modeling statistical properties of written text. PLoS ONE 4(4), 1–8 (2009)
Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A framework and graphical development environment for robust NLP tools and applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (2002)
Wilcock, G.: Introduction to linguistic annotation and text analytics. Synthesis Lectures on Human Language Technologies 2(1), 1–159 (2009)
Yin, D., Xue, Z., Hong, L., Davison, B.D., Kontostathis, A., Edwards, L.: Detection of harassment on web 2.0. In: CAW 2.0 2009: Proceedings of the 1st Content Analysis in Web 2.0 Workshop (2009)
Balog, K., Bron, M., He, J., Hofmann, K., Meij, E.J., de Rijke, M., Tsagkias, E., Weerkamp, W.: The University of Amsterdam at TREC 2009: Blog, Web, Entity, and Relevance Feedback. In: TREC 2009 Working Notes (2009)
Owoputi, O., O’Connor, B., Dyer, C., Gimpel, K., Schneider, N., Smith, N.A.: Improved part-of-speech tagging for online conversational text with word clusters. In: Proc. of NAACL (2013)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Inches, G., Crestani, F. (2014). An Introduction to the Novel Challenges in Information Retrieval for Social Media. In: Ferro, N. (eds) Bridging Between Information Retrieval and Databases. PROMISE 2013. Lecture Notes in Computer Science, vol 8173. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54798-0_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-54798-0_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54797-3
Online ISBN: 978-3-642-54798-0
eBook Packages: Computer ScienceComputer Science (R0)