Skip to main content

A Faceted Crawler for the Twitter Service

  • Conference paper
Web Information Systems Engineering – WISE 2014 (WISE 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8787))

Included in the following conference series:

Abstract

Researchers, nowadays, have at their disposal valuable data from social networking applications, of which Twitter and Facebook are the most prominent examples. To retrieve this content, the Twitter service provides 2 distinct Application Programming Interfaces (APIs): a probe-based and a streaming one, each of which imposes different limitations on the data collection process. In this paper, we present a general architecture to facilitate faceted crawling of the service, which simplifies retrieval. We give implementation details of our system, while providing a simple way to express the crawling process, i.e., the crawl flow. We experimentally evaluate it on a variety of faceted crawls, depicting its efficacy for the online medium.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. https://dev.twitter.com/docs/api/1.1

  2. https://dev.twitter.com/docs/twitter-libraries

  3. https://dev.twitter.com/docs/twitter-libraries

  4. Abel, F., Celik, I., Houben, G.-J., Siehndel, P.: Leveraging the semantics of tweets for adaptive faceted search on twitter. In: ISCW, pp. 1–17 (2011)

    Google Scholar 

  5. Ahmed, A., Hong, L., Smola, A.J.: Hierarchical geographical modeling of user locations from social media posts. In: WWW (2013)

    Google Scholar 

  6. Bakshy, E., Hofman, J.M., Mason, W.A., Watts, D.J.: Everyone’s an influencer: quantifying influence on twitter. In: WSDM, pp. 65–74 (2011)

    Google Scholar 

  7. Barbieri, N., Bonchi, F., Manco, G.: Influence-based network-oblivious community detection. In: ICDM (2013)

    Google Scholar 

  8. Bergman, M.: The deep web: Surfacing hidden value. Technical report (2001)

    Google Scholar 

  9. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: WWW, pp. 107–117 (1998)

    Google Scholar 

  10. Castillo, C.: Effective web crawling. SIGIR Forum 39(1), 55–56 (2005)

    Article  Google Scholar 

  11. Cho, J., Garcia-Molina, H.: Effective page refresh policies for web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003)

    Article  Google Scholar 

  12. Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: EMNLP (2010)

    Google Scholar 

  13. Garcia-Molina, H.: Challenges in crawling the web. In: James, A., Younas, M., Lings, B. (eds.) BNCOD 2003. LNCS, vol. 2712, p. 3. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  14. Ghosh, S., Korlam, G., Ganguly, N.: Spammers’ networks within online social networks: a case-study on twitter. In: WWW, pp. 41–42 (2011)

    Google Scholar 

  15. Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in facebook: A case study of unbiased sampling of osns. In: INFOCOM, pp. 2498–2506 (2010)

    Google Scholar 

  16. Grier, C., Thomas, K., Paxson, V., Zhang, M.: @spam: The underground on 140 characters or less. In: CCS, pp. 27–37 (2010)

    Google Scholar 

  17. Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)

    Article  Google Scholar 

  18. Kamath, K.Y., Caverlee, J.: Content-based crowd retrieval on the real-time web. In: CIKM, pp. 195–204 (2012)

    Google Scholar 

  19. Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: WWW, pp. 591–600 (2010)

    Google Scholar 

  20. Stutzbach, D., Rejaie, R., Duffield, N.G., Sen, S., Willinger, W.: On unbiased sampling for unstructured peer-to-peer networks. In: Internet Measurement Conference, pp. 27–40 (2006)

    Google Scholar 

  21. Valkanas, G., Gunopulos, D.: Location extraction from social networks with commodity software and online data. In: ICDM Workshops (SSTDM) (2012)

    Google Scholar 

  22. Valkanas, G., Gunopulos, D.: How the live web feels about events. In: CIKM, pp. 639–648 (2013)

    Google Scholar 

  23. Weng, J., Lee, B.-S.: Event detection in twitter. In: ICWSM (2011)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Valkanas, G., Saravanou, A., Gunopulos, D. (2014). A Faceted Crawler for the Twitter Service. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2014. WISE 2014. Lecture Notes in Computer Science, vol 8787. Springer, Cham. https://doi.org/10.1007/978-3-319-11746-1_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11746-1_13

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11745-4

  • Online ISBN: 978-3-319-11746-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics