A Faceted Crawler for the Twitter Service

Valkanas, George; Saravanou, Antonia; Gunopulos, Dimitrios

doi:10.1007/978-3-319-11746-1_13

George Valkanas¹⁹,
Antonia Saravanou¹⁹ &
Dimitrios Gunopulos¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8787))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1453 Accesses
4 Citations

Abstract

Researchers, nowadays, have at their disposal valuable data from social networking applications, of which Twitter and Facebook are the most prominent examples. To retrieve this content, the Twitter service provides 2 distinct Application Programming Interfaces (APIs): a probe-based and a streaming one, each of which imposes different limitations on the data collection process. In this paper, we present a general architecture to facilitate faceted crawling of the service, which simplifies retrieval. We give implementation details of our system, while providing a simple way to express the crawling process, i.e., the crawl flow. We experimentally evaluate it on a variety of faceted crawls, depicting its efficacy for the online medium.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

https://dev.twitter.com/docs/api/1.1
https://dev.twitter.com/docs/twitter-libraries
https://dev.twitter.com/docs/twitter-libraries
Abel, F., Celik, I., Houben, G.-J., Siehndel, P.: Leveraging the semantics of tweets for adaptive faceted search on twitter. In: ISCW, pp. 1–17 (2011)
Google Scholar
Ahmed, A., Hong, L., Smola, A.J.: Hierarchical geographical modeling of user locations from social media posts. In: WWW (2013)
Google Scholar
Bakshy, E., Hofman, J.M., Mason, W.A., Watts, D.J.: Everyone’s an influencer: quantifying influence on twitter. In: WSDM, pp. 65–74 (2011)
Google Scholar
Barbieri, N., Bonchi, F., Manco, G.: Influence-based network-oblivious community detection. In: ICDM (2013)
Google Scholar
Bergman, M.: The deep web: Surfacing hidden value. Technical report (2001)
Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: WWW, pp. 107–117 (1998)
Google Scholar
Castillo, C.: Effective web crawling. SIGIR Forum 39(1), 55–56 (2005)
Article Google Scholar
Cho, J., Garcia-Molina, H.: Effective page refresh policies for web crawlers. ACM Trans. Database Syst. 28(4), 390–426 (2003)
Article Google Scholar
Eisenstein, J., O’Connor, B., Smith, N.A., Xing, E.P.: A latent variable model for geographic lexical variation. In: EMNLP (2010)
Google Scholar
Garcia-Molina, H.: Challenges in crawling the web. In: James, A., Younas, M., Lings, B. (eds.) BNCOD 2003. LNCS, vol. 2712, p. 3. Springer, Heidelberg (2003)
Chapter Google Scholar
Ghosh, S., Korlam, G., Ganguly, N.: Spammers’ networks within online social networks: a case-study on twitter. In: WWW, pp. 41–42 (2011)
Google Scholar
Gjoka, M., Kurant, M., Butts, C.T., Markopoulou, A.: Walking in facebook: A case study of unbiased sampling of osns. In: INFOCOM, pp. 2498–2506 (2010)
Google Scholar
Grier, C., Thomas, K., Paxson, V., Zhang, M.: @spam: The underground on 140 characters or less. In: CCS, pp. 27–37 (2010)
Google Scholar
Heydon, A., Najork, M.: Mercator: A scalable, extensible web crawler. World Wide Web 2(4), 219–229 (1999)
Article Google Scholar
Kamath, K.Y., Caverlee, J.: Content-based crowd retrieval on the real-time web. In: CIKM, pp. 195–204 (2012)
Google Scholar
Kwak, H., Lee, C., Park, H., Moon, S.: What is twitter, a social network or a news media? In: WWW, pp. 591–600 (2010)
Google Scholar
Stutzbach, D., Rejaie, R., Duffield, N.G., Sen, S., Willinger, W.: On unbiased sampling for unstructured peer-to-peer networks. In: Internet Measurement Conference, pp. 27–40 (2006)
Google Scholar
Valkanas, G., Gunopulos, D.: Location extraction from social networks with commodity software and online data. In: ICDM Workshops (SSTDM) (2012)
Google Scholar
Valkanas, G., Gunopulos, D.: How the live web feels about events. In: CIKM, pp. 639–648 (2013)
Google Scholar
Weng, J., Lee, B.-S.: Event detection in twitter. In: ICWSM (2011)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Informatics and Telecommunications, University of Athens, Athens, Greece
George Valkanas, Antonia Saravanou & Dimitrios Gunopulos

Authors

George Valkanas
View author publications
You can also search for this author in PubMed Google Scholar
Antonia Saravanou
View author publications
You can also search for this author in PubMed Google Scholar
Dimitrios Gunopulos
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of New South Wales, Sydney, Australia
Boualem Benatallah
Boston University, Boston, MA, USA
Azer Bestavros
Aristotle University of Thessaloniki, Thessaloniki, Greece
Yannis Manolopoulos & Athena Vakali &
Victoria University, Footscray, VIC, Australia
Yanchun Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Valkanas, G., Saravanou, A., Gunopulos, D. (2014). A Faceted Crawler for the Twitter Service. In: Benatallah, B., Bestavros, A., Manolopoulos, Y., Vakali, A., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2014. WISE 2014. Lecture Notes in Computer Science, vol 8787. Springer, Cham. https://doi.org/10.1007/978-3-319-11746-1_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-11746-1_13
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11745-4
Online ISBN: 978-3-319-11746-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics