Skip to main content

Topic Crawler for Social Networks Monitoring

  • Conference paper
Knowledge Engineering and the Semantic Web (KESW 2013)

Abstract

Paper describes a focused crawler for monitoring social networks which is used for information extraction and content analysis. Crawler implements MapReduce model for distributed computations and is oriented to big text data. Focused crawler allows to look for the pages classified as relevant to the specified topic. Classifier is build using knowledge database that defines words, their classes and rules of joining words into the phrases. Based on the weights of words and phrases the text weight which indicates relevance to the topic is obtained. This system was used to detect drug community in Russian segment of Livejournal social network. Official and slang drug terminology was implemented to develop knowledge database. Different aspects of knowledge database and classifier are studied. The non-homogeneous Poisson process was used to model blogs changing since it permits to build a monitoring policy that includes blogs update frequency and day-time effect. Evaluation on real data shows 25% increase in new posts detection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Lammel, R.: Google’s MapReduce programming model — Revisted. Science of Computer Programming 70, 1–30 (2007)

    Article  MathSciNet  Google Scholar 

  2. White, T.: Hadoop: the definitive guide. O’Reilly Media, Yahoo! Press (2009)

    Google Scholar 

  3. Cafarella, M., Cutting, D.: Building Nutch: open source search. ACM Queue 2(2), 54–61 (2004)

    Article  Google Scholar 

  4. Sia, K., Cho, J., Cho, H.: Efficient monitoring algorithm for fast news alerts. Knowledge and Data Engineering (2007)

    Google Scholar 

  5. Cho, J., Garcia-Molina, H.: Effective page refresh policies for Web crawlers. ACM Transactions on Database Systems 28(4), 390–426 (2003)

    Article  Google Scholar 

  6. Ipeirotis, P.G., Agichtein, E., Gravano, L.: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks, pp. 265–276 (2006)

    Google Scholar 

  7. Cho, J., Garcia-Molina, H.: Synchronizing a database to Improve Freshness, 1–30 (2000)

    Google Scholar 

  8. Mityagin, S.A., et al.: Definition of target thresholds for drug-using indexes in respect to regional safety. Social Sciences (Obshestvennye nauki) 4, 243–251 (2012) (in Russian)

    Google Scholar 

  9. Mityagin, S.A, Yakushev, A.V., Boukhanovsky, A.V.: Simulation of drug-spreading in population using social network monitoring. SISP Journal 2(10), 133–151 (2012) (in Russian)

    Google Scholar 

  10. Simma, A., Jordan, M.: Modeling events with cascades of Poisson processes. Arxiv preprint arXiv:1203.3516 (2012)

    Google Scholar 

  11. Bloehdorn, S., Hotho, A.: Boosting for Text Classification with Semantic Features. In: Mobasher, B., Nasraoui, O., Liu, B., Masand, B. (eds.) WebKDD 2004. LNCS (LNAI), vol. 3932, pp. 149–166. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  12. Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining, ICDM 2003. IEEE (2003)

    Google Scholar 

  13. Bloehdorn, S., Hotho, A.: Text classification by boosting weak learners based on terms and concepts. In: Fourth IEEE International Conference on Data Mining, ICDM 2004. IEEE (2004)

    Google Scholar 

  14. Song, M.-H., Lim, S.-Y., Park, S.-B., Kang, D.-J., Lee, S.-J.: An automatic approach to classify web documents using a domain ontology. In: Pal, S.K., Bandyopadhyay, S., Biswas, S., et al. (eds.) PReMI 2005. LNCS, vol. 3776, pp. 666–671. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  15. Castells, P., Fernandez, M., Vallet, D.: An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval. IEEE Transactions on Knowledge and Data Engineering (2007)

    Google Scholar 

  16. Chau, D.H., et al.: Parallel Crawling for Online Social Networks. In: Proceedings of the 16th International Conference on World Wide Web. ACM (2007)

    Google Scholar 

  17. Boanjak, M., et al.: TwitterEcho: a distributed focused crawler to support open research with twitter data. In: Proceedings of the 21st International World Wide Web Conference (2012)

    Google Scholar 

  18. Ravakhah, M., Kamyar, M.: Semantic Similarity Based Focused Crawling, Computational Intelligence, Communication Systems and Networks (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yakushev, A.V., Boukhanovsky, A.V., Sloot, P.M.A. (2013). Topic Crawler for Social Networks Monitoring. In: Klinov, P., Mouromtsev, D. (eds) Knowledge Engineering and the Semantic Web. KESW 2013. Communications in Computer and Information Science, vol 394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41360-5_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-41360-5_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-41359-9

  • Online ISBN: 978-3-642-41360-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics