Abstract
Paper describes a focused crawler for monitoring social networks which is used for information extraction and content analysis. Crawler implements MapReduce model for distributed computations and is oriented to big text data. Focused crawler allows to look for the pages classified as relevant to the specified topic. Classifier is build using knowledge database that defines words, their classes and rules of joining words into the phrases. Based on the weights of words and phrases the text weight which indicates relevance to the topic is obtained. This system was used to detect drug community in Russian segment of Livejournal social network. Official and slang drug terminology was implemented to develop knowledge database. Different aspects of knowledge database and classifier are studied. The non-homogeneous Poisson process was used to model blogs changing since it permits to build a monitoring policy that includes blogs update frequency and day-time effect. Evaluation on real data shows 25% increase in new posts detection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Lammel, R.: Google’s MapReduce programming model — Revisted. Science of Computer Programming 70, 1–30 (2007)
White, T.: Hadoop: the definitive guide. O’Reilly Media, Yahoo! Press (2009)
Cafarella, M., Cutting, D.: Building Nutch: open source search. ACM Queue 2(2), 54–61 (2004)
Sia, K., Cho, J., Cho, H.: Efficient monitoring algorithm for fast news alerts. Knowledge and Data Engineering (2007)
Cho, J., Garcia-Molina, H.: Effective page refresh policies for Web crawlers. ACM Transactions on Database Systems 28(4), 390–426 (2003)
Ipeirotis, P.G., Agichtein, E., Gravano, L.: To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks, pp. 265–276 (2006)
Cho, J., Garcia-Molina, H.: Synchronizing a database to Improve Freshness, 1–30 (2000)
Mityagin, S.A., et al.: Definition of target thresholds for drug-using indexes in respect to regional safety. Social Sciences (Obshestvennye nauki) 4, 243–251 (2012) (in Russian)
Mityagin, S.A, Yakushev, A.V., Boukhanovsky, A.V.: Simulation of drug-spreading in population using social network monitoring. SISP Journal 2(10), 133–151 (2012) (in Russian)
Simma, A., Jordan, M.: Modeling events with cascades of Poisson processes. Arxiv preprint arXiv:1203.3516 (2012)
Bloehdorn, S., Hotho, A.: Boosting for Text Classification with Semantic Features. In: Mobasher, B., Nasraoui, O., Liu, B., Masand, B. (eds.) WebKDD 2004. LNCS (LNAI), vol. 3932, pp. 149–166. Springer, Heidelberg (2006)
Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining, ICDM 2003. IEEE (2003)
Bloehdorn, S., Hotho, A.: Text classification by boosting weak learners based on terms and concepts. In: Fourth IEEE International Conference on Data Mining, ICDM 2004. IEEE (2004)
Song, M.-H., Lim, S.-Y., Park, S.-B., Kang, D.-J., Lee, S.-J.: An automatic approach to classify web documents using a domain ontology. In: Pal, S.K., Bandyopadhyay, S., Biswas, S., et al. (eds.) PReMI 2005. LNCS, vol. 3776, pp. 666–671. Springer, Heidelberg (2005)
Castells, P., Fernandez, M., Vallet, D.: An Adaptation of the Vector-Space Model for Ontology-Based Information Retrieval. IEEE Transactions on Knowledge and Data Engineering (2007)
Chau, D.H., et al.: Parallel Crawling for Online Social Networks. In: Proceedings of the 16th International Conference on World Wide Web. ACM (2007)
Boanjak, M., et al.: TwitterEcho: a distributed focused crawler to support open research with twitter data. In: Proceedings of the 21st International World Wide Web Conference (2012)
Ravakhah, M., Kamyar, M.: Semantic Similarity Based Focused Crawling, Computational Intelligence, Communication Systems and Networks (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yakushev, A.V., Boukhanovsky, A.V., Sloot, P.M.A. (2013). Topic Crawler for Social Networks Monitoring. In: Klinov, P., Mouromtsev, D. (eds) Knowledge Engineering and the Semantic Web. KESW 2013. Communications in Computer and Information Science, vol 394. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41360-5_17
Download citation
DOI: https://doi.org/10.1007/978-3-642-41360-5_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41359-9
Online ISBN: 978-3-642-41360-5
eBook Packages: Computer ScienceComputer Science (R0)