Abstract
In order to distinguish and extract the topic information from other interferential information on the BBC news website for the study in social computing, the BBC News Hunter was proposed in this paper. The whole system consists of 6 subsystems, respectively named: UI, Control, Download, Analysis, Storage and Log. Numerical experiments show that satisfactory results can be obtained from the BBC news website, whose average accuracy as well as efficiency are acceptable.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Wang, J., Zhu, L., Li, C.: Discussion about the core of search engine again—web crawler. In: 2011 International Conference on Computer Science and Service System (CSSS), pp. 3188–3191. IEEE (2011)
Khare, R., Cutting, D., Sitaker, K., Rifkin, A.: Nutch: a flexible and scalable open-source web search engine. Or. State Univ. 1, 32 (2004)
Brin, S., Page, L.: Reprint of: the anatomy of a large-scale hypertextual web search engine. Comput. Netw. 56(18), 3825–3833 (2012)
Mohr, G., Stack, M., Ranitovic, I., et al.: An Introduction to Heritrix An open source archival quality web crawler. In: IWAW 2004, 4th International Web Archiving Workshop (2004)
Liu, D.F., Fan, X.S.: Study and application of web crawler algorithm based on heritrix. In: Advanced Materials Research, vol. 219, pp. 1069–1072. Trans Tech Publications (2011)
Kim, H.G., Lee, J.W., Ban, T.H., Jung, H.K.: A study on distributed crawling-based overhead optimization. Int. J. Softw. Eng. Appl. 9(3), 175–182 (2015)
Feng, W., Mao, Z.: The research of web pages information extraction based on Web. J. Luoyang Technol. Coll. 3, 30–31 (2005)
Chakrabarti, S.: Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In: Proceedings of the 10th International Conference on World Wide Web, pp. 211–220. ACM (2001)
Hengru, Z., Chun, C.: Web information extraction technology research based on ajax. In: 2011 International Conference on Business Computing and Global Informatization (BCGIN), pp. 208–211. IEEE (2011)
Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V.: Recognition of common areas in a web page using visual information: a possible application in a page classification. In: Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM 2003, pp. 250–257. IEEE (2002)
Kang, J., Choi, J.: Detecting informative web page blocks for efficient information extraction using visual block segmentation. In: International Symposium on Information Technology Convergence, ISITC 2007, pp. 306–310. IEEE (2007)
Embley, D.W., Jiang, Y., Ng, Y.K.: Record-boundary discovery in web documents. ACM SIGMOD Rec. 28(2), 467–478 (1999). ACM
Zhao, X.X., Suo, H.G., Liu, Y.S.: Web content information extraction method based on tag window. Jisuanji Yingyong Yanjiu/Appl. Res. Comput. 24(3), 144–145 (2007)
Acknowledgements
The work was supported by National Science Foundation of China under Grant 61503150, 61472158, 61572228 and the 2015 Annual Innovation Training Program of Jilin University under Grant 2015540784.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media Singapore
About this paper
Cite this paper
Wang, M. et al. (2016). The BBC News Hunter: A Novel Crawler for BBC News. In: Che, W., et al. Social Computing. ICYCSEE 2016. Communications in Computer and Information Science, vol 624. Springer, Singapore. https://doi.org/10.1007/978-981-10-2098-8_26
Download citation
DOI: https://doi.org/10.1007/978-981-10-2098-8_26
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2097-1
Online ISBN: 978-981-10-2098-8
eBook Packages: Computer ScienceComputer Science (R0)