The BBC News Hunter: A Novel Crawler for BBC News

Wang, Mingxin; Wang, Ning; Wang, Boran; Tian, Can; Liang, Yanchun; Zhao, Guozhong; Han, Xiaosong

doi:10.1007/978-981-10-2098-8_26

Mingxin Wang²⁰,
Ning Wang²⁰,
Boran Wang²⁰,
Can Tian²⁰,
Yanchun Liang^21,22,
Guozhong Zhao²³ &
…
Xiaosong Han^21,23

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 624))

Included in the following conference series:

International Conference of Pioneering Computer Scientists, Engineers and Educators

766 Accesses

Abstract

In order to distinguish and extract the topic information from other interferential information on the BBC news website for the study in social computing, the BBC News Hunter was proposed in this paper. The whole system consists of 6 subsystems, respectively named: UI, Control, Download, Analysis, Storage and Log. Numerical experiments show that satisfactory results can be obtained from the BBC news website, whose average accuracy as well as efficiency are acceptable.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Wang, J., Zhu, L., Li, C.: Discussion about the core of search engine again—web crawler. In: 2011 International Conference on Computer Science and Service System (CSSS), pp. 3188–3191. IEEE (2011)
Google Scholar
Khare, R., Cutting, D., Sitaker, K., Rifkin, A.: Nutch: a flexible and scalable open-source web search engine. Or. State Univ. 1, 32 (2004)
Google Scholar
Brin, S., Page, L.: Reprint of: the anatomy of a large-scale hypertextual web search engine. Comput. Netw. 56(18), 3825–3833 (2012)
Article Google Scholar
http://blog.csdn.net/chaishen10000/article/details/50776662
Mohr, G., Stack, M., Ranitovic, I., et al.: An Introduction to Heritrix An open source archival quality web crawler. In: IWAW 2004, 4th International Web Archiving Workshop (2004)
Google Scholar
Liu, D.F., Fan, X.S.: Study and application of web crawler algorithm based on heritrix. In: Advanced Materials Research, vol. 219, pp. 1069–1072. Trans Tech Publications (2011)
Google Scholar
Kim, H.G., Lee, J.W., Ban, T.H., Jung, H.K.: A study on distributed crawling-based overhead optimization. Int. J. Softw. Eng. Appl. 9(3), 175–182 (2015)
Google Scholar
Feng, W., Mao, Z.: The research of web pages information extraction based on Web. J. Luoyang Technol. Coll. 3, 30–31 (2005)
Google Scholar
Chakrabarti, S.: Integrating the document object model with hyperlinks for enhanced topic distillation and information extraction. In: Proceedings of the 10th International Conference on World Wide Web, pp. 211–220. ACM (2001)
Google Scholar
Hengru, Z., Chun, C.: Web information extraction technology research based on ajax. In: 2011 International Conference on Business Computing and Global Informatization (BCGIN), pp. 208–211. IEEE (2011)
Google Scholar
Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V.: Recognition of common areas in a web page using visual information: a possible application in a page classification. In: Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM 2003, pp. 250–257. IEEE (2002)
Google Scholar
Kang, J., Choi, J.: Detecting informative web page blocks for efficient information extraction using visual block segmentation. In: International Symposium on Information Technology Convergence, ISITC 2007, pp. 306–310. IEEE (2007)
Google Scholar
Embley, D.W., Jiang, Y., Ng, Y.K.: Record-boundary discovery in web documents. ACM SIGMOD Rec. 28(2), 467–478 (1999). ACM
Article Google Scholar
Zhao, X.X., Suo, H.G., Liu, Y.S.: Web content information extraction method based on tag window. Jisuanji Yingyong Yanjiu/Appl. Res. Comput. 24(3), 144–145 (2007)
Google Scholar

Download references

Acknowledgements

The work was supported by National Science Foundation of China under Grant 61503150, 61472158, 61572228 and the 2015 Annual Innovation Training Program of Jilin University under Grant 2015540784.

Author information

Authors and Affiliations

College of Software, Jilin University, Changchun, 130012, China
Mingxin Wang, Ning Wang, Boran Wang & Can Tian
Key Laboratory for Symbol Computation and Knowledge Engineering of National Education Ministry, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
Yanchun Liang & Xiaosong Han
Zhuhai Laboratory of Key Laboratory for Symbol Computation and Knowledge Engineering of Ministry of Education, Zhuhai College of Jilin University, Zhuhai, 519041, China
Yanchun Liang
Daqing Oilfield Personnel Development Institute, CNPC, Daqing, 163000, China
Guozhong Zhao & Xiaosong Han

Authors

Mingxin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ning Wang
View author publications
You can also search for this author in PubMed Google Scholar
Boran Wang
View author publications
You can also search for this author in PubMed Google Scholar
Can Tian
View author publications
You can also search for this author in PubMed Google Scholar
Yanchun Liang
View author publications
You can also search for this author in PubMed Google Scholar
Guozhong Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Xiaosong Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaosong Han .

Editor information

Editors and Affiliations

Harbin Institute of Technology , Harbin, China
Wanxiang Che
Harbin Engineering University , Harbin, China
Qilong Han
Harbin Institute of Technology , Harbin, China
Hongzhi Wang
Northeast Forestry University , Harbin, China
Weipeng Jing
National University of Defense Technology , Changsha, China
Shaoliang Peng
Harbin Engineering University , Harbin, China
Junyu Lin
Harbin Univ. of Science and Technology , Harbin, China
Guanglu Sun
Harbin Univ. of Science and Technology , Harbin, China
Xianhua Song
Harbin Engineering University , Harbin, China
Hongtao Song
Harbin Sea of Clouds & Computer Tech. , Harbin, China
Zeguang Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, M. et al. (2016). The BBC News Hunter: A Novel Crawler for BBC News. In: Che, W., et al. Social Computing. ICYCSEE 2016. Communications in Computer and Information Science, vol 624. Springer, Singapore. https://doi.org/10.1007/978-981-10-2098-8_26

Download citation

DOI: https://doi.org/10.1007/978-981-10-2098-8_26
Published: 31 July 2016
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-2097-1
Online ISBN: 978-981-10-2098-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics