A Vertical Search Engine for School Information Based on Heritrix and Lucene

Lee, Hyo-Bong; Nazareno, Franco; Jung, Seung-Hyun; Cho, Wan-Sup

doi:10.1007/978-3-642-24082-9_42

Hyo-Bong Lee¹⁹,
Franco Nazareno²⁰,
Seung-Hyun Jung²¹ &
…
Wan-Sup Cho¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6935))

Included in the following conference series:

International Conference on Hybrid Information Technology

2046 Accesses
1 Citations

Abstract

The contents on the web are increasing exponentially as the rapid development of the Internet applications and services continues to expand. A problem in obtaining useful information from vast contents quickly and accurately is facing us while people are enjoying the convenience of the Internet. The immediate response to this problem is a Web Search Engine. We developed a vertical search engine for a certain domain like university. The search engine consists of Crawler, Indexer, and Searcher. The crawler component is implemented with Heritrix crawler based on the mechanism of recursion and archiving. A reusable, extensible index establishment and management subsystem are designed and implemented by open-source package named Lucene in the indexer component. An experiment has been done for Chungbuk National University web sites, and the number of documents the system retrieves is more than 4 hundred times on the average for typical keywords set than those from Google or university’s search engines.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

InSciC—Knowledge-Aware Crawler for Indian Sciences

Multilingual Crawling Strategies for Information Retrieval from BRICS Academic Websites

An effective approach to enhancing a focused crawler using Google

Article 20 February 2019

References

Curran, K., Glinchey, J.: Vertical Search Engines. ITB Journal (16), 22–26 (2007)
Google Scholar
Chau, M., Chen, H.: Comparison of Three Vertical Search Spiders, pp. 56–62. IEEE Computer Society, Los Alamitos (2003)
Google Scholar
Chakrabarti, S., Jaju, R., Joshi, M., Punera, K.: Analyzing Fine-grained Hypertext Features for Enhanced Crawling and Topic Distillation, vol. 25(1). IEEE Computer Society, Los Alamitos (2002)
Google Scholar
Cho, J., Page, L.: Efficient crawling through URL ordering. In: Proceedings of the Seventh International World Wide Web Conference, WWW7 (1998)
Google Scholar
Gravano, L., Ipeirotis, P., Sahami, M.: Query- vs. Crawling-based Classification of Searchable Web Databases, vol. 25(1). IEEE Computer Society, Los Alamitos (2002)
Google Scholar
Gospodnetic, O., Hatcher, E.: Lucene in Action, 2nd edn. Manning Publications Co. (2009)
Google Scholar
Sigurðsson, K.: Incremental crawling with Heritrix, National and University Library of Iceland. In: Proc. IWAW (2005)
Google Scholar
Stack, M.: Full Text Search of Web Archive Collections, Internet Archive, The Presidio of San Francisco, 116 Sheridan Ave, San Francisco, CA 94129 the 5th International Web Archiving Workshop, IWAW (2005)
Google Scholar
Wang, X.: Lucene Nuthc Search Engine Development. Posts and Telcom. Press, Beijing (2008)
Google Scholar
The Apache Software Foundation, http://tomcat.apache.org/
Chungbuk search engine, http://search.chungbuk.ac.kr/RSA/front/Search.jsp
Heritrix User Manual, http://crawler.archive.org
Index (search engine), http://en.wikipedia.org/wiki/Index_search_engine
Google search engine, http://www.google.com

Download references

Author information

Authors and Affiliations

Dept. of Management Information Systems, u-BIZ BK21, Chungbuk National University, Korea
Hyo-Bong Lee & Wan-Sup Cho
Dept. of Bio-Information Technology, Chungbuk National University, Korea
Franco Nazareno
Dept. of Information Industrial Engineering, Chungbuk National University, Korea
Seung-Hyun Jung

Authors

Hyo-Bong Lee
View author publications
You can also search for this author in PubMed Google Scholar
Franco Nazareno
View author publications
You can also search for this author in PubMed Google Scholar
Seung-Hyun Jung
View author publications
You can also search for this author in PubMed Google Scholar
Wan-Sup Cho
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Engineering Department, Hannam University, 70 Hannamro, Daedeuk-gu, Daejeon, Korea
Geuk Lee
QinetiQ Company Fellow, Howard Science Limited, 24 Sunrise, WR14 2NJ, Malvern, UK
Daniel Howard
University of Warsaw, 02-097, Warsaw, Poland
Dominik Ślęzak

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lee, HB., Nazareno, F., Jung, SH., Cho, WS. (2011). A Vertical Search Engine for School Information Based on Heritrix and Lucene. In: Lee, G., Howard, D., Ślęzak, D. (eds) Convergence and Hybrid Information Technology. ICHIT 2011. Lecture Notes in Computer Science, vol 6935. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24082-9_42

Download citation

DOI: https://doi.org/10.1007/978-3-642-24082-9_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-24081-2
Online ISBN: 978-3-642-24082-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics