Skip to main content

A Vertical Search Engine for School Information Based on Heritrix and Lucene

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6935))

Abstract

The contents on the web are increasing exponentially as the rapid development of the Internet applications and services continues to expand. A problem in obtaining useful information from vast contents quickly and accurately is facing us while people are enjoying the convenience of the Internet. The immediate response to this problem is a Web Search Engine. We developed a vertical search engine for a certain domain like university. The search engine consists of Crawler, Indexer, and Searcher. The crawler component is implemented with Heritrix crawler based on the mechanism of recursion and archiving. A reusable, extensible index establishment and management subsystem are designed and implemented by open-source package named Lucene in the indexer component. An experiment has been done for Chungbuk National University web sites, and the number of documents the system retrieves is more than 4 hundred times on the average for typical keywords set than those from Google or university’s search engines.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Curran, K., Glinchey, J.: Vertical Search Engines. ITB Journal (16), 22–26 (2007)

    Google Scholar 

  2. Chau, M., Chen, H.: Comparison of Three Vertical Search Spiders, pp. 56–62. IEEE Computer Society, Los Alamitos (2003)

    Google Scholar 

  3. Chakrabarti, S., Jaju, R., Joshi, M., Punera, K.: Analyzing Fine-grained Hypertext Features for Enhanced Crawling and Topic Distillation, vol. 25(1). IEEE Computer Society, Los Alamitos (2002)

    Google Scholar 

  4. Cho, J., Page, L.: Efficient crawling through URL ordering. In: Proceedings of the Seventh International World Wide Web Conference, WWW7 (1998)

    Google Scholar 

  5. Gravano, L., Ipeirotis, P., Sahami, M.: Query- vs. Crawling-based Classification of Searchable Web Databases, vol. 25(1). IEEE Computer Society, Los Alamitos (2002)

    Google Scholar 

  6. Gospodnetic, O., Hatcher, E.: Lucene in Action, 2nd edn. Manning Publications Co. (2009)

    Google Scholar 

  7. Sigurðsson, K.: Incremental crawling with Heritrix, National and University Library of Iceland. In: Proc. IWAW (2005)

    Google Scholar 

  8. Stack, M.: Full Text Search of Web Archive Collections, Internet Archive, The Presidio of San Francisco, 116 Sheridan Ave, San Francisco, CA 94129 the 5th International Web Archiving Workshop, IWAW (2005)

    Google Scholar 

  9. Wang, X.: Lucene Nuthc Search Engine Development. Posts and Telcom. Press, Beijing (2008)

    Google Scholar 

  10. The Apache Software Foundation, http://tomcat.apache.org/

  11. Chungbuk search engine, http://search.chungbuk.ac.kr/RSA/front/Search.jsp

  12. Heritrix User Manual, http://crawler.archive.org

  13. Index (search engine), http://en.wikipedia.org/wiki/Index_search_engine

  14. Google search engine, http://www.google.com

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lee, HB., Nazareno, F., Jung, SH., Cho, WS. (2011). A Vertical Search Engine for School Information Based on Heritrix and Lucene. In: Lee, G., Howard, D., Ślęzak, D. (eds) Convergence and Hybrid Information Technology. ICHIT 2011. Lecture Notes in Computer Science, vol 6935. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24082-9_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24082-9_42

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24081-2

  • Online ISBN: 978-3-642-24082-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics