Skip to main content

An Efficient Webpage Classification Algorithm Based on LSH

  • Conference paper
Intelligent Computation in Big Data Era (ICYCSEE 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 503))

  • 2006 Accesses

Abstract

With the explosive growth of Internet information, it is more and more important to fetch real-time and related information. And it puts forward higher requirement on the speed of webpage classification which is one of common methods to retrieve and manage information. To get a more efficient classifier, this paper proposes a webpage classification method based on locality sensitive hash function. In which, three innovative modules including building feature dictionary, mapping feature vectors to fingerprints using Locality-sensitive hashing, and extending webpage features are contained. The compare results show that the proposed algorithm has better performance in lower time than the naïve bayes one.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Zhou, X.S., Li, S.: Modeling and Simulation of Webpage Automatic Classification. Computer Simulation 28(10), 121–124 (2011)

    Google Scholar 

  2. Qi, X., Davison, B.D.: Web Page Classification: Features and Algorithms. ACM Computing Surveys (CSUR) 41(2), 12 (2009)

    Article  Google Scholar 

  3. Shi, K., Li, L., Liu, H.: An Improved KNN Text Classification Algorithm Based on Density. In: 2011 IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS). IEEE (2011)

    Google Scholar 

  4. Liu, X.L., Ding, S.F., Zhu, H., Zhang, L.W.: Appropriateness in Applying SVMs to Text Classification. Computer Engineering and Science 32(6), 106–108 (2010)

    Google Scholar 

  5. Zhang, W., Gao, F.: An Improvement to Naive Bayes for Text Classification. Procedia Engineering 15, 2160–2164 (2011)

    Article  Google Scholar 

  6. Gionis, A., Indyk, P., Motwani, R.: Similarity Search in High Dimensions Via Hashing. In: Proc. 25th VLDB, pp. 518–529 (1999)

    Google Scholar 

  7. Manku, G.S., Jain, A., Das Sarma, A.: Detecting Near-Duplicates for Web Crawling. In: Proceedings of the 16th International Conference on World Wide Web, Banff, Alberta, Canada, pp. 141–150 (2007)

    Google Scholar 

  8. Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google News Personalization: Scalable Online Collaborative Filtering. In: Proceedings of the 16th International Conference on World Wide Web, pp. 271–280. ACM (2007)

    Google Scholar 

  9. Koga, H., Ishibashi, T., Watanabe, T.: Fast Agglomerative Hierarchical Clustering Algorithm Using Locality-Sensitive Hashing. Knowledge and Information Systems 12(1), 25–53 (2007)

    Article  Google Scholar 

  10. Brinza, D., Schultz, M., Tesler, G., Bafna, V.: RAPID Detection of Gene–gene Interactions in Genome-wide Association Studies. Bioinformatics 26(22), 2856–2862 (2010)

    Article  Google Scholar 

  11. Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning (2011)

    Google Scholar 

  12. Mahout, http://mahout.apache.org/users/classification/bayesian.html

  13. Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: Proceedings of the Thirty-fourth Annual ACM Symposium on Theory of Computing, pp. 380–388. ACM (2002)

    Google Scholar 

  14. Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1-2), 69–90 (1999)

    Article  Google Scholar 

  15. Steenwijk, M.D., Pouwels, P.J.W., Daams, M., van Dalen, J.W., Caan, M.W., Richard, E., Barkhof, F., Vrenken, H.: Accurate White Matter Lesion Segmentation by K Nearest Neighbor Classification with Tissue Type Priors (kNN-TTPs). NeuroImage: Clinical 3, 462–469 (2013)

    Article  Google Scholar 

  16. Thilina, K.M., Choi, K.W., Saquib, N., Hossain, E.: Pattern Classification Techniques for Cooperative Spectrum Sensing in Cognitive Radio Networks: SVM and W-KNN approaches. In: 2012 IEEE Global Communications Conference (GLOBECOM), pp. 1260–1265. IEEE (2012)

    Google Scholar 

  17. Köknar-Tezel, S., Latecki, L.J.: Improving SVM Classification on Imbalanced Time Series Data Sets with Ghost Points. Knowledge and information systems 28(1), 1–23 (2011)

    Article  Google Scholar 

  18. Dukart, J., Mueller, K., Barthel, H., Villringer, A., Sabri, O., Schroeter, M.L.: Meta-Analysis Based SVM Classification Enables Accurate Detection of Alzheimer’s Disease Across Different Clinical Centers Using FDG-PET and MRI. Psychiatry Research: Neuroimaging 212(3), 230–236 (2013)

    Article  Google Scholar 

  19. Rosen, G.L., Reichenberger, E.R., Rosenfeld, A.M.: NBC: The Naive Bayes Classification Tool Webserver for Taxonomic Classification of Metagenomic Reads. Bioinformatics 27(1), 127–129 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, J., Sun, H., Ding, Z. (2015). An Efficient Webpage Classification Algorithm Based on LSH. In: Wang, H., et al. Intelligent Computation in Big Data Era. ICYCSEE 2015. Communications in Computer and Information Science, vol 503. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46248-5_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-46248-5_31

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-46247-8

  • Online ISBN: 978-3-662-46248-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics