Abstract
With the explosive growth of Internet information, it is more and more important to fetch real-time and related information. And it puts forward higher requirement on the speed of webpage classification which is one of common methods to retrieve and manage information. To get a more efficient classifier, this paper proposes a webpage classification method based on locality sensitive hash function. In which, three innovative modules including building feature dictionary, mapping feature vectors to fingerprints using Locality-sensitive hashing, and extending webpage features are contained. The compare results show that the proposed algorithm has better performance in lower time than the naïve bayes one.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Zhou, X.S., Li, S.: Modeling and Simulation of Webpage Automatic Classification. Computer Simulation 28(10), 121–124 (2011)
Qi, X., Davison, B.D.: Web Page Classification: Features and Algorithms. ACM Computing Surveys (CSUR) 41(2), 12 (2009)
Shi, K., Li, L., Liu, H.: An Improved KNN Text Classification Algorithm Based on Density. In: 2011 IEEE International Conference on Cloud Computing and Intelligence Systems (CCIS). IEEE (2011)
Liu, X.L., Ding, S.F., Zhu, H., Zhang, L.W.: Appropriateness in Applying SVMs to Text Classification. Computer Engineering and Science 32(6), 106–108 (2010)
Zhang, W., Gao, F.: An Improvement to Naive Bayes for Text Classification. Procedia Engineering 15, 2160–2164 (2011)
Gionis, A., Indyk, P., Motwani, R.: Similarity Search in High Dimensions Via Hashing. In: Proc. 25th VLDB, pp. 518–529 (1999)
Manku, G.S., Jain, A., Das Sarma, A.: Detecting Near-Duplicates for Web Crawling. In: Proceedings of the 16th International Conference on World Wide Web, Banff, Alberta, Canada, pp. 141–150 (2007)
Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google News Personalization: Scalable Online Collaborative Filtering. In: Proceedings of the 16th International Conference on World Wide Web, pp. 271–280. ACM (2007)
Koga, H., Ishibashi, T., Watanabe, T.: Fast Agglomerative Hierarchical Clustering Algorithm Using Locality-Sensitive Hashing. Knowledge and Information Systems 12(1), 25–53 (2007)
Brinza, D., Schultz, M., Tesler, G., Bafna, V.: RAPID Detection of Gene–gene Interactions in Genome-wide Association Studies. Bioinformatics 26(22), 2856–2862 (2010)
Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning (2011)
Mahout, http://mahout.apache.org/users/classification/bayesian.html
Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: Proceedings of the Thirty-fourth Annual ACM Symposium on Theory of Computing, pp. 380–388. ACM (2002)
Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1(1-2), 69–90 (1999)
Steenwijk, M.D., Pouwels, P.J.W., Daams, M., van Dalen, J.W., Caan, M.W., Richard, E., Barkhof, F., Vrenken, H.: Accurate White Matter Lesion Segmentation by K Nearest Neighbor Classification with Tissue Type Priors (kNN-TTPs). NeuroImage: Clinical 3, 462–469 (2013)
Thilina, K.M., Choi, K.W., Saquib, N., Hossain, E.: Pattern Classification Techniques for Cooperative Spectrum Sensing in Cognitive Radio Networks: SVM and W-KNN approaches. In: 2012 IEEE Global Communications Conference (GLOBECOM), pp. 1260–1265. IEEE (2012)
Köknar-Tezel, S., Latecki, L.J.: Improving SVM Classification on Imbalanced Time Series Data Sets with Ghost Points. Knowledge and information systems 28(1), 1–23 (2011)
Dukart, J., Mueller, K., Barthel, H., Villringer, A., Sabri, O., Schroeter, M.L.: Meta-Analysis Based SVM Classification Enables Accurate Detection of Alzheimer’s Disease Across Different Clinical Centers Using FDG-PET and MRI. Psychiatry Research: Neuroimaging 212(3), 230–236 (2013)
Rosen, G.L., Reichenberger, E.R., Rosenfeld, A.M.: NBC: The Naive Bayes Classification Tool Webserver for Taxonomic Classification of Metagenomic Reads. Bioinformatics 27(1), 127–129 (2011)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, J., Sun, H., Ding, Z. (2015). An Efficient Webpage Classification Algorithm Based on LSH. In: Wang, H., et al. Intelligent Computation in Big Data Era. ICYCSEE 2015. Communications in Computer and Information Science, vol 503. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46248-5_31
Download citation
DOI: https://doi.org/10.1007/978-3-662-46248-5_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46247-8
Online ISBN: 978-3-662-46248-5
eBook Packages: Computer ScienceComputer Science (R0)