Abstract
Text classification is among the most broadly used machine learning tools in computational linguistic. Web information retrieval is one of the most important sectors that took advantage from this technique. Applications range from page classification, used by search engines, to URL classification used for focus crawling and on-line time-sensitive applications [2]. Due to the pressing need for the highest possible accuracy, a supervised learning approach is always preferred when an adequately large set of training examples is available. Nonetheless, since building such an accurate and representative training set often becomes impractical when the number of classes increases over a few units, alternative unsupervised or semi-supervised approaches have come out. The use of standard web directories as a source of examples can be prone to undesired effects due, for example, to the presence of maliciously misclassified web pages. In addition, this option is subjected to the existence of all the desired classes in the directory hierarchy.
Taking as input a textual description of each class and a set of URLs, in this paper we propose a new framework to automatically build a representative training set able to reasonably approximate the classification accuracy obtained by means of a manually-curated training set. Our approach leverages on the observation that a not negligible fraction of website names is the result of the juxtaposition of few keywords. Yet, the entire URL can often be converted into a meaningful text snippet. When this happens, we can label the URL by measuring its degree of similarity with each class description. The text contained in the pages corresponding to labelled URLs can be used as a training set for any subsequent classification task (not necessarily on the web). Experiments on a set of 20 thousand web pages belonging to 9 categories have shown that our auto-labelling framework is able to attain an approximation factor over 88% of the accuracy of a pure supervised classification trained with manually-curated examples.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Adami, G., Avesani, P., Sona, D.: Clustering documents into a web directory for bootstrapping a supervised classification. Data Knowl. Eng. 54(3), 301–325 (2005). https://doi.org/10.1016/j.datak.2004.11.003
Baykan, E., Henzinger, M., Marian, L., Weber, I.: Purely URL-based topic classification. In: Proceedings of the 18th International Conference on World Wide Web, pp. 1109–1110. ACM (2009)
Baykan, E., Henzinger, M., Marian, L., Weber, I.: A comprehensive study of features and algorithms for URL-based topic classification. ACM Trans. Web 5(3), 15:1–15:29 (2011). https://doi.org/10.1145/1993053.1993057
Baykan, E., Henzinger, M., Weber, I.: A comprehensive study of techniques for URL-based web page language classification. ACM Trans. Web 7(1), 3:1–3:37 (2013). https://doi.org/10.1145/2435215.2435218
Bennett, G., Scholer, F., Uitdenbogerd, A.: A comparative study of probabilistic and language models for information retrieval. In: Proceedings of the Nineteenth Conference on Australasian Database, vol. 75, pp. 65–74. Australian Computer Society Inc. (2008)
Boyd, D., Crawford, K.: Critical questions for big data: provocations for a cultural, technological, and scholarly phenomenon. Inf. Commun. Soc. 15(5), 662–679 (2012)
Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J.: LOF: identifying density-based local outliers. In: ACM Sigmod Record, vol. 29, pp. 93–104. ACM (2000)
Broder, A.Z., Ciccolo, P., Fontoura, M., Gabrilovich, E., Josifovski, V., Riedel, L.: Search advertising using web relevance feedback. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, pp. 1013–1022. ACM, New York (2008). https://doi.org/10.1145/1458082.1458217
Carpineto, C., Osiński, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. 41(3), 17:1–17:38 (2009)
Castellanos, M., Daniel, F., Garrigós, I., Mazón, J.N.: Business intelligence and the web. Inf. Syst. Front. 15(3), 307–309 (2013)
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of STOC-02, 34th Annual ACM Symposium on the Theory of Computing, , Montreal, CA, pp. 380–388 (2002)
Dong, H., Hussain, F.: Focused crawling for automatic service discovery, annotation, and classification in industrial digital ecosystems. IEEE Trans. Ind. Electron. 58(6), 2106–2116 (2011)
Eickhoff, C., Serdyukov, P., de Vries, A.P.: Web page classification on child suitability. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM 2010, pp. 1425–1428. ACM, New York (2010). https://doi.org/10.1145/1871437.1871638
Erdélyi, M., Garzó, A., Benczúr, A.A.: Web spam classification: a few features worth more. In: Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, WebQuality 2011, pp. 27–34. ACM, New York (2011). https://doi.org/10.1145/1964114.1964121
Forman, G.: Quantifying counts and costs via classification. Data Min. Knowl. Discov. 17(2), 164–206 (2008)
Fürnkranz, J.: Exploiting structural information for text classification on the WWW. In: Hand, D.J., Kok, J.N., Berthold, M.R. (eds.) IDA 1999. LNCS, vol. 1642, pp. 487–497. Springer, Heidelberg (1999). https://doi.org/10.1007/3-540-48412-4_41
de Groc, C.: Babouk: Focused web crawling for corpus compilation and automatic terminology extraction. In: 2011 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), vol. 1, pp. 497–498 (2011)
Halvorson, T., et al.: The BIZ top-level domain: ten years later. In: Taft, N., Ricciato, F. (eds.) PAM 2012. LNCS, vol. 7192, pp. 221–230. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-28537-0_22
Hao, H.W., Mu, C.X., Yin, X.C., Li, S., Wang, Z.B.: An improved topic relevance algorithm for focused crawling. In: 2011 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 850–855, October 2011
Hernández, I., Rivero, C.R., Ruiz, D., Corchuelo, R.: A statistical approach to URL-based web page clustering. In: Proceedings of the 21st International Conference Companion on World Wide Web, WWW 2012 Companion, pp. 525–526. ACM, New York (2012). https://doi.org/10.1145/2187980.2188109
Hernández, I., Rivero, C.R., Ruiz, D., Corchuelo, R.: CALA: an unsupervised URL-based web page classification system. Knowl.-Based Syst. 57, 168–180 (2014). http://www.sciencedirect.com/science/article/pii/S0950705113003997
Kriegel, H.P., Schubert, M.: Classification of websites as sets of feature vectors. In: Databases and Applications, pp. 127–132 (2004)
Liu, T.Y.: Learning to rank for information retrieval. Found. Trends Inf. Retr. 3(3), 225–331 (2009). https://doi.org/10.1561/1500000016
Liu, T.Y., Xu, J., Qin, T., Xiong, W., Li, H.: LETOR: Benchmark dataset for research on learning to rank for information retrieval. In: Proceedings of SIGIR 2007 Workshop on Learning to Rank for Information Retrieval, pp. 3–10 (2007)
Mangai, J.A., Kumar, V.S., Alias Balamurugan, S.A.: A novel feature selection framework for automatic web page classification. Int. J. Autom. Comput. 9(4), 442–448 (2012)
Marath, S.T., Shepherd, M., Milios, E., Duffy, J.: Large-scale web page classification. In: 2014 47th Hawaii International Conference on System Sciences (HICSS), pp. 1813–1822. IEEE (2014)
Milli, L., Monreale, A., Rossetti, G., Giannotti, F., Pedreschi, D., Sebastiani, F.: Quantification trees. In: 2013 IEEE 13th International Conference on Data Mining (ICDM), pp. 528–536. IEEE (2013)
Özel, S.A.: A web page classification system based on a genetic algorithm using tagged-terms as features. Expert Syst. Appl. 38(4), 3407–3415 (2011). https://doi.org/10.1016/j.eswa.2010.08.126
Patil, A.S., Pawar, B.: Automated classification of web sites using naive Bayesian algorithm. In: Proceedings of the International Multi-Conference of Engineers and Computer Scientists, vol. 1, pp. 14–16 (2012)
Qi, X., Davison, B.D.: Web page classification: features and algorithms. ACM Comput. Surv. (CSUR) 41(2), 12 (2009)
Rajalakshmi, R., Aravindan, C.: Naive Bayes approach for website classification. In: Das, V.V., Thomas, G., Lumban Gaol, F. (eds.) AIM 2011. CCIS, vol. 147, pp. 323–326. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-20573-6_55
Robertson, S.E.: Overview of the okapi projects. J. Doc. 53(1), 3–7 (1997)
Rose, D.E., Levinson, D.: Understanding user goals in web search. In: Proceedings of the 13th International Conference on World Wide Web, pp. 13–19. ACM (2004)
Saad, M.K., Hewahi, N.M.: A comparative study of outlier mining and class outlier mining. Comput. Sci. Lett. 1(1) (2009)
Sebastiani, F.: Text quantification. In: de Rijke, M. (ed.) ECIR 2014. LNCS, vol. 8416, pp. 819–822. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-06028-6_104
Smith, M., Martinez, T.: Improving classification accuracy by identifying and removing instances that should be misclassified. In: The 2011 International Joint Conference on Neural Networks (IJCNN), pp. 2690–2697, July 2011
Tang, L., Gao, H., Liu, H.: Network quantification despite biased labels. In: Proceedings of the Eighth Workshop on Mining and Learning with Graphs, pp. 147–154. ACM (2010)
Taylan, D., Poyraz, M., Akyokus, S., Ganiz, M.: Intelligent focused crawler: learning which links to crawl. In: 2011 International Symposium on Innovations in Intelligent Systems and Applications (INISTA), pp. 504–508, June 2011
Trivedi, A., Rai, P., Daumé III, H., DuVall, S.L.: Leveraging social bookmarks from partially tagged corpus for improved web page clustering. ACM Trans. Intell. Syst. Technol. (TIST) 3(4), 67 (2012)
Wilkinson, R., Zobel, J., Sacks-davis, R.: Similarity measures for short queries. In: Fourth text Retrieval Conference (TREC-4), pp. 277–285 (1995)
Xu, Z., Yan, F., Qin, J., Zhu, H.: A web page classification algorithm based on link information. In: 2011 Tenth International Symposium on Distributed Computing and Applications to Business, Engineering and Science (DCABES), pp. 82–86, October 2011
Yu, H., Han, J., Chang, K.C.: PEBL: web page classification without negative examples. IEEE Trans. Knowl. Data Eng. 16(1), 70–81 (2004)
Zhong, S., Zou, D.: Web page classification using an ensemble of support vector machine classifiers. J. Netw. 6(11), 1625–1630 (2011)
Funding
This work was supported by: the Regione Toscana of Italy under the grant POR CRO 2007/2013 Asse IV Capitale Umano; the Italian Registry of the ccTLD “it” Registro.it.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this paper
Cite this paper
Geraci, F., Papini, T. (2018). Approximating Multi-class Text Classification Via Automatic Generation of Training Examples. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2017. Lecture Notes in Computer Science(), vol 10762. Springer, Cham. https://doi.org/10.1007/978-3-319-77116-8_44
Download citation
DOI: https://doi.org/10.1007/978-3-319-77116-8_44
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77115-1
Online ISBN: 978-3-319-77116-8
eBook Packages: Computer ScienceComputer Science (R0)