Abstract
This paper proposes an information system that classifies Web pages according a taxonomy, which is mainly used from seven search engines/directories. The proposed classifier is a four-layer generalised regression neural network (GRNN) that aims to perform the information segmentation according to information filtering techniques using content descriptor vectors. Eight categories of Web pages were used in order to evaluate the robustness of the method, while no restrictions were imposed except for the language of the content, which is English. The system can be used as an assistant and consultative tool for classification purposes as well as for estimating the population of Web pages at any given point in time.
Similar content being viewed by others
Abbreviations
- tf k :
-
Normalised frequency of term k
- idf k :
-
Inverse document frequency of term k
- hf :
-
Tag hierarchical rating
- \(\bar x\) :
-
Mean value
- σ:
-
Variance (distributions of normalised and inverse document frequencies over the terms’ rank order)
- f(x,z):
-
The probability density function (pdf) of the vector random variable x and its scalar random variable z
- D i :
-
The Euclidean distance between vector random variable x and sample points x i
- \( \bar \sigma \) :
-
A width parameter, which satisfies the asymptotic behaviour as the number of Parzen windows becomes large
- β :
-
The ‘beta’ coefficient for all the local approximators in the middle layer of the proposed neural network classifier
References
van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworths, London
Salton G (1989) Automatic text processing. Addison-Wesley, Reading, MA
Kohonen T (1995) Self-organizing maps. Springer, Berlin Heidelberg New York
Kohonen T, Kaski S, Lagus K, Salojarvi J, Honkela J, Paatero V, Saarela A (2000) Self organization of a massive document collection. IEEE Trans Neural Netw 11(3):574–585. Special Issue on Neural Networks for Data Mining and Knowledge Discovery
Rialle V, Meunier J, Oussedik S, Nault G (1997) Semiotic and modeling computer classification of text with genetic algorithm: analysis and first results. In: Proceedings of ISAS’97, Caracas, Venezuela, July 1997, pp 325–330
Mitaim S, Kosko B (1997) Fuzzy function approximation and intelligent agents. In: Proc SPIE 3165:2–13
Petridis V, Kaburlasos VG (2001) Clustering and classification in structured data domains using fuzzy lattice neurocomputing (FLN). IEEE Trans Knowl Data Eng 13(2):245–260
Haruechaiyasak C, Mei-Ling Shyu, Shu-Ching Chen, Xiuqi Li (2002) Web document classification based on fuzzy association. In: Proceedings of the 26th Annual International Computer Software and Applications Conference, Oxford, UK, August 2002, pp 487–492
Albrecht S, Busch J, Kloppenburg M, Metze F, Tavan P (2000) Generalised radial basis function networks for classification and novelty detection: self-organisation of optimal Bayesian decision. Neural Netw 13:1075–1093
Chung-Hsin Lin, Hsinchun Chen (1996) An automatic indexing and neural network approach to concept retrieval and classification of multilingual (Chinese–English) documents. In: IEEE Trans Syst Man Cybern B:75–88
Anagnostopoulos I, Psoroulas I, Loumos V, Kayafas E (2002) Implementing a customised meta-search interface for user query personalisation. In: Proceedings of the IEEE 24th International Conference on Information Technology Interfaces, Cavtat/Dubrovnik, June 2002, pp 79–84
Fox C A stop list for general text. ACM Spec Interest Group Inf Retrieval 24(1–2):19–35
Ricardo B, Berthier R (1999) Modern information retrieval. Addison-Wesley, Reading, MA, Appendix: Porter’s Algorithm
Soderland S (1997) Learning to extract text-based information from the World Wide Web. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA, August 1997
Specht DF (1991) A general regression neural network. IEEE Trans Neural Netw 2:568–576
Kaban A, Girolami M (2000) Initialized and guided EM-clustering of sparse binary data with application to text based documents. In: Proceedings of the 15th International Conference on Pattern Recognition 2:744–747
Chou PA (1991) Optimal partitioning for classification and regression trees. IEEE Trans Pattern Anal Machine Intell 13(4):340–354
Hoya T, Chambers JA (2001) Heuristic pattern correction scheme using adaptively trained generalized regression neural networks. IEEE Trans Neural Netw 12(1):91–100
Parzen E (1962) On the estimation of a probability density function and mode. Annals Math Stat 33:1064–1076
Specht DF (1996) Fuzzy logic and neural network handbook: chapter 3—probabilistic and general regression neural networks. McGraw-Hill, New York
Timothy M (1995) Advanced algorithms for neural networks: a C++ coursebook. Wiley, Canada
Teo Lian Seng, Khalid M, Yusof R (1999) Tuning of a neuro-fuzzy controller by genetic algorithm. IEEE Trans Syst Man Cybern Part B 29(2):226–236
Teo Lian Seng, Khalid M, Yusof R (2002) Adaptive GRNN for the modelling of dynamic plants. In: Proceedings of the 2002 IEEE Internatinal Symposium on Intelligent Control, Vancouver, Canada, 27–30 October 2002, pp 217–222
Burrascano P (1995) Learning vector quantization for the probabilistic neural network. IEEE Trans Neural Netw 2:458–461
Traven HGC (1991) A neural network approach to statistical pattern classification by semiparametric estimation of probability density function. IEEE Trans Neural Netw 2:366–377
Stamatios V. Kartalopoulos (1996) Understanding neural networks and fuzzy logic. IEEE Press, New York
Shian-Hua Lin, Meng Chang Chen, Jan-Ming Ho, Yueh-Ming Huang (2002) ACIRD: intelligent Internet document organization and retrieval. IEEE Trans Knowl Data Eng 14(3):599–614
Lee PY, Hui SC, Fong (2002) Neural networks for web content filtering. A.C.M. IEEE Intell Syst 17(5):48–57
Kouzas GS, Stavropoulos P, Anagnostopoulos I, Anagnostopoulos C, Loumos V, Kayafas E (2003) Measuring the population of web pages in the wild web. In: Proceedings of the XVII IMEKO World Congress, Dubrovnik, Poland, 22–27 June 2003, pp 720–725
Acknowledgements
The authors are very appreciative of all anonymous expert editors of Liaison Systems S.A. for their provided help as far as the validation of the Web pages, used for training and testing the GRNN classifier. The authors would also like to recognise the partial financial support from the European Community concerning the project ERMIS (Electronic commeRce Measurements through Intelligent agentS, IST-1999-21051), in which an initial version of the proposed GRNN classifier was tested and validated.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Anagnostopoulos, I., Anagnostopoulos, C., Kouzas, G. et al. A generalised regression algorithm for Web page categorisation. Neural Comput & Applic 13, 229–236 (2004). https://doi.org/10.1007/s00521-004-0409-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-004-0409-0