Skip to main content
Log in

A generalised regression algorithm for Web page categorisation

  • Original Article
  • Published:
Neural Computing & Applications Aims and scope Submit manuscript

Abstract

This paper proposes an information system that classifies Web pages according a taxonomy, which is mainly used from seven search engines/directories. The proposed classifier is a four-layer generalised regression neural network (GRNN) that aims to perform the information segmentation according to information filtering techniques using content descriptor vectors. Eight categories of Web pages were used in order to evaluate the robustness of the method, while no restrictions were imposed except for the language of the content, which is English. The system can be used as an assistant and consultative tool for classification purposes as well as for estimating the population of Web pages at any given point in time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Abbreviations

tf k :

Normalised frequency of term k

idf k :

Inverse document frequency of term k

hf :

Tag hierarchical rating

\(\bar x\) :

Mean value

σ:

Variance (distributions of normalised and inverse document frequencies over the terms’ rank order)

f(x,z):

The probability density function (pdf) of the vector random variable x and its scalar random variable z

D i :

The Euclidean distance between vector random variable x and sample points x i

\( \bar \sigma \) :

A width parameter, which satisfies the asymptotic behaviour as the number of Parzen windows becomes large

β :

The ‘beta’ coefficient for all the local approximators in the middle layer of the proposed neural network classifier

References

  1. van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworths, London

  2. Salton G (1989) Automatic text processing. Addison-Wesley, Reading, MA

  3. Kohonen T (1995) Self-organizing maps. Springer, Berlin Heidelberg New York

  4. Kohonen T, Kaski S, Lagus K, Salojarvi J, Honkela J, Paatero V, Saarela A (2000) Self organization of a massive document collection. IEEE Trans Neural Netw 11(3):574–585. Special Issue on Neural Networks for Data Mining and Knowledge Discovery

    Article  Google Scholar 

  5. Rialle V, Meunier J, Oussedik S, Nault G (1997) Semiotic and modeling computer classification of text with genetic algorithm: analysis and first results. In: Proceedings of ISAS’97, Caracas, Venezuela, July 1997, pp 325–330

  6. Mitaim S, Kosko B (1997) Fuzzy function approximation and intelligent agents. In: Proc SPIE 3165:2–13

  7. Petridis V, Kaburlasos VG (2001) Clustering and classification in structured data domains using fuzzy lattice neurocomputing (FLN). IEEE Trans Knowl Data Eng 13(2):245–260

    Article  Google Scholar 

  8. Haruechaiyasak C, Mei-Ling Shyu, Shu-Ching Chen, Xiuqi Li (2002) Web document classification based on fuzzy association. In: Proceedings of the 26th Annual International Computer Software and Applications Conference, Oxford, UK, August 2002, pp 487–492

  9. Albrecht S, Busch J, Kloppenburg M, Metze F, Tavan P (2000) Generalised radial basis function networks for classification and novelty detection: self-organisation of optimal Bayesian decision. Neural Netw 13:1075–1093

    Article  Google Scholar 

  10. Chung-Hsin Lin, Hsinchun Chen (1996) An automatic indexing and neural network approach to concept retrieval and classification of multilingual (Chinese–English) documents. In: IEEE Trans Syst Man Cybern B:75–88

  11. Anagnostopoulos I, Psoroulas I, Loumos V, Kayafas E (2002) Implementing a customised meta-search interface for user query personalisation. In: Proceedings of the IEEE 24th International Conference on Information Technology Interfaces, Cavtat/Dubrovnik, June 2002, pp 79–84

  12. Fox C A stop list for general text. ACM Spec Interest Group Inf Retrieval 24(1–2):19–35

  13. Ricardo B, Berthier R (1999) Modern information retrieval. Addison-Wesley, Reading, MA, Appendix: Porter’s Algorithm

  14. Soderland S (1997) Learning to extract text-based information from the World Wide Web. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA, August 1997

  15. Specht DF (1991) A general regression neural network. IEEE Trans Neural Netw 2:568–576

    Article  Google Scholar 

  16. Kaban A, Girolami M (2000) Initialized and guided EM-clustering of sparse binary data with application to text based documents. In: Proceedings of the 15th International Conference on Pattern Recognition 2:744–747

  17. Chou PA (1991) Optimal partitioning for classification and regression trees. IEEE Trans Pattern Anal Machine Intell 13(4):340–354

    Article  Google Scholar 

  18. Hoya T, Chambers JA (2001) Heuristic pattern correction scheme using adaptively trained generalized regression neural networks. IEEE Trans Neural Netw 12(1):91–100

    Article  Google Scholar 

  19. Parzen E (1962) On the estimation of a probability density function and mode. Annals Math Stat 33:1064–1076

    Google Scholar 

  20. Specht DF (1996) Fuzzy logic and neural network handbook: chapter 3—probabilistic and general regression neural networks. McGraw-Hill, New York

    Google Scholar 

  21. Timothy M (1995) Advanced algorithms for neural networks: a C++ coursebook. Wiley, Canada

    Google Scholar 

  22. Teo Lian Seng, Khalid M, Yusof R (1999) Tuning of a neuro-fuzzy controller by genetic algorithm. IEEE Trans Syst Man Cybern Part B 29(2):226–236

    Article  Google Scholar 

  23. Teo Lian Seng, Khalid M, Yusof R (2002) Adaptive GRNN for the modelling of dynamic plants. In: Proceedings of the 2002 IEEE Internatinal Symposium on Intelligent Control, Vancouver, Canada, 27–30 October 2002, pp 217–222

  24. Burrascano P (1995) Learning vector quantization for the probabilistic neural network. IEEE Trans Neural Netw 2:458–461

    Article  Google Scholar 

  25. Traven HGC (1991) A neural network approach to statistical pattern classification by semiparametric estimation of probability density function. IEEE Trans Neural Netw 2:366–377

    Article  Google Scholar 

  26. Stamatios V. Kartalopoulos (1996) Understanding neural networks and fuzzy logic. IEEE Press, New York

  27. Shian-Hua Lin, Meng Chang Chen, Jan-Ming Ho, Yueh-Ming Huang (2002) ACIRD: intelligent Internet document organization and retrieval. IEEE Trans Knowl Data Eng 14(3):599–614

    Article  Google Scholar 

  28. Lee PY, Hui SC, Fong (2002) Neural networks for web content filtering. A.C.M. IEEE Intell Syst 17(5):48–57

    Article  MATH  Google Scholar 

  29. Kouzas GS, Stavropoulos P, Anagnostopoulos I, Anagnostopoulos C, Loumos V, Kayafas E (2003) Measuring the population of web pages in the wild web. In: Proceedings of the XVII IMEKO World Congress, Dubrovnik, Poland, 22–27 June 2003, pp 720–725

Download references

Acknowledgements

The authors are very appreciative of all anonymous expert editors of Liaison Systems S.A. for their provided help as far as the validation of the Web pages, used for training and testing the GRNN classifier. The authors would also like to recognise the partial financial support from the European Community concerning the project ERMIS (Electronic commeRce Measurements through Intelligent agentS, IST-1999-21051), in which an initial version of the proposed GRNN classifier was tested and validated.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ioannis Anagnostopoulos.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Anagnostopoulos, I., Anagnostopoulos, C., Kouzas, G. et al. A generalised regression algorithm for Web page categorisation. Neural Comput & Applic 13, 229–236 (2004). https://doi.org/10.1007/s00521-004-0409-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-004-0409-0

Keywords

Navigation