Skip to main content

A Diversified Classification Committee for Recognition of Innovative Internet Domains

  • Conference paper
  • First Online:
Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery (BDAS 2015, BDAS 2016)

Abstract

The objective of this paper was to propose a classification method of innovative domains on the Internet. The proposed approach helped to estimate whether companies are innovative or not through analyzing their web pages. A Naïve Bayes classification committee was used as the classification system of the domains. The classifiers in the committee were based concurrently on Bernoulli and Multinomial feature distribution models, which were selected depending on the diversity of input data. Moreover, the information retrieval procedures were applied to find such documents in domains that most likely indicate innovativeness. The proposed methods have been verified experimentally. The results have shown that the diversified classification committee combined with the information retrieval approach in the preprocessing phase boosts the classification quality of domains that may represent innovative companies. This approach may be applied to other classification tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Morphological analyser morfeusz. http://sgjp.pl/morfeusz/morfeusz.html.en. Accessed 28 Oct 2015

  2. Bellotti, T., Nouretdinov, I., Yang, M., Gammerman, A.: Feature selection, pp. 115–130. Elsevier (2014)

    Google Scholar 

  3. Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. In: Biennial GSCL Conference 2009, Tübingen, pp. 31–40 (2009)

    Google Scholar 

  4. Figueiredo, F., Rocha, L., Couto, T., Salles, T., Gonçalves, M., Meira, W.: Word co-occurrence features for text classification. Inf. Syst. 36(5), 843–858 (2011)

    Article  Google Scholar 

  5. Hand, D., Smyth, P., Mannila, H.: Principles of Data Mining. MIT Press, Cambridge (2001)

    Google Scholar 

  6. Li, S., Xia, R., Zong, C., Huang, C.: A framework of feature selection methods for text categorization. AFNLP 2, 692–700 (2009)

    Google Scholar 

  7. Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Data-Centric Systems and Applications. Springer, Heidelberg (2006)

    MATH  Google Scholar 

  8. Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  9. Nakatsuji, M.: Identifying novel topics based on user interests. In: Elçi, A., Koné, M.T., Orgun, M.A. (eds.) Semantic Agent Systems. SCI, vol. 344, pp. 273–292. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  10. Nakatsuji, M., Miyoshi, Y., Otsuka, Y.: Innovation detection based on user-interest ontology of blog community. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 515–528. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  11. Nakatsuji, M., Yoshida, M., Ishida, T.: Detecting innovative topics based on user-interest ontology. Web Semant. Sci. Serv. Agents World Wide Web 7(2), 107–120 (2009)

    Article  Google Scholar 

  12. Piasecki, M., Szpakowicz, S., Broda, B.: Toward plWordNet 2.0. Principles, Construction and Application of Multilingual Wordnets, pp. 263–270 (2010)

    Google Scholar 

  13. Przepiórkowski, A., Buczyński, A.: Shallow parsing and disambiguation engine. In: Proceedings of the 3rd Language and Technology Conference, Poznań (2007)

    Google Scholar 

  14. Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, pp. 109–126 (1994)

    Google Scholar 

  15. Romanov, D., Ponfilenok, M., Kazantsev, N.: Potential innovations (new ideas/trends) detection in information network. Int. J. Future Comput. Commun. 2(1), 63–66 (2013)

    Article  Google Scholar 

  16. Sammut, C., Webb, G.: Encyclopedia of Machine Learning. Springer, New York (2011)

    MATH  Google Scholar 

  17. Schenker, A., Bunke, H., Last, M., Kandel, A.: Graph-Theoretic Techniques for Web Content Mining. World Scientific Publishing, Singapore (2005)

    MATH  Google Scholar 

  18. Schurmann, J.: Pattern Classification - A Unified View of Statistical and Neural Approaches. Wiley, New York (1996)

    Google Scholar 

  19. Solka, J.: Text data mining: theory and methods. Statist. Surv. 2, 94–112 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  20. Woliński, M.: Morphological tagset in the ipi pan corpus, Polonika, pp. 39–54 (2004)

    Google Scholar 

  21. Wolinski, M.: Morfeusz - a practical tool for the morphological analysis of polish. Intell. Inf. Process. Web Min. Adv. Soft Comput. 35, 511–520 (2006)

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their comments that significantly helped to improve the manuscript. Moreover, we are grateful for Krzysztof Wolinski and his team support in labeling of data.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marcin Mirończuk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Mirończuk, M., Protasiewicz, J. (2016). A Diversified Classification Committee for Recognition of Innovative Internet Domains. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery. BDAS BDAS 2015 2016. Communications in Computer and Information Science, vol 613. Springer, Cham. https://doi.org/10.1007/978-3-319-34099-9_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-34099-9_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-34098-2

  • Online ISBN: 978-3-319-34099-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics