Abstract
The objective of this paper was to propose a classification method of innovative domains on the Internet. The proposed approach helped to estimate whether companies are innovative or not through analyzing their web pages. A Naïve Bayes classification committee was used as the classification system of the domains. The classifiers in the committee were based concurrently on Bernoulli and Multinomial feature distribution models, which were selected depending on the diversity of input data. Moreover, the information retrieval procedures were applied to find such documents in domains that most likely indicate innovativeness. The proposed methods have been verified experimentally. The results have shown that the diversified classification committee combined with the information retrieval approach in the preprocessing phase boosts the classification quality of domains that may represent innovative companies. This approach may be applied to other classification tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Morphological analyser morfeusz. http://sgjp.pl/morfeusz/morfeusz.html.en. Accessed 28 Oct 2015
Bellotti, T., Nouretdinov, I., Yang, M., Gammerman, A.: Feature selection, pp. 115–130. Elsevier (2014)
Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. In: Biennial GSCL Conference 2009, Tübingen, pp. 31–40 (2009)
Figueiredo, F., Rocha, L., Couto, T., Salles, T., Gonçalves, M., Meira, W.: Word co-occurrence features for text classification. Inf. Syst. 36(5), 843–858 (2011)
Hand, D., Smyth, P., Mannila, H.: Principles of Data Mining. MIT Press, Cambridge (2001)
Li, S., Xia, R., Zong, C., Huang, C.: A framework of feature selection methods for text categorization. AFNLP 2, 692–700 (2009)
Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Data-Centric Systems and Applications. Springer, Heidelberg (2006)
Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Nakatsuji, M.: Identifying novel topics based on user interests. In: Elçi, A., Koné, M.T., Orgun, M.A. (eds.) Semantic Agent Systems. SCI, vol. 344, pp. 273–292. Springer, Heidelberg (2011)
Nakatsuji, M., Miyoshi, Y., Otsuka, Y.: Innovation detection based on user-interest ontology of blog community. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 515–528. Springer, Heidelberg (2006)
Nakatsuji, M., Yoshida, M., Ishida, T.: Detecting innovative topics based on user-interest ontology. Web Semant. Sci. Serv. Agents World Wide Web 7(2), 107–120 (2009)
Piasecki, M., Szpakowicz, S., Broda, B.: Toward plWordNet 2.0. Principles, Construction and Application of Multilingual Wordnets, pp. 263–270 (2010)
Przepiórkowski, A., Buczyński, A.: Shallow parsing and disambiguation engine. In: Proceedings of the 3rd Language and Technology Conference, Poznań (2007)
Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, pp. 109–126 (1994)
Romanov, D., Ponfilenok, M., Kazantsev, N.: Potential innovations (new ideas/trends) detection in information network. Int. J. Future Comput. Commun. 2(1), 63–66 (2013)
Sammut, C., Webb, G.: Encyclopedia of Machine Learning. Springer, New York (2011)
Schenker, A., Bunke, H., Last, M., Kandel, A.: Graph-Theoretic Techniques for Web Content Mining. World Scientific Publishing, Singapore (2005)
Schurmann, J.: Pattern Classification - A Unified View of Statistical and Neural Approaches. Wiley, New York (1996)
Solka, J.: Text data mining: theory and methods. Statist. Surv. 2, 94–112 (2008)
Woliński, M.: Morphological tagset in the ipi pan corpus, Polonika, pp. 39–54 (2004)
Wolinski, M.: Morfeusz - a practical tool for the morphological analysis of polish. Intell. Inf. Process. Web Min. Adv. Soft Comput. 35, 511–520 (2006)
Acknowledgements
We would like to thank the anonymous reviewers for their comments that significantly helped to improve the manuscript. Moreover, we are grateful for Krzysztof Wolinski and his team support in labeling of data.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Mirończuk, M., Protasiewicz, J. (2016). A Diversified Classification Committee for Recognition of Innovative Internet Domains. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery. BDAS BDAS 2015 2016. Communications in Computer and Information Science, vol 613. Springer, Cham. https://doi.org/10.1007/978-3-319-34099-9_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-34099-9_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-34098-2
Online ISBN: 978-3-319-34099-9
eBook Packages: Computer ScienceComputer Science (R0)