A Diversified Classification Committee for Recognition of Innovative Internet Domains

Mirończuk, Marcin; Protasiewicz, Jarosław

doi:10.1007/978-3-319-34099-9_29

Marcin Mirończuk¹⁵ &
Jarosław Protasiewicz¹⁵

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 613))

Included in the following conference series:

1192 Accesses
5 Citations

Abstract

The objective of this paper was to propose a classification method of innovative domains on the Internet. The proposed approach helped to estimate whether companies are innovative or not through analyzing their web pages. A Naïve Bayes classification committee was used as the classification system of the domains. The classifiers in the committee were based concurrently on Bernoulli and Multinomial feature distribution models, which were selected depending on the diversity of input data. Moreover, the information retrieval procedures were applied to find such documents in domains that most likely indicate innovativeness. The proposed methods have been verified experimentally. The results have shown that the diversified classification committee combined with the information retrieval approach in the preprocessing phase boosts the classification quality of domains that may represent innovative companies. This approach may be applied to other classification tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Morphological analyser morfeusz. http://sgjp.pl/morfeusz/morfeusz.html.en. Accessed 28 Oct 2015
Bellotti, T., Nouretdinov, I., Yang, M., Gammerman, A.: Feature selection, pp. 115–130. Elsevier (2014)
Google Scholar
Bouma, G.: Normalized (pointwise) mutual information in collocation extraction. In: Biennial GSCL Conference 2009, Tübingen, pp. 31–40 (2009)
Google Scholar
Figueiredo, F., Rocha, L., Couto, T., Salles, T., Gonçalves, M., Meira, W.: Word co-occurrence features for text classification. Inf. Syst. 36(5), 843–858 (2011)
Article Google Scholar
Hand, D., Smyth, P., Mannila, H.: Principles of Data Mining. MIT Press, Cambridge (2001)
Google Scholar
Li, S., Xia, R., Zong, C., Huang, C.: A framework of feature selection methods for text categorization. AFNLP 2, 692–700 (2009)
Google Scholar
Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Data-Centric Systems and Applications. Springer, Heidelberg (2006)
MATH Google Scholar
Manning, C., Raghavan, P., Schutze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book MATH Google Scholar
Nakatsuji, M.: Identifying novel topics based on user interests. In: Elçi, A., Koné, M.T., Orgun, M.A. (eds.) Semantic Agent Systems. SCI, vol. 344, pp. 273–292. Springer, Heidelberg (2011)
Chapter Google Scholar
Nakatsuji, M., Miyoshi, Y., Otsuka, Y.: Innovation detection based on user-interest ontology of blog community. In: Cruz, I., Decker, S., Allemang, D., Preist, C., Schwabe, D., Mika, P., Uschold, M., Aroyo, L.M. (eds.) ISWC 2006. LNCS, vol. 4273, pp. 515–528. Springer, Heidelberg (2006)
Chapter Google Scholar
Nakatsuji, M., Yoshida, M., Ishida, T.: Detecting innovative topics based on user-interest ontology. Web Semant. Sci. Serv. Agents World Wide Web 7(2), 107–120 (2009)
Article Google Scholar
Piasecki, M., Szpakowicz, S., Broda, B.: Toward plWordNet 2.0. Principles, Construction and Application of Multilingual Wordnets, pp. 263–270 (2010)
Google Scholar
Przepiórkowski, A., Buczyński, A.: Shallow parsing and disambiguation engine. In: Proceedings of the 3rd Language and Technology Conference, Poznań (2007)
Google Scholar
Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., Gatford, M.: Okapi at TREC-3. In: Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, pp. 109–126 (1994)
Google Scholar
Romanov, D., Ponfilenok, M., Kazantsev, N.: Potential innovations (new ideas/trends) detection in information network. Int. J. Future Comput. Commun. 2(1), 63–66 (2013)
Article Google Scholar
Sammut, C., Webb, G.: Encyclopedia of Machine Learning. Springer, New York (2011)
MATH Google Scholar
Schenker, A., Bunke, H., Last, M., Kandel, A.: Graph-Theoretic Techniques for Web Content Mining. World Scientific Publishing, Singapore (2005)
MATH Google Scholar
Schurmann, J.: Pattern Classification - A Unified View of Statistical and Neural Approaches. Wiley, New York (1996)
Google Scholar
Solka, J.: Text data mining: theory and methods. Statist. Surv. 2, 94–112 (2008)
Article MathSciNet MATH Google Scholar
Woliński, M.: Morphological tagset in the ipi pan corpus, Polonika, pp. 39–54 (2004)
Google Scholar
Wolinski, M.: Morfeusz - a practical tool for the morphological analysis of polish. Intell. Inf. Process. Web Min. Adv. Soft Comput. 35, 511–520 (2006)
Article Google Scholar

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their comments that significantly helped to improve the manuscript. Moreover, we are grateful for Krzysztof Wolinski and his team support in labeling of data.

Author information

Authors and Affiliations

Laboratory of Intelligent Information Systems, National Information Processing Institute, al. Niepodległości 188b, 00-608, Warsaw, Poland
Marcin Mirończuk & Jarosław Protasiewicz

Authors

Marcin Mirończuk
View author publications
You can also search for this author in PubMed Google Scholar
Jarosław Protasiewicz
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marcin Mirończuk .

Editor information

Editors and Affiliations

Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Stanisław Kozielski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Dariusz Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Paweł Kasprowski
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Bożena Małysiak-Mrozek
Institute of Informatics, Silesian University of Technology, Gliwice, Poland
Daniel Kostrzewa

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mirończuk, M., Protasiewicz, J. (2016). A Diversified Classification Committee for Recognition of Innovative Internet Domains. In: Kozielski, S., Mrozek, D., Kasprowski, P., Małysiak-Mrozek, B., Kostrzewa, D. (eds) Beyond Databases, Architectures and Structures. Advanced Technologies for Data Mining and Knowledge Discovery. BDAS BDAS 2015 2016. Communications in Computer and Information Science, vol 613. Springer, Cham. https://doi.org/10.1007/978-3-319-34099-9_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-34099-9_29
Published: 28 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-34098-2
Online ISBN: 978-3-319-34099-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics