Abstract
Regional Innovation Systems describe the relations between actors, structures and infrastructures in a region in order to stimulate innovation and regional development. For these systems the collection and organization of information is crucial. In the present paper we investigate the possibilities to extract information from websites of companies. Especially we consider faceted classification of companies by keyword extraction using a specialized thesaurus. First we identify a number of challenges that arise when we want to extract information about companies from their websites. Then we describe a small scale experiment in which keywords related to economic sectors and commodities are extracted from the websites of over 200 companies. The experiment shows that the approach is at least feasible for the commodities facet. For the sectors facet the simple keyword extraction methods used do not perform well. We find that a good coverage of words in the text by the thesaurus is crucial and that hence the results can be improved by adding more alternative labels to the thesaurus terms. Furthermore, we find that weighting terms according to their relations to other terms on the website instead of using inverse document frequency gives better results than the classical tf.idf weighting of terms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Asheim, B., Gertler, M.: The geography of innovation: regional innovation systems. In: Fagerberg, J., Mowery, D., Nelson, R. (eds.) The Oxford Handbook of Innovation, pp. 291–317. Oxford University Press, Oxford (2005)
Barinani, A., Agard, B., Beaudry, C.: Competence maps using agglomerative hierarchical clustering. J. Intell. Manuf. 24(2), 1–12 (2011)
Canongia, C.: Synergy between competitive intelligence (CI), knowledge management (KM) and technological foresight (TF) as a strategic model of prospecting — the use of biotechnology in the development of drugs against breast cancer. Biotechnol. Adv. 25(1), 57–74 (2007)
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: A framework and graphical development environment for robust nlp tools and applications. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 6–12 July, pp. 168–175. ACL, Philadelphia (2002)
David, P., Foray, D.: Assessing and expanding the science and technology knowledge base. STI Rev. 14, 13–68 (1995)
De Campos, L.M., Fernández-Luna, J.M., Huete, J.F., Romero, A.E.: Automatic indexing from a thesaurus using bayesian networks: application to the classification of parliamentary initiatives. In: Mellouli, K. (ed.) ECSQARU 2007. LNCS (LNAI), vol. 4724, pp. 865–877. Springer, Heidelberg (2007)
Doloreux, D., Nabil, A., Landry, R.: Mapping regional and sectoral characteristics of knowledge-intensive business services: Evidence from the province of Quebec (Canada). Growth Change 39(3), 464–496 (2008)
Doloreux, D., Parto, S.: Regional innovation systems: Current discourse and unresolved issues. Technol. Soc. 27, 133–153 (2005)
Driessen, S., Huijsen, W., Grootveld, M.: A framework for evaluating knowledge-mapping tools. J. Knowl. Manage. 11(2), 109–117 (2007)
Eckert, K., Stuckenschmidt, H., Pfeffer, M.: Interactive thesaurus assessment for automatic document annotation. In: Proceedings of the 4th International Conference on Knowledge Capture, pp. 103–110. ACM (2007)
Escorsa, P., Rodriguez, M., Maspons, R.: Technology mapping, business strategy and market opportunities. Compet. Intell. Rev. 11(1), 46–57 (2000)
Färber, M., Rettinger, A.: A semantic wiki for novelty search on documents. In: Proceedings of the 13th Dutch-Belgian Workshop on Information Retrieval, Delft, pp. 60–61 (2013)
Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 1999, Stockholm, Sweden, July 31–August 6, pp. 668–673 (1999)
Garcia-Alsina, M., Ortoll, E.: La Inteligencia Competitiva: evolución histórica y fundamentos teóricos. Trea, Gijón (2012)
Garcia-Alsina, M., Wartena, C., Lieberam-Schmidt, S.: Regional knowledge maps: potentials and challenges. In: Fifth International Conference on Knowledge Management and Information Sharing (KMIS 2013) (2013)
Gastmeyer, M.: Standard-thesaurus wirtschaft. Technical report Deutsch Zentralbibliothek für Wirtschaftswissenschaften, Kiel (1998)
Gastmeyer, M., Weskamp, W.: Nace-konkordanz. In: Standard-Thesaurus Wirtschaft, vol. 2, Kiel (1998)
Gazendam, L., Wartena, C., Brussee, R.: Thesaurus based term ranking for keyword extraction. In: Tjoa, A.M., Wagner, R. (eds.) Database and Expert Systems Applications, DEXA, 10th International Workshop on Text-based Information Retrieval, TIR, pp. 49–53. IEEE (2010)
Gazendam, L., Wartena, C., Malaisé, V., Schreiber, G., De Jong, A., Brugman, H.: Automatic annotation suggestions for audiovisual archives: Evaluation aspects. Interdis. Sci. Rev. 34(2–3), 172–188 (2009)
Girardot, J.J.: Evolution of the concept of territorial intelligence within the coordination action of the european network of territorial intelligence. Ricerca e Sviluppo per le politiche sociali 1(1–2), 11–29 (2008)
Girardot, J.J., Brunau, É.: Territorial intelligence and innovation for the socio-ecological transition. In: 9th International conference of Territorial Intelligence, ENTI, Strasbourg (2010)
Grineva, M.P., Grinev, M.N., Lizorkin, D.: Extracting key terms from noisy and multitheme documents. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, 20–24 April, pp. 661–670 (2009)
Herbaux, P.: Tools for territorial intelligence and generic scientific methods. In: Internationa Annual Conference on Territorial Intelligence. Besançon: 16–17 October 2008
Isaac, A., Summers, E.: Skos simple knowledge organization system primer. W3C Working Group Note (August 2009). http://www.w3.org/TR/skos-primer/
Jimenez, F., Fernández, I., Menéndez, A.: Los sistemas regionales de innovación: revisión conceptual e implicaciones en américa latina. In: Los Sistemas Regionales de Innovación en América Latina. Banco Interamericano de Desarrollo, Washington (2011)
Lundvall, B.A., Christensen, J.L.: Broadening the analysis of innovation systems-competition, organisational change and employment dynamics in the danish system. In: Conceição, P., Heitor, M., Lundvall, B.-A. (eds.) Innovation, Competence Building and Social Cohesion in Europe: Towards a Learning Society, pp. 144–179. Edward Elgar, Cheltenham (2003)
Lundvall, B.A., Johnson, B.: The learning economy. J. Ind. Stud. 1(2), 23–42 (1994)
Lundvall, B. (ed.): National Systems of Innovation: Towards a Theory of Innovation and Interactive Learning. Pinter, London (1992)
Lundvall, B.A.: Why study national systems and national styles of innovations? Technol. Anal. Strateg. Manag. 10(4), 407–421 (1998)
Malaisé, V., Gazendam, L., Brugman, H.: Disambiguating automatic semantic annotation based on a thesaurus structure. In: Hathout, N., Muller, P. (eds.) Actes de la 14e conférence sur le Traitement Automatique des Langues Naturelles (communications orales), pp. 197–206. Association pour le Traitement Automatique des Langues, Toulouse (2007)
Malaisé, V., Isaac, A., Gazendam, L., Brugman, H.: Anchoring dutch cultural heritage thesauri to wordnet: two case studies. In: ACL 2007, pp. 57–63 (2007)
Medelyan, O., Witten, I.H.: Thesaurus-based index term extraction for agricultural documents. In: Proceedings of the 6th Agricultural Ontology Service Workshop (2005)
Mollo, M.: The survey on territory research in europe, In: International Conference of Territorial Intelligence, Papers on Tools and methods of Territorial Intelligence (MSHE). Besançon (2009)
Nahapiet, J., Ghoshal, S.: Social capital, intellectual capital, and the organizational advantage. Acad. Manage. Rev. 23(2), 242–266 (1998)
Nelson, R.R. (ed.): National Innovation Systems: A Comparative Study. Oxford University Press, Oxford (1993)
Neubert, J.: Bringing the “thesaurus for economics” on the web of linked data. In: Proceedings of the Linked Data on the Web Workshop (LDOW 2009) (2009)
OECD, EUROSTAT: Oslo Manual: Guidelines for collecting and interpreting innovation data. OECD Publising and European Commission. 3rd edn. (2005)
Robertson, S., Jones, K.: Relevance weighting of search terms. J. Am. Soc. Inform. Sci. 27(3), 129–146 (1976)
Salavisa, I., Vali, M.: Social Networks, Innovation and the Knowledge Economy. Routledge, London (2012)
Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Technical report Cornell University (1987). http://hdl.handle.net/1813/6721
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 60, 493–502 (2004)
Tiun, S., Abdullah, R., Kong, T.E.: Automatic topic identification using ontology hierarchy. In: Gelbukh, A. (ed.) CICLing 2001. LNCS, vol. 2004, pp. 444–453. Springer, Heidelberg (2001)
Turney, P.D.: Learning algorithms for keyphrase extraction. Inf. Retr. 2(4), 303–336 (2000)
Wang, J., Liu, J., Wang, C.: Keyword extraction based on PageRank. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 857–864. Springer, Heidelberg (2007)
Wartena, C., Brussee, R., Gazendam, L., Huijsen, W.: Apolda: A practical tool for semantic annotation. In: Database and Expert Systems Applications, DEXA, 7th International Workshop on Text-based Information Retrieval, TIR, pp. 288–292. IEEE (2007)
Acknowledgements
The research presented in this paper was partially funded by the Spanish Ministry of Education, Culture and Sport (Ref. CAS 12/00155).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wartena, C., Garcia-Alsina, M. (2015). Keyword Extraction from Company Websites for the Development of Regional Knowledge Maps. In: Fred, A., Dietz, J., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2013. Communications in Computer and Information Science, vol 454. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46549-3_7
Download citation
DOI: https://doi.org/10.1007/978-3-662-46549-3_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46548-6
Online ISBN: 978-3-662-46549-3
eBook Packages: Computer ScienceComputer Science (R0)