Skip to main content

Keyword Extraction from Company Websites for the Development of Regional Knowledge Maps

  • Conference paper
  • First Online:
Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2013)

Abstract

Regional Innovation Systems describe the relations between actors, structures and infrastructures in a region in order to stimulate innovation and regional development. For these systems the collection and organization of information is crucial. In the present paper we investigate the possibilities to extract information from websites of companies. Especially we consider faceted classification of companies by keyword extraction using a specialized thesaurus. First we identify a number of challenges that arise when we want to extract information about companies from their websites. Then we describe a small scale experiment in which keywords related to economic sectors and commodities are extracted from the websites of over 200 companies. The experiment shows that the approach is at least feasible for the commodities facet. For the sectors facet the simple keyword extraction methods used do not perform well. We find that a good coverage of words in the text by the thesaurus is crucial and that hence the results can be improved by adding more alternative labels to the thesaurus terms. Furthermore, we find that weighting terms according to their relations to other terms on the website instead of using inverse document frequency gives better results than the classical tf.idf weighting of terms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://zbw.eu/stw/versions/latest/about.en.html.

  2. 2.

    http://zbw.eu/stw/versions/latest/mapping/gnd/about.

  3. 3.

    http://www.dnb.de/gnd.

  4. 4.

    http://code.google.com/p/crawler4j/.

References

  1. Asheim, B., Gertler, M.: The geography of innovation: regional innovation systems. In: Fagerberg, J., Mowery, D., Nelson, R. (eds.) The Oxford Handbook of Innovation, pp. 291–317. Oxford University Press, Oxford (2005)

    Google Scholar 

  2. Barinani, A., Agard, B., Beaudry, C.: Competence maps using agglomerative hierarchical clustering. J. Intell. Manuf. 24(2), 1–12 (2011)

    Google Scholar 

  3. Canongia, C.: Synergy between competitive intelligence (CI), knowledge management (KM) and technological foresight (TF) as a strategic model of prospecting — the use of biotechnology in the development of drugs against breast cancer. Biotechnol. Adv. 25(1), 57–74 (2007)

    Article  Google Scholar 

  4. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: A framework and graphical development environment for robust nlp tools and applications. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 6–12 July, pp. 168–175. ACL, Philadelphia (2002)

    Google Scholar 

  5. David, P., Foray, D.: Assessing and expanding the science and technology knowledge base. STI Rev. 14, 13–68 (1995)

    Google Scholar 

  6. De Campos, L.M., Fernández-Luna, J.M., Huete, J.F., Romero, A.E.: Automatic indexing from a thesaurus using bayesian networks: application to the classification of parliamentary initiatives. In: Mellouli, K. (ed.) ECSQARU 2007. LNCS (LNAI), vol. 4724, pp. 865–877. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  7. Doloreux, D., Nabil, A., Landry, R.: Mapping regional and sectoral characteristics of knowledge-intensive business services: Evidence from the province of Quebec (Canada). Growth Change 39(3), 464–496 (2008)

    Article  Google Scholar 

  8. Doloreux, D., Parto, S.: Regional innovation systems: Current discourse and unresolved issues. Technol. Soc. 27, 133–153 (2005)

    Article  Google Scholar 

  9. Driessen, S., Huijsen, W., Grootveld, M.: A framework for evaluating knowledge-mapping tools. J. Knowl. Manage. 11(2), 109–117 (2007)

    Article  Google Scholar 

  10. Eckert, K., Stuckenschmidt, H., Pfeffer, M.: Interactive thesaurus assessment for automatic document annotation. In: Proceedings of the 4th International Conference on Knowledge Capture, pp. 103–110. ACM (2007)

    Google Scholar 

  11. Escorsa, P., Rodriguez, M., Maspons, R.: Technology mapping, business strategy and market opportunities. Compet. Intell. Rev. 11(1), 46–57 (2000)

    Article  Google Scholar 

  12. Färber, M., Rettinger, A.: A semantic wiki for novelty search on documents. In: Proceedings of the 13th Dutch-Belgian Workshop on Information Retrieval, Delft, pp. 60–61 (2013)

    Google Scholar 

  13. Frank, E., Paynter, G.W., Witten, I.H., Gutwin, C., Nevill-Manning, C.G.: Domain-specific keyphrase extraction. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, IJCAI 1999, Stockholm, Sweden, July 31–August 6, pp. 668–673 (1999)

    Google Scholar 

  14. Garcia-Alsina, M., Ortoll, E.: La Inteligencia Competitiva: evolución histórica y fundamentos teóricos. Trea, Gijón (2012)

    Google Scholar 

  15. Garcia-Alsina, M., Wartena, C., Lieberam-Schmidt, S.: Regional knowledge maps: potentials and challenges. In: Fifth International Conference on Knowledge Management and Information Sharing (KMIS 2013) (2013)

    Google Scholar 

  16. Gastmeyer, M.: Standard-thesaurus wirtschaft. Technical report Deutsch Zentralbibliothek für Wirtschaftswissenschaften, Kiel (1998)

    Google Scholar 

  17. Gastmeyer, M., Weskamp, W.: Nace-konkordanz. In: Standard-Thesaurus Wirtschaft, vol. 2, Kiel (1998)

    Google Scholar 

  18. Gazendam, L., Wartena, C., Brussee, R.: Thesaurus based term ranking for keyword extraction. In: Tjoa, A.M., Wagner, R. (eds.) Database and Expert Systems Applications, DEXA, 10th International Workshop on Text-based Information Retrieval, TIR, pp. 49–53. IEEE (2010)

    Google Scholar 

  19. Gazendam, L., Wartena, C., Malaisé, V., Schreiber, G., De Jong, A., Brugman, H.: Automatic annotation suggestions for audiovisual archives: Evaluation aspects. Interdis. Sci. Rev. 34(2–3), 172–188 (2009)

    Article  Google Scholar 

  20. Girardot, J.J.: Evolution of the concept of territorial intelligence within the coordination action of the european network of territorial intelligence. Ricerca e Sviluppo per le politiche sociali 1(1–2), 11–29 (2008)

    Google Scholar 

  21. Girardot, J.J., Brunau, É.: Territorial intelligence and innovation for the socio-ecological transition. In: 9th International conference of Territorial Intelligence, ENTI, Strasbourg (2010)

    Google Scholar 

  22. Grineva, M.P., Grinev, M.N., Lizorkin, D.: Extracting key terms from noisy and multitheme documents. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, 20–24 April, pp. 661–670 (2009)

    Google Scholar 

  23. Herbaux, P.: Tools for territorial intelligence and generic scientific methods. In: Internationa Annual Conference on Territorial Intelligence. Besançon: 16–17 October 2008

    Google Scholar 

  24. Isaac, A., Summers, E.: Skos simple knowledge organization system primer. W3C Working Group Note (August 2009). http://www.w3.org/TR/skos-primer/

  25. Jimenez, F., Fernández, I., Menéndez, A.: Los sistemas regionales de innovación: revisión conceptual e implicaciones en américa latina. In: Los Sistemas Regionales de Innovación en América Latina. Banco Interamericano de Desarrollo, Washington (2011)

    Google Scholar 

  26. Lundvall, B.A., Christensen, J.L.: Broadening the analysis of innovation systems-competition, organisational change and employment dynamics in the danish system. In: Conceição, P., Heitor, M., Lundvall, B.-A. (eds.) Innovation, Competence Building and Social Cohesion in Europe: Towards a Learning Society, pp. 144–179. Edward Elgar, Cheltenham (2003)

    Google Scholar 

  27. Lundvall, B.A., Johnson, B.: The learning economy. J. Ind. Stud. 1(2), 23–42 (1994)

    Article  Google Scholar 

  28. Lundvall, B. (ed.): National Systems of Innovation: Towards a Theory of Innovation and Interactive Learning. Pinter, London (1992)

    Google Scholar 

  29. Lundvall, B.A.: Why study national systems and national styles of innovations? Technol. Anal. Strateg. Manag. 10(4), 407–421 (1998)

    Article  Google Scholar 

  30. Malaisé, V., Gazendam, L., Brugman, H.: Disambiguating automatic semantic annotation based on a thesaurus structure. In: Hathout, N., Muller, P. (eds.) Actes de la 14e conférence sur le Traitement Automatique des Langues Naturelles (communications orales), pp. 197–206. Association pour le Traitement Automatique des Langues, Toulouse (2007)

    Google Scholar 

  31. Malaisé, V., Isaac, A., Gazendam, L., Brugman, H.: Anchoring dutch cultural heritage thesauri to wordnet: two case studies. In: ACL 2007, pp. 57–63 (2007)

    Google Scholar 

  32. Medelyan, O., Witten, I.H.: Thesaurus-based index term extraction for agricultural documents. In: Proceedings of the 6th Agricultural Ontology Service Workshop (2005)

    Google Scholar 

  33. Mollo, M.: The survey on territory research in europe, In: International Conference of Territorial Intelligence, Papers on Tools and methods of Territorial Intelligence (MSHE). Besançon (2009)

    Google Scholar 

  34. Nahapiet, J., Ghoshal, S.: Social capital, intellectual capital, and the organizational advantage. Acad. Manage. Rev. 23(2), 242–266 (1998)

    Google Scholar 

  35. Nelson, R.R. (ed.): National Innovation Systems: A Comparative Study. Oxford University Press, Oxford (1993)

    Google Scholar 

  36. Neubert, J.: Bringing the “thesaurus for economics” on the web of linked data. In: Proceedings of the Linked Data on the Web Workshop (LDOW 2009) (2009)

    Google Scholar 

  37. OECD, EUROSTAT: Oslo Manual: Guidelines for collecting and interpreting innovation data. OECD Publising and European Commission. 3rd edn. (2005)

    Google Scholar 

  38. Robertson, S., Jones, K.: Relevance weighting of search terms. J. Am. Soc. Inform. Sci. 27(3), 129–146 (1976)

    Article  Google Scholar 

  39. Salavisa, I., Vali, M.: Social Networks, Innovation and the Knowledge Economy. Routledge, London (2012)

    Google Scholar 

  40. Salton, G., Buckley, C.: Term weighting approaches in automatic text retrieval. Technical report Cornell University (1987). http://hdl.handle.net/1813/6721

  41. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)

    Article  Google Scholar 

  42. Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 60, 493–502 (2004)

    Article  Google Scholar 

  43. Tiun, S., Abdullah, R., Kong, T.E.: Automatic topic identification using ontology hierarchy. In: Gelbukh, A. (ed.) CICLing 2001. LNCS, vol. 2004, pp. 444–453. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  44. Turney, P.D.: Learning algorithms for keyphrase extraction. Inf. Retr. 2(4), 303–336 (2000)

    Article  Google Scholar 

  45. Wang, J., Liu, J., Wang, C.: Keyword extraction based on PageRank. In: Zhou, Z.-H., Li, H., Yang, Q. (eds.) PAKDD 2007. LNCS (LNAI), vol. 4426, pp. 857–864. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  46. Wartena, C., Brussee, R., Gazendam, L., Huijsen, W.: Apolda: A practical tool for semantic annotation. In: Database and Expert Systems Applications, DEXA, 7th International Workshop on Text-based Information Retrieval, TIR, pp. 288–292. IEEE (2007)

    Google Scholar 

Download references

Acknowledgements

The research presented in this paper was partially funded by the Spanish Ministry of Education, Culture and Sport (Ref. CAS 12/00155).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Wartena .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wartena, C., Garcia-Alsina, M. (2015). Keyword Extraction from Company Websites for the Development of Regional Knowledge Maps. In: Fred, A., Dietz, J., Liu, K., Filipe, J. (eds) Knowledge Discovery, Knowledge Engineering and Knowledge Management. IC3K 2013. Communications in Computer and Information Science, vol 454. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46549-3_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-46549-3_7

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-46548-6

  • Online ISBN: 978-3-662-46549-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics