Skip to main content
Log in

Learning and exploiting concept networks with ConNeKTion

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Studying, understanding and exploiting the content of a document collection require automatic techniques that can effectively support the users in extracting useful information from it and reason with this information. Concept networks (e.g., taxonomies) may play a relevant role in this perspective, but are seldom available, and cannot be manually built and maintained cheaply and reliably. On the other hand, automated learning of these resources from text needs to be robust with respect to missing or partial knowledge, because often only sparse fragments of the target network can be extracted. This work presents ConNeKTion, a tool that is able to learn concept networks from plain text and to structure and enrich them by finding concept generalizations. The proposed methodologies are general and applicable to any language. It also provides functionalities for the exploitation of the learned knowledge, and a control panel that allows the user to comfortably carry out these activities. Several experiments and applications are reported, showing the usefulness and flexibility of ConNeKTion.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. 1 A vertex-induced sub-graph is a subset of the vertexes of a graph together with all edges in the graph whose endpoints are both in the subset.

  2. 2 A weak component is defined as a maximal sub-graph in which there exists a path between all pairs of vertexes (considering undirected edges).

  3. 3 First the synsets of each word are extracted from WordNet, then, for each synset, all the associated domains in WordNet Domains are selected, and finally each domain is weighted according to the density function presented in [1], depending on the number of domains to which each synset belongs, on the number of synsets associated to each word, and on the number of words that make up the sentence. Each synset of a word is weighted based on the weights of the associated domains, and the one with highest weight is selected.

  4. 4 Note that this is different than the spreading activation algorithm [4], in that (1) graph traversal is not affected by weights on edges nor thresholds, (2) we focus on paths rather than nodes, and specifically we are interested in the path(s) between two particular nodes rather than in the whole graph activation, hence (3) in our approach setting the initial activation weight of start nodes makes no sense, and (4) this allows to exploit a bi-directional partial search rather than a mono-directional complete graph traversal.

  5. 5 Again, this is not a spreading activation, even if weights on edges are exploited.

  6. 6 A technique to semi-automatically extract a domain-specific ontology from free text without using external resources but focusing on Hub Words. After building the ontology, the ‘Hub Weight’ of a word t is computed as:

    $$W(t) = \alpha w_{0} + \beta n + \gamma \sum_{i=1}^{n} w(t_{i})$$

    where w 0 is a given initial weight, n is the number of relationships in which t is involved, w(t i ) is the t fi d f weight of the i-th word related to t, and α+β+γ=1. These elements, with some modifications, appear in the first three terms of our formula.

References

  1. Angioni M, Demontis R, Tuveri F (2008) A semantic approach for resource cataloguing and query resolution. Commun SIWN Spec Issue Distrib Agent-based Retr Tools 5:62–66

    Google Scholar 

  2. Argamon S, Whitelaw C, Chase P, Hota SR, Garg N, Levitan S (2007) Stylistic text classification using functional lexical: research articles. J Am Soc Inf Sci Technol 58(6):802–822

    Article  Google Scholar 

  3. Cimiano P, Hotho A, Staab S (2005) Learning concept hierarchies from text corpora using formal concept analysis. J Artif Int Res 24(1):305–339

    MATH  Google Scholar 

  4. Crestani F (1997) Application of spreading activation techniques in information retrieval. Artif Intell Rev 11:453–482

    Article  Google Scholar 

  5. Deerwester S (1988) Improving information retrieval with latent semantic indexing. In: Borgman CL, Pai EYH (eds) Proceedings of the 51st ASIS annual meeting (ASIS 88), vol 25. American Society for Information Science, Atlanta

  6. Defays D (1977) An efficient algorithm for a complete link method. Comput J 20(4):364–366

    Article  MathSciNet  MATH  Google Scholar 

  7. Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. In: Machine learning, pp 143–175

  8. Fellbaum C (ed) (1998) An electronic lexical database. MIT Press, Cambridge

  9. Ferilli S (2011) Automatic digital document processing and management, problems, algorithms and techniques, 1st edn. Springer Publishing Company, Incorporated

  10. Ferilli S, Basile TMA, Di Mauro N, Esposito F (2011) Plugging numeric similarity in first-order logic horn clauses comparison. In: Pirrone R, Sorbello F (eds) 7th international conference of the Italian association for artificial intelligence, vol 6934. Springer, LNCS, pp 33–44

  11. Ferilli S, Biba M, Basile TMA, Esposito F (2009) Combining qualitative and quantitative keyword extraction methods with document layout analysis. In: Post-proceedings of the 5th Italian research conference on digital libraries - IRCDL 2009, Padova Italy, 29–30 January 2009, pp 22–33

  12. Ferilli S, Biba M, Di Mauro N, Basile TMA, Esposito F (2009) Plugging taxonomic similarity in first-order logic horn clauses comparison. In: Emergent perspectives in artificial intelligence, lecture notes in artificial intelligence. Springer, pp 131–140

  13. Ferilli S, Leuzzi F, Rotella F (2011) Cooperating techniques for extracting conceptual taxonomies from text. In: Proceedings of the workshop on mining complex patterns at AI*IA 7th conference

  14. Gale W.A., Church K.W., Yarowsky D. (1992) One sense per discourse. In: DARPA speech and natural language workshop

  15. Gupta V, Lehal G (2009) A survey of text mining techniques and applications. J Emerg Tech Web Intell 1(1):60–76

  16. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The weka data mining software: An update. SIGKDD Explor Newsl 11(1):10–18

    Article  Google Scholar 

  17. Hamming RW (1950) Error detecting and error correcting codes. Bell Syst Tech J 26(2):147–160

    Article  MathSciNet  Google Scholar 

  18. Hasegawa R, Kitamura M, Kaiya H, Saeki M (2009) Extracting conceptual graphs from Japanese documents for software requirements modeling. In: Proceedings of the 6th APCCM, APCCM 09, vol 96. Australian Computer Society, Inc., Darlinghurs, Australia, pp 87–96

  19. Hensman S (2004) Construction of conceptual graph representation of texts. In: Proceedings of the student research workshop at HLT-NAACL 2004, HLT-SRWS 04. Association for Computational Linguistics Stroudsburg, pp 49–54

  20. Jones WP, Furnas GW (1987) Pictures of relevance: a geometric analysis of similarity measures. J Amer Soc Inf Sci 38(6):420–442

    Article  Google Scholar 

  21. Karypis G, Han E-H (2000) Concept indexing: a fast dimensionality reduction algorithm with applications to document retrieval and categorization. Technical report tr-00-0016, University of Minnesota

  22. Karypis G, (Sam) Han E-H (2000) Concept indexing: a fast dimensionality reduction algorithm with applications to document retrieval and categorization. Technical report, IN CIKM00

  23. Kimmig A, Costa VS, Rocha R, Demoen B, De Raedt L (2008) On the efficient execution of problog programs. In: Garcia de la Banda M, Pontelli E (eds) ICLP, Lecture notes in computer Science, vol 5366. Springer, pp 175–189

  24. Kimmig A, De Raedt L, Toivonen H (2007) Probabilistic explanation based learning. In: ECML, pp 176–187

  25. Kipper K, Dang HT, Palmer M (2000) Class-based construction of a verb lexicon. In: Proceedings of the 17th NCAI and 12th IAAI conference. AAAI Press, pp 691–696

  26. Klein D, Manning CD (2003) Fast exact inference with a factored model for natural language parsing. In: Advances in neural information processing systems, vol 15. MIT Press

  27. Koo S-O, Lim S-Y, Lee S-J (2003) Constructing an ontology based on hub words. In: ISMIS03, pp 93–97

  28. Leuzzi F, Ferilli S, Rotella F (2013) ConNeKTion: a tool for handling conceptual graphs automatically extracted from text. In: Catarci T, Ferro N, Poggi A (eds) Bridging between cultural Heritage Institutions Proceedings of the 9th Italian research conference on digital libraries (IRCDL 2013), CCIS, vol 385. Springer

  29. Leuzzi F, Ferilli S, Rotella F (2013) Improving robustness and flexibility of concept taxonomy learning from text. In: Appice A, Ceci M, Loglisci C, Manco G, Masciari E, Ras ZW (eds) New frontiers in mining complex patterns - first International Workshop, NFMCP 2012, Held in Conjunction with ECML/PKDD 2012, Bristol, UK, September 24, 2012 Revised Selected Papers, CCIS, vol 7765. Springer, pp 232–244

  30. Leuzzi F, Ferilli S, Taranto C, Rotella F (2013) A relational unsupervised approach to author identification. In: Workshop new frontiers in mining complex patterns 2013 held at ECML-PKDD 2013

  31. Maedche A, Staab S (2000) Mining ontologies from tex. In: EKAW, pp 189–202

  32. Maedche A, Staab S (2000) The text-to-onto ontology learning environment. In: ICCS-2000 — 8th international conference on conceptual structures, software demonstration

  33. Magnini B, Cavaglià G (2000) Integrating subject field codes into wordnet, pp 1413–1418

  34. De Marneffe M-C, Maccartney B, Manning CD (2006) Generating typed dependency parses from phrase structure parses. In: Proceedings international conference on language resources and evaluation (LREC), pp 449–454

  35. Matsuo Y, Ishizuka M (2004) Keyword extraction from a single document using word co-occurrence statistical information. Int J Artif Intell Tools 13:2003

    Article  Google Scholar 

  36. Mccarthy PM, Lewis GA, Dufty DF, Mcnamara DS (2006) Analyzing writing styles with coh-metrix. In: Sutcliffe G, Goebel R (eds) Proceedings of the Florida artificial intelligence research society international conference (FLAIRS). AAAI Press, pp 764–769

  37. Miller GA (1995) Wordnet: A lexical database for English. Commun ACM 38(11):39–41

    Article  Google Scholar 

  38. Ogata N (2001) A formal ontology discovery from web documents. In: Web intelligence: research and development, 1st Asia-Pacific conference (WI 2001), lecture notes on artificial intelligence, no 2198. Springer, pp 514–519

  39. O’Madadhain J, Fisher D, White S, Boey Y (2003) The JUNG (Java Universal Network/Graph) framework. Technical report, UCI-ICS

  40. Qiu L, Kan M-Y, Chua T-SA public reference implementation of the RAP anaphora resolution algorithm. In: Proceedings of the 4th international conference on language resources and evaluation, LREC 2004, May 26–28, 2004. European Language Resources Association, Lisbon, pp 291–294

  41. De Raedt L, Kimmig A, Toivonen H (2007) Problog: a probabilistic prolog and its application in link discovery. In: Proceedings of 20th IJCAI. AAAI Press, pp 2468–2473

  42. Raghavan S, Kovashka A, Mooney R (2010) Authorship attribution using probabilistic context-free grammars. In: Proceedings of the ACL 2010 conference short papers, ACLShort 10. Association for Computational Linguistics, Stroudsburg, pp 38–42

  43. Robertson SE, Walker S, Jones S, Hancock-Beaulieu MM, Gatford M (1996) Okapi at trec-3, pp 109–126

  44. Rotella F, Ferilli S, Leuzzi F (2013) An approach to automated learning of conceptual graphs from text. In: Ali M, Bosse T,Hindriks KV, Hoogendoorn M, Jonker CM, Treur J (eds) Recent trends in applied artificial intelligence, 26th international conference on industrial, engineering and other applications of applied intelligent systems, IEA/AIE 2013, Amsterdam, The Netherlands, 17-21 June 2013, Proceedings of lecture notes in computer science, vol 7906. Springer, pp 341–350

  45. Rotella F, Ferilli S, Leuzzi F (2013) A domain based approach to information retrieval in digital libraries. In: Agosti M, Esposito F, Ferilli S, Ferro N (eds) Digital Libraries and archives - 8th Italian research conference, IRCDL 2012, Bari, Italy, 9-10 Feb 2012. Revised selected papers, CCIS, vol 354. Springer-Verlag, Berlin Heidelberg, pp 129–140

  46. Salton G (1971) The SMART retrieval system experiments in automatic document processing. Prentice-Hall, Upper Saddle River

  47. Salton G (1980) Automatic term class construction using relevance–a summary of work in automatic pseudoclassification. Inf Process Manage 16(1):1–15

    Article  Google Scholar 

  48. Salton G., McGill M. (1984) Introduction to modern information retrieval. McGraw-Hill Book Company

  49. Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18:613–620

    Article  MATH  Google Scholar 

  50. Sato T (1995) A statistical learning method for logic programs with distribution semantics. In: Proceedings of the 12th ICLP 1995. MIT Press, pp 715–729

  51. Semeraro G, Esposito F, Malerba D, Fanizzi N, Ferilli S (1997) A logic framework for the incremental inductive synthesis of datalog theories. In: Fuchs, NE (ed)LOPSTR, Lecture notes in computer science, vol 1463. Springer, pp 300–321

  52. Shamsfard M, Barforoush AA (2004) Learning ontologies from natural language texts. Int J Hum-Comput Stud 60(1):17–63

    Article  Google Scholar 

  53. Singhal A, Buckley C, Mitra M, Mitra A (1996) Pivoted document length normalization. ACM Press, pp 21–29

  54. Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556

    Article  Google Scholar 

  55. van Halteren H (2004) Linguistic profiling for author recognition and verification. In: Proceedings of the 42nd annual meeting on association for computational linguistics, ACL 04. Association or Computational Linguistics, Stroudsburg

  56. Velardi P, Navigli R, Cucchiarelli A, Neri F (2006) Evaluation of OntoLearn, a methodology for automatic population of domain ontologies. In: Ontology learning from text: methods, applications and evaluation. IOS Press

  57. Wu Z, Palmer M (1994) Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on association for computational linguistics. Association for Computational Linguistics, Morristown, pp 133–138

  58. Zesch T, Müller C, Gurevych I (2008) Extracting lexical semantic knowledge from wikipedia and wiktionary. In: Proceedings of the 6th International Conference on Language Resources and Evaluation, Marrakech, Morocco, Electronic proceedings

  59. Zheng R, Li J, Chen H, Huang Z (2006) A framework for authorship identification of online messages: writing-style features and classification techniques. J Am Soc Inf Sci Technol 57(3):378–393

    Article  Google Scholar 

Download references

Acknowledgments

This work was partially funded by the Italian PON 2007-2013 project PON02_00563_3489339 “Puglia@Service”.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Ferilli.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rotella, F., Leuzzi, F. & Ferilli, S. Learning and exploiting concept networks with ConNeKTion. Appl Intell 42, 87–111 (2015). https://doi.org/10.1007/s10489-014-0543-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-014-0543-z

Keywords

Navigation