Skip to main content

LearnSec: A Framework for Full Text Analysis

  • Conference paper
  • First Online:
Book cover Hybrid Artificial Intelligent Systems (HAIS 2018)

Abstract

Large corpus of scientific research papers have been available for a long time. However, most of those corpus store only the title and the abstract of the paper. For some domains this information may not be enough to achieve high performance in text mining tasks. This problem has been recently reduced by the growing availability of full text scientific research papers. A full text version provides more detailed information but, on the other hand, a large amount of data needs to be processed. A priori, it is difficult to know if the extra work of the full text analysis has a significant impact in the performance of text mining tasks, or if the effect depends on the scientific domain or the specific corpus under analysis.

The goal of this paper is to show a framework for full text analysis, called LearnSec, which incorporates domain specific knowledge and information about the content of the document sections to improve the classification process with propositional and relational learning.

To demonstrate the usefulness of the tool, we process a scientific corpus based on OSHUMED, generating an attribute/value dataset in Weka format and a First Order Logic dataset in Inductive Logic Programming (ILP) format. Results show a successful assessment of the framework.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/manuscript/.

  2. 2.

    https://www.nlm.nih.gov/mesh/download_mesh.html.

  3. 3.

    https://www.ncbi.nlm.nih.gov/pubmed/18459944.

  4. 4.

    http://wordnet.princeton.edu/.

  5. 5.

    http://www.geneontology.org/.

  6. 6.

    http://www.lasr.cs.ucla.edu/geoff/ispell.html.

References

  1. Hersh, W., Buckley, C., Leone, T.J., Hickam, D.: OHSUMED: an interactive retrieval evaluation and new large test collection for research. In: Croft, B.W., van Rijsbergen, C.J. (eds.) SIGIR 1994. Springer, London (1994). https://doi.org/10.1007/978-1-4471-2099-5_20

    Chapter  Google Scholar 

  2. Moschitti, A., Basili, R.: Complex linguistic features for text classification: a comprehensive study. In: McDonald, S., Tait, J. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24752-4_14

    Chapter  Google Scholar 

  3. Muggleton, S., De Raedt, L.: Inductive Logic Programming: theory and methods. J. Logic Program. 19/20, 629–679 (1994)

    Article  MathSciNet  Google Scholar 

  4. Eineborg, M., Lindberg, N.: ILP in Part-of-Speech Tagging — an overview. In: Cussens, J., Džeroski, S. (eds.) LLL 1999. LNCS (LNAI), vol. 1925, pp. 157–169. Springer, Heidelberg (2000). https://doi.org/10.1007/3-540-40030-3_10

    Chapter  MATH  Google Scholar 

  5. Muggleton, S.: Inverse entailment and progol. New Gener. Comput. 1(3–4), 245–286 (1995). Special issue on Inductive Logic Programming

    Article  Google Scholar 

  6. Zhou, W., Smalheiser, N.R., Yu, C.: A tutorial on information retrieval, basics terms and concepts. J. Biomed. Discov. Collab. 1, 2 (2006)

    Article  Google Scholar 

  7. Srinivasan, A.: The aleph manual (2001)

    Google Scholar 

  8. Gonçalves, C.T., Camacho, R., Oliveira, E.: BioTextRetriever: a tool to retrieve relevant papers. Int. J. Knowl. Discov. Bioinform. 2, 21–36 (2011). IGI Publishing

    Article  Google Scholar 

  9. Gonçalves, C.A., Gonçalves, C.T., Camacho, R., Oliveira, E.: The Impact of pre-processing in classifying MEDLINE documents. In: Proceedings of the 10th International Workshop on Pattern Recognition in Information Systems (PRIS2010), Funchal, Madeira, pp. 53–61 (2010)

    Google Scholar 

  10. Aprile, A., Castellano, M., Mastronardi, G., Tarricone, G.: A web text mining flexible architecture. Int. J. Comput. Sci. Eng. (2007)

    Google Scholar 

  11. Oram, P.: WordNet: an electronical lexical database. Appl. Psycholinguist. 22, 131–134 (1998). Cambridge University Press

    Article  Google Scholar 

  12. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Sherlock, G.: Gene ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000)

    Article  Google Scholar 

  13. Rebholz-Schuhmann, D., Pezik, P., Lee, V., Kim, J.J., Del Gratta, R., Sasaki, Y., McNaught, J., Montemagni, S., Monachini, M., Calzolari, N., Ananiadou, S.: BioLexicon: towards a reference terminological resource in the biomedical domain. In: Proceedings of the 16th Annual International Conference on Intelligent Systems for Molecular Biology (2008)

    Google Scholar 

  14. The Hosford Medical Terms Dictionary v3.0 (2004)

    Google Scholar 

  15. Porter, M.F.: An algorithm for suffix stripping. In: Readings in Information Retrieval, pp. 313–316. Morgan Kaufmann Publishers Inc. (1997)

    Google Scholar 

  16. Witten, I.H., Eibe, F., Trigg, L., Hall, M., Holmes, G., Cunningham, S.J.: WEKA: practical machine learning tools and techniques with Java implementations. In: Proceedings of the ICONIP/ANZIIS/ANNES99 Future Directions for Intelligent Systems and Information Sciences, pp. 192–196. Morgan Kaufmann (1999)

    Google Scholar 

  17. Borase, P.N., Kinariwala, S.A.: Image Re-ranking using Information Gain and relative consistency through multi-graph learning. Int. J. Comput. Appl. 147, 29–32 (2016). Foundation of Computer Science, NY, USA

    Google Scholar 

  18. Hall, M.: Correlation-based feature selection for machine learning. Ph.D. thesis, University of Waikato (1999)

    Google Scholar 

  19. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33, 159–174 (1977)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlos Gonçalves .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Gonçalves, C., Iglesias, E.L., Borrajo, L., Camacho, R., Seara Vieira, A., Gonçalves, C.T. (2018). LearnSec: A Framework for Full Text Analysis. In: de Cos Juez, F., et al. Hybrid Artificial Intelligent Systems. HAIS 2018. Lecture Notes in Computer Science(), vol 10870. Springer, Cham. https://doi.org/10.1007/978-3-319-92639-1_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-92639-1_42

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-92638-4

  • Online ISBN: 978-3-319-92639-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics