Skip to main content
Log in

A hybrid strategy to extract metadata from scholarly articles by utilizing support vector machine and heuristics

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

The immense growth in online research publications has attracted the research community to extract valuable information from scientific resources by exploring online digital libraries and publishers’ websites. The metadata stored in a machine comprehendible form can facilitate a precise search to enlist most related articles by applying semantic queries to the document’s metadata and the structural elements. The online search engines and digital libraries offer only keyword-based search on full-body text, which creates excessive results. The research community in recent years has adopted different approaches to extract structural information from research documents. We have distributed the content of an article into two logical layouts and metadata levels. This strategy has given our technique an advantage over the state-of-the-art (SOTA) extracting metadata with diversified publication styles. The experimental results have revealed that the proposed approach has shown a significant gain in performance of 20.26% to 27.14%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://www.niso.org/standards-committees/jats.

  2. https://github.com/angelobo/SemPubEvaluator.

  3. https://github.com/ceurws/lod/wiki/SemPub17_Task2.

  4. https://github.com/knmnyn/ParsCit/blob/master/doc/sectLabelXML.tagged.txt.

  5. https://github.com/WaqasAJanjua/flagpdfe.

References

  • Ahmed, M. W., & Afzal, M. T. (2020). FLAG-PDFe: Features oriented metadata extraction framework for scientific publications. IEEE Access, 8, 99458–99469.

    Article  Google Scholar 

  • Berg, Ø. R., Oepen, S., & Read, J. (2012). Towards high-quality text stream extraction from pdf: Technical background to the acl 2012 contributed task. In Proceedings of the ACL-2012 special workshop on rediscovering 50 years of discoveries (pp. 98–103). Association for Computational Linguistics.

  • Böschen, I. (2021). Software review: The jatsdecoder package–extract metadata, abstract and sectioned text from niso-jats coded xml documents; insights to pubmed central’s open access database. Scientometrics, 126(12), 9585–9601.

    Article  Google Scholar 

  • Constantin, A., Pettifer, S., & Voronkov, A. (2013). Pdfx: Fully-automated pdf-to-xml conversion of scientific literature. In Proceedings of the 2013 ACM symposium on document engineering (pp. 177–180). ACM.

  • Councill, I. G., Giles, C. L., & Kan, M.-Y. (2008). Parscit: An open-source crf reference string parsing package. LREC, 8, 661–667.

    Google Scholar 

  • Déjean, H. & Meunier, J.-L. (2006). A system for converting pdf documents into structured xml format. In International workshop on document analysis systems (pp. 129–140). Springer.

  • Dimou, A., Di Iorio, A., Lange, C., & Vahdati, S. (2016). Semantic publishing challenge—Assessing the quality of scientific output in its ecosystem. In A. Dimou, A. Di Iorio, C. Lange, & S. Vahdati (Eds.), Semantic web evaluation challenge (pp. 243–254). Springer.

    Chapter  Google Scholar 

  • Do, H. H. N., Chandrasekaran, M. K., Cho, P. S., & Kan, M. Y. (2013). Extracting and matching authors and affiliations in scholarly documents. In Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries (pp. 219–228). ACM.

  • Granitzer, M., Hristakeva, M., Jack, K., & Knight, R. (2012). A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In Proceedings of the 27th annual ACM symposium on applied computing (pp. 962–964). ACM.

  • Jinha, A. E. (2010). Article 50 million: An estimate of the number of scholarly articles in existence. Learned Publishing, 23(3), 258–263.

    Article  Google Scholar 

  • Johnson, R., Watkinson, A., & Mabe, M. (2018). The stm report. An overview of scientific and scholarly publishing (5th ed.). STM Association.

    Google Scholar 

  • Kiss, T., & Strunk, J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4), 485–525.

    Article  Google Scholar 

  • Klampfl, S., Granitzer, M., Jack, K., & Kern, R. (2014). Unsupervised document structure analysis of digital scientific articles. International Journal on Digital Libraries, 14(3–4), 83–99.

    Article  Google Scholar 

  • Klink, S., & Kieninger, T. (2001). Rule-based document structure understanding with a fuzzy combination of layout and textual features. International Journal on Document Analysis and Recognition, 4(1), 18–26.

    Article  Google Scholar 

  • Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., & Petrov, S. (2022). Syntactic annotations for the google books ngram corpus.

  • Luong, M. T., Nguyen, T. D., & Kan, M. Y. (2012). Logical structure recovery in scholarly articles with rich document features. In Multimedia storage and retrieval innovations for digital library systems (pp. 270–292). IGI Global.

  • Ma, K. (2018). Automatic literature metadata extraction from datacite services. Recent Patents on Computer Science, 11(1), 25–31.

    Article  Google Scholar 

  • Ramakrishnan, C., Patnia, A., Hovy, E., & Burns, G. A. (2012). Layout-aware text extraction from full-text pdf of scientific articles. Source Code for Biology and Medicine, 7(1), 7.

    Article  Google Scholar 

  • Rebholz-Schuhmann, D., Oellrich, A., & Hoehndorf, R. (2012). Text-mining solutions for biomedical research: Enabling integrative biology. Nature Reviews Genetics, 13(12), 829–839.

    Article  Google Scholar 

  • Santosh, K. (2015). g-dice: Graph mining-based document information content exploitation. International Journal on Document Analysis and Recognition, 18(4), 337–355.

    Article  Google Scholar 

  • Su, X., Gao, G., Wei, H., & Bao, F. (2016). A knowledge-based recognition system for historical Mongolian documents. International Journal on Document Analysis and Recognition, 19(3), 221–235.

    Article  Google Scholar 

  • Tkaczyk, D., Bolikowski, L., Czeczko, A., & Rusek, K. (2012). A modular metadata extraction system for born-digital articles. In 2012 10th IAPR international workshop on document analysis systems (DAS) (pp. 11–16). IEEE.

  • Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P. J., & Bolikowski, Ł. (2015). Cermine: Automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition, 18(4), 317–335.

    Article  Google Scholar 

  • Tsai, C.-T., Kundu, G., & Roth, D. (2013). Concept-based analysis of scientific literature. In Proceedings of the 22nd ACM international conference on conference on information & knowledge management (pp. 1733–1738). ACM.

  • Tuarob, S., Bhatia, S., Mitra, P., & Giles, C. L. (2013). Automatic detection of pseudocodes in scholarly documents using machine learning. In 2013 12th international conference on document analysis and recognition (pp. 738–742). IEEE.

  • Tuarob, S., Kang, S. W., Wettayakorn, P., Pornprasit, C., Sachati, T., Hassan, S.-U., & Haddawy, P. (2020). Automatic classification of algorithm citation functions in scientific literature. IEEE Transactions on Knowledge and Data Engineering, 32(10), 1881–1896. https://doi.org/10.1109/TKDE.2019.2913376

    Article  Google Scholar 

  • Washio, T., & Motoda, H. (2003). State of the art of graph-based data mining. Acm Sigkdd Explorations Newsletter, 5(1), 59–68.

    Article  Google Scholar 

  • Wu, J., Williams, K. M., Chen, H.-H., Khabsa, M., Caragea, C., Tuarob, S., Ororbia, A. G., Jordan, D., Mitra, P., & Giles, C. L. (2015). Citeseerx: AI in a digital library search engine. AI Magazine, 36(3), 35–48.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Waqas.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Appendix: Comparsion with Cermine and PDFX on GOLD-standard

Appendix: Comparsion with Cermine and PDFX on GOLD-standard

XQuery for Cermine generated XML files

We used the following XQueries to get the desired metadata from Cermine system’s XML file output.

Journal name XQuery to find the name of the journal.

\(<journaltitle>\)

\(\{data(\$articles/front/journal-meta/journal-title-group/journal-title)\}\)

\(</journaltitle>\)

Title XQuery to find the title of the article.

\(<title>\)

\(\{data(\$articles/front/article-meta/title-group/article-title)\}\)

\(</title>\)

DOI XQuery to find the doi of the article.

\(<doi>\)

\(\{data(\$articles/front/article-meta/article-id)\}\)

\(</doi>\)

Year XQuery to find the year of the article.

\(<pubyear>\)

\(\{data(\$articles/front/article-meta/pub-date/year)\}\)

\(</pubyear>\)

Volume XQuery to find the volume of an article.

\(<volume>\)

\(\{data(\$articles/front/article-meta/volume)\}\)

\(</volume>\)

Issue XQuery to find the issue number of the article.

\(<issue>\)

\(\{data(\$articles/front/article-meta/issue)\}\)

\(</issue>\)

Pages XQuery to find the pages of the article.

\(<firstpage>\)

\(\{data(\$articles/front/article-meta/fpage)\}\)

\(</firstpage>\)

\(<lastpage>\)

\(\{data(\$articles/front/article-meta/lpage)\}\)

\(</lastpage>\)

Keywords XQuery to find the keywords of the article.

\(<keywords>\) \(\{ for \$keyword at \$j in\)

\(\$articles/front/article-meta/kwd-group/kwd\)

\(return keyword\{data(\$keyword)\} /keyword \}\)

\(\langle /keywords\rangle\)

Authors XQuery to find the authors of the article.

\(\langle authors\rangle\) \(\{ for \$authors at \$j in\)

\(\$articles/front/article-meta/contrib-group/contrib/string-name\)

\(return \langle author\rangle \{data(\$authors)\} \langle /author\rangle \}\)

\(\langle /authors\rangle\)

Affiliations XQuery to find the author affiliations of the article.

\(\langle affiliations\rangle\)

\(\{\) \(for \$affiliation at \$j in\)

\(\$articles/front/article-meta/contrib-group/aff\)

\(return \langle institution\rangle\)

\(\{data(\$affiliation/institution)\} \langle /institution\rangle\)

\(\}\) \(\{\)

\(for \$affiliation at \$j in\)

\(\$articles/front/article-meta/contrib-group/aff\)

\(return \langle country\rangle\)

\(\{data(\$affiliation/country)\} \langle /country\rangle\)

\(\}\)

\(\langle /affiliations\rangle\)

H1 XQuery to find the Heading level 1 of the article.

\(\langle section1\rangle\)

\(\{\)

\(for \$h1 at \$j in \$articles/body/sec\)

\(return \langle h1\rangle \{data(\$h1/title)\} \langle /h1\rangle\)

\(\}\)

\(\langle /section1\rangle\)

H2 XQuery to find the Heading level 2 of the article.

\(\langle section2\rangle\)

\(\{\)

\(for \$h2 at \$j in \$articles/body/sec/sec\)

\(return \langle h2\rangle \{data(\$h2/title)\} \langle /h2\rangle\)

\(\}\)

\(\langle /section2\rangle\)

H3 XQuery to find the Heading level 1 of the article.

\(\langle section3\rangle\)

\(\{\) \(for \$h3 at \$j in \$articles/body/sec/sec/sec\)

\(return \langle h3\rangle \{data(\$h3/title)\} \langle /h3\rangle\)

\(\}\)

\(\langle /section3\rangle\)

References XQuery to find the number of references in the article.

\(\langle refcnt\rangle\)

\(\{\)

\(for \$ref at \$j in \$articles/back/ref-list\)

\(return \langle ref\rangle \{count(\$ref/ref)\} \langle /ref\rangle\)

\(\}\)

\(\langle /refcnt\rangle\)

Abstract XQuery to find the abstract of the article.

\(\langle abstract\rangle\)

\(\{count(\$articles/front/article-meta/abstract)\}\)

\(\langle /abstract\rangle\)

XQuery for PDFX generated XML files

We used the following XQueries to get the desired metadata from PDFX system’s XML file output.

Title XQuery to find the title of the article.

\(\langle title\rangle\)

\(\{data(\$articles/article/front/title-group/article-title)\}\)

\(\langle /title\rangle\)

DOI XQuery to find the doi of the article.

\(\langle doi\rangle \{data(\$articles/meta/doi)\} \langle /doi\rangle\)

Abstract XQuery to find the abstract of the article.

\(\langle abstract\rangle\)

\(\{count(\$articles/article/front/abstract)\}\)

\(\langle /abstract\rangle\)

Authors XQuery to find the authors of the article.

\(\langle authors\rangle\) \(\{ for \$authors at \$j in\)

\(\$articles/article/front/region/email\)

\(return \langle email\rangle \{data(\$authors)\} \langle /email\rangle \}\)

\(\langle /authors\rangle\)

H1 XQuery to find the Heading level 1 of the article.

\(\langle section1\rangle\)

\(\{\)

\(for \$h1 at \$j in \$articles/article/body/.//h1\)

\(return \langle h1\rangle \{data(\$h1)\} \langle /h1\rangle\)

\(\}\)

\(\langle /section1\rangle\)

H2 XQuery to find the Heading level 2 of the article.

\(\langle section2\rangle\)

\(\{\)

\(for \$h2 at \$j in \$articles/article/body/.//h2\)

\(return \langle h2\rangle \{data(\$h2)\} \langle /h2\rangle\)

\(\}\)

\(\langle /section2\rangle\)

H3 XQuery to find the Heading level 3 of the article.

\(\langle section3\rangle\)

\(\{\)

\(for \$h3 at \$j in \$articles/article/body/.//h3\)

\(return \langle h3\rangle \{data(\$h3)\} \langle /h3\rangle\)

\(\}\)

\(\langle /section3\rangle\)

References XQuery to find the number of references in the article.

\(\langle ref\rangle\)

\(\{count(\$articles/article/body/section/.//ref-list/ref)\}\)

\(\langle /ref\rangle\)

Table caption XQuery to find the table captions in the article. \(\langle tables\rangle \{\)

\(for \$captions at \$j in\)

\(\$articles/.//region[contains(@class, 'DoCO:TableBox')]\)

\(return \langle table\rangle \{data(\$captions/caption)\}\langle /table\rangle\) \(\}\langle /tables\rangle\)

Figure caption XQuery to find the Figure captions in the article.

\(\langle figures\rangle \{\)

\(for \$captions at \$j in\)

\(\$articles/.//region[contains(@class, 'DoCO:FigureBox')]\)

\(return \langle fig\rangle \{data(\$captions/caption)\}\langle /fig\rangle\)

\(\}\langle /figures\rangle\)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Waqas, M., Anjum, N. & Afzal, M.T. A hybrid strategy to extract metadata from scholarly articles by utilizing support vector machine and heuristics. Scientometrics 128, 4349–4382 (2023). https://doi.org/10.1007/s11192-023-04774-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-023-04774-7

Keywords

Navigation