A hybrid strategy to extract metadata from scholarly articles by utilizing support vector machine and heuristics

Waqas, Muhammad; Anjum, Nadeem; Afzal, Muhammad Tanvir

doi:10.1007/s11192-023-04774-7

A hybrid strategy to extract metadata from scholarly articles by utilizing support vector machine and heuristics

Published: 22 June 2023

Volume 128, pages 4349–4382, (2023)
Cite this article

Scientometrics Aims and scope Submit manuscript

298 Accesses
Explore all metrics

Abstract

The immense growth in online research publications has attracted the research community to extract valuable information from scientific resources by exploring online digital libraries and publishers’ websites. The metadata stored in a machine comprehendible form can facilitate a precise search to enlist most related articles by applying semantic queries to the document’s metadata and the structural elements. The online search engines and digital libraries offer only keyword-based search on full-body text, which creates excessive results. The research community in recent years has adopted different approaches to extract structural information from research documents. We have distributed the content of an article into two logical layouts and metadata levels. This strategy has given our technique an advantage over the state-of-the-art (SOTA) extracting metadata with diversified publication styles. The experimental results have revealed that the proposed approach has shown a significant gain in performance of 20.26% to 27.14%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Section-wise indexing and retrieval of research articles

Article 22 May 2017

Generic features selection for structure classification of diverse styled scholarly articles

Article 16 July 2023

Metadata Extraction for Scientific Papers

Notes

References

Ahmed, M. W., & Afzal, M. T. (2020). FLAG-PDFe: Features oriented metadata extraction framework for scientific publications. IEEE Access, 8, 99458–99469.
Article Google Scholar
Berg, Ø. R., Oepen, S., & Read, J. (2012). Towards high-quality text stream extraction from pdf: Technical background to the acl 2012 contributed task. In Proceedings of the ACL-2012 special workshop on rediscovering 50 years of discoveries (pp. 98–103). Association for Computational Linguistics.
Böschen, I. (2021). Software review: The jatsdecoder package–extract metadata, abstract and sectioned text from niso-jats coded xml documents; insights to pubmed central’s open access database. Scientometrics, 126(12), 9585–9601.
Article Google Scholar
Constantin, A., Pettifer, S., & Voronkov, A. (2013). Pdfx: Fully-automated pdf-to-xml conversion of scientific literature. In Proceedings of the 2013 ACM symposium on document engineering (pp. 177–180). ACM.
Councill, I. G., Giles, C. L., & Kan, M.-Y. (2008). Parscit: An open-source crf reference string parsing package. LREC, 8, 661–667.
Google Scholar
Déjean, H. & Meunier, J.-L. (2006). A system for converting pdf documents into structured xml format. In International workshop on document analysis systems (pp. 129–140). Springer.
Dimou, A., Di Iorio, A., Lange, C., & Vahdati, S. (2016). Semantic publishing challenge—Assessing the quality of scientific output in its ecosystem. In A. Dimou, A. Di Iorio, C. Lange, & S. Vahdati (Eds.), Semantic web evaluation challenge (pp. 243–254). Springer.
Chapter Google Scholar
Do, H. H. N., Chandrasekaran, M. K., Cho, P. S., & Kan, M. Y. (2013). Extracting and matching authors and affiliations in scholarly documents. In Proceedings of the 13th ACM/IEEE-CS joint conference on digital libraries (pp. 219–228). ACM.
Granitzer, M., Hristakeva, M., Jack, K., & Knight, R. (2012). A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In Proceedings of the 27th annual ACM symposium on applied computing (pp. 962–964). ACM.
Jinha, A. E. (2010). Article 50 million: An estimate of the number of scholarly articles in existence. Learned Publishing, 23(3), 258–263.
Article Google Scholar
Johnson, R., Watkinson, A., & Mabe, M. (2018). The stm report. An overview of scientific and scholarly publishing (5th ed.). STM Association.
Google Scholar
Kiss, T., & Strunk, J. (2006). Unsupervised multilingual sentence boundary detection. Computational Linguistics, 32(4), 485–525.
Article Google Scholar
Klampfl, S., Granitzer, M., Jack, K., & Kern, R. (2014). Unsupervised document structure analysis of digital scientific articles. International Journal on Digital Libraries, 14(3–4), 83–99.
Article Google Scholar
Klink, S., & Kieninger, T. (2001). Rule-based document structure understanding with a fuzzy combination of layout and textual features. International Journal on Document Analysis and Recognition, 4(1), 18–26.
Article Google Scholar
Lin, Y., Michel, J.-B., Aiden, E. L., Orwant, J., Brockman, W., & Petrov, S. (2022). Syntactic annotations for the google books ngram corpus.
Luong, M. T., Nguyen, T. D., & Kan, M. Y. (2012). Logical structure recovery in scholarly articles with rich document features. In Multimedia storage and retrieval innovations for digital library systems (pp. 270–292). IGI Global.
Ma, K. (2018). Automatic literature metadata extraction from datacite services. Recent Patents on Computer Science, 11(1), 25–31.
Article Google Scholar
Ramakrishnan, C., Patnia, A., Hovy, E., & Burns, G. A. (2012). Layout-aware text extraction from full-text pdf of scientific articles. Source Code for Biology and Medicine, 7(1), 7.
Article Google Scholar
Rebholz-Schuhmann, D., Oellrich, A., & Hoehndorf, R. (2012). Text-mining solutions for biomedical research: Enabling integrative biology. Nature Reviews Genetics, 13(12), 829–839.
Article Google Scholar
Santosh, K. (2015). g-dice: Graph mining-based document information content exploitation. International Journal on Document Analysis and Recognition, 18(4), 337–355.
Article Google Scholar
Su, X., Gao, G., Wei, H., & Bao, F. (2016). A knowledge-based recognition system for historical Mongolian documents. International Journal on Document Analysis and Recognition, 19(3), 221–235.
Article Google Scholar
Tkaczyk, D., Bolikowski, L., Czeczko, A., & Rusek, K. (2012). A modular metadata extraction system for born-digital articles. In 2012 10th IAPR international workshop on document analysis systems (DAS) (pp. 11–16). IEEE.
Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P. J., & Bolikowski, Ł. (2015). Cermine: Automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition, 18(4), 317–335.
Article Google Scholar
Tsai, C.-T., Kundu, G., & Roth, D. (2013). Concept-based analysis of scientific literature. In Proceedings of the 22nd ACM international conference on conference on information & knowledge management (pp. 1733–1738). ACM.
Tuarob, S., Bhatia, S., Mitra, P., & Giles, C. L. (2013). Automatic detection of pseudocodes in scholarly documents using machine learning. In 2013 12th international conference on document analysis and recognition (pp. 738–742). IEEE.
Tuarob, S., Kang, S. W., Wettayakorn, P., Pornprasit, C., Sachati, T., Hassan, S.-U., & Haddawy, P. (2020). Automatic classification of algorithm citation functions in scientific literature. IEEE Transactions on Knowledge and Data Engineering, 32(10), 1881–1896. https://doi.org/10.1109/TKDE.2019.2913376
Article Google Scholar
Washio, T., & Motoda, H. (2003). State of the art of graph-based data mining. Acm Sigkdd Explorations Newsletter, 5(1), 59–68.
Article Google Scholar
Wu, J., Williams, K. M., Chen, H.-H., Khabsa, M., Caragea, C., Tuarob, S., Ororbia, A. G., Jordan, D., Mitra, P., & Giles, C. L. (2015). Citeseerx: AI in a digital library search engine. AI Magazine, 36(3), 35–48.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, Capital University of Science and Technology, Expressway, Kahuta Road, Zone-V, Islamabad, ICT, Pakistan
Muhammad Waqas & Nadeem Anjum
Faculty of Computing, Shifa Tameer-e-Millat University, Pitras Bukhari Road, H-8/4, Islamabad, ICT, 44000, Pakistan
Afzal Muhammad Tanvir

Authors

Muhammad Waqas
View author publications
You can also search for this author in PubMed Google Scholar
Nadeem Anjum
View author publications
You can also search for this author in PubMed Google Scholar
Afzal Muhammad Tanvir
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muhammad Waqas.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Appendix: Comparsion with Cermine and PDFX on GOLD-standard

XQuery for Cermine generated XML files

We used the following XQueries to get the desired metadata from Cermine system’s XML file output.

Journal name XQuery to find the name of the journal.

$<journaltitle>$

$\{data(\$articles/front/journal-meta/journal-title-group/journal-title)\}$

$</journaltitle>$

Title XQuery to find the title of the article.

$<title>$

$\{data(\$articles/front/article-meta/title-group/article-title)\}$

$</title>$

DOI XQuery to find the doi of the article.

$<doi>$

$\{data(\$articles/front/article-meta/article-id)\}$

$</doi>$

Year XQuery to find the year of the article.

$<pubyear>$

$\{data(\$articles/front/article-meta/pub-date/year)\}$

$</pubyear>$

Volume XQuery to find the volume of an article.

$<volume>$

$\{data(\$articles/front/article-meta/volume)\}$

$</volume>$

Issue XQuery to find the issue number of the article.

$<issue>$

$\{data(\$articles/front/article-meta/issue)\}$

$</issue>$

Pages XQuery to find the pages of the article.

$<firstpage>$

$\{data(\$articles/front/article-meta/fpage)\}$

$</firstpage>$

$<lastpage>$

$\{data(\$articles/front/article-meta/lpage)\}$

$</lastpage>$

Keywords XQuery to find the keywords of the article.

$<keywords>$ $\{ for \$keyword at \$j in$

$\$articles/front/article-meta/kwd-group/kwd$

$return keyword\{data(\$keyword)\} /keyword \}$

$\langle /keywords\rangle$

Authors XQuery to find the authors of the article.

$\langle authors\rangle$ $\{ for \$authors at \$j in$

$\$articles/front/article-meta/contrib-group/contrib/string-name$

$return \langle author\rangle \{data(\$authors)\} \langle /author\rangle \}$

$\langle /authors\rangle$

Affiliations XQuery to find the author affiliations of the article.

$\langle affiliations\rangle$

$\{$ $for \$affiliation at \$j in$

$\$articles/front/article-meta/contrib-group/aff$

$return \langle institution\rangle$

$\{data(\$affiliation/institution)\} \langle /institution\rangle$

$\}$ $\{$

$for \$affiliation at \$j in$

$\$articles/front/article-meta/contrib-group/aff$

$return \langle country\rangle$

$\{data(\$affiliation/country)\} \langle /country\rangle$

$\}$

$\langle /affiliations\rangle$

H1 XQuery to find the Heading level 1 of the article.

$\langle section1\rangle$

$\{$

$for \$h1 at \$j in \$articles/body/sec$

$return \langle h1\rangle \{data(\$h1/title)\} \langle /h1\rangle$

$\}$

$\langle /section1\rangle$

H2 XQuery to find the Heading level 2 of the article.

$\langle section2\rangle$

$\{$

$for \$h2 at \$j in \$articles/body/sec/sec$

$return \langle h2\rangle \{data(\$h2/title)\} \langle /h2\rangle$

$\}$

$\langle /section2\rangle$

H3 XQuery to find the Heading level 1 of the article.

$\langle section3\rangle$

$\{$ $for \$h3 at \$j in \$articles/body/sec/sec/sec$

$return \langle h3\rangle \{data(\$h3/title)\} \langle /h3\rangle$

$\}$

$\langle /section3\rangle$

References XQuery to find the number of references in the article.

$\langle refcnt\rangle$

$\{$

$for \$ref at \$j in \$articles/back/ref-list$

$return \langle ref\rangle \{count(\$ref/ref)\} \langle /ref\rangle$

$\}$

$\langle /refcnt\rangle$

Abstract XQuery to find the abstract of the article.

$\langle abstract\rangle$

$\{count(\$articles/front/article-meta/abstract)\}$

$\langle /abstract\rangle$

XQuery for PDFX generated XML files

We used the following XQueries to get the desired metadata from PDFX system’s XML file output.

Title XQuery to find the title of the article.

$\langle title\rangle$

$\{data(\$articles/article/front/title-group/article-title)\}$

$\langle /title\rangle$

DOI XQuery to find the doi of the article.

$\langle doi\rangle \{data(\$articles/meta/doi)\} \langle /doi\rangle$

Abstract XQuery to find the abstract of the article.

$\langle abstract\rangle$

$\{count(\$articles/article/front/abstract)\}$

$\langle /abstract\rangle$

Authors XQuery to find the authors of the article.

$\langle authors\rangle$ $\{ for \$authors at \$j in$

$\$articles/article/front/region/email$

$return \langle email\rangle \{data(\$authors)\} \langle /email\rangle \}$

$\langle /authors\rangle$

H1 XQuery to find the Heading level 1 of the article.

$\langle section1\rangle$

$\{$

$for \$h1 at \$j in \$articles/article/body/.//h1$

$return \langle h1\rangle \{data(\$h1)\} \langle /h1\rangle$

$\}$

$\langle /section1\rangle$

H2 XQuery to find the Heading level 2 of the article.

$\langle section2\rangle$

$\{$

$for \$h2 at \$j in \$articles/article/body/.//h2$

$return \langle h2\rangle \{data(\$h2)\} \langle /h2\rangle$

$\}$

$\langle /section2\rangle$

H3 XQuery to find the Heading level 3 of the article.

$\langle section3\rangle$

$\{$

$for \$h3 at \$j in \$articles/article/body/.//h3$

$return \langle h3\rangle \{data(\$h3)\} \langle /h3\rangle$

$\}$

$\langle /section3\rangle$

References XQuery to find the number of references in the article.

$\langle ref\rangle$

$\{count(\$articles/article/body/section/.//ref-list/ref)\}$

$\langle /ref\rangle$

Table caption XQuery to find the table captions in the article. $\langle tables\rangle \{$

$for \$captions at \$j in$

$\$articles/.//region[contains(@class, 'DoCO:TableBox')]$

$return \langle table\rangle \{data(\$captions/caption)\}\langle /table\rangle$ $\}\langle /tables\rangle$

Figure caption XQuery to find the Figure captions in the article.

$\langle figures\rangle \{$

$for \$captions at \$j in$

$\$articles/.//region[contains(@class, 'DoCO:FigureBox')]$

$return \langle fig\rangle \{data(\$captions/caption)\}\langle /fig\rangle$

$\}\langle /figures\rangle$

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Waqas, M., Anjum, N. & Afzal, M.T. A hybrid strategy to extract metadata from scholarly articles by utilizing support vector machine and heuristics. Scientometrics 128, 4349–4382 (2023). https://doi.org/10.1007/s11192-023-04774-7

Download citation

Received: 29 August 2022
Accepted: 08 June 2023
Published: 22 June 2023
Issue Date: August 2023
DOI: https://doi.org/10.1007/s11192-023-04774-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A hybrid strategy to extract metadata from scholarly articles by utilizing support vector machine and heuristics

Abstract

Access this article

Similar content being viewed by others

Section-wise indexing and retrieval of research articles

Generic features selection for structure classification of diverse styled scholarly articles

Metadata Extraction for Scientific Papers

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix: Comparsion with Cermine and PDFX on GOLD-standard

XQuery for Cermine generated XML files

XQuery for PDFX generated XML files

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A hybrid strategy to extract metadata from scholarly articles by utilizing support vector machine and heuristics

Abstract

Access this article

Similar content being viewed by others

Section-wise indexing and retrieval of research articles

Generic features selection for structure classification of diverse styled scholarly articles

Metadata Extraction for Scientific Papers

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Appendix: Comparsion with Cermine and PDFX on GOLD-standard

Appendix: Comparsion with Cermine and PDFX on GOLD-standard

XQuery for Cermine generated XML files

XQuery for PDFX generated XML files

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation