Generic features selection for structure classification of diverse styled scholarly articles

Waqas, Muhammad; Anjum, Nadeem

doi:10.1007/s11042-023-16128-9

Generic features selection for structure classification of diverse styled scholarly articles

Published: 16 July 2023

Volume 83, pages 16623–16655, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

123 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

The enormous growth in online research publications in diversified domains has attracted the research community to extract these valuable scientific resources by searching online digital libraries and publishers’ websites. A precise search is desired to enlist most related articles by applying semantic queries to the document’s metadata and the structural elements. The online search engines and digital libraries offer only keyword-based search on full-body text, which creates excessive results. Therefore, the research article’s structural and metadata information has to be stored in machine comprehendible form by the online research publishers. The research community in recent years has adopted different approaches to extract structural information from research documents like rule-based heuristics and machine-learning-based approaches. Studies suggest that machine-learning-based techniques have produced optimum results for document structure extraction from publishers having diversified publication layouts. In this paper, we have proposed thirteen different logical layout structural (LLS) components. We have identified a two-staged innovative set of generic features that are associated with the LLS. This approach has given our technique an advantage against the state-of-the-art for structural classification of digital scientific articles with diversified publication styles. We have applied chi-square (\(ch{i}^{2}\)) for feature selection, and the final result has revealed that SVM (Kernal function) has produced an optimum result with an overall F-measure of 0.95.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Algorithm 1

Algorithm 3

Artificial intelligence to automate the systematic review of scientific literature

Article Open access 11 May 2023

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Article 05 March 2020

Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases

Article Open access 10 November 2018

Data Availability

The reference to ds1 is already available on page 24 (footnote5: https://github.com/flagpdfe/LLSExtractor)

Notes

References

Abdar M, Acharya UR, Sarrafzadegan N, Makarenkov V (2019) Ne-nu-svc: A new nested ensemble clinical decision support system for effective diagnosis of coronary artery disease. IEEE Access 7:167605–167620
Article Google Scholar
Ahmad R, Afzal MT, Qadir MA (2016) Information extraction from PDF sources based on rule-based system using integrated formats. In: Semantic web challenges: third SemWebEval challenge at ESWC 2016, Heraklion, Crete, Greece. May 29-June 2, 2016. Revised selected papers 3. Springer International Publishing, pp 293–308
Alam MJ, Kenny P, O’Shaughnessy D (2011) A study of low-variance multi-taper features for distributed speech recognition. In International Conference on Nonlinear Speech Processing, pp 239–245. Springer
Azad HK, Deepak A, Azad A (2022) LOD search engine: a semantic search over linked data. J Intell Inf Syst 1–21
Azad HK, Deepak A, Chakraborty C, Abhishek K (2022) Improving query expansion using pseudo-relevant web knowledge for information retrieval. Pattern Recog Lett 158:148–156
Article Google Scholar
Bharti KK, Singh PK (2015) Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst Appl 42(6):3105–3114
Article Google Scholar
Bowles M (2015) Machine learning in Python: essential techniques for predictive analysis. John Wiley & Sons
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Article Google Scholar
Claesen M, De Smet F, Suykens JAK, De Moor B (2014) Fast prediction with svm models containing rbf kernels. arXiv preprint arXiv:1403.0736. Accessed 14 July 2023
Constantin A, Pettifer S, Voronkov A (2013) Pdfx: fully-automated pdf-to-xml conversion of scientific literature. In Proceedings of the 2013 ACM symposium on document engineering, pages 177–180. ACM
Déjean H, Meunier JL (2006) A system for converting PDF documents into structured XML format. In: Document analysis systems VII: 7th international workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006. Proceedings 7, Springer Berlin Heidelberg, pp 129–14
Dey A (2016) Machine learning algorithms: a review. Int J Comput Sci Inf Technol 7(3):1174–1179
Google Scholar
Dimou A, Di Iorio A, Lange C, Vahdati S (2016) Semantic publishing challenge–assessing the quality of scientific output in its ecosystem. In: Semantic web challenges: third SemWebEval challenge at ESWC 2016, Heraklion, Crete, Greece, May 29-June 2, 2016, Revised selected papers 3. Springer International Publishing, pp 243–254
DoHHN, Chandrasekaran MK, Cho PS, Kan MY (2013) Extracting and matching authors and affiliations in scholarly documents. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries, pp 219–228. ACM
Granitzer M, Hristakeva M, Jack K, Knight R (2012) A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In Proceedings of the 27th Annual ACM Symposium on Applied Computing, pp 962–964. ACM
Guo K, Chen T, Ren S, Li N, Hu M, Kang J (2022) Federated learning empowered real-time medical data processing method for smart healthcare. IEEE/ACM Trans Comput Biol Bioinforma, 1–12. https://doi.org/10.1109/TCBB.2022.3185395
Guo K, Shen C, Hu B, Hu M, Kui X (2022) Rsnet: Relation separation network for few-shot similar class recognition. IEEE Trans Multimed, 1–1. https://doi.org/10.1109/TMM.2022.3168146
Han J, Kamber M, Pei J (2012) Data mining concepts and techniques third edition. University of Illinois at Urbana-Champaign Micheline Kamber Jian Pei Simon Fraser University
Haryanto AW, Mawardi EK (2018) Influence of word normalization and chi-squared feature selection on support vector machine (svm) text classification. In 2018 international seminar on application for technology of information and communication. IEEE pp. 229–233
Hiregoudar SB, Manjunath K, Patil KS (2014) A survey: research summary on neural networks. Int J Res Eng Technol 3(15):385–389
Article Google Scholar
Jiang L, Zhang H, Cai Z (2008) A novel bayes model: Hidden naive bayes. IEEE Trans Knowl Data Eng 21(10):1361–1371
Article Google Scholar
Jinha AE (2010) Article 50 million: an estimate of the number of scholarly articles in existence. Learned Publ 23(3):258–263
Article Google Scholar
Johnson R, Watkinson A, Mabe M (2018) The STM Report-an overview of scientific and scholarly publishing 2018. STM Association. 5th edn Oct
Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist 32(4):485–525
Article Google Scholar
Klampfl S, Granitzer M, Jack K, Kern R (2014) Unsupervised document structure analysis of digital scientific articles. Int J Digit Libr 14(3–4):83–99
Article Google Scholar
Klink S, Kieninger T (2001) Rule-based document structure understanding with a fuzzy combination of layout and textual features. Int J Doc Anal Recogn 4(1):18–26
Article Google Scholar
Lai C, Reinders MJ, Wessels L (2006) Random subspace method for multivariate feature selection. Pattern Recog Lett 27(10):1067–1076
Article Google Scholar
Mohri M, Rostamizadeh A, Talwalkar A (2018) Foundations of machine learning. MIT press
Olson RS, Bartley N, Urbanowicz RJ, Moore JH (2016) Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the genetic and evolutionary computation conference 2016, pp 485–492
Preparata FP, Shamos MI (2012) Computational geometry: an introduction. Springer Science & Business Media
Ramakrishnan C, Patnia A, Hovy E, Burns GA (2012) Layout-aware text extraction from full-text pdf of scientific articles. Source Code Biol Med 7(1):7
Article Google Scholar
Rebholz-Schuhmann D, Oellrich A, Hoehndorf R (2012) Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 13(12):829–839
Article Google Scholar
Richert, W. (2013). Building machine learning systems with Python. Packt Publishing Ltd.
Santosh KC (2015) g-dice: graph mining-based document information content exploitation. Int J Doc Anal Recog (IJDAR) 18(4):337–355
Article Google Scholar
Shi P, Ray S, Zhu Q, Kon MA (2011) Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction. BMC Bioinforma 12(1):375
Article Google Scholar
Su X, Gao G, Wei H, Bao F (2016) A knowledge-based recognition system for historical mongolian documents. Int J Doc Anal Recog (IJDAR) 19(3):221–235
Article Google Scholar
Tkaczyk D, Bolikowski L, Czeczko A, Rusek K (2012) A modular metadata extraction system for born-digital articles. In 2012 10th IAPR international workshop on document analysis systems. IEEE pp. 11–16
Tsai C-T, Kundu G, Roth D (2013) Concept-based analysis of scientific literature. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp 1733–1738. ACM
Tuarob S, Bhatia S, Mitra P, Lee Giles C (2013) Automatic detection of pseudocodes in scholarly documents using machine learning. In 2013 12th International Conference on Document Analysis and Recognition, pages 738–742. IEEE
Tuarob S, Kang SW, Wettayakorn P, Pornprasit C, Sachati T, Hassan S-U, Haddawy P (2020) Automatic classification of algorithm citation functions in scientific literature. IEEE Trans Knowl Data Eng 32(10):1881–1896. https://doi.org/10.1109/TKDE.2019.2913376
Article Google Scholar
Washio T, Motoda H (2003) State of the art of graph-based data mining. ACM SIGKDD Explor Newsl 5(1):59–68
Article Google Scholar
Wu J, Williams KM, Chen H-H, Khabsa M, Caragea C, Tuarob S, Ororbia AG, Jordan D, Mitra P, Giles CL (2015) Citeseerx: Ai in a digital library search engine. AI Mag 36(3):35–48
Google Scholar
Yan Ke, Zhang D (2015) Feature selection and analysis on correlated gas sensor data with recursive feature elimination. Sens Actuators, B Chem 212:353–363
Article Google Scholar
Zhu L, He S, Wang L, Zeng W, Yang J (2019) Feature selection using an improved gravitational search algorithm. IEEE Access 7:114440–114448
Article Google Scholar

Download references

Author information

Nadeem Anjum author contributed equally to this work

Authors and Affiliations

Department of Computer Science, Capital University of Science and Technology, Expressway, Kahuta Road, Zone-V, ICT, Islamabad, Pakistan
Muhammad Waqas & Nadeem Anjum

Authors

Muhammad Waqas
View author publications
You can also search for this author in PubMed Google Scholar
Nadeem Anjum
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muhammad Waqas.

Ethics declarations

Conflict of interest

No conflict of interest.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Waqas, M., Anjum, N. Generic features selection for structure classification of diverse styled scholarly articles. Multimed Tools Appl 83, 16623–16655 (2024). https://doi.org/10.1007/s11042-023-16128-9

Download citation

Received: 06 December 2021
Revised: 21 June 2023
Accepted: 26 June 2023
Published: 16 July 2023
Issue Date: February 2024
DOI: https://doi.org/10.1007/s11042-023-16128-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Generic features selection for structure classification of diverse styled scholarly articles

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence to automate the systematic review of scientific literature

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Generic features selection for structure classification of diverse styled scholarly articles

Abstract

Access this article

Similar content being viewed by others

Artificial intelligence to automate the systematic review of scientific literature

A Comparative Analysis of Logistic Regression, Random Forest and KNN Models for the Text Classification

Google Scholar to overshadow them all? Comparing the sizes of 12 academic search engines and bibliographic databases

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation