Skip to main content
Log in

Generic features selection for structure classification of diverse styled scholarly articles

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The enormous growth in online research publications in diversified domains has attracted the research community to extract these valuable scientific resources by searching online digital libraries and publishers’ websites. A precise search is desired to enlist most related articles by applying semantic queries to the document’s metadata and the structural elements. The online search engines and digital libraries offer only keyword-based search on full-body text, which creates excessive results. Therefore, the research article’s structural and metadata information has to be stored in machine comprehendible form by the online research publishers. The research community in recent years has adopted different approaches to extract structural information from research documents like rule-based heuristics and machine-learning-based approaches. Studies suggest that machine-learning-based techniques have produced optimum results for document structure extraction from publishers having diversified publication layouts. In this paper, we have proposed thirteen different logical layout structural (LLS) components. We have identified a two-staged innovative set of generic features that are associated with the LLS. This approach has given our technique an advantage against the state-of-the-art for structural classification of digital scientific articles with diversified publication styles. We have applied chi-square (\(ch{i}^{2}\)) for feature selection, and the final result has revealed that SVM (Kernal function) has produced an optimum result with an overall F-measure of 0.95.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Algorithm 2
Algorithm 3
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Algorithm 4
Algorithm 5
Algorithm 6
Algorithm 7
Algorithm 8
Algorithm 9
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

The reference to ds1 is already available on page 24 (footnote5: https://github.com/flagpdfe/LLSExtractor)

Notes

  1. https://pdfbox.apache.org/

  2. https://poppler.freedesktop.org/

  3. https://www.openhub.net/p/jpodlib

  4. https://itextpdf.com/

  5. https://github.com/flagpdfe/LLSExtractor

  6. http://cermine.ceon.pl/index.html

  7. http://pdfx.cs.man.ac.uk/

References

  1. Abdar M, Acharya UR, Sarrafzadegan N, Makarenkov V (2019) Ne-nu-svc: A new nested ensemble clinical decision support system for effective diagnosis of coronary artery disease. IEEE Access 7:167605–167620

    Article  Google Scholar 

  2. Ahmad R, Afzal MT, Qadir MA (2016) Information extraction from PDF sources based on rule-based system using integrated formats. In: Semantic web challenges: third SemWebEval challenge at ESWC 2016, Heraklion, Crete, Greece. May 29-June 2, 2016. Revised selected papers 3. Springer International Publishing, pp 293–308

  3. Alam MJ, Kenny P, O’Shaughnessy D (2011) A study of low-variance multi-taper features for distributed speech recognition. In International Conference on Nonlinear Speech Processing, pp 239–245. Springer

  4. Azad HK, Deepak A, Azad A (2022) LOD search engine: a semantic search over linked data. J Intell Inf Syst 1–21

  5. Azad HK, Deepak A, Chakraborty C, Abhishek K (2022) Improving query expansion using pseudo-relevant web knowledge for information retrieval. Pattern Recog Lett 158:148–156

    Article  Google Scholar 

  6. Bharti KK, Singh PK (2015) Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst Appl 42(6):3105–3114

    Article  Google Scholar 

  7. Bowles M (2015) Machine learning in Python: essential techniques for predictive analysis. John Wiley & Sons

  8. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28

    Article  Google Scholar 

  9. Claesen M, De Smet F, Suykens JAK, De Moor B (2014) Fast prediction with svm models containing rbf kernels. arXiv preprint arXiv:1403.0736. Accessed 14 July 2023

  10. Constantin A, Pettifer S, Voronkov A (2013) Pdfx: fully-automated pdf-to-xml conversion of scientific literature. In Proceedings of the 2013 ACM symposium on document engineering, pages 177–180. ACM

  11. Déjean H, Meunier JL (2006) A system for converting PDF documents into structured XML format. In: Document analysis systems VII: 7th international workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006. Proceedings 7, Springer Berlin Heidelberg, pp 129–14

  12. Dey A (2016) Machine learning algorithms: a review. Int J Comput Sci Inf Technol 7(3):1174–1179

    Google Scholar 

  13. Dimou A, Di Iorio A, Lange C, Vahdati S (2016) Semantic publishing challenge–assessing the quality of scientific output in its ecosystem. In: Semantic web challenges: third SemWebEval challenge at ESWC 2016, Heraklion, Crete, Greece, May 29-June 2, 2016, Revised selected papers 3. Springer International Publishing, pp 243–254

  14. DoHHN, Chandrasekaran MK, Cho PS, Kan MY (2013) Extracting and matching authors and affiliations in scholarly documents. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries, pp 219–228. ACM

  15. Granitzer M, Hristakeva M, Jack K, Knight R (2012) A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In Proceedings of the 27th Annual ACM Symposium on Applied Computing, pp 962–964. ACM

  16. Guo K, Chen T, Ren S, Li N, Hu M, Kang J (2022) Federated learning empowered real-time medical data processing method for smart healthcare. IEEE/ACM Trans Comput Biol Bioinforma, 1–12. https://doi.org/10.1109/TCBB.2022.3185395

  17. Guo K, Shen C, Hu B, Hu M, Kui X (2022) Rsnet: Relation separation network for few-shot similar class recognition. IEEE Trans Multimed, 1–1. https://doi.org/10.1109/TMM.2022.3168146

  18. Han J, Kamber M, Pei J (2012) Data mining concepts and techniques third edition. University of Illinois at Urbana-Champaign Micheline Kamber Jian Pei Simon Fraser University

  19. Haryanto AW, Mawardi EK (2018) Influence of word normalization and chi-squared feature selection on support vector machine (svm) text classification. In 2018 international seminar on application for technology of information and communication. IEEE pp. 229–233

  20. Hiregoudar SB, Manjunath K, Patil KS (2014) A survey: research summary on neural networks. Int J Res Eng Technol 3(15):385–389

    Article  Google Scholar 

  21. Jiang L, Zhang H, Cai Z (2008) A novel bayes model: Hidden naive bayes. IEEE Trans Knowl Data Eng 21(10):1361–1371

    Article  Google Scholar 

  22. Jinha AE (2010) Article 50 million: an estimate of the number of scholarly articles in existence. Learned Publ 23(3):258–263

    Article  Google Scholar 

  23. Johnson R, Watkinson A, Mabe M (2018) The STM Report-an overview of scientific and scholarly publishing 2018. STM Association. 5th edn Oct

  24. Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist 32(4):485–525

    Article  Google Scholar 

  25. Klampfl S, Granitzer M, Jack K, Kern R (2014) Unsupervised document structure analysis of digital scientific articles. Int J Digit Libr 14(3–4):83–99

    Article  Google Scholar 

  26. Klink S, Kieninger T (2001) Rule-based document structure understanding with a fuzzy combination of layout and textual features. Int J Doc Anal Recogn 4(1):18–26

    Article  Google Scholar 

  27. Lai C, Reinders MJ, Wessels L (2006) Random subspace method for multivariate feature selection. Pattern Recog Lett 27(10):1067–1076

    Article  Google Scholar 

  28. Mohri M, Rostamizadeh A, Talwalkar A (2018) Foundations of machine learning. MIT press

  29. Olson RS, Bartley N, Urbanowicz RJ, Moore JH (2016) Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the genetic and evolutionary computation conference 2016, pp 485–492

  30. Preparata FP, Shamos MI (2012) Computational geometry: an introduction. Springer Science & Business Media

  31. Ramakrishnan C, Patnia A, Hovy E, Burns GA (2012) Layout-aware text extraction from full-text pdf of scientific articles. Source Code Biol Med 7(1):7

    Article  Google Scholar 

  32. Rebholz-Schuhmann D, Oellrich A, Hoehndorf R (2012) Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 13(12):829–839

    Article  Google Scholar 

  33. Richert, W. (2013). Building machine learning systems with Python. Packt Publishing Ltd.

  34. Santosh KC (2015) g-dice: graph mining-based document information content exploitation. Int J Doc Anal Recog (IJDAR) 18(4):337–355

    Article  Google Scholar 

  35. Shi P, Ray S, Zhu Q, Kon MA (2011) Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction. BMC Bioinforma 12(1):375

    Article  Google Scholar 

  36. Su X, Gao G, Wei H, Bao F (2016) A knowledge-based recognition system for historical mongolian documents. Int J Doc Anal Recog (IJDAR) 19(3):221–235

    Article  Google Scholar 

  37. Tkaczyk D, Bolikowski L, Czeczko A, Rusek K (2012) A modular metadata extraction system for born-digital articles. In 2012 10th IAPR international workshop on document analysis systems. IEEE  pp. 11–16

  38. Tsai C-T, Kundu G, Roth D (2013) Concept-based analysis of scientific literature. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp 1733–1738. ACM

  39. Tuarob S, Bhatia S, Mitra P, Lee Giles C (2013) Automatic detection of pseudocodes in scholarly documents using machine learning. In 2013 12th International Conference on Document Analysis and Recognition, pages 738–742. IEEE

  40. Tuarob S, Kang SW, Wettayakorn P, Pornprasit C, Sachati T, Hassan S-U, Haddawy P (2020) Automatic classification of algorithm citation functions in scientific literature. IEEE Trans Knowl Data Eng 32(10):1881–1896. https://doi.org/10.1109/TKDE.2019.2913376

    Article  Google Scholar 

  41. Washio T, Motoda H (2003) State of the art of graph-based data mining. ACM SIGKDD Explor Newsl 5(1):59–68

    Article  Google Scholar 

  42. Wu J, Williams KM, Chen H-H, Khabsa M, Caragea C, Tuarob S, Ororbia AG, Jordan D, Mitra P, Giles CL (2015) Citeseerx: Ai in a digital library search engine. AI Mag 36(3):35–48

    Google Scholar 

  43. Yan Ke, Zhang D (2015) Feature selection and analysis on correlated gas sensor data with recursive feature elimination. Sens Actuators, B Chem 212:353–363

    Article  Google Scholar 

  44. Zhu L, He S, Wang L, Zeng W, Yang J (2019) Feature selection using an improved gravitational search algorithm. IEEE Access 7:114440–114448

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Muhammad Waqas.

Ethics declarations

Conflict of interest

No conflict of interest.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Waqas, M., Anjum, N. Generic features selection for structure classification of diverse styled scholarly articles. Multimed Tools Appl 83, 16623–16655 (2024). https://doi.org/10.1007/s11042-023-16128-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16128-9

Keywords

Navigation