Abstract
The enormous growth in online research publications in diversified domains has attracted the research community to extract these valuable scientific resources by searching online digital libraries and publishers’ websites. A precise search is desired to enlist most related articles by applying semantic queries to the document’s metadata and the structural elements. The online search engines and digital libraries offer only keyword-based search on full-body text, which creates excessive results. Therefore, the research article’s structural and metadata information has to be stored in machine comprehendible form by the online research publishers. The research community in recent years has adopted different approaches to extract structural information from research documents like rule-based heuristics and machine-learning-based approaches. Studies suggest that machine-learning-based techniques have produced optimum results for document structure extraction from publishers having diversified publication layouts. In this paper, we have proposed thirteen different logical layout structural (LLS) components. We have identified a two-staged innovative set of generic features that are associated with the LLS. This approach has given our technique an advantage against the state-of-the-art for structural classification of digital scientific articles with diversified publication styles. We have applied chi-square (\(ch{i}^{2}\)) for feature selection, and the final result has revealed that SVM (Kernal function) has produced an optimum result with an overall F-measure of 0.95.
Similar content being viewed by others
Data Availability
The reference to ds1 is already available on page 24 (footnote5: https://github.com/flagpdfe/LLSExtractor)
References
Abdar M, Acharya UR, Sarrafzadegan N, Makarenkov V (2019) Ne-nu-svc: A new nested ensemble clinical decision support system for effective diagnosis of coronary artery disease. IEEE Access 7:167605–167620
Ahmad R, Afzal MT, Qadir MA (2016) Information extraction from PDF sources based on rule-based system using integrated formats. In: Semantic web challenges: third SemWebEval challenge at ESWC 2016, Heraklion, Crete, Greece. May 29-June 2, 2016. Revised selected papers 3. Springer International Publishing, pp 293–308
Alam MJ, Kenny P, O’Shaughnessy D (2011) A study of low-variance multi-taper features for distributed speech recognition. In International Conference on Nonlinear Speech Processing, pp 239–245. Springer
Azad HK, Deepak A, Azad A (2022) LOD search engine: a semantic search over linked data. J Intell Inf Syst 1–21
Azad HK, Deepak A, Chakraborty C, Abhishek K (2022) Improving query expansion using pseudo-relevant web knowledge for information retrieval. Pattern Recog Lett 158:148–156
Bharti KK, Singh PK (2015) Hybrid dimension reduction by integrating feature selection with feature extraction method for text clustering. Expert Syst Appl 42(6):3105–3114
Bowles M (2015) Machine learning in Python: essential techniques for predictive analysis. John Wiley & Sons
Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40(1):16–28
Claesen M, De Smet F, Suykens JAK, De Moor B (2014) Fast prediction with svm models containing rbf kernels. arXiv preprint arXiv:1403.0736. Accessed 14 July 2023
Constantin A, Pettifer S, Voronkov A (2013) Pdfx: fully-automated pdf-to-xml conversion of scientific literature. In Proceedings of the 2013 ACM symposium on document engineering, pages 177–180. ACM
Déjean H, Meunier JL (2006) A system for converting PDF documents into structured XML format. In: Document analysis systems VII: 7th international workshop, DAS 2006, Nelson, New Zealand, February 13-15, 2006. Proceedings 7, Springer Berlin Heidelberg, pp 129–14
Dey A (2016) Machine learning algorithms: a review. Int J Comput Sci Inf Technol 7(3):1174–1179
Dimou A, Di Iorio A, Lange C, Vahdati S (2016) Semantic publishing challenge–assessing the quality of scientific output in its ecosystem. In: Semantic web challenges: third SemWebEval challenge at ESWC 2016, Heraklion, Crete, Greece, May 29-June 2, 2016, Revised selected papers 3. Springer International Publishing, pp 243–254
DoHHN, Chandrasekaran MK, Cho PS, Kan MY (2013) Extracting and matching authors and affiliations in scholarly documents. In Proceedings of the 13th ACM/IEEE-CS joint conference on Digital libraries, pp 219–228. ACM
Granitzer M, Hristakeva M, Jack K, Knight R (2012) A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In Proceedings of the 27th Annual ACM Symposium on Applied Computing, pp 962–964. ACM
Guo K, Chen T, Ren S, Li N, Hu M, Kang J (2022) Federated learning empowered real-time medical data processing method for smart healthcare. IEEE/ACM Trans Comput Biol Bioinforma, 1–12. https://doi.org/10.1109/TCBB.2022.3185395
Guo K, Shen C, Hu B, Hu M, Kui X (2022) Rsnet: Relation separation network for few-shot similar class recognition. IEEE Trans Multimed, 1–1. https://doi.org/10.1109/TMM.2022.3168146
Han J, Kamber M, Pei J (2012) Data mining concepts and techniques third edition. University of Illinois at Urbana-Champaign Micheline Kamber Jian Pei Simon Fraser University
Haryanto AW, Mawardi EK (2018) Influence of word normalization and chi-squared feature selection on support vector machine (svm) text classification. In 2018 international seminar on application for technology of information and communication. IEEE pp. 229–233
Hiregoudar SB, Manjunath K, Patil KS (2014) A survey: research summary on neural networks. Int J Res Eng Technol 3(15):385–389
Jiang L, Zhang H, Cai Z (2008) A novel bayes model: Hidden naive bayes. IEEE Trans Knowl Data Eng 21(10):1361–1371
Jinha AE (2010) Article 50 million: an estimate of the number of scholarly articles in existence. Learned Publ 23(3):258–263
Johnson R, Watkinson A, Mabe M (2018) The STM Report-an overview of scientific and scholarly publishing 2018. STM Association. 5th edn Oct
Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist 32(4):485–525
Klampfl S, Granitzer M, Jack K, Kern R (2014) Unsupervised document structure analysis of digital scientific articles. Int J Digit Libr 14(3–4):83–99
Klink S, Kieninger T (2001) Rule-based document structure understanding with a fuzzy combination of layout and textual features. Int J Doc Anal Recogn 4(1):18–26
Lai C, Reinders MJ, Wessels L (2006) Random subspace method for multivariate feature selection. Pattern Recog Lett 27(10):1067–1076
Mohri M, Rostamizadeh A, Talwalkar A (2018) Foundations of machine learning. MIT press
Olson RS, Bartley N, Urbanowicz RJ, Moore JH (2016) Evaluation of a tree-based pipeline optimization tool for automating data science. In: Proceedings of the genetic and evolutionary computation conference 2016, pp 485–492
Preparata FP, Shamos MI (2012) Computational geometry: an introduction. Springer Science & Business Media
Ramakrishnan C, Patnia A, Hovy E, Burns GA (2012) Layout-aware text extraction from full-text pdf of scientific articles. Source Code Biol Med 7(1):7
Rebholz-Schuhmann D, Oellrich A, Hoehndorf R (2012) Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 13(12):829–839
Richert, W. (2013). Building machine learning systems with Python. Packt Publishing Ltd.
Santosh KC (2015) g-dice: graph mining-based document information content exploitation. Int J Doc Anal Recog (IJDAR) 18(4):337–355
Shi P, Ray S, Zhu Q, Kon MA (2011) Top scoring pairs for feature selection in machine learning and applications to cancer outcome prediction. BMC Bioinforma 12(1):375
Su X, Gao G, Wei H, Bao F (2016) A knowledge-based recognition system for historical mongolian documents. Int J Doc Anal Recog (IJDAR) 19(3):221–235
Tkaczyk D, Bolikowski L, Czeczko A, Rusek K (2012) A modular metadata extraction system for born-digital articles. In 2012 10th IAPR international workshop on document analysis systems. IEEE pp. 11–16
Tsai C-T, Kundu G, Roth D (2013) Concept-based analysis of scientific literature. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp 1733–1738. ACM
Tuarob S, Bhatia S, Mitra P, Lee Giles C (2013) Automatic detection of pseudocodes in scholarly documents using machine learning. In 2013 12th International Conference on Document Analysis and Recognition, pages 738–742. IEEE
Tuarob S, Kang SW, Wettayakorn P, Pornprasit C, Sachati T, Hassan S-U, Haddawy P (2020) Automatic classification of algorithm citation functions in scientific literature. IEEE Trans Knowl Data Eng 32(10):1881–1896. https://doi.org/10.1109/TKDE.2019.2913376
Washio T, Motoda H (2003) State of the art of graph-based data mining. ACM SIGKDD Explor Newsl 5(1):59–68
Wu J, Williams KM, Chen H-H, Khabsa M, Caragea C, Tuarob S, Ororbia AG, Jordan D, Mitra P, Giles CL (2015) Citeseerx: Ai in a digital library search engine. AI Mag 36(3):35–48
Yan Ke, Zhang D (2015) Feature selection and analysis on correlated gas sensor data with recursive feature elimination. Sens Actuators, B Chem 212:353–363
Zhu L, He S, Wang L, Zeng W, Yang J (2019) Feature selection using an improved gravitational search algorithm. IEEE Access 7:114440–114448
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
No conflict of interest.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Waqas, M., Anjum, N. Generic features selection for structure classification of diverse styled scholarly articles. Multimed Tools Appl 83, 16623–16655 (2024). https://doi.org/10.1007/s11042-023-16128-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16128-9