Abstract
With the rapid development of mobile Internet technology and artificial intelligence technology, the digital publishing industry is in urgent need of using intelligent technology to change the current way of content production and service. Most of the e-book resources owned by publishing enterprises are in PDF format, which is not suitable for reading on mobile devices, and it is not convenient to directly extract key information and construct knowledge graph. With this in mind, this article designs a PDF automatic indexing scheme that can identify all the element information in PDF and output structured data automatically and then extract all the key information in it to generate a keyword library with tag weights. The scheme mainly involves two key technical points: parsing PDF based on text features and grammar rules and extracting keywords based on tag weights. The former visualizes the text block in PDF into a rectangular area, divides the elements by clustering algorithm, and, finally, outputs structured data containing all the information. The latter combines the tags and their weights in the structured data and extracts the keywords in it by the inter-word relation algorithm. The structured data and keywords database produced by this scheme can be used to produce intelligent e-book and build knowledge graph, thus helping publishing enterprises to transform from a content service provider to an intelligent knowledge service provider. This transformation can deeply excavate the core value of the content held by the publishing industry and promote the digitization and intelligentization process of the whole industry.
- [1] . 2019. Preview analytics of ePUB3 eBook-based flipped classes using a big data approach. J. Internet Technol. 20, 7 (2019), 2129–2140.Google Scholar
- [2] . 2020. Smartly handling renewable energy instability in supporting a cloud datacentre. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. 769–778.
DOI: Google ScholarCross Ref - [3] . 2020. Data extraction from PDF-irradiation protocols of different irradiation planning systems. Strahlenther. Onkol. 196, (2020), S115–S115.Google Scholar
- [4] . 2020. Distributed steganography in PDF files-secrets hidden in modified pages. Entropy 22, 6 (2020), 600. Google ScholarCross Ref
- [5] . 2019. On new approaches of maximum weighted target coverage and sensor connectivity: Hardness and approximation. IEEE Trans. Netw. Sci. Eng. 7, 3 (2020), 1736–1751. Google ScholarCross Ref
- [6] . 2018. Machine learning based big data processing framework for cancer diagnosis using hidden Markov model and GM clustering. Wireless Pers. Commun. 102, 3 (2018), 2099–2116. Google ScholarDigital Library
- [7] . 2020. Task failure prediction in cloud data centers using deep learning. In Proceedings of the IEEE International Conference on Big Data (Big Data’19). 1111–1116. Google ScholarCross Ref
- [8] . 2017. DNA word analysis based on the distribution of the distances between symmetric words. Sci. Rep. 7 (2017), 728. Google ScholarCross Ref
- [9] . 2020. On virtual id assignment in networks for high resilience routing: A theoretical framework. In Proceedings of the GLOBECOM 2020-2020 IEEE Global Communications Conference. 1–6. Google ScholarDigital Library
- [10] . 2019. How to include exclusive J/psi production data in global PDF analyses. Phys. Rev. D 101, 9 (2019), 094011. Google ScholarCross Ref
- [11] . 2015. Maximum entropy PDF design using feature density constraints: Applications in signal processing. IEEE Trans. Sign. Process. 63, 11 (2015), 2815–2825. Google ScholarDigital Library
- [12] . 2020. Editorial note: Machine learning for visual analysis of multimedia data. Multimedia Tools Appl. 79 (2018), 5003. Google ScholarCross Ref
- [13] 2020. Research paper recommender system based on public contextual metadata. Scientometrics 125, 1 (2020), 101–114. Google ScholarDigital Library
- [14] . 2018. Block-accelerated aggregation multigrid for markov chains with application to pagerank problems. Commun. Nonlin. Sci. Numer. Simul. 59 (2018), 472–487. Google ScholarCross Ref
- [15] . 2020. Guide to match: Multi-layer feature matching with a hybrid gaussian mixture model. IEEE Trans. Multimedia 22, 9 (2020), 2246–2261. Google ScholarCross Ref
- [16] . 2019. Neural chinese word segmentation with dictionary. Neurocomputing 338 (2019), 46–54. Google ScholarDigital Library
- [17] . 2020. Keywords extraction with deep neural network model. Neurocomputing 383 (2020), 113–121. Google ScholarDigital Library
- [18] . 2016. Exploring the topic hierarchy of digital library research in China using keyword networks: A K-core decomposition approach. Scientometrics 108, 3 (2016), 1085–1101. Google ScholarDigital Library
- [19] . 2020. Exploring the topic hierarchy of digital library research in China using keyword networks: A K-core decomposition approach. Neurocomputing 391 (2020), 210–219.Google Scholar
- [20] . 2014. Metadata extraction for calculating object perimeter in images. IEEE Latin Am. Trans. 12, 8 (2014), 1566–1571. Google ScholarCross Ref
- [21] . 2020. Discriminative block-diagonal covariance descriptors for image set classification. Pattern Recogn. Lett. 136 (2020), 230–236. Google ScholarCross Ref
- [22] . 2020. Learning a representation with the block-diagonal structure for pattern classification. Pattern Anal. Appl. 23, 3 (2019), 1381–1390. Google ScholarCross Ref
- [23] . 2019. Integrated features and GMM based hand detector applied to character recognition system under practical conditions. Multimedia Tools Appl. 78, 24 (2019), 34927–34961.
DOI: Google ScholarCross Ref - [24] . 2020. Integrating lexical and prosodic features for automatic paragraph segmentation. Speech Commun. 121 (2020), 44–57. Google ScholarCross Ref
- [25] . 2016. Feature-rich regular expression matching accelerator for text analytics. J. Sign. Process. Syst. 85, 3 (2016), 355–371. Google ScholarDigital Library
- [26] . 2018. By the numbers: The magic of numerical intelligence in text analytic systems. Decis. Supp. Syst. 113 (2018), 86–98. Google ScholarCross Ref
- [27] . 2016. Joint semantic similarity assessment with raw corpus and structured ontology for semantic-oriented service discovery. Pers. Ubiq. Comput. 20, 3 (2016), 311–323. Google ScholarDigital Library
- [28] . 2017. Intelligibility enhancement based on mutual information. IEEE-ACM Trans. Aud. Speech Lang. Process. 25, 8 (2017), 1694–1708. Google ScholarDigital Library
- [29] . 2015. Universal features of left-right entanglement entropy. Phys. Rev. 115 (2015), 131606. Google ScholarCross Ref
- [30] . 2017. IDF for word n-grams. ACM Trans. Inf. Syst. 36, 1 (2017), 1–38. Google ScholarDigital Library
- [31] . 2016. Preferences in artificial intelligence. Ann. Math. Artif. Intell. 77, 3 (2015), 361–401. Google ScholarDigital Library
- [32] . 2015. TAG term weight-based n gram thesaurus generation for query expansion in information retrieval application. J. Inf. Sci. 41, 4 (2015), 467–485. Google ScholarDigital Library
- [33] . 2021. A novel method for analyzing the effect of dust accumulation on energy efficiency loss in photovoltaic (PV) system. Energy (Oxford), 234, 1 (2021). Google ScholarCross Ref
- [34] . 2021. Learning from a complementary-label source domain: Theory and algorithms. IEEE Trans. Neural Netw. Learn. Systems (2021), 1–15. Google ScholarCross Ref
- [35] . 2019. On feasibility and limitations of detecting false data injection attacks on power grid state estimation using D-FACTS devices. IEEE Trans. Industr. Inf. 16, 2 (2019), 854–864. Google ScholarCross Ref
Index Terms
- Research and Implementation of Automatic Indexing Method of PDF for Digital Publishing
Recommendations
Multilingual Indexing Based on Ontologies
Proceedings of the 2006 conference on Leading the Web in Concurrent Engineering: Next Generation Concurrent EngineeringThis article deals with multilingual document indexing. We propose an indexing method based on several stages. First of all the most important terms of the document are extracted using general characteristics of languages and statistical methods. Thus, ...
Multilingual extraction of semantic indexes
SADPI '07: Proceedings of the 2007 international workshop on Semantically aware document processing and indexingThis article deals with multilingual document indexing. We propose an indexing method based on several stages. First of all the most important terms of the document are extracted using general characteristics of languages and statistical methods. Thus, ...
Domain-independent automatic keyphrase indexing with small training sets
Keyphrases are widely used in both physical and digital libraries as a brief, but precise, summary of documents. They help organize material based on content, provide thematic access, represent search results, and assist with navigation. Manual ...
Comments