skip to main content
research-article

Research and Implementation of Automatic Indexing Method of PDF for Digital Publishing

Authors Info & Claims
Published:14 April 2023Publication History
Skip Abstract Section

Abstract

With the rapid development of mobile Internet technology and artificial intelligence technology, the digital publishing industry is in urgent need of using intelligent technology to change the current way of content production and service. Most of the e-book resources owned by publishing enterprises are in PDF format, which is not suitable for reading on mobile devices, and it is not convenient to directly extract key information and construct knowledge graph. With this in mind, this article designs a PDF automatic indexing scheme that can identify all the element information in PDF and output structured data automatically and then extract all the key information in it to generate a keyword library with tag weights. The scheme mainly involves two key technical points: parsing PDF based on text features and grammar rules and extracting keywords based on tag weights. The former visualizes the text block in PDF into a rectangular area, divides the elements by clustering algorithm, and, finally, outputs structured data containing all the information. The latter combines the tags and their weights in the structured data and extracts the keywords in it by the inter-word relation algorithm. The structured data and keywords database produced by this scheme can be used to produce intelligent e-book and build knowledge graph, thus helping publishing enterprises to transform from a content service provider to an intelligent knowledge service provider. This transformation can deeply excavate the core value of the content held by the publishing industry and promote the digitization and intelligentization process of the whole industry.

REFERENCES

  1. [1] Tina Pingting Tsai, Jyhjong Lin, Jiali Hou, Yihsiu Chen, and Chingsheng Hsu. 2019. Preview analytics of ePUB3 eBook-based flipped classes using a big data approach. J. Internet Technol. 20, 7 (2019), 21292140.Google ScholarGoogle Scholar
  2. [2] Jiechao Gao, Haoyu Wang, and Haiying Shen. 2020. Smartly handling renewable energy instability in supporting a cloud datacentre. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. 769778. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Ulrich M., Kudak A., and Debus J.. 2020. Data extraction from PDF-irradiation protocols of different irradiation planning systems. Strahlenther. Onkol. 196, (2020), S115S115.Google ScholarGoogle Scholar
  4. [4] Katarzyna Koptyra and Ogiela Marek R.. 2020. Distributed steganography in PDF files-secrets hidden in modified pages. Entropy 22, 6 (2020), 600. Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Tu Nguyen N., Liu Bing-Hong, and Wang Shih-Yuan. 2019. On new approaches of maximum weighted target coverage and sensor connectivity: Hardness and approximation. IEEE Trans. Netw. Sci. Eng. 7, 3 (2020), 17361751. Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Manogaran Gunasekaran, Vijayakumar V., Varatharajan R., Malarvizhi Kumar Priyan, Sundarasekar Revathi, and Hsu Ching-Hsien. 2018. Machine learning based big data processing framework for cancer diagnosis using hidden Markov model and GM clustering. Wireless Pers. Commun. 102, 3 (2018), 20992116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Gao Jiechao, Wang Haoyu, and Shen Haiying. 2020. Task failure prediction in cloud data centers using deep learning. In Proceedings of the IEEE International Conference on Big Data (Big Data’19). 11111116. Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Tavares Ana H. M. P., Pinho Armando J., Silva Raquel M., Rodrigues João M. O. S., Bastos Carlos A. C., Ferreira Paulo J. S. G., and Afreixo Vera. 2017. DNA word analysis based on the distribution of the distances between symmetric words. Sci. Rep. 7 (2017), 728. Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Gyan Ranjan, Tu Nguyen N., Hesham Mekky, and Zhi- Li Zhang. 2020. On virtual id assignment in networks for high resilience routing: A theoretical framework. In Proceedings of the GLOBECOM 2020-2020 IEEE Global Communications Conference. 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Flett Chris A., Jones Stephen P., Martin Alan D., Ryskin Misha G., and Teubner Thomas. 2019. How to include exclusive J/psi production data in global PDF analyses. Phys. Rev. D 101, 9 (2019), 094011. Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Baggenstoss Paul M.. 2015. Maximum entropy PDF design using feature density constraints: Applications in signal processing. IEEE Trans. Sign. Process. 63, 11 (2015), 28152825. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Gunasekaran Manogaran, Naveen Chilamkurti, and Ching-Hsu Hsu. 2020. Editorial note: Machine learning for visual analysis of multimedia data. Multimedia Tools Appl. 79 (2018), 5003. Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Haruna Khalid, Ismail Maizatul Akmar, Qazi Atika, Kakudi Habeebah Adamu, Hassan Mohammed, Muaz Sanah Abdullahi, and Chiroma. Haruna 2020. Research paper recommender system based on public contextual metadata. Scientometrics 125, 1 (2020), 101114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Zhao-Li Shen, Ting-Zhu Huang, Bruno Carpentieri, Chun Wen, and Xian-Ming Gu. 2018. Block-accelerated aggregation multigrid for markov chains with application to pagerank problems. Commun. Nonlin. Sci. Numer. Simul. 59 (2018), 472487. Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Kun Sun, Wenbing Tao, and Yuhua Qian. 2020. Guide to match: Multi-layer feature matching with a hybrid gaussian mixture model. IEEE Trans. Multimedia 22, 9 (2020), 22462261. Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Liu Junxin, Wu Fangzhao, Wu Chuhan, Huang Yongfeng, and Xie Xing. 2019. Neural chinese word segmentation with dictionary. Neurocomputing 338 (2019), 4654. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Zhang Yu, Tuo Mingxiang, Yin Qingyu, Qi Le, Wang Xuxiang, and Liu Ting. 2020. Keywords extraction with deep neural network model. Neurocomputing 383 (2020), 113121. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Xiao Lu, Chen Guo, Sun Jianjun, Han Shuguang, and Zhang Chengzhi. 2016. Exploring the topic hierarchy of digital library research in China using keyword networks: A K-core decomposition approach. Scientometrics 108, 3 (2016), 10851101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Qing Xie, Yajie Zhu, and Feng Xiong. 2020. Exploring the topic hierarchy of digital library research in China using keyword networks: A K-core decomposition approach. Neurocomputing 391 (2020), 210219.Google ScholarGoogle Scholar
  20. [20] Hosoya Name Marcio, Ribeiro Sergio Silva, Matos Maruyama Teruo, Padua Valle Henrique de, Falate Rosane, and Salete Marcon Gomes Vaz Maria. 2014. Metadata extraction for calculating object perimeter in images. IEEE Latin Am. Trans. 12, 8 (2014), 15661571. Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Jieyi Ren, Xiao-jun Wu, and Josef Kittler. 2020. Discriminative block-diagonal covariance descriptors for image set classification. Pattern Recogn. Lett. 136 (2020), 230236. Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] He-Feng Yin, Xiao-Jun Wu, Josef Kittler, and Zhen-Hua Feng. 2020. Learning a representation with the block-diagonal structure for pattern classification. Pattern Anal. Appl. 23, 3 (2019), 13811390. Google ScholarGoogle ScholarCross RefCross Ref
  23. [23] Songhita Misra and Laskar R. H.. 2019. Integrated features and GMM based hand detector applied to character recognition system under practical conditions. Multimedia Tools Appl. 78, 24 (2019), 3492734961. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Lai Catherine, Farrus Mireia, and Moore Johanna D.. 2020. Integrating lexical and prosodic features for automatic paragraph segmentation. Speech Commun. 121 (2020), 4457. Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Kubilay Atasu. 2016. Feature-rich regular expression matching accelerator for text analytics. J. Sign. Process. Syst. 85, 3 (2016), 355371. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Gruss Richard, Abrahams Alan S., Fan Weiguo, and Alan Wang G.. 2018. By the numbers: The magic of numerical intelligence in text analytic systems. Decis. Supp. Syst. 113 (2018), 8698. Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Wei Lu, Yuanyuan Cai, Xiaoping Che, and Yuxun Lu. 2016. Joint semantic similarity assessment with raw corpus and structured ontology for semantic-oriented service discovery. Pers. Ubiq. Comput. 20, 3 (2016), 311323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Seyran Khademi, Hendriks Richard C., and Bastiaanl Kleijn W.. 2017. Intelligibility enhancement based on mutual information. IEEE-ACM Trans. Aud. Speech Lang. Process. 25, 8 (2017), 16941708. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Diptarka Das and Shouvik Datta. 2015. Universal features of left-right entanglement entropy. Phys. Rev. 115 (2015), 131606. Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Masumi Shirakawa, Takahiro Hara, and Shojiro Nishio. 2017. IDF for word n-grams. ACM Trans. Inf. Syst. 36, 1 (2017), 138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Gabriella Pigozzi, Tsoukias Alexis, and Paolo Viappiani. 2016. Preferences in artificial intelligence. Ann. Math. Artif. Intell. 77, 3 (2015), 361401. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Shaila S. G. and Vadivel A.. 2015. TAG term weight-based n gram thesaurus generation for query expansion in information retrieval application. J. Inf. Sci. 41, 4 (2015), 467485. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Fan S., Wang Y., Cao S., Sun T., and Liu P.. 2021. A novel method for analyzing the effect of dust accumulation on energy efficiency loss in photovoltaic (PV) system. Energy (Oxford), 234, 1 (2021). Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Zhang Y., Liu F., Fang Z., Yuan B., Zhang G., and Lu J.. 2021. Learning from a complementary-label source domain: Theory and algorithms. IEEE Trans. Neural Netw. Learn. Systems (2021), 115. Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Li B., Xiao G., Lu R., Deng R., and Bao H.. 2019. On feasibility and limitations of detecting false data injection attacks on power grid state estimation using D-FACTS devices. IEEE Trans. Industr. Inf. 16, 2 (2019), 854864. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Research and Implementation of Automatic Indexing Method of PDF for Digital Publishing

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 3
        March 2023
        570 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3579816
        Issue’s Table of Contents

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 April 2023
        • Online AM: 30 March 2022
        • Accepted: 17 November 2021
        • Received: 26 March 2021
        Published in tallip Volume 22, Issue 3

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
      • Article Metrics

        • Downloads (Last 12 months)121
        • Downloads (Last 6 weeks)12

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      View Full Text

      HTML Format

      View this article in HTML Format .

      View HTML Format