research-article

Research and Implementation of Automatic Indexing Method of PDF for Digital Publishing

Authors:
Keliang Chen

School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing, China

School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing, China

0000-0003-0720-2302
View Profile

,
Jianming Huang

School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing, China

School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing, China

0000-0002-5988-837X
View Profile

,
Qi Zhang

School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing, China

School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing, China

0000-0003-1884-5530
View Profile

,
Yansong Cui

School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing, China

School of Electronic Engineering, Beijing University of Posts and Telecommunications, Beijing, China

0000-0003-3988-001X
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22 Issue 3Article No.: 69pp 1–21https://doi.org/10.1145/3501400

Published:14 April 2023Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

With the rapid development of mobile Internet technology and artificial intelligence technology, the digital publishing industry is in urgent need of using intelligent technology to change the current way of content production and service. Most of the e-book resources owned by publishing enterprises are in PDF format, which is not suitable for reading on mobile devices, and it is not convenient to directly extract key information and construct knowledge graph. With this in mind, this article designs a PDF automatic indexing scheme that can identify all the element information in PDF and output structured data automatically and then extract all the key information in it to generate a keyword library with tag weights. The scheme mainly involves two key technical points: parsing PDF based on text features and grammar rules and extracting keywords based on tag weights. The former visualizes the text block in PDF into a rectangular area, divides the elements by clustering algorithm, and, finally, outputs structured data containing all the information. The latter combines the tags and their weights in the structured data and extracts the keywords in it by the inter-word relation algorithm. The structured data and keywords database produced by this scheme can be used to produce intelligent e-book and build knowledge graph, thus helping publishing enterprises to transform from a content service provider to an intelligent knowledge service provider. This transformation can deeply excavate the core value of the content held by the publishing industry and promote the digitization and intelligentization process of the whole industry.

REFERENCES

[1] Tina Pingting Tsai, Jyhjong Lin, Jiali Hou, Yihsiu Chen, and Chingsheng Hsu. 2019. Preview analytics of ePUB3 eBook-based flipped classes using a big data approach. J. Internet Technol. 20, 7 (2019), 2129–2140.Google Scholar
[2] Jiechao Gao, Haoyu Wang, and Haiying Shen. 2020. Smartly handling renewable energy instability in supporting a cloud datacentre. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium. 769–778. DOI:Google ScholarCross Ref
[3] Ulrich M., Kudak A., and Debus J.. 2020. Data extraction from PDF-irradiation protocols of different irradiation planning systems. Strahlenther. Onkol. 196, (2020), S115–S115.Google Scholar
[4] Katarzyna Koptyra and Ogiela Marek R.. 2020. Distributed steganography in PDF files-secrets hidden in modified pages. Entropy 22, 6 (2020), 600. Google ScholarCross Ref
[5] Tu Nguyen N., Liu Bing-Hong, and Wang Shih-Yuan. 2019. On new approaches of maximum weighted target coverage and sensor connectivity: Hardness and approximation. IEEE Trans. Netw. Sci. Eng. 7, 3 (2020), 1736–1751. Google ScholarCross Ref
[6] Manogaran Gunasekaran, Vijayakumar V., Varatharajan R., Malarvizhi Kumar Priyan, Sundarasekar Revathi, and Hsu Ching-Hsien. 2018. Machine learning based big data processing framework for cancer diagnosis using hidden Markov model and GM clustering. Wireless Pers. Commun. 102, 3 (2018), 2099–2116. Google ScholarDigital Library
[7] Gao Jiechao, Wang Haoyu, and Shen Haiying. 2020. Task failure prediction in cloud data centers using deep learning. In Proceedings of the IEEE International Conference on Big Data (Big Data’19). 1111–1116. Google ScholarCross Ref
[8] Tavares Ana H. M. P., Pinho Armando J., Silva Raquel M., Rodrigues João M. O. S., Bastos Carlos A. C., Ferreira Paulo J. S. G., and Afreixo Vera. 2017. DNA word analysis based on the distribution of the distances between symmetric words. Sci. Rep. 7 (2017), 728. Google ScholarCross Ref
[9] Gyan Ranjan, Tu Nguyen N., Hesham Mekky, and Zhi- Li Zhang. 2020. On virtual id assignment in networks for high resilience routing: A theoretical framework. In Proceedings of the GLOBECOM 2020-2020 IEEE Global Communications Conference. 1–6. Google ScholarDigital Library
[10] Flett Chris A., Jones Stephen P., Martin Alan D., Ryskin Misha G., and Teubner Thomas. 2019. How to include exclusive J/psi production data in global PDF analyses. Phys. Rev. D 101, 9 (2019), 094011. Google ScholarCross Ref
[11] Baggenstoss Paul M.. 2015. Maximum entropy PDF design using feature density constraints: Applications in signal processing. IEEE Trans. Sign. Process. 63, 11 (2015), 2815–2825. Google ScholarDigital Library
[12] Gunasekaran Manogaran, Naveen Chilamkurti, and Ching-Hsu Hsu. 2020. Editorial note: Machine learning for visual analysis of multimedia data. Multimedia Tools Appl. 79 (2018), 5003. Google ScholarCross Ref
[13] Haruna Khalid, Ismail Maizatul Akmar, Qazi Atika, Kakudi Habeebah Adamu, Hassan Mohammed, Muaz Sanah Abdullahi, and Chiroma. Haruna 2020. Research paper recommender system based on public contextual metadata. Scientometrics 125, 1 (2020), 101–114. Google ScholarDigital Library
[14] Zhao-Li Shen, Ting-Zhu Huang, Bruno Carpentieri, Chun Wen, and Xian-Ming Gu. 2018. Block-accelerated aggregation multigrid for markov chains with application to pagerank problems. Commun. Nonlin. Sci. Numer. Simul. 59 (2018), 472–487. Google ScholarCross Ref
[15] Kun Sun, Wenbing Tao, and Yuhua Qian. 2020. Guide to match: Multi-layer feature matching with a hybrid gaussian mixture model. IEEE Trans. Multimedia 22, 9 (2020), 2246–2261. Google ScholarCross Ref
[16] Liu Junxin, Wu Fangzhao, Wu Chuhan, Huang Yongfeng, and Xie Xing. 2019. Neural chinese word segmentation with dictionary. Neurocomputing 338 (2019), 46–54. Google ScholarDigital Library
[17] Zhang Yu, Tuo Mingxiang, Yin Qingyu, Qi Le, Wang Xuxiang, and Liu Ting. 2020. Keywords extraction with deep neural network model. Neurocomputing 383 (2020), 113–121. Google ScholarDigital Library
[18] Xiao Lu, Chen Guo, Sun Jianjun, Han Shuguang, and Zhang Chengzhi. 2016. Exploring the topic hierarchy of digital library research in China using keyword networks: A K-core decomposition approach. Scientometrics 108, 3 (2016), 1085–1101. Google ScholarDigital Library
[19] Qing Xie, Yajie Zhu, and Feng Xiong. 2020. Exploring the topic hierarchy of digital library research in China using keyword networks: A K-core decomposition approach. Neurocomputing 391 (2020), 210–219.Google Scholar
[20] Hosoya Name Marcio, Ribeiro Sergio Silva, Matos Maruyama Teruo, Padua Valle Henrique de, Falate Rosane, and Salete Marcon Gomes Vaz Maria. 2014. Metadata extraction for calculating object perimeter in images. IEEE Latin Am. Trans. 12, 8 (2014), 1566–1571. Google ScholarCross Ref
[21] Jieyi Ren, Xiao-jun Wu, and Josef Kittler. 2020. Discriminative block-diagonal covariance descriptors for image set classification. Pattern Recogn. Lett. 136 (2020), 230–236. Google ScholarCross Ref
[22] He-Feng Yin, Xiao-Jun Wu, Josef Kittler, and Zhen-Hua Feng. 2020. Learning a representation with the block-diagonal structure for pattern classification. Pattern Anal. Appl. 23, 3 (2019), 1381–1390. Google ScholarCross Ref
[23] Songhita Misra and Laskar R. H.. 2019. Integrated features and GMM based hand detector applied to character recognition system under practical conditions. Multimedia Tools Appl. 78, 24 (2019), 34927–34961. DOI:Google ScholarCross Ref
[24] Lai Catherine, Farrus Mireia, and Moore Johanna D.. 2020. Integrating lexical and prosodic features for automatic paragraph segmentation. Speech Commun. 121 (2020), 44–57. Google ScholarCross Ref
[25] Kubilay Atasu. 2016. Feature-rich regular expression matching accelerator for text analytics. J. Sign. Process. Syst. 85, 3 (2016), 355–371. Google ScholarDigital Library
[26] Gruss Richard, Abrahams Alan S., Fan Weiguo, and Alan Wang G.. 2018. By the numbers: The magic of numerical intelligence in text analytic systems. Decis. Supp. Syst. 113 (2018), 86–98. Google ScholarCross Ref
[27] Wei Lu, Yuanyuan Cai, Xiaoping Che, and Yuxun Lu. 2016. Joint semantic similarity assessment with raw corpus and structured ontology for semantic-oriented service discovery. Pers. Ubiq. Comput. 20, 3 (2016), 311–323. Google ScholarDigital Library
[28] Seyran Khademi, Hendriks Richard C., and Bastiaanl Kleijn W.. 2017. Intelligibility enhancement based on mutual information. IEEE-ACM Trans. Aud. Speech Lang. Process. 25, 8 (2017), 1694–1708. Google ScholarDigital Library
[29] Diptarka Das and Shouvik Datta. 2015. Universal features of left-right entanglement entropy. Phys. Rev. 115 (2015), 131606. Google ScholarCross Ref
[30] Masumi Shirakawa, Takahiro Hara, and Shojiro Nishio. 2017. IDF for word n-grams. ACM Trans. Inf. Syst. 36, 1 (2017), 1–38. Google ScholarDigital Library
[31] Gabriella Pigozzi, Tsoukias Alexis, and Paolo Viappiani. 2016. Preferences in artificial intelligence. Ann. Math. Artif. Intell. 77, 3 (2015), 361–401. Google ScholarDigital Library
[32] Shaila S. G. and Vadivel A.. 2015. TAG term weight-based n gram thesaurus generation for query expansion in information retrieval application. J. Inf. Sci. 41, 4 (2015), 467–485. Google ScholarDigital Library
[33] Fan S., Wang Y., Cao S., Sun T., and Liu P.. 2021. A novel method for analyzing the effect of dust accumulation on energy efficiency loss in photovoltaic (PV) system. Energy (Oxford), 234, 1 (2021). Google ScholarCross Ref
[34] Zhang Y., Liu F., Fang Z., Yuan B., Zhang G., and Lu J.. 2021. Learning from a complementary-label source domain: Theory and algorithms. IEEE Trans. Neural Netw. Learn. Systems (2021), 1–15. Google ScholarCross Ref
[35] Li B., Xiao G., Lu R., Deng R., and Bao H.. 2019. On feasibility and limitations of detecting false data injection attacks on power grid state estimation using D-FACTS devices. IEEE Trans. Industr. Inf. 16, 2 (2019), 854–864. Google ScholarCross Ref

Index Terms

Research and Implementation of Automatic Indexing Method of PDF for Digital Publishing
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Feature selection
2. Networks
  1. Network services
    1. Network management

Recommendations

Multilingual Indexing Based on Ontologies
Proceedings of the 2006 conference on Leading the Web in Concurrent Engineering: Next Generation Concurrent Engineering

This article deals with multilingual document indexing. We propose an indexing method based on several stages. First of all the most important terms of the document are extracted using general characteristics of languages and statistical methods. Thus, ...
Read More
Multilingual extraction of semantic indexes
SADPI '07: Proceedings of the 2007 international workshop on Semantically aware document processing and indexing

This article deals with multilingual document indexing. We propose an indexing method based on several stages. First of all the most important terms of the document are extracted using general characteristics of languages and statistical methods. Thus, ...
Read More
Domain-independent automatic keyphrase indexing with small training sets

Keyphrases are widely used in both physical and digital libraries as a brief, but precise, summary of documents. They help organize material based on content, provide thematic access, represent search results, and assist with navigation. Manual ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 3
March 2023
570 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3579816
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 April 2023
- Online AM: 30 March 2022
- Accepted: 17 November 2021
- Received: 26 March 2021
Published in tallip Volume 22, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Natural language processing
automatic indexing
text aggregation
keywords extraction
tag weight
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 218
  Total Downloads
- Downloads (Last 12 months)121
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Research and Implementation of Automatic Indexing Method of PDF for Digital Publishing

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Multilingual Indexing Based on Ontologies

Multilingual extraction of semantic indexes

Domain-independent automatic keyphrase indexing with small training sets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Caption

Research and Implementation of Automatic Indexing Method of PDF for Digital Publishing

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Multilingual Indexing Based on Ontologies

Multilingual extraction of semantic indexes

Domain-independent automatic keyphrase indexing with small training sets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

HTML Format

Share this Publication link

Share on Social Media