skip to main content
10.1145/3297156.3297248acmotherconferencesArticle/Chapter ViewAbstractPublication PagescsaiConference Proceedingsconference-collections
research-article

A Hybrid Approach for Measuring Similarity between Government Documents of China

Published: 08 December 2018 Publication History

Abstract

In China, the government publishes hundreds of thousands of government documents every year. The civil servants in China are struggling to find relevant government documents while doing their office works, such as writing documents, analyzing government policy, explaining the policy to the public. Furthermore, the public also finds it difficult to find the exact government documents since the most popular search engines in China such as Baidu, Sogou are not specialized in the field of government document searching and indexing. Determining the similarity between documents is critical to applications such as search and recommendation. Currently, most kinds of literatures focus on semantic similarity between words and paragraph fragments. As for government documents in China, they are written under standards so that they can be considered as semi-structured data after data cleansing. In this paper, we propose a hybrid approach for measuring the document-level similarity between government documents of China. We represent government documents as the publisher of the document, the document domain, the document type, the publishing time and other contents. By calculating the similarity between the elements, the government documents similarity is formed by the weighted sum of the similarity between the elements. Experiment results show that the proposed hybrid method outperforms the classic methods like LDA and doc2vec.

References

[1]
Ni, Y., Xu, Q. K., Cao, F., Mass, Y., Sheinwald, D., & Zhu, H. J., et al. (2016). Semantic Documents Relatedness using Concept Graph Representation. ACM International Conference on Web Search and Data Mining (pp.635--644). ACM.
[2]
Arts S, Cassiman B, Gomez J C. Text matching to measure patent similarity{J}. Working Papers Department of Managerial Economics Strategy & Innovation, 2017, 39(1).
[3]
Zhang, Y., Jin, R. & Zhou, ZH. Int. J. Mach. Learn. & Cyber. (2010) 1: 43.
[4]
Thomas K. Landauer, Peter W. Foltz, and Darrell Laham. 1998.: An Introduction to latent semantic analysis. Discourse Processes, 25(2--3):259--284.
[5]
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January 2003.
[6]
Bengio, Yoshua, Schwenk, Holger, Senecal, Jean Sebastien, Morin, Frederic, and Gauvain, Jean-Luc. Neural probabilistic language models. In Innovations in Machine Learning, pp. 137--186. Springer, 2006.
[7]
Collobert, Ronan and Weston, Jason. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pp. 160--167. ACM, 2008.
[8]
Leacock, C., and Chodorow, M. 1998.: Combining local context and WordNet sense similarity for word sense identification. In WordNet, An Electronic Lexical Database. The MIT Press.
[9]
Zhibiao Wu and Martha Palme.1994.: Verb semantics and lexical selection. In: Proceedings of ACL, pages 133--138.
[10]
Goikoetxea J, Soroa A, Agirre E, et al.: Random walks and neural network language models on knowledge bases{C}. In: Proceedings of NAACL-HLT. 2015: 1434--1439.
[11]
Faruqui M, Dyer C.: Non-distributional word vector representations {J}. arXiv preprint arXiv:1506.05230, (2015)
[12]
Liu, M., Lang, B., & Gu, Z. (2017). Calculating Semantic Similarity between Academic Articles using Topic Event and Ontology. CoRR, abs/1711.11508.
[13]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (ICML'14), Eric P. Xing and Tony Jebara (Eds.), Vol. 32. JMLR.org II-1188-II-1196.
[14]
Benik, J., Chang, C., Raschid, L., Vidal, M.E., Palma, G., Thor, A.: Finding cross genome patterns in annotation graphs. In: Data Integration in the Life Sciences, Lecture Notes in Computer Science, vol. 7348, pp. 21{36. Springer Berlin Heidelberg (2012)
[15]
Palma, G., Vidal, M.E., Haag, E., Raschid, L., Thor, A.: Measuring relatedness between scientific entities in annotation datasets. In: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics. pp. 367:367{367:376. BCB'13, ACM, New York, NY, USA (2013)

Cited By

View all
  • (2019)An Unsupervised Keywords Extraction Approach for Chinese Government Documents2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD.2019.8837026(196-201)Online publication date: May-2019
  • (2019)A Government Policy Analysis Platform Based on Knowledge Graph2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD.2019.8836979(208-214)Online publication date: May-2019

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CSAI '18: Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence
December 2018
641 pages
ISBN:9781450366069
DOI:10.1145/3297156
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • Shenzhen University: Shenzhen University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Document Similarity
  2. Government Documents
  3. Ontology

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

CSAI '18

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2019)An Unsupervised Keywords Extraction Approach for Chinese Government Documents2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD.2019.8837026(196-201)Online publication date: May-2019
  • (2019)A Government Policy Analysis Platform Based on Knowledge Graph2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD.2019.8836979(208-214)Online publication date: May-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media