research-article

A Hybrid Approach for Measuring Similarity between Government Documents of China

Authors:

Zesong LiAuthors Info & Claims

CSAI '18: Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence

Pages 431 - 435

https://doi.org/10.1145/3297156.3297248

Published: 08 December 2018 Publication History

Abstract

In China, the government publishes hundreds of thousands of government documents every year. The civil servants in China are struggling to find relevant government documents while doing their office works, such as writing documents, analyzing government policy, explaining the policy to the public. Furthermore, the public also finds it difficult to find the exact government documents since the most popular search engines in China such as Baidu, Sogou are not specialized in the field of government document searching and indexing. Determining the similarity between documents is critical to applications such as search and recommendation. Currently, most kinds of literatures focus on semantic similarity between words and paragraph fragments. As for government documents in China, they are written under standards so that they can be considered as semi-structured data after data cleansing. In this paper, we propose a hybrid approach for measuring the document-level similarity between government documents of China. We represent government documents as the publisher of the document, the document domain, the document type, the publishing time and other contents. By calculating the similarity between the elements, the government documents similarity is formed by the weighted sum of the similarity between the elements. Experiment results show that the proposed hybrid method outperforms the classic methods like LDA and doc2vec.

References

[1]

Ni, Y., Xu, Q. K., Cao, F., Mass, Y., Sheinwald, D., & Zhu, H. J., et al. (2016). Semantic Documents Relatedness using Concept Graph Representation. ACM International Conference on Web Search and Data Mining (pp.635--644). ACM.

Digital Library

[2]

Arts S, Cassiman B, Gomez J C. Text matching to measure patent similarity{J}. Working Papers Department of Managerial Economics Strategy & Innovation, 2017, 39(1).

[3]

Zhang, Y., Jin, R. & Zhou, ZH. Int. J. Mach. Learn. & Cyber. (2010) 1: 43.

[4]

Thomas K. Landauer, Peter W. Foltz, and Darrell Laham. 1998.: An Introduction to latent semantic analysis. Discourse Processes, 25(2--3):259--284.

[5]

D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, January 2003.

Digital Library

[6]

Bengio, Yoshua, Schwenk, Holger, Senecal, Jean Sebastien, Morin, Frederic, and Gauvain, Jean-Luc. Neural probabilistic language models. In Innovations in Machine Learning, pp. 137--186. Springer, 2006.

[7]

Collobert, Ronan and Weston, Jason. A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th International Conference on Machine Learning, pp. 160--167. ACM, 2008.

Digital Library

[8]

Leacock, C., and Chodorow, M. 1998.: Combining local context and WordNet sense similarity for word sense identification. In WordNet, An Electronic Lexical Database. The MIT Press.

[9]

Zhibiao Wu and Martha Palme.1994.: Verb semantics and lexical selection. In: Proceedings of ACL, pages 133--138.

Digital Library

[10]

Goikoetxea J, Soroa A, Agirre E, et al.: Random walks and neural network language models on knowledge bases{C}. In: Proceedings of NAACL-HLT. 2015: 1434--1439.

[11]

Faruqui M, Dyer C.: Non-distributional word vector representations {J}. arXiv preprint arXiv:1506.05230, (2015)

[12]

Liu, M., Lang, B., & Gu, Z. (2017). Calculating Semantic Similarity between Academic Articles using Topic Event and Ontology. CoRR, abs/1711.11508.

[13]

Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (ICML'14), Eric P. Xing and Tony Jebara (Eds.), Vol. 32. JMLR.org II-1188-II-1196.

Digital Library

[14]

Benik, J., Chang, C., Raschid, L., Vidal, M.E., Palma, G., Thor, A.: Finding cross genome patterns in annotation graphs. In: Data Integration in the Life Sciences, Lecture Notes in Computer Science, vol. 7348, pp. 21{36. Springer Berlin Heidelberg (2012)

Digital Library

[15]

Palma, G., Vidal, M.E., Haag, E., Raschid, L., Thor, A.: Measuring relatedness between scientific entities in annotation datasets. In: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics. pp. 367:367{367:376. BCB'13, ACM, New York, NY, USA (2013)

Digital Library

Cited By

Fang XLi ZLi ZSong Y(2019)An Unsupervised Keywords Extraction Approach for Chinese Government Documents2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD.2019.8837026(196-201)Online publication date: May-2019
https://doi.org/10.1109/ICAIBD.2019.8837026
Wang PLi ZLi ZFang X(2019)A Government Policy Analysis Platform Based on Knowledge Graph2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD.2019.8836979(208-214)Online publication date: May-2019
https://doi.org/10.1109/ICAIBD.2019.8836979

Index Terms

A Hybrid Approach for Measuring Similarity between Government Documents of China
1. Applied computing
  1. Computers in other domains
    1. Computing in government
      1. E-government
  2. Document management and text processing
    1. Document searching
2. General and reference
  1. Document types
    1. General conference proceedings

Recommendations

A fuzzy clustering approach for finding similar documents using a novel similarity measure

Searching for similar documents has a crucial role in document management. This paper aims for developing a fast and high quality method of searching similar documents based on fuzzy clustering in large document collections. In order to perform these ...
Word importance-based similarity of documents metric (WISDM): Fast and scalable document similarity metric for analysis of scientific documents
WOSP 2017: Proceedings of the 6th International Workshop on Mining Scientific Publications

We present the Word importance-based similarity of documents metric (WISDM), a fast and scalable novel method for document similarity/distance computation for analysis of scientific documents. It is based on recent advancements in the area of word ...
Detecting similar documents using salient terms
CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management

We describe a system for rapidly determining document similarity among a set of documents obtained from an information retrieval (IR) system. We obtain a ranked list of the most important terms in each document using a rapid phrase recognizer system. We ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

CSAI '18: Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence

December 2018

641 pages

ISBN:9781450366069

DOI:10.1145/3297156

Copyright © 2018 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Shenzhen University: Shenzhen University

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

CSAI '18

CSAI '18: 2018 2nd International Conference on Computer Science and Artificial Intelligence

December 8 - 10, 2018

Shenzhen, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
77
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Fang XLi ZLi ZSong Y(2019)An Unsupervised Keywords Extraction Approach for Chinese Government Documents2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD.2019.8837026(196-201)Online publication date: May-2019
https://doi.org/10.1109/ICAIBD.2019.8837026
Wang PLi ZLi ZFang X(2019)A Government Policy Analysis Platform Based on Knowledge Graph2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD.2019.8836979(208-214)Online publication date: May-2019
https://doi.org/10.1109/ICAIBD.2019.8836979

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten