skip to main content
10.1145/3319921.3319970acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiciaiConference Proceedingsconference-collections
research-article

Employing Auto-Annotated Data for Government Document Classification

Published: 15 March 2019 Publication History

Abstract

In China, the government documents are documents with legal effect and of standard forms formulated in the process of government administration. With the continuous development of e-government in China, government database size increases hugely. To fully utilize the potential of the database, many applications based on natural language processing (NLP) are developed. Classification is a fundamental task for many NLP applications such as automatic document archive, intelligent search, and personalized recommendation. Presently, in China, the government document classification method which based on issuing departments has very low accuracy. Traditional text classifiers based on machine learning or deep learning models rely heavily on human-labeled training data. While there are no open data sets on the government documents, we propose a method to automatically constructing large-scale annotated data set for government document classification based on the information retrieval method. Experiment results show that the supervised classification model trained on our automatically constructed data set outperforms the baseline method 15% on F1-score.

References

[1]
Zeyuan Li, Jie He, Dagang Chen, Xin Fang, Yajun Song, and Zesong Li. 2018. A Hybrid Approach for Measuring Similarity between Government Documents of China. In Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence (CSAI '18). ACM, New York, NY, USA, 431--435.
[2]
Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1):1--47.
[3]
Nargesian, F.; Samulowitz, H.; Khurana, U.; Khalil, E. B.; and Turaga, D. 2017. Learning feature engineering for classification. In IJCAI, 2529--2535.
[4]
Cover, T. M., and Thomas, J. A. 2012. Elements of information theory. John Wiley & Sons.
[5]
Ng, A. Y. 2004. Feature selection, L1 vs. L2 regularization, and rotational invariance. In ICML, 78.
[6]
Lewis, D. D. 1992. An evaluation of phrasal and clustered representations on a text categorization task. In SIGIR, 37--50.
[7]
Post, M., and Bergsma, S. 2013. Explicit and implicit syntactic features for text classification. In ACL, 866--872.
[8]
Salton, G. and Buckley, C. 1988 Term-weighting approaches in automatic text retrieval. Information Processing & Management 24 (5): 513--523.
[9]
Domingos, Pedro; Pazzani, Michael (1997). "On the optimality of the simple Bayesian classifier under zero-one loss". Machine Learning. 29: 103--137.
[10]
Vapnik, Vladimir N.; and Kotz, Samuel; Estimation of Dependences Based on Empirical Data, Springer, 2006. ISBN 0-387-30865-2
[11]
L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001.
[12]
T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 785--794, San Francisco, CA, 2016.
[13]
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436--444 (2015).
[14]
Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (ICML'14), Eric P. Xing and Tony Jebara (Eds.), Vol. 32. JMLR.org II1188-II-1196.
[15]
Le, T. Mikolov. 2014. Distributed Represenations of Sentences and Documents. In Proceedings of ICML 2014.
[16]
Blunsom, Phil, Edward Grefenstette, Nal Kalchbrenner, et al. 2014. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.
[17]
Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
[18]
Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics.
[19]
Manning, C. D.; Raghavan, P.; and Schutze, H. 2008. Introduction to Information Retrieval. Cambridge

Cited By

View all
  • (2023)Natural Language Processing Adoption in Governments and Future Research Directions: A Systematic ReviewApplied Sciences10.3390/app13221234613:22(12346)Online publication date: 15-Nov-2023
  • (2022)Text Mining on Hospital Stay Durations and Management of Sickle Cell Disease Patients2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)10.1109/CICN56167.2022.10008265(1-6)Online publication date: 4-Dec-2022
  • (2021)On Natural Language Processing Applications for Military Dialect Classification2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA52953.2021.00040(211-218)Online publication date: Dec-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICIAI '19: Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence
March 2019
279 pages
ISBN:9781450361286
DOI:10.1145/3319921
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • Xi'an Jiaotong-Liverpool University: Xi'an Jiaotong-Liverpool University
  • University of Texas-Dallas: University of Texas-Dallas

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 March 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Document Classification
  2. Intelligent Search Engine
  3. Personalized Recommendation

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICIAI 2019

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Natural Language Processing Adoption in Governments and Future Research Directions: A Systematic ReviewApplied Sciences10.3390/app13221234613:22(12346)Online publication date: 15-Nov-2023
  • (2022)Text Mining on Hospital Stay Durations and Management of Sickle Cell Disease Patients2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)10.1109/CICN56167.2022.10008265(1-6)Online publication date: 4-Dec-2022
  • (2021)On Natural Language Processing Applications for Military Dialect Classification2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA52953.2021.00040(211-218)Online publication date: Dec-2021
  • (2019)An Unsupervised Keywords Extraction Approach for Chinese Government Documents2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD.2019.8837026(196-201)Online publication date: May-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media