research-article

Employing Auto-Annotated Data for Government Document Classification

Authors:

Dagang ChenAuthors Info & Claims

ICIAI '19: Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence

Pages 121 - 125

https://doi.org/10.1145/3319921.3319970

Published: 15 March 2019 Publication History

Abstract

In China, the government documents are documents with legal effect and of standard forms formulated in the process of government administration. With the continuous development of e-government in China, government database size increases hugely. To fully utilize the potential of the database, many applications based on natural language processing (NLP) are developed. Classification is a fundamental task for many NLP applications such as automatic document archive, intelligent search, and personalized recommendation. Presently, in China, the government document classification method which based on issuing departments has very low accuracy. Traditional text classifiers based on machine learning or deep learning models rely heavily on human-labeled training data. While there are no open data sets on the government documents, we propose a method to automatically constructing large-scale annotated data set for government document classification based on the information retrieval method. Experiment results show that the supervised classification model trained on our automatically constructed data set outperforms the baseline method 15% on F1-score.

References

[1]

Zeyuan Li, Jie He, Dagang Chen, Xin Fang, Yajun Song, and Zesong Li. 2018. A Hybrid Approach for Measuring Similarity between Government Documents of China. In Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence (CSAI '18). ACM, New York, NY, USA, 431--435.

Digital Library

[2]

Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys, 2002, 34(1):1--47.

Digital Library

[3]

Nargesian, F.; Samulowitz, H.; Khurana, U.; Khalil, E. B.; and Turaga, D. 2017. Learning feature engineering for classification. In IJCAI, 2529--2535.

Digital Library

[4]

Cover, T. M., and Thomas, J. A. 2012. Elements of information theory. John Wiley & Sons.

[5]

Ng, A. Y. 2004. Feature selection, L1 vs. L2 regularization, and rotational invariance. In ICML, 78.

Digital Library

[6]

Lewis, D. D. 1992. An evaluation of phrasal and clustered representations on a text categorization task. In SIGIR, 37--50.

Digital Library

[7]

Post, M., and Bergsma, S. 2013. Explicit and implicit syntactic features for text classification. In ACL, 866--872.

[8]

Salton, G. and Buckley, C. 1988 Term-weighting approaches in automatic text retrieval. Information Processing & Management 24 (5): 513--523.

Digital Library

[9]

Domingos, Pedro; Pazzani, Michael (1997). "On the optimality of the simple Bayesian classifier under zero-one loss". Machine Learning. 29: 103--137.

Digital Library

[10]

Vapnik, Vladimir N.; and Kotz, Samuel; Estimation of Dependences Based on Empirical Data, Springer, 2006. ISBN 0-387-30865-2

Digital Library

[11]

L. Breiman. Random forests. Machine learning, 45(1):5--32, 2001.

Digital Library

[12]

T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 785--794, San Francisco, CA, 2016.

Digital Library

[13]

LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436--444 (2015).

[14]

Quoc Le and Tomas Mikolov. 2014. Distributed representations of sentences and documents. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32 (ICML'14), Eric P. Xing and Tony Jebara (Eds.), Vol. 32. JMLR.org II1188-II-1196.

Digital Library

[15]

Le, T. Mikolov. 2014. Distributed Represenations of Sentences and Documents. In Proceedings of ICML 2014.

Digital Library

[16]

Blunsom, Phil, Edward Grefenstette, Nal Kalchbrenner, et al. 2014. A convolutional neural network for modelling sentences. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics.

[17]

Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In Twenty-Ninth AAAI Conference on Artificial Intelligence.

Digital Library

[18]

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics.

[19]

Manning, C. D.; Raghavan, P.; and Schutze, H. 2008. Introduction to Information Retrieval. Cambridge

Digital Library

Cited By

Jiang YPang PWong DKan H(2023)Natural Language Processing Adoption in Governments and Future Research Directions: A Systematic ReviewApplied Sciences10.3390/app13221234613:22(12346)Online publication date: 15-Nov-2023
https://doi.org/10.3390/app132212346
Gollapalli MAlabdullatif LAlsuwayeh FAljouali MAlhunief ABatook Z(2022)Text Mining on Hospital Stay Durations and Management of Sickle Cell Disease Patients2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)10.1109/CICN56167.2022.10008265(1-6)Online publication date: 4-Dec-2022
https://doi.org/10.1109/CICN56167.2022.10008265
Gunasekara CCarryer TTriff M(2021)On Natural Language Processing Applications for Military Dialect Classification2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA52953.2021.00040(211-218)Online publication date: Dec-2021
https://doi.org/10.1109/ICMLA52953.2021.00040
Show More Cited By

Index Terms

Employing Auto-Annotated Data for Government Document Classification
1. Computing methodologies
  1. Modeling and simulation
    1. Simulation types and techniques
      1. Massively parallel and high-performance simulations
2. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

Semi-supervised document classification using heterogeneous rule selection
ICEC '17: Proceedings of the International Conference on Electronic Commerce

In traditional supervised classification, a large set of labeled data is required to train the model. However, labeled data are often hard to obtain and expensive, because human efforts are needed for the labeling. Therefore, semi-supervised learning ...
Sparse multiple instance learning as document classification

This work focuses on multiple instance learning (MIL) with sparse positive bags (which we name as sparse MIL). A structural representation is presented to encode both instances and bags. This representation leads to a non-i.i.d. MIL algorithm, miStruct, ...
Categorizing the Document Using Multi Class Classification in Data Mining
CICN '11: Proceedings of the 2011 International Conference on Computational Intelligence and Communication Networks

Classification is the process of dividing the data into number of groups which are either dependent or independent of each other and each group acts as a class. The task of Classification can be done by using several methods using different types of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

ICIAI '19: Proceedings of the 2019 3rd International Conference on Innovation in Artificial Intelligence

March 2019

279 pages

ISBN:9781450361286

DOI:10.1145/3319921

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

Xi'an Jiaotong-Liverpool University: Xi'an Jiaotong-Liverpool University
University of Texas-Dallas: University of Texas-Dallas

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 March 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICIAI 2019

ICIAI 2019: 2019 The 3rd International Conference on Innovation in Artificial Intelligence

March 15 - 18, 2019

Suzhou, China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
154
Total Downloads

Downloads (Last 12 months)15
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jiang YPang PWong DKan H(2023)Natural Language Processing Adoption in Governments and Future Research Directions: A Systematic ReviewApplied Sciences10.3390/app13221234613:22(12346)Online publication date: 15-Nov-2023
https://doi.org/10.3390/app132212346
Gollapalli MAlabdullatif LAlsuwayeh FAljouali MAlhunief ABatook Z(2022)Text Mining on Hospital Stay Durations and Management of Sickle Cell Disease Patients2022 14th International Conference on Computational Intelligence and Communication Networks (CICN)10.1109/CICN56167.2022.10008265(1-6)Online publication date: 4-Dec-2022
https://doi.org/10.1109/CICN56167.2022.10008265
Gunasekara CCarryer TTriff M(2021)On Natural Language Processing Applications for Military Dialect Classification2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA52953.2021.00040(211-218)Online publication date: Dec-2021
https://doi.org/10.1109/ICMLA52953.2021.00040
Fang XLi ZLi ZSong Y(2019)An Unsupervised Keywords Extraction Approach for Chinese Government Documents2019 2nd International Conference on Artificial Intelligence and Big Data (ICAIBD)10.1109/ICAIBD.2019.8837026(196-201)Online publication date: May-2019
https://doi.org/10.1109/ICAIBD.2019.8837026

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten