Abstract
Chinese toponym recognition is crucial in named entity recognition and has significant implications for improving geographic information systems. Based on the real-time nature of social media and rich geographical data contained in social media, it is important to identify Chinese toponyms, including compound toponyms, informal toponyms, and other forms of social media content, for automatic geospatial information extraction. However, the strong word-building ability, diverse features, and ambiguity of Chinese toponyms combined with the linguistic irregularities of social media pose significant challenges for accurately locating toponym boundaries and resolving ambiguities. Furthermore, existing Chinese toponym recognition methods often ignore the fusion of local and global features during feature extraction, resulting in semantic information loss. Therefore, we used the Chinese-roberta-wwm-ext pre-trained language model to encode input text and obtain character-level information. An improved SoftLexicon-based statistical method was employed to acquire word-level semantic information, which was then integrated with character-level semantic information. A two-channel neural network layer comprising a bi-directional long short-term memory and an inception-dilated convolutional neural network was utilized to extract global and local features from text. Additionally, a conditional random field was applied to establish label constraints. The proposed deep neural network model, called CHTopoNER, is designed to identify various forms of Chinese toponyms in irregular Chinese social media content. Its effectiveness was validated on four publicly available annotated toponym datasets and a custom social media dataset. CHTopoNER surpasses state-of-the-art Chinese toponym recognition models and achieves promising results for extracting various types of toponyms and spatial location terms.





Similar content being viewed by others
Code availability
References
Akbik A, Bergmann T, Blythe D et al (2019) FLAIR: an easy-to-use framework for state-of-the-art NLP. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics (demonstrations), pp 54–59
Amada I, Asai A, Shindo H et al (2020) LUKE: deep contextualized entity representations with entity-aware self-attention. arXiv preprint arXiv:2010.01057
Amitay E, Har’El N, Sivan R et al (2004) Web-a-where: geotagging web content. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 273–280
Bo C, Weihong LI, Haoxin T (2019) Chinese hierarchical address segmentation based on BiLSTM-CRF. Geogr Inf Sci 21(8):1143–1151
Chen W, Zhang Y, Isahara H (2006) Chinese named entity recognition with conditional random fields. In: Proceedings of the 5th SIGHAN workshop on Chinese language processing, pp 118–121
Chen Y, Ouyang Y, Li W et al (2010) Using deep belief nets for Chinese named entity categorization. In: Proceedings of the 2010 named entities workshop, pp 102–109
Chiu JPC, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNs. Trans Assoc Comp Linguist 4:357–370. https://doi.org/10.1162/tacl_a_00104
Collobert R, Weston J, Bottou L et al (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12:2493–2537
Cui X, Dai F, Sun C et al (2021a) BiLSTM-Attention-CRF model for entity extraction in internet recruitment data. Procedia Comput Sci 183:706–712. https://doi.org/10.1016/j.procs.2021.02.118
Cui Y, Che W, Liu T, Qin B, Yang Z (2021b) Pretraining with whole word masking for Chinese bert. IEEE ACM Trans Aud Speech Lang Process 29:3504–3514. https://doi.org/10.1109/TASLP.2021.3124365
DeLozier G, Baldridge J, London L (2015) Gazetteer-independent toponym resolution using geographic word profiles. In: 29th AAAI conference on artificial intelligence, vol 29
Devlin J, Chang MW, Lee K et al (2018) Bert: pretraining of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805
Di L, Ling X, Guangwen W (2021) Design of Chinese named entity recognition algorithm based on BiLSTM-CRF model. In: 2021 IEEE conference on telecommunications, optics and computer science (TOCS), pp 37–41
Du P, Liu Y (2011) Recognition of Chinese place names based on ontology. Xibei Shifan Daxue Xuebao J Northwest Norm Univ 47(6):87–93
Fernández NJ, Periñán-Pascual C (2021) nLORE: a linguistically rich deep-learning system for locative-reference extraction in tweets. In: Intelligent environments 2021: workshop proceedings of the 17th international conference on intelligent environments, vol 29. IOS Press, pp 243
Finkel JR, Grenager T, Manning CD (2005) Incorporating non-local information into information extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual meeting of the association for computational linguistics (ACL’05), pp 363–370
Goodchild MF (2007) Citizens as voluntary sensors: spatial data infrastructure in the world of web 2.0. Int J Spat Data Infrastruct Res 2(2):24–32
Goyal P, Dollar P, Girshick RB et al (2017) Accurate, large minibatch SGD: training ImageNet in 1 h. arXiv: computer vision and pattern recognition
Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech, and signal processing, pp 6645–6649
Grishman R, Sundheim BM (1996) A brief history. In: COLING, volume 1. Message understanding conference: 16th international conference on computational linguistics
Hill LL (2009) Georeferencing: the geographic associations of information. MIT Press, Cambridge
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Hoffer E, Hubara I, Soudry D (2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Adv Neural Inf Process Syst 30:1731–1741
Hu YH, Ge L (2007) A supervised machine learning approach to toponym disambiguation. The geospatial web: how geobrowsers, social software and the web 2.0 are Shaping the network society. Springer, Cham, pp 117–128
Hu X, Zhou Z, Sun Y, Kersten J, Klan F, Fan H, Wiegmann M (2022b) GazPNE2: a general place name extractor for microblogs fusing gazetteers and pretrained transformer models. IEEE Internet Things J 9(17):16259–16271
Hu X, Zhou Z, Li H et al (2022) A survey and comparison. arXiv preprint arXiv:2207.01683
Kamalloo E, Rafiei D (2018) A coherent unsupervised model for toponym resolution. In: Proceedings of the 2018 world wide web conference, pp 1287–1296
Karimzadeh M, Pezanowski S, MacEachren AM, Wallgrün JO (2019) GeoTxt: a scalable geoparsing system for unstructured text geolocation. Trans GIS 23(1):118–136. https://doi.org/10.1111/tgis.12510
Keskar NS, Socher R (2017) Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628
Keskar NS, Mudigere D, Nocedal J et al (2016) On large-batch training for deep learning: generalization gap and sharp minima. arXiv preprint arXiv:1609.04836
Lafferty J, McCallum A, Pereira FCN (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data
LeCun Y, Bottou L, Bengio Y, Haffner P (1998) Gradient-based learning applied to document recognition. Proc IEEE 86(11):2278–2324. https://doi.org/10.1109/5.726791
Levow GA (2006) The 3 international Chinese language processing bakeoff: word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN workshop on Chinese language processing, pp 108–117
Li X, Zhang H, Zhou X (2020) Chinese clinical named entity recognition with variant neural structures based on BERT methods. J Biomed Inform 107:103422. https://doi.org/10.1016/j.jbi.2020.103422
Lieberman MD, Samet H Sankaranarayanan J (2010) Geotagging with local lexicons to build indexes for textually-specified spatial data. In: 2010 IEEE 26th International conference on data engineering, pp 201–212
Liu Y, Ott M, Goyal N, et al. (2019) Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692
Loshchilov I, Hutter F (2016) Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983
Ma K, Tan YJ, Xie Z, Qiu Q, Chen S (2022) Chinese toponym recognition with variant neural structures from social media messages based on BERT methods. J Geogr Syst 24(2):143–169. https://doi.org/10.1007/s10109-022-00375-9
Ma R, Peng M, Zhang Q et al (2019) Simplify the usage of lexicon in Chinese NER. arXiv preprint arXiv:1908.05969
Melo F, Martins B (2017) Automated geocoding of textual documents: a survey of current approaches. Trans GIS 21(1):3–38. https://doi.org/10.1111/tgis.12212
Mengjun K, Qingyun DU, Mingjun W (2015) A new method of Chinese address extraction based on address tree model. Acta Geod Cartogr Sin 44(1):99
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
Purves RS, Clough P, Jones CB, Hall MH, Murdock V (2018) Geographic information retrieval: progress and challenges in spatial search of text. FNT Inf Retr 12(2–3):164–318. https://doi.org/10.1561/1500000034
Qin Y, Lin Y, Takanobu R et al (2020) ERICA: improving entity and relation understanding for pretrained language models via contrastive learning. arXiv preprint arXiv:2012.15022
Qiu Q, Xie Z, Wu L, Li W (2018) DGeoSegmenter: a dictionary-based Chinese word segmenter for the geoscience domain. Comput Geosci 121:1–11. https://doi.org/10.1016/j.cageo.2018.08.006
Qiu Q, Xie Z, Wu L, Li W (2019) Geoscience keyphrase extraction algorithm using enhanced word embedding. Expert Syst Appl 125:157–169. https://doi.org/10.1016/j.eswa.2019.02.001
Qiu Q, Xie Z, Wang S et al (2022) ChineseTR: a weakly supervised toponym recognition architecture based on automatic training data generator and deep neural network. Trans GIS 26(3):1256–1279. https://doi.org/10.1111/tgis.12902
Reddi SJ, Kale S, Kumar S (2018) On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237
Roberts K, Bejan CA, Harabagiu S (2010) Toponym disambiguation using events. In: 23rd international FLAIRS conference, vol 10
Si S, Danhao Z (2017) Research on Chinese place name recognition based on deep learning. Trans Beijing Inst Technol 37(11):54–59
Smith SL, Le QV (2017) A Bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451
Smith SL, Kindermans PJ, Ying C et al (2017) Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489
Smith LN (2017) Cyclical learning rates for training neural networks. In: 2017 IEEE winter conference on applications of computer vision, pp 464–472
Wang J, Hu Y, Joseph K (2020) NeuroTPR: a neuro-net toponym recognition model for extracting locations from social media messages. Trans GIS 24(3):719–735. https://doi.org/10.1111/tgis.12627
Wolf T, Debut L, Sanh V et al (2020) Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp 38–45
Xueying Z, Chuju Z, Guonian LÜ (2010) Design and analysis of a classification scheme of geographical named entities. Geo Inf Sci 12(2):220–227
Yu F, Koltun V (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122.
Yu B, Wei J (2020) IDCNN-CRF-based domain named entity recognition method. In: 2020 2nd international conference on civil aviation safety and information technology ICCASIT, pp 542–546
Acknowledgements
The authors would like to thank the open courses on natural-language processing, which provided the basic technology for this article. We are grateful for the helpful comments from the journal editors and three anonymous reviewers.
Funding
National Natural Science Foundation of China (42101455).
Author information
Authors and Affiliations
Contributions
MZ, XL, ZZ, YQ, ZJ, PZ contributed to conceptualization, methodology, software, formal analysis, investigation, and data curation; MZ was involved in writing—original draft preparation; XL, ZZ, YQ contributed to writing—review and editing; ZJ was involved in visualization; ZZ, YQ, ZJ, PZ contributed to supervision. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, M., Liu, X., Zhang, Z. et al. CHTopoNER model-based method for recognizing Chinese place names from social media information. J Geogr Syst 26, 149–179 (2024). https://doi.org/10.1007/s10109-023-00433-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10109-023-00433-w
Keywords
- Named entity recognition
- Chinese place name recognition
- Deep learning
- Geographic information acquisition
- Disambiguation of place names