Skip to main content
Log in

REBDT: A regular expression boundary-based decision tree model for Chinese logistics address segmentation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Chinese logistics address segmentation is a specific domain of the address resolution, which is very challenging due to language, culture, user privacy, business value, etc. Although deep learning can effectively solve problems where traditional segmentation methods are overly dependent on domain knowledge, it faces the dilemma of costly manual labeling. In this context, a decision tree model based on regular expression boundaries is proposed, which requires no additional data and manual labeling. First, different from traditional methods of describing the entire address elements, a regular expressions rule library (RERL) is constructed, which only describes the boundaries of address elements. Second, the binary split attribute is defined according to the boundary matching algorithm based on RERL. A decision tree model is then constructed concerning the distribution law of address element types to segment an address and to evaluate its effect. The final experimental results demonstrate the improvement of our model and further substantiate that our proposal can provide a high-quality labeling training set for deep learning models without any professional domain knowledge, even if in low-resource scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. An B., Qing Z. T. (2014) Learning regular expressions for clinical text classification, J Am Med Inform Assoc, 850–857

  2. Bartoli A., De Lorenzo A., Medvet E., Tarlao F. (2016) Inference of regular expressions for text extraction from examples. IEEE Trans Knowl Data Eng 28(5):1217–1230. https://doi.org/10.1109/TKDE.2016.2515587

    Article  Google Scholar 

  3. Bartoli A., De Lorenzo A., Medvet E., Tarlao F. (2018) Active learning of regular expressions for entity extraction. IEEE Trans Cybern 48(3):1067–1080. https://doi.org/10.1109/TCYB.2017.2680466

    Article  Google Scholar 

  4. Bioch J.C., Meer O., Potharst R. (1997) Bivariate decision trees. In: J. Komorowski, J. Zytkow (eds) Principles of Data Mining and Knowledge Discovery, vol. 1263, pp. 232–242. Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-63223-9_122. http://link.springer.com/10.1007/3-540-63223-9_122

  5. Bollwein F., Westphal S. (2021) A branch & bound algorithm to determine optimal bivariate splits for oblique decision tree induction Applied Intelligence. https://doi.org/10.1007/s10489-021-02281-x

  6. Brauer F., Rieger R., Mocan A., Barczynski W.M. (2020) Enabling information extraction by inference of regular expressions from sample entities. In: Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM ’11, pp. 1285–1294. Association for Computing Machinery. https://doi.org/10.1145/2063576.2063763

  7. Chang C.H., Chuang H.M., Huang C.Y., Su Y.S., Li S.Y. (2016) Enhancing POI search on maps via online address extraction and associated information segmentation. Applied Intelligence 44(3):539–556. https://doi.org/10.1007/s10489-015-0707-5, http://link.springer.com/10.1007/s10489-015-0707-5

    Article  Google Scholar 

  8. Chang-Xiu C., Bin Y. U. (2011) A rule-based segmenting and matching method for fuzzy chinese addresses. Geogr Geo-Inf Sci 27(3):26–29

    Google Scholar 

  9. Cheng B.L., Weihong T.H. (2019) Chinese address segmentation based on bilstm-crf. J Geo-Inf Sci 21(8):1143. https://doi.org/10.12082/dqxxkx.2019.180654, {http://www.dqxxkx.cn/EN/abstract/article_43333.shtml}

    Google Scholar 

  10. CH/Z9010-2011 (2011) Geographic Entities and Geographical Address Data Specification. Mapping and Geoinformation

  11. Devlin J., Chang M., Lee K., Toutanova K. (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: In: J. Burstein, C. Doran, T. Solorio (eds.) Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics. https://doi.org/10.18653/v1/n19-1423

  12. Dong C., Zhang J., Zong C., Hattori M., Di H. Lin C. Y., Xue N., Zhao D., Huang X., Feng Y. (eds) (2016) Character-based lstm-crf with radical-level features for chinese named entity recognition. Springer International Publishing, Cham

  13. He Z., Wang Z., Wei W., Feng S., Mao X., Jiang S. (2020) A survey on recent advances in sequence labeling from deep learning models. arXiv:2011.06727

  14. Hedderich M. A., Lange L., Adel H., Strötgen J., Klakow D. (2021) A Survey on Recent Approaches for Natural Language Processing in low-Resource Scenarios. arXiv:2010.12309

  15. Hu Z., Ma X., Liu Z., Hovy E., Xing E. (2016) Harnessing deep neural networks with logic rules. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2410–2420. Association for Computational Linguistics, Berlin, Germany. https://doi.org/10.18653/v1/P16-1228. https://aclanthology.org/P16-1228

  16. Huang Z., Xu W., Yu K. (2015) Bidirectional LSTM- CRF Models for Sequence Tagging. arXiv:1508.01991

  17. Lample G., Ballesteros M., Subramanian S., Kawakami K., Dyer C. (2016) Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American chapter of the association for computational linguistics: Human Language Technologies, pp. 260–270. Association for Computational Linguistics, San Diego, California. https://doi.org/10.18653/v1/N16-1030https://www.aclweb.org/anthology/N16-1030

  18. Lan Z., Chen M., Goodman S., Gimpel K., Sharma P., Soricut R. (2020) ALBERT: A Lite BERT For Self-supervised Learning of Language Representations. arXiv:1909.11942

  19. Li H., Lu W., Xie P., Li L. (2019) Neural chinese address parsing, Proc. of NAACL

  20. Li J., Sun A., Han J., Li C. (2020) A survey on deep learning for named entity recognition, IEEE Trans Knowl Data Eng, 1–1. https://doi.org/10.1109/TKDE.2020.2981314

  21. Li Y., Liu J., Luo A. (2018) Chinese address segmentation algorithm based on depth learning. Sci Surv Mapp 43(10):107–111

    Google Scholar 

  22. Ling G.M., Xu A.P., Wang W. (2020) Research of address information automatic annotation based on deep learning (in chinese). Acta Electronica Sinica 48(11):2081–2091. https://doi.org/10.3969/j.issn.0372-2112.2020.11.001https://doi.org/10.3969/j.issn.0372-2112.2020.11.001

    Google Scholar 

  23. Liu X.Y., Li Y.L., Yin B., Tian X. (2021) Chinese address understanding by integrating neural network and spatial relationship (in chinese). Sci Surv Mapp 46(8):165–171 + 212. https://doi.org/10.16251/j.cnki.1009-2307.2021.08.023

    Article  Google Scholar 

  24. Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., Levy O., Lewis M., Zettlemoyer L., Stoyanov V. (2019 ) RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692

  25. Prasse P., Sawade C., Landwehr N., Scheffer T. (2012) Learning to identify regular expressions that describe email campaigns. In: In international conference on machine learning (ICML), pp. 3687–3720

  26. Tjong Kim Sang E.F., De Meulder F. (2003) Introduction to the conll-2003 shared task: Language-independent named entity recognition. In: Proceedings of the seventh conference on natural language learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, p. 142–147. Association for Computational Linguistics, USA. https://doi.org/10.3115/1119176.1119195,

  27. Utgoff P. E. (1989) Incremental induction of decision trees. Mach Learn 4:26. https://doi.org/10.1023/A:1022699900025

    Article  Google Scholar 

  28. Wang G., Jia X. Method and system for place name entity recognition. WO2015027836A1. https://patents.google.com/patent/WO2015027836A1/en

  29. Wei J., Zou K. (2019) EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6382–6388. Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1670, https://www.aclweb.org/anthology/D19-1670

  30. Weihong L., Ao Z., Kan D. (2014) An efficient bayesian framework based place name segmentation algorithm for geocoding system. In: 2014 Fifth international conference on intelligent systems design and engineering applications, pp. 141–144. https://doi.org/10.1109/ISDEA.2014.39

  31. Ye X. U., Shen B. X., Xiang X. U., Jun L. I. (2019) A new crf based semantic resolution approach of unstructured chinese addresses. Geogr Geo-Inf Sci 35(02):12–18

    Google Scholar 

  32. Ying S., Weiyang L. I., Biao H. E., Wang W., Yuan W. (2019) Chinese segmentation of city address set based on the statistical decision tree. Geomatics Inf Sci Wuhan Univ 44(2):302–309

    Google Scholar 

  33. Zhang H., Ren F., Li H., Yang R., Zhang S., Du Q. (2020) Recognition method of new address elements in chinese address matching based on deep learning. ISPRS International Journal of Geo-Information 9:12. https://doi.org/10.3390/ijgi9120745, https://www.mdpi.com/2220-9964/9/12/745

    Article  Google Scholar 

  34. Zhang J. (2021) Dive into Decision Trees and forests: A Theoretical Demonstration. arXiv:2101.08656

  35. Zhang S., He L., Vucetic S., Dragut E. (2018) Regular expression guided entity mention mining from noisy web data. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1991–2000. Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1224https://www.aclweb.org/anthology/D18-1224

  36. Zhang X., Guonian L. V., Boqiu L. I., Chen W. (2010) Rule-based approach to semantic resolution of chinese addresses. Journal of Geo-Information Science 12(1):9–16

    Article  Google Scholar 

  37. Zhang X., Lv G., Li B., Chen W. (2010) Rule-based approach to semantic resolution of chinese addresses. Journal of Geo-information Science 12:9. http://www.dqxxkx.cn/EN/abstract/article_23025.shtml

    Article  Google Scholar 

  38. Zhang Y., Yang J. (2018) Chinese NER Using Lattice LSTM. arXiv:1805.02023

  39. Zhao Y., Wang L., Qiu A. (2013) An improved algorithm for address segmentation Science of Surveying and Mapping 38(05)

  40. Zhu F., Zhao T., Liu Y., Zhao Y. (2018) Research on chinese address resolution model based on conditional random field. In: Journal of Physics: Conference Series 1087:052040. IOP Publishing. https://doi.org/10.1088/1742-6596/1087/5/052040

Download references

Acknowledgments

This work was supported by the National Key R&D Program of China (No. 2018YFB2100603). The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aiping Xu.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work is partially supported by grants from the National Key R&D Program of China (grant no. 2018YFB2100603).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ling, G., Xu, A., Wang, C. et al. REBDT: A regular expression boundary-based decision tree model for Chinese logistics address segmentation. Appl Intell 53, 6856–6872 (2023). https://doi.org/10.1007/s10489-022-03511-6

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03511-6

Keywords

Navigation