Abstract
In the compliance domain of tax laws, a barrier to term extraction from documents written in natural language is getting a sizable training set of documents to train a well-grounded term-extraction model. To alleviate term-extraction silence, i.e. the outcome of the automated process missing legitimate term candidates, we extract terms from a string datatype that is written in a quasi-natural language. Domain software applications rely on structured content in XML documents for their processes. One type of XML data is the element; elements have names typically written in a variant of a natural language. Term extraction restores the element names to the detected natural language. These extracted expressions are either novel terms for our terminology or are flagged as synonymous expressions to existing terms. Increasing term coverage improves semantic parsing, query understanding and explanation generation. For a subset of XML documents of one tax-domain software application, we augment the existing terminology by 49%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The corpus referenced here is the collection of tax forms and related instructions published in 2019 by the United States Internal Revenue Service (IRS) and the Canada Revenue Agency (CRA).
- 2.
The spelling convention indicators are clearly detectable. Camel notation with uppercase letter signifies word boundaries for individual single-token term. For instance, ParkingFeesAndTollsAmt has 4 tokens and one abbreviation.
- 3.
Single DMEs are embedded in XML paths that provide local context like natural-language tokens in a phrase or utterance. For instance, IRS1040/WagesSalariesAndTipsWorksheet/TotalWagesSalariesAndTips /TotalOtherNonEarnedIncome. The single DMEs are separated by forward slashes.
- 4.
XML has been anonymized and simplified.
- 5.
On 500 individual tax forms, content structuring from PDFs extracts with 95% accuracy.
- 6.
The notion of smallest is conveyed by the relative clause at the end of the utterance whichever is less.
- 7.
We use underscores to bind together single tokens in a multi-token term.
- 8.
Publicly-available tax terminologies are relatively small in size (less than a thousand entries). Typically, they are published by government agencies, private outfits and international organizations like Organisation for Economic Co-operation and Development.
- 9.
This type of correction closer to XML generation would be costly corrective backtracking.
- 10.
In effect, it is a label for descriptive educational categories.
- 11.
The modifiers are enclosed in parentheses.
- 12.
XPaths never use blank space to separate functional or semantic units.
References
An, Y.J., Wilson, N.: Tax knowledge adventure: ontologies that analyze corporate tax transactions. In: Proceedings of the 17th International Digital Government Research Conference on Digital Government Research (2016)
Baldwin, T.: Compositionality and multiword expressions: six of one, half a dozen of the other? In: Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, p. 1. Association for Computational Linguistics, Sydney, Australia (2006). https://aclanthology.org/W06-1201
Blank, J.D., Osofsky, L.: Simplexity: plain language and the tax law. 66 Emory L. J. 189 (2017)
Boguraev, B., Manandise, E., Segal, B.: The bare necessities: increasing lexical coverage for multi-word domain terms with less lexical data. In: Proceedings of the 11th Workshop on Multiword Expressions, pp. 60–64. Association for Computational Linguistics, Denver, Colorado (2015). https://doi.org/10.3115/v1/W15-0910, https://aclanthology.org/W15-0910
Cohen, S.B.: Words! words! words!: teaching the language of tax. Legal Scholarship Education (LSN) (Topic), EduRN (2005)
Curtotti, M., McCreath, E.C.: A Corpus of Australian contract language: description, profiling and analysis. In: ICAIL (2011)
Distinto, I., Guarino, N., Masolo, C.: A well-founded ontological framework for modeling personal income tax. In: ICAIL (2013)
Foufi, V., Nerima, L., Wehrli, É.: Parsing and MWE detection: fips at the PARSEME shared task. In: Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pp. 54–59. Association for Computational Linguistics, Valencia, Spain (2017). https://doi.org/10.18653/v1/W17-1706, https://aclanthology.org/W17-1706
IRS: Modernized e-file status. http://www.irs.gov/e-file-providers/modernized-e-file-mef-status. Accessed 27 Feb 2022
Korkontzelos, I., Manandhar, S.: Can recognising multiword expressions improve shallow parsing? In: NAACL (2010)
Manandise, E.: Towards unlocking the narrative of the united states income tax forms. In: FNP (2019)
Manandise, E., de Peuter, C.: Mitigating silence in compliance terminology during parsing of utterances. In: FNP (2020)
Manandise, E., de Peuter, C., Mukherjee, S.: From tax compliance in natural language to executable calculations: combining lexical-grammar-based parsing and machine learning. In: FLAIRS Conference (2021)
Morris, J.: Rules as code: how technology may change the language in which legislation is written, and what it might mean for lawyers of tomorrow. In: TECHSHOW 2021, pp. 2–16, New York (2021)
Šajatović, A., Buljan, M., Šnajder, J., Dalbelo Bašić, B.: Evaluating automatic term extraction methods on individual documents. In: Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), pp. 149–154. Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/v1/W19-5118, https://aclanthology.org/W19-5118
Wang, Y., Berant, J., Liang, P.: Building a semantic parser overnight. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1332–1342. Association for Computational Linguistics, Beijing, China (2015). https://doi.org/10.3115/v1/P15-1129, https://aclanthology.org/P15-1129
Wong, M.W.: Rules as code - seven levels of digitisation. In: Research Collection, pp. 1–24. Research Collection School Of Law, Singapore (2020). https://ink.library.smu.edu.sg/sol_research/3093
Acknowledgments
We thank R. Meike, C. de Peuter and three anonymous reviewers for helpful insight and comments.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Manandise, E. (2022). Extracting Domain Terms from Data Model Elements. In: Rosso, P., Basile, V., Martínez, R., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2022. Lecture Notes in Computer Science, vol 13286. Springer, Cham. https://doi.org/10.1007/978-3-031-08473-7_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-08473-7_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08472-0
Online ISBN: 978-3-031-08473-7
eBook Packages: Computer ScienceComputer Science (R0)