Extracting Domain Terms from Data Model Elements

Manandise, Esmé

doi:10.1007/978-3-031-08473-7_24

Esmé Manandise ORCID: orcid.org/0000-0003-2752-4894¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13286))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

1393 Accesses

Abstract

In the compliance domain of tax laws, a barrier to term extraction from documents written in natural language is getting a sizable training set of documents to train a well-grounded term-extraction model. To alleviate term-extraction silence, i.e. the outcome of the automated process missing legitimate term candidates, we extract terms from a string datatype that is written in a quasi-natural language. Domain software applications rely on structured content in XML documents for their processes. One type of XML data is the element; elements have names typically written in a variant of a natural language. Term extraction restores the element names to the detected natural language. These extracted expressions are either novel terms for our terminology or are flagged as synonymous expressions to existing terms. Increasing term coverage improves semantic parsing, query understanding and explanation generation. For a subset of XML documents of one tax-domain software application, we augment the existing terminology by 49%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The corpus referenced here is the collection of tax forms and related instructions published in 2019 by the United States Internal Revenue Service (IRS) and the Canada Revenue Agency (CRA).
2.
The spelling convention indicators are clearly detectable. Camel notation with uppercase letter signifies word boundaries for individual single-token term. For instance, ParkingFeesAndTollsAmt has 4 tokens and one abbreviation.
3.
Single DMEs are embedded in XML paths that provide local context like natural-language tokens in a phrase or utterance. For instance, IRS1040/WagesSalariesAndTipsWorksheet/TotalWagesSalariesAndTips /TotalOtherNonEarnedIncome. The single DMEs are separated by forward slashes.
4.
XML has been anonymized and simplified.
5.
On 500 individual tax forms, content structuring from PDFs extracts with 95% accuracy.
6.
The notion of smallest is conveyed by the relative clause at the end of the utterance whichever is less.
7.
We use underscores to bind together single tokens in a multi-token term.
8.
Publicly-available tax terminologies are relatively small in size (less than a thousand entries). Typically, they are published by government agencies, private outfits and international organizations like Organisation for Economic Co-operation and Development.
9.
This type of correction closer to XML generation would be costly corrective backtracking.
10.
In effect, it is a label for descriptive educational categories.
11.
The modifiers are enclosed in parentheses.
12.
XPaths never use blank space to separate functional or semantic units.

References

An, Y.J., Wilson, N.: Tax knowledge adventure: ontologies that analyze corporate tax transactions. In: Proceedings of the 17th International Digital Government Research Conference on Digital Government Research (2016)
Google Scholar
Baldwin, T.: Compositionality and multiword expressions: six of one, half a dozen of the other? In: Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, p. 1. Association for Computational Linguistics, Sydney, Australia (2006). https://aclanthology.org/W06-1201
Blank, J.D., Osofsky, L.: Simplexity: plain language and the tax law. 66 Emory L. J. 189 (2017)
Google Scholar
Boguraev, B., Manandise, E., Segal, B.: The bare necessities: increasing lexical coverage for multi-word domain terms with less lexical data. In: Proceedings of the 11th Workshop on Multiword Expressions, pp. 60–64. Association for Computational Linguistics, Denver, Colorado (2015). https://doi.org/10.3115/v1/W15-0910, https://aclanthology.org/W15-0910
Cohen, S.B.: Words! words! words!: teaching the language of tax. Legal Scholarship Education (LSN) (Topic), EduRN (2005)
Google Scholar
Curtotti, M., McCreath, E.C.: A Corpus of Australian contract language: description, profiling and analysis. In: ICAIL (2011)
Google Scholar
Distinto, I., Guarino, N., Masolo, C.: A well-founded ontological framework for modeling personal income tax. In: ICAIL (2013)
Google Scholar
Foufi, V., Nerima, L., Wehrli, É.: Parsing and MWE detection: fips at the PARSEME shared task. In: Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pp. 54–59. Association for Computational Linguistics, Valencia, Spain (2017). https://doi.org/10.18653/v1/W17-1706, https://aclanthology.org/W17-1706
IRS: Modernized e-file status. http://www.irs.gov/e-file-providers/modernized-e-file-mef-status. Accessed 27 Feb 2022
Korkontzelos, I., Manandhar, S.: Can recognising multiword expressions improve shallow parsing? In: NAACL (2010)
Google Scholar
Manandise, E.: Towards unlocking the narrative of the united states income tax forms. In: FNP (2019)
Google Scholar
Manandise, E., de Peuter, C.: Mitigating silence in compliance terminology during parsing of utterances. In: FNP (2020)
Google Scholar
Manandise, E., de Peuter, C., Mukherjee, S.: From tax compliance in natural language to executable calculations: combining lexical-grammar-based parsing and machine learning. In: FLAIRS Conference (2021)
Google Scholar
Morris, J.: Rules as code: how technology may change the language in which legislation is written, and what it might mean for lawyers of tomorrow. In: TECHSHOW 2021, pp. 2–16, New York (2021)
Google Scholar
Šajatović, A., Buljan, M., Šnajder, J., Dalbelo Bašić, B.: Evaluating automatic term extraction methods on individual documents. In: Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), pp. 149–154. Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/v1/W19-5118, https://aclanthology.org/W19-5118
Wang, Y., Berant, J., Liang, P.: Building a semantic parser overnight. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1332–1342. Association for Computational Linguistics, Beijing, China (2015). https://doi.org/10.3115/v1/P15-1129, https://aclanthology.org/P15-1129
Wong, M.W.: Rules as code - seven levels of digitisation. In: Research Collection, pp. 1–24. Research Collection School Of Law, Singapore (2020). https://ink.library.smu.edu.sg/sol_research/3093

Download references

Acknowledgments

We thank R. Meike, C. de Peuter and three anonymous reviewers for helpful insight and comments.

Author information

Authors and Affiliations

Intuit Marine Way, 2632, Mountain View, CA, 94043, USA
Esmé Manandise

Authors

Esmé Manandise
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Esmé Manandise .

Editor information

Editors and Affiliations

Universitat Politècnica de València, Valencia, Spain
Paolo Rosso
University of Turin, Torino, Italy
Valerio Basile
Universidad Nacional de Educación a Distancia, Madrid, Spain
Raquel Martínez
Conservatoire National des Arts et Métiers, Paris, France
Elisabeth Métais
University of Derby, Derby, UK
Farid Meziane

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Manandise, E. (2022). Extracting Domain Terms from Data Model Elements. In: Rosso, P., Basile, V., Martínez, R., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2022. Lecture Notes in Computer Science, vol 13286. Springer, Cham. https://doi.org/10.1007/978-3-031-08473-7_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-08473-7_24
Published: 13 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08472-0
Online ISBN: 978-3-031-08473-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Extracting Domain Terms from Data Model Elements