Skip to main content

Extracting Domain Terms from Data Model Elements

Task Overview in One Implementation

  • Conference paper
  • First Online:
Natural Language Processing and Information Systems (NLDB 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13286))

  • 1393 Accesses

Abstract

In the compliance domain of tax laws, a barrier to term extraction from documents written in natural language is getting a sizable training set of documents to train a well-grounded term-extraction model. To alleviate term-extraction silence, i.e. the outcome of the automated process missing legitimate term candidates, we extract terms from a string datatype that is written in a quasi-natural language. Domain software applications rely on structured content in XML documents for their processes. One type of XML data is the element; elements have names typically written in a variant of a natural language. Term extraction restores the element names to the detected natural language. These extracted expressions are either novel terms for our terminology or are flagged as synonymous expressions to existing terms. Increasing term coverage improves semantic parsing, query understanding and explanation generation. For a subset of XML documents of one tax-domain software application, we augment the existing terminology by 49%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The corpus referenced here is the collection of tax forms and related instructions published in 2019 by the United States Internal Revenue Service (IRS) and the Canada Revenue Agency (CRA).

  2. 2.

    The spelling convention indicators are clearly detectable. Camel notation with uppercase letter signifies word boundaries for individual single-token term. For instance, ParkingFeesAndTollsAmt has 4 tokens and one abbreviation.

  3. 3.

    Single DMEs are embedded in XML paths that provide local context like natural-language tokens in a phrase or utterance. For instance, IRS1040/WagesSalariesAndTipsWorksheet/TotalWagesSalariesAndTips /TotalOtherNonEarnedIncome. The single DMEs are separated by forward slashes.

  4. 4.

    XML has been anonymized and simplified.

  5. 5.

    On 500 individual tax forms, content structuring from PDFs extracts with 95% accuracy.

  6. 6.

    The notion of smallest is conveyed by the relative clause at the end of the utterance whichever is less.

  7. 7.

    We use underscores to bind together single tokens in a multi-token term.

  8. 8.

    Publicly-available tax terminologies are relatively small in size (less than a thousand entries). Typically, they are published by government agencies, private outfits and international organizations like Organisation for Economic Co-operation and Development.

  9. 9.

    This type of correction closer to XML generation would be costly corrective backtracking.

  10. 10.

    In effect, it is a label for descriptive educational categories.

  11. 11.

    The modifiers are enclosed in parentheses.

  12. 12.

    XPaths never use blank space to separate functional or semantic units.

References

  1. An, Y.J., Wilson, N.: Tax knowledge adventure: ontologies that analyze corporate tax transactions. In: Proceedings of the 17th International Digital Government Research Conference on Digital Government Research (2016)

    Google Scholar 

  2. Baldwin, T.: Compositionality and multiword expressions: six of one, half a dozen of the other? In: Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties, p. 1. Association for Computational Linguistics, Sydney, Australia (2006). https://aclanthology.org/W06-1201

  3. Blank, J.D., Osofsky, L.: Simplexity: plain language and the tax law. 66 Emory L. J. 189 (2017)

    Google Scholar 

  4. Boguraev, B., Manandise, E., Segal, B.: The bare necessities: increasing lexical coverage for multi-word domain terms with less lexical data. In: Proceedings of the 11th Workshop on Multiword Expressions, pp. 60–64. Association for Computational Linguistics, Denver, Colorado (2015). https://doi.org/10.3115/v1/W15-0910, https://aclanthology.org/W15-0910

  5. Cohen, S.B.: Words! words! words!: teaching the language of tax. Legal Scholarship Education (LSN) (Topic), EduRN (2005)

    Google Scholar 

  6. Curtotti, M., McCreath, E.C.: A Corpus of Australian contract language: description, profiling and analysis. In: ICAIL (2011)

    Google Scholar 

  7. Distinto, I., Guarino, N., Masolo, C.: A well-founded ontological framework for modeling personal income tax. In: ICAIL (2013)

    Google Scholar 

  8. Foufi, V., Nerima, L., Wehrli, É.: Parsing and MWE detection: fips at the PARSEME shared task. In: Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), pp. 54–59. Association for Computational Linguistics, Valencia, Spain (2017). https://doi.org/10.18653/v1/W17-1706, https://aclanthology.org/W17-1706

  9. IRS: Modernized e-file status. http://www.irs.gov/e-file-providers/modernized-e-file-mef-status. Accessed 27 Feb 2022

  10. Korkontzelos, I., Manandhar, S.: Can recognising multiword expressions improve shallow parsing? In: NAACL (2010)

    Google Scholar 

  11. Manandise, E.: Towards unlocking the narrative of the united states income tax forms. In: FNP (2019)

    Google Scholar 

  12. Manandise, E., de Peuter, C.: Mitigating silence in compliance terminology during parsing of utterances. In: FNP (2020)

    Google Scholar 

  13. Manandise, E., de Peuter, C., Mukherjee, S.: From tax compliance in natural language to executable calculations: combining lexical-grammar-based parsing and machine learning. In: FLAIRS Conference (2021)

    Google Scholar 

  14. Morris, J.: Rules as code: how technology may change the language in which legislation is written, and what it might mean for lawyers of tomorrow. In: TECHSHOW 2021, pp. 2–16, New York (2021)

    Google Scholar 

  15. Šajatović, A., Buljan, M., Šnajder, J., Dalbelo Bašić, B.: Evaluating automatic term extraction methods on individual documents. In: Proceedings of the Joint Workshop on Multiword Expressions and WordNet (MWE-WN 2019), pp. 149–154. Association for Computational Linguistics, Florence, Italy (2019). https://doi.org/10.18653/v1/W19-5118, https://aclanthology.org/W19-5118

  16. Wang, Y., Berant, J., Liang, P.: Building a semantic parser overnight. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1332–1342. Association for Computational Linguistics, Beijing, China (2015). https://doi.org/10.3115/v1/P15-1129, https://aclanthology.org/P15-1129

  17. Wong, M.W.: Rules as code - seven levels of digitisation. In: Research Collection, pp. 1–24. Research Collection School Of Law, Singapore (2020). https://ink.library.smu.edu.sg/sol_research/3093

Download references

Acknowledgments

We thank R. Meike, C. de Peuter and three anonymous reviewers for helpful insight and comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Esmé Manandise .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Manandise, E. (2022). Extracting Domain Terms from Data Model Elements. In: Rosso, P., Basile, V., Martínez, R., Métais, E., Meziane, F. (eds) Natural Language Processing and Information Systems. NLDB 2022. Lecture Notes in Computer Science, vol 13286. Springer, Cham. https://doi.org/10.1007/978-3-031-08473-7_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-08473-7_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-08472-0

  • Online ISBN: 978-3-031-08473-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics