skip to main content
10.1145/3587828.3587878acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicscaConference Proceedingsconference-collections
research-article

Morfawk.ja: A Japanese Token Level Pattern Matching and Processing Language with Dependency Analysis

Published: 20 June 2023 Publication History

Abstract

The quality of development documents written in natural language is basically ensured by developer's review with quite a few man-hours. Natural language processing (NLP) tools can partially automate the time-consuming review. The authors have developed morpheme level pattern matching and processing tools morfgrep and morfawk for Japanese texts for this purpose; moreover, in this paper, extends them to perform pattern matching on dependency among phrases. Example applications of the tools, proofreading of development documents and domain-specific term extraction, are also presented.

References

[1]
T. Kudo, K. Yamamoto, and Y. Matsumoto. 2004. Applying Conditional Random Fields to Japanese Morphological Analysis. In Proc. 2004 Conf. on Empirical Methods in Natural Language Processing. 230–237.
[2]
H. Morita, D. Kawahara, and S. Kurohashi. 2015. Morphological Analysis for Unsegmented Languages Using Recurrent Neural Network Language Model. In Proc. 2015 Conf. on Empirical Methods in Natural Language Processing. 2292–2297.
[3]
K. Takaoka, S. Hisamoto, N. Kawahara, M. Sakamoto, Y. Uchida, and Y. Matsumoto. 2018. Sudachi: A Japanese Tokenizer for Business. In Proc. 11th Int. Conf. on Language Resources and Evaluation (LREC 2018). 2246–2249.
[4]
T. Kudo and Y. Matsumoto. 2002. Japanese Dependency Analysis Using Cascaded Chunking. In Proc. 6th Conf. on Natural Language Learning 2002 (CoNLL 2002). 63–69.
[5]
D. Kawahara and S. Kurohashi. 2014. A Fully-Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis. In J. of Natural Language Processing 21, 4 (Sep. 2014), 799–815.
[6]
H. Matsuda. 2021. GiNZA Version 4.0: Improving Syntactic Structure Analysis Through Japanese Bunsetsu-Phrase Extraction API Integration. Retrieved Jan. 25, 2023 from https://megagon.ai/ginza-version-4-0-improvingsyntactic-structure-analysis-through-japanese-bunsetsu-phrase-extractionapi-integration/
[7]
T. Nakanishi, K. Yoshimura, H. Ototake, T. Tanabe, H. Furusho, Y. Nishiura,and M. Asano. 2020. morfgrep: A Morpheme Pattern Matcher and Its Use Cases in Software Development. IEICE Tech. Rep., 119, 451 (Mar. 2020), 113–118. (in Japanese)
[8]
T. Nakanishi, K. Yoshimura, H. Ototake, T. Tanabe, H. Furusho, and Y. Nishiura. 2020. morfawk: A Morpheme Pattern Matching and Processing Language. IPSJ SIG Tech. Rep., 2020-SE-205, 5 (June 2020), 1–6. (in Japanese)
[9]
A. Aho, B. W. Kernighan, and P. J. Weinberger. 1988. The AWK Programming Language, Addison-Wesley.
[10]
A. X. Chang and C. D. Manning. 2014. TokensRegex: Defining Cascaded Regular Expressions over Tokens. Tech. Rep. CSTR 2014-02. Dept. of Computer Science, Stanford University.
[11]
M. Asahara and Y. Matsumoto. 2003. ipadic version 2.7.0 User's Manual. https://ja.osdn.net/projects/ipadic/docs/ipadic-2.7.0manual-en.pdf/en/1/ipadic-2.7.0-manual-en.pdf
[12]
F. Bonin, F. Dell'Orletta, S. Montemagni, and G. Venturi. 2010. A Contrastive Approach to Multi-Word Extraction from Domain-Specific Corpora. In Proc. Int. Conf. on Language Resources and Evaluation (LREC 2010). 17–23.
[13]
A. Ferrari, G. O. Spagnolo, and F. Dell'Orletta. 2013. Mining Commonalities and Variabilities from Natural Language Documents. In Proc. 17th Int. Software Product Line Conf. (SPLC 2013). 116–120.

Index Terms

  1. Morfawk.ja: A Japanese Token Level Pattern Matching and Processing Language with Dependency Analysis

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    ICSCA '23: Proceedings of the 2023 12th International Conference on Software and Computer Applications
    February 2023
    385 pages
    ISBN:9781450398589
    DOI:10.1145/3587828
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 June 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICSCA 2023

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 22
      Total Downloads
    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 20 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media