Extract
of Japanese Text Characteristics of Simplified Corpora using Non-negative
Matrix Factorization
(pp075-098)
Koji Wajima, Kei Koqure, Toshihiro Furukawa, and Tetsuji Satoh
doi:
https://doi.org/10.26421/JDI1.1-5
Abstracts:
Ways of disseminating(Verbreitungsmedien)
information through different media have rapidly changed owing to
technological progress, especially in the field of information and
communication technologies. Reflecting the changes in terms of
conditions of technological progress, communication methods, and
abilities have also changed. On the Internet, contents with
different expressions of difficulty are mixed even though they have
almost the same contents. A user who intends to search for new
things or unknown things may get confused and spend a lot of
time in selecting contents that are understandable for them
because there are large amounts of similar contents with different
difficulties. Herein, The characteristics of relevant simplified
corpora are critical for everybody. In this research, we propose a
method to compare two types of documents with different difficulty,
and select a characteristic related to simple of expression from
various characteristics related to text. In our proposed method,
thousands of text characteristics are compressed and converted by
Non-negative Matrix Factorization(NMF),
and a basis for characterizing the simplified document is selected.
The proposed method combines the characteristics of the most
conducted research using the characteristics of 32 types and 2,196
dimensions. We evaluated the text characteristics in the
NMF
Base of the results using a classifier. As a result of applying the
proposed method to two kinds of environment white papers, it became
clear that an effective basis can be selected. In
Addtionally,
We showed estimate of the causation relationships, Optimization of
the parameter. Furthermore, We showed flexibility to other media.
Key words: NMF, LSI, LDA, Bayesian Network,
Semantik, SDGs, Verbreitungsmedien