Rinton Press - Publisher in Science and Technology

Subscribers: to view a paper, simply click on the title of the paper, the pdf (or ps or zip file) file will pup up on your screen. If you have any problem to access the files, please check with your librarian or contact jdi@rintonpress.com To subscribe to JDI, please click Here.

Journal of Data Intelligence ISSN: 2577-610X published since 2020

Vol.1 No.1 March, 2020

Extract of Japanese Text Characteristics of Simpliﬁed Corpora using Non-negative Matrix Factorization (pp075-098)
Koji Wajima, Kei Koqure, Toshihiro Furukawa, and Tetsuji Satoh
doi: https://doi.org/10.26421/JDI1.1-5
Abstracts: Ways of disseminating(Verbreitungsmedien) information through different media have rapidly changed owing to technological progress, especially in the field of information and communication technologies. Reflecting the changes in terms of conditions of technological progress, communication methods, and abilities have also changed. On the Internet, contents with different expressions of difficulty are mixed even though they have almost the same contents. A user who intends to search for new things or unknown things may get confused and spend a lot of time in selecting contents that are understandable for them because there are large amounts of similar contents with different difficulties. Herein, The characteristics of relevant simplified corpora are critical for everybody. In this research, we propose a method to compare two types of documents with different difficulty, and select a characteristic related to simple of expression from various characteristics related to text. In our proposed method, thousands of text characteristics are compressed and converted by Non-negative Matrix Factorization(NMF), and a basis for characterizing the simplified document is selected. The proposed method combines the characteristics of the most conducted research using the characteristics of 32 types and 2,196 dimensions. We evaluated the text characteristics in the NMF Base of the results using a classifier. As a result of applying the proposed method to two kinds of environment white papers, it became clear that an effective basis can be selected. In Addtionally, We showed estimate of the causation relationships, Optimization of the parameter. Furthermore, We showed flexibility to other media.
Key words: NMF, LSI, LDA, Bayesian Network, Semantik, SDGs, Verbreitungsmedien