ISSN: 2577-610X

 JDI Homepage
 Guidelines for Authors
 JDI Online

Subscribers: to view a paper, simply click on the title of the paper, the pdf (or ps or zip file) file will pup up on your screen. If you have any problem to access the files, please check with your librarian or contact jdi@rintonpress.com      To subscribe to JDI, please click Here.

 

Journal of Data Intelligence  ISSN: 2577-610X      published since 2020
Vol.1 No.1  March, 2020 

Extract of Japanese Text Characteristics of Simplified Corpora using Non-negative Matrix Factorization (pp075-098)
         Koji Wajima, Kei Koqure, Toshihiro Furukawa, and Tetsuji Satoh

        
doi:
https://doi.org/10.26421/JDI1.1-5
Abstracts:  Ways of disseminating(Verbreitungsmedien) information through different media have rapidly changed owing to technological progress, especially in the field of information and communication technologies. Reflecting the changes in terms of conditions of technological progress, communication methods, and abilities have also changed. On the Internet, contents with different expressions of difficulty are mixed even though they have almost the same contents. A user who intends to search for new things or unknown things may get confused  and spend a lot of time in selecting contents that are understandable for them  because there are large amounts of similar contents with different difficulties. Herein, The characteristics of relevant simplified corpora are critical for everybody. In this research, we propose a method to compare two types of documents with different difficulty, and select a characteristic related to simple of expression from various characteristics related to text. In our proposed method, thousands of text characteristics are compressed and converted by Non-negative Matrix Factorization(NMF), and a basis for characterizing the simplified document is selected. The proposed method combines the characteristics of  the most conducted research using the characteristics of 32 types and 2,196 dimensions. We evaluated the text characteristics in the NMF Base of the results using a classifier. As a result of applying the proposed method to two kinds of environment white papers, it became clear that an effective basis can be selected. In Addtionally, We showed estimate of the causation relationships, Optimization of the parameter. Furthermore, We showed flexibility to other media.
Key words:  NMF, LSI, LDA, Bayesian Network, Semantik, SDGs, Verbreitungsmedien