Development of a patent document classification and search platform using a back-propagation network

https://doi.org/10.1016/j.eswa.2006.01.013Get rights and content

Abstract

In order to process large numbers of explicit knowledge documents such as patents in an organized manner, automatic document categorization and search are required. In this paper, we develop a document classification and search methodology based on neural network technology that helps companies manage patent documents more effectively. The classification process begins by extracting key phrases from the document set by means of automatic text processing and determining the significance of key phrases according to their frequency in text. In order to maintain a manageable number of independent key phrases, correlation analysis is applied to compute the similarities between key phrases. Phrases with higher correlations are synthesized into a smaller set of phrases. Finally, the back-propagation network model is adopted as a classifier. The target output identifies a patent document’s category based on a hierarchical classification scheme, in this case, the international patent classification (IPC) standard. The methodology is tested using patents related to the design of power hand-tools. Related patents are automatically classified using pre-trained neural network models. In the prototype system, two modules are used for patent document management. The automatic classification module helps the user classify patent documents and the search module helps users find relevant and related patent documents. The result shows an improvement in document classification and identification over previously published methods of patent document management.

Introduction

With the rapid development of information technology, the number of electronic documents and the digital content of documents exceed the capacity of manual control and management. People are increasingly required to handle wide ranges of information from multiple sources. As a result, knowledge management systems are implemented by enterprises and organizations to manage their information and knowledge more effectively. Knowledge management includes sorting useful knowledge from information, storing knowledge in good order, and finding knowledge in an existing knowledge base (Turban & Aronson, 2001). In this research, we focus on explicit knowledge management, i.e., management of well-structured documents such as patent documents (called “patents” in brief). Patents provide exclusive rights and legal protection for patent inventors. In addition, patents play an important role in the advancement and diffusion of technology. The objective of this research is to develop an effective methodology to automatically classify and identify patent documents. Furthermore, a prototype system is implemented and tested using hand tools patents sourced from the World Intellectual Property Organization (WIPO) database for scenario demonstration.

There have been many research efforts devoted to automatic document classification. Some of the classification methodologies are difficult to implement, and others are neither efficient nor effective, requiring developers of knowledge management systems to expend considerable resources testing and evaluating algorithms. The purpose of this research is to develop a document classification method based on neural networks and benchmark the performance against published standards. Through the implementation of a document classification module and a document search module, a prototype patent document management system is created. The patent document management system automates the classification of patent documents and improves the search for documents.

The automatic document classification methodology is described in the following steps: First, significant terms are abstracted from patent documents and are used to build a key phrase database. Second, the similarities between phrases are computed and depicted in a correlation matrix in order to synthesize phrases into a smaller set representing key concepts within the patent domain. After the steps of key phrase extraction and synthesis, a consolidated set of key phrases are treated as inputs of the back-propagation network model. The neural network model is trained using key phrases and the frequency of key phrases from the sample documents. The trained model is assessed until it reaches a satisfactory level of accuracy. After the network model is trained, the final step is to use the model for automated patent document classification and search.

Section snippets

Literature survey

This section reviews the relevant topics including knowledge and e-document management, document categorization, clustering methodologies and related patent analysis research.

System architecture and methodology

This section depicts the detailed methodologies for document classification and document search. First, a document content extraction model is built to represent the document content with a vector consisting of key phrase frequencies. Second, a document classification model based on the back-propagation network (BPN) approach is developed. Finally, a document search model is implementing using a trained back-propagation network.

Function modules of the system

The prototype system includes the System Parameters Management Module, the Automatic Categorization Module, and the Document Search Module. The system parameters management module provides the interface to adjust keyword correlation values and neural network weights. The automatic categorization module contains the functions of document upload, content extraction and document categorization. Finally, the document search module provides users an interface to search and download documents.

Conclusion

The back-propagation networks (BPN) algorithm provides advantages of non-linear problem solving ability and learning by example. There are limitations to the application of BPN since inadequate training data may yield an unreliable model and the training procedure may require significant computing resources. The first limitation can be solved by compiling a wide range of examples. Since a well-trained model can help companies better manage documents, the cost of computing resources can be

Acknowledgement

This research is funded partially by the Taiwan Ministry of Economic Affairs and National Science Council research grants.

References (35)

  • Chiang, J., & Chen, Y. (2001). Hierarchical fuzzy-kNN networks for news documents categorization. In Proceedings of the...
  • Deng, W., & Wu, W. (2001). Document categorization and retrieval using semantic microfeatures and growing cell...
  • Farkas, J. (1994). Generating document clusters using thesauri and neural networks. In Proceedings of the 1994 Canadian...
  • D. Grossman et al.

    Integrating structured data and text: A relational approach

    Journal of the American Society for Information Science

    (1997)
  • J.H. Holland

    Genetic algorithms

    Scientific American

    (1992)
  • J.L. Hou et al.

    A document content extraction model using keyword correlation analysis

    International Journal of Electronic Business Management

    (2003)
  • F.C. Hsu et al.

    Develop a multi-channel legal knowledge service center with knowledge mining capability

    International Journal of Electronic Business Management

    (2004)
  • Cited by (113)

    • A survey on deep learning for patent analysis

      2021, World Patent Information
    • A systematic literature review on intelligent automation: Aligning concepts from theory, practice, and future perspectives

      2021, Advanced Engineering Informatics
      Citation Excerpt :

      The number of manual inspections required can significantly reduce as error prevention and compliance of regulations can be performed simultaneously with each digitalised decision process with the use of IA. For example, RPA can detect any erroneous human input in the work papers in accounting [1], extract contents from patents, legal documents [101–103] and validate the medical processes that satisfy the drug regulations in healthcare and clinical research [57,62,63]. Nonetheless, valuable insights for model correction can also be accomplished when the IA faces error.

    • Automated classification of patents: A topic modeling approach

      2020, Computers and Industrial Engineering
    View all citing articles on Scopus
    View full text