Development of a patent document classification and search platform using a back-propagation network

doi:10.1016/j.eswa.2006.01.013

Expert Systems with Applications

Volume 31, Issue 4, November 2006, Pages 755-765

https://doi.org/10.1016/j.eswa.2006.01.013 Get rights and content

Abstract

In order to process large numbers of explicit knowledge documents such as patents in an organized manner, automatic document categorization and search are required. In this paper, we develop a document classification and search methodology based on neural network technology that helps companies manage patent documents more effectively. The classification process begins by extracting key phrases from the document set by means of automatic text processing and determining the significance of key phrases according to their frequency in text. In order to maintain a manageable number of independent key phrases, correlation analysis is applied to compute the similarities between key phrases. Phrases with higher correlations are synthesized into a smaller set of phrases. Finally, the back-propagation network model is adopted as a classifier. The target output identifies a patent document’s category based on a hierarchical classification scheme, in this case, the international patent classification (IPC) standard. The methodology is tested using patents related to the design of power hand-tools. Related patents are automatically classified using pre-trained neural network models. In the prototype system, two modules are used for patent document management. The automatic classification module helps the user classify patent documents and the search module helps users find relevant and related patent documents. The result shows an improvement in document classification and identification over previously published methods of patent document management.

Introduction

With the rapid development of information technology, the number of electronic documents and the digital content of documents exceed the capacity of manual control and management. People are increasingly required to handle wide ranges of information from multiple sources. As a result, knowledge management systems are implemented by enterprises and organizations to manage their information and knowledge more effectively. Knowledge management includes sorting useful knowledge from information, storing knowledge in good order, and finding knowledge in an existing knowledge base (Turban & Aronson, 2001). In this research, we focus on explicit knowledge management, i.e., management of well-structured documents such as patent documents (called “patents” in brief). Patents provide exclusive rights and legal protection for patent inventors. In addition, patents play an important role in the advancement and diffusion of technology. The objective of this research is to develop an effective methodology to automatically classify and identify patent documents. Furthermore, a prototype system is implemented and tested using hand tools patents sourced from the World Intellectual Property Organization (WIPO) database for scenario demonstration.

There have been many research efforts devoted to automatic document classification. Some of the classification methodologies are difficult to implement, and others are neither efficient nor effective, requiring developers of knowledge management systems to expend considerable resources testing and evaluating algorithms. The purpose of this research is to develop a document classification method based on neural networks and benchmark the performance against published standards. Through the implementation of a document classification module and a document search module, a prototype patent document management system is created. The patent document management system automates the classification of patent documents and improves the search for documents.

The automatic document classification methodology is described in the following steps: First, significant terms are abstracted from patent documents and are used to build a key phrase database. Second, the similarities between phrases are computed and depicted in a correlation matrix in order to synthesize phrases into a smaller set representing key concepts within the patent domain. After the steps of key phrase extraction and synthesis, a consolidated set of key phrases are treated as inputs of the back-propagation network model. The neural network model is trained using key phrases and the frequency of key phrases from the sample documents. The trained model is assessed until it reaches a satisfactory level of accuracy. After the network model is trained, the final step is to use the model for automated patent document classification and search.

Section snippets

Literature survey

This section reviews the relevant topics including knowledge and e-document management, document categorization, clustering methodologies and related patent analysis research.

System architecture and methodology

This section depicts the detailed methodologies for document classification and document search. First, a document content extraction model is built to represent the document content with a vector consisting of key phrase frequencies. Second, a document classification model based on the back-propagation network (BPN) approach is developed. Finally, a document search model is implementing using a trained back-propagation network.

Function modules of the system

The prototype system includes the System Parameters Management Module, the Automatic Categorization Module, and the Document Search Module. The system parameters management module provides the interface to adjust keyword correlation values and neural network weights. The automatic categorization module contains the functions of document upload, content extraction and document categorization. Finally, the document search module provides users an interface to search and download documents.

Conclusion

The back-propagation networks (BPN) algorithm provides advantages of non-linear problem solving ability and learning by example. There are limitations to the application of BPN since inadequate training data may yield an unreliable model and the training procedure may require significant computing resources. The first limitation can be solved by compiling a wide range of examples. Since a well-trained model can help companies better manage documents, the cost of computing resources can be

Acknowledgement

This research is funded partially by the Taiwan Ministry of Economic Affairs and National Science Council research grants.

References (35)

B. Andersen
The evolution of technological trajectories 1890–1990
Structural Change and Economic Dynamics
(1998)
S. Breschi et al.
Knowledge-relatedness in firm technological diversification
Research Policy
(2003)
J. Cantwell et al.
Historical evolution of technological diversification
Research Policy
(2004)
M. Karki
Patent citation analysis: A policy analysis tool
World Patent Information
(1997)
L. Massey
On the quality ART1 text clustering
Neural Networks
(2003)
J. Mostafa et al.
Automatic classification using supervised learning in a medical document filtering application
Information Processing and Management
(2000)
A. Selamat et al.
Web page feature selection and classification using neural networks
Information Sciences
(2004)
B. Yoon et al.
A text-mining-based patent network: Analytical tool for high-technology trend
Journal of High Technology Management Research
(2004)
Antonie, M.-L., & Zaiane, O. R. (2002). Text document categorization by term association. In Proceedings of IEEE...
S. Chakrabarti et al.
Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies
The International Journal on very Large Data Bases
(1998)

Chiang, J., & Chen, Y. (2001). Hierarchical fuzzy-kNN networks for news documents categorization. In Proceedings of the...

Deng, W., & Wu, W. (2001). Document categorization and retrieval using semantic microfeatures and growing cell...

Farkas, J. (1994). Generating document clusters using thesauri and neural networks. In Proceedings of the 1994 Canadian...

D. Grossman et al.

Integrating structured data and text: A relational approach

Journal of the American Society for Information Science

(1997)

J.H. Holland

Genetic algorithms

Scientific American

(1992)

J.L. Hou et al.

A document content extraction model using keyword correlation analysis

International Journal of Electronic Business Management

(2003)

F.C. Hsu et al.

Develop a multi-channel legal knowledge service center with knowledge mining capability

International Journal of Electronic Business Management

(2004)

Cited by (113)

Automatic topology optimization of echo state network based on particle swarm optimization
2023, Engineering Applications of Artificial Intelligence
The task of time series forecasting is to predict the future trend of data based on the collected historical data, providing theoretical and data support for human judgment and decision making. Randomization-based echo state networks (ESNs) are widely used in the research and application field of time series analysis for their simple structure and fast training speed. The core of the ESN is its dynamic reservoir, which the original reservoir are randomly generated and controlled only by parameter sparsity, often performing poorly on complex tasks and affecting the performance of networks. Manual design of the topology of reservoir is difficult, time-consuming and inconvenient to operate, which is not conducive to the development of ESNs. The construction of a suitable reservoir topology for practical application problems to enrich reservoir dynamics is a hot research point for researchers. In this paper, an automatic optimization method is introduced into the topology optimization of ESN (TP-ESN), and the particle swarm optimization algorithm is used to optimize the topological construction of the ESN. The connection structure between the reservoir neurons is first encoded and then iteratively optimized. The optimized structure is decoded and then the reservoir is initialized for ESN training. Prediction results on Mackey–Glass benchmark time series and two electroencephalogram (EEG) datasets demonstrate that TP-ESN method can have better adaptability, stronger prediction ability and stability than several other manually designed ESN reservoir topologies in the case of relatively complex tasks.
A survey on deep learning for patent analysis
2021, World Patent Information
Patent document collections are an immense source of knowledge for research and innovation communities worldwide. The rapid growth of the number of patent documents poses an enormous challenge for retrieving and analyzing information from this source in an effective manner. Based on deep learning methods for natural language processing, novel approaches have been developed in the field of patent analysis. The goal of these approaches is to reduce costs by automating tasks that previously only domain experts could solve. In this article, we provide a comprehensive survey of the application of deep learning for patent analysis. We summarize the state-of-the-art techniques and describe how they are applied to various tasks in the patent domain. In a detailed discussion, we categorize 40 papers based on the dataset, the representation, and the deep learning architecture that were used, as well as the patent analysis task that was targeted. With our survey, we aim to foster future research at the intersection of patent analysis and deep learning and we conclude by listing promising paths for future work.
Novel mixed-encoding for forecasting patent grant duration
2021, World Patent Information
In an age when data is regarded as the most essential commodity, organizations are racing to use it for better decision making. The quality of the patent portfolio is an important indicator of technological innovation in an organization and its analysis can reveal several indicators linked to the growth of a company. The advancement of machine learning along with the access to large amounts of patent data has led to a paradigm shift from traditional patent data analysis methodologies to novel approaches. A lot of research has been done in this direction for analysing data on patent citations, patent text, IPC class etc. However, much less has been explored regarding the forecast of patent grant duration and its significance for decision making with an even lower focus on data collected from developing countries. This work is built upon our existing study on patent grant duration prediction by devising a novel methodology of encoding the data using a combination of augmented one-hot encoding and label-encoding. Thereafter, methodologies such as Outlier Detection have been applied to this data to yield an improved result vis-à-vis our baseline results. In addition, we identify some of the important factors which impact the decision on grant duration of patent applications using the raw data from the Indian Patent Office.
A systematic literature review on intelligent automation: Aligning concepts from theory, practice, and future perspectives
2021, Advanced Engineering Informatics
Citation Excerpt :
The number of manual inspections required can significantly reduce as error prevention and compliance of regulations can be performed simultaneously with each digitalised decision process with the use of IA. For example, RPA can detect any erroneous human input in the work papers in accounting [1], extract contents from patents, legal documents [101–103] and validate the medical processes that satisfy the drug regulations in healthcare and clinical research [57,62,63]. Nonetheless, valuable insights for model correction can also be accomplished when the IA faces error.
With the recent developments in robotic process automation (RPA) and artificial intelligence (AI), academics and industrial practitioners are now pursuing robust and adaptive decision making (DM) in real-life engineering applications and automated business workflows and processes to accommodate context awareness, adaptation to environment and customisation. The emerging research via RPA, AI and soft computing offers sophisticated decision analysis methods, data-driven DM and scenario analysis with regard to the consideration of decision choices and provides benefits in numerous engineering applications. The emerging intelligent automation (IA) – the combination of RPA, AI and soft computing – can further transcend traditional DM to achieve unprecedented levels of operational efficiency, decision quality and system reliability. RPA allows an intelligent agent to eliminate operational errors and mimic manual routine decisions, including rule-based, well-structured and repetitive decisions involving enormous data, in a digital system, while AI has the cognitive capabilities to emulate the actions of human behaviour and process unstructured data via machine learning, natural language processing and image processing. Insights from IA drive new opportunities in providing automated DM processes, fault diagnosis, knowledge elicitation and solutions under complex decision environments with the presence of context-aware data, uncertainty and customer preferences. This sophisticated review attempts to deliver the relevant research directions and applications from the selected literature to the readers and address the key contributions of the selected literature, IA’s benefits, implementation considerations, challenges and potential IA applications to foster the relevant research development in the domain.
Automated classification of patents: A topic modeling approach
2020, Computers and Industrial Engineering
Due to the rapid increase in technological innovation and corresponding increase in patent applications, automatic patent classification systems are very helpful for both individual inventors and patent attorneys in classifying patents. However, previous studies have neglected the question of what content patents include and how to represent patent content effectively in a structured form to predict the patent class. In response, this study suggests a topic model based on support vector machine (SVM) prediction for automatic patent classification. This study considers two important issues for patent classification: text representation and class prediction. For text representation, we use the topic modeling technique and employ latent Dirichlet allocation (LDA). The result of LDA is then used as the input for the second aspect: class prediction. We use SVM prediction for automatic patent classification. We also suggest potential improvement strategies to enhance the prediction performance of our suggested approach. This study contributes to the field in that it can lead to the automatic classification of patents without the need for any expert judgment during the process.
Parameter tuning Naïve Bayes for automatic patent classification
2020, World Patent Information
I present an analysis of feature selection for automatic patent categorization. For a corpus of 7,309 patent applications from the World Patent Information (WPI) Test Collection (Lupu, 2019), I assign International Patent Classification (IPC) section codes using a modified Naïve Bayes classifier. I compare precision, recall, and f-measure for a variety of meta-parameter settings including data smoothing and acceptance threshold. Finally, I apply the optimized model to IPC class and group codes and compare the results of patent categorization to academic literature.

View all citing articles on Scopus

View full text

Development of a patent document classification and search platform using a back-propagation network

Abstract

Introduction

Section snippets

Literature survey

System architecture and methodology

Function modules of the system

Conclusion

Acknowledgement

Structural Change and Economic Dynamics

Research Policy

Research Policy

World Patent Information

Neural Networks

Information Processing and Management

Information Sciences

Journal of High Technology Management Research

Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies

The International Journal on very Large Data Bases

Integrating structured data and text: A relational approach

Journal of the American Society for Information Science

Genetic algorithms

Scientific American

A document content extraction model using keyword correlation analysis

International Journal of Electronic Business Management

Develop a multi-channel legal knowledge service center with knowledge mining capability

International Journal of Electronic Business Management