Automatic extraction of titles from general documents using machine learning

https://doi.org/10.1016/j.ipm.2005.12.001Get rights and content

Abstract

In this paper, we propose a machine learning approach to title extraction from general documents. By general documents, we mean documents that can belong to any one of a number of specific genres, including presentations, book chapters, technical papers, brochures, reports, and letters. Previously, methods have been proposed mainly for title extraction from research papers. It has not been clear whether it could be possible to conduct automatic title extraction from general documents. As a case study, we consider extraction from Office including Word and PowerPoint. In our approach, we annotate titles in sample documents (for Word and PowerPoint, respectively) and take them as training data, train machine learning models, and perform title extraction using the trained models. Our method is unique in that we mainly utilize formatting information such as font size as features in the models. It turns out that the use of formatting information can lead to quite accurate extraction from general documents. Precision and recall for title extraction from Word are 0.810 and 0.837, respectively, and precision and recall for title extraction from PowerPoint are 0.875 and 0.895, respectively in an experiment on intranet data. Other important new findings in this work include that we can train models in one domain and apply them to other domains, and more surprisingly we can even train models in one language and apply them to other languages. Moreover, we can significantly improve search ranking results in document retrieval by using the extracted titles.

Introduction

Metadata of documents is useful for many kinds of document processing such as search, browsing, and filtering. Ideally, metadata is defined by the authors of documents and is then used by various systems. However, people seldom define document metadata by themselves, even when they have convenient metadata definition tools (Crystal & Land, 2003). Thus, how to automatically extract metadata from the bodies of documents turns out to be an important research issue.

Methods for performing the task have been proposed. However, the focus was mainly on extraction from research papers. For instance, Han et al. (2003) proposed a machine learning based method to conduct extraction from research papers. They formalized the problem as that of classification and employed Support Vector Machines as the classifier. They mainly used linguistic features in the model.

In this paper, we consider metadata extraction from general documents. By general documents, we mean documents that may belong to any one of a number of specific genres. General documents are more widely available in digital libraries, intranets and the internet, and thus investigation on extraction from them is sorely needed. Research papers usually have well-formed styles and noticeable characteristics. In contrast, the styles of general documents can vary greatly. It has not been clarified whether a machine learning based approach can work well for this task.

There are many types of metadata: title, author, date of creation, etc. As a case study, we consider title extraction in this paper. General documents can be in many different file formats: Microsoft Office, PDF (PS), etc. As a case study, we consider extraction from Office including Word and PowerPoint.

We take a machine learning approach. We annotate titles in sample documents (for Word and PowerPoint, respectively) and take them as training data to train several types of models, and perform title extraction using any one type of the trained models. In the models, we mainly utilize formatting information such as font size as features. We employ the following models: Perceptron with Uneven Margins, Maximum Entropy (ME), Maximum Entropy Markov Model (MEMM), Voted Perceptron Model (VP), and Conditional Random Fields (CRF).

In this paper, we also investigate the following three problems, which did not seem to have been examined previously.

  • (1)

    Comparison between models: among the models above, which model performs best for title extraction;

  • (2)

    Generality of model: whether it is possible to train a model on one domain and apply it to another domain, and whether it is possible to train a model in one language and apply it to another language;

  • (3)

    Usefulness of extracted titles: whether extracted titles can improve document processing such as search.

Experimental results indicate that our approach works well for title extraction from general documents. Our method can significantly outperform the baselines: one that always uses the first lines as titles and the other that always uses the lines in the largest font sizes as titles. Precision and recall for title extraction from Word are 0.810 and 0.837, respectively, and precision and recall for title extraction from PowerPoint are 0.875 and 0.895, respectively. It turns out that the using of format features is the key to successful title extraction.

(1) We have observed that Perceptron based models perform better in terms of extraction accuracies. (2) We have empirically verified that the models trained with our approach are generic in the sense that they can be trained on one domain and applied to another, and they can be trained in one language and applied to another. (3) We have found that using the extracted titles we can significantly improve precision of document retrieval (by 10%).

We conclude that we can indeed conduct reliable title extraction from general documents and use the extracted results to improve real applications.

The rest of the paper is organized as follows. In Section 2, we introduce related work, and in Section 3, we explain the motivation and problem setting of our work. In Section 4, we describe our method of title extraction, and in Section 5, we describe our method of document retrieval using extracted titles. Section 6 gives our experimental results. We make concluding remarks in Section 7.

Section snippets

Document metadata extraction

Methods have been proposed for performing automatic metadata extraction from documents; however, the main focus was on extraction from research papers.

The proposed methods fall into two categories: the rule based approach and the machine learning based approach.

Giuffrida, Shek, and Yang (2000), for instance, developed a rule-based system for automatically extracting metadata from research papers in Postscript. They used rules like “titles are usually located on the upper portions of the first

Motivation and problem setting

We consider the issue of automatically extracting titles from general documents.

By general documents, we mean documents that belong to one of any number of specific genres. The documents can be presentations, books, book chapters, technical papers, brochures, reports, memos, specifications, letters, announcements, or resumes. General documents are more widely available in digital libraries, intranets, and internet, and thus investigation on title extraction from them is sorely needed.

Fig. 1

Outline

Title extraction based on machine learning consists of training and extraction. The same pre-processing step occurs before training and extraction.

During pre-processing, from the top region of the first page of a Word document or the first slide of a PowerPoint document a number of units for processing are extracted. If a line (lines are separated by ‘return’ symbols) only has a single format, then the line will become a unit. If a line has several parts and each of them has its own format,

Document retrieval method

We describe our method of document retrieval using extracted titles.

Typically, in information retrieval a document is split into a number of fields including body, title, and anchor text. A ranking function in search can use different weights for different fields of the document. Also, titles are typically assigned high weights, indicating that they are important for document retrieval. As explained previously, our experiment has shown that a significant number of documents actually have

Data sets and evaluation measures

We used three data sets in our experiments.

First, we downloaded and randomly selected 5000 Word documents and 5000 PowerPoint documents from an intranet of Microsoft. We call it MS hereafter.

Second, we downloaded and randomly selected 500 Word and 500 PowerPoint documents from the DotGov and DotCom domains on the internet, respectively.

Third, we downloaded and randomly selected 4000 Word and 4000 PowerPoint documents written in three other languages, including 4000 documents in Chinese, 2000

Conclusion

In this paper, we have investigated the problem of automatically extracting titles from general documents. We have tried using a machine learning approach to address the problem.

Previous work showed that the machine learning approach can work well for metadata extraction from research papers. In this paper, we showed that the approach can work for extraction from general documents as well. Our experimental results indicated that the machine learning approach can work significantly better than

Acknowledgements

We thank Chunyu Wei and Bojuan Zhao for their work on data annotation. We acknowledge Jinzhu Li for his assistance in conducting the experiments. We thank Ming Zhou, John Chen, and Jun Xu for their valuable comments on early versions of this paper. We also thank the anonymous reviewers of this paper for their making many valuable comments.

References (27)

  • A.L. Berger et al.

    A maximum entropy approach to natural language processing

    Computational Linguistics

    (1996)
  • Crystal, A., & Land, P. (2003). Metadata and Search Global Corporate Circle DCMI 2003 Workshop. Available from...
  • Collins, M. (2002). Discriminative training methods for hidden markov models: theory and experiments with perceptron...
  • C. Cortes et al.

    Support-vector networks

    Machine Learning

    (1995)
  • Chieu, H. L., & Ng, H. T. (2002). A maximum entropy approach to information extraction from semi-structured and free...
  • Evans, D. K., Klavans, J. L., & McKeown, K. R. (2004). Columbia newsblaster: multilingual news summarization on the...
  • Z. Ghahramani et al.

    Factorial hidden markov models

    Machine Learning

    (1997)
  • Gheel, J., & Anderson, T. (1999). Data and metadata for finding and reminding. In Proceedings of the 1999 international...
  • Giles, C. L., Petinot, Y., Teregowda, P. B., Han, H., Lawrence, S., & Rangaswamy, A., et al. (2003). eBizSearch: a...
  • Giuffrida, G., Shek, E. C., & Yang, J. (2000). Knowledge-based metadata extraction from PostScript files. In...
  • Han, H., Giles, C. L., Manavoglu, E., Zha, H., Zhang, Z., & Fox, E. A. (2003). Automatic document metadata extraction...
  • M. Kobayashi et al.

    Information retrieval on the Web

    ACM Computing Surveys

    (2000)
  • Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: probabilistic models for segmenting and...
  • Cited by (33)

    • Machine learning classification of entrepreneurs in British historical census data

      2020, Information Processing and Management
      Citation Excerpt :

      Many learning methods have been developed in information science for related classifications; e.g. binary linkage [8], classifier chains [41], label powerset [53], rankings by pairwise comparison [17,26]. These developments have expanded the focus in textual processing from title searches and tagging [25] to multiple tag interactions [3,37,50], complex text interlinkages for result caching [29], deep textual semantic interactions [28], and attempts to identify sentiments through textual recurrence [2]. There is a rapidly growing literature on machine learning in the information sciences.

    • How can catchy titles be generated without loss of informativeness?

      2014, Expert Systems with Applications
      Citation Excerpt :

      On the other hand, there is a wide variety of approaches that use learning-based approaches. For example, (Hu et al., 2006) annotates titles and considers them as training data. The learnt model performs title extraction.

    • Machine learning and security applications in digital library

      2019, International Journal of Innovative Technology and Exploring Engineering
    • Obtaining reference’s topic congruity in indonesian publications using machine learning approach

      2019, International Conference on Electrical Engineering, Computer Science and Informatics (EECSI)
    • A Text Extraction Software Benchmark Based on a Synthesized Dataset

      2017, Proceedings of the ACM/IEEE Joint Conference on Digital Libraries
    View all citing articles on Scopus
    1

    The work was conducted when the author was visiting Microsoft Research Asia.

    View full text