skip to main content
abstract

Beyond linear chain: a journey through conditional random fields for information extraction from text

Published: 26 June 2014 Publication History

Abstract

Information Extraction (IE) is a field at the crossroads of IR and NLP that studies methods for extracting information from text in such a way that this information can be used to populate a structured information repository. The main methods by means of which IE has been tackled rely on supervised learning;the best-performing such methods belong to the class of probabilistic graphical models, and, in particular, to the class of Conditional Random Fields (CRFs). In this thesis we investigate two major aspects related to textual IE via CRFs: (a) the creation of CRFs models that can outperform the commonly adopted linear- chain CRFs, and the creation of methods for ensuring the quality of training data and for assessing the impact of training data quality on the accuracy of CRFs systems for IE.
We start by facing the task of IE from medical documents written in the Italian language. We propose two novel approaches: (i) a cascaded, two-stage method composed by two layers of CRFs, and (ii) a confidence-weighted ensemble method that combines standard linear-chain CRFs and the proposed two-stage method. Both the proposed models are shown to outperform a standard linear-chain CRFs system.
We then investigate aspect-oriented sentence-level opinion mining from product reviews, that consists in predicting, for all sentences in the review, whether the sentence expresses a positive, neutral, or negative opinion (or no opinion at all) about a specific aspect of the product. We propose a set of increasingly powerful models based on CRFs, including a hierarchical multi-label CRFs scheme that jointly models the overall opinion expressed in a product review and the set of aspect-specific opinions expressed in each of its sentences. The proposed CRFs models are shown to obtain better results than linear-chain CRFs.
We then study the impact that the quality of training data has on the accuracy of an IE system via experiments performed on a dataset in which inter-coder agreement data are available. Finally, we investigate active learning techniques for a type of semi-supervised CRFs specifically devised for partially labeled sequences. We show that margin-based strategies always obtain the best results on the four tasks we have tested them on.

Index Terms

  1. Beyond linear chain: a journey through conditional random fields for information extraction from text

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM SIGIR Forum
        ACM SIGIR Forum  Volume 48, Issue 1
        June 2014
        42 pages
        ISSN:0163-5840
        DOI:10.1145/2641383
        Issue’s Table of Contents

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 26 June 2014
        Published in SIGIR Volume 48, Issue 1

        Check for updates

        Qualifiers

        • Abstract

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • 0
          Total Citations
        • 89
          Total Downloads
        • Downloads (Last 12 months)1
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 06 Jan 2025

        Other Metrics

        Citations

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media