skip to main content
10.1145/3152494.3156817acmotherconferencesArticle/Chapter ViewAbstractPublication PagescodsConference Proceedingsconference-collections
research-article

Mining entities and their values from semi-structured documents in business process outsourcing

Published: 11 January 2018 Publication History

Abstract

One of the most commonly performed tasks in a business process operations is to convert heterogeneous documents in various formats like doc, xls, pdf into a structured data that is entered into a typical ERP system. For example, in a logistics based business process the objective is to identify the courier company, the delivery address, the booking address, the shipment and payment details from various types of documents. In this paper, we try to address the problem of entity extraction from the heterogenous data in an attempt to automate the manual task in financial domain. We have experimented with two supervised algorithms - CRFs and SVM for sequentially tagging the entities. We will present details of our experiments on some real-world data collected in a BPO organizations. We observe that CRF (81%,73% F1-measures) performed better than SVMhmm (80%,56% F1-measures) in logistics and accounts payables-based business process scenarios respectively. Also, we observe that structural in combination with the syntactic and domain knowledge features helped in getting better results.

References

[1]
South, Brett R., et al. "A prototype tool set to support machine-assisted annotation." Proceedings of the 2012 Workshop on Biomedical Natural Language Processing. Association for Computational Linguistics, 2012.
[2]
Okazaki, Naoaki. "Crfsuite: a fast implementation of conditional random fields (crfs)." (2007).
[3]
Tsochantaridis I, Joachims T, Hofmann T, Altun Y: Large margin methods for structured and interdependent output variables. J Mach Learn Res. 2005, 6: 1453--1484.
[4]
Pinto, D., McCallum, A., Wei, X., Croft, W.B.: Table Extraction Using Conditional Random Fields. In: Proc. of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2003).
[5]
Liu, Ying, Prasenjit Mitra, and C. Lee Giles. "Identifying table boundaries in digital documents via sparse line detection." Proceedings of the 17th ACM conference on Information and knowledge management. ACM, 2008.
[6]
Alam, Hassan, et al. "Automated financial data extraction-an AI approach." Proceedings on the International Conference on Artificial Intelligence (ICAI). The Steering Committee of The World Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), 2013.
[7]
https://github.com/openvenues/libpostal
[8]
Hodson, James A., and James Y. Zhang. "Entity extraction and disambiguation in finance." Proceedings of the first international workshop on Entity recognition & disambiguation. ACM, 2014.
[9]
Lafferty, John, Andrew McCallum, and Fernando CN Pereira. "Conditional random fields: Probabilistic models for segmenting and labeling sequence data." (2001).
[10]
McCallum, Andrew, Dayne Freitag, and Fernando CN Pereira. "Maximum Entropy Markov Models for Information Extraction and Segmentation." Icml. Vol. 17. 2000.
[11]
van de Kerkhof, Jan. "Convolutional Neural Networks for Named Entity Recognition in Images of Documents." (2016).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CODS-COMAD '18: Proceedings of the ACM India Joint International Conference on Data Science and Management of Data
January 2018
379 pages
ISBN:9781450363419
DOI:10.1145/3152494
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 January 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ACM proceedings
  2. business process
  3. entities
  4. segments
  5. semi-structured document
  6. sequential models

Qualifiers

  • Research-article

Conference

CoDS-COMAD '18

Acceptance Rates

CODS-COMAD '18 Paper Acceptance Rate 50 of 150 submissions, 33%;
Overall Acceptance Rate 197 of 680 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 133
    Total Downloads
  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media