demonstration

On Automating Basic Data Curation Tasks

Authors:
Seyed-Mehdi-Reza Beheshti

University of New South Wales, Sydney, Australia

University of New South Wales, Sydney, Australia
View Profile

,
Alireza Tabebordbar

University of New South Wales, Sydney, Australia

University of New South Wales, Sydney, Australia
View Profile

,
Boualem Benatallah

University of New South Wales, Sydney, Australia

University of New South Wales, Sydney, Australia
View Profile

,
Reza Nouri

University of New South Wales, Sydney, Australia

University of New South Wales, Sydney, Australia
View Profile

WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web CompanionApril 2017Pages 165–169https://doi.org/10.1145/3041021.3054726

Published:03 April 2017Publication History

WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

Pages 165–169

ABSTRACT

Big data analytics is firmly recognized as a strategic priority for modern enterprises. At the heart of big data analytics lies the data curation process, consists of tasks that transform raw data (unstructured, semi-structured and structured data sources) into curated data, i.e. contextualized data and knowledge that is maintained and made available for use by end-users and applications. To achieve this, the data curation process may involve techniques and algorithms for extracting, classifying, linking, merging, enriching, sampling, and the summarization of data and knowledge. To facilitate the data curation process and enhance the productivity of researchers and developers, we identify and implement a set of basic data curation APIs and make them available as services to researchers and developers to assist them in transforming their raw data into curated data. The curation APIs enable developers to easily add features - such as extracting keyword, part of speech, and named entities such as Persons, Locations, Organizations, Companies, Products, Diseases, Drugs, etc.; providing synonyms and stems for extracted information items leveraging lexical knowledge bases for the English language such as WordNet; linking extracted entities to external knowledge bases such as Google Knowledge Graph and Wikidata; discovering similarity among the extracted information items, such as calculating similarity between string and numbers; classifying, sorting and categorizing data into various types, forms or any other distinct class; and indexing structured and unstructured data - into their data applications. These services can be accessed via a REST API, and the data is returned as a JSON file that can be integrated into data applications. The curation APIs are available as an open source project on GitHub.

References

Michael R. Anderson, Dolan Antenucci, Victor Bittorf, Matthew Burgess, Michael J. Cafarella, Arun Kumar, Feng Niu, Yongjoo Park, Christopher Ré and Ce Zhang. 2013. Brainwash: A Data System for Feature Engineering.. In CIDR.Google Scholar
Seyed-Mehdi-Reza Beheshti, Boualem Benatallah, and Hamid Reza Motahari-Nezhad. 2016a. Scalable graph-based OLAP analytics over process execution data. Distributed and Parallel Databases 34, 3 (2016), 379--423. Google ScholarDigital Library
Seyed-Mehdi-Reza Beheshti, Boualem Benatallah, Sherif Sakr, Daniela Grigori, Hamid Reza Motahari-Nezhad, Moshe Chai Barukh, Ahmed Gater, and Seung Hwan Ryu. 2016b. Process Analytics - Concepts and Techniques for Querying and Analyzing Process Data. Springer. Google ScholarDigital Library
Seyed-Mehdi-Reza Beheshti, Alireza Tabebordbar, Boualem Benatallah, and Reza Nouri. 2016d. Data Curation APIs. CoRR abs/1612.03277 (2016). http://arxiv.org/abs/1612.03277Google Scholar
Seyed-Mehdi-Reza Beheshti, Boualem Benatallah, Srikumar Venugopal, Seung Hwan Ryu, Hamid Reza Motahari-Nezhad, and Wei Wang. 2016c. A systematic review and comparative analysis of cross-document coreference resolution methods and tools. Computing (2016), 1--37. Google ScholarDigital Library
Hsinchun Chen, Roger HL Chiang, and Veda C Storey. 2012. Business intelligence and analytics: From big data to big impact. MIS quarterly 36, 4 (2012), 1165--1188. Google ScholarDigital Library
Abhishek Gattani, Digvijay S. Lamba, Nikesh Garera, Mitul Tiwari, Xiaoyong Chai, Sanjib Das, Sri Subramaniam, Anand Rajaraman, Venky Harinarayan, and AnHai Doan. 2013. Entity Extraction, Linking, Classification, and Tagging for Social Media: A Wikipedia-Based Approach. PVLDB 6, 11 (2013), 1126--1137. http://www.vldb.org/pvldb/vol6/p1126-gattani.pdf Google ScholarDigital Library
Clinton Gormley and Zachary Tong. 2015. Elasticsearch: The Definitive Guide. -- O'Reilly Media, Inc. Google ScholarDigital Library
Krzystof Jajuga, Andrzej Sokolowski, and Hans-Hermann Bock. 2012. Classification, clustering, and data analysis: recent advances and applications. Springer Science & Business Media.Google Scholar
Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue B. Moon. 2010. What is Twitter, a social network or a news media?. In Proceedings of the 19th International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, USA, April 26-30, 2010. 591--600. Google ScholarDigital Library
Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, and David McClosky. 2014. The Stanford CoreNLP Natural Language Processing Toolkit. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, June 22-27, 2014, Baltimore, MD, USA, System Demonstrations. 55--60. http://aclweb.org/anthology/P/P14/P14--5010.pdfGoogle ScholarCross Ref
James H Martin and Daniel Jurafsky. 2000. Speech and language processing. International Edition 710 (2000).Google Scholar
Omer Tene and Jules Polonetsky. 2012. Big data for all: Privacy and user control in the age of analytics. Nw. J. Tech. & Intell. Prop. 11 (2012), xxvii.Google Scholar

Index Terms

On Automating Basic Data Curation Tasks

Recommendations

DataSynapse: A Social Data Curation Foundry
Abstract
Social data analytics have become a vital asset for organizations and governments. For example, over the last few years, governments started to extract knowledge and derive insights from vastly growing open data to personalize the advertisements ...
Read More
Automatic Extraction of Nested Entities in Clinical Referrals in Spanish
Here we describe a new clinical corpus rich in nested entities and a series of neural models to identify them. The corpus comprises de-identified referrals from the waiting list in Chilean public hospitals. A subset of 5,000 referrals (58.6% medical and ...
Read More
Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia
CHI '24: Proceedings of the CHI Conference on Human Factors in Computing Systems

AI tools are increasingly deployed in community contexts. However, datasets used to evaluate AI are typically created by developers and annotators outside a given community, which can yield misleading conclusions about AI performance. How might we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion
April 2017
1738 pages
ISBN:9781450349147
General Chairs:
Rick Barrett
W3Events
,
Rick Cummings
Murdoch University
,
Program Chairs:
Eugene Agichtein
Emory University
,
Evgeniy Gabrilovich
Google Research
Sponsors
In-Cooperation
Publisher
International World Wide Web Conferences Steering Committee
Republic and Canton of Geneva, Switzerland
Publication History
- Published: 3 April 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
big data analytics
curation api
data curation
Qualifiers
- demonstration
Conference

Acceptance Rates
WWW '17 Companion Paper Acceptance Rate164of966submissions,17%Overall Acceptance Rate1,899of8,196submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 23
  Total Citations
  View Citations
- 463
  Total Downloads
- Downloads (Last 12 months)37
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On Automating Basic Data Curation Tasks

WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

ABSTRACT

References

Cited By

Index Terms

Recommendations

DataSynapse: A Social Data Curation Foundry

Automatic Extraction of Nested Entities in Clinical Referrals in Spanish

Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

On Automating Basic Data Curation Tasks

WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

ABSTRACT

References

Cited By

Index Terms

Recommendations

DataSynapse: A Social Data Curation Foundry

Automatic Extraction of Nested Entities in Clinical Referrals in Spanish

Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media