research-article

Why Big Data Industrial Systems Need Rules and What We Can Do About It

Authors:

Paul Suganthan G.C.,

Krishna Gayatri K.,

Narasimhan Rampalli,

Shishir Prasad,

Esteban Arcaute,

Ganesh Krishnan,

Vijay Raghavendra,

AnHai DoanAuthors Info & Claims

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Pages 265 - 276

https://doi.org/10.1145/2723372.2742784

Published: 27 May 2015 Publication History

Abstract

Big Data industrial systems that address problems such as classification, information extraction, and entity matching very commonly use hand-crafted rules. Today, however, little is understood about the usage of such rules. In this paper we explore this issue. We discuss how these systems differ from those considered in academia. We describe default solutions, their limitations, and reasons for using rules. We show examples of extensive rule usage in industry. Contrary to popular perceptions, we show that there is a rich set of research challenges in rule generation, evaluation, execution, optimization, and maintenance. We discuss ongoing work at WalmartLabs and UW-Madison that illustrate these challenges. Our main conclusions are (1) using rules (together with techniques such as learning and crowdsourcing) is fundamental to building semantics-intensive Big Data systems, and (2) it is increasingly critical to address rule management, given the tens of thousands of rules industrial systems often manage today in an ad-hoc fashion.

References

[1]

Regex magic http://www.regexmagic.com/.

[2]

A. Gattani et al. Entity extraction, linking, classification, and tagging for social media: A Wikipedia-based approach. PVLDB, 6(11):1126--1137, 2013.

Digital Library

[3]

R. Agrawal and R. Srikant. Mining sequential patterns. In ICDE '95.

Digital Library

[4]

E. Baralis and P. Garza. A lazy approach to pruning classification rules. In ICDM '02.

Digital Library

[5]

R. Bekkerman and M. Gavish. High-precision phrase-based document classification on a modern scale. In KDD '11.

Digital Library

[6]

M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In AAAI '99.

Digital Library

[7]

L. Chiticariu, Y. Li, and F. R. Reiss. Rule-based information extraction is dead! long live rule-based information extraction systems! In EMNLP '13.

[8]

F. Ciravegna. Adaptive information extraction from text by rule induction and generalisation. In IJCAI '01.

Digital Library

[9]

W. W. Cohen. Fast effective rule induction. In ICML '95.

[10]

F. Denis. Learning regular languages from simple positive examples. Mach. Learn., 44, 2001.

Digital Library

[11]

A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012.

Digital Library

[12]

P. Domingos. The rise system: conquering without separating. In ICTAI '94.

[13]

G. Dong, X. Zhang, L. Wong, and J. Li. Caep: Classification by aggregating emerging patterns. In DS '99.

Digital Library

[14]

H. Fernau. Algorithms for learning regular expressions. In ALT '05.

Digital Library

[15]

L. Firoiu, T. Oates, and P. R. Cohen. Learning regular languages from positive evidence. In In Twentieth Annual Meeting of the Cognitive Science Society, 1998.

[16]

S. Godbole, I. Bhattacharya, A. Gupta, and A. Verma. Building re-usable dictionary repositories for real-world text mining. In CIKM '10.

Digital Library

[17]

C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD '14.

Digital Library

[18]

M. Hern--andez, G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky. HIL: a high-level scripting language for entity integration. In EDBT '13.

Digital Library

[19]

W. L. D. IV, P. Schwarz, and E. Terzi. Finding representative association rules from large rule collections. In SDM '09.

[20]

W. Li, J. Han, and J. Pei. Cmar: accurate and efficient classification based on multiple class-association rules. In ICDM '01.

Digital Library

[21]

Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. V. Jagadish. Regular expression learning for information extraction. In EMNLP '08.

Digital Library

[22]

D. Lin. Automatic retrieval and clustering of similar words. In COLING '98.

Digital Library

[23]

B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In AAAI '98.

[24]

Liu et al. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explor. Newsl., 2005.

Digital Library

[25]

I. Miliaraki, K. Berberich, R. Gemulla, and S. Zoupanos. Mind the gap: Large-scale frequent sequence mining. In SIGMOD '13.

Digital Library

[26]

O. Deshpande et al. Building, maintaining, and using knowledge bases: a report from the trenches. In SIGMOD '13.

Digital Library

[27]

J. Rocchio. Relevance feedback in information retrieval. In The SMART retrieval system. Prentice Hall, 1971.

[28]

G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 54, 1988.

Digital Library

[29]

D. Shen, J.-D. Ruvini, and B. Sarwar. Large-scale item categorization for e-commerce. In CIKM '12.

Digital Library

[30]

W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB '2007.

Digital Library

[31]

S. Soderland. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34, 1999.

Digital Library

[32]

C. Sun, N. Rampalli, F. Yang, and A. Doan. Chimera: Large-scale classification using machine learning, rules, and crowdsourcing. PVLDB, 2014.

Digital Library

[33]

H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hätönen, and H. Mannila. Pruning and grouping discovered association rules, 1995.

[34]

S. M. Weiss and N. Indurkhya. Predictive Data Mining: A Practical Guide. Morgan Kaufmann, 1998.

Digital Library

[35]

S. E. Whang and H. Garcia-Molina. Entity resolution with evolving rules. Proc. VLDB Endow., 3, 2010.

Digital Library

[36]

X. Chai et al. Social media analytics: The Kosmix story. IEEE Data Eng. Bull., 36(3):4--12, 2013.

[37]

G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classification in large-scale text hierarchies. In SIGIR '08.

Digital Library

[38]

X. Yin and J. Han. CPAR: Classification based on Predictive Association Rules. In SDM '03.

Cited By

Kwashie SLiu LLiu JStumptner MLi JYang L(2019)CertusProceedings of the VLDB Endowment10.14778/3311880.331188312:6(653-666)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.14778/3311880.3311883
Tabebordbar ABeheshti AMotahari Nezhad HMikkilineni RBenatallah BCasati FDustdar SDodig-Crnkovic GMos A(2018)Adaptive rule monitoring systemProceedings of the 1st International Workshop on Software Engineering for Cognitive Services10.1145/3195555.3195564(45-51)Online publication date: 28-May-2018
https://dl.acm.org/doi/10.1145/3195555.3195564
Singh RMeduri VElmagarmid AMadden SPapotti PQuiané-Ruiz JSolar-Lezama ATang N(2017)Synthesizing entity matching rules by examplesProceedings of the VLDB Endowment10.14778/3149193.314919911:2(189-202)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.14778/3149193.3149199

Index Terms

Why Big Data Industrial Systems Need Rules and What We Can Do About It
1. Information systems
  1. Data management systems
    1. Database management system engines

Recommendations

A software tool for visualizing, managing and eliciting SWRL rules
ESWC'10: Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part II

SWRL rule are increasingly being used to represent knowledge on the Semantic Web. As these SWRL rule bases grows larger, managing the resulting complexity can become a challenge. Developers and end-users need rule management tools to tackle this ...
From Big Data to Big Data Mining: Challenges, Issues, and Opportunities
Proceedings of the 18th International Conference on Database Systems for Advanced Applications - Volume 7827

While "big data" has become a highlighted buzzword since last year, "big data mining", i.e., mining from big data, has almost immediately followed up as an emerging, interrelated research area. This paper provides an overview of big data mining and ...
Exploration of SWRL Rule Bases through Visualization, Paraphrasing, and Categorization of Rules
RuleML '09: Proceedings of the 2009 International Symposium on Rule Interchange and Applications

Rule bases are increasingly being used as repositories of knowledge content on the Semantic Web. As the size and complexity of these rule bases increases, developers and end users need methods of rule abstraction to facilitate rule management. In this ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

May 2015

2110 pages

ISBN:9781450327589

DOI:10.1145/2723372

General Chair:
Timos Sellis
RMIT University, Australia
,
Program Chairs:
Susan B. Davidson
University of Pennsylvania, USA
,
Zack Ives
University of Pennsylvania, USA

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

@WalmartLabs

Conference

SIGMOD/PODS'15

Sponsor:

SIGMOD

SIGMOD/PODS'15: International Conference on Management of Data

May 31 - June 4, 2015

Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
1,058
Total Downloads

Downloads (Last 12 months)19
Downloads (Last 6 weeks)4

Reflects downloads up to 07 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Kwashie SLiu LLiu JStumptner MLi JYang L(2019)CertusProceedings of the VLDB Endowment10.14778/3311880.331188312:6(653-666)Online publication date: 1-Feb-2019
https://dl.acm.org/doi/10.14778/3311880.3311883
Tabebordbar ABeheshti AMotahari Nezhad HMikkilineni RBenatallah BCasati FDustdar SDodig-Crnkovic GMos A(2018)Adaptive rule monitoring systemProceedings of the 1st International Workshop on Software Engineering for Cognitive Services10.1145/3195555.3195564(45-51)Online publication date: 28-May-2018
https://dl.acm.org/doi/10.1145/3195555.3195564
Singh RMeduri VElmagarmid AMadden SPapotti PQuiané-Ruiz JSolar-Lezama ATang N(2017)Synthesizing entity matching rules by examplesProceedings of the VLDB Endowment10.14778/3149193.314919911:2(189-202)Online publication date: 1-Oct-2017
https://dl.acm.org/doi/10.14778/3149193.3149199

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten