ABSTRACT
Big Data industrial systems that address problems such as classification, information extraction, and entity matching very commonly use hand-crafted rules. Today, however, little is understood about the usage of such rules. In this paper we explore this issue. We discuss how these systems differ from those considered in academia. We describe default solutions, their limitations, and reasons for using rules. We show examples of extensive rule usage in industry. Contrary to popular perceptions, we show that there is a rich set of research challenges in rule generation, evaluation, execution, optimization, and maintenance. We discuss ongoing work at WalmartLabs and UW-Madison that illustrate these challenges. Our main conclusions are (1) using rules (together with techniques such as learning and crowdsourcing) is fundamental to building semantics-intensive Big Data systems, and (2) it is increasingly critical to address rule management, given the tens of thousands of rules industrial systems often manage today in an ad-hoc fashion.
- Regex magic http://www.regexmagic.com/.Google Scholar
- A. Gattani et al. Entity extraction, linking, classification, and tagging for social media: A Wikipedia-based approach. PVLDB, 6(11):1126--1137, 2013. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Mining sequential patterns. In ICDE '95. Google ScholarDigital Library
- E. Baralis and P. Garza. A lazy approach to pruning classification rules. In ICDM '02. Google ScholarDigital Library
- R. Bekkerman and M. Gavish. High-precision phrase-based document classification on a modern scale. In KDD '11. Google ScholarDigital Library
- M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In AAAI '99. Google ScholarDigital Library
- L. Chiticariu, Y. Li, and F. R. Reiss. Rule-based information extraction is dead! long live rule-based information extraction systems! In EMNLP '13.Google Scholar
- F. Ciravegna. Adaptive information extraction from text by rule induction and generalisation. In IJCAI '01. Google ScholarDigital Library
- W. W. Cohen. Fast effective rule induction. In ICML '95.Google Scholar
- F. Denis. Learning regular languages from simple positive examples. Mach. Learn., 44, 2001. Google ScholarDigital Library
- A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012. Google ScholarDigital Library
- P. Domingos. The rise system: conquering without separating. In ICTAI '94.Google Scholar
- G. Dong, X. Zhang, L. Wong, and J. Li. Caep: Classification by aggregating emerging patterns. In DS '99. Google ScholarDigital Library
- H. Fernau. Algorithms for learning regular expressions. In ALT '05. Google ScholarDigital Library
- L. Firoiu, T. Oates, and P. R. Cohen. Learning regular languages from positive evidence. In In Twentieth Annual Meeting of the Cognitive Science Society, 1998.Google ScholarCross Ref
- S. Godbole, I. Bhattacharya, A. Gupta, and A. Verma. Building re-usable dictionary repositories for real-world text mining. In CIKM '10. Google ScholarDigital Library
- C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD '14. Google ScholarDigital Library
- M. Hern--andez, G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky. HIL: a high-level scripting language for entity integration. In EDBT '13. Google ScholarDigital Library
- W. L. D. IV, P. Schwarz, and E. Terzi. Finding representative association rules from large rule collections. In SDM '09.Google Scholar
- W. Li, J. Han, and J. Pei. Cmar: accurate and efficient classification based on multiple class-association rules. In ICDM '01. Google ScholarDigital Library
- Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. V. Jagadish. Regular expression learning for information extraction. In EMNLP '08. Google ScholarDigital Library
- D. Lin. Automatic retrieval and clustering of similar words. In COLING '98. Google ScholarDigital Library
- B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In AAAI '98.Google Scholar
- Liu et al. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explor. Newsl., 2005. Google ScholarDigital Library
- I. Miliaraki, K. Berberich, R. Gemulla, and S. Zoupanos. Mind the gap: Large-scale frequent sequence mining. In SIGMOD '13. Google ScholarDigital Library
- O. Deshpande et al. Building, maintaining, and using knowledge bases: a report from the trenches. In SIGMOD '13. Google ScholarDigital Library
- J. Rocchio. Relevance feedback in information retrieval. In The SMART retrieval system. Prentice Hall, 1971.Google Scholar
- G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 54, 1988. Google ScholarDigital Library
- D. Shen, J.-D. Ruvini, and B. Sarwar. Large-scale item categorization for e-commerce. In CIKM '12. Google ScholarDigital Library
- W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB '2007. Google ScholarDigital Library
- S. Soderland. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34, 1999. Google ScholarDigital Library
- C. Sun, N. Rampalli, F. Yang, and A. Doan. Chimera: Large-scale classification using machine learning, rules, and crowdsourcing. PVLDB, 2014. Google ScholarDigital Library
- H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hätönen, and H. Mannila. Pruning and grouping discovered association rules, 1995.Google Scholar
- S. M. Weiss and N. Indurkhya. Predictive Data Mining: A Practical Guide. Morgan Kaufmann, 1998. Google ScholarDigital Library
- S. E. Whang and H. Garcia-Molina. Entity resolution with evolving rules. Proc. VLDB Endow., 3, 2010. Google ScholarDigital Library
- X. Chai et al. Social media analytics: The Kosmix story. IEEE Data Eng. Bull., 36(3):4--12, 2013.Google Scholar
- G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classification in large-scale text hierarchies. In SIGIR '08. Google ScholarDigital Library
- X. Yin and J. Han. CPAR: Classification based on Predictive Association Rules. In SDM '03.Google Scholar
Index Terms
- Why Big Data Industrial Systems Need Rules and What We Can Do About It
Recommendations
A software tool for visualizing, managing and eliciting SWRL rules
ESWC'10: Proceedings of the 7th international conference on The Semantic Web: research and Applications - Volume Part IISWRL rule are increasingly being used to represent knowledge on the Semantic Web. As these SWRL rule bases grows larger, managing the resulting complexity can become a challenge. Developers and end-users need rule management tools to tackle this ...
From Big Data to Big Data Mining: Challenges, Issues, and Opportunities
Proceedings of the 18th International Conference on Database Systems for Advanced Applications - Volume 7827While "big data" has become a highlighted buzzword since last year, "big data mining", i.e., mining from big data, has almost immediately followed up as an emerging, interrelated research area. This paper provides an overview of big data mining and ...
Exploration of SWRL Rule Bases through Visualization, Paraphrasing, and Categorization of Rules
RuleML '09: Proceedings of the 2009 International Symposium on Rule Interchange and ApplicationsRule bases are increasingly being used as repositories of knowledge content on the Semantic Web. As the size and complexity of these rule bases increases, developers and end users need methods of rule abstraction to facilitate rule management. In this ...
Comments