skip to main content
10.1145/2723372.2742784acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Why Big Data Industrial Systems Need Rules and What We Can Do About It

Published:27 May 2015Publication History

ABSTRACT

Big Data industrial systems that address problems such as classification, information extraction, and entity matching very commonly use hand-crafted rules. Today, however, little is understood about the usage of such rules. In this paper we explore this issue. We discuss how these systems differ from those considered in academia. We describe default solutions, their limitations, and reasons for using rules. We show examples of extensive rule usage in industry. Contrary to popular perceptions, we show that there is a rich set of research challenges in rule generation, evaluation, execution, optimization, and maintenance. We discuss ongoing work at WalmartLabs and UW-Madison that illustrate these challenges. Our main conclusions are (1) using rules (together with techniques such as learning and crowdsourcing) is fundamental to building semantics-intensive Big Data systems, and (2) it is increasingly critical to address rule management, given the tens of thousands of rules industrial systems often manage today in an ad-hoc fashion.

References

  1. Regex magic http://www.regexmagic.com/.Google ScholarGoogle Scholar
  2. A. Gattani et al. Entity extraction, linking, classification, and tagging for social media: A Wikipedia-based approach. PVLDB, 6(11):1126--1137, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Agrawal and R. Srikant. Mining sequential patterns. In ICDE '95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. E. Baralis and P. Garza. A lazy approach to pruning classification rules. In ICDM '02. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Bekkerman and M. Gavish. High-precision phrase-based document classification on a modern scale. In KDD '11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In AAAI '99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Chiticariu, Y. Li, and F. R. Reiss. Rule-based information extraction is dead! long live rule-based information extraction systems! In EMNLP '13.Google ScholarGoogle Scholar
  8. F. Ciravegna. Adaptive information extraction from text by rule induction and generalisation. In IJCAI '01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. W. W. Cohen. Fast effective rule induction. In ICML '95.Google ScholarGoogle Scholar
  10. F. Denis. Learning regular languages from simple positive examples. Mach. Learn., 44, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. P. Domingos. The rise system: conquering without separating. In ICTAI '94.Google ScholarGoogle Scholar
  13. G. Dong, X. Zhang, L. Wong, and J. Li. Caep: Classification by aggregating emerging patterns. In DS '99. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Fernau. Algorithms for learning regular expressions. In ALT '05. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Firoiu, T. Oates, and P. R. Cohen. Learning regular languages from positive evidence. In In Twentieth Annual Meeting of the Cognitive Science Society, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  16. S. Godbole, I. Bhattacharya, A. Gupta, and A. Verma. Building re-usable dictionary repositories for real-world text mining. In CIKM '10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD '14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Hern--andez, G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky. HIL: a high-level scripting language for entity integration. In EDBT '13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. L. D. IV, P. Schwarz, and E. Terzi. Finding representative association rules from large rule collections. In SDM '09.Google ScholarGoogle Scholar
  20. W. Li, J. Han, and J. Pei. Cmar: accurate and efficient classification based on multiple class-association rules. In ICDM '01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Y. Li, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. V. Jagadish. Regular expression learning for information extraction. In EMNLP '08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Lin. Automatic retrieval and clustering of similar words. In COLING '98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In AAAI '98.Google ScholarGoogle Scholar
  24. Liu et al. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explor. Newsl., 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. I. Miliaraki, K. Berberich, R. Gemulla, and S. Zoupanos. Mind the gap: Large-scale frequent sequence mining. In SIGMOD '13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. O. Deshpande et al. Building, maintaining, and using knowledge bases: a report from the trenches. In SIGMOD '13. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Rocchio. Relevance feedback in information retrieval. In The SMART retrieval system. Prentice Hall, 1971.Google ScholarGoogle Scholar
  28. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Inf. Process. Manage., 54, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Shen, J.-D. Ruvini, and B. Sarwar. Large-scale item categorization for e-commerce. In CIKM '12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB '2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. S. Soderland. Learning information extraction rules for semi-structured and free text. Mach. Learn., 34, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. Sun, N. Rampalli, F. Yang, and A. Doan. Chimera: Large-scale classification using machine learning, rules, and crowdsourcing. PVLDB, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. H. Toivonen, M. Klemettinen, P. Ronkainen, K. Hätönen, and H. Mannila. Pruning and grouping discovered association rules, 1995.Google ScholarGoogle Scholar
  34. S. M. Weiss and N. Indurkhya. Predictive Data Mining: A Practical Guide. Morgan Kaufmann, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. E. Whang and H. Garcia-Molina. Entity resolution with evolving rules. Proc. VLDB Endow., 3, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. X. Chai et al. Social media analytics: The Kosmix story. IEEE Data Eng. Bull., 36(3):4--12, 2013.Google ScholarGoogle Scholar
  37. G.-R. Xue, D. Xing, Q. Yang, and Y. Yu. Deep classification in large-scale text hierarchies. In SIGIR '08. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. X. Yin and J. Han. CPAR: Classification based on Predictive Association Rules. In SDM '03.Google ScholarGoogle Scholar

Index Terms

  1. Why Big Data Industrial Systems Need Rules and What We Can Do About It

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
      May 2015
      2110 pages
      ISBN:9781450327589
      DOI:10.1145/2723372

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 May 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader