skip to main content
10.1145/2593882.2593889acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
Article

Engineering big data solutions

Published:31 May 2014Publication History

ABSTRACT

Structured and unstructured data in operational support tools have long been prevalent in software engineering. Similar data is now becoming widely available in other domains. Software systems that utilize such operational data (OD) to help with software design and maintenance activities are increasingly being built despite the difficulties of drawing valid conclusions from disparate and low-quality data and the continuing evolution of operational support tools. This paper proposes systematizing approaches to the engineering of OD-based systems. To prioritize and structure research areas we consider historic developments, such as big data hype; synthesize defining features of OD, such as confounded measures and unobserved context; and discuss emerging new applications, such as diverse and large OD collections and extremely short development intervals. To sustain the credibility of OD-based systems more research will be needed to investigate effective existing approaches and to synthesize novel, OD-specific engineering principles.

References

  1. Bente C.D. Anda, Dag I.K. Sjøberg, and Audris Mockus. Variability and reproducibility in software engineering: A study of four companies that developed the same system. IEEE Transactions on Software Engineering, 35(3), May/June 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V.R. Basili, R.W. Selby, and D.H. Hutchens. Experimentation in software engineering. IEEE Transactions on Software Engineering, pages 758–773, July 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. V.R. Basili and D.M. Weiss. A methodology for collecting valid software engineering data. IEEE Transactions on Software Engineering, 10(6):728–737, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Andrew Begel and Thomas Zimmermann. Analyze this! 145 questions for data scientists in software engineering. In ICSE, Hyderabad, India, June 2014. IEEE CS. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. A. Belady and M. M. Lehman. Programming system dynamics, or the meta-dynamics of systems in maintenance and growth. Technical report, IBM Thomas J. Watson Research Center, 1971.Google ScholarGoogle Scholar
  6. Christian Bird, Peter C. Rigby, Earl T. Barr, David J. Hamilton, Daniel M. German, and Prem Devanbu. The promises and perils of mining git. 2013 10th Working Conference on Mining Software Repositories (MSR), 0:1–10, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B.W. Boehm. Software Engineering Economics. Prentice-Hall, 1981. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Marcelo Cataldo, Audris Mockus, Jeffrey A. Roberts, and James D. Herbsleb. Software dependencies, the structure of work dependencies and their impact on failures. IEEE Transactions on Software Engineering, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Per Cedeqvist and et al. CVS Manual. May be found on: http://www.cvshome.org/CVS/.Google ScholarGoogle Scholar
  10. Ben Collins-Sussman, Brian W. Fitzpatrick, and C. Michael Pilato. Subversion Manual. May be found on: http://svnbook.red-bean.com/.Google ScholarGoogle Scholar
  11. Laura Dabbish, H. Colleen Stuart, Jason Tsay, and James D. Herbsleb. Leveraging transparency. IEEE Software, 30(1):37–43, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Data warehouse. http://en.wikipedia.org/wiki/Data_warehouse.Google ScholarGoogle Scholar
  13. T. Dunning. Natural Experiments in the Social Sciences: A Design-Based Approach. Cambridge University Press, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  14. S.G. Eick, J.L. Steffen, and Sumner E.E. Seesoft-a tool for visualizing line oriented software statistics. IEEE Transactions on Software Engineering, 18(11):957 – 968, November 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Harald Gall, Karin Hajek, and Mehdi Jazayeri. Detection of logical coupling based on product release history. In ICSM, pages 190–197, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hype cycles. http://www.gartner.com/technology/ research/methodologies/hype-cycle.jsp.Google ScholarGoogle Scholar
  17. Linux kernel. http://en.wikipedia.org/wiki/Linux_ kernel#Development_model.Google ScholarGoogle Scholar
  18. R Grady and E Caswell. Software metrics. Prentice-Hall, Englewood Cliff, 1987.Google ScholarGoogle Scholar
  19. T. Graves and A. Mockus. Identifying productivity drivers by modeling work units using partial data. Technometrics, 43(2):168–179, May 2001.Google ScholarGoogle ScholarCross RefCross Ref
  20. Randy Hackbarth, Audris Mockus, John Palframan, and David Weiss. Assessing the state of software in a large enterprise. Journal of Empirical Software Engineering, 10(3):219–249, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. H. Halstead. Elements of Software Science. Elsevier – North Holland, 1979.Google ScholarGoogle Scholar
  22. Ahmed E. Hassan, Abram Hindle, Per Runeson, Martin Shepperd, Premkumar T. Devanbu, and Sunghun Kim. Roundtable: What’s next in software analytics. IEEE Software, 30(4):53–56, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ahmed E. Hassan, Richard C. Holt, and Audris Mockus. Report on MSR 2004: International workshop on mining software repositories. In ACM SIGSOFT Software Engineering Notes, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Abram Hindle, Neil A. Ernst, Michael W. Godfrey, and John Mylopoulos. Automated topic naming to support cross-project analysis of software maintenance activities. In Proceedings of the 8th Working Conference on Mining Software Repositories, MSR ’11, pages 163–172, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. John P. A. Ioannidis. Why most published research findings are false. PLoS Med, 2(8):e124, August 30 2005.Google ScholarGoogle ScholarCross RefCross Ref
  26. J. Jelinski and P. B. Moranda. Software reliability research. In W. Freiberger, editor, Probabilistic Models for Software, pages 485–502. Academic Press, 1972.Google ScholarGoogle Scholar
  27. Jira plugins. https://marketplace.atlassian.com/plugins.Google ScholarGoogle Scholar
  28. Donald E. Knuth. Literate Programming. Stanford University Center for the Study of Language and Information, Stanford, CA, USA, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Willey Series in Probability and Mathematical Statistics. John Willey & Sons, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. J. McCabe. A complexity measure. IEEE Trans. on Software Engineering, 2(4):308–320, Dec. 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Katina Michael and Keith W. Miller. Big data: New opportunities and new challenges {guest editors’ introduction}. Computer, 46(6):22–24, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Anil K. Midha. Software configuration management for the 21st century. Bell Labs Technical Journal, 2(1), Winter 1997.Google ScholarGoogle ScholarCross RefCross Ref
  33. A. Mockus, R. F. Fielding, and J. Herbsleb. A case study of open source development: The Apache server. In 22nd International Conference on Software Engineering, pages 263–272, Limerick, Ireland, June 4-11 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Audris Mockus. Software support tools and experimental work. In V Basili and et al, editors, Empirical Software Engineering Issues: Critical Assessments and Future Directions, volume LNCS 4336, pages 91–99. Springer, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Audris Mockus. Missing data in software engineering. In J. Singer et al., editor, Guide to Advanced Empirical Software Engineering, pages 185–200. Springer-Verlag, 2008.Google ScholarGoogle Scholar
  36. Audris Mockus. Amassing and indexing a large sample of version control systems: towards the census of public source code history. In 6th IEEE Working Conference on Mining Software Repositories, May 16–17 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Audris Mockus. Organizational volatility and its effects on software defects. In ACM SIGSOFT / FSE, pages 117–126, Santa Fe, New Mexico, November 7–11 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Audris Mockus, Todd L. Graves, and Alan F. Karr. Modelling software changes. In C.E. Minder and H. Friedl, editors, Good Statistical Practice, pages 175–179. Austrian Statistical Society, Wien, Austria, July 1997. Proceedings of the 12th International Workshop on Statistical Modeling, Biel/Bienne.Google ScholarGoogle Scholar
  39. Audris Mockus, Randy Hackbarth, and John Palframan. Risky files: An approach to focus quality improvement effort. In 9th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Audris Mockus and David Weiss. Interval quality: Relating customer-perceived quality to process quality. In 2008 International Conference on Software Engineering, pages 733–740, Leipzig, Germany, May 10–18 2008. ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. 11th working conference on mining software repositories. http://2014.msrconf.org/.Google ScholarGoogle Scholar
  42. Nathaniel Poor. Mechanisms of an online public sphere: The website slashdot. Journal of Computer-Mediated Communication, 10(2), 2005.Google ScholarGoogle ScholarCross RefCross Ref
  43. Foyzur Rahman, Daryl Posnett, Israel Herraiz, and Premkumar T. Devanbu. Sample size vs. bias in defect prediction. In ESEC/SIGSOFT FSE, pages 147–157, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. M.J. Rochkind. The source code control system. IEEE Trans. on Software Engineering, 1(4):364–370, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Eric Schulte, Dan Davison, Thomas Dye, and Carsten Dominik. A multi-language computing environment for literate programming and reproducible research. Journal of Statistical Software, 46(3):1–24, 1 2012.Google ScholarGoogle ScholarCross RefCross Ref
  46. Matthias Schwab, Martin Karrenbach, and Jon Claerbout. Making scientific computations reproducible. In Computing in Science & Engineering, pages 61––67, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Martin J. Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. Data quality: Some comments on the nasa software defect datasets. IEEE Trans. Software Eng., 39(9):1208–1215, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Emad Shihab, Christian Bird, and Thomas Zimmermann. The effect of branching strategies on software quality. In ESEM, pages 301–310, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Dag I.K. Sjøberg, Bente Anda, and Audris Mockus. Questioning software maintenance metrics: a comparative case study. In Proceedings of the ACM-IEEE international symposium on Empirical software engineering and measurement, ESEM ’12, pages 107–110, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Dag I.K. Sjoberg, Aiko Yamashita, Bente Anda, Audris Mockus, and Tore Dyba. Quantifying the effect of code smells on maintenance effort. IEEE Transactions on Software Engineering, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. But he looked good on paper. http://www.slate. com/articles/business/small_business/2010/08/ but_he_looked_good_on_paper.html.Google ScholarGoogle Scholar
  52. Sonar. http://en.wikipedia.org/wiki/SonarQube.Google ScholarGoogle Scholar
  53. Margaret-Anne Storey, Leif Singer, Fernando Figueira Filho, Brendan Cleary, and Alexey Zagalsky. The (R)evolutionary Role of Social Media in Software Engineering. In ICSE, Hyderabad, India, June 2014. IEEE CS.Google ScholarGoogle Scholar
  54. Walter F. Tichy. Design, implementation, and evaluation of a revision control system. In ICSE, pages 58–67, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Claude E. Walston and Charles P. Felix. A method of programming measurement and estimation. IBM Systems Journal, 16(1):54–73, 1977. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. M.R. Wigan and R. Clarke. Big data’s big unintended consequences. Computer, 46(6):46–53, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Big data. http://en.wikipedia.org/wiki/Big_Data.Google ScholarGoogle Scholar
  58. Business intelligence. http: //en.wikipedia.org/wiki/Business_intelligence.Google ScholarGoogle Scholar
  59. Git. http: //en.wikipedia.org/wiki/Git_%28software%29.Google ScholarGoogle Scholar
  60. Predictive analytics. http: //en.wikipedia.org/wiki/Predictive_analytics.Google ScholarGoogle Scholar
  61. Vworker. http://en.wikipedia.org/wiki/VWorker.Google ScholarGoogle Scholar
  62. Jialiang Xie, Qimu Zhengand, Minghui Zhou, and Audris Mockus. Product assignment recommender. In ICSE’14 Demonstrations, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Jianming Ye. On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association, 93(441):120–131, March 1998.Google ScholarGoogle ScholarCross RefCross Ref
  64. Tze-Jie Yu, Vincent Yun Shen, and Hubert E. Dunsmore. An analysis of several software defect models. IEEE Trans. Software Eng., 14(9):1261–1270, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Engineering big data solutions

                            Recommendations

                            Comments

                            Login options

                            Check if you have access through your login credentials or your institution to get full access on this article.

                            Sign in
                            • Published in

                              cover image ACM Conferences
                              FOSE 2014: Future of Software Engineering Proceedings
                              May 2014
                              224 pages
                              ISBN:9781450328654
                              DOI:10.1145/2593882

                              Copyright © 2014 ACM

                              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                              Publisher

                              Association for Computing Machinery

                              New York, NY, United States

                              Publication History

                              • Published: 31 May 2014

                              Permissions

                              Request permissions about this article.

                              Request Permissions

                              Check for updates

                              Qualifiers

                              • Article

                              Upcoming Conference

                              ICSE 2025

                            PDF Format

                            View or Download as a PDF file.

                            PDF

                            eReader

                            View online with eReader.

                            eReader