ABSTRACT
Structured and unstructured data in operational support tools have long been prevalent in software engineering. Similar data is now becoming widely available in other domains. Software systems that utilize such operational data (OD) to help with software design and maintenance activities are increasingly being built despite the difficulties of drawing valid conclusions from disparate and low-quality data and the continuing evolution of operational support tools. This paper proposes systematizing approaches to the engineering of OD-based systems. To prioritize and structure research areas we consider historic developments, such as big data hype; synthesize defining features of OD, such as confounded measures and unobserved context; and discuss emerging new applications, such as diverse and large OD collections and extremely short development intervals. To sustain the credibility of OD-based systems more research will be needed to investigate effective existing approaches and to synthesize novel, OD-specific engineering principles.
- Bente C.D. Anda, Dag I.K. Sjøberg, and Audris Mockus. Variability and reproducibility in software engineering: A study of four companies that developed the same system. IEEE Transactions on Software Engineering, 35(3), May/June 2009. Google ScholarDigital Library
- V.R. Basili, R.W. Selby, and D.H. Hutchens. Experimentation in software engineering. IEEE Transactions on Software Engineering, pages 758–773, July 1986. Google ScholarDigital Library
- V.R. Basili and D.M. Weiss. A methodology for collecting valid software engineering data. IEEE Transactions on Software Engineering, 10(6):728–737, 1984. Google ScholarDigital Library
- Andrew Begel and Thomas Zimmermann. Analyze this! 145 questions for data scientists in software engineering. In ICSE, Hyderabad, India, June 2014. IEEE CS. Google ScholarDigital Library
- L. A. Belady and M. M. Lehman. Programming system dynamics, or the meta-dynamics of systems in maintenance and growth. Technical report, IBM Thomas J. Watson Research Center, 1971.Google Scholar
- Christian Bird, Peter C. Rigby, Earl T. Barr, David J. Hamilton, Daniel M. German, and Prem Devanbu. The promises and perils of mining git. 2013 10th Working Conference on Mining Software Repositories (MSR), 0:1–10, 2009. Google ScholarDigital Library
- B.W. Boehm. Software Engineering Economics. Prentice-Hall, 1981. Google ScholarDigital Library
- Marcelo Cataldo, Audris Mockus, Jeffrey A. Roberts, and James D. Herbsleb. Software dependencies, the structure of work dependencies and their impact on failures. IEEE Transactions on Software Engineering, 2009. Google ScholarDigital Library
- Per Cedeqvist and et al. CVS Manual. May be found on: http://www.cvshome.org/CVS/.Google Scholar
- Ben Collins-Sussman, Brian W. Fitzpatrick, and C. Michael Pilato. Subversion Manual. May be found on: http://svnbook.red-bean.com/.Google Scholar
- Laura Dabbish, H. Colleen Stuart, Jason Tsay, and James D. Herbsleb. Leveraging transparency. IEEE Software, 30(1):37–43, 2013. Google ScholarDigital Library
- Data warehouse. http://en.wikipedia.org/wiki/Data_warehouse.Google Scholar
- T. Dunning. Natural Experiments in the Social Sciences: A Design-Based Approach. Cambridge University Press, 2012.Google ScholarCross Ref
- S.G. Eick, J.L. Steffen, and Sumner E.E. Seesoft-a tool for visualizing line oriented software statistics. IEEE Transactions on Software Engineering, 18(11):957 – 968, November 1992. Google ScholarDigital Library
- Harald Gall, Karin Hajek, and Mehdi Jazayeri. Detection of logical coupling based on product release history. In ICSM, pages 190–197, 1998. Google ScholarDigital Library
- Hype cycles. http://www.gartner.com/technology/ research/methodologies/hype-cycle.jsp.Google Scholar
- Linux kernel. http://en.wikipedia.org/wiki/Linux_ kernel#Development_model.Google Scholar
- R Grady and E Caswell. Software metrics. Prentice-Hall, Englewood Cliff, 1987.Google Scholar
- T. Graves and A. Mockus. Identifying productivity drivers by modeling work units using partial data. Technometrics, 43(2):168–179, May 2001.Google ScholarCross Ref
- Randy Hackbarth, Audris Mockus, John Palframan, and David Weiss. Assessing the state of software in a large enterprise. Journal of Empirical Software Engineering, 10(3):219–249, 2010. Google ScholarDigital Library
- M. H. Halstead. Elements of Software Science. Elsevier – North Holland, 1979.Google Scholar
- Ahmed E. Hassan, Abram Hindle, Per Runeson, Martin Shepperd, Premkumar T. Devanbu, and Sunghun Kim. Roundtable: What’s next in software analytics. IEEE Software, 30(4):53–56, 2013. Google ScholarDigital Library
- Ahmed E. Hassan, Richard C. Holt, and Audris Mockus. Report on MSR 2004: International workshop on mining software repositories. In ACM SIGSOFT Software Engineering Notes, 2005. Google ScholarDigital Library
- Abram Hindle, Neil A. Ernst, Michael W. Godfrey, and John Mylopoulos. Automated topic naming to support cross-project analysis of software maintenance activities. In Proceedings of the 8th Working Conference on Mining Software Repositories, MSR ’11, pages 163–172, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- John P. A. Ioannidis. Why most published research findings are false. PLoS Med, 2(8):e124, August 30 2005.Google ScholarCross Ref
- J. Jelinski and P. B. Moranda. Software reliability research. In W. Freiberger, editor, Probabilistic Models for Software, pages 485–502. Academic Press, 1972.Google Scholar
- Jira plugins. https://marketplace.atlassian.com/plugins.Google Scholar
- Donald E. Knuth. Literate Programming. Stanford University Center for the Study of Language and Information, Stanford, CA, USA, 1992. Google ScholarDigital Library
- R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Willey Series in Probability and Mathematical Statistics. John Willey & Sons, 1987. Google ScholarDigital Library
- T. J. McCabe. A complexity measure. IEEE Trans. on Software Engineering, 2(4):308–320, Dec. 1976. Google ScholarDigital Library
- Katina Michael and Keith W. Miller. Big data: New opportunities and new challenges {guest editors’ introduction}. Computer, 46(6):22–24, 2013. Google ScholarDigital Library
- Anil K. Midha. Software configuration management for the 21st century. Bell Labs Technical Journal, 2(1), Winter 1997.Google ScholarCross Ref
- A. Mockus, R. F. Fielding, and J. Herbsleb. A case study of open source development: The Apache server. In 22nd International Conference on Software Engineering, pages 263–272, Limerick, Ireland, June 4-11 2000. Google ScholarDigital Library
- Audris Mockus. Software support tools and experimental work. In V Basili and et al, editors, Empirical Software Engineering Issues: Critical Assessments and Future Directions, volume LNCS 4336, pages 91–99. Springer, 2007. Google ScholarDigital Library
- Audris Mockus. Missing data in software engineering. In J. Singer et al., editor, Guide to Advanced Empirical Software Engineering, pages 185–200. Springer-Verlag, 2008.Google Scholar
- Audris Mockus. Amassing and indexing a large sample of version control systems: towards the census of public source code history. In 6th IEEE Working Conference on Mining Software Repositories, May 16–17 2009. Google ScholarDigital Library
- Audris Mockus. Organizational volatility and its effects on software defects. In ACM SIGSOFT / FSE, pages 117–126, Santa Fe, New Mexico, November 7–11 2010. Google ScholarDigital Library
- Audris Mockus, Todd L. Graves, and Alan F. Karr. Modelling software changes. In C.E. Minder and H. Friedl, editors, Good Statistical Practice, pages 175–179. Austrian Statistical Society, Wien, Austria, July 1997. Proceedings of the 12th International Workshop on Statistical Modeling, Biel/Bienne.Google Scholar
- Audris Mockus, Randy Hackbarth, and John Palframan. Risky files: An approach to focus quality improvement effort. In 9th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, 2013. Google ScholarDigital Library
- Audris Mockus and David Weiss. Interval quality: Relating customer-perceived quality to process quality. In 2008 International Conference on Software Engineering, pages 733–740, Leipzig, Germany, May 10–18 2008. ACM Press. Google ScholarDigital Library
- 11th working conference on mining software repositories. http://2014.msrconf.org/.Google Scholar
- Nathaniel Poor. Mechanisms of an online public sphere: The website slashdot. Journal of Computer-Mediated Communication, 10(2), 2005.Google ScholarCross Ref
- Foyzur Rahman, Daryl Posnett, Israel Herraiz, and Premkumar T. Devanbu. Sample size vs. bias in defect prediction. In ESEC/SIGSOFT FSE, pages 147–157, 2013. Google ScholarDigital Library
- M.J. Rochkind. The source code control system. IEEE Trans. on Software Engineering, 1(4):364–370, 1975. Google ScholarDigital Library
- Eric Schulte, Dan Davison, Thomas Dye, and Carsten Dominik. A multi-language computing environment for literate programming and reproducible research. Journal of Statistical Software, 46(3):1–24, 1 2012.Google ScholarCross Ref
- Matthias Schwab, Martin Karrenbach, and Jon Claerbout. Making scientific computations reproducible. In Computing in Science & Engineering, pages 61––67, 1997. Google ScholarDigital Library
- Martin J. Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. Data quality: Some comments on the nasa software defect datasets. IEEE Trans. Software Eng., 39(9):1208–1215, 2013. Google ScholarDigital Library
- Emad Shihab, Christian Bird, and Thomas Zimmermann. The effect of branching strategies on software quality. In ESEM, pages 301–310, 2012. Google ScholarDigital Library
- Dag I.K. Sjøberg, Bente Anda, and Audris Mockus. Questioning software maintenance metrics: a comparative case study. In Proceedings of the ACM-IEEE international symposium on Empirical software engineering and measurement, ESEM ’12, pages 107–110, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- Dag I.K. Sjoberg, Aiko Yamashita, Bente Anda, Audris Mockus, and Tore Dyba. Quantifying the effect of code smells on maintenance effort. IEEE Transactions on Software Engineering, 2013. Google ScholarDigital Library
- But he looked good on paper. http://www.slate. com/articles/business/small_business/2010/08/ but_he_looked_good_on_paper.html.Google Scholar
- Sonar. http://en.wikipedia.org/wiki/SonarQube.Google Scholar
- Margaret-Anne Storey, Leif Singer, Fernando Figueira Filho, Brendan Cleary, and Alexey Zagalsky. The (R)evolutionary Role of Social Media in Software Engineering. In ICSE, Hyderabad, India, June 2014. IEEE CS.Google Scholar
- Walter F. Tichy. Design, implementation, and evaluation of a revision control system. In ICSE, pages 58–67, 1982. Google ScholarDigital Library
- Claude E. Walston and Charles P. Felix. A method of programming measurement and estimation. IBM Systems Journal, 16(1):54–73, 1977. Google ScholarDigital Library
- M.R. Wigan and R. Clarke. Big data’s big unintended consequences. Computer, 46(6):46–53, 2013. Google ScholarDigital Library
- Big data. http://en.wikipedia.org/wiki/Big_Data.Google Scholar
- Business intelligence. http: //en.wikipedia.org/wiki/Business_intelligence.Google Scholar
- Git. http: //en.wikipedia.org/wiki/Git_%28software%29.Google Scholar
- Predictive analytics. http: //en.wikipedia.org/wiki/Predictive_analytics.Google Scholar
- Vworker. http://en.wikipedia.org/wiki/VWorker.Google Scholar
- Jialiang Xie, Qimu Zhengand, Minghui Zhou, and Audris Mockus. Product assignment recommender. In ICSE’14 Demonstrations, 2014. Google ScholarDigital Library
- Jianming Ye. On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association, 93(441):120–131, March 1998.Google ScholarCross Ref
- Tze-Jie Yu, Vincent Yun Shen, and Hubert E. Dunsmore. An analysis of several software defect models. IEEE Trans. Software Eng., 14(9):1261–1270, 1988. Google ScholarDigital Library
Index Terms
- Engineering big data solutions
Recommendations
Data Science: A Comprehensive Overview
The 21st century has ushered in the age of big data and data economy, in which data DNA, which carries important knowledge, insights, and potential, has become an intrinsic constituent of all data-based organisms. An appropriate understanding of data ...
Introduction to Big Data: Scalable Representation and Analytics for Data Science Minitrack
HICSS '13: Proceedings of the 2013 46th Hawaii International Conference on System SciencesBig data is an emerging phenomenon characterized by the three Vs: volume, velocity, and variety. The volume of data has increased from terabytes to petabytes and is encroaching on exabytes. Some pundits are suggesting that zettabytes (1021) are ...
Comments