Article

Engineering big data solutions

Author:
Audris Mockus

Avaya Labs Research, USA

Avaya Labs Research, USA
View Profile

Authors Info & Claims

FOSE 2014: Future of Software Engineering ProceedingsMay 2014Pages 85–99https://doi.org/10.1145/2593882.2593889

Published:31 May 2014Publication History

FOSE 2014: Future of Software Engineering Proceedings

Pages 85–99

ABSTRACT

Structured and unstructured data in operational support tools have long been prevalent in software engineering. Similar data is now becoming widely available in other domains. Software systems that utilize such operational data (OD) to help with software design and maintenance activities are increasingly being built despite the difficulties of drawing valid conclusions from disparate and low-quality data and the continuing evolution of operational support tools. This paper proposes systematizing approaches to the engineering of OD-based systems. To prioritize and structure research areas we consider historic developments, such as big data hype; synthesize defining features of OD, such as confounded measures and unobserved context; and discuss emerging new applications, such as diverse and large OD collections and extremely short development intervals. To sustain the credibility of OD-based systems more research will be needed to investigate effective existing approaches and to synthesize novel, OD-specific engineering principles.

References

Bente C.D. Anda, Dag I.K. Sjøberg, and Audris Mockus. Variability and reproducibility in software engineering: A study of four companies that developed the same system. IEEE Transactions on Software Engineering, 35(3), May/June 2009. Google ScholarDigital Library
V.R. Basili, R.W. Selby, and D.H. Hutchens. Experimentation in software engineering. IEEE Transactions on Software Engineering, pages 758–773, July 1986. Google ScholarDigital Library
V.R. Basili and D.M. Weiss. A methodology for collecting valid software engineering data. IEEE Transactions on Software Engineering, 10(6):728–737, 1984. Google ScholarDigital Library
Andrew Begel and Thomas Zimmermann. Analyze this! 145 questions for data scientists in software engineering. In ICSE, Hyderabad, India, June 2014. IEEE CS. Google ScholarDigital Library
L. A. Belady and M. M. Lehman. Programming system dynamics, or the meta-dynamics of systems in maintenance and growth. Technical report, IBM Thomas J. Watson Research Center, 1971.Google Scholar
Christian Bird, Peter C. Rigby, Earl T. Barr, David J. Hamilton, Daniel M. German, and Prem Devanbu. The promises and perils of mining git. 2013 10th Working Conference on Mining Software Repositories (MSR), 0:1–10, 2009. Google ScholarDigital Library
B.W. Boehm. Software Engineering Economics. Prentice-Hall, 1981. Google ScholarDigital Library
Marcelo Cataldo, Audris Mockus, Jeffrey A. Roberts, and James D. Herbsleb. Software dependencies, the structure of work dependencies and their impact on failures. IEEE Transactions on Software Engineering, 2009. Google ScholarDigital Library
Per Cedeqvist and et al. CVS Manual. May be found on: http://www.cvshome.org/CVS/.Google Scholar
Ben Collins-Sussman, Brian W. Fitzpatrick, and C. Michael Pilato. Subversion Manual. May be found on: http://svnbook.red-bean.com/.Google Scholar
Laura Dabbish, H. Colleen Stuart, Jason Tsay, and James D. Herbsleb. Leveraging transparency. IEEE Software, 30(1):37–43, 2013. Google ScholarDigital Library
Data warehouse. http://en.wikipedia.org/wiki/Data_warehouse.Google Scholar
T. Dunning. Natural Experiments in the Social Sciences: A Design-Based Approach. Cambridge University Press, 2012.Google ScholarCross Ref
S.G. Eick, J.L. Steffen, and Sumner E.E. Seesoft-a tool for visualizing line oriented software statistics. IEEE Transactions on Software Engineering, 18(11):957 – 968, November 1992. Google ScholarDigital Library
Harald Gall, Karin Hajek, and Mehdi Jazayeri. Detection of logical coupling based on product release history. In ICSM, pages 190–197, 1998. Google ScholarDigital Library
Hype cycles. http://www.gartner.com/technology/ research/methodologies/hype-cycle.jsp.Google Scholar
Linux kernel. http://en.wikipedia.org/wiki/Linux_ kernel#Development_model.Google Scholar
R Grady and E Caswell. Software metrics. Prentice-Hall, Englewood Cliff, 1987.Google Scholar
T. Graves and A. Mockus. Identifying productivity drivers by modeling work units using partial data. Technometrics, 43(2):168–179, May 2001.Google ScholarCross Ref
Randy Hackbarth, Audris Mockus, John Palframan, and David Weiss. Assessing the state of software in a large enterprise. Journal of Empirical Software Engineering, 10(3):219–249, 2010. Google ScholarDigital Library
M. H. Halstead. Elements of Software Science. Elsevier – North Holland, 1979.Google Scholar
Ahmed E. Hassan, Abram Hindle, Per Runeson, Martin Shepperd, Premkumar T. Devanbu, and Sunghun Kim. Roundtable: What’s next in software analytics. IEEE Software, 30(4):53–56, 2013. Google ScholarDigital Library
Ahmed E. Hassan, Richard C. Holt, and Audris Mockus. Report on MSR 2004: International workshop on mining software repositories. In ACM SIGSOFT Software Engineering Notes, 2005. Google ScholarDigital Library
Abram Hindle, Neil A. Ernst, Michael W. Godfrey, and John Mylopoulos. Automated topic naming to support cross-project analysis of software maintenance activities. In Proceedings of the 8th Working Conference on Mining Software Repositories, MSR ’11, pages 163–172, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
John P. A. Ioannidis. Why most published research findings are false. PLoS Med, 2(8):e124, August 30 2005.Google ScholarCross Ref
J. Jelinski and P. B. Moranda. Software reliability research. In W. Freiberger, editor, Probabilistic Models for Software, pages 485–502. Academic Press, 1972.Google Scholar
Jira plugins. https://marketplace.atlassian.com/plugins.Google Scholar
Donald E. Knuth. Literate Programming. Stanford University Center for the Study of Language and Information, Stanford, CA, USA, 1992. Google ScholarDigital Library
R. J. A. Little and D. B. Rubin. Statistical Analysis with Missing Data. Willey Series in Probability and Mathematical Statistics. John Willey & Sons, 1987. Google ScholarDigital Library
T. J. McCabe. A complexity measure. IEEE Trans. on Software Engineering, 2(4):308–320, Dec. 1976. Google ScholarDigital Library
Katina Michael and Keith W. Miller. Big data: New opportunities and new challenges {guest editors’ introduction}. Computer, 46(6):22–24, 2013. Google ScholarDigital Library
Anil K. Midha. Software configuration management for the 21st century. Bell Labs Technical Journal, 2(1), Winter 1997.Google ScholarCross Ref
A. Mockus, R. F. Fielding, and J. Herbsleb. A case study of open source development: The Apache server. In 22nd International Conference on Software Engineering, pages 263–272, Limerick, Ireland, June 4-11 2000. Google ScholarDigital Library
Audris Mockus. Software support tools and experimental work. In V Basili and et al, editors, Empirical Software Engineering Issues: Critical Assessments and Future Directions, volume LNCS 4336, pages 91–99. Springer, 2007. Google ScholarDigital Library
Audris Mockus. Missing data in software engineering. In J. Singer et al., editor, Guide to Advanced Empirical Software Engineering, pages 185–200. Springer-Verlag, 2008.Google Scholar
Audris Mockus. Amassing and indexing a large sample of version control systems: towards the census of public source code history. In 6th IEEE Working Conference on Mining Software Repositories, May 16–17 2009. Google ScholarDigital Library
Audris Mockus. Organizational volatility and its effects on software defects. In ACM SIGSOFT / FSE, pages 117–126, Santa Fe, New Mexico, November 7–11 2010. Google ScholarDigital Library
Audris Mockus, Todd L. Graves, and Alan F. Karr. Modelling software changes. In C.E. Minder and H. Friedl, editors, Good Statistical Practice, pages 175–179. Austrian Statistical Society, Wien, Austria, July 1997. Proceedings of the 12th International Workshop on Statistical Modeling, Biel/Bienne.Google Scholar
Audris Mockus, Randy Hackbarth, and John Palframan. Risky files: An approach to focus quality improvement effort. In 9th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering, 2013. Google ScholarDigital Library
Audris Mockus and David Weiss. Interval quality: Relating customer-perceived quality to process quality. In 2008 International Conference on Software Engineering, pages 733–740, Leipzig, Germany, May 10–18 2008. ACM Press. Google ScholarDigital Library
11th working conference on mining software repositories. http://2014.msrconf.org/.Google Scholar
Nathaniel Poor. Mechanisms of an online public sphere: The website slashdot. Journal of Computer-Mediated Communication, 10(2), 2005.Google ScholarCross Ref
Foyzur Rahman, Daryl Posnett, Israel Herraiz, and Premkumar T. Devanbu. Sample size vs. bias in defect prediction. In ESEC/SIGSOFT FSE, pages 147–157, 2013. Google ScholarDigital Library
M.J. Rochkind. The source code control system. IEEE Trans. on Software Engineering, 1(4):364–370, 1975. Google ScholarDigital Library
Eric Schulte, Dan Davison, Thomas Dye, and Carsten Dominik. A multi-language computing environment for literate programming and reproducible research. Journal of Statistical Software, 46(3):1–24, 1 2012.Google ScholarCross Ref
Matthias Schwab, Martin Karrenbach, and Jon Claerbout. Making scientific computations reproducible. In Computing in Science & Engineering, pages 61––67, 1997. Google ScholarDigital Library
Martin J. Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. Data quality: Some comments on the nasa software defect datasets. IEEE Trans. Software Eng., 39(9):1208–1215, 2013. Google ScholarDigital Library
Emad Shihab, Christian Bird, and Thomas Zimmermann. The effect of branching strategies on software quality. In ESEM, pages 301–310, 2012. Google ScholarDigital Library
Dag I.K. Sjøberg, Bente Anda, and Audris Mockus. Questioning software maintenance metrics: a comparative case study. In Proceedings of the ACM-IEEE international symposium on Empirical software engineering and measurement, ESEM ’12, pages 107–110, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
Dag I.K. Sjoberg, Aiko Yamashita, Bente Anda, Audris Mockus, and Tore Dyba. Quantifying the effect of code smells on maintenance effort. IEEE Transactions on Software Engineering, 2013. Google ScholarDigital Library
But he looked good on paper. http://www.slate. com/articles/business/small_business/2010/08/ but_he_looked_good_on_paper.html.Google Scholar
Sonar. http://en.wikipedia.org/wiki/SonarQube.Google Scholar
Margaret-Anne Storey, Leif Singer, Fernando Figueira Filho, Brendan Cleary, and Alexey Zagalsky. The (R)evolutionary Role of Social Media in Software Engineering. In ICSE, Hyderabad, India, June 2014. IEEE CS.Google Scholar
Walter F. Tichy. Design, implementation, and evaluation of a revision control system. In ICSE, pages 58–67, 1982. Google ScholarDigital Library
Claude E. Walston and Charles P. Felix. A method of programming measurement and estimation. IBM Systems Journal, 16(1):54–73, 1977. Google ScholarDigital Library
M.R. Wigan and R. Clarke. Big data’s big unintended consequences. Computer, 46(6):46–53, 2013. Google ScholarDigital Library
Big data. http://en.wikipedia.org/wiki/Big_Data.Google Scholar
Business intelligence. http: //en.wikipedia.org/wiki/Business_intelligence.Google Scholar
Git. http: //en.wikipedia.org/wiki/Git_%28software%29.Google Scholar
Predictive analytics. http: //en.wikipedia.org/wiki/Predictive_analytics.Google Scholar
Vworker. http://en.wikipedia.org/wiki/VWorker.Google Scholar
Jialiang Xie, Qimu Zhengand, Minghui Zhou, and Audris Mockus. Product assignment recommender. In ICSE’14 Demonstrations, 2014. Google ScholarDigital Library
Jianming Ye. On measuring and correcting the effects of data mining and model selection. Journal of the American Statistical Association, 93(441):120–131, March 1998.Google ScholarCross Ref
Tze-Jie Yu, Vincent Yun Shen, and Hubert E. Dunsmore. An analysis of several software defect models. IEEE Trans. Software Eng., 14(9):1261–1270, 1988. Google ScholarDigital Library

Index Terms

Engineering big data solutions

Recommendations

Data Science: A Comprehensive Overview

The 21st century has ushered in the age of big data and data economy, in which data DNA, which carries important knowledge, insights, and potential, has become an intrinsic constituent of all data-based organisms. An appropriate understanding of data ...
Read More
Introduction to Big Data: Scalable Representation and Analytics for Data Science Minitrack
HICSS '13: Proceedings of the 2013 46th Hawaii International Conference on System Sciences

Big data is an emerging phenomenon characterized by the three Vs: volume, velocity, and variety. The volume of data has increased from terabytes to petabytes and is encroaching on exabytes. Some pundits are suggesting that zettabytes (1021) are ...
Read More
Big data

We use structuralism and functionalism paradigms to analyze the origins of big data applications.Current trends and sources of big data.Processing technologies, methods and analysis techniques for big data are compared in detail.We analyze major ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FOSE 2014: Future of Software Engineering Proceedings
May 2014
224 pages
ISBN:9781450328654
DOI:10.1145/2593882
General Chair:
James Herbsleb
Carnegie Mellon University, USA
,
Program Chairs:
Matthew B. Dwyer
University of Nebraska at Lincoln, USA
,
James Herbsleb
Carnegie Mellon University, USA
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 May 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Analytics
Data Engineering
Data Quality
Data Science
Game Theory
Operational Data
Statistics
Qualifiers
- Article
Conference

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 38
  Total Citations
  View Citations
- 861
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Engineering big data solutions

FOSE 2014: Future of Software Engineering Proceedings

ABSTRACT

References

Cited By

Index Terms

Recommendations

Data Science: A Comprehensive Overview

Introduction to Big Data: Scalable Representation and Analytics for Data Science Minitrack

Big data