research-article

Experiences with text mining large collections of unstructured systems development artifacts at jpl

Authors:
Dan Port

University of Hawaii, Honolulu, HI, USA

University of Hawaii, Honolulu, HI, USA
View Profile

,
Allen Nikora

California Institute of Technology, Pasadena, CA, USA

California Institute of Technology, Pasadena, CA, USA
View Profile

,
Jairus Hihn

California Institute of Technology, Pasadena, CA, USA

California Institute of Technology, Pasadena, CA, USA
View Profile

,
LiGuo Huang

Southern Methodist University, Dallas, TX, USA

Southern Methodist University, Dallas, TX, USA
View Profile

ICSE '11: Proceedings of the 33rd International Conference on Software EngineeringMay 2011Pages 701–710https://doi.org/10.1145/1985793.1985891

Published:21 May 2011Publication History

ICSE '11: Proceedings of the 33rd International Conference on Software Engineering

Pages 701–710

ABSTRACT

Often repositories of systems engineering artifacts at NASA's Jet Propulsion Laboratory (JPL) are so large and poorly structured that they have outgrown our capability to effectively manually process their contents to extract useful information. Sophisticated text mining methods and tools seem a quick, low-effort approach to automating our limited manual efforts. Our experiences of exploring such methods mainly in three areas including historical risk analysis, defect identification based on requirements analysis, and over-time analysis of system anomalies at JPL, have shown that obtaining useful results requires substantial unanticipated efforts - from preprocessing the data to transforming the output for practical applications. We have not observed any quick 'wins' or realized benefit from short-term effort avoidance through automation in this area. Surprisingly we have realized a number of unexpected long-term benefits from the process of applying text mining to our repositories. This paper elaborates some of these benefits and our important lessons learned from the process of preparing and applying text mining to large unstructured system artifacts at JPL aiming to benefit future TM applications in similar problem domains and also in hope for being extended to broader areas of applications.

References

N. Leveson, Safeware System Safety and Computers, Addison-Wesley, 1995. Google ScholarCross Ref
B. W. Boehm, "Software risk management: Principles and practice". IEEE Software, 8(1): 32--41. Google ScholarDigital Library
D. Engler, D. Y. Chen, S. Hallem, A. Chou, and B. Chelf, "Bugs as deviant behavior: a general approach to inferring errors in systems code", In Proc. SOSP, pages 57--72, 2001. Google ScholarDigital Library
D. Carney, E. Morris, and P. Place, "Identifying Commercial Off-the-Shelf (COTS) Product Risks: The COTS Usage Risk Evaluation", Carnegie Mellon University/SEI- 2003-TR-023. September 2003.Google ScholarCross Ref
Menzies, T., Benson, M., Costello, K., Moats, C., Northey, M., Richardson, J., "Learning Better IV&V Practices," Innovations in Systems and Software Engineering, Springer London, 4(2), June 2008, 169--183.Google ScholarCross Ref
J. H. Hayes, "Risk Reduction Through Requirements Tracing," Proc. of 1990 Software Quality Week, San Francisco, CA, 1990.Google Scholar
J. H. Hayes, A. Dekhtyar, J. Osbourne, "Improving Requirements Tracing via Information Retrieval," Proc. of 2003 IEEE International Conference on Requirements Engineering, IEEE Press, Sep. 2003, pp. 151--161. Google ScholarDigital Library
J. H. Hayes, A. Dekhtyar, S. Sundaram, S. Howard, "Helping Analysts Trace Requirements: An Objective Look," in Proc. of IEEE International Conference on Requirements Engineering, Sep. 2004, pp. 249--261. Google ScholarDigital Library
The Porter Stemming algorithm, Karen Sparck Jones and Peter Willet, Readings in Information Retrieval, San Francisco: Morgan Kaufmann, 1997. ISBN 1-55860-454-4.Google Scholar
WEKA 3 - Data Mining with Open Source Machine Learning Software in Java, http://www.cs.waikato.ac.nz/ml/weka/.Google Scholar
Rapid-I, "RapidMiner", http://rapidi.com/content/view/181/190/, accessed Feb. 8, 2011.Google Scholar
The R Project for Statistic Computing, http://www.rproject.org/, accessed Feb. 8, 2011.Google Scholar
L. Huang, D. Port, L. Wang, T. Xie, and T. Menzies, "Text Mining in Supporting Software Systems Risk Assurance", Proceedings of 25th IEEE/ACM International Conference on Automated Software Engineering (ASE), Sept. 20-24, 2010. Google ScholarDigital Library
TnT - Statistical Part-of-Speech Tagging, http://www.coli.uni-saarland.de/~thorsten/tnt/.Google Scholar
Lutz, R., Mikulski, C., "Empirical Analysis of Safety-Critical Anomalies During Operations," IEEE Trans. on Software Engineering, vol. 30, no. 3, Mar, 2004, 172--180. Google ScholarDigital Library
A. Nikora, "Classifying Requirements: Towards a More Rigorous Analysis of Natural-Language Specifications." Proc. of the 16th International Symposium on Software Reliability Engineering. Chicago, 2005. 291--300. Google ScholarDigital Library
A. Nikora, and G. Balcom. "Automated Identification of LTL Patterns in Natural Language Requirements", Proc. 20th International Symposium on Software Reliability Engineering. IEEE Computer Society, 2009. 185--194. Google ScholarDigital Library
Nikora, A., Hayes, J., and Holbrook, E. "Experiments in Automated Identification of Ambiguous Natural-Language Requirements," to appear, proceedings 21st IEEE International Symposium on Software Reliability Engineering, San Jose: IEEE Computer Society, 2010.Google Scholar
D. East, and M. Truszczynski. "Predicate-calculus based logics for modeling and solving search problems." ACM Transactions on Computational Logic 7(1) (2006): 38--83.5. Google ScholarDigital Library
D. East, and M. Truszczynski. "The aspps System." Proceedings of the 8th European Conference on Logics in Artificial Intelligence (JELIA), Lecture Notes in Computer Science. Springer Verlag, 2002. 533--536. Google ScholarDigital Library
M. Carr, S. Konda, I. Monarch, C. Ulrich, C. Walker, "Taxonomy based risk identification (Carnegie Mellon University/SEI-93-TR-6, ADA266992)", Software Engineering Institute, Carnegie Mellon University, Pittsburgh, PA, 1993.Google Scholar
A. Rashid, and G. Kotonya: "Risk Management in Component-Based Development: A Separation of Concerns Perspective," ECOOP Workshop on Advanced Separation of Concerns, Springer-Verlag, 2001.Google Scholar
Port, D., Nikora, A., Hayes, J., and Huang, L., "Text Mining Support for Software Requirements: Traceability Assurance", Proc. HICSS 2011. Google ScholarDigital Library
Predictor Models in Software Engineering (Promise) Software Engineering Repository." http://promise.site.uottawa.ca/SERepository.Google Scholar
Warfield, K. and Hihn J., "Spreadsheets in Team X: Preserving Order in an Inherently Chaotic Environment", Proc. of the 42nd Hawaiian International Conference on System Sciences (HICSS 42), Waikoloa, HI, Jan. 5-8, 2009 Google ScholarDigital Library
Hihn, J., Chattopadhyay, D., Hanna, R., and Port, D., "Identification And Classification Of Common Risks In Space Science Missions", Proc. AIAA Space 2010 Conference and Exposition, Anaheim, CA. Sept. 1-3, 2010.Google ScholarCross Ref
Nikora, A., G. Balcom. "Improving the Accuracy of Space Mission Software Anomaly Frequency Estimates." Proc. of the 3rd IEEE International Conference on Space Mission Challenges for Information Technology. Pasadena, CA, 2009. 402--409. Google ScholarDigital Library
Green, N., A. Hoffman, T. Schow, and H. Garrett. "Anomaly trends for robotic missions to Mars: implications for mission reliability." 44th AIAA Aerospace Sciences Meeting and Exhibit. Reno, NV, 2006.Google Scholar
Basili, V. R., and Weiss, D. M., "A Methodology for Collecting Valid Software Engineering Data", IEEE Trans. on Software Engineering, SE-lo, 728--738, November 1984.Google ScholarDigital Library
Basili, V. R., "Software Modeling and Measurement: The Goal/Question/Metric Paradigm", University of Maryland Technical Report. UMIACS-TR-92-96, 1992. http://portal.acm.org/citation.cfm?id=137076. Google Scholar
DocuShare Document Management Solutions, http://www.office.xerox.com/software-solutions/xeroxdocushare/enus.html, accessed Mar 15, 2010.Google Scholar
IBM Software - Rational DOORS, http://www-01.ibm.com/software/awdtools/doors/productline/.Google Scholar
J. I. Maletic, A. Marcus, "Using latent semantic analysis to identify similarities in source code to support program understanding", In Proc. 12th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'00), p 46, 2000. Google ScholarDigital Library
A. DeLucia, F. Fasano, R. Oliveto, G. Tortora, "Can Information Retrieval Techniques Effectively support Traceability Link Recovery?", Proc. 14th IEEE Int. Conf. on Program Comprehension, Athens, Greece, 2006, pp. 307--316. Google ScholarDigital Library
J. H. Hayes, A. Dekhtyar, and S. K. Sundaram, "Advancing candidate link generation for requirements tracing: the study of methods", IEEE Trans. on Software Engineering, vol.32,no.1, pp 4--19, January 2006. Google ScholarDigital Library
M. Lormans and A. Van Deursen, Can LSI help Reconstructing Requirements Traceability in Design and Test? Proc. 10th IEEE European Conf. on Software Maintenance and Reengineering, 2006, pp. 47--56. Google ScholarDigital Library
D. Poshyvanyk, Y.-G. Gueheneuc, A. Marcus, G. Antoniol, and V. Rajlich, "Combining Probabilistic Ranking and Latenet Semantic Indexing for Feature Identification", In Proc. 14th IEEE Int. Conf. on Program Comprehension, pp. 137--148, Athens, Greece, 2006. Google ScholarDigital Library
C. Duan, and J. Cleland-Huang, "Clustering support for automated tracing", In Proceedings of the 22 IEEE/ACM international Conference on Automated Software Engineering, Page 244--253, Atlanta, Georgia, USA, 2007. Google ScholarDigital Library
A. Michail, "Data mining library reuse patterns using generalized association rules", ICSE, pp.167--176, 2000. Google ScholarDigital Library
Z. Li and Y. Zhou, "PR-Miner: Automatically extracting implicit programming rules and detecting violations in large software codes", In Proc. ESEC/FSE, pages 306--315, 2005. Google ScholarDigital Library
V. B. Livshits and T. Zimmermann, "DynaMine: Finding common error patterns by mining software revision histories", In Proc. ESEC/FSE, pages 296--305, 2005. Google ScholarDigital Library

Index Terms

Experiences with text mining large collections of unstructured systems development artifacts at jpl
1. Social and professional topics
  1. Professional topics
    1. Management of computing and information systems
      1. System management
        Quality assurance

Recommendations

Mining Text Using Keyword Distributions

Knowledge Discovery in Databases (KDD) focuses on the computerized exploration of large amounts of data and on the discovery of interesting patterns within them. While most work on KDD has been concerned with structured databases, there has been little work ...
Read More
Text mining in supporting software systems risk assurance
ASE '10: Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering

Insufficient risk analysis often leads to software system design defects and system failures. Assurance of software risk documents aims to increase the confidence that identified risks are complete, specific, and correct. Yet assurance methods rely ...
Read More
Combining Information Extraction with Genetic Algorithms for Text Mining

Text mining discovers unseen patterns in textual databases. But these discoveries are useless unless they contribute valuable knowledge for users who make strategic decisions. Confronting this issue can lead to a complicated activity called knowledge ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICSE '11: Proceedings of the 33rd International Conference on Software Engineering
May 2011
1258 pages
ISBN:9781450304450
DOI:10.1145/1985793
General Chair:
Richard N. Taylor
UC Irvine, USA
,
Program Chairs:
Harald Gall
University of Zurich, Switzerland
,
Nenad Medvidović
University of Southern California, USA
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 May 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
assurance
experience
requirements assurance
risk
risk assurance
system repository mining
systems development artifact
text mining
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate276of1,856submissions,15%

Upcoming Conference

ICSE 2025

2025 IEEE/ACM 46th International Conference on Software Engineering

April 26 - May 3, 2025

Ottawa , ON , Canada
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 328
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Experiences with text mining large collections of unstructured systems development artifacts at jpl

ICSE '11: Proceedings of the 33rd International Conference on Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mining Text Using Keyword Distributions

Text mining in supporting software systems risk assurance

Combining Information Extraction with Genetic Algorithms for Text Mining