Understanding users’ behavior with software operation data mining

doi:10.1016/j.chb.2013.07.049

Computers in Human Behavior

Volume 30, January 2014, Pages 583-594

https://doi.org/10.1016/j.chb.2013.07.049 Get rights and content

Highlights

•
We inspect different types of knowledge about software users’ usage behavior.
•
We select state-of-the-art data mining techniques to analyze software usage.
•
We create a method for mining usage knowledge from software operation data.
•
We instantiate the method in a prototype developed in R.
•
We evaluate the Usage Mining Method and the prototype by means of a case study.

Abstract

Software usage concerns knowledge about how end-users use the software in the field, and how the software itself responds to their actions. In this paper, we present the Usage Mining Method to guide the analysis of data collected during software operation, in order to extract knowledge about how a software product is used by the end-users. Our method suggests three analysis tasks which employ data mining techniques for extracting usage knowledge from software operation data: users profiling, clickstream analysis and classification analysis. The Usage Mining Method was evaluated through a prototype that was executed in the case of Exact Online, the main online financial management application in the Netherlands. The evaluation confirmed the supportive role of the Usage Mining Method in software product management and development processes, as well as the applicability of the suggested data mining algorithms to carry out the usage analysis tasks.

Introduction

Software usage concerns the utilization of a software product by the end-users. Software usage data may be collected while the end-users are using the software in the field (El-Ramly & Stroulia, 2004). Simmons (2006) points out the possibility to extract system requirements from usage, rendering the beneficial role of user experience in product innovation and differentiation. Software usage knowledge includes the awareness of how end-users use the software in the field, and how the software itself responds to their actions (Van der Schuur, Jansen, & Brinkkemper, 2010).

By tracking software usage, we can monitor which applications are most often used, which features are underutilized, and which functionalities could be expanded (Junco, 2013). This information could for example be used to highlight changes in the requirements engineering process. We may also gain insights on how users browse themselves through the user interface in order to perform an operation, with the goal to improve software usability or to reengineer processes. Furthermore, by observing the usage behavior of different customer profiles, the software vendor can implement more directed marketing or customized licensing (Germanakos et al., 2008, Van der Schuur et al., 2010). Improved customer satisfaction, and consequently customer retention and increase in sales, are some of the business advantages that could be gained through an automated usage analysis, based on real execution data.

Software usage knowledge may be extracted from software operation data, i.e. data that are collected during software operation in the field (van der Schuur et al., 2010). A noticeable amount of research has already been performed in the process of recording software operation data (Bowring et al., 2002, Nusayr and Cook, 2009). In practice, most vendors tend to handle the acquired data manually, or use general statistics and simple visualization techniques (Kristjansson & Van der Schuur, 2009). However, such analysis cannot yield interesting patterns that are hidden in large datasets (Kantardzic, 2002).

On the other hand, a lot of development has been seen in the web usage mining field (Cooley, Mobasher, & Srivastava, 1997). Although many lessons can be learned from there, the approach for analyzing web usage by website visitors has significant differences, compared to analyzing how software products are used by the users. The techniques that are used in web usage mining (and other related domains) need to be revised for their application in mining usage on software operation data.

While usage knowledge is highly important for making good software products, the rise of cloud computing and Software-as-a-Service (SaaS) applications (Park & Ryoo, 2013) creates an opportunity to mine the easily acquired data. Even though there are algorithms for doing such data analysis, they are hardly ever used for analyzing software usage. Following a meta-algorithmic approach, we will try to answer the research question:

How should we inspect software operation data, in order to gain knowledge about how the software is used by the end-users?

This research suggests how data mining techniques can be integrated to analyze software operation data in a uniform and automated way. Hence, it contributes to the domain of software usage analysis as well as to the software operation knowledge and its use in software product management, development and maintenance processes (Van der Schuur et al., 2010). From a practical perspective, the method that we suggest for usage mining constitutes a reference process model that can be followed by software vendors, to analyze how their customers use their products.

The remainder of this paper has the following structure: In Section 2 we review the research that has been performed on the area of extracting usage knowledge from the system utilization. We shortly present our research design in Section 3. In Section 4 we present the method that has been constructed to extract usage knowledge. In Section 5 we describe the usage knowledge subjects that we suggest to extract, and the variables that should be inspected in software operation data, in order to derive conclusions about how software operates in the field. Section 6 describes the data mining techniques that are suggested for mining software usage knowledge. In Section 7 we present the prototype that was constructed as an instantiation of the usage mining method. We evaluate the two artifacts in a case study in Section 8. Finally, in Section 9 we discuss the insights from this research and provide some general conclusions.

Section snippets

Related work

As far as specific research on software usage analysis is concerned, extraction of in-the-field usage knowledge remains an area that needs a lot of enrichment. Data analysis techniques have been previously applied to this field: for software reengineering purposes (El-Ramly et al., 2009, Lefngwell and Widrig, 2003), for program comprehension (Zaidman, Calders, Demeyer, & Paredaens, 2005), for re-documentation of use cases (Smit, Stroulia, & Wong, 2008), or for user interface learning agents (

Research design

The users’ shift to cloud computing applications (Park & Ryoo, 2013) creates the opportunity for software vendors to automatically collect vast amounts of usage data. Although several algorithms have been developed to analyze the behavior of website visitors, they are hardly ever used in the software products domain. This research aims to follow a meta-algorithmic approach, by incorporating the state-of-the-art data mining techniques in a method. Our goal is to show how the appropriate

Usage Mining Method

In this section we present the first design artifact that we constructed in this research. The Usage Mining Method suggests an ordered set of activities that should be followed to extract relevant usage knowledge from software operation data.

In order to provide guidance in analyzing software product users’ usage behavior, we propose the Usage Mining Method (Fig. 2). The method has been constructed with the Method Engineering approach, provided by van de Weerd and Brinkkemper (2008). The method

Software usage knowledge

In this section we suggest what types of knowledge should be extracted from software operation data to gain insights about how the end-users are using a software product. Subsequently, we present the fundamental variables that should be inspected during software operation, in order to gather the data that are necessary to analyze usage.

Based on our findings from our literature research in the domains of usage analysis in software systems (El-Ramly et al., 2009, Simmons, 2006) and web usage

Usage mining tasks and data mining techniques

In this section, we are going to suggest which data mining techniques could be performed on the software operation data, in order to analyze the software usage. More specifically, in order to produce the various usage knowledge types presented in the previous section, we suggest the following usage analysis tasks:

1.
Classification Analysis, to understand the factors which influence the decisions that customers take, in the context of the software product utilization.
2.
Users Profiling, i.e.

A prototype for usage mining

The Usage Mining Method presented in Section 4 is instantiated in a prototype, which we developed in R (R Development Core Team, 2008) and implements the method’s activities. The prototype can be used to analyze the software usage of SaaS products with embedded logging procedures that record the operation data. The prototype has the format of an R script, which performs successively the activities of Data Preparation, Exploratory Analysis, Classification Analysis, Users Profiling and

Case study

Following the Design Science Research approach, we just presented the two design artifacts that we constructed in this research: the Usage Mining Method and the prototype developed in R. In order to evaluate the two artifacts, we performed a case study in a Dutch software company, to implement the Usage Mining Method and run the prototype in the context of a real software product. This section comprises the design of the case study, as well as the execution and interpretation of the results.

Discussion

In this paper we have investigated how we can inspect software operation data, in order to gain knowledge about how the software is used by the end-users. We reviewed related literature on software usage analysis. We constructed and presented a method that could be used to analyze how the end-users are using a software product. We explicated this knowledge by distinguishing four different categories (statistical summaries of sessions and users’ behavior, factors that influence the customers’

References (57)

P. Germanakos et al.
Capturing essential intrinsic user behaviour values for the design of comprehensive web-based personalized environments
Computers in Human Behavior
(2008)
R. Junco
Comparing actual and self-reported measures of Facebook use
Computers in Human Behavior
(2013)
C. Lin et al.
Applying social bookmarking to collective information searching (CIS): An analysis of behavioral pattern and peer interaction for co-exploring quality online resources
Computers in Human Behavior
(2011)
S. Okazaki
Lessons learned from i-mode: What makes consumers click wireless banner ads?
Computers in Human Behavior
(2007)
S.C. Park et al.
An empirical investigation of end-users’ switching toward cloud computing: A two factor theory perspective
Computers in Human Behavior
(2013)
S. Stieger et al.
What are participants doing while filling in an online questionnaire: A paradata collection tool and an empirical study
Computers in Human Behavior
(2010)
W.M.P. Van der Aalst et al.
Process mining: A research agenda
Computers in Industry
(2004)
J. Bowring et al.
Monitoring deployed software using software tomography
SIGSOFT Software Engineering Notes
(2002)
L. Breiman et al.
Classification and regression trees
(1984)
G. Brock et al.
Clvalid: An r package for cluster validation
Journal of Statistical Software
(2008)

Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., et al. (2000). CRISP-DM 1.0 Step-by-step...

R. Cooley et al.

Web mining: Information and pattern discovery on the world wide web

(1997)

R. Cooley et al.

Data preparation for mining world wide web browsing patterns

Knowledge and Information Systems

(1999)

R.G. Cowell et al.

Probabilistic networks and expert systems: Exact computational methods for Bayesian networks

(2007)

S. Dustdar et al.

Discovering web service workflows using web services interaction mining

International Journal of Business Process Integration and Management

(2007)

El-Ramly, M., & Stroulia, E. (2004). Mining software usage data. In International Workshop on Mining Software...

M. El-Ramly et al.

Legacy systems interaction reengineering

B. Everitt et al.

Cluster analysis

(2001)

U.M. Fayyad et al.

On the handling of continuous-valued attributes in decision tree generation

Machine Learning

(1992)

A. Field

Discovering statistics using SPSS

(2009)

P. Giudici

Applied data mining: Statistical methods for business and industry

(2003)

C.M. Grinstead et al.

Introduction to probability

(2006)

J. Han

Data mining: Concepts and techniques

(2005)

T. Hastie et al.

Hierarchical clustering

(2009)

S. Haykin

Neural networks: A comprehensive foundation

(1998)

A.R. Hevner et al.

Design science in information systems research

MIS Quarterly

(2004)

A.K. Jain et al.

Data clustering: A review

ACM Computing Surveys

(1999)

S. Jones et al.

An analysis of usage of a digital library

Cited by (37)

From user-generated data to data-driven innovation: A research agenda to understand user privacy in digital markets
2021, International Journal of Information Management
Citation Excerpt :
Strategies focused on large-scale data automation and DDI must be standardized and examined to avoid abuse that could harm user privacy and data. The application of DDI and BDA to the study of online user behavior has been studied from behavioral (Pachidi, Spruit, & Van De Weerd, 2014) and marketing perspectives (Vinerean, Cetina, Dumitrescu, & Tichindelean, 2013; Palos-Sanchez et al., 2019). However, these analytical approaches have allowed tracking users online, allowing thise companies to anticipate user decisions and understand how users behave on the Internet (Steinfeld, 2016; Tene & Polenetsky, 2012).
In recent years, strategies focused on data-driven innovation (DDI) have led to the emergence and development of new products and business models in the digital market. However, these advances have given rise to the development of sophisticated strategies for data management, predicting user behavior, or analyzing their actions. Accordingly, the large-scale analysis of user-generated data (UGD) has led to the emergence of user privacy concerns about how companies manage user data. Although there are some studies on data security, privacy protection, and data-driven strategies, a systematic review on the subject that would focus on both UGD and DDI as main concepts is lacking. Therefore, the present study aims to provide a comprehensive understanding of the main challenges related to user privacy that affect DDI. The methodology used in the present study unfolds in the following three phases; (i) a systematic literature review (SLR); (ii) in-depth interviews framed in the perspectives of UGD and DDI on user privacy concerns, and finally, (iii) topic-modeling using a Latent Dirichlet allocation (LDA) model to extract insights related to the object of study. Based on the results, we identify 14 topics related to the study of DDI and UGD strategies. In addition, 14 future research questions and 7 research propositions are presented that should be consider for the study of UGD, DDI and user privacy in digital markets. The paper concludes with an important discussion regarding the role of user privacy in DDI in digital markets.
Bridging the information gap of disaster responders by optimizing data selection using cost and quality
2018, Computers and Geosciences
Citation Excerpt :
Furthermore, it is much more difficult to do this kind of mapping between organizations and even more so if certain work flows are still paper based. It might be possible to log data file usage on the main websites that are used by responders and for example how the app and dashboard are used (Pachidi et al., 2014). In addition, an after-action review with the responders in a focus group setting could be used to have the responders categorize their needs according to the four phases.
Natural disasters are chaotic and disruptive events, with compressed timelines and high levels of uncertainty. Comprehensive data on the impact becomes only available well into the response phase and data is scattered across organizations. Data heterogeneity issues are common. Consequently, responding organizations have difficulties finding data that match their information needs. We investigated the information needs of and the disaster management data available to both national and local decision makers during the 2014 floods in Bangladesh. We conducted 13 semi-structured interviews and three focus group discussions, collecting in this way input from 51 people, transcribed and coded them so that themes of information needs emerged. We mapped the information needs on the available data sets and determined which needs were not, partially or completely covered. We identified seven themes of in total 71 information needs and 15 data sets. The mapping revealed a significant information gap of timely and location-based data. Only 40% of the information needs are covered in time and 75% if no time constraints are considered. Instead of using all data sets, we optimized for coverage -with Integer Linear Programming-combinations of data sets against the costs of extracting data from structured versus unstructured data and against the quality in terms of timeliness, source and content rating and granularity. Without time constraints, three data sets yield already a coverage of 68%, whereas adding five extra data sets only gives an improvement of 7%. We recommend executing identification and mapping of available data sets on the information needs as part of Data Preparedness. Determination of the optimal combination of data sets can be used to extract data on information needs more efficiently. Currently, we did this manually, but future research will investigate automatic matching of information needs on data sets, by applying intelligent querying and semantic data matching.
Applied data science in patient-centric healthcare: Adaptive analytic systems for empowering physicians and patients
2018, Telematics and Informatics
We define the emerging research field of applied data science as the knowledge discovery process in which analytic systems are designed and evaluated to improve the daily practices of domain experts. We investigate adaptive analytic systems as a novel research perspective of the three intertwining aspects within the knowledge discovery process in healthcare: domain and data understanding for physician- and patient-centric healthcare, data preprocessing and modelling using natural language processing and (big) data analytic techniques, and model evaluation and knowledge deployment through information infrastructures. We align these knowledge discovery aspects with the design science research steps of problem investigation, treatment design, and treatment validation, respectively. We note that the adaptive component in healthcare system prototypes may translate to data-driven personalisation aspects including personalised medicine. We explore how applied data science for patient-centric healthcare can thus empower physicians and patients to more effectively and efficiently improve healthcare. We propose meta-algorithmic modelling as a solution-oriented design science research framework in alignment with the knowledge discovery process to address the three key dilemmas in the emerging “post-algorithmic era” of data science: depth versus breadth, selection versus configuration, and accuracy versus transparency.
Introducing continuous experimentation in large software-intensive product and service organisations
2017, Journal of Systems and Software
Citation Excerpt :
Similarly, there are examples of instrumenting software running locally on users' devices and analysing the resulting data to gain insights on, e.g., performance issues (Han et al., 2012). Pachidi et al. (2014) propose a method to guide the analysis of data collected during software operation, using three different data mining techniques to produce a classification analysis, user profiling, and clickstream analysis to support decision-making. Whereas data mining can be performed in an exploratory manner without many up-front assumptions, an experiment-driven approach focuses on testing important assumptions about a software product or service.
Software development in highly dynamic environments imposes high risks to development organizations. One such risk is that the developed software may be of only little or no value to customers, wasting the invested development efforts. Continuous experimentation, as an experiment-driven development approach, may reduce such development risks by iteratively testing product and service assumptions that are critical to the success of the software. Although several experiment-driven development approaches are available, there is little guidance available on how to introduce continuous experimentation into an organization. This article presents a multiple-case study that aims at better understanding the process of introducing continuous experimentation into an organization with an already established development process. The results from the study show that companies are open to adopting such an approach and learning throughout the introduction process. Several benefits were obtained, such as reduced development efforts, deeper customer insights, and better support for development decisions. Challenges included complex stakeholder structures, difficulties in defining success criteria, and building experimentation skills. Our findings indicate that organizational factors may limit the benefits of experimentation. Moreover, introducing continuous experimentation requires fundamental changes in how companies operate, and a systematic introduction process can increase the chances of a successful start.
A comprehensive study on the effects of using data mining techniques to predict tie strength
2016, Computers in Human Behavior
Citation Excerpt :
Users' interactions are modeled in this framework by a social graph generation technique, in which, ties between a pair of nodes are established when they participate in at least one group-chat session. Pachidi, Spruit, and van de Weerd (2014) presented a usage mining method to analyze collected data from software operations, in order to understand how a software product is used by the end-users. Users profiling, click-stream analysis and classification analysis are three types of analysis which are employed by this method.
The use of social networks has grown noticeably in recent years and this fact has led to the production of numerous volumes of data. Data that are widely used by users on the social media sites are very large, noisy, unstructured and dynamic. Providing a flexible framework and method to apply in all of these networks can be the perfect solution. The uncertainties arising from the complexity of decisions in recognition of the Tie Strength among people have led researchers to seek effective variables of intimacy among people. Since there are several effective variables which their effectiveness rate are not precisely determined and their relations are nonlinear and complex, using data mining techniques can be considered as one of the practical solutions for this problem. Some types of unsupervised mining methods have been conducted in the field of detecting the type of tie. Data mining could be considered as one of the applicable tools for researchers in exploring the relationships among users.
In this paper, the problem of tie strength prediction is modeled as a data mining problem on which different supervised and unsupervised mining methods are applicable. We propose a comprehensive study on the effects of using different classification techniques such as decision trees, Naive Bayes and so on; in addition to some ensemble classification methods such as Bagging and Boosting methods for predicting tie strength of users of a social network. LinkedIn social network is used as a real case study and our experimental results are proposed on its extracted data. Several models, based on basic techniques and ensemble methods are created and their efficiencies are compared based on F-Measure, accuracy, and average executing time. Our experimental results show that, our profile-behavioral based model has much better accuracy in comparison with profile-data based models techniques.
The sociability score: App-based social profiling from a healthcare perspective
2016, Computers in Human Behavior
Citation Excerpt :
Based on the results, the experts will provide their opinions on the satisfactory level that the method provides for health care professionals as a part of the last evaluation phase. For the processing and analysis of the data in preparation of determining the sociability score, several different tools were used, in line with the Usage Mining Method of Pachidi, Spruit, Van der Weerd, (2014). First, all data collected by the BeHapp application was sent to and stored in a MySQL database with access through phpMiniAdmin.
As the smartphone becomes an integral part of our lives, its value as a rich data source reaches an increasing potential. Several previous studies have used smartphone-derived data to discover relationships between user characteristics and different types of smartphone use. However, none tried to use smartphone data to capture an individual's social behavior into one profile, aimed at providing additional information for the diagnostic evaluation of social deficits. This study presents a novel way of combining different modalities of smartphone data for the creation of sociability profiles using a scoring mechanism that allows for easy addition and removal of data sources. Following installation of the smartphone application, data is being sampled in the background to allow for the assessment of spontaneous smartphone use. Sociability scores were based on the integration of social communication and social exploration scores derived from smartphone use and environmental data sampling (e.g., GPS and external Bluetooth signals). Finally, we have applied our Sociability model to create social profiles of ten test subjects as a baseline for future studies. This pilot study provided insight in the usability of the individual sociability scores for future smartphone application to provide longitudinal objective measures of normal and atypical human social behavioral profiles in their natural environment.

View all citing articles on Scopus

View full text

Understanding users’ behavior with software operation data mining

Highlights

Abstract

Introduction

Section snippets

Related work

Research design

Usage Mining Method

Software usage knowledge

Usage mining tasks and data mining techniques

A prototype for usage mining

Case study

Discussion

Computers in Human Behavior

Computers in Human Behavior

Computers in Human Behavior

Computers in Human Behavior

Computers in Human Behavior

Computers in Human Behavior

Computers in Industry

Monitoring deployed software using software tomography

SIGSOFT Software Engineering Notes

Classification and regression trees

Clvalid: An r package for cluster validation

Journal of Statistical Software

Web mining: Information and pattern discovery on the world wide web

Data preparation for mining world wide web browsing patterns

Knowledge and Information Systems

Probabilistic networks and expert systems: Exact computational methods for Bayesian networks

Discovering web service workflows using web services interaction mining

International Journal of Business Process Integration and Management

Legacy systems interaction reengineering

Cluster analysis

On the handling of continuous-valued attributes in decision tree generation

Machine Learning

Discovering statistics using SPSS

Applied data mining: Statistical methods for business and industry

Introduction to probability

Data mining: Concepts and techniques

Hierarchical clustering

Neural networks: A comprehensive foundation

Design science in information systems research

MIS Quarterly

Data clustering: A review

ACM Computing Surveys

An analysis of usage of a digital library