1 Introduction

Cloud computing offers heterogenous services such as storage, computation, and applications to users an on-demand basis. It is a form of utility computing that facilitates users with pay-per-usage billing, which requires low or no initial cost to acquire computational resources as resources are essentially rented from the cloud service providers. Tech giants such as Amazon, Google, Microsoft are providing cloud services of various forms. Amazon Elastic Compute Cloud (EC2), Simple Queue Service (SQS), Simple Storage Service (S3), Google App Engine (GAE), Windows Azure, SQL Azure, and Windows Intune are some of these services.

Since the emergence of cloud computing as a prominent medium of acquiring high performance e.g., mass storage, high processing power at a low cost, there have been numerous discussions on its trustworthiness [1]. There has been a lack of trust among users and providers of cloud. Eliminating this lack of trust and creating a trustworthy platform of resource management is one of the greatest challenges of cloud computing.

To discuss the issue, let us consider an example scenario where a company Titans is using cloud services provided by a company Hydra. Now, a malicious employee Hera from Titans has done something illegal (e.g., Denial-of-Service (DoS) attack on a site or mining sensitive unauthorized data) using the resources provided by Hydra. When the victim charges Hydra and Hydra in turn charges Titans, Hera denies the charges. Moreover, she accuses another member Hyercules of Titans. Hyercules has no way to prove his innocence as both Hera and Hyercules were using the resources at the same time and no one knows who was doing what. Moreover, if Hydra somehow collaborates with Hera, Hyercules will have no way to escape. This raises a serious trust issue within the users of the cloud.

The trust issue discussed above arises from the fact that users of cloud do not have access to the logging and auditing records of tasks performed by them in cloud. The associated mechanisms are fully controlled by the cloud providers. Hence, users cannot verify their tasks. Again, some applications such as eScience and healthcare needs to store records corresponding to computational tasks [2]. These records are used to analyze experimental results and other uses. The absence of computational records is a major barrier to the widespread growth of these applications in cloud computing platform. Another fact to consider is, in current cloud model, users can not be certain about different resource usages. In particular, a user cannot confirm that (i) the billing statements accurately reflect real use of resources, (ii) the deployed resources were up and running all the time, (iii) there is any security breaches or malfunctions affecting the outsourced resources [1]. As a consequence, a mistrust may build-up among users and providers.

Substantial work has been done to establish the trust among users and providers of cloud. But most of these works tend to establish trust in the storage context by maintaining the provenance of data objects [1]. However, ensuring trust in the context of processing (i.e., when users use virtual machines or applications provided by cloud) is still an open issue. Now-a-days many are proposing sophisticated cloud based applications such as software testing using cloud [3], cloud based malicious site detection [4, 5], cloud based data mining [6], using cloud to conceal IP [7] etc. But these applications can be a massive security threat if not monitored properly. For example, some user may use multiple VMs to perform denial-of-service (DoS) attack on a site. Similarly, some may mine unauthorized data to misuse it. In a nutshell, a malicious user may perform illegal activities using cloud and accuse another user for it. The situation becomes more complicated if the cloud provider joins the malicious user. Again, a malicious provider may apply excessive charges on its clients. To eliminate such catastrophe and to establish the trust within users and also among users and providers, computational transparency is required.

In this paper, we present a middleware service, AntiqueData proxy that will establish trust among users and providers by introducing computational transparency. The system works by collecting information from users and providers after every session of data processing and storing these information as computational provenance records. Session in this context refers to the lapse of time a user passes using a set of cloud resources. The records stored by AntiqueData proxy are session specific and are used to maintain transparency. These records are accessible to cloud providers and users in a hierarchically restricted way. The proxy also allows users to check their resource usage and thus ensure proper billing.

2 Background and Related Work

With the increase of popularity of cloud as a storage and computation medium, the demand for a trusted cloud structure increased simultaneously. So, the idea of provenance emerged in the field of cloud computing. Provenance generally refers to the information that helps to determine the derivation history of a data product, starting from its original sources [8]. Numerous techniques have been proposed for provenance in the cloud system. Among the proposed techniques, Provenance-Aware Storage System (PASS) [9] is considered a pioneer. Reddy et al. discussed the requirements of adding the provenance data to the cloud storage and four properties to make the provenance system truly useful. They proposed three protocols that monitor the client system call and stores both the provenance and data to AWS S3 storage. These three protocols use AWS S3, SimpleDB and SQS service hierarchically to ensure the properties like provenance data coupling, efficient query, and causal ordering, respectively [10, 11]. Zhang recently proposed an approach named dataPROVE [2] that maintains provenance data depending on the resource granularities [12]. Reilly and Naughton have proposed extending the Condor batch execution system [13] to capture data in execution environments, machine identities, log files, and file permissions. While there are significant new challenges on a cloud infrastructure, the Provenance-Aware Condor system certainly collects the right kind of provenance data. Abbadi et al. [14, 15] proposed use of middleware at different layers of cloud structure to maintain required provenance data.

Most of the approaches mentioned above deals with provenance associated with data storage in cloud. Again, almost all of the proposals involve storing object based provenance [1] data which are fully deployed and controlled by cloud providers and are not reasonably protected. This in turn questions the credibility of provenance data in the cloud. As a result, it affects the integrity of cloud built upon client and provider’s mutual trust.

Recently Park et al. [16] introduced a new system RAMP for capturing and tracing provenance in MapReduce workflows. RAMP (Reduce And Map Provenance) is an extension to Hadoop that supports provenance capture and tracing for workflows of MapReduce jobs. Akoush et al. [17] introduced HadoopProv, a modified version of Hadoop that implements provenance capture and analysis in MapReduce jobs but with reduced provenance capture overhead. Whereas all these systems focus on debugging the Mapreduce workflows, they rely on data provenance to serve the purpose.

3 Threat Model

Numerous applications of heterogenous diversity are being developed on cloud computing platform. Butler et al. [7] proposed masking all network traffic via IP concealment with OpenVPN relaying to EC2 (MANTICORE). Such masking of network traffic using cloud may exploit malicious activities (e.g., cyber criminals may use such applications to remain anonymous and attempt to hide their IP address). Ferguson et al. [5] proposed using cloud infrastructure to obfuscate phishing scam analysis. Such applications allow blacklisted users to access sites without being detected and fetch contents. Zhang et al. [3] proposed design and implementation of cloud-based performance testing system for web services. Performance testing applications involve testing number of requests that can be served per second. A malicious user may use such applications to perform Denial-of-Service (DoS) attack on sites by saturating the target machine with external communications requests. Data mining using cloud computing platform is another popular concept with security issues [1820]. According to the survey done by Rexer Analytics, 7 % data miners use cloud to analyze data [21]. Malicious miners may use raw computing power provided by cloud to analyze sensitive unauthorized data and thus cause privacy violation. In a nutshell, the increasing growth of cloud computing is leading towards development of applications with higher complexity and credibility. If not monitored properly, these applications may exploit cyber crime using cloud computing platform.

Again, the cloud still remains a black box to its users. The providers control all records related to processing and users do not have access to these records. As a result, users cannot be certain about billing, malfunctioning of resources, etc. In a 2010 survey by Fujitsu Research Institute on potential cloud customers, it was found that 88 % of potential cloud consumers are worried about who has access to their data, and demanded more awareness of what goes on in the back-end physical server (i.e., virtual and physical machines). This is an obstacle in the widespread growth of cloud computing [22].

Fig. 1.
figure 1

Ishikawa diagram representing some trust issues in cloud

Figure 1 shows an Ishikawa diagram representing some major issues (causes) that may build mistrust among users and cloud providers. These issues include increasing use of sophisticated applications violating privacy, faultiness of virtual machines, Security Through Obscurity principle adopted by provider, malicious user etc.

4 AntiqueData Proxy

Our proposed system that introduces transparency among users and providers of cloud during computation or processing is shown at Fig. 2. The major component of our system is AntiqueData proxy, a proxy implemented at a third party server that acts as a middleware to ensure transparency. The proxy collects session-wise data from users and providers and stores it in a third party database. To use a particular resource for computational purposes, a user needs to be connected to the proxy and the proxy in turn connects to a cloud provider. Thus a user gets access to a cloud resource. That is, users can not attach themselves to cloud providers directly rather via AntiqueData proxy. To serve its purpose the proxy maintains information regarding providers and users. These information are maintained using three database tables dedicated to providers, clients and their relationship.

Fig. 2.
figure 2

System architecture

Cloud Provider Table: Each entry of the Cloud Provider Table contains information regarding a particular cloud provider. The information include the cloud provider’s name, its reliability expressed in terms of reliability levels, parameters and methods associated with pricing i.e., the cost model, its transparency expressed in terms of transparency levels etc. (Table 1). Reliability level ensures that a cloud provider meets the reliability demand of a client. Cost model is used to validate billing. Expression of transparency is given at Sect. 6.

Table 1. Cloud Provider Table Sample (Partial)

Client Table: The entries of the Client Table correspond to information regarding all the users listed under specific clients. Client in this context refers to an organization or a company and users refer to the members of the organization or employees of the company. The Client Table information include client’s identity, transparency level, list of triples combining a virtual ID, a hierarchy level and a password for each user belonging to the client etc. (Table 2). The hierarchy level is used for controlling access to the computational provenance records.

Table 2. Client Table Sample (Partial)

Relationship Table: Each entry of the Relationship Table maintains information regarding the relationship between a particular cloud provider and a particular client. The information include the cloud provider’s name, the client’s identity, the transparency level corresponding to the pair i.e., pairwise transparency level, number of sessions, list of session IDs etc. (Table 3).

Table 3. Relationship Table Sample (Partial)

The primary task of the AntiqueData proxy is to collect varieties of information from users and providers for each session. These information are required to perform continuous monitoring of tasks performed by users in the cloud and also to ensure proper billing by monitoring resource usage by user. The information include session id (\(i\)), virtual user id (\(U_{i}\)), login time (\(T_{i}\)), duration of the session (\(T_{i}^{d}\)), the location of the user (\(L_{i}\)), work description (\(W_{i}\)), system information (\(S_{i}\)), log of user task (\(L_{i}^{t}\)) and usage of resources by the user (\(R_{i}\)). A record is generated using the information collected from user and provider. This record is AntiqueData record which is represented by \(\langle i, U_{i}, T_{i}, T_{i}^{d}, L_{i}, W_{i}, S_{i}, L_{i}^{t}, R_{i}\rangle \). The components of AntiqueData record are described below:

Session ID \(\mathbf{( }i\mathbf{): }~i\) represents a unique session number associated with each session of computation.

Virtual User ID \(\mathbf{( }U_{i}\mathbf{): }~U_{i}\) does not represent the real identity of a user, rather it is used to distinguish among different users within a client. This anonymization of user identity helps users to maintain their privacy. The mapping from real users within a client to their virtual identities are maintained by the client.

Login Time \(\mathbf{( }T_{i}\mathbf{): }~T_{i}\) represents the beginning of a session. It is the unix timestamp at which a user logs into the AntiqueData proxy to perform processing tasks using cloud resources.

Duration of the Session \(\mathbf{( }T_{i}^{d}\mathbf{): }~T_{i}^{d}\) represents the duration of a session. It is the lapse of time a user passes using a set of cloud resources for processing purpose.

Location of the User \(\mathbf{( }L_{i}\mathbf{): }~L_{i}\) represents the location information of the user. This location information is collected using geolocation of IP [23] belonging to the user. The location information may include country, region/state, city, metro code/zip code, organization etc. Location information is required to solve dispute (e.g., a user to justify that his account has been compromised) among users in case of unauthorized access using user account.

Work Description \(\mathbf{( }W_{i}\mathbf{): }~W_{i}\) represents information regarding the work done by the user using cloud resources. These information are provided by user. The information includes several fields such as type of the work (e.g., billing, mining, auditing), priority (represents the significance of the work), brief details etc.

System Information \(\mathbf{( }S_{i}\mathbf{): }~S_{i}\) represents system information such as virtual resource status, kernel version, operating system, modules loaded, library configurations, the amount of main memory available, the memory allocated to the address space, file path of an object on the VM etc. The information about what (file operation), where (both PM and VM), and at what time a file is accessed, duplicated or transferred (which are captured within the cloud provider) are also maintained.

Task Log \(\mathbf{( }L_{i}^{t}\mathbf{): }~L_{i}^{t}\) represents a log containing information regarding tasks (a work done by user is considered as a task sequence in this regard). The primary information is a complete workflow for the tasks done within the provider, such as complexity of tasks, whether network access is involved in a task, which blocks of a file have been modified or which records in a database table were changed or what processes and applications are run in a single machine for performing a particular task. Monitoring report of the movement of packets corresponding to a single file regarding a definite task in the network is also added to the log.

Resource Usage \(\mathbf{( }R_{i}\mathbf{): }~R_{i}\) represents the usage of different resources by a user during a particular session. This information is required to ensure transparency regarding billing.

The AntiqueData records can be stored either in the third party server (where the proxy resides) or in the cloud. In both the cases, the records need to be encrypted before storing. This is done to ensure privacy of records. These records can be accessed by both cloud providers and users. But there are some hierarchical access control restrictions. A user can access any record corresponding to his own work or corresponding to work done by users of lower hierarchy. That is a user can not access works done by users of same (except his own) or higher hierarchy. Hierarchy here refers to the hierarchy level defined at the Client Table. Similarly, a cloud provider can only access his own records. The AntiqueData records are provided to users or providers using xml of following form.

figure a

5 Communication Between Proxy and Cloud Provider

One of the biggest challenges of AntiqueData proxy is to collect information such as system information, task log, resource usage from providers. To do this, an entity (software, plugin, tool) needs to be present at all the virtual machines corresponding to the provider. The entity needs to serve two purposes: logging virtual machine to collect \(\{S_i, L_{i}^{t}, R_i\}\) and communication with AntiqueData proxy.

Fig. 3.
figure 3

Communication between proxy and provider

The first purpose of the entity can be served either by implementing a logger with features such as process monitoring, resource monitoring etc. or extending existing data-centric logging mechanisms such as Flogger. Flogger [24] is a distributed file-centric VM/PM logger which monitors file operations and transfers within the cloud. The second purpose of the entity can be served by using communication API provided by AntiqueData proxy. The API passes information to proxy using POST method (Fig. 3).

There are several issues to consider regarding the entity. The first issue to consider is the cloud provider authorization. The cloud provider may not be interested in using an API/software provided by third party. In that case, it can implement its own module that will serve the logging and communication purpose. The second issue to consider is all providers may not be interested in providing the required information. This issue is related to transparency and discussed in the next section.

6 Transparency

Transparency in cloud computing context refers to openness in communication between provider and client. It is a dual key lock that requires approval of both parties. Without cooperation of any of the parties (client or provider), the concept of transparency may fail in cloud context. We define two separate transparency parameters \(\alpha \) and \(\beta \) for provider and client respectively.

The transparency parameter \(\alpha \) for a cloud provider indicates the provider’s consent in providing information. It can be defined as:

$$ \alpha = \frac{\sum _{i=1}^{n} D_i*W_i}{\sum _{i=1}^{n} W_i} $$

Here, \(D_i\) is a boolean value which indicates the presence/absence of a particular information component (e.g., file path of object in VM, resource status etc.) according to the consent of cloud provider, \(W_i\) is a real value which represents the significance (weight) of the corresponding component in terms of trust establishment and \(n\) represents total number of information components. The transparency level \(TL\) of a cloud provider is defined using the value of \(\alpha \) as: \(TL = \lceil l\alpha \rceil \). The \(l\) transparency levels (\(TL 1, ..., l\)) represent \(l\) ranges of \(\alpha \)’s value.

Similarly, the transparency parameter \(\beta \) for a client indicates the client’s consent in allowing provider to monitor tasks in virtual machine. The formulation of \(\beta \) is similar to \(\alpha \) except \(D_i\) here is determined by the consent of client. The transparency level of a client is determined using the value of \(\beta \) as: \(TL = \lceil l\beta \rceil \).

The pairwise transparency parameter \(\gamma \) represents transparency for a pair combining a provider and a client. The formulation of \(\gamma \) is similar to \(\alpha \) and \(\beta \) except \(D_i\) here is determined by the consent of both provider and client.

7 Evaluation

The goal of our evaluation is to (i) understand the storage and data transfer cost associated with provenance data, (ii) measure computational overhead introduced by the system and (iii) load test the system.

A single session of computation generates 15 KB (maximum) provenance data. This is the upper bound for both storage and data transfer. So, the storage cost and data transfer cost introduced by the system is a modest one.

We have implemented a logger (similar to the one described in Sect. 5) using Java. In a PC having 2.5 GHz Intel Core i-5 processor with 2.88 GB usable memory running Windows 7 operating system, the logger on average uses 12458 KB memory and 5 % of CPU.

We have implemented a prototype of proxy using PHP. We have tested the consistency of the proxy and have monitored its performance in terms of user response time. The results are shown in Fig. 4. We can see that with the increase in number of users, the average response time also increases drastically. To eliminate this bottleneck multiple proxies can be introduced.

Fig. 4.
figure 4

Number of users vs user response time

8 Limitations

The proposed system establishes trust among users and providers by introducing transparency. This transparency is reflected by pairwise transparency parameter \(\gamma \). If the value of \(\gamma \) is low for majority of the pairs (of provider and client) due to Security Through Obscurity principle adopted by provider or client, the availed transparency will not be as expected. As a result, the system will fail to serve its purpose of establishing trust. One of the elements of provenance record, \(W_i\) involves work description of the user. Hasty users may provide insufficient details in case of such information.

9 Future Work

In future, we would like to incorporate X.509 certificates to (i) prove a user is indeed bona fide, (ii) maintain authenticity of provenance records. A similar concept is found in grid computing. The Globus [25] security model uses X.509 certificates. We would also like to implement multiple proxy system that will eliminate the bottlenecks associated with single proxy (e.g., single point of failure).

10 Conclusion

Establishing trust among the users and cloud providers is a challenging task. Users now-a-days run varieties of complex applications on cloud computing platform. These applications are not only sophisticated in nature but they also exploit the vulnerabilities of cyber crime using cloud platform. Hence, proper monitoring of processing tasks in cloud has become a key concern. To ensure proper monitoring of computational tasks (to prevent malicious activities) and establish trust among users and providers (by introducing transparency), computational provenance records are required. In this paper, we present a middleware service AntiqueData proxy that serves the above purposes by maintaining computational provenance records.