Privacy-preserving human action recognition as a remote cloud service using RGB-D sensors and deep CNN

doi:10.1016/j.eswa.2020.113349

Expert Systems with Applications

Volume 152, 15 August 2020, 113349

https://doi.org/10.1016/j.eswa.2020.113349 Get rights and content

Abstract

Cloud-based expert systems are highly emerging nowadays. However, the data owners and cloud service providers are not in the same trusted domain in practice. For the sake of data privacy, sensitive data usually has to be encrypted before outsourcing which makes effective cloud utilization a challenging task. Taking this concern into account, we propose a novel cloud-based approach to securely recognize human activities. A few schemes exist in the literature for secure recognition. However, they suffer from the problem of constrained data and are vulnerable to re-identification attack, where advanced deep learning models are used to predict an object’s identity. We address these problems by considering color and depth data, and securing them using position based superpixel transformation. The proposed transformation is designed by actively involving additional noise while resizing the underlying image. Due to this, a higher degree of obfuscation is achieved. Further, in spite of securing the complete video, we secure only four images, that is, one motion history image and three depth motion maps which are highly saving the data overhead. The recognition is performed using a four stream deep Convolutional Neural Network (CNN), where each stream is based on pre-trained MobileNet architecture. Experimental results show that the proposed approach is the best suitable candidate in “security-recognition accuracy (%)” trade-off relation among other image obfuscation as well as state-of-the-art schemes. Moreover, a number of security tests and analyses demonstrate robustness of the proposed approach.

Introduction

Evolution of deep learning has strengthened the expert systems to perform human action recognition very precisely, (Ronao & Cho, 2016). Human action recognition has been studied for decades and is still a very popular topic due to broad real-world applications, such as video retrieval, visual surveillance, human-computer interaction, and robotics for human behavior characterization, (Mabrouk & Zagrouba, 2018). However, rich hardware and software resources along with a team of specialized persons are required to maintain an expert system for human action recognition. This restricts the real world utility of this system as mid-level organizations are not capable enough to maintain an in-house system for a long time. Recently, cloud service providers have addressed this problem by introducing pay-as-you-go models. Examples include Azure Machine Learning,¹ IBM Watson Machine Learning,² etc. These services relieve users from infrastructure maintenance responsibility by outsourcing their data for deep learning based advanced facilities over the cloud. However, the massive data collection required for deep learning presents privacy issues.

The personal and highly sensitive user data, such as photos and video recordings, are indefinitely stored by companies that collect them. The images and video recordings often contain accidentally captured sensitive items including faces, license plates, computer screens, etc. which lead to privacy loss. For example, an organization that may want to apply cloud-based deep learning techniques to identify suspicious actions in its critical areas is prevented by privacy concerns from sharing its surveillance data and thus benefiting from large-scale deep learning. The emerging privacy-paradox studies, (Ooi, Hew, & Lin, 2018) and (Hew, Tan, Lin, & Ooi, 2017), have also shown ambiguous user behavior regarding their privacy. In this situation, privacy and confidentiality restrictions significantly reduce the use of necessary facilities offered by cloud service providers. To overcome this problem, strong cryptographic techniques can be used to secure user data before sending it to the cloud. The solution works well for secure storage. However, it introduces the challenge of data processing in encrypted form, known as Encrypted Domain (ED). We address these problems and propose a novel approach for secure outsourcing of user data for a cloud-based expert system.

The proposed approach ensures data privacy and enables user to access deep learning model for human action recognition over the cloud. The application of the proposed approach can be found in a wide range of situations, where surveillance data contains sensitive information. These include secure and automatic identification of suspicious human actions in a parking lot, Intensive Care Unit (ICU) and other sensitive places, secure monitoring of human actions at traffic signals, bank counters, Active and Assisted Living (AAL) systems for smart homes, etc.

In the context of multimedia data, there exist various methods to assure image privacy, including mosaic, blurring, scrambling, and encryption. Each method is unique and finds its scope in situation specific solutions. For example, automatic face blurring was introduced by YouTube³ in 2012. The big software giant assured identity preservation by blurring human’s faces in the video content. For example, securing the identity of activists involved in a protest march. However, this approach does not fulfill our privacy goal as there can be multiple video frames with similar visual information that may enable an adversary to obtain significant information from non-protected parts, such as background objects. The next method, namely image mosaicing, secures image information by creating big pixel-like patches known as pixelation. Though, there is a risk as some recent attack models have concentrated to re-identify identity related information from obfuscated image parts, (McPherson, Shokri, & Shmatikov, 2016). The remaining two methods, namely, image scrambling and encryption, focuses on securing full image information with standard cryptographic methods. However, pixel information is found to be distorted to a greater extent which results an infeasible situation for data processing.

In this paper, our primary objective is to design a novel approach for full image obfuscation in a manner such that automatic human action recognition can be performed without revealing any identity related information over the cloud. We achieve this by transforming the secret image into extremely low resolution, e.g. 224 × 224 to 14 × 14, using a Position based superpixel transformation. The proposed transformation is designed to concentrate a group of pixels into a single composition value in accumulation with random noise. As a result, the identity related spatial information of the underlying image is reduced to a greater extent, making it a challenging task to relate two obfuscated images for an adversary. This successfully removes the chances of re-identification attack. As compared to existing image obfuscation methods, the proposed approach successfully improves image security with higher recognition accuracy in “security-recognition accuracy (%)” trade-off relation.

Recently, a few research works, (Dai, Saghafi, Wu, Konrad, Ishwar, 2015, Ryoo, Kim, Yang, 2018, Ryoo, Rothrock, Fleming, Yang, 2017) and Chou et al. (2018), have been proposed that use extremely low resolution images to achieve privacy. Their main idea is to resize the original high resolution image into extremely low resolution so that the sensitive information, such as face of the person, number plate of a vehicle, etc. can be obfuscated. The authors used the resized training data sample to train a deep learning model for secure recognition. However, some drawbacks are observed, which are described as follows -

1.
Intrinsic information is lost due to extreme low resolution, restricting the model to learn in a constrained manner as only RGB (Red-Green-Blue) images are considered, and
2.
There is a risk of privacy as resulting low resolution images are generated in an analogue manner that falls in the category of full image mosaicing. This makes cloud service provider to identify similar images, even if they are claimed to be secure. Recently, McPherson et al. (2016) described this as a potential risk in their article entitled Defeating Image Obfuscation with Deep Learning by demonstrating how deep learning models can be used to re-identify the sensitive information. The authors claimed higher re-identification accuracies over four standard datasets, namely MNIST, CIFAR-10, AT&T, and FaceScrub.

We address the first drawback by utilizing RGB and depth data along with a four channel deep Convolutional Neural Network (CNN). Considering several advantages of depth data as compared to using RGB solely, we use depth data as the second modality to overcome the data restriction problem.

In the context of re-identification attack which has been discussed as the second drawback, we propose a non-invertible position-based superpixel transformation for image obfuscation. McPherson et al. (2016) demonstrated that neural networks can automatically discover relevant features, and can learn to exploit correlation in the obfuscated image. This problem is further increased with low resolution images used in state-of-the-art schemes as they are generated in an analogue manner. On the other hand, unlike existing privacy-preserving schemes of (Dai, Saghafi, Wu, Konrad, Ishwar, 2015, Ryoo, Kim, Yang, 2018, Ryoo, Rothrock, Fleming, Yang, 2017) and (Chou et al., 2018) that used simple image resizing for obfuscation, the proposed transformation is accumulated with random noise. Due to this, an adversary cannot identify two encrypted images, generated from the same secret image. To summarize, the major contributions of this paper are as follows -

•
As compared to state-of-the-art works of (Dai, Saghafi, Wu, Konrad, Ishwar, 2015, Ryoo, Kim, Yang, 2018, Ryoo, Rothrock, Fleming, Yang, 2017) and (Chou et al., 2018), the proposed transformation is accumulated with additional noise. This results in improved security, the robustness of which is validated using several statistical and differential tests.
•
Unlike previous schemes that utilized RGB data, we use depth maps in integration with RGB as the second modality. Therefore, for a video $V$ of t number of frames, only four images, that is, one Motion History Image (MHI) and three Depth Motion Maps (DMMs) are secured by transforming them into extremely low resolution. As a result, the data overhead is significantly reduced.
•
A four channel deep CNN is used corresponding to MHI and DMMs. The respective output is then fused, resulting in more accurate recognition.
•
The proposed approach outperforms other image obfuscation methods in “security-recognition accuracy (%)” trade-off relation.

We review the existing privacy-preserving schemes in Section 2. A brief overview of the proposed approach is provided in Section 3. Section 4 presents detailed description of the proposed transformation, followed by complete methodology and model description in Section 5. Recognition results and security analysis are presented in Sections 6 and 7 respectively. Analysis of data efficiency, achieved by the proposed approach is discussed in Section 8. Section 9 provides a comprehensive discussion of all results reported in the paper with future directions. Finally the paper is concluded in Section 10.

Section snippets

Related work

In this section, we first provide a brief overview of existing schemes that support secure data processing and then move towards the recent developments for privacy-preserving human action recognition.

System overview

The proposed approach is designed to run in cloud environment, and operations required for its functioning are classified as per treat model and cloud server modalities. Any user accessible device, located at the actual field where the multimedia is temporarily stored and secured for further transmission to the cloud server, falls in the category of treat model. The rest of the tasks, that is, secure storage and recognition, are performed at the cloud server.

The cloud server needs to deploy a

Position-based superpixel transformation for image obfuscation

In order to secure image information, we propose a position-based superpixel transformation f. Superpixel is the process of clustering connected pixels in an image with similar features so that an abstract image can be obtained. It finds wide scope in various applications such as medical image segmentation, (Kitrungrotsakul, Han, & Chen, 2015), image retrieval, (Haas, Donner, Burner, Holzer, Langs, 2011, Stutz, Hermans, Leibe, 2018), dataset annotation, (Liu et al., 2013), etc. However, unlike

Privacy-preserving human action recognition as a service

The two primary objectives of the proposed approach are: (i) Obfuscating image information at treat model, and (ii) performing secure human action recognition over cloud server. The functionality is organized into two phases which are described as follows -

Recognition results

In this section, we provide network details and validate effectiveness of the proposed approach over standard UTD-MHAD dataset.

Security analysis

Motivated by the recent work of (McPherson et al., 2016), demonstrating the risk of re-identification, we propose to obfuscate full image information using position based superpixel transformation in this paper. A scheme is considered as secure if no adversary can break it with likelihood altogether more noteworthy than irregular speculating. We achieve higher obfuscation by accumulating random noise during position based super pixel transformation, which results an extremely low resolution

Data overhead

The existing cloud-based paradigm requires the active involvement of communication network for data transmission. Therefore, the data overhead should be very less so that a real time system can be supported. The state-of-the-art privacy-preserving human action recognition schemes utilized extremely low resolution images to achieve obfuscation. Also, the data expansion, caused due to huge multimedia dimensions is reduced. For example, a colored video $V$ with 500 frames each of size 224 × 224

Discussion

In this paper, we have emphasized on the development of a novel approach for secure human action recognition using position based superpixel transformation. The primary objective of the proposed approach is to obfuscate image information by resizing it into extremely low resolution with added noise. This prevents re-identification attack. A four stream deep CNN is utilized for recognition purpose, where each stream is based on pretrained MobileNet. After a series of experiments, we found a

Conclusion

In this paper, a privacy-preserving human action recognition approach has been proposed. We have emphasized the feasibility to perform effective human action recognition with higher security and least storage overhead. Initially, recent vulnerabilities related to image obfuscation are discussed along with existing work and their problems. Next, image obfuscation is formulated by computing MHI and DMMs of the underlying video sequences and transforming them to extremely low resolution with

Declaration of Competing Interest

All authors declare that they have no conflict of interest regarding the publication of this manuscript.

CRediT authorship contribution statement

Amitesh Singh Rajput: Conceptualization, Methodology, Writing - original draft. Balasubramanian Raman: Supervision, Validation, Resources. Javed Imran: Methodology, Writing - review & editing, Validation.

Acknowledgments

We would like to thank the editor and external reviewers for their thoughtful and detailed comments on our paper. We would also like to thank Information Security Education and Awareness (ISEA) Project (phase II), MeitY, Government of INDIA for the necessary support.

References (42)

K. Brkić et al.
Protecting the privacy of humans in video sequences using a computer vision-based de-identification pipeline
Expert Systems with Applications
(2017)
P. Khaire et al.
Combining cnn streams of rgb-d and skeletal data for human activity recognition
Pattern Recognition Letters
(2018)
S. Kumar et al.
Upper approximation based privacy preserving in online social networks
Expert Systems with Applications
(2017)
N. Lee et al.
A privacy-aware feature selection method for solving the personalization–privacy paradox in mobile wellness healthcare services
Expert Systems with Applications
(2015)
D. Megías et al.
Collusion-resistant and privacy-preserving p2p multimedia distribution based on recombined fingerprinting
Expert Systems with Applications
(2017)
L. Onofri et al.
A survey on using domain and contextual knowledge for human activity recognition in video streams
Expert Systems with Applications
(2016)
N. Polatidis et al.
Privacy-preserving collaborative recommendations based on random perturbations
Expert Systems with Applications
(2017)
C.A. Ronao et al.
Human activity recognition with smartphone sensors using deep learning neural networks
Expert Systems with Applications
(2016)
P. Singh et al.
Secure data deduplication using secret sharing schemes over cloud
Future Generation Computer Systems
(2018)
D. Stutz et al.
Superpixels: An evaluation of the state-of-the-art
Computer Vision and Image Understanding
(2018)

M. Ziaeefard et al.

Semantic human activity recognition: a literature review

Pattern Recognition

(2015)

A.F. Bobick et al.

The recognition of human movement using temporal templates

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2001)

E. Bresson et al.

A simple public-key cryptosystem with a double trapdoor decryption mechanism and its applications

International conference on the theory and application of cryptology and information security

(2003)

C. Chen et al.

Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor

International conference on image processing (icip)

(2015)

L. Chen et al.

Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning

Conference on computer vision and pattern recognition (cvpr)

(2017)

E. Chou et al.

Privacy-preserving action recognition for smart hospitals using low-resolution depth images

J. Dai et al.

Towards privacy-preserving recognition of human activities

International conference on image processing (icip)

(2015)

J. Deng et al.

Imagenet: A large-scale hierarchical image database

Conference on computer vision and pattern recognition (cvpr)

(2009)

N.E.D. Elmadany et al.

Information fusion for human action recognition via biset/multiset globality locality preserving canonical correlation analysis

IEEE Transactions on Image Processing

(2018)

S. Haas et al.

Superpixel-based interest points for effective bags of visual words medical image retrieval

Miccai international workshop on medical content-based retrieval for clinical decision support

(2011)

K. He et al.

Deep residual learning for image recognition

Conference on computer vision and pattern recognition (cvpr)

(2016)

Cited by (22)

A two-stage deep generative adversarial quality enhancement network for real-world 3D CT images
2022, Expert Systems with Applications
High-quality (HQ) three-dimensional (3D) images are the premise of analyzing the properties of porous media such as rocks. X-ray computed tomography (CT) is one of the most widely used imaging tools to capture the 3D images of rock samples. Nevertheless, the quality (e.g., resolution, sharpness, and the signal-to-noise ratio) of the collected rock CT images may not meet the needs of practical applications in some cases due to the limitations of imaging systems, leading to inaccurate results of property analysis. In this paper, aiming at improving the quality of rock CT images as well as the accuracy of property analysis, we develop a two-stage deep generative adversarial quality enhancement network for real-world 3D CT images, namely the CTQENet. More specifically, the proposed CTQENet consists of a two-dimensional (2D) reconstruction module (2DRM) and a 3D fusion module (3DFM), which enhance the quality of 3D CT images from the perspective of 2D slices and 3D volumes, respectively. In order to remove artifacts and enhance the resolution of real-world CT images, the 2DRM takes the cycle-consistent generative adversarial network as the backbone to learn the mapping from low-quality (LQ) 2D slices to HQ ones without one-to-one paired training data. Then, the 3D CT volumes stacked by the reconstructed HQ slices along the $x$ / $y$ / $z$ -axis are adaptively fused in the generative adversarial network-based 3DFM, to achieve more reliable 3D morphological structures. Qualitative and quantitative comparisons show the effectiveness of the proposed CTQENet for real-world 3D CT images of rock samples. In particular, the reconstructed HQ 3D CT images by CTQENet show similar morphological characteristics and statistical properties with HQ targets. This study makes it possible to obtain higher quality 3D CT images that partly exceed the limitations of CT imaging systems for better visual experience and more accurate property analysis.
Human activity recognition using temporal convolutional neural network architecture
2022, Expert Systems with Applications
Citation Excerpt :
One drawback of HAR recognition using RGB-D is the lack of privacy in obtaining the information. For this reason, Rajput et al. (2020) developed a deep CNN method in the cloud with the priority of preserving privacy. In their method, they obfuscated the information by using Motion History Images (MHI) and three Depth Motion Maps (DMM).
In health care and other fields, the detection and recognition of human actions or activities are essential in the context of human–robot interaction. During the last decade, many approaches for human activity recognition have taken advantage of high-performance computing devices. These devices make use of various sensors and improve the quality and efficiency of the results. With the aim of using a non-invasive method, we propose the design of a temporal convolutional neural network that uses spatio-temporal features to analyze and recognize human activities using only a short video as input. The proposed architecture is based on a 3D convolutional layer and a convolutional long short-term memory layer. Our methodology leverages the time-motion features with the spatial location of the activities performed by people to improve the accuracy of the classification results. This design makes optimal use of computational resources to achieve training/classification in a short period of time, and consequently, obtain real-time classification results. The computer simulations showed that our method provided superior state-of-the-art classification results for human activities even for those methods that require information from more sensors.
Complex Network-based features extraction in RGB-D human action recognition
2022, Journal of Visual Communication and Image Representation
Citation Excerpt :
Boissiere, et al. [52] proposed a modular network combining skeleton and infrared data and a pre-trained 2D convolutional neural network (CNN) expressed as a pose module to extract features from skeleton data. Rajput, et al. [53] proposed a novel cloud-based approach to securely recognize human activities. They considered color and depth data, and secured them using position based super pixel transformation.
Analysis of human behavior through visual information has been one of the active research areas in computer vision community during the last decade. Vision-based human action recognition (HAR) is a crucial part of human behavior analysis, which is also of great demand in a wide range of applications. HAR was initially performed via images from a conventional camera; however, depth sensors have recently embedded as an additional informative resource to cameras. In this paper, we have proposed a novel approach to largely improve the performance of human action recognition using Complex Network-based feature extraction from RGB-D information. Accordingly, the constructed complex network is employed for single-person action recognition from skeletal data consisting of 3D positions of body joints. The indirect features help the model cope with the majority of challenges in action recognition. In this paper, the meta-path concept in the complex network has been presented to lessen the unusual actions structure challenges. Further, it boosts recognition performance. The extensive experimental results on two widely adopted benchmark datasets, the MSR-Action Pairs, and MSR Daily Activity3D indicate the efficiency and validity of the method.
Secure and Privacy-Preserving Human Interaction Recognition of Pervasive Healthcare Monitoring
2023, IEEE Transactions on Network Science and Engineering
Exploiting Security Issues in Human Activity Recognition Systems (HARSs)
2023, Information (Switzerland)
Preserving Privacy in Image Database through Bit-planes Obfuscation
2023, TechRxiv

View all citing articles on Scopus

View full text

Privacy-preserving human action recognition as a remote cloud service using RGB-D sensors and deep CNN

Abstract

Introduction

Section snippets

Related work

System overview

Position-based superpixel transformation for image obfuscation

Privacy-preserving human action recognition as a service

Recognition results

Security analysis

Data overhead

Discussion

Conclusion

Declaration of Competing Interest

CRediT authorship contribution statement

Acknowledgments

Expert Systems with Applications

Pattern Recognition Letters

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Expert Systems with Applications

Future Generation Computer Systems

Computer Vision and Image Understanding

Pattern Recognition

The recognition of human movement using temporal templates

IEEE Transactions on Pattern Analysis and Machine Intelligence

A simple public-key cryptosystem with a double trapdoor decryption mechanism and its applications

International conference on the theory and application of cryptology and information security

Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor

International conference on image processing (icip)

Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning

Conference on computer vision and pattern recognition (cvpr)

Privacy-preserving action recognition for smart hospitals using low-resolution depth images

Towards privacy-preserving recognition of human activities

International conference on image processing (icip)

Imagenet: A large-scale hierarchical image database

Conference on computer vision and pattern recognition (cvpr)

Information fusion for human action recognition via biset/multiset globality locality preserving canonical correlation analysis

IEEE Transactions on Image Processing

Superpixel-based interest points for effective bags of visual words medical image retrieval

Miccai international workshop on medical content-based retrieval for clinical decision support

Deep residual learning for image recognition

Conference on computer vision and pattern recognition (cvpr)