short-paper

Efficient Analytics on Encrypted Data

Author:

Gidon GershinskyAuthors Info & Claims

SYSTOR '18: Proceedings of the 11th ACM International Systems and Storage Conference

Page 121

https://doi.org/10.1145/3211890.3211907

Published: 04 June 2018 Publication History

Get Access

Abstract

Enterprises and non-profit organizations work with sensitive commercial or personal information, stored in an encrypted form due to business confidentiality requirements, GDPR regulations [4] and other reasons. Unfortunately, a straightforward encryption doesn't work well for modern columnar data formats, such as Apache Parquet [2], that are leveraged by analytic frameworks for acceleration of data ingest and processing.

Parquet is a popular file format, widely used in cloud and on-premises processing of data by Apache Spark [3], Impala [1] and other systems. Besides column-oriented information storage, Parquet enables efficient data encoding, compression and fast access to field values by use of multi-level internal indexing and statistics. The latter capability is critical for a so-called predicate push-down, where an analytic framework fetches and processes only a subset of the full data set, after analyzing Parquet metadata that narrows down the files and data pages relevant for a given query (predicate). Combined with column filtering, this allows to accelerate analytic workloads by order(s) or magnitude. However, if Parquet files are bulk-encrypted in storage, their internal modules can not be extracted and parsed. All files in a requested folder must be fully delivered from storage to the analytic framework location, decrypted and authenticated there, and then processed. Another alternative is to decrypt the files at the storage upon a read request - however, this makes the encryption keys and the data visible to the storage system and administration. Also, this still requires full decryption of every file in a folder, before the parsing becomes possible. A third option is to use an encryption client in storage SDKs, available in some clouds. But these clients don't support authentication encryption for range reads, required for predicate push-down, and make the solution tied to a specific cloud storage, inapplicable in other clouds or on-premises data centers.

We are working on a Parquet modular encryption mechanism [5] that supports authenticated data encryption and efficient filtering in any storage, without revealing the encryption key or the data to the storage system. The mechanism preserves the Parquet encoding, compression, columnar projection and indexing capabilities. It uses the internal modular structure of the format for a separate encryption of all data and metadata components, while updating the module references as required by authenticated encryption algorithms that don't preserve the data length. Authentication support allows a reader to make sure a file has not been tampered with or replaced with an old version. We work with the Apache Parquet community to contribute this mechanism to the open source project. Initially, the mechanism will enable a single encryption key for each file, with a choice of columns to be encrypted and columns to be left as plaintext if they don't contain any sensitive data. Later, this approach will be extended to a key-per-column and possibly key-perrowgroup encryption. In parallel, we integrate Apache Spark with the Parquet modular encryption mechanism - to enable Spark to work directly with encrypted data. The integration allows for an efficient Spark SQL analytics not only on clear-text Parquet files, but on encrypted data as well.

References

[1]

Apache Software Foundation. 2018. Apache Impala. https://impala.apache.org/. (2018).

Google Scholar

[2]

Apache Software Foundation. 2018. Apache Parquet. https://parquet.apache.org/. (2018).

Google Scholar

[3]

Apache Spark. 2018. Spark SQL and DataFrames. https://spark.apache.org/sql/. (2018).

Google Scholar

[4]

EU. 2018. General Data Protection Regulation. https://gdpr-info.eu/. (2018).

Google Scholar

[5]

Gidon Gershinsky. 2018. Parquet Modular Encryption Jira. https://issues.apache.org/jira/browse/PARQUET-1178. (2018).

Google Scholar

Index Terms

Efficient Analytics on Encrypted Data
1. Security and privacy
  1. Database and storage security
    1. Management and querying of encrypted data

Recommendations

Responsible Big Data Analytics for E-Business Services
ICBDR '21: Proceedings of the 5th International Conference on Big Data Research

This paper examines responsible big data analytics for e-business services and looks at how to use responsible big data analytics to obtain responsible e-business services. It addresses why responsibility matters to big data analytics and e-business ...
Debugging Big Data Analytics in Spark with BigDebug
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

To process massive quantities of data, developers leverage Data-Intensive Scalable Computing (DISC) systems such as Apache Spark. In terms of debugging, DISC systems support only post-mortem log analysis and do not provide any debugging functionality. ...
Multimedia Big Data Analytics: A Survey

With the proliferation of online services and mobile technologies, the world has stepped into a multimedia big data era. A vast amount of research work has been done in the multimedia area, targeting different aspects of big data analytics, such as the ...

Comments

Information & Contributors

Information

Published In

SYSTOR '18: Proceedings of the 11th ACM International Systems and Storage Conference

June 2018

144 pages

ISBN:9781450358491

DOI:10.1145/3211890

General Chair:
David Breitgand
IBM Research
,
Program Chairs:
Gala Yadgar
Technion
,
Donald E. Porter
University of North Carolina at Chapel Hill
,
Publications Chair:
Ittay Eyal
Technion

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 June 2018

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

SYSTOR '18

Sponsor:

Technion
SIGOPS
USENIX Assoc

SYSTOR '18: International Systems and Storage Conference

June 4 - 7, 2018

Haifa, Israel

Acceptance Rates

Overall Acceptance Rate 108 of 323 submissions, 33%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
234
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)0

Reflects downloads up to 20 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Index Terms

Recommendations

Responsible Big Data Analytics for E-Business Services

Debugging Big Data Analytics in Spark with BigDebug

Multimedia Big Data Analytics: A Survey

Comments

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Other Metrics

Article Metrics

Other Metrics

Login options

Full Access

PDF

eReader

Abstract

References

Index Terms

Recommendations

Responsible Big Data Analytics for E-Business Services

Debugging Big Data Analytics in Spark with BigDebug

Multimedia Big Data Analytics: A Survey

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Login options

Full Access

View options

PDF

eReader

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations