Multi-core parallel algorithms for hiding high-utility sequential patterns
Introduction
Data is gold if people can extract useful information from it. In information technology, data can be easily collected from various sources such as the internet, sensors, and digital devices. Data quantity and especially quality play crucial roles for any data mining and machine learning algorithms. Algorithms with precise input data can produce meaningful mining results. Along with the availability of the data, data sharing is one of the cornerstones of modern science that enables large-scale analyses, and reproducibility [1]. In some cases, the implementation of cryptographic algorithms to encrypt a dataset before delivering and decrypting it when used may reduce the availability and benefits of data sharing. More importantly, the sensitive information remains in the decrypted data, which leads to some risks if the data is stolen or used illegally. In strategic alliance cases, companies need to share information with others and protect their business confidentiality. Thus, designing privacy-preserving algorithms that can modify quantitative datasets containing sensitive patterns to reduce the utility value below the given threshold while preventing reconstruction of the original dataset from the sanitized one is a necessary task [2].
Nowadays, the growth of big data’s primary characteristics, including velocity, variety, and volume, are increasing significantly, making the demands for data analysis and data mining are urgent and necessary more than ever. However, there are more risks involved with data leaking and security. For example, if a company wants to share the data with an analytical or software development company, its policy and contract may not protect the sensitive information from the third person. Furthermore, leaking private information of customers, partners, or bank accounts may bring an end to a business. Therefore, data protection and privacy are critical issues in data management, mining, and analysis. The existing methods can be classified into two main categories: data hiding and knowledge hiding. Data hiding is also known as Privacy-Preserving Data Publishing (PPDP) [3]. It implements several basic security methods such as encryption, randomization, and anonymization techniques to transform the raw data into modified versions. However, such methods may reduce data usability and lead to inaccurate or non-retrievable knowledge for mining algorithms. Therefore, they may cause the loss of veracity, variability, and value characteristics in big data whenever they are applied. The data mining algorithms not only discover knowledge but also may disclose sensitive information from data. Therefore, it increases the risk of disclosure of private knowledge within the huge data by using data mining algorithms. This problem may lead to serious threats when competitors get confidential information [4].
Knowledge hiding is also known as Privacy-Preserving Data Mining (PPDM) which aims to protect mining results from data mining algorithms. From the general point of view, the mining results can be association rules, knowledge patterns, high-utility sequential patterns, clusters, and classifications. PPDM includes the set of methodologies used to sanitize the sensitive information from the original database. The utility-based mining (UBM) problem was proposed in the past decade and has become an extensive field study. UBM is applied widely in various real-life analytical and mining applications. Thus, in recent years, privacy-preserving for data from UBM algorithms has become a critical challenge. Privacy-Preserving Utility Mining (PPUM) is emerging as a sub-topic in PPDM. PPUM uses both UBM and privacy-preserving methods to keep the privacy of the data [5], [6].
HUSPM algorithms are designed to discover patterns that have high-utility values (e.g., cost or profit) in quantitative sequence datasets (QSDs). In contrast, HUSPH algorithms aim at hiding high-utility patterns in QSDs so that they cannot be discovered by HUSPM algorithms [6]. It is a topic of PPUM. The key problem of HUSPH is to design a hiding algorithm that maximizes privacy while maintaining the usability of data as much as possible. In other words, analytical or mining tasks must still work well on the sanitized datasets produced by a HUSPH algorithm. HUSPH algorithms can be adapted for privacy-preserving tasks wherever HUSPM is applied, such as economics, healthcare, and the stock market. Recently, several algorithms have been proposed for HUSPH problem [7], [8], [9], [10], [11]. Among them, HUS-Hiding [10], and FH-HUSP [11] are the latest algorithms. It outperforms other algorithms for this topic. Although the literature on this topic abounds, there is no perfect model for all hiding tasks. In addition, from the literature viewpoint, the interest in this topic seems to be unwaning. Thus, we also focused on designing new HUSPH algorithms in this paper. The reasons for this work are based on the following observations:
Observation 1 In terms of performance, previous algorithms were designed to hide HUSPs in a sequential manner leading to time-consuming and high memory consumption, especially when dealing with large-scale datasets. Therefore, designing a parallel approach to speed up the hiding process and reduce memory usage is necessary for this task. The proposed parallel approach uses multi-core processors, which are popular central processing unit (CPU) architecture nowadays. This approach improves the performances and increases the ability to scale up the proposed hiding algorithms on large-scale datasets.
Observation 2 The previous algorithms produce a high missing cost, which means that the difference between the original and the sanitized dataset is still large. Thus, designing a HUSPH algorithm that achieves a good trade-off between privacy and the missing cost is a non-trivial task. In addition, more privacy criteria should be used to evaluate the effectiveness of HUSPH algorithms on the output.
The above observations motivate the design of HUSPH algorithms in this work that achieves higher performance and lower side effects. The main contributions of this paper are highlighted as follows. Up to now, this is the first work that takes the advantages of multi-core processors and parallelization to reduce computational complexity, improve the scalability of HUSPH algorithms, and thus be applicable for dealing with large-scale datasets. A new data structure named Pattern Utility Set for Hiding (PUSH) was designed to enhance the hiding process. The three algorithms named USHPA, USHP and USHR were proposed to address the limitations of previous algorithms. We also proposed a metric called privacy factor (PF) to measure how good the hiding results are. The comparative experiments were conducted on real datasets to compare the proposed algorithms with several state-of-the-art HUSPH algorithms regarding execution time, memory consumption, scalability, and side effects. The results show that the three proposed algorithms outperform the previous HUSPH algorithms for all metrics. From the application viewpoint, the proposed algorithms provide a tool1 for preserving private information in several scenarios such as banking, healthcare, and online shopping. Particularly, users can use this tool for sanitization before sharing data for scientific and industrial purposes.
The remainder of this paper is structured as follows. The related work of HUSPM and HUSPH are presented in Section 2. The background and preliminaries are provided in Section 3. The proposed USHPA, USHP and USHR algorithms are introduced in Section 4. The comparative experiment is shown in Section 5. Finally, the conclusion and future work is drawn in Section 6.
Section snippets
Related work
Background of HUSPM
The problem of HUSPM has been considered in previous studies [10], [11], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [28], [31], [32], [53]. A quantitative sequence dataset contains a set of quantitative sequences (sequences for short) such that where is a sequence with its identity is . For example, Table 3 shows an example of a quantitative sequence dataset that contains five sequences and their identities. This dataset is
The proposed HUSPH algorithms
PAS structure and pattern utility set for mining improve the performance of HUSPM when compared to the LQS-Tree and utility matrix. By improving upon the mining structure proposed in [45], we designed a new data structure called Pattern Utility Set for Hiding (push) and employed them in the proposed hiding algorithms.
Experiment
Experiments were performed on a workstation running Windows 10 with an Intel Xeon W3520 central processing unit (CPU) and 8 GB of main memory. This CPU has four physical cores with 2.67 GHz for each and contains four logical processors capable of eight threads. We used C# programming language, Task Parallel Library in Microsoft C# .NET framework, and Microsoft Visual Studio 2019 Community to implement the proposed and compared algorithms. The executable file of the proposed algorithms and
Conclusion
In this paper, we proposed three algorithms named USHPA, USHP, and USHR to hide all HUSPs from quantitative sequence datasets efficiently. The proposed algorithms rely on a data structure named pattern utility set for hiding (push). This structure keeps essential information required for the modification. The proposed algorithms apply the separate hiding model that first uses a mining algorithm to collect the set of HUSPs and then modifies them until their utilities are lower than . USHPA is a
CRediT authorship contribution statement
Ut Huynh: Data curation, Methodology, Experiment, Writing, Editing. Bac Le: Methodology, Reviewing, Editing, Funding acquisition. Duy-Tai Dinh: Methodology, Experiment, Writing, Editing. Hamido Fujita: Writing – review & editing.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2018.307.
References (62)
- et al.
HHUIF And MSICF: Novel algorithms for privacy preserving utility mining
Expert Syst. Appl.
(2010) - et al.
Privacy preserving mining of association rules
Inf. Syst.
(2004) - et al.
An efficient algorithm for hiding high utility sequential patterns
Internat. J. Approx. Reason.
(2018) - et al.
Applying the maximum utility measure in high utility sequential pattern mining
Expert Syst. Appl.
(2014) - et al.
On incremental high utility sequential pattern mining
ACM Trans. Intell. Syst. Technol. (TIST)
(2018) - et al.
Proum: Projection-based utility mining on sequence data
Inform. Sci.
(2020) - et al.
Fmaxclohusm: An efficient algorithm for mining frequent closed and maximal high utility sequences
Eng. Appl. Artif. Intell.
(2019) - et al.
Interactive mining of high utility patterns over data streams
Expert Syst. Appl.
(2012) - et al.
Mining top-k high utility patterns over data streams
Inform. Sci.
(2014) - et al.
Distributed mining of high utility time interval sequential patterns using mapreduce approach
Expert Syst. Appl.
(2020)
A pure array structure and parallel strategy for high-utility sequential pattern mining
Expert Syst. Appl.
Fast algorithms for hiding sensitive high-utility itemsets in privacy-preserving utility mining
Eng. Appl. Artif. Intell.
A fast perturbation algorithm using tree structure for privacy preserving utility mining
Expert Syst. Appl.
Parallel skyline computation on multicore architectures
Inf. Syst.
Parallel online spatial and temporal aggregations on multi-core cpus and many-core gpus
Inf. Syst.
Improving matrix-based dynamic programming on massively parallel accelerators
Inf. Syst.
Toppi: An efficient algorithm for item-centric mining
Inf. Syst.
Clustering mixed numerical and categorical data with missing values
Inform. Sci.
Data sharing practices and data availability upon request differ across scientific disciplines
Sci. Data
Privacy-preserving data publishing: A survey of recent developments
ACM Comput. Surv.
Privacy preserving utility mining: A survey
A survey of privacy preserving utility mining
A novel approach for hiding high utility sequential patterns
MHHUSP: AN integrated algorithm for mining and hiding high utility sequential patterns
An approach to decrease execution time and difference for hiding high utility sequential patterns
A fast algorithm for hiding high utility sequential patterns
Hiding periodic high-utility sequential patterns
A survey of sequential pattern mining
Data Sci. Pattern Recognit.
Mining sequential patterns
A novel approach for mining high-utility sequential patterns in sequence databases
ETRI J.
USPan: an efficient algorithm for mining high utility sequential patterns
Cited by (7)
Parallel approaches to extract multi-level high utility itemsets from hierarchical transaction databases
2023, Knowledge-Based SystemsEfficient privacy preserving algorithms for hiding sensitive high utility itemsets
2023, Computers and SecurityEfficient high-utility occupancy itemset mining algorithm on massive data
2022, Expert Systems with ApplicationsGPU-Based Efficient Parallel Heuristic Algorithm for High-Utility Itemset Mining in Large Transaction Datasets
2024, IEEE Transactions on Knowledge and Data EngineeringA survey of high utility sequential patterns mining methods
2023, Journal of Intelligent and Fuzzy SystemsHigh utility pattern mining algorithm over data streams using ext-list.
2023, Applied Intelligence