Elsevier

Knowledge-Based Systems

Volume 237, 15 February 2022, 107793
Knowledge-Based Systems

Multi-core parallel algorithms for hiding high-utility sequential patterns

https://doi.org/10.1016/j.knosys.2021.107793Get rights and content

Abstract

High-utility sequential pattern mining (HUSPM) can be applied in many applications such as retail, market basket analysis, click-stream analysis, healthcare data analysis, and bioinformatics. HUSPM algorithms discover useful information from data. However, looking at the dark side, the sensitive patterns can also be disclosed by the competitors, who use a HUSPM algorithm on the leaked data. Therefore, high-utility sequential pattern hiding (HUSPH) is used to protect the privacy information from HUSPM algorithms. This paper proposes three algorithms named High Utility Sequential Pattern Hiding Using Pure Array Structure (USHPA), High Utility Sequential Pattern Hiding Using Parallel Strategy (USHP), and High Utility Sequential Pattern Hiding Using Random Distribution Strategy (USHR) for hiding high-utility sequential patterns on quantitative sequence datasets. These algorithms use a proposed data structure named Pattern Utility Set for Hiding (PUSH) to speed up the hiding process. We also introduce a metric called Privacy Factor to evaluate the quality of hiding results. The comparative experiments were conducted on real datasets to evaluate the performance of the proposed algorithms in terms of runtime, memory consumption, scalability, missing cost, and privacy factor. Results show that the proposed algorithms can efficiently sanitize the input datasets, and they outperform the compared algorithms for all metrics.

Introduction

Data is gold if people can extract useful information from it. In information technology, data can be easily collected from various sources such as the internet, sensors, and digital devices. Data quantity and especially quality play crucial roles for any data mining and machine learning algorithms. Algorithms with precise input data can produce meaningful mining results. Along with the availability of the data, data sharing is one of the cornerstones of modern science that enables large-scale analyses, and reproducibility [1]. In some cases, the implementation of cryptographic algorithms to encrypt a dataset before delivering and decrypting it when used may reduce the availability and benefits of data sharing. More importantly, the sensitive information remains in the decrypted data, which leads to some risks if the data is stolen or used illegally. In strategic alliance cases, companies need to share information with others and protect their business confidentiality. Thus, designing privacy-preserving algorithms that can modify quantitative datasets containing sensitive patterns to reduce the utility value below the given threshold while preventing reconstruction of the original dataset from the sanitized one is a necessary task [2].

Nowadays, the growth of big data’s primary characteristics, including velocity, variety, and volume, are increasing significantly, making the demands for data analysis and data mining are urgent and necessary more than ever. However, there are more risks involved with data leaking and security. For example, if a company wants to share the data with an analytical or software development company, its policy and contract may not protect the sensitive information from the third person. Furthermore, leaking private information of customers, partners, or bank accounts may bring an end to a business. Therefore, data protection and privacy are critical issues in data management, mining, and analysis. The existing methods can be classified into two main categories: data hiding and knowledge hiding. Data hiding is also known as Privacy-Preserving Data Publishing (PPDP) [3]. It implements several basic security methods such as encryption, randomization, and anonymization techniques to transform the raw data into modified versions. However, such methods may reduce data usability and lead to inaccurate or non-retrievable knowledge for mining algorithms. Therefore, they may cause the loss of veracity, variability, and value characteristics in big data whenever they are applied. The data mining algorithms not only discover knowledge but also may disclose sensitive information from data. Therefore, it increases the risk of disclosure of private knowledge within the huge data by using data mining algorithms. This problem may lead to serious threats when competitors get confidential information [4].

Knowledge hiding is also known as Privacy-Preserving Data Mining (PPDM) which aims to protect mining results from data mining algorithms. From the general point of view, the mining results can be association rules, knowledge patterns, high-utility sequential patterns, clusters, and classifications. PPDM includes the set of methodologies used to sanitize the sensitive information from the original database. The utility-based mining (UBM) problem was proposed in the past decade and has become an extensive field study. UBM is applied widely in various real-life analytical and mining applications. Thus, in recent years, privacy-preserving for data from UBM algorithms has become a critical challenge. Privacy-Preserving Utility Mining (PPUM) is emerging as a sub-topic in PPDM. PPUM uses both UBM and privacy-preserving methods to keep the privacy of the data [5], [6].

HUSPM algorithms are designed to discover patterns that have high-utility values (e.g., cost or profit) in quantitative sequence datasets (QSDs). In contrast, HUSPH algorithms aim at hiding high-utility patterns in QSDs so that they cannot be discovered by HUSPM algorithms [6]. It is a topic of PPUM. The key problem of HUSPH is to design a hiding algorithm that maximizes privacy while maintaining the usability of data as much as possible. In other words, analytical or mining tasks must still work well on the sanitized datasets produced by a HUSPH algorithm. HUSPH algorithms can be adapted for privacy-preserving tasks wherever HUSPM is applied, such as economics, healthcare, and the stock market. Recently, several algorithms have been proposed for HUSPH problem [7], [8], [9], [10], [11]. Among them, HUS-Hiding [10], and FH-HUSP [11] are the latest algorithms. It outperforms other algorithms for this topic. Although the literature on this topic abounds, there is no perfect model for all hiding tasks. In addition, from the literature viewpoint, the interest in this topic seems to be unwaning. Thus, we also focused on designing new HUSPH algorithms in this paper. The reasons for this work are based on the following observations:

Observation 1

In terms of performance, previous algorithms were designed to hide HUSPs in a sequential manner leading to time-consuming and high memory consumption, especially when dealing with large-scale datasets. Therefore, designing a parallel approach to speed up the hiding process and reduce memory usage is necessary for this task. The proposed parallel approach uses multi-core processors, which are popular central processing unit (CPU) architecture nowadays. This approach improves the performances and increases the ability to scale up the proposed hiding algorithms on large-scale datasets.

Observation 2

The previous algorithms produce a high missing cost, which means that the difference between the original and the sanitized dataset is still large. Thus, designing a HUSPH algorithm that achieves a good trade-off between privacy and the missing cost is a non-trivial task. In addition, more privacy criteria should be used to evaluate the effectiveness of HUSPH algorithms on the output.

The above observations motivate the design of HUSPH algorithms in this work that achieves higher performance and lower side effects. The main contributions of this paper are highlighted as follows. Up to now, this is the first work that takes the advantages of multi-core processors and parallelization to reduce computational complexity, improve the scalability of HUSPH algorithms, and thus be applicable for dealing with large-scale datasets. A new data structure named Pattern Utility Set for Hiding (PUSH) was designed to enhance the hiding process. The three algorithms named USHPA, USHP and USHR were proposed to address the limitations of previous algorithms. We also proposed a metric called privacy factor (PF) to measure how good the hiding results are. The comparative experiments were conducted on real datasets to compare the proposed algorithms with several state-of-the-art HUSPH algorithms regarding execution time, memory consumption, scalability, and side effects. The results show that the three proposed algorithms outperform the previous HUSPH algorithms for all metrics. From the application viewpoint, the proposed algorithms provide a tool1 for preserving private information in several scenarios such as banking, healthcare, and online shopping. Particularly, users can use this tool for sanitization before sharing data for scientific and industrial purposes.

The remainder of this paper is structured as follows. The related work of HUSPM and HUSPH are presented in Section 2. The background and preliminaries are provided in Section 3. The proposed USHPA, USHP and USHR algorithms are introduced in Section 4. The comparative experiment is shown in Section 5. Finally, the conclusion and future work is drawn in Section 6.

Section snippets

Related work

Background of HUSPM

The problem of HUSPM has been considered in previous studies [10], [11], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26], [28], [31], [32], [53]. A quantitative sequence dataset QSD contains a set of quantitative sequences (sequences for short) such that QSD=S1,S2,,Sz where Sy (1yz) is a sequence with its identity is y. For example, Table 3 shows an example of a quantitative sequence dataset QSD that contains five sequences and their identities. This dataset is

The proposed HUSPH algorithms

PAS structure and pattern utility set for mining improve the performance of HUSPM when compared to the LQS-Tree and utility matrix. By improving upon the mining structure proposed in [45], we designed a new data structure called Pattern Utility Set for Hiding (push) and employed them in the proposed hiding algorithms.

Experiment

Experiments were performed on a workstation running Windows 10 with an Intel Xeon W3520 central processing unit (CPU) and 8 GB of main memory. This CPU has four physical cores with 2.67 GHz for each and contains four logical processors capable of eight threads. We used C# programming language, Task Parallel Library in Microsoft C# .NET framework, and Microsoft Visual Studio 2019 Community to implement the proposed and compared algorithms. The executable file of the proposed algorithms and

Conclusion

In this paper, we proposed three algorithms named USHPA, USHP, and USHR to hide all HUSPs from quantitative sequence datasets efficiently. The proposed algorithms rely on a data structure named pattern utility set for hiding (push). This structure keeps essential information required for the modification. The proposed algorithms apply the separate hiding model that first uses a mining algorithm to collect the set of HUSPs and then modifies them until their utilities are lower than λ. USHPA is a

CRediT authorship contribution statement

Ut Huynh: Data curation, Methodology, Experiment, Writing, Editing. Bac Le: Methodology, Reviewing, Editing, Funding acquisition. Duy-Tai Dinh: Methodology, Experiment, Writing, Editing. Hamido Fujita: Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 102.05-2018.307.

References (62)

  • LeBac et al.

    A pure array structure and parallel strategy for high-utility sequential pattern mining

    Expert Syst. Appl.

    (2018)
  • LinJerry Chun-Wei et al.

    Fast algorithms for hiding sensitive high-utility itemsets in privacy-preserving utility mining

    Eng. Appl. Artif. Intell.

    (2016)
  • YunUnil et al.

    A fast perturbation algorithm using tree structure for privacy preserving utility mining

    Expert Syst. Appl.

    (2015)
  • ImHyeonseung et al.

    Parallel skyline computation on multicore architectures

    Inf. Syst.

    (2011)
  • ZhangJianting et al.

    Parallel online spatial and temporal aggregations on multi-core cpus and many-core gpus

    Inf. Syst.

    (2014)
  • BednárekDavid et al.

    Improving matrix-based dynamic programming on massively parallel accelerators

    Inf. Syst.

    (2017)
  • LeroyVincent et al.

    Toppi: An efficient algorithm for item-centric mining

    Inf. Syst.

    (2017)
  • DinhDuy-Tai et al.

    Clustering mixed numerical and categorical data with missing values

    Inform. Sci.

    (2021)
  • TedersooLeho et al.

    Data sharing practices and data availability upon request differ across scientific disciplines

    Sci. Data

    (2021)
  • FungBenjamin C.M. et al.

    Privacy-preserving data publishing: A survey of recent developments

    ACM Comput. Surv.

    (2010)
  • GanWensheng et al.

    Privacy preserving utility mining: A survey

  • DinhDuy-Tai et al.

    A survey of privacy preserving utility mining

  • DinhTai et al.

    A novel approach for hiding high utility sequential patterns

  • QuangMinh Nguyen et al.

    MHHUSP: AN integrated algorithm for mining and hiding high utility sequential patterns

  • QuangMinh Nguyen et al.

    An approach to decrease execution time and difference for hiding high utility sequential patterns

  • ZhangChunkai et al.

    A fast algorithm for hiding high utility sequential patterns

  • HuynhUt et al.

    Hiding periodic high-utility sequential patterns

  • Fournier-VigerPhilippe et al.

    A survey of sequential pattern mining

    Data Sci. Pattern Recognit.

    (2017)
  • AgrawalRakesh et al.

    Mining sequential patterns

  • AhmedChowdhury Farhan et al.

    A novel approach for mining high-utility sequential patterns in sequence databases

    ETRI J.

    (2010)
  • YinJunfu et al.

    USPan: an efficient algorithm for mining high utility sequential patterns

  • Cited by (7)

    • A survey of high utility sequential patterns mining methods

      2023, Journal of Intelligent and Fuzzy Systems
    View all citing articles on Scopus
    View full text