An improved approximation algorithm for the minimum common integer partition problem

https://doi.org/10.1016/j.ic.2021.104784Get rights and content

Abstract

Given a collection of multisets {X1,X2,,Xk} (k2) of positive integers, a multiset S is a common integer partition for them if S is an integer partition of every multiset Xi,1ik. The minimum common integer partition (k-MCIP) problem is defined as to find a CIP for {X1,X2,,Xk} with the minimum cardinality. We present a 65-approximation algorithm for the 2-MCIP problem, improving the previous best algorithm of performance ratio 54 designed by Chen et al. in 2006. We then extend it to obtain an absolute 0.6k-approximation algorithm for k-MCIP when k is even (when k is odd, the approximation ratio is 0.6k+0.4).

Introduction

The minimum common integer partition (MCIP) problem was introduced to the computational biology community by Chen et al. [7], formulated from their work on ortholog assignment and DNA fingerprint assembly. Mathematically, a partition of a positive integer x is a multiset σ(x)={a1,a2,,at} of positive integers such that a1+a2++at=x, where each ai is called a part of the partition of x [2], [3]. For example, {3,2,2,1} is a partition of x=8; so is {6,1,1}. A partition of a multiset X of positive integers is the multiset union of the partition σ(x) for all x of X, i.e. σ(X)=xXσ(x). For example, as {3,2,2,1} is a partition of x1=8 and {3,2} is a partition of x2=5, {3,3,2,2,2,1} is a partition for X={8,5}.

Given a collection of multisets {X1,X2,,Xk} (k2), a multiset S is a common integer partition (CIP) for them if S is an integer partition of every multiset Xi,1ik. For example, when k=2 and X1={8,5} and X2={6,4,3}, {3,3,2,2,2,1} is a CIP for them since {3,3,2,2,2,1} is also a partition for X2={6,4,3}: 3+3=6, 2+2=4, and 2+1=3. The minimum common integer partition (MCIP) problem is defined as to find a CIP for {X1,X2,,Xk} with the minimum cardinality. For example, one can verify that, for the above X1={8,5} and X2={6,4,3}, {6,3,2,2} is a minimum cardinality CIP. We use k-MCIP to denote the restricted version of the MCIP problem when the number of input multisets is fixed to be k.

For simplicity, we denote the optimal, i.e. a minimum cardinality, CIP for {X1,X2,,Xk} as OPT(X1,X2,,Xk), or simply OPT when the input multisets are clear from the context. Analogously, we denote the CIP for {X1,X2,,Xk} produced by an algorithm A as CIPA(X1,X2,,Xk), or simply CIPA; without the algorithm subscript, we use CIP to denote any feasible common integer partition.

We mentioned earlier that the MCIP problem was introduced by Chen et al. [7], formulated out of ortholog assignment and DNA fingerprint assembly. The interested readers may refer to their paper for more detailed descriptions and the mappings between the problems. More recently, another application of the MCIP problem in similarity comparison between two unlabeled pedigrees was presented in [10]. Pedigrees, or commonly known as family trees, record the occurrence and appearance (or phenotypes) of a particular gene or organism and its ancestors from one generation to the next. They are important to geneticists for linkage analysis, as with a valid pedigree the recombination events can be deduced more accurately [8], or disease loci can be mapped consistently [12], [13]. Jiang et al. [10] considered the isomorphism and similarity comparison problems for two-generation pedigrees, and formulated them as the minimum common integer pair partition (MCIPP) problem, which generalizes the MCIP problem. By exploiting certain structural properties of the optimal solutions for the 2-MCIP problem, they were able to show that their MCIPP problem is also fixed-parameter tractable [10].

For an integer xZ+, its number of integer partitions increases very rapidly with x. For example, the integer 3 has three partitions, namely {3}, {2,1}, and {1,1,1}; the integer 4 has five partitions, namely {4}, {3,1}, {2,2}, {2,1,1}, and {1,1,1,1}; while the integer 10 has 190,569,292 partitions according to [2].

Given a collection of multisets {X1,X2,,Xk} (k2), they have a CIP if and only if they have the same summation over their elements. Multisets with this property are called related [6], and we assume throughout the paper that the multisets in any instance of MCIP are related, as the verification takes only linear time.

One can see that the 2-MCIP problem generalizes the well-known subset sum problem [9], based on the following observation: Given a positive integer number x and a multiset of positive integers X={a1,a2,,am}, there exists a sub-multiset of X summing to x if and only if for the two multisets X={a1,a2,,am} and Y={x,i=1maix}, |OPT(X,Y)|=m. Thus 2-MCIP is NP-hard [6]. Chen et al. showed that 2-MCIP is APX-hard [6], via a linear reduction (also called an approximation preserving reduction) from the maximum bounded 3-dimensional matching problem [11]. After the preliminary version of this paper, You et al. presented a fixed-parameter tractable (FPT) algorithm for 2-MCIP in [15].

Let M=|X1|+|X2|++|Xk| denote the total number of integers in the k-MCIP problem. For the positive algorithmic results, Chen et al. presented a linear time 2-approximation algorithm and an O(M9)-time 5/4-approximation algorithm for 2-MCIP [6], based on a heuristic for the maximum weighted set packing problem [11]. The 5/4-approximation can be taken as a subroutine to design a 0.625k-approximation algorithm for k-MCIP (when k is even; when k is odd, the approximation ratio is 0.625k+0.375) [14]. Woodruff developed a framework for capturing the frequencies of the integers across the input multisets and presented a randomized O(Mlogk)-time approximation algorithm for k-MCIP, with a worst-case performance ratio 0.6139k(1+o(1)) [14]. The basic idea is, when there are not too many distinct integers in the input multisets, most of the low frequency integers will have to be split into at least two parts in any common partition. Inspired by this idea, Zhao et al. [16] formulated the k-MCIP problem into a flow decomposition problem in an acyclic k-layer network with the goal to find a minimum number of directed simple paths from the source to the sink. Since this minimum number can be bounded by the number of arcs in the network according to the well-known flow decomposition theorem [1], Zhao et al. presented a scheme to reduce the number of arcs in the network, resulting in a de-randomized approximation algorithm with a performance ratio 0.5625k(1+o(1)), which is the currently best.

In this paper, we present a polynomial-time 6/5-approximation algorithm for 2-MCIP. Subsequently, we obtain a 0.6k-approximation algorithm for k-MCIP when k is even (when k is odd, the approximation ratio is 0.6k+0.4). It is worth pointing out that the ratio of 0.5625k in [16] is asymptotic, that it holds for only sufficiently large k; while our ratio of 0.6k is absolute, that it holds for all k2.

The rest of the paper is organized as follows: In the next section, we introduce some known bounds on the cardinality of the optimal CIPs for 2-MCIP first, then present our 6/5-approximation algorithm and its performance analysis, assuming an important inequality stated in Lemma 4. The entire Section 3 is devoted to the proof of Lemma 4, where multiple amortized analyses are employed. We note that while conceptually simple, some of the amortized analyses are technical and involved, with a number of notations set up for token counting purposes. In Section 4, we extend the 6/5-approximation algorithm to a 0.6k-approximation for k-MCIP when k>2 (a (0.6k+0.4)-approximation when k is odd). We conclude the paper with some future work in Section 5.

Section snippets

A 6/5-approximation algorithm for 2-MCIP

In this section, we deal with the 2-MCIP problem. For ease of presentation, we denote the two multisets of positive integers in an instance as X={x1,x2,,xm} and Y={y1,y2,,yn}, and assume without loss of generality that they are related. Recall that, OPT(X,Y) denotes the optimal solution — the minimum cardinality CIP for {X,Y}, and CIPA(X,Y) denotes the solution CIP produced by the algorithm A.

Proof of Lemma 4

This section is devoted to the proof of Lemma 4, stating that 3q3+2q4+q55(p3+p4+p5). By Eq. (2.4), it is sufficient to show that 2q31+q32+q412p3+p4, which is stated as Lemma 10. To this purpose, we consider the bipartite subgraph H of the graph H induced by the vertex subsets Q31Q32Q41 and P. By associating two tokens for each vertex of Q31 and one token for each vertex of Q32Q41, we re-distribute these tokens to the vertices of P through adjacencies by distinguishing five

A 0.6k-approximation algorithm for k-MCIP

Given an instance of the k-MCIP problem {X1,X2,,Xk}, we first divide these k multisets into k/2 pairs {X2i1,X2i}, i=1,2,,k/2, plus the last multiset Xk if k is odd. Next, we run the algorithm Apx65 on each pair {X2i1,X2i} to obtain a solution

, for i=1,2,,k/2, plus Z(k+1)/2=Xk if k is odd. We continue this dividing and running Apx65 on {Z1,Z2,,Z(k+1)/2} if (k+1)/22, and repeat until we have only one multiset left, denoted as CIPfinal. Clearly, CIPfinal is a common

Conclusions

We presented an improved 65-approximation algorithm for the 2-MCIP problem; the previous best approximation algorithm has a performance ratio of 54 and was designed by Chen et al. in 2006 [5], [6]. Subsequently, we obtained an absolute 0.6k-approximation algorithm for k-MCIP when k is even (when k is odd, the approximation ratio is 0.6k+0.4). It is worth pointing out that the ratio of 0.5625k in [16] is asymptotic, that it holds for only sufficiently large k1

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgement

We are very grateful to the anonymous reviewers for their many helpful comments and suggestions to improve the presentation.

This research is supported by the NSERC Canada.

References (16)

There are more references available in the full text version of this article.

Cited by (0)

An extended abstract appears in ISAAC 2014.

View full text