Generalised Isotonic Regression - In Depth

Posted on November 22, 2023

The final C++ implementation is available on GitHub at Generalised Isotonic Regression, and a variant compiled to WebAssembly for experimentation in the browser can be found here.

What is Isotonic Regression?

Like other regression models, Isotonic Regression tries to find a predictive relationship between a set of features and corresponding observations. A method such as Linear Regression constrains the space of possible relationships such that the predictions vary linearly with changes in the feature space; in other words, it has a constant rate of change. Isotonic Regression, conversely, has no restrictions on its shape apart from requiring a strict non-decreasing (alt. non-increasing) relationship - typically a monotonic piecewise-constant function - meaning there can be large jumps between subsequent feature values. It can, therefore, be a helpful alternative when a monotonic relationship between variables exists, especially when this relationship does not follow an obvious Linear or Polynomial form typical in regression.

In the classical formulation (with \(L_2\) loss) where \(\mathfrak{I}\) is a partial order defined over our independent variables, Isotonic Regression can be defined as the following optimisation problem.

\[\begin{aligned} &\text{minimise} &\quad \sum_i (\hat{y}_i - y_i)^2 & \\ &\text{subject to} &\quad \hat{y}_i \preceq \hat{y}_j, &\quad \forall \left ( i, j \right ) \in \mathfrak{I} \end{aligned} \]

Throughout this post, we will discuss what a Linear Program is and how it leads to the algorithm discussed in (Ronny Luss and Rosset 2014) for solving the Isotonic Regression problem in an arbitrary number of dimensions while supporting any convex differentiable loss function.

Linear Programming

Linear Programming is a generic approach to solving constrained problems with a linear objective function and linear constraints. The standard form of such a problem asks us to find a vector \(\bm{x} \in \mathbb{R}^p\) minimising our objective \(o(\bm{x})\) and takes the form:

\[\begin{aligned} &\text{minimise} &o(\bm{x}) \\ &\text{subject to} \quad &b_i(\bm{x}) &\leq 0, &\quad \forall i \\ & &c_j(\bm{x}) &= 0, &\quad \forall j \end{aligned} \]

where our inequalities \(b_i\) and equalities \(c_j\) express linear relationships.

The advantage of prescribing such a form is that generic software can be written to solve any problem that fits the template. For example, the commercial software, MOSEK, or the the open source HiGHS, which I have used in this implementation.

With \(o\) being any linear function, we are not even restricted to minimisation problems and inequalities of the form \(\leq 0\). Multiplying our objective by negative one lets us morph a maximisation problem into the above form. In the same way, a greater than inequality can be changed to a less than. Furthermore, an offset can be included to support equalities and inequalities with values other than zero. For example, the linear relation \(x \geq 5\) changed to \(5 - x \leq 0\) produces the function \(b_i(x) = 5 - x\), which also fits the above form. We can then use generic interior point or simplex solvers to minimise the objective.

Example Linear Program

Consider, for example, a hobbyist photographer and amateur baker. He wants to maximise the time he spends on his two favourite activities but, unfortunately, cannot finance them without limits. From these wishes, we can create a constrained optimisation problem.

You may choose to play around with setting up and solving some of these problems using an online solver.

Our hobbyist wants to maximise his time doing his favourite activities; perhaps he prefers photography, \(p\) over baking, \(b\) (\(2p +b\)). Unfortunately, photography is considerably more expensive at 5€ per hour than baking at 1€, and he has a limited budget of 10€ constraining the activities (\(5p + b \leq 10\)). Finally, he does not want to make more baked goods than he can eat, so he limits himself to 5 hours of baking (\(b \leq 5\)). Along with the natural assumption that a negative amount of time can not be spent on an activity, this produces the Linear Program:

\[\begin{aligned} &\text{maximize} & 2 p + b \\ &\text{subject to} \quad & 5 p + b &\leq 10 \\ & & b &\leq 5 \\ & & p &\geq 0 \\ & & b &\geq 0 \end{aligned} \]

With such few constraints and dimensions, this Linear Program is easily solved. We plot the constraints and choose the largest value possible; one of the vertices. Technically, it will be one of the vertices if there is only a single solution and otherwise, any point along a line between vertices if there are multiple solutions.

However, once we have more than three dimensions, it is more challenging to visualise such a problem. Regardless, a similar approach can be taken, checking each vertex of the feasible region of the corresponding multidimensional polytope.

Plotting all of the constraints, we see our maximal vertex in the top right corner, which suggests that our hobbyist spend an hour a week taking pictures and five hours baking bread.

For a more in-depth introduction, see this video The Art of Linear Programming

We can also write this problem in the standard form we mentioned above.

\[\begin{aligned} &\text{minimise} & - 2p - b \\ &\text{subject to} \quad & 5 p + b - s &\leq 0 \\ & & b - 5 &\leq 0 \\ & & -p &\leq 0 \\ & & -b &\leq 0 \end{aligned} \]

Lagrangian Dual Function

Given the standard form:

\[\begin{aligned} &\text{minimise} &o(\bm{x}) \\ &\text{subject to} \quad &b_i(\bm{x}) &\leq 0, &\quad \forall i \\ & &c_j(\bm{x}) &= 0, &\quad \forall j \end{aligned} \]

we also have the option to define a Lagrangian dual function. Essentially, this new definition relaxes the constraints by multiplying them with a \(\lambda_i \in \mathbb{R}_0^+\) or \(v_i \in \mathbb{R}_0^+\) term, for inequalities and equalities respectively, removing discontinuities in the search space. When violating constraints, we no longer move from a finite to an undefined objective value. Furthermore, by directly including the weighted sum of these constraints in the objective, the Lagrangian acts as a lower bound on the original problem.

\[ L(\bm{x}, \bm{\lambda}, \bm{v}) = o(\bm{x}) + \sum_i \lambda_i b_i(\bm{x}) + \sum_j v_j c_j(\bm{x}) \]

It is natural to ask what the optimal lower bound is for a given \(\bm{\lambda}\) and \(\bm{v}\). This is the infimum of the above objective.

\[ g(\bm{\lambda}, \bm{v}) = \text{inf}_{\bm{x}} L(\bm{x}, \bm{\lambda}, \bm{v}) \]

Taking this objective, we then define the dual problem of our standard form above; the following convex maximisation problem:

\[\begin{aligned} &\text{maximize} & g(\bm{\lambda}, \bm{v}) \\ &\text{subject to} \quad & \bm{\lambda} \succeq 0 \end{aligned} \]

(Boyd and Vandenberghe 2004, pages 215-223)

Example Lagrangian Dual Problem

Using the form of this new dual Lagrangian problem, we can lower-bound our example from above. Our Lagrangian is:

\[\begin{aligned} L(p, b, \bm{\lambda}) &= -2p - b + \lambda_1 \left ( 5p + b - 10 \right ) + \lambda_2 \left ( b - 5 \right ) - \lambda_3 p - \lambda_4 b \\ &= p \left ( 5 \lambda_1 - \lambda_3 - 2 \right ) + b \left ( \lambda_1 + \lambda_2 - \lambda_4 - 1 \right ) + \left ( - 10 \lambda_1 - 5 \lambda_2 \right ) \\ \end{aligned} \]

We have no \(\bm{v}\) term in this case due to a lack of equality constraints, and we have \((p, b)\) in place of \(\bm{x}\). Our dual problem is then formulated as follows:

\[\begin{aligned} &\text{maximize} &\text{inf}_{p, b} \text{ } & p \left ( 5 \lambda_1 - \lambda_3 - 2 \right ) + b \left ( \lambda_1 + \lambda_2 - \lambda_4 - 1 \right ) + \left ( - 10 \lambda_1 - 5 \lambda_2 \right ) \\ &\text{subject to} \quad &\bm{\lambda} & \succeq 0 \end{aligned} \]

What we notice is that this program is unbounded for all values of \(\bm{\lambda}\) except for those where \((5 \lambda_1 - \lambda_3 - 2) = 0\) and \((\lambda_1 + \lambda_2 - \lambda_4 - 1) = 0\). If either of these terms is not zero, the corresponding \(p\) or \(b\) term can be arbitrarily small, and the infimum returns negative infinity. Linear programs are always unbounded like this, but this is not generally true of optimisation problems.

Using this, we can derive an analytic solution by first rearranging our two equalities

\[\begin{aligned} \lambda_1 &= 1/5 \lambda_3 + 2/5 \\ \lambda_2 &= \lambda_4 + 1 - \lambda_1 \\ &= \lambda_4 + 3/5 - 1/5 \lambda_3 \\ \end{aligned} \]

and then substituting these into our objective function \(L(p, b, \bm{\lambda})\) and simplifying

\[\begin{aligned} L(p, b, \bm{\lambda}) &= -\left ( 10 \lambda_1 + 5 \lambda_2 \right ) \\ &= -\left ( 2 \lambda_3 + 4 + 5 \lambda_4 + 3 - \lambda_3 \right ) \\ &= -\left ( \lambda_3 + 5 \lambda_4 + 7 \right ) \end{aligned} \]

and realising that if any \(\lambda_i > 0\) we move away from our maximum, we reach a maximum of

\[ \lambda_3 = 0, \lambda_4 = 0: \quad -7 \]

Finally, recalling that we negated the objective in the formulation of our original problem to fit the standard form, we get \(7\). In other words, our lower bound is identical to the optimal value from our original problem! For many, many more examples, see this excellent and free textbook Convex Optimisation

Activating the curve \(2p +b = 7\) in our feasible region graph above also shows that this new constraint resulting from the Lagrangian dual program intersects the feasible at precisely one point, our optimal vertex from before.

Karush-Kuhn-Tucker (KKT) Conditions

If we assume, for a moment, that we know the optimal value \(\bm{x}^*\) and optimal dual values \(\bm{\lambda}^*\) and \(\bm{v}^*\) of our Lagrangian and that our lower bound is equal to the standard form solution, as in the preceding example, we can conclude that the Lagrangian, \(L(\bm{x}, \bm{\lambda}^*, \bm{v}^*)\) has attained its minimum at \(\bm{x}^*\) and consequently has a gradient of zero.

\[ \Delta o(\bm{x}^*) + \sum_i \lambda_i^* \Delta b_i (\bm{x}^*) + \sum_j v_j^* \Delta c_j(\bm{x}^*) = 0 \]

We can conclude several additional conditions that must hold at this optimal set of values. Together, these are the Karush-Kuhn-Tucker (KKT) Conditions:

\[\begin{aligned} b_i(\bm{x}^*) &\leq 0, &\quad \forall i \\ c_j(\bm{x}^*) &= 0 , &\quad \forall j \\ \lambda_i^* &\geq 0, &\quad \forall i \\ \lambda_i^* b_i(\bm{x}^*) &= 0, &\quad \forall i \end{aligned} \]

The first two must hold as an optimal \(\bm{x}^*\) must satisfy the constraints of our linear program's standard form. The third is true as the optimal \(\bm{\lambda}^*\) and \(\bm{v}^*\) must satisfy the constraint of the Lagrangian dual program. To reach the final condition, we build on the others. From the first, we know that \(b_i(\bm{x}^*) \leq 0\). This means that any \(\lambda_i > 0\) will push our Lagrangian dual problem away from its maximum as one of its summations multiplies the positive \(\lambda_i\) with a negative \(b_i(\bm{x})\). Consequently, attaining the optimal solution requires that either \(\lambda_i\) or \(b_i(\bm{x})\) equal zero producing the last condition.

This all assumes there is a feasible solution and strong duality, i.e. that the optimal solution to both the dual and primal problem is identical.

It is not always the case that the primal and dual programs produce the same solution. If, however, our problem is convex, as in our example above, then we always have a zero duality gap when together values \(\hat{\bm{x}}\), \(\hat{\bm{\lambda}}\) and \(\hat{\bm{v}}\) satisfy the KKT conditions. Luckily, the constraints in the linear program from the paper we are looking at here are convex and the objective is restricted to convex differentiable functions.

Without this convexity guarantee, the gradient of the Lagrangian equating to zero does not imply it has been maximised, meaning \(g(\hat{\bm{\lambda}}, \hat{\bm{v}})\) doesn't necessarily equal \(L(\hat{\bm{x}}, \hat{\bm{\lambda}}, \hat{\bm{v}})\) for a point \(\hat{\bm{x}}\) satisfying the conditions. When convexity does hold, though, we can use the KKT conditions to show that the following equality holds:

\[\begin{aligned} g(\hat{\bm{\lambda}}, \hat{\bm{v}}) &= L(\hat{\bm{x}}, \hat{\bm{\lambda}}, \hat{\bm{v}}) \\ &= o(\hat{\bm{x}}) + \sum_i \hat{\lambda}_i b_i(\hat{\bm{x}}) + \sum_j \hat{v}_i c_i(\hat{\bm{x}}) \\ &= o(\hat{\bm{x}}) \end{aligned} \]

(Boyd and Vandenberghe 2004, pages 244-245)

Generalised Isotonic Regression

Having covered the necessary background material, we can move to the paper. We aim to solve the following general version of the isotonic regression problem:

\[\begin{aligned} &\text{minimise} &\sum_i f_i(\hat{y}_i) & \\ &\text{subject to}_i \quad &\hat{y}_i - \hat{y}_j \leq 0, &\quad \forall \left ( i, j \right ) \in \mathfrak{I} \end{aligned} \]

where each \(f_i : \mathbb{R} \rightarrow \mathbb{R}\) is a differentiable convex function. Should we wish to use an \(L_2\) loss function, we would substitute \((\hat{y}_i - y_i)^2\) in place of \(f_i(\hat{y})\).

As a starting point, we consider the corresponding dual Lagrangian objective function and the KKT conditions. For a single point, \(i\), the Lagrangian is:

\[ L_i(\hat{\bm{y}}, \bm{\lambda}) = f_i(\hat{y}_i) + \sum_j \lambda_{ij} \left ( \hat{y}_i - \hat{y}_j \right ) + \sum_k \lambda_{ki} \left ( \hat{y}_k - \hat{y}_i \right ) \]

In the first summation, we include the constraints for all points larger than \(i\) and in the second for all points smaller. The KKT for a single point \(i\) are:

\[\begin{aligned} \frac{\partial f_i(\hat{y}_i^*)}{\partial \hat{y}_i^*} + \sum_j \lambda_{ij}^* - \sum_k \lambda_{ki}^* &= 0 \\ \hat{y}_i^* - \hat{y}_j^* &\leq 0, &\quad \forall (i, j) \in \mathfrak{I} \\ \text{no equality constraints} \\ \lambda_{ij}^* &\geq 0, &\quad \forall (i, j) \in \mathfrak{I} \\ \lambda_{ij}^* \left ( \hat{y}_i^* - \hat{y}_j^* \right ) &= 0, &\quad \forall (i, j) \in \mathfrak{I} \end{aligned} \]

These conditions reveal several properties of the optimal solution:

From conditions two and four, we conclude that optimal estimates \(\hat{y}_i^*\) and \(\hat{y}_j^*\) for isotonically constrained points \(i \preceq j\) are either identical, as would be the case when the unconstrained estimate otherwise produces \(\hat{y}_i > \hat{y}_j\), or the two estimates trivially satisfy the constraint and \(\lambda_{ij} = 0\).
Consequently, the optimal solution partitions the space into blocks \(V\), where every point within a block has the identical estimate \(\hat{y}_V^*\)
The second condition, furthermore, implies isotonicity at the optimum.

Therefore, the paper's authors suggest an algorithm which iteratively selects the block \(V\) with the largest between-group variance and seeks to partition this group or conclude that it is already optimal and move on to the next. An example can be seen in the plot below. As with many statistical and machine learning algorithms, with too many iterations, it appears to overfit the solution.

Optimal Cut

From the first KKT condition above, we know that a block \(V\) that is already optimally partitioned will satisfy:

\[ \sum_{i \in V} \frac{\partial f_i(\hat{y}_V)}{\partial \hat{y}_V} = 0 \]

Our first KKT condition is a telescopic series that sums to zero. Consider three points \(a\), \(b\) and \(c\); all terms cancel out when summed. \[ \begin{aligned} a&: 0 - \lambda_{ab} - \lambda_{ac} \\ b&: \lambda_{ab} - \lambda_{bc} \\ c&: \lambda_{ac} + \lambda_{bc} \\ \end{aligned} \]

If, however, this is not the case, we can find a more optimal solution by partitioning our block into two subblocks \(V_-\) and \(V_+\), where all points \(i\in V_-\) are smaller than those \(j \in V_+\) according to our isotonicity constraints.

\[ \sum_{j \in V_+} \frac{\partial f_j(\hat{y}_V)}{\partial \hat{y}_V} - \sum_{i \in V_-} \frac{\partial f_i(\hat{y}_V)}{\partial \hat{y}_V} < 0 \]

However, this only happens if the first summand has a net negative gradient and the second a net positive. In other words, a better partitioning implies that our estimates \(\hat{y}_j\) want to increase in magnitude, while \(\hat{y}_i\) wants to decrease, maintaining isotonicity. After partitioning a block, this difference also measures how well separated the two new blocks are and is used to guide the iterative splitting. At each iteration, therefore, we select the block with the maximal difference and search for a new optimal cut or to verify that the block is already optimal.

\[ \text{minimise}_{\left ( V^-, V^+ \right ) \in V} \left ( \sum_{j \in V^+} \frac{\partial f_j(\hat{y}_V)}{\partial \hat{y}_V} - \sum_{i \in V^-} \frac{\partial f_i(\hat{y}_V)}{\partial \hat{y}_V} \right ) \]

This is equivalent to maximising the between-group variance at each step. For details see (Rosset Luss Ronny and Shahar 2012)

Introducing a new variable \(x_i\), which can be either \(1\) or \(-1\), this minimisation problem is equivalent to the following binary program:

\[\begin{aligned} &\text{minimise} &\quad \sum_{i \in V} x_i \frac{\partial f_i(\hat{y}_V)}{\partial \hat{y}_V} \\ &\text{subject to} &\quad x_i \leq x_j &\quad \forall (i,j) \in \mathfrak{I}_V \\ & &\quad x_i \in \left \{ -1, +1 \right \} &\quad \forall i \in V \end{aligned} \]

Finally, relaxing the integer constraint, allowing \(x_i\) to vary continuously between \(-1\) and \(1\), and converting it to our standard form, we reach the linear program from the paper:

\[\begin{aligned} &\text{minimise} &\quad \sum_{i \in V} x_i \frac{\partial f_i(\hat{y}_V)}{\partial \hat{y}_V} \\ &\text{subject to} &\quad x_i - x_j \leq 0 &\quad \forall (i,j) \in \mathfrak{I}_V \\ & &\quad -x_i - 1 \leq 0 &\quad \forall i \in V \\ & &\quad x_i - 1 \leq 0 &\quad \forall i \in V \end{aligned} \]

Previous work from the author of the discussed paper typically considers a maximisation problem instead of the minimisation problem we discuss here. See for example, (Ronny Luss, Rosset, and Shahar 2010, equation 3) and (Rosset Luss Ronny and Shahar 2012, equation 3). These two variants are equivalent, but the order of comparison is reversed. Here we take the derivative of the \(L_2\) loss to be \(2 (\hat{y}_V - y_i)\), whereas the previous works take this to be \(2 (y_i - \hat{y}_V)\). Switching from a minimisation to a maximisation requires few changes. In the case of the final matrix form below, we would swap the maximisation for minimisation and swap the domain of our variables \(y_i\) from \(y_i \leq 0\) to \(y_i \geq 0\). In my full implementation, I have actually implemented this opposite variant following the implementation from the author.

Implementation

Having discussed the core of the paper, we now implement the algorithm for solving the generalised isotonic regression problem. To start, we need a method to discover and store the monotonicity constraints between a set of points points_to_adjacency. Then, we need to be able to encode these constraints into a matrix, adjacency_to_LP_standard_form. Finally, we set up our linear program and solve it, minimum_cut.

Isotonic Constraints

Points can be ordered unambiguously in a single dimension according to the traditional \(\leq\) operator, whereby \(0 \leq 1 \leq 2 \leq ...\). With at least \(2\)-dimensions, however, things are more complex. Which of the two points \((1, 3)\) and \((3, 1)\) are larger? One option would be to order the pairs lexicographically, in which case the second is larger. Here, however, we consider a domination-based ordering, whereby a point \((x_1, x_2)\) is considered smaller than or equal to another point \((y_1, y_2)\) if both \(x_1 \leq y_1\) and \(x_2 \leq y_2\). Following this relation, the two points \((1, 3)\) and \((3, 1)\) have no relation to each another. They are both, however, larger than the point \((1, 1)\) and smaller than the point \((3, 3)\). Following this relation, we graphically show the ordering for several points in \(2\)-dimensional space below.

We save this ordering as a sparse adjacency matrix to reduce memory consumption. Clicking on the "Adjacency Matrix" button, we see, as is typical of an adjacency matrix, that where there is a \(1\), the column corresponding point is less than or equal to the row corresponding point. However, we have not added a \(1\) for every point smaller than another. We could have included all of these, and the implementation would still have worked. The additional \(1\)'s are redundant; removing them is an optimisation to reduce the number of constraints passed into the linear program solver to improve performance.

A potential alternative to the brute force approach I have coded here might be to make use of the approach of (Bentley 1980) determining a ranking based on the number of points dominated by each point. This can be done in \(T(N, k) = O \left ( N \text{log}^{k-1} N \right )\) time, with \(k\) dimensions and \(N\) points. We would then only need to compare points with those of the previous rank.

We now introduce a function for creating this reduced adjacency matrix.

template<typename V>
Eigen::SparseMatrix<bool>
points_to_adjacency(const Eigen::MatrixXd& points) {

  const uint64_t total_points = points.rows();

  Eigen::SparseMatrix<bool> adjacency(total_points, total_points);
  VectorXu degree = VectorXu::Zero(total_points);

  Eigen::VectorX<bool> is_predecessor =
      Eigen::VectorX<bool>::Zero(total_points);
  Eigen::VectorX<bool> is_equal =
      Eigen::VectorX<bool>::Zero(total_points);

First, we sort the points lexicographically in \(O(N \text{log} N)\) time in order to reduce the total comparisons.

  const auto& sorted_idxs = argsort(points);
  const Eigen::MatrixXd sorted_points =
      points(sorted_idxs, Eigen::all).transpose();

Next, we iterate through the points and compare each with all previous points. We add any points smaller than the current point without a successor also smaller than the current point. This we complete with roughly \(O(N^2/2)\) comparisons.

  for (uint64_t i = 1; i < total_points; ++i) {

    const auto& previous_points = sorted_points(
        Eigen::all,
        VectorXu::LinSpaced(i, 0, i-1)
    ).array();

    const auto& current_point = sorted_points(
        Eigen::all,
        VectorXu::LinSpaced(i, i, i)
    ).array();

    is_predecessor(Eigen::seq(0, i-1)) =
        (
            previous_points <= current_point
        ).colwise().all();

    degree(i) = is_predecessor.count();

    /* If there is a chain of points that are all predecessors,
     * we take only the largest. So, we check if the outgoing
     * edge of a predecessor connects to another predecessor
     * (which would be included instead).
     */
    for (Eigen::Index j = 0; j < adjacency.outerSize(); ++j) {
      if (is_predecessor(j)) {
        for (
          Eigen::SparseMatrix<bool>::InnerIterator it(adjacency, j);
          it;
          ++it
        ) {
          if (it.value() && !is_equal(it.row())) {
            is_predecessor(it.row()) = false;
          }
        }
      }
    }

    adjacency.col(i) = is_predecessor.sparseView();
  }

  return adjacency;
}

Minimum Cut / Maximum Flow

To solve the optimal cut, we use the open-source linear solver HiGHS. This requires us to convert our linear program into matrix form.

\[\begin{aligned} \text{minimise} &\quad \bm{b}^T \bm{x} \\ \text {subject to} &\quad \bm{A} \bm{x} \leq \bm{c} \\ \end{aligned} \]

Our example adjacency matrix from above in this matrix form is as follows:

\[\begin{aligned} &\bm{A} &\bm{x} &\leq &\bm{c} \\ &\begin{bmatrix} 1 & 0 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 1 \\ -1 & 0 & 0 & 0 & 0 & 0 \\ 0 & -1 & 0 & 0 & 0 & 0 \\ 0 & 0 & -1 & 0 & 0 & 0 \\ 0 & 0 & 0 & -1 & 0 & 0 \\ 0 & 0 & 0 & 0 & -1 & 0 \\ 0 & 0 & 0 & 0 & 0 & -1 \\ 1 & -1 & 0 & 0 & 0 & 0 \\ -1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 1 & -1 & 0 & 0 & 0 \\ 0 & 1 & 0 & -1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 0 & -1 \\ 0 & 0 & 0 & 1 & -1 & 0 \\ 0 & 0 & 0 & 0 & 1 & -1 \end{bmatrix} &\begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \\ x_6 \end{bmatrix} &\leq &\begin{bmatrix} 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix} \end{aligned} \]

The first six rows represent the \(x_i \leq 1\) constraints. We multiply each side of the \(-1 \leq x_1\) constraints by \(-1\) to get \(-x_i \leq 1\) filling the next six rows. The last seven correspond to the isotonicity constraints \(x_i - x_j \leq 0\) from the adjacency matrix. We encode the objective with the matrix \(b\).

\[ \bm{b} = \begin{bmatrix} \frac{\partial f_1(\hat{y})}{\partial \hat{y}} \\ \frac{\partial f_2(\hat{y})}{\partial \hat{y}} \\ \frac{\partial f_3(\hat{y})}{\partial \hat{y}} \\ \frac{\partial f_4(\hat{y})}{\partial \hat{y}} \\ \frac{\partial f_5(\hat{y})}{\partial \hat{y}} \\ \frac{\partial f_6(\hat{y})}{\partial \hat{y}} \end{bmatrix} \]

Another example of converting a Linear Program to matrix form can be seen on the first few slides here.

The function adjacency_to_LP_standard_form builds the constraint matrix \(A^T\) for a selected subset of points considered_idxs.

Eigen::SparseMatrix<int>
adjacency_to_LP_standard_form(
    const Eigen::SparseMatrix<bool>& adjacency_matrix,
    const VectorXu& considered_idxs
) {
  const uint64_t total_constraints =
      constraints_count(adjacency_matrix, considered_idxs);

  const uint64_t total_observations = considered_idxs.rows();
  const uint64_t columns = 2 * total_observations + total_constraints;

  Eigen::SparseMatrix<int, Eigen::ColMajor> standard_form(
      total_observations, columns);

  int idx = 0;
  for (Eigen::Index j = 0; j < total_observations; ++j) {
    auto it = Eigen::SparseMatrix<bool>::InnerIterator it(
        adjacency_matrix, considered_idxs(j));

    for (
      it;
      it;
      ++it
    ) {
      auto row_idx = std::find(
          considered_idxs.begin(),
          considered_idxs.end(),
          it.row());

      // add isotonicity constraint if both points in
      // considered subset
      if (row_idx != considered_idxs.end()) {

        // smaller point
        standard_form.insert(
            std::distance(considered_idxs.begin(), row_idx),
            2 * total_observations + idx) = -1;

        // larger point
        standard_form.insert(
            j,
            2 * total_observations + idx) = 1;

        ++idx;
      }
    }

    // bound x by -1 and 1 (source and sink)
    standard_form.insert(j, j) = 1;
    standard_form.insert(j, j + total_observations) = -1;
  }

  standard_form.makeCompressed();
  return standard_form;
}

We could put this into the solver as formulated. Instead, following the suggestion of the paper's author, we implement the dual linear program, switching from the minimum cut to the maximum flow variant, a relatively mechanical process, now that we have put it into the matrix form. The conversion rules are detailed under Constructing the dual LP here.

\[\begin{aligned} \text{minimise} &\quad \bm{b}^T \bm{x} &\quad \text{maximise} &\quad \bm{c}^T \bm{y} \\ \text {subject to} &\quad \bm{A} \bm{x} \leq \bm{c} \quad \rightarrow &\quad \text {subject to} &\quad \bm{A}^T \bm{y} = \bm{b} \\ &\quad \bm{x} \in \mathbb{R}^N &\quad &\quad \bm{y} \leq 0 \end{aligned} \]

Realising that this Min-Cut problem can be reformulated as a Max-Flow problem permits using many other algorithms. Just recently, this was even discussed on Quanta or the more technical presentation here. Other algorithms, such as Ford Fulkerson, operate directly on the maximum flow graph. For more information about the correspondence between the Max-Flow and Min-Cut problems.

The first step to setting up HiGHS for solving our linear program is to declare the optimisation direction.

Eigen::VectorX<bool>
minimum_cut(
    const Eigen::SparseMatrix<bool>& adjacency_matrix,
    const Eigen::VectorXd loss_gradient, // z in the paper
    const VectorXu considered_idxs
) {

  HighsModel model;
  model.lp_.sense_ = ObjSense::kMaximize;

Next, we create the matrix \(A^T\).

  const auto A =
      adjacency_to_LP_standard_form(
          adjacency_matrix, considered_idxs);

  model.lp_.num_col_ = A.cols();
  model.lp_.num_row_ = A.rows();

HiGHS requires that we flatten the matrix \(A^T\) into three vectors corresponding to the compressed column scheme. The first contains the contents - in this case, column-wise - the second, the corresponding row for each value and the third, the index where each new column starts. This is actually the same format the Eigen uses for its compressed sparse matrices by default.

  std::vector<int64_t> column_start_positions(A.cols() + 1);
  std::vector<int64_t> nonzero_row_index(A.nonZeros());
  std::vector<double> nonzero_values(A.nonZeros());
  uint64_t idx = 0;
  for (Eigen::Index j = 0; j < A.outerSize(); ++j) {
    column_start_positions[j] = idx;
    for (Eigen::SparseMatrix<int>::InnerIterator it(A, j); it; ++it) {
      nonzero_row_index[idx] = it.row();
      nonzero_values[idx] = it.value();
      ++idx;
    }
  }
  column_start_positions[column_start_positions.size()-1] = idx;

  model.lp_.a_matrix_.format_ = MatrixFormat::kColwise;
  model.lp_.a_matrix_.start_ = column_start_positions;
  model.lp_.a_matrix_.index_ = nonzero_row_index;
  model.lp_.a_matrix_.value_ = nonzero_values;

Next, we add the equality constraint \(= \bm{b}\)

  std::vector<double> b(loss_gradient.begin(), loss_gradient.end());
  model.lp_.row_lower_ = b;
  model.lp_.row_upper_ = b;

and the upper bound \(\bm{y} \leq 0\).

  model.lp_.col_lower_ =
      std::vector<double>(A.cols(), -infinity);

  model.lp_.col_upper_ =
      std::vector<double>(A.cols(), 0);

Finally, we determine how the objective is calculated via the vector \(\bm{c}\). In this case, \(1\) for each bounding constraint \(-1 \leq x_i \leq 1\) and \(0\) for each isotonicity constraint.

  std::vector<double> c(A.cols());
  for (size_t i = 0; i < total_observations * 2; ++i)
    c[i] = 1;
  for (size_t i = 0; i < total_constraints; ++i)
    c[2 * total_observations + i] = 0;

  model.lp_.col_cost_ = std::move(c);

Running the solver, we retrieve the values of interest - the vector \(\bm{x}\) - the solution to the optimal cut or original matrix form of the linear program. These are contained in the row_dual variable, as it is the solution to the program dual to which we tasked HiGHS with solving.

  Highs highs;
  highs.setOptionValue("solver", "simplex");

  highs.passModel(model);
  highs.run();

  auto solution = Eigen::VectorXd::Map(
      &highs.getSolution().row_dual[0],
      highs.getSolution().row_dual.size()).array() > 0;

  return solution.array() > 0; // -1 := V- and 1 := V+
}

All that remains is continuously iterating, selecting the block with the largest between-group variation and attempting to partition it further until all blocks are optimal.

The full, non-simplified version of the implementation is available at GeneralisedIsotonicRegression, where I will continue to add features and attempt to improve the code's performance. You can also play around with a version that I compiled to WebAssembly here.

References

Bentley, Jon Louis. 1980. “Multidimensional divide-and-conquer.” Commun. ACM 23 (4) (April): 214–229. doi:10.1145/358841.358850. https://doi.org/10.1145/358841.358850.

Boyd, Stephen, and Lieven Vandenberghe. 2004. Convex optimization. USA: Cambridge University Press.

Luss, Ronny, and Saharon Rosset. 2014. “Generalized isotonic regression.” Journal of Computational and Graphical Statistics 23 (1): 192–210. doi:10.1080/10618600.2012.741550. https://doi.org/10.1080/10618600.2012.741550.

Luss, Ronny, Saharon Rosset, and Moni Shahar. 2010. “Decomposing isotonic regression for efficiently solving large problems.” In Advances in neural information processing systems, ed by. J. Lafferty, C. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta. Vol. 23. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2010/file/03c6b06952c750899bb03d998e631860-Paper.pdf.

Luss, Rosset, Ronny, and Moni Shahar. 2012. “Efficient regularized isotonic regression with application to gene–gene interaction search.” The Annals of Applied Statistics 6 (1): 253–283. doi:10.1214/11-AOAS504. https://doi.org/10.1214/11-AOAS504.