Friday, July 15, 2022

Minimizing a convex quadratic with equality constraints

In our last few posts we tackled the question of finding a global minimum for a function $f(x)=c^Tx+\frac{1}{2} x^TCx.$ The heart of the algorithm was finding an orthonormal basis for $\mathbb R^n$ with respect to the inner product $\langle x,y\rangle =x^TCy$.

Now suppose we have a set of equality constraints $Ax=b$. This defines an affine subspace of $\mathbb R^n$ and if we can find an orthonormal basis of the same plane translated to the origin $Ax=0$, then we can just use the same algorithm as before.

There is a neat trick for doing this. First, assume that the rows of $A$ $a_1\ldots,a_m$ are linearly independent. (We will remove that restriction later.) Now add extra rows to $A$ so that its rows form a basis of $\mathbb R^n$, to get a matrix we call $D$. 

Claim: The $m+1,\ldots, n$ columns of $D^{-1}$ form a basis for the subspace $Ax=0$.

Proof:

Suppose $Ax=0$. Then $Dx=(0,\ldots,0,*,\ldots,*)^T$ is $0$ in the first $m$ coordinates. Applying $D^{-1}$ to both sides, we have $x=D^{-1}(0,\ldots,0,*,\ldots,*)^T$, which is in the column space of $D^{-1}$ spanned by the last $n-m$ columns. $\Box$

We proceed by doing Gram-Schmidt orthogonalization to the last $n-m$ columns. So to begin, normalize the $m+1$st column and subtract off projections of this column from all other columns, including the initial $m$. These column operations correspond to similar row operations on $D$, so that we are adding a multiple of the $m+1$st row of $D$ to the vectors spanning the subspace $Ax=b$ (see below for a simple explanation of this phenomenon). Thus by the above claim, the last $n-m-1$ vectors of the new $D^{-1}$ are orthogonal to the subspace spanned by the first $m+1$ columns.

 Repeating this process with the remaining columns as we move to the right, we get a collection of orthonormal vectors spanning our subspace.

 Thus we can use the algorithm developed in the unconstrained case with very little modification, but with a little more work needed in the initialization. 

  • First, you need to be able to identify a point $x_0$ satisfying the constraints: $Ax_0=b$.
  • Second, you need to add rows to $A$ to get an invertible matrix $D$, and start the algorithm off with $GS=D^{-1}$ rather than just the identity.
  • Also as a note, you have to make sure to apply the matrix update rules to the first $m$ columns at each step, not just the columns to the right of the currently active one.

 I hope to write this out as a detailed algorithm in my next post, but for now, let us return to the question of how row/column operations affect the inverse.

Lemma: An elementary row operation on a matrix $M$ corresponds to an elementary column operation on $M^{-1}$.

Proof: An elementary row operation is realized by multiplying by an elementary matrix on the left $EM$. Taking the inverse we have $M^{-1}E^{-1}$. The inverse of an elementary matrix is an elementary matrix, and right multiplication is an elementary column operation. $\Box$




Unconstrained quadratic optimization: the algorithm

 We finally have all the tools in place to describe a step algorithm for minimizing a (convex) quadratic objective function without any constraints.

  • Minimize $f(x)= c\cdot x+\frac{1}{2}x^TCx$ where $C$ is a positive semidefinite matrix.
We start with any seed value $x_0\in \mathbb R^n$. Recall that the modified Gram-Schmidt process starts with a set of independent vectors and gradually replaces them with an orthonormal set. We will start with the standard basis $e_1,\ldots,e_n$, recorded as the columns of a matrix $GS_0=I$. We use a vector $J\in\{Y,N\}^n$ to keep track of which columns have been orthonormalized. So to begin we have $J=\{N,\ldots,N\}$, and as we work on a column to create a conjugate direction, we add $Y$ to that index of $J$. Set $g_0=\nabla f(x_0)=c+Cx_0$ and $j=0$ to initialize, where $j$ represents the iteration of the algorithm.
  • Step 1: Computation of search direction $s_j$: Let $GS_j=[c_{1j},\ldots, c_{nj}]$ and $J_j=\{\alpha_{1j},\ldots,\alpha_{nj}]$. If each component of $J_j$ is equal to $Y$, then stop with optimal solution $x_j$. Otherwise, compute the smallest index $k$ such that $$|g_j^Tc_{kj}|= \max\{|g_j^Tc_{ij}|\,|\, \forall i \alpha_{ij}=N\}.$$ If $g^T_jc_{kj}=0$, stop with optimal solution $x_j$. Otherwise set $s_j=\operatorname{sgn}(g^T_jc_{kj})c_{kj}.$ Go to step 2.
  • Step 2: Computation of step size $t_j$: Compute $\langle s_j,s_j\rangle$. If it is $0$, then stop. The objective function is unbounded from below. Otherwise set: $$t_j=\frac{g_j^Ts_j}{\langle s_j,s_j\rangle} $$ Go to step 3.
  • Step 3: Updating data: Set
    • $x_{j+1}:=x_j-t_js_j$.
    • $g_{j+1} := \nabla f(x_{j+1}) = c+Cx_{j+1}$
    • $f(x_{j+1}) = c^Tx_{j+1} +\frac{1}{2}x_{j+1}^TCx_{j+1}$
    • $d_j = s_j/\sqrt{\langle s_j,s_j\rangle}$
    • $GS_{j+1}$ is formed from $GS_j$ by 
      • replacing the $k$th column $c_k$ by $c_k/(d_j^Tc_k)$ 
      • replacing all columns $c_i$ where $\alpha_{ij}=N$ by $c_i-\frac{d_j^Tc_i}{d_j^Tc_k}c_k$
      • leaving columns $c_i$ where $\alpha_{ij}=Y$ unchanged.
    • $J_{j+1}$ is formed from $J_j$ be setting $\alpha_{k}$ to $Y$ and leaving everything else unchanged.
    • Increment $j:=j+1$ and go to step 1!

To summarize the algorithm works as follows. Pick some point $x_0$. Choose a search direction $s_1$ parallel to the axes that maximizes the gradient slope. Locate the minimum of $f$ on the line through $x_0$ parallel to $s_1$ to get $x_1$. Comb the remaining basis vectors of $\mathbb R^n$ to be orthogonal to $s_1$ with respect to the inner product defined by $C$. Next find a search direction $s_2$ among these combed vectors that maximizes the gradient slope, locate the minimum of $f$ on the line through $x_1$ parallel to $s_2$ to get $x_3$, etc. After at most $n$ steps the algorithm will terminate.

I went ahead and programmed this into C# using the MathNet.Numerics.LinearAlgebra library and it seems to work quite well.

First we have a helper function that updates the matrix $GS$, followed by the main minimization routine.

public Matrix<double> GramSchmidt(Matrix<double> matrix, Vector<double> d, int k, bool[] J)
        {
            Matrix<double> A = matrix.Clone();
            double lambda = d.DotProduct(matrix.Column(k));
            A.SetColumn(k, matrix.Column(k) / lambda);
            for(int i = 0; i < A.ColumnCount; i++)
            {
                if (!J[i]&&i!=k)
                {
                    double mu = d.DotProduct(matrix.Column(i));
                    A.SetColumn(i, matrix.Column(i)-matrix.Column(k) * (mu / lambda));
                }
            }
            return A;
        }

 public Vector<double> Minimize(Vector<double> c,Matrix<double> C,Vector<double> x0, out bool unbounded)
        {
            unbounded = false;
            Vector<double> x = x0;
            int n = c.Count;
            bool[] J = new bool[n];//builds array of falses
            Matrix<double> GS = Matrix<double>.Build.DenseIdentity(n);
            Vector<double> g = c + C * x;
            for(int j = 0; j < n; j++)
            {
                //Step 1:
                bool init = true;
                foreach(bool index in J) { init = init && index; }
                if (init) { return x; }

                double max = 0;
                int k = -1;
                for(int i =0; i < n; i++)
                {
                    if (!J[i] && Math.Abs(g.DotProduct(GS.Column(i))) > max)
                          { max = Math.Abs(g.DotProduct(GS.Column(i)));k = i; }
                }

                if (g.DotProduct(GS.Column(k)) == 0) { return x; }
                Vector<double> s = GS.Column(k);
                if (g.DotProduct(GS.Column(k)) < 0) { s *= -1; }

                //Step 2:
                double normSquaredOfs = (s.ToRowMatrix() * C * s.ToColumnMatrix())[0,0];
                if (Math.Abs(normSquaredOfs) < .0000001) { unbounded = true; return x; }
                double t = g.DotProduct(s) / normSquaredOfs;

                //Step 3:
                x -= t * s;
                g = c + C * x;
                Vector<double> d = 1/ Math.Pow(normSquaredOfs, 0.5)*(C * s);
                GS = GramSchmidt(GS, d, k, J);
                J[k] = true;

            }

            return x;
        }

Finding an orthonormal basis with respect to a symmetric positive definite form

Let $C$ be an $n\times n$ symmetric positive definite matrix. Define an inner product on $\mathbb R^n$ by the formula $\langle x,y\rangle =x^TCy$.

 In the last post we wanted to find a set of conjugate directions $s_1,\ldots,s_n$ which were a set of vectors in $\mathbb R^n$ such that $\langle s_i,s_j\rangle=\delta_{ij}$, where $\delta$ is the Kronecker delta function. To do this we start with any basis of $\mathbb R^n$ $b_1,\ldots,b_n$ and slowly improve it. 

At the $k$th stage of the process we will have a list of vectors $s_1^{(k)},\ldots s_n^{(k)}$ where the first $k$ vectors span $S_k$ and the last $n-k$ vectors span $W_k$. The vectors $s_1^{(k)},\ldots s_k^{(k)}$ form an orthonormal basis for $S_k$ and $\mathbb R^n = S_k\oplus W_k$ is an orthogonal decomposition. 

The basic process (called the Modified Gram-Schmidt process) is as follows. Let $v_1,\ldots,v_k$ be a set of independent vectors.
  • Let $\hat{v}_1 = v_1/\sqrt{\langle v_1,v_1\rangle}$.
  • Let $\hat{v}_i = v_i-\langle v_i,\hat{v}_1\rangle\hat{v}_1$ for all $i\geq 2$.
Then $\hat{v}_1$ is orthogonal to the subspace generated by $\hat{v}_2,\ldots, \hat{v}_k$.

$$\langle \hat{v}_i,\hat{v}_1\rangle = \langle v_i,\hat{v}_1\rangle - \langle v_i,\hat{v}_1\rangle\langle \hat{v}_1,\hat{v}_1\rangle$$
$$\hspace{4em}=\langle v_i,\hat{v}_1\rangle - \langle v_i,\hat{v}_1\rangle\cdot 1=0.$$

This process iterates as follows. Suppose we have a decomposition $S_k\oplus W_k$. Then use this procedure to find a basis for $W_k$ of the form $c_{k+1},\ldots,c_n$ where $c_{k+1}$ has norm $1$ and is orthogonal to the rest of the vectors. Then $c_{k+1}$ is orthogonal to $S_k$ as it lies in $W_k$, so it can be added to $S_k$ to get a new decomposition $S_{k+1}\oplus W_{k+1}$.

In the next post we will use this algorithm to solve the unconstrained quadratic minimization problem. There are a couple things to note. The conjugate directions that are produced this way have a choice of sign, and we will need to pick the right sign when implementing the algorithm. The other thing to note it that we started from the first vector in the list and then steadily moved from left to right in creating the orthonormal basis. However, we have a choice of what order to do this in, and the algorithm will be set up in such a way as to pick certain directions as making the most progress. As far as I can tell this is not important because the unconstrained algorithm will always take $n$ steps to reach a solution (unless you happen to get lucky in your choice of initial point). My guess is that keeping things flexible will help to generalize the argument later.


Thursday, July 14, 2022

Quadratic Programming: The unconstrained case

In order to build our understanding of how quadratic programming works, we are first going to look at the unconstrained case:

  • Minimize $f(x) = c^Tx+\frac{1}{2} x^TCx$ where $C$ is a positive definite matrix.
Note this is a global optimization problem and can be solved using calculus. The minimum occurs where the gradient vanishes:
$$\nabla f = c^T+Cx = 0.$$
So one could simply solve this linear system in whatever way one wishes and be done with the problem, but following Best's textbook, we will present a basic algorithm that we can build off of in the constrained cases.

The basic idea is as follows. Start with any initial approximation $x_0$, and suppose we have a search direction in the form of a vector $s_0$. We want to walk along this search direction as far as possible to minimize $f(x)$.

Let $g_0=\nabla f(x_0)$. Suppose $g_0^Ts_0\neq 0$. Replace $s_0$ by $-s_0$ if necessary so that $g_0^Ts_0>0$. Looking along the line $x_0-t s_0$, we write the Taylor Series as
$$f(x_0-ts_0) = f(x_0) -tg_0^Ts_0+\frac{1}{2}t^2s_0^TCs_0$$

Note that $s_0^TCs_0>0$ since $C$ is a positive definite matrix. So this function has a minimum, which can be found by taking the derivative with respect to $t$:
$$\frac{d}{dt}f(x_0-ts_0)=-g_0^Ts_0+ts_0^TCs_0=0$$
$$t=\frac{-g_0^Ts_0}{s_0^TCs_0}=\frac{-\nabla f(x_0)^Ts_0}{s_0^TCs_0}.$$
This quantity is called the optimal step size and tells you how far to walk along the ray to minimize the objective function on that ray.

One can iterate this by choosing a new search direction, so the question becomes how to choose these directions. A convenient choice of search directions is given by the concept of conjugate directions.

Definition: A set of vectors $s_1,\ldots,s_k$ is said to be conjugate (with respect to $C$) if $s_i^TCs_j=0$ for $i\neq j$ (and $s_i^TCs_i>0$). 

Theorem: Let $s_0,\ldots,s_{n-1}$ be conjugate directions, and $x_0$ be arbitrary. Construct $x_1,\ldots,x_n$ where $x_i=x_{i-1}-t_{i-1}s_{i-1}$, with $t_{i-1}=\frac{-\nabla f(x_{i-1})^Ts_0}{s_{i-1}^TCs_{i-1}}$. Then
  • $\nabla f(x_{j+1})s_i=0$ for $0\leq i\leq j\leq n-1$
  • $x_n$ is the optimal solution for the QP.
In other words, at each iteration the gradient becomes orthogonal to more search directions. In the end it is orthogonal to all of them, and since they span $\mathbb R^n$ (exercise) the gradient must be zero.

Proof:
First we want to show that $\nabla f(x_{j+1})^T s_j = 0$. To see this note that
$$\nabla f(x_{j+1})=\nabla f(x_j) -t_jCs_j$$ by Taylor series, and taking the inner product with $s_j$ we get
$$\nabla f(x_{j+1})^Ts_j=\nabla f(x_j)^Ts_j -t_js_j^TCs_j=0$$ by definition of $t_j$.

Now choose $i<j\leq n-1$. Again by Taylor series, we have
$$\nabla f(x_{j+1})=\nabla f(x_{i+1})+C(x_{j+1}-x_{i+1}).$$
Taking the inner product with $s_i$:
$$\nabla f(x_{j+1})^Ts_i=\nabla f(x_{i+1})^Ts_i+(x_{j+1}-x_{i+1})^TCs_i.$$
$$\hspace 4em = (x_{j+1}-x_{i+1})^TCs_i.$$
$$\hspace 4em = -t_{i+1}s^T_{i+1}Cs_i-\cdots-t_js^T_jCs_i.$$
Each term is equal to $0$ by definition of conjugate directions. $\Box$

The only missing ingredient is identifying a set of conjugate directions for a matrix $C$. We will do this in the next post.

Quadratic Programming

 In my next series of posts, I would like to start exploring the quadratic programming problem. This is similar to the linear case, except the objective function is allowed to be a convex quadratic function 

$$f(x) = c\cdot x+\frac{1}{2} x^TCx$$

where $C$ is a positive semidefinite matrix. We are still considering the constraints to be linear, and in fact, following Best's textbook, we will assume they are inequality constraints of the form $Ax\leq b$. 

Definition: Let $x_0\in\mathbb R^n$. We say a constraint $a_i\cdot x\leq b_i$ is active at $x_0$, if $a_i\cdot x_0=b_i$.

In other words, $x_0$ lies on the facet of the feasible region defined by the $i$th constraint. 

Definition: Let $I(x_0)$ be the set of indices of active constraints for $x_0$. We say $x_0$ is quasi-stationary if $x_0$ is the optimal solution for the problem

  • Minimize $c\cdot x+\frac{1}{2} x^TCx$
  • Subject to $a_i \cdot x=b_i, i\in I(x_0)$

A couple of things to note: any corner of the feasible region is quasi-stationary since the active region is just a point. 

Lemma: An optimal solution for a QP must be quasi-stationary.

Proof: Let $I$ be the set of active constraints for an optimal point $x_0$. Note that $I$ could be empty.

We need to show that $x_0$ is optimal on the entire facet defined by its active constraints. This is obvious, due to convexity of the objective function. If there were a point with smaller objective function, just walk toward it until you hit a facet of higher codimension, meaning that you had not included all active constraints. $\Box$

So we have a very similar situation to the linear programming case. Instead of looking at basic variables, we look at active constraints. Just as in the LP case, a crude but inefficient algorithm is available. Just look at all possible subsets of the constraints, solve them as equalities, and then find the global minimum of the objective function on the hyperplane defined by that facet.

We will conclude this first post on quadratic programming by deriving tests for a point to be quasi-stationary or optimal. 

Suppose we have a quasi-stationary point $x_0$. Recall $\nabla f$ is the vector which gives the direction of fastest increase of $f$. Because $x_0$ is optimal for its set of active constraints, $\nabla f$ must be orthogonal to the hyperplane given by those constraints. Hence $\nabla f$ is in the linear span of $\{a_i\,|\,i\in I(x_0)\}$:

$$-\nabla f = \sum_{i\in I(x_0)} u_i a_i,$$

where we introduce the negative sign for later convenience. If $x_0$ is actually optimal, then the direction of fastest increase must point into the feasible polytope, which means that $u_i\geq 0$ in the above.

A convenient way to encode this is as follows:

Theorem: A point $x_0$ is quasi-stationary for the QP 

$$\min\{f(x)\,|\, a_i\cdot x\leq b_i\}$$ if and only if

  • $x_0$ lies in the feasible region $R$: $\forall i\,\,a_i\cdot x_0\leq b_i $
  • $-\nabla f(x_0) = \sum_{i=1}^m u_ia_i$ for some scalars $u_i\in\mathbb R$
  • $\forall i\,\, u_i(a_i\cdot x_0-b_i)=0$
The last condition gives an alternative: either $u_i=0$ or $i\in I(x_0),$ so is just a reformulation of the statement that the gradient is a linear combination of the active constraint coefficients.

Theorem: A point $x_0$ is optimal if in addition to the above three conditions, each $u_i$ is nonnegative. Writing this in matrix notation:
  • $Ax_0\leq b$
  • $-\nabla f(x_0) = A^Tu$ for some $u\geq 0$
  • $u^T(Ax_0-b)=0$


Wednesday, July 13, 2022

Linear programming versus row echelon form

 As mentioned in my previous post, my plan was to illustrate the LP algorithm we have developed so far over the last several posts with a real world example. I actually did take an example from a project I am working on with a coefficient matrix $A$ of dimensions $5\times 14$. Obviously this must be a massively overdetermined system, but at the moment, my program simply feeds it through the steps by first constructing a phase 1 and then an auxiliary LP.

This did not seem like a great example to show because several rows were simply zero and other rows were copies or negations. So I decided to row reduce it first. However, one thing I noticed Wikipedia ominously mention, is that Gaussian elimination is numerically unstable, and indeed when I did naive elimination, the resulting system was inconsistent. I was able to get results if I used the CoerceZero() method first, which coerces any elements of a matrix sufficiently close to 0 to actually be 0. However, I also noticed a couple elements of the row echelon form were absurdly big. 

So I actually think that row reducing the matrix is actually a bad idea due to the numerical instability. It's possible a more sophisticated approach using the $LU$ decomposition might be better, but in fact, the program works great at the moment without doing the reduction, so perhaps its okay to leave well enough alone, although I will probably at least remove the all $0$ rows. 

I would also like to note that the algorithm for computing row echelon form which I found at Rosetta Code does not appear to be correct, to add insult to injury.




Linear programming - the auxiliary LP

 Yesterday, I wrote a series of posts explaining how to solve a linear programming problem of the form

  • Minimize the objective function $f(x)=c\cdot x$
  • subject to the constraints
    • $Ax=b$ where $A$ is an $m\times n$ matrix
    • $x\geq 0$
The simplex algorithm works great, but you need an initial basic feasible solution to get started. Luckily, one can construct a Phase I program to decide if any feasible solutions exist, and provide such a solution if so. The last remaining piece is to turn this into a basic feasible solution, which we can do with the help of an auxiliary LP.

Recall that the phase I LP is the following

Phase 1 LP (standard)
Introduce new variables $y=(y_1,\ldots,y_n)$.
  • Minimize $\sum_{i=1}^n y_i=e\cdot y $ where $e=(1,\ldots,1)$
  • subject to
    • $Ax+Iy=b$
    • $x,y\geq 0$
If we have an optimal solution $(x^*,y^*)$ with $y^*=0$, then $x^*$ is a solution to the original problem. If all of the $y^*$ variables are non-basic, then we can easily construct a bfs for the original LP. However it is quite possible that this is not the case. To remedy this, we introduce the auxiliary LP.

Auxiliary LP
Start with the phase 1 LP and add one new variable $z$.
  • Minimize $c\cdot x$
  • subject to
    • $Ax+Iy=b$
    • $e\cdot y+z = 0$ 
    • $x,y,z\geq 0$
The auxiliary LP is useful for the following reasons:
  • Fact 1: If $(x,y,z)$ is optimal for the auxiliary LP, then $x$ is optimal for the original LP.
  • Fact 2: If the auxiliary LP is unbounded, then the original LP is unbounded.
  • Fact 3: If $(x^*,0)$ is an optimal solution to the phase 1 LP, then $(x^*,0,0)$ is a bfs for the auxiliary LP. The basic variables are the same as those for $(x^*,0)$ with $z$ added.
Facts 1 and 2 basically come from the fact that any solution to the auxiliary LP must have $y=0$ and $z=0$, because the sum of nonnegative terms $e\cdot y+z$  is equal to $0$. 

The main thing about Fact 3 is to check that the new matrix, after appending the column $\left(\begin{matrix} 0\\ 1\end{matrix}\right)$, is invertible, which I leave as an exercise.

So now we have a complete algorithm to solve LPs of the form indicated at the start of the post. I've implemented this algorithm in C# and it works quite well. The one issue that I kept grappling with is the issue of thresholds, and I essentially just kept tinkering with them until the program stopped giving me error messages. The linear systems I have been dealing with are often highly degenerate, and one thing I am picking up from these posts is that I may want to row reduce the coefficient matrix and throw away extraneous rows before implementing the algorithm. 

So now the question is, what's next? I would like to understand how to do quadratic programming, but I would also like to run through at least one real life (but smallish) example of the LP algorithm we just described both for illustrative purposes, but also so I can see how I might be able to improve it.

Tuesday, July 12, 2022

Linear Programming - Phase 1 algorithms

 In my previous few posts I discussed solving the LP

  • Minimize $c\cdot x$
  • Subject to
    • $Ax=b$ and 
    • $x\geq 0.$
Here $A$ is an $m\times n$ matrix, while $c$ and $x$ are $n$-vectors.

We went through the basic simplex algorithm, but it required an initial basic feasible solution to get started. In this post I want to discuss two algorithms for finding a feasible solution.

We begin with what appears to be the standard Phase 1 LP, taken from Solow's textbook.

Phase 1 LP (standard)
Introduce new variables $y=(y_1,\ldots,y_n)$.
  • Minimize $\sum_{i=1}^n y_i=(1,\ldots,1)\cdot y $
  • subject to
    • $Ax+Iy=b$
    • $x,y\geq 0$
  • Initial bfs: $B=I$ is formed from the last $m$ columns of  the coefficient matrix $[A \, I]$.
Theorem: Let $(x^*,y^*)$ be an optimal solution to the phase 1 LP. Then the original LP is feasible iff $y^*=0$. In that case, $x^*$ is a solution.
Proof:
Clearly if $y^*=0$, then $b=Ax^*+Iy^*=Ax^*$, so $x^*$ is feasible for the original LP. Moreover, $x^*\geq 0$ since $(x^*,y^*)$ is feasible for the phase 1 LP.

Now suppose that the original LP is feasible with solution $x^*$. Then $(x^*,0)$ is feasible for the phase 1 LP. Moreover the objective function $f(x,y)=\sum y_i$ is zero on this solution. Moreover $f(x,y)>0$ if $y\neq 0$, so that means any optimal solution must have $y^*=0$. $\Box$

This gives us a very pretty solution to the question of when the feasible region is nonempty, but if we are to use it to continue with the simplex algorithm, we still need to have a bfs for the original LP, not just a solution. We will return to this question later when we talk about auxilliary LPs. For now, let us look at another Phase 1 LP from Best's book on Quadratic Programming.

The set up for this algorithm is slightly more general. We are looking at a region expressed by the constraints
  • $a_i^Tx\leq b_i$ for $i=1,\ldots, m$
  • $a_i^Tx = b_i\geq 0$ for $i=m+1,\ldots,m+r$
The question is whether the region carved out by these constraints is nonempty.

Phase 1 LP (Best's Algorithm)
Let $d=-\sum_{i=m+1}^{m+r}a_i$, and introduce a single additional variable $\alpha$. 
  • Minimize $d\cdot x+\alpha$.
  • Subject to:
    • $a_i^Tx-\alpha \leq b_i$ for $i=1,\ldots m$
    • $a_i^Tx \leq b_i$ for $i=m+1,\ldots,m+r$
    • $-\alpha\leq 0$
Best says that taking $x=0$ and $\alpha$ sufficiently large is a  feasible solution to the phase I LP, which is true, but he does not say that it is a basic feasible solution. Indeed, this is an inequality constrained LP, so the discussion from our previous posts does not apply. In his book, Best develops an algorithm for optimizing a quadratic objective function subject to both equality and inequality constraints, which he specializes to the case of an linear objective function. I will have to dig into that algorithm more to see what initial data is required. Perhaps simply a feasible solution is enough.
 
Leaving these concerns aside, let us check to see if Best's algorithm really does determine if the original LP is feasible.

Proposition: Best's Phase 1 LP is bounded from below. Hence an optimal solution exists. 
Proof:
The objective function 
$$d\cdot x+\alpha=\alpha-\sum_{i=m+1}^{m+r}a_i\geq \alpha -\sum_{i=m+1}^{m+r}b_i\geq-\sum_{i=m+1}^{m+r}b_i.$$
$\Box$

Theorem: Let $(x^*,\alpha^*)$ be an optimal solution to Best's Phase 1 LP. Then the initial LP is feasible iff $\alpha^*=0$.

Proof: We start with the lower bound on the objective function 
$$d\cdot x^*+\alpha^*\geq -\sum_{i=m+1}^{m+r}b_i.$$
If the initial LP were feasible with solution $x^*$, then $(x^*,0)$ would make the above inequality an equality. Hence if $d\cdot x^*+\alpha^*> -\sum_{i=m+1}^{m+r}b_i$, the initial LP is not feasible. If on the other hand $d\cdot x^*+\alpha^*= -\sum_{i=m+1}^{m+r}b_i$, then $\alpha=0$ and the inequalities $A_ix\leq b$ for $i=m+1,\ldots,m+r$ must actually be equalities, implying $x^*$ is a feasible solution to the original LP. $\Box$

So it looks like Best's Phase 1 algorithm checks out, and is actually quite clever.








Linear Programming: pivoting

 Let us summarize our progress so far. 

  • $x=(x_B,x_N)=(B^{-1}b,0)$ is a basic feasible solution to $Ax=b, x\geq 0$, 
  • $j^*$ is an index such that $d:=(c_N-c_BB^{-1}N)_{j^*}<0$ and
  • $t^*=\min\{-x_i/d_i\,|\, 1\leq i\leq n\text{ and } d_i<0\}$. Let $k^*$ be a choice of index where this minimum is realized.
Then we have concluded that $x+t^*d$ is a feasible solution. In this post we want to show that it is a bfs, and how to calculate the new index sets.

One thing to note: if $t^*=0$ then we aren't actually moving to a different point, but we are moving to a different bfs representation of that same point. In this case the objective function doesn't strictly decrease, which means that if we repeat the basic steps in our algorithm, we might cycle back to the same point. Apparently cycling is quite rare in practice, though one can also preclude its happening through the right choice of pivoting rule - i.e. which $j^*$ and $k^*$ to pick! (One such choice is Bland's Rule.) 

Theorem: With the hypotheses above, $x+t^*d$ is a bfs $(x_{B'},x_N')$ where $B'$ is formed by replacing column $k^*$ of $B$ with column $j^*$ of $N$. 

Proof: By construction, we zeroed out the $x_{k^*}$ variable, so $x_{N'}=0$. If we can show $B'$ is invertible, then automatically $x_{B'}=(B')^{-1}b$ and we are done. To see that $B'$ is invertible, we claim that $B'=BE$ where $E$ is formed by replacing column $k^*$ of the $m\times m$ identity matrix with $-d_B$. 

If $k\neq k*$, $(BE)_{\cdot k}=BE_{\cdot k}=BI_{\cdot k}= B_{\cdot k}$.

On the other hand $(BE)_{\cdot k^*}=BE_{\cdot k^*}=B(-d_B)=B(B^{-1}N_{\cdot j^*})=N_{\cdot j^*}.$ By definition the column vector $N_{\cdot j^*}$ is equal to $B'_{\cdot k^*}$.

Finally we need to show that $E$ is nonsingular. This follows because $(d_B)_{k^*}\neq 0$. So one can use row operations to clear out the rest of the $k^*$ column to get a diagonal matrix with $1$'s on the diagonal, except for an occurrence of $(d_B)_{k^*}$. $\Box$

This operation of updating $B$ to $B'$ is called pivoting.

Now we have all the ingredients for the basic simplex algorithm to minimize an objective function $c\cdot x$ subject to the constraints $Ax=b$ and $x\geq 0$.

  1. Start with an initial bfs $x=(x_B,x_N)=(B^{-1}b,0)\geq 0$.
  2. Compute $c_N-c_BB^{-1}N$.
  3. If  $c_N-c_BB^{-1}N\geq 0$, $x$ is optimal. Terminate program.
  4. Otherwise, select $1\leq j^*\leq n-m$ such that $(c_N-c_BB^{-1}N)_j<0.$
  5. Compute $d_B=-B^{-1}N_{\cdot j^*}.$
  6. If $d_B\geq 0$, the LP is unbounded. Terminate program.
  7. Otherwise, select $1\leq k^*\leq m$ so that $-x_{k^*}/d_{k^*} = \min\{-x_i/d_i\,|\, 1\leq i\leq n\text{ and } d_i<0\}$.
  8. Create a new bfs by replacing the $k^*$ column of $B$ with the $j^*$ column of $N$. 
  9. Go to step 2.

This is all very well, but how do we find that initial bfs? How do we even determine if there is a feasible solution at all? That was actually my primary motivation for looking into this. How can we find a solution to $Ax=b$ such that $x\geq 0$? Let's not even worry about the objective function!

In the next post, I want to introduce what is called the phase I program. This is a linear program formed from the original, which has the property that it has an obvious bfs, and such that the solution can tell us if the initial problem was feasible. In fact, I plan to explore two different constructions. One is given in the book Linear Programming by Daniel Solow. The other is from the book Quadratic Programming with Computer Programs by Michael J. Best. Solow's phase I algorithm looks pretty standard from the various sources I have looked at. Best's algorithm appears to me much more efficient, requiring only one additional variable, and if it works, it seems to be a major improvement. I am a bit skeptical, but hopefully we can sort it out in subsequent posts.

Linear programming: improving a basic feasible solution

 Recall our basic set up. We have a linear program

  • Minimize $c\cdot x$ where $x$ and $c$ are $n$-vectors.
  • subject to the constraints
    • $Ax=b$ ($A$ is an $m\times n$ matrix.
    • $x\geq 0$
and we have decided to look for basic feasible solutions, which correspond to partitioning the index set of columns of $A$ into the $B$ and $N$ indices. Somewhat abusing notation, $B$ and $N$ are also the submatrices of $A$ formed by these columns. $x$ is said to be a basic feasible solution if $x_N=0$, and $x_B=B^{-1}b\geq 0$.

In the last post we saw that if $x$ passes the test for optimality $c_N-c_BB^{-1}N\geq 0$, then it is optimal. In this post we consider what happens if our bfs fails the test. Our first step will be to find a direction $d$ along which the objective function decreases. For calculus aficionados, denoting the objective function by $f(x)=c\cdot x$, the directional derivative in the direction of $d$, is just given by $c\cdot d$. Hence we want to find a direction $d$ such that $c\cdot d<0$. (We also need the direction to point into the polytope.)

Theorem: If $x$ is a bfs which fails the test for optimality, let $j$ be an $N$-index  such that $1\leq j\leq n-m$ and $(c_N-c_BB^{-1}N)_j<0$. Let $d=(d_B,d_N)=(-B^{-1}N_{\cdot j},I_{\cdot j})$, where $I$ is the $(n-m)\times(n-m)$ identity matrix. Then $c\cdot d<0$.
Proof:
$$c\cdot d = c_Bd_B+c_Nd_N=-c_BB^{-1}N_{\cdot j}+c_NI_{\cdot j}$$
$$\hspace 4em = (c_N-c_BB^{-1}N)_j<0$$
$\Box$

So we have a direction $d$, possibly several. Let $j^*$ be some choice of index such that $(c_N-c_BB^{-1}N)_j<0$ which gives us this direction. Next we must determine how far to go in the direction $d.$ That is, if we have a bfs $x$, and a direction $d$, how big can we make $t$ so that $x+td$ remains in our feasible polytope. One nice thing is that the equality constraints $Ax=b$ remain true along the entire line if $d$ is of the special form $(-B^{-1}N_{\cdot j},I_{\cdot j})$.

Lemma: Suppose $x=(x_B,x_N)=(B^{-1}b,0)$ is a bfs for $Ax=b$ and $d=(d_B,d_N)=(-B^{-1}N_{\cdot j},I_{\cdot j})$ for some $j$, then for all $t$ $A(x+td)=b$.
Proof:
It suffices to show that $Ad=0$. 
$$Ad=Bd_B+Nd_N=B(-B^{-1}N_{\cdot j})+NI_{\cdot j}=N_{\cdot j}-N_{\cdot j}=0.$$
$\Box$

So we need to find the maximal $t$ so that the nonnegativity constraints $x\geq 0$ are satisfied.

Lemma: Suppose $x\geq 0$ is a solution for $Ax=b$ and $d$ is a direction having at least one negative component, then
$$t^* = \min\{-x_i/d_i\,|\, 1\leq i\leq n\text{ and } d_i<0\}$$ has the property that $x+t^*d\geq 0$. In particular, if $x,d$ are as in the previous lemma, then $x+t^*d$ is feasible.

If $d\geq 0$, the LP is unbounded: the entire ray $x+td, t\geq 0$ lies in the feasible region and $f$ is decreasing along it.

Proof of lemma: 
By the previous lemma, we just need to satisfy the inequality constraints $x\geq 0$. Let us check for each coordinate $x_i$. If $d_i\geq 0$, then $(x+td)_i = x_i+td_i\geq 0$ for all $t>0$. Otherwise suppose $d_i<0$. Then $(x+t^*d)_i = x_i+t^*d_i$, and we know $t^*\leq -x_i/d_i$ by definition, so $t^*d_i\geq -x_i$ and $x_i+t^*d_i\geq x_i-x_i=0$.
$\Box$

In summary, we have chosen a direction along which the objective function decreases that lies in the affine subspace $Ax=b$. Then we have calculated the maximum distance one can travel along this ray and stay in the positive orthant. 

In the next post we will reinterpret this operation combinatorially and show that $x+t^*d$ is also a bfs.

Linear programming and the test for optimality.

[Note: I am using Linear Programming by Daniel Solow as a reference for this and future posts. This is an economical Dover Book which I highly recommend!]

Our goal in this post is to describe an algorithm for solving a particular type of linear program, one which is an optimization problem of the following form:

  • Minimize $f(x)=c\cdot x$ 
  • Subject to
    • $Ax=b$
    • $x\geq 0$

Let's just take a step back and think about this problem for a second. $Ax=b$ is some affine subspace of $\mathbb R^n$, and we are considering its intersection with the positive orthant. This gives us some higher dimensional polytope, and we are trying to identify the point that minimizes $f$. Since $f$ is a linear function, such a point, if it exists, will have to be one of the vertices of this polytope. Unfortunately, a polytope can have exponentially many vertices in the number of constraints, so any method that seeks to enumerate all vertices first is going to be very slow.

I am going to outline a method, due to Dantzig, called the simplex method. There are two ingredients of this method that make it work:

  • There is a simple test to check if a point is optimal. So if you happen to land on an optimal point, you can easily check if you are done.
  • If you are not on an optimal point, you can find a direction where $f$ decreases, and walk along an edge of the polytope to find a new point. 
You do need an initial vertex to get the algorithm started (more on that later), but once you have it, you can keep walking along edges, decreasing $f$ each time, until you find the minimum.

Next lets think about what a vertex of this polytope is combinatorially. To begin, we will assume that the solution set to $Ax=b$ is generic in the sense that the rows of $A$ are linearly independent. That means it carves out a codimension $m$ affine subspace of $\mathbb R^n$. Generically, this will intersect an $m$ dimensional subspace in a point, so the corners of our polytope will come from setting $n-m$ coordinates equal to $0$. And indeed, if we set $n-m$ variable equal to $0$, we are left with the same number of unknowns as equations, and we expect there to be a unique solution generically. This motivates the definition of what is known as a "basic feasible solution" or bfs for short.

Definition: A basic feasible solution (bfs) to the linear system $Ax=b, x\geq 0$, is a vector $x$ such that
  • there is a nonsingular $m\times m$ submatrix of $A$ formed by choosing $m$ columns, denoted by $B$, and the remaining columns are denoted by $N$ such that
  • $x_B=B^{-1}b$
  • $x_B\geq 0$, and
  • $x_N=0$

Here $x_B$ and $x_N$ are the projections of $x$ onto the subspace spanned by the $B$ and $N$ columns of $A$ respectively. 

Again, a bfs is generically nothing more than a vertex of the solution polytope in the first orthant. If $Ax=b$ is not generic, one could first row reduce the system and throw out the $0$ rows. There are other options available as well to create a system with an initial bfs.

Let us return now to the two ingredients I mentioned above, a test for optimality, and a method to determine a direction to walk if the current point is not optimal.

Test for optimality: Consider the linear program to minimize $f(x)=c\cdot x$ subject to $Ax=b, x\geq 0$. Let $x^*$ be a bfs with $(x_B,x_N)=(B^{-1}b,0)\geq 0$. If $$c_N-c_BB^{-1}N\geq0$$ then $x^*$ is optimal for this LP.

Proof:

Let $x=(x_B,x_N)$ be some other feasible solution. We want to show $c\cdot x\geq c\cdot x^*$.

$$ c\cdot(x-x^*) = c_B(x_B-x_B^*)+c_N(x_N-x_N^*) $$

$$\hspace{2em}=c_B[B^{-1}(b-Nx_N)-B^{-1}b]+c_N(x_N-0)$$

Here we use that $x$ is feasible, so $Ax=Bx_B+Nx_N=b$, and solving for $x_B$, we get $x_B=B^{-1}(b-Nx_N)$.

Continuing,

$$c_B[B^{-1}(b-Nx_N)-B^{-1}b]+c_N(x_N-0)=c_B[B^{-1}b-B^{-1}Nx_N-B^{-1}b]+c_Nx_N$$

$$\hspace{4em}=-c_BB^{-1}Nx_N+c_Nx_N$$

$$\hspace{4em}=(c_N-c_BB^{-1}N)x_N.$$

So to reiterate, for any feasible solution $x$, and any bfs $x^*$, we have the equation

$$c\cdot(x-x^*)=(c_N-c_BB^{-1}N)x_N.$$

By hypothesis, $(c_N-c_BB^{-1}N)\geq 0$, and by feasibility $x_N\geq 0$. So if $x^*$ passes the optimality test, then $c\cdot x\geq c\cdot x^*$ for any feasible solution. Hence it really is optimal. $\Box$

Now, what about the converse? Can we show that if a feasible solution is optimal, then it is a basic feasible solution satisfying the optimality test? This is a good question which we will return to later. To get some intuition, let's do an example.

Example: Minimize $x+2y+3z,$ subject to $x+y+z=1$ and $x,y,z\geq0$. 

There are $3$ basic feasible solutions. $(1,0,0)$, $(0,1,0)$ and $(0,0,1)$. In each case $B$ is the $1\times 1$ matrix with a $1$ in it. So $B=B^{-1}=(1)$, while $N=(1,1)$. For the first basic feasible solution we have $c_N=(2,3)$ and $c_B=(1)$. So we get

$$c_N-c_BB^{-1}N=(2,3)-(1)*1*(1,1)=(1,2)\geq 0.$$

So indeed $y=z=0$ gives an optimal solution. It is interesting to note that this works even in the degenerate case when we consider $x=y=z=0$. 

This seems like a good place to stop for this post. In the next post, we'll look at how to find an improved bfs if your existing bfs fails the test for optimality.



Solving linear systems, the pseudoinverse, and linear programming.

Suppose we want to solve a system of linear equations $Ax=b$. There are many ways to go about this. If $A$ happens to be a square invertible matrix, you can just write $x=A^{-1}b$. If $A$ is not square, or is not invertible, then we have to look for other methods. One trick that often works is to multiply both sides of the equation $Ax=b$ by the transpose of $A$, to get

$$A^TAx=A^Tb$$

The matrix $A^TA$ is a square matrix, and in good cases, it is actually invertible, so we have

$$x=(A^TA)^{-1}A^Tb.$$

This is actually a special case of the Moore-Penrose pseudoinverse of a matrix $A^+$. When $A^TA$ is invertible, we have that $A^+=(A^TA)^{-1}A^T$, but $A^+$ is defined for any matrix.

It is not my intention to get deep into the theory of pseudoinverses, but I do want to point out some important properties. (See wikipedia for more detail.)

  • The system $Ax=b$ has at least one solution if and only if $AA^+b=b$.
  • If a solution exists, $x=A^+b$ is the solution with smallest Euclidean norm. (i.e. it is closest to the origin.)
  • If a solution exists, one can obtain all solutions by $x=A^+b+(I-A^+A)w$ for an arbitrary vector $w$.  

Calculation of pseudoinverses is widely supported in software libraries. For example, in C#, I use the MathNet numerics package, and it is a convenient way to solve linear systems.

Instead of finding the solution closest to the origin, what if you want to find the solution closest to some other point $x_0$?

Exercise: Let $A$ be an $m\times n$ matrix. What is a method for finding the solution to a system of equations $Ax=b$ which is closest to a given $x_0\in\mathbb R^n$.

Solution: Subtract $Ax_0$ from both sides of the equation to get: $$A(x-x_0)=b-Ax_0.$$

Then $x-x_0=A^+(b-Ax_0)$ is the solution such that $x-x_0$ is closest to $0$. Thus $$x=A^+(b-Ax_0)+x_0$$ is closest to $x_0$. $\Box$

We have just solved a simple quadratic optimization problem, that of minimizing the distance to a point of a solution to a linear system. 

Let us consider a slightly more complex question. 

Problem: Minimize the distance to a point $x_0$ of a solution to $Ax=b$, assuming each coordinate of $x$ is nonnegative.

This is a more difficult question, and even the simpler feasibility question is not so obvious:

Feasibility Problem: Does there exist a solution to $Ax=b$ where each coordinate of $x$ is nonnegative?

The feasibility problem can be solved using the technique of linear programming, which for our purposes means that we want to minimize some linear objective function $f\colon \mathbb R^n\to\mathbb R$ subject to the constraints $Ax=b$ and $x\geq 0$. (Although it looks like a typo, the convention is that $x\geq0$ means each coordinate of $x$ is nonnegative.) At first glance, this seems like a harder problem than determining feasibility, but bear with me. Once we have set up the algorithm for solving linear programs of the above form, we will be able to easily apply it to the feasibility issue.

In order to make these posts as readable as possible, I'll stop here and continue in the next post to talk about linear programs.