Friday, August 19, 2022

The True Centroid of a Polygon

 My apologies for taking a break from the quadratic programming thread. I will be coming back to it. But for now I thought I would record a method for finding the true centroid of a planar polygon, given as a sequence of vertices that traverse the boundary in order. A quick and dirty approximation is to simply take the barycenter of the vertices. This is often good enough, and for convex polygons, is guaranteed to lie within the interior of the polygon, but consider the following example. Take any polygon and then split one of its vertices into two vertices connected by a very short edge. This new polygon should have approximately the same centroid, but the barycenter of its vertices changes quite a bit since the split vertex now exerts twice the pull on the centroid than it did before.

Luckily there is a very simple way to get the centroid, at least for planar polygons. I believe the formulas should generalize to higher dimensional embeddings of polygons, but have not given the matter sufficient thought. We start with Green's Theorem 

$$\iint_R \frac{\partial M}{\partial x}-\frac{\partial L}{\partial y}\,dx\,dy = \oint_{\partial R} L\,dx+M\,dy$$

for integrating a 1-form on the boundary of a region. 

As a warm-up, let $M=x$ and $L=0$. Then we discover that the area of the polygon is the integral

$$\oint_{\partial R} x\,dy$$ along its boundary. This integral can be evaluated on each line segment comprising the boundary to get a simple combinatorial formula.

Let $p_0=\left(\begin{matrix} x_0\\y_0\end{matrix}\right)$ and $p_1=\left(\begin{matrix} x_1\\y_1\end{matrix}\right)$. Then the path connecting them is $(1-t)p_0+tp_1$ for $0\leq t \leq 1$.  

Then $dy = (y_1-y_0)\,dt$ and $x= (1-t)x_0+tx_1$, so we get

$$\int_0^1 ((1-t)x_0+tx_1)(y_1-y_0)\,dt= \frac{1}{2}(x_1+x_0)(y_1-y_0).$$

This gives us a beautiful formula for area. If the polygon has $n$ points $p_1,\ldots,p_n$ we have

$$A=\frac{1}{2}\sum_{i=1}^n(x_i+x_{i+1})(y_{i+1}-y_i)$$

where the indices are read modulo $n$. $x_{n+1}=x_1, y_{n+1}=y_1$. As an exercise see if you can find two more similar formulas for area with different choices of $L$ and $M$ above. These types of formulas are the basis for mechanical planimeters which calculate area by tracing around the perimeter.

Returning to the centroid, we need to calculate the average position of both $x$ and $y$:

$$\overline{x} = \frac{1}{A}\iint_R x\,dx\,dy,\,\,\,\overline{y} = \frac{1}{A}\iint_R y\,dx\,dy$$

Tackling the $x$ coordinate, let $M=\frac{x^2}{2}$ and $L=0$. Then we have

$$\iint_R x\,dx = \sum_{i=1}^n\int_0^1 \frac{((1-t)x_i)^2+t x_{i+1}}{2}(y_{i+1}-y_i)\,dt$$

$$\iint_R x\,dx = \frac{1}{6}\sum_{i=1}^n (x_i^2+x_ix_{i+1}+x_{i+1}^2)(y_{i+1}-y_i)$$

In a similar way we can derive the formula

$$\iint_R y\,dy = \frac{1}{6}\sum_{i=1}^n (y_i^2+y_iy_{i+1}+y_{i+1}^2)(x_{i}-x_{i+1}).$$


It does not feel to me like these formulas are unique to planar embeddings of polygons, but the technique of proof, using Green's theorem, will not work for higher ambient dimensions.




Friday, July 15, 2022

Minimizing a convex quadratic with equality constraints

In our last few posts we tackled the question of finding a global minimum for a function $f(x)=c^Tx+\frac{1}{2} x^TCx.$ The heart of the algorithm was finding an orthonormal basis for $\mathbb R^n$ with respect to the inner product $\langle x,y\rangle =x^TCy$.

Now suppose we have a set of equality constraints $Ax=b$. This defines an affine subspace of $\mathbb R^n$ and if we can find an orthonormal basis of the same plane translated to the origin $Ax=0$, then we can just use the same algorithm as before.

There is a neat trick for doing this. First, assume that the rows of $A$ $a_1\ldots,a_m$ are linearly independent. (We will remove that restriction later.) Now add extra rows to $A$ so that its rows form a basis of $\mathbb R^n$, to get a matrix we call $D$. 

Claim: The $m+1,\ldots, n$ columns of $D^{-1}$ form a basis for the subspace $Ax=0$.

Proof:

Suppose $Ax=0$. Then $Dx=(0,\ldots,0,*,\ldots,*)^T$ is $0$ in the first $m$ coordinates. Applying $D^{-1}$ to both sides, we have $x=D^{-1}(0,\ldots,0,*,\ldots,*)^T$, which is in the column space of $D^{-1}$ spanned by the last $n-m$ columns. $\Box$

We proceed by doing Gram-Schmidt orthogonalization to the last $n-m$ columns. So to begin, normalize the $m+1$st column and subtract off projections of this column from all other columns, including the initial $m$. These column operations correspond to similar row operations on $D$, so that we are adding a multiple of the $m+1$st row of $D$ to the vectors spanning the subspace $Ax=b$ (see below for a simple explanation of this phenomenon). Thus by the above claim, the last $n-m-1$ vectors of the new $D^{-1}$ are orthogonal to the subspace spanned by the first $m+1$ columns.

 Repeating this process with the remaining columns as we move to the right, we get a collection of orthonormal vectors spanning our subspace.

 Thus we can use the algorithm developed in the unconstrained case with very little modification, but with a little more work needed in the initialization. 

  • First, you need to be able to identify a point $x_0$ satisfying the constraints: $Ax_0=b$.
  • Second, you need to add rows to $A$ to get an invertible matrix $D$, and start the algorithm off with $GS=D^{-1}$ rather than just the identity.
  • Also as a note, you have to make sure to apply the matrix update rules to the first $m$ columns at each step, not just the columns to the right of the currently active one.

 I hope to write this out as a detailed algorithm in my next post, but for now, let us return to the question of how row/column operations affect the inverse.

Lemma: An elementary row operation on a matrix $M$ corresponds to an elementary column operation on $M^{-1}$.

Proof: An elementary row operation is realized by multiplying by an elementary matrix on the left $EM$. Taking the inverse we have $M^{-1}E^{-1}$. The inverse of an elementary matrix is an elementary matrix, and right multiplication is an elementary column operation. $\Box$




Unconstrained quadratic optimization: the algorithm

 We finally have all the tools in place to describe a step algorithm for minimizing a (convex) quadratic objective function without any constraints.

  • Minimize $f(x)= c\cdot x+\frac{1}{2}x^TCx$ where $C$ is a positive semidefinite matrix.
We start with any seed value $x_0\in \mathbb R^n$. Recall that the modified Gram-Schmidt process starts with a set of independent vectors and gradually replaces them with an orthonormal set. We will start with the standard basis $e_1,\ldots,e_n$, recorded as the columns of a matrix $GS_0=I$. We use a vector $J\in\{Y,N\}^n$ to keep track of which columns have been orthonormalized. So to begin we have $J=\{N,\ldots,N\}$, and as we work on a column to create a conjugate direction, we add $Y$ to that index of $J$. Set $g_0=\nabla f(x_0)=c+Cx_0$ and $j=0$ to initialize, where $j$ represents the iteration of the algorithm.
  • Step 1: Computation of search direction $s_j$: Let $GS_j=[c_{1j},\ldots, c_{nj}]$ and $J_j=\{\alpha_{1j},\ldots,\alpha_{nj}]$. If each component of $J_j$ is equal to $Y$, then stop with optimal solution $x_j$. Otherwise, compute the smallest index $k$ such that $$|g_j^Tc_{kj}|= \max\{|g_j^Tc_{ij}|\,|\, \forall i \alpha_{ij}=N\}.$$ If $g^T_jc_{kj}=0$, stop with optimal solution $x_j$. Otherwise set $s_j=\operatorname{sgn}(g^T_jc_{kj})c_{kj}.$ Go to step 2.
  • Step 2: Computation of step size $t_j$: Compute $\langle s_j,s_j\rangle$. If it is $0$, then stop. The objective function is unbounded from below. Otherwise set: $$t_j=\frac{g_j^Ts_j}{\langle s_j,s_j\rangle} $$ Go to step 3.
  • Step 3: Updating data: Set
    • $x_{j+1}:=x_j-t_js_j$.
    • $g_{j+1} := \nabla f(x_{j+1}) = c+Cx_{j+1}$
    • $f(x_{j+1}) = c^Tx_{j+1} +\frac{1}{2}x_{j+1}^TCx_{j+1}$
    • $d_j = s_j/\sqrt{\langle s_j,s_j\rangle}$
    • $GS_{j+1}$ is formed from $GS_j$ by 
      • replacing the $k$th column $c_k$ by $c_k/(d_j^Tc_k)$ 
      • replacing all columns $c_i$ where $\alpha_{ij}=N$ by $c_i-\frac{d_j^Tc_i}{d_j^Tc_k}c_k$
      • leaving columns $c_i$ where $\alpha_{ij}=Y$ unchanged.
    • $J_{j+1}$ is formed from $J_j$ be setting $\alpha_{k}$ to $Y$ and leaving everything else unchanged.
    • Increment $j:=j+1$ and go to step 1!

To summarize the algorithm works as follows. Pick some point $x_0$. Choose a search direction $s_1$ parallel to the axes that maximizes the gradient slope. Locate the minimum of $f$ on the line through $x_0$ parallel to $s_1$ to get $x_1$. Comb the remaining basis vectors of $\mathbb R^n$ to be orthogonal to $s_1$ with respect to the inner product defined by $C$. Next find a search direction $s_2$ among these combed vectors that maximizes the gradient slope, locate the minimum of $f$ on the line through $x_1$ parallel to $s_2$ to get $x_3$, etc. After at most $n$ steps the algorithm will terminate.

I went ahead and programmed this into C# using the MathNet.Numerics.LinearAlgebra library and it seems to work quite well.

First we have a helper function that updates the matrix $GS$, followed by the main minimization routine.

public Matrix<double> GramSchmidt(Matrix<double> matrix, Vector<double> d, int k, bool[] J)
        {
            Matrix<double> A = matrix.Clone();
            double lambda = d.DotProduct(matrix.Column(k));
            A.SetColumn(k, matrix.Column(k) / lambda);
            for(int i = 0; i < A.ColumnCount; i++)
            {
                if (!J[i]&&i!=k)
                {
                    double mu = d.DotProduct(matrix.Column(i));
                    A.SetColumn(i, matrix.Column(i)-matrix.Column(k) * (mu / lambda));
                }
            }
            return A;
        }

 public Vector<double> Minimize(Vector<double> c,Matrix<double> C,Vector<double> x0, out bool unbounded)
        {
            unbounded = false;
            Vector<double> x = x0;
            int n = c.Count;
            bool[] J = new bool[n];//builds array of falses
            Matrix<double> GS = Matrix<double>.Build.DenseIdentity(n);
            Vector<double> g = c + C * x;
            for(int j = 0; j < n; j++)
            {
                //Step 1:
                bool init = true;
                foreach(bool index in J) { init = init && index; }
                if (init) { return x; }

                double max = 0;
                int k = -1;
                for(int i =0; i < n; i++)
                {
                    if (!J[i] && Math.Abs(g.DotProduct(GS.Column(i))) > max)
                          { max = Math.Abs(g.DotProduct(GS.Column(i)));k = i; }
                }

                if (g.DotProduct(GS.Column(k)) == 0) { return x; }
                Vector<double> s = GS.Column(k);
                if (g.DotProduct(GS.Column(k)) < 0) { s *= -1; }

                //Step 2:
                double normSquaredOfs = (s.ToRowMatrix() * C * s.ToColumnMatrix())[0,0];
                if (Math.Abs(normSquaredOfs) < .0000001) { unbounded = true; return x; }
                double t = g.DotProduct(s) / normSquaredOfs;

                //Step 3:
                x -= t * s;
                g = c + C * x;
                Vector<double> d = 1/ Math.Pow(normSquaredOfs, 0.5)*(C * s);
                GS = GramSchmidt(GS, d, k, J);
                J[k] = true;

            }

            return x;
        }

Finding an orthonormal basis with respect to a symmetric positive definite form

Let $C$ be an $n\times n$ symmetric positive definite matrix. Define an inner product on $\mathbb R^n$ by the formula $\langle x,y\rangle =x^TCy$.

 In the last post we wanted to find a set of conjugate directions $s_1,\ldots,s_n$ which were a set of vectors in $\mathbb R^n$ such that $\langle s_i,s_j\rangle=\delta_{ij}$, where $\delta$ is the Kronecker delta function. To do this we start with any basis of $\mathbb R^n$ $b_1,\ldots,b_n$ and slowly improve it. 

At the $k$th stage of the process we will have a list of vectors $s_1^{(k)},\ldots s_n^{(k)}$ where the first $k$ vectors span $S_k$ and the last $n-k$ vectors span $W_k$. The vectors $s_1^{(k)},\ldots s_k^{(k)}$ form an orthonormal basis for $S_k$ and $\mathbb R^n = S_k\oplus W_k$ is an orthogonal decomposition. 

The basic process (called the Modified Gram-Schmidt process) is as follows. Let $v_1,\ldots,v_k$ be a set of independent vectors.
  • Let $\hat{v}_1 = v_1/\sqrt{\langle v_1,v_1\rangle}$.
  • Let $\hat{v}_i = v_i-\langle v_i,\hat{v}_1\rangle\hat{v}_1$ for all $i\geq 2$.
Then $\hat{v}_1$ is orthogonal to the subspace generated by $\hat{v}_2,\ldots, \hat{v}_k$.

$$\langle \hat{v}_i,\hat{v}_1\rangle = \langle v_i,\hat{v}_1\rangle - \langle v_i,\hat{v}_1\rangle\langle \hat{v}_1,\hat{v}_1\rangle$$
$$\hspace{4em}=\langle v_i,\hat{v}_1\rangle - \langle v_i,\hat{v}_1\rangle\cdot 1=0.$$

This process iterates as follows. Suppose we have a decomposition $S_k\oplus W_k$. Then use this procedure to find a basis for $W_k$ of the form $c_{k+1},\ldots,c_n$ where $c_{k+1}$ has norm $1$ and is orthogonal to the rest of the vectors. Then $c_{k+1}$ is orthogonal to $S_k$ as it lies in $W_k$, so it can be added to $S_k$ to get a new decomposition $S_{k+1}\oplus W_{k+1}$.

In the next post we will use this algorithm to solve the unconstrained quadratic minimization problem. There are a couple things to note. The conjugate directions that are produced this way have a choice of sign, and we will need to pick the right sign when implementing the algorithm. The other thing to note it that we started from the first vector in the list and then steadily moved from left to right in creating the orthonormal basis. However, we have a choice of what order to do this in, and the algorithm will be set up in such a way as to pick certain directions as making the most progress. As far as I can tell this is not important because the unconstrained algorithm will always take $n$ steps to reach a solution (unless you happen to get lucky in your choice of initial point). My guess is that keeping things flexible will help to generalize the argument later.


Thursday, July 14, 2022

Quadratic Programming: The unconstrained case

In order to build our understanding of how quadratic programming works, we are first going to look at the unconstrained case:

  • Minimize $f(x) = c^Tx+\frac{1}{2} x^TCx$ where $C$ is a positive definite matrix.
Note this is a global optimization problem and can be solved using calculus. The minimum occurs where the gradient vanishes:
$$\nabla f = c^T+Cx = 0.$$
So one could simply solve this linear system in whatever way one wishes and be done with the problem, but following Best's textbook, we will present a basic algorithm that we can build off of in the constrained cases.

The basic idea is as follows. Start with any initial approximation $x_0$, and suppose we have a search direction in the form of a vector $s_0$. We want to walk along this search direction as far as possible to minimize $f(x)$.

Let $g_0=\nabla f(x_0)$. Suppose $g_0^Ts_0\neq 0$. Replace $s_0$ by $-s_0$ if necessary so that $g_0^Ts_0>0$. Looking along the line $x_0-t s_0$, we write the Taylor Series as
$$f(x_0-ts_0) = f(x_0) -tg_0^Ts_0+\frac{1}{2}t^2s_0^TCs_0$$

Note that $s_0^TCs_0>0$ since $C$ is a positive definite matrix. So this function has a minimum, which can be found by taking the derivative with respect to $t$:
$$\frac{d}{dt}f(x_0-ts_0)=-g_0^Ts_0+ts_0^TCs_0=0$$
$$t=\frac{-g_0^Ts_0}{s_0^TCs_0}=\frac{-\nabla f(x_0)^Ts_0}{s_0^TCs_0}.$$
This quantity is called the optimal step size and tells you how far to walk along the ray to minimize the objective function on that ray.

One can iterate this by choosing a new search direction, so the question becomes how to choose these directions. A convenient choice of search directions is given by the concept of conjugate directions.

Definition: A set of vectors $s_1,\ldots,s_k$ is said to be conjugate (with respect to $C$) if $s_i^TCs_j=0$ for $i\neq j$ (and $s_i^TCs_i>0$). 

Theorem: Let $s_0,\ldots,s_{n-1}$ be conjugate directions, and $x_0$ be arbitrary. Construct $x_1,\ldots,x_n$ where $x_i=x_{i-1}-t_{i-1}s_{i-1}$, with $t_{i-1}=\frac{-\nabla f(x_{i-1})^Ts_0}{s_{i-1}^TCs_{i-1}}$. Then
  • $\nabla f(x_{j+1})s_i=0$ for $0\leq i\leq j\leq n-1$
  • $x_n$ is the optimal solution for the QP.
In other words, at each iteration the gradient becomes orthogonal to more search directions. In the end it is orthogonal to all of them, and since they span $\mathbb R^n$ (exercise) the gradient must be zero.

Proof:
First we want to show that $\nabla f(x_{j+1})^T s_j = 0$. To see this note that
$$\nabla f(x_{j+1})=\nabla f(x_j) -t_jCs_j$$ by Taylor series, and taking the inner product with $s_j$ we get
$$\nabla f(x_{j+1})^Ts_j=\nabla f(x_j)^Ts_j -t_js_j^TCs_j=0$$ by definition of $t_j$.

Now choose $i<j\leq n-1$. Again by Taylor series, we have
$$\nabla f(x_{j+1})=\nabla f(x_{i+1})+C(x_{j+1}-x_{i+1}).$$
Taking the inner product with $s_i$:
$$\nabla f(x_{j+1})^Ts_i=\nabla f(x_{i+1})^Ts_i+(x_{j+1}-x_{i+1})^TCs_i.$$
$$\hspace 4em = (x_{j+1}-x_{i+1})^TCs_i.$$
$$\hspace 4em = -t_{i+1}s^T_{i+1}Cs_i-\cdots-t_js^T_jCs_i.$$
Each term is equal to $0$ by definition of conjugate directions. $\Box$

The only missing ingredient is identifying a set of conjugate directions for a matrix $C$. We will do this in the next post.

Quadratic Programming

 In my next series of posts, I would like to start exploring the quadratic programming problem. This is similar to the linear case, except the objective function is allowed to be a convex quadratic function 

$$f(x) = c\cdot x+\frac{1}{2} x^TCx$$

where $C$ is a positive semidefinite matrix. We are still considering the constraints to be linear, and in fact, following Best's textbook, we will assume they are inequality constraints of the form $Ax\leq b$. 

Definition: Let $x_0\in\mathbb R^n$. We say a constraint $a_i\cdot x\leq b_i$ is active at $x_0$, if $a_i\cdot x_0=b_i$.

In other words, $x_0$ lies on the facet of the feasible region defined by the $i$th constraint. 

Definition: Let $I(x_0)$ be the set of indices of active constraints for $x_0$. We say $x_0$ is quasi-stationary if $x_0$ is the optimal solution for the problem

  • Minimize $c\cdot x+\frac{1}{2} x^TCx$
  • Subject to $a_i \cdot x=b_i, i\in I(x_0)$

A couple of things to note: any corner of the feasible region is quasi-stationary since the active region is just a point. 

Lemma: An optimal solution for a QP must be quasi-stationary.

Proof: Let $I$ be the set of active constraints for an optimal point $x_0$. Note that $I$ could be empty.

We need to show that $x_0$ is optimal on the entire facet defined by its active constraints. This is obvious, due to convexity of the objective function. If there were a point with smaller objective function, just walk toward it until you hit a facet of higher codimension, meaning that you had not included all active constraints. $\Box$

So we have a very similar situation to the linear programming case. Instead of looking at basic variables, we look at active constraints. Just as in the LP case, a crude but inefficient algorithm is available. Just look at all possible subsets of the constraints, solve them as equalities, and then find the global minimum of the objective function on the hyperplane defined by that facet.

We will conclude this first post on quadratic programming by deriving tests for a point to be quasi-stationary or optimal. 

Suppose we have a quasi-stationary point $x_0$. Recall $\nabla f$ is the vector which gives the direction of fastest increase of $f$. Because $x_0$ is optimal for its set of active constraints, $\nabla f$ must be orthogonal to the hyperplane given by those constraints. Hence $\nabla f$ is in the linear span of $\{a_i\,|\,i\in I(x_0)\}$:

$$-\nabla f = \sum_{i\in I(x_0)} u_i a_i,$$

where we introduce the negative sign for later convenience. If $x_0$ is actually optimal, then the direction of fastest increase must point into the feasible polytope, which means that $u_i\geq 0$ in the above.

A convenient way to encode this is as follows:

Theorem: A point $x_0$ is quasi-stationary for the QP 

$$\min\{f(x)\,|\, a_i\cdot x\leq b_i\}$$ if and only if

  • $x_0$ lies in the feasible region $R$: $\forall i\,\,a_i\cdot x_0\leq b_i $
  • $-\nabla f(x_0) = \sum_{i=1}^m u_ia_i$ for some scalars $u_i\in\mathbb R$
  • $\forall i\,\, u_i(a_i\cdot x_0-b_i)=0$
The last condition gives an alternative: either $u_i=0$ or $i\in I(x_0),$ so is just a reformulation of the statement that the gradient is a linear combination of the active constraint coefficients.

Theorem: A point $x_0$ is optimal if in addition to the above three conditions, each $u_i$ is nonnegative. Writing this in matrix notation:
  • $Ax_0\leq b$
  • $-\nabla f(x_0) = A^Tu$ for some $u\geq 0$
  • $u^T(Ax_0-b)=0$


Wednesday, July 13, 2022

Linear programming versus row echelon form

 As mentioned in my previous post, my plan was to illustrate the LP algorithm we have developed so far over the last several posts with a real world example. I actually did take an example from a project I am working on with a coefficient matrix $A$ of dimensions $5\times 14$. Obviously this must be a massively overdetermined system, but at the moment, my program simply feeds it through the steps by first constructing a phase 1 and then an auxiliary LP.

This did not seem like a great example to show because several rows were simply zero and other rows were copies or negations. So I decided to row reduce it first. However, one thing I noticed Wikipedia ominously mention, is that Gaussian elimination is numerically unstable, and indeed when I did naive elimination, the resulting system was inconsistent. I was able to get results if I used the CoerceZero() method first, which coerces any elements of a matrix sufficiently close to 0 to actually be 0. However, I also noticed a couple elements of the row echelon form were absurdly big. 

So I actually think that row reducing the matrix is actually a bad idea due to the numerical instability. It's possible a more sophisticated approach using the $LU$ decomposition might be better, but in fact, the program works great at the moment without doing the reduction, so perhaps its okay to leave well enough alone, although I will probably at least remove the all $0$ rows. 

I would also like to note that the algorithm for computing row echelon form which I found at Rosetta Code does not appear to be correct, to add insult to injury.