Jekyll2022-02-11T15:39:29+00:00https://invenia.github.io/blog/feed.xmlInvenia BlogBlogging About Electricity Grids, Julia, and Machine LearningInveniaUsing Meta-optimization for Predicting Solutions to Optimal Power Flow2021-12-17T00:00:00+00:002021-12-17T00:00:00+00:00https://invenia.github.io/blog/2021/12/17/opf-nn-meta<p>In a previous <a href="/blog/2021/10/11/opf-nn/">blog post</a>, we reviewed how neural networks (NNs) can be used to predict solutions of <a href="/blog/2021/06/18/opf-intro/">optimal power flow</a> (OPF) problems.
We showed that widely used approaches fall into two major classes: <em>end-to-end</em> and <em>hybrid</em> techniques.
In the case of end-to-end (or direct) methods, a NN is applied as a regressor and either the full set or a subset of the optimization variables is predicted based on the grid parameters (typically active and reactive power loads).
Hybrid (or indirect) approaches include two steps: in the first step, some of the inputs of OPF are inferred by a NN, and in the second step an OPF optimization is performed with the predicted inputs.
This can reduce the computational time by either enhancing convergence to the solution or formulating an equivalent but smaller problem.</p>
<p>Under certain conditions, hybrid techniques provide feasible (and sometimes even optimal) values of the optimization variables of the OPF.
Therefore, hybrid approaches constitute an important research direction in the field.
However, feasibility and optimality have their price: because of the requirement of solving an actual OPF, hybrid methods can be computationally expensive, especially when compared to end-to-end techniques.
In this blog post, we discuss an approach to minimize the total computational time of hybrid methods by applying a meta-objective function that directly measures this computational cost.</p>
<h2 id="hybrid-nn-based-methods-for-predicting-opf-solutions">Hybrid NN-based methods for predicting OPF solutions</h2>
<p>In this post we will focus solely on hybrid NN-based methods.
For more detailed theoretical background we suggest reading our earlier <a href="/blog/2021/10/11/opf-nn/">blog post</a>.</p>
<p>OPF problems are mathematical optimizations with the following concise and generic form:</p>
\[\begin{equation}
\begin{aligned}
& \min \limits_{y}\ f(x, y) \\
& \mathrm{s.\ t.} \ \ c_{i}^{\mathrm{E}}(x, y) = 0 \quad i = 1, \dots, n \\
& \quad \; \; \; \; \; c_{j}^{\mathrm{I}}(x, y) \ge 0 \quad j = 1, \dots, m \\
\end{aligned}
\label{opt}
\end{equation}\]
<p>where \(x\) is the vector of grid parameters and \(y\) is the vector of optimization variables, \(f(x, y)\) is the objective (or cost) function to minimize, subject to equality constraints \(c_{i}^{\mathrm{E}}(x, y) \in \mathcal{C}^{\mathrm{E}}\) and inequality constraints \(c_{j}^{\mathrm{I}}(x, y) \in \mathcal{C}^{\mathrm{I}}\).
For convenience we write \(\mathcal{C}^{\mathrm{E}}\) and \(\mathcal{C}^{\mathrm{I}}\) for the sets of equality and inequality constraints, with corresponding cardinalities \(n = \lvert \mathcal{C}^{\mathrm{E}} \rvert\) and \(m = \lvert \mathcal{C}^{\mathrm{I}} \rvert\), respectively.</p>
<p>This optimization problem can be non-linear, non-convex, and even mixed-integer.
The most widely used approach to solving it is the <a href="https://invenia.github.io/blog/2021/06/18/opf-intro/#solving-the-ed-problem-using-the-interior-point-method">interior-point method</a>.<sup id="fnref:Boyd04" role="doc-noteref"><a href="#fn:Boyd04" class="footnote" rel="footnote">1</a></sup> <sup id="fnref:Nocedal06" role="doc-noteref"><a href="#fn:Nocedal06" class="footnote" rel="footnote">2</a></sup> <sup id="fnref:Wachter06" role="doc-noteref"><a href="#fn:Wachter06" class="footnote" rel="footnote">3</a></sup>
The interior-point method is based on an iteration, which starts from an initial value of the optimization variables (\(y^{0}\)).
At each iteration step, the optimization variables are updated based on the <a href="https://en.wikipedia.org/wiki/Newton%27s_method">Newton-Raphson method</a> until some convergence criteria have been satisfied.</p>
<p>OPF can be also considered as an operator<sup id="fnref:Zhou20" role="doc-noteref"><a href="#fn:Zhou20" class="footnote" rel="footnote">4</a></sup> that maps the grid parameters (\(x\)) to the optimal value of the optimization variables (\(y^{*}\)).
In a more general sense (Figure 1), the objective and constraint functions, as well as the initial value of the optimization variables, are also arguments of this operator and we can write:</p>
\[\begin{equation}
\Phi: \Omega \to \mathbb{R}^{n_{y}}: \quad \Phi\left( x, y^{0}, f, \mathcal{C}^{\mathrm{E}}, \mathcal{C}^{\mathrm{I}} \right) = y^{*},
\label{opf-operator}
\end{equation}\]
<p>where \(\Omega\) is an abstract set within the values of the operator arguments are allowed to change and \(n_{y}\) is the dimension of the optimization variables.</p>
<p>Hybrid NN-based techniques apply neural networks to predict some arguments of the operator in order to reduce the computational time of running OPF compared to default methods.
<br />
<br /></p>
<p><img src="/blog/public/images/opf_solve_general.png" alt="opf_solve_general" />
Figure 1. OPF as an operator.</p>
<p><br /></p>
<h3 id="regression-based-hybrid-methods">Regression-based hybrid methods</h3>
<p>Starting the optimization from a specified value of the optimization variables (rather than the default used by a given solver) in the framework of interior-point methods is called <em>warm-start</em> optimization.
Directly predicting the optimal values of the optimization variables (end-to-end or direct methods) can result in a sub-optimal or even infeasible point.
However, as it is presumably close enough to the solution, it seems to be a reasonable approach to use this predicted value as the starting point of a warm-start optimization.
Regression-based indirect techniques follow exactly this route<sup id="fnref:Baker19" role="doc-noteref"><a href="#fn:Baker19" class="footnote" rel="footnote">5</a></sup> <sup id="fnref:Jamei19" role="doc-noteref"><a href="#fn:Jamei19" class="footnote" rel="footnote">6</a></sup>: for a given grid parameter vector \(x_{t}\), a NN is applied to obtain the predicted set-point, \(\hat{y}_{t}^{0} = \mathrm{NN}_{\theta}^{\mathrm{r}}(x_{t})\), where the subscript \(\theta\) denotes all parameters (weights, biases, etc.) of the NN, and the superscript \(\mathrm{r}\) indicates that the NN is used as a regressor.
Based on the predicted set-point, a warm-start optimization can be performed:</p>
\[\begin{equation}
\begin{aligned}
\hat{\Phi}^{\mathrm{red}}(x_{t}) &= \Phi \left( x_{t}, \hat{y}_{t}^{0}, f, \mathcal{C}^{\mathrm{E}}, \mathcal{C}^{\mathrm{I}} \right) \\ &= \Phi \left( x_{t}, \mathrm{NN}_{\theta}^{\mathrm{r}}(x_{t}), f, \mathcal{C}^{\mathrm{E}}, \mathcal{C}^{\mathrm{I}} \right) \\ &= y_{t}^{*}.
\end{aligned}
\end{equation}\]
<p>Warm-start methods provide an exact (locally) optimal point, as the solution is eventually obtained from an optimization problem identical to the original one.</p>
<p>Training the NN is performed by minimizing the sum (or mean) of some differentiable loss function of the training samples with respect to the NN parameters:</p>
\[\begin{equation}
\min \limits_{\theta} \sum \limits_{t=1}^{N_{\mathrm{train}}} L^\mathrm{r} \left(y_{t}^{*}, \hat{y}_{t}^{(0)} \right),
\end{equation}\]
<p>where \(N_{\mathrm{train}}\) denotes the number of training samples and \(L^{\mathrm{r}} \left(y_{t}^{*}, \hat{y}_{t}^{(0)} \right)\) is the (regression) loss function between the true and predicted values of the optimal point.
Typical loss functions are the (mean) squared error, \(SE \left(y_{t}^{*}, \hat{y}_{t}^{(0)} \right) = \left\| y_{t}^{*} - \hat{y}_{t}^{(0)} \right\|^{2}\), and (mean) absolute error, \(AE \left(y_{t}^{*}, \hat{y}_{t}^{(0)} \right) = \left\| y_{t}^{*} - \hat{y}_{t}^{(0)} \right\|\) functions.
<br />
<br /></p>
<p><img src="/blog/public/images/warm_start_opf_loss.png" alt="warm_start_opf" />
Figure 2. Flowchart of the warm-start method (yellow panel) in combination with an NN regressor (purple panel, default arguments of OPF operator are omitted for clarity) trained by minimizing a conventional loss function.</p>
<p><br /></p>
<h3 id="classification-based-hybrid-methods">Classification-based hybrid methods</h3>
<p>An alternative hybrid approach using a NN classifier (\(\mathrm{NN}_{\theta}^{\mathrm{c}}\)) leverages the observation that only a fraction of all constraints is binding at the optimum<sup id="fnref:Ng18" role="doc-noteref"><a href="#fn:Ng18" class="footnote" rel="footnote">7</a></sup> <sup id="fnref:Deka19" role="doc-noteref"><a href="#fn:Deka19" class="footnote" rel="footnote">8</a></sup>, so a reduced OPF problem can be formulated by keeping only the binding constraints.
Since this reduced problem still has the same objective function as the original one, the solution should be equivalent to that of the original full
problem: \(\Phi \left( x_{t}, y^{0}, f, \mathcal{C}^{\mathrm{E}}, \mathcal{A}_{t} \right) = \Phi \left( x_{t}, y^{0}, f, \mathcal{C}^{\mathrm{E}}, \mathcal{C}^{\mathrm{I}} \right) = y_{t}^{*}\), where \(\mathcal{A}_{t} \subseteq \mathcal{C}^{\mathrm{I}}\) is the active or binding subset of the inequality constraints (also note that \(\mathcal{C}^{\mathrm{E}} \cup \mathcal{A}_{t}\) contains all active constraints defining the specific congestion regime).
This suggests a classification formulation in which the grid parameters are used to predict the active set:</p>
\[\begin{equation}
\begin{aligned}
\hat{\Phi}^{\mathrm{red}}(x_{t}) &= \Phi \left( x_{t}, y^{0}, f, \mathcal{C}^{\mathrm{E}}, \hat{\mathcal{A}}_{t} \right) \\ &= \Phi \left( x_{t}, y^{0}, f, \mathcal{C}^{\mathrm{E}}, \mathrm{NN}_{\theta}^{\mathrm{c}}(x_{t}) \right) \\ &= \hat{y}_{t}^{*}.
\end{aligned}
\end{equation}\]
<p>Technically, the NN can be used in two ways to predict the active set.
One approach is to identify all distinct active sets in the training data and train a multiclass classifier that maps the input to the corresponding active set.<sup id="fnref:Deka19:1" role="doc-noteref"><a href="#fn:Deka19" class="footnote" rel="footnote">8</a></sup>
Since the number of active sets increases exponentially with system size, for larger grids it might be better to predict the binding status of each non-trivial constraint by using a binary multi-label classifier.<sup id="fnref:Robson20" role="doc-noteref"><a href="#fn:Robson20" class="footnote" rel="footnote">9</a></sup></p>
<p>However, since the classifiers are not perfect, there may be violated constraints not included in the reduced model.
One of the most widely used approaches to ensure convergence to an optimal point of the full problem is the <em>iterative feasibility test</em>.<sup id="fnref:Pineda20" role="doc-noteref"><a href="#fn:Pineda20" class="footnote" rel="footnote">10</a></sup> <sup id="fnref:Robson20:1" role="doc-noteref"><a href="#fn:Robson20" class="footnote" rel="footnote">9</a></sup>
The procedure has been widely used by power grid operators and it includes the following steps in combination with a classifier:</p>
<ol>
<li>An initial active set of inequality constraints (\(\hat{\mathcal{A}}_{t}^{(1)}\)) is proposed by the classifier and a solution of the reduced problem is obtained.</li>
<li>In each feasibility iteration, \(k = §1, \ldots, K\), the optimal point of the actual reduced problem (\({\hat{y}_{t}^{*}}^{(k)}\)) is validated against the constraints \(\mathcal{C}^{\mathrm{I}}\) of the original full formulation.</li>
<li>At each step \(k\), the violated constraints \(\mathcal{N}_{t}^{(k)} \subseteq \mathcal{C}^{\mathrm{I}} \setminus \hat{\mathcal{A}}_{t}^{(k)}\) are added to the set of considered inequality constraints to form the active set of the next iteration: \(\hat{\mathcal{A}}_{t}^{(k+1)} = \hat{\mathcal{A}}_{t}^{(k)} \cup \mathcal{N}_{t}^{(k)}\).</li>
<li>The procedure repeats until no violations are found (i.e. \(\mathcal{N}_{t}^{(k)} = \emptyset\)), and the optimal point satisfies all original constraints \(\mathcal{C}^{\mathrm{I}}\). At this point, we have found the optimal point to the full problem (\({\hat{y}_{t}^{*}}^{(k)} = y_{t}^{*}\)).</li>
</ol>
<p>As the reduced OPF is much cheaper to solve than the full problem, this procedure (if converged in few iterations) can be in theory very efficient, resulting in an optimal solution to the full OPF problem.
Obtaining optimal NN parameters is again based on minimizing a loss function of the training data:</p>
\[\begin{equation}
\min \limits_{\theta} \sum \limits_{t=1}^{N_{\mathrm{train}}} L^{\mathrm{c}} \left( \mathcal{A}_{t}, \hat{\mathcal{A}}_{t}^{(1)} \right),
\end{equation}\]
<p>where \(L^{\mathrm{c}} \left( \mathcal{A}_{t}, \hat{\mathcal{A}}_{t}^{(1)} \right)\) is the classifier loss as a function of the true and predicted active inequality constraints.
For instance, for predicting the binding status of each potential constraint, the typical loss function used is the binary cross-entropy: \(BCE \left( \mathcal{A}_{t}, \hat{\mathcal{A}}_{t}^{(1)} \right) = -\frac{1}{m} \sum \limits_{j=1}^{m} c^{(t)}_{j} \log \hat{c}^{(t)}_{j} + \left( 1 - c^{(t)}_{j} \right) \log \left( 1 - \hat{c}^{(t)}_{j} \right)\), where \(c_{j}^{(t)}\) and \(\hat{c}_{j}^{(t)}\) are the true value and predicted probability of the binding status of the \(j\)th inequality constraint.
<br />
<br /></p>
<p><img src="/blog/public/images/reduced_opf_loss.png" alt="reduced_opf" />
Figure 3. Flowchart of the iterative feasibility test method (yellow panel) in combination with a NN classifier (purple panel; default arguments of OPF operator are omitted for clarity) trained by minimizing a conventional loss function.</p>
<p><br /></p>
<h2 id="meta-loss">Meta-loss</h2>
<p>In the previous sections, we discussed hybrid methods that apply a NN either as a regressor or a classifier to predict starting values of the optimization variables and initial set of binding constraints, respectively, for the subsequent OPF calculations.
During the optimization of the NN parameters some regression- or classification-based loss function of the training data is minimized.
However, below we will show that these conventional loss functions do not necessarily minimize the most time consuming part of the hybrid techniques, i.e. the computationally expensive OPF step(s).
Therefore, we suggest a meta-objective function that directly addresses this computational cost.</p>
<h3 id="meta-optimization-for-regression-based-hybrid-approaches">Meta-optimization for regression-based hybrid approaches</h3>
<p>Conventional supervised regression techniques typically use loss functions based on a distance between the training ground-truth and predicted output
value, such as mean squared error or mean absolute error.
In general, each dimension of the target variable is treated equally in these loss functions.
However, the shape of the Lagrangian landscape of the OPF problem as a function of the optimization variables is far from isotropic<sup id="fnref:Mones18" role="doc-noteref"><a href="#fn:Mones18" class="footnote" rel="footnote">11</a></sup>, implying that optimization under such an objective does not necessarily minimize the warm-started OPF solution time.
Also, trying to derive initial values for optimization variables using empirical risk minimization techniques does not guarantee feasibility, regardless of the accuracy of the prediction to the ground truth.
Interior-point methods start by first moving the system into a feasible region, thereby potentially altering the initial position significantly. Consequently, warm-starting from an infeasible point can be inefficient.</p>
<p>Instead, one can use a meta-loss function that directly measures the computational cost of solving the (warm-started) OPF problem.<sup id="fnref:Jamei19:1" role="doc-noteref"><a href="#fn:Jamei19" class="footnote" rel="footnote">6</a></sup> <sup id="fnref:Robson20:2" role="doc-noteref"><a href="#fn:Robson20" class="footnote" rel="footnote">9</a></sup>
One measure of the computational cost can be defined by the number of iterations \((N(\hat{y}^{0}_{t}))\) required to reach the optimal solution.
Since the warm-started OPF has exactly the same formulation as the original OPF problem, the comparative number of iterations represents the improvement in computational cost.
Alternatively, the total CPU time of the warm-started OPF \((T(\hat{y}^{0}_{t}))\) can also be measured, although, unlike the number of iterations, it is not a noise-free measure.
Figure 4 presents a flowchart of the procedure for using a NN with parameters determined by minimizing the computational time-based meta-loss function (meta-optimization) on the training set.
As this meta-loss is a non-differentiable function with respect to the NN weights, back-propagation cannot be used.
As an alternative, one can employ gradient-free optimization techniques.
<br />
<br /></p>
<p><img src="/blog/public/images/meta_warm_start_opf.png" alt="meta_warm_start_opf" />
Figure 4. Flowchart of the warm-start method (yellow panel) in combination with an NN regressor (purple panel; default arguments of the OPF operator are omitted for clarity) trained by minimizing a meta-loss function that is a sum of the computational time of warm-started OPFs of the training data.</p>
<p><br /></p>
<h3 id="meta-optimization-for-classification-based-hybrid-approaches">Meta-optimization for classification-based hybrid approaches</h3>
<p>In the case of the classification-based hybrid methods, the goal is to find NN weights that minimize the total computational time of the iterative feasibility test.
However, minimizing a cross-entropy loss function to obtain such weights is not straightforward.
First, the number of cycles in the iterative procedure is much more sensitive to false negative than to false positive predictions of the binding
status.
Second, different constraints can be more or less important depending on the actual congestion regime and binding status (Figure 5).
These suggest the use of a more sophisticated objective function, for instance a weighted cross-entropy with appropriate weights for the corresponding terms.
The weights as hyperparameters can then be optimized to achieve an objective that can adapt to the above requirements.
However, an alternative objective can be defined as the total computational time of the iterative feasibility test procedure.<sup id="fnref:Robson20:3" role="doc-noteref"><a href="#fn:Robson20" class="footnote" rel="footnote">9</a></sup>
<br />
<br /></p>
<p><img src="/blog/public/images/meta_loss_feasibility_test_constraints.png" alt="meta_loss_feasibility_test_constraints" />
Figure 5. Profile of the meta-loss (total computational time) and number of iterations within the iterative feasibility test as functions of the number of constraints for two grids, and a comparison of DC vs. AC formulations. Perfect classifiers with the active set (AS) are indicated by vertical dashed lines, with the false positive (FP) region to the right, and the false negative (FN) region to the left.<sup id="fnref:Robson20:4" role="doc-noteref"><a href="#fn:Robson20" class="footnote" rel="footnote">9</a></sup></p>
<p><br /></p>
<p>The meta-loss objective, therefore, includes the solution time of a sequence of reduced OPF problems.
Similarly to the meta-loss defined for the regression approach, it measures the computational cost of obtaining a solution of the full problem and, unlike weighted cross-entropy, it does not require additional hyperparameters to be optimized.
<br />
<br /></p>
<p><img src="/blog/public/images/meta_reduced_opf.png" alt="meta_reduced_opf" />
Figure 6. Flowchart of the iterative feasibility test method (yellow panel) in combination with an NN classifier (purple panel; default arguments of the OPF operator are omitted for clarity) trained by minimizing a meta-loss function that is a sum of the computational time of the iterative feasibility tests of the training data.</p>
<p><br /></p>
<h3 id="optimizing-the-meta-loss-function-the-particle-swarm-optimization-method">Optimizing the meta-loss function: the particle swarm optimization method</h3>
<p>Neither the number of iterations nor the computational time of the subsequent OPF is a differentiable quantity with respect to the applied NNs.
Therefore, to obtain optimal NN weights by minimizing the meta-loss, a gradient-free method (i.e. using only the value of the loss function) is required.<br />
The particle swarm optimization was found to be a particularly promising<sup id="fnref:Robson20:5" role="doc-noteref"><a href="#fn:Robson20" class="footnote" rel="footnote">9</a></sup> gradient-free approach for this.</p>
<p><a href="https://en.wikipedia.org/wiki/Particle_swarm_optimization">Particle swarm optimization</a> (PSO) is a gradient-free <a href="https://en.wikipedia.org/wiki/Metaheuristic">meta-heuristic</a> algorithm inspired by the concept of swarm intelligence that can be found in nature within certain animal groups.
The method applies a set of particles (\(N_{\mathrm{p}}\)) and the particle dynamics at each optimization step is influenced by both the individual (best position found by the particle) and collective (best position found among all particles) knowledge.
More specifically, for a \(D\) dimensional problem, each particle \(p\) is associated with a velocity \(v^{(p)} \in \mathbb{R}^{D}\) and position \(x^{(p)} \in \mathbb{R}^{D}\) vectors that are randomly initialized at the beginning of the optimization within the corresponding ranges of interest.
During the course of the optimization, the velocities and positions are updated and in the \(n\)th iteration, the new vectors are obtained as<sup id="fnref:Zhan09" role="doc-noteref"><a href="#fn:Zhan09" class="footnote" rel="footnote">12</a></sup>:</p>
\[\begin{equation}
\begin{aligned}
v_{n+1}^{(p)} & = \underbrace{\omega_{n} v_{n}^{(p)}}_{\mathrm{inertia}} + \underbrace{c_{1} r_{n} \odot (l_{n}^{(p)} - x_{n}^{(p)})}_{\mathrm{local\ information}} + \underbrace{c_{2} q_{n} \odot (g_{n} - x_{n}^{(p)})}_{\mathrm{global\ information}} \\
x_{n+1}^{(p)} & = x_{n}^{(p)} + v_{n+1}^{(p)},
\end{aligned}
\label{eq:pso}
\end{equation}\]
<p>where \(\omega_{n}\) is the inertia weight, \(c_{1}\) and \(c_{2}\) are the acceleration coefficients, \(r_{n}\) and \(q_{n}\) are random vectors whose each component is drawn from a uniform distribution within the \([0, 1]\) interval, \(l_{n}^{(p)}\) is the best local position found by particle \(p\) and \(g_{n}\) is the best global position found by all particles together so far, and \(\odot\) denotes the <a href="https://en.wikipedia.org/wiki/Hadamard_product_(matrices)">Hadamard</a> (pointwise) product.</p>
<p>As it seems from eq. \ref{eq:pso}, PSO is a fairly simple approach: the particles’ velocity is governed by three terms: inertia, local and global knowledge.
The method was originally proposed for global optimization and it can be easily parallelized across the particles.</p>
<h2 id="numerical-experiments">Numerical experiments</h2>
<p>We demonstrate the increased performance of the meta-loss objective compared to conventional loss functions for the classification-based hybrid approach.
Three synthetic grids using both DC and AC formulations were investigated using 10k and 1k samples, respectively.
The samples were split randomly into training, validation, and test sets containing 70%, 20%, and 10% of the samples, respectively.</p>
<p>A fully connected NN was trained using a conventional and a weighted cross-entropy objective function in combination with a standard gradient-based optimizer (ADAM).
As discussed earlier, the meta-loss is much more sensitive to false negative predictions (Figure 5).
To reflect this in the weighted cross-entropy expression, we applied weights of 0.25 and 0.75 for the false positive and false negative penalty terms.
Then, starting from these optimized NNs, we performed further optimizations, using the meta-loss objective in combination with a PSO optimizer.</p>
<p>Table 1 includes the average computational gains (compared to the full problem) of the different approaches using 10 independent experiments for each.
For all cases, the weighted cross-entropy objective outperforms the conventional cross-entropy, indicating the importance of the false negative penalty term.
However, optimizing the NNs further by the meta-loss objective significantly improves the computational gain for both cases.
For AC-OPF, it brings the gain into the positive regime.
For details on how to use a meta-loss function, as well as for further numerical results, we refer to our paper.<sup id="fnref:Robson20:6" role="doc-noteref"><a href="#fn:Robson20" class="footnote" rel="footnote">9</a></sup></p>
<p><br />
<br />
Table 1. Average gain with two sided \(95\)% confidence intervals of classification-based hybrid models in combination with meta-optimization using conventional and weighted binary cross-entropy for pre-training the NN.</p>
<table>
<tr>
<th rowspan="3">Case</th>
<th colspan="4">Gain (%)</th>
</tr>
<tr>
<th colspan="2">Conventional</th>
<th colspan="2">Weighted</th>
</tr>
<tr>
<th>Cross-entropy</th>
<th>Meta-loss</th>
<th>Cross-entropy</th>
<th>Meta-loss</th>
</tr>
<tr>
<td colspan="5" style="text-align:center">DC-OPF</td>
</tr>
<tr>
<td>118-ieee</td>
<td>$$\begin{equation*}38.2 \pm 0.8\end{equation*}$$</td>
<td>$$\begin{equation*}42.1 \pm 2.7\end{equation*}$$</td>
<td>$$\begin{equation*}43.0 \pm 0.5\end{equation*}$$</td>
<td>$$\begin{equation*}44.8 \pm 1.2\end{equation*}$$</td>
</tr>
<tr>
<td>162-ieee-dtc</td>
<td>$$\begin{equation*}8.9 \pm 0.9\end{equation*}$$</td>
<td>$$\begin{equation*}31.2 \pm 1.3\end{equation*}$$</td>
<td>$$\begin{equation*}21.2 \pm 0.7\end{equation*}$$</td>
<td>$$\begin{equation*}36.9 \pm 1.0\end{equation*}$$</td>
</tr>
<tr>
<td>300-ieee</td>
<td>$$\begin{equation*}-47.1 \pm 0.5\end{equation*}$$</td>
<td>$$\begin{equation*}11.8 \pm 5.2\end{equation*}$$</td>
<td>$$\begin{equation*}-10.2 \pm 0.8\end{equation*}$$</td>
<td>$$\begin{equation*}23.2 \pm 1.8\end{equation*}$$</td>
</tr>
<tr>
<td colspan="5" style="text-align:center">AC-OPF</td>
</tr>
<tr>
<td>118-ieee</td>
<td>$$\begin{equation*}-31.7 \pm 1.2\end{equation*}$$</td>
<td>$$\begin{equation*}20.5 \pm 4.2\end{equation*}$$</td>
<td>$$\begin{equation*}-3.8 \pm 2.3\end{equation*}$$</td>
<td>$$\begin{equation*}29.3 \pm 2.0\end{equation*}$$</td>
</tr>
<tr>
<td>162-ieee-dtc</td>
<td>$$\begin{equation*}-60.5 \pm 2.7\end{equation*}$$</td>
<td>$$\begin{equation*}8.6 \pm 7.6\end{equation*}$$</td>
<td>$$\begin{equation*}-28.4 \pm 3.0\end{equation*}$$</td>
<td>$$\begin{equation*}23.4 \pm 2.2\end{equation*}$$</td>
</tr>
<tr>
<td>300-ieee</td>
<td>$$\begin{equation*}-56.0 \pm 5.8\end{equation*}$$</td>
<td>$$\begin{equation*}5.0 \pm 6.4\end{equation*}$$</td>
<td>$$\begin{equation*}-30.9 \pm 2.2\end{equation*}$$</td>
<td>$$\begin{equation*}15.8 \pm 2.3\end{equation*}$$</td>
</tr>
</table>
<p><br /></p>
<h2 id="conclusion">Conclusion</h2>
<p>NN-based hybrid approaches can guarantee optimal solutions and are therefore a particularly interesting research direction among machine learning assisted OPF models.
In this blog post, we argued that the computational cost of the subsequent (warm-start or set of reduced) OPF can be straightforwardly reduced by applying a meta-loss objective.
Unlike conventional loss functions that measure some error between the ground truth and the predicted quantities, the meta-loss objective directly addresses the computational time.
Since the meta-loss is not a differentiable function of the NN weights, a gradient-free optimization technique, like the particle swarm optimization approach, can be used.</p>
<p>Although significant improvements of the computational cost of OPF can be achieved by applying a meta-loss objective compared to conventional loss functions, in practice these gains are still far from desirable (i.e. the total computational cost should be a fraction of that of the original OPF problem).<sup id="fnref:Robson20:7" role="doc-noteref"><a href="#fn:Robson20" class="footnote" rel="footnote">9</a></sup></p>
<p>In a subsequent blog post, we will discuss how neural networks can be utilized in a more efficient way to obtain optimal OPF solutions.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:Boyd04" role="doc-endnote">
<p>S. Boyd and L. Vandenberghe, <a href="https://web.stanford.edu/~boyd/cvxbook/">“Convex Optimization”</a>, <em>New York: Cambridge University Press</em>, (2004). <a href="#fnref:Boyd04" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Nocedal06" role="doc-endnote">
<p>J. Nocedal and S. J. Wright, <a href="https://link.springer.com/book/10.1007/978-0-387-40065-5">“Numerical Optimization”</a>, <em>New York: Springer</em>, (2006). <a href="#fnref:Nocedal06" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Wachter06" role="doc-endnote">
<p>A. Wächter and L. Biegler, <a href="https://link.springer.com/article/10.1007/s10107-004-0559-y">“On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming”</a>, <em>Math. Program.</em>, <strong>106</strong>, pp. 25, (2006). <a href="#fnref:Wachter06" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Zhou20" role="doc-endnote">
<p>F. Zhou, J. Anderson and S. H. Low, <a href="https://arxiv.org/abs/1907.02219">“The Optimal Power Flow Operator: Theory and Computation”</a>, <em>arXiv:1907.02219</em>, (2020). <a href="#fnref:Zhou20" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Baker19" role="doc-endnote">
<p>K. Baker, <a href="https://arxiv.org/abs/1905.08860">“Learning Warm-Start Points For Ac Optimal Power Flow”</a>, <em>IEEE International Workshop on Machine Learning for Signal Processing</em>, <strong>pp. 1</strong>, (2019). <a href="#fnref:Baker19" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Jamei19" role="doc-endnote">
<p>M. Jamei, L. Mones, A. Robson, L. White, J. Requeima and C. Ududec, <a href="https://www.climatechange.ai/papers/icml2019/42/paper.pdf">“Meta-Optimization of Optimal Power Flow”</a>, <em>Proceedings of the 36th International Conference on Machine Learning Workshop</em>, (2019) <a href="#fnref:Jamei19" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:Jamei19:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:Ng18" role="doc-endnote">
<p>Y. Ng, S. Misra, L. A. Roald and S. Backhaus, <a href="https://arxiv.org/abs/1801.07809">“Statistical Learning For DC Optimal Power Flow”</a>, <em>arXiv:1801.07809</em>, (2018). <a href="#fnref:Ng18" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Deka19" role="doc-endnote">
<p>D. Deka and S. Misra, <a href="https://arxiv.org/abs/1902.05607">“Learning for DC-OPF: Classifying active sets using neural nets”</a>, <em>arXiv:1902.05607</em>, (2019). <a href="#fnref:Deka19" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:Deka19:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:Robson20" role="doc-endnote">
<p>A. Robson, M. Jamei, C. Ududec and L. Mones, <a href="https://arxiv.org/abs/1911.06784">“Learning an Optimally Reduced Formulation of OPF through Meta-optimization”</a>, <em>arXiv:1911.06784</em>, (2020). <a href="#fnref:Robson20" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:Robson20:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a> <a href="#fnref:Robson20:2" class="reversefootnote" role="doc-backlink">↩<sup>3</sup></a> <a href="#fnref:Robson20:3" class="reversefootnote" role="doc-backlink">↩<sup>4</sup></a> <a href="#fnref:Robson20:4" class="reversefootnote" role="doc-backlink">↩<sup>5</sup></a> <a href="#fnref:Robson20:5" class="reversefootnote" role="doc-backlink">↩<sup>6</sup></a> <a href="#fnref:Robson20:6" class="reversefootnote" role="doc-backlink">↩<sup>7</sup></a> <a href="#fnref:Robson20:7" class="reversefootnote" role="doc-backlink">↩<sup>8</sup></a></p>
</li>
<li id="fn:Pineda20" role="doc-endnote">
<p>S. Pineda, J. M. Morales and A. Jiménez-Cordero, <a href="https://arxiv.org/abs/1907.04694">“Data-Driven Screening of Network Constraints for Unit Commitment”</a>, <em>IEEE Transactions on Power Systems</em>, <strong>35</strong>, pp. 3695, (2020). <a href="#fnref:Pineda20" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Mones18" role="doc-endnote">
<p>L Mones, C. Ortner and G. Csanyi, <a href="https://www.nature.com/articles/s41598-018-32105-x">“Preconditioners for the geometry optimisation and saddle point search of molecular systems”</a>, <em>Scientific Reports</em>, <strong>8</strong>, 1, (2018). <a href="#fnref:Mones18" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Zhan09" role="doc-endnote">
<p>Z. Zhan, J. Zhang, Y. Li and H. S. Chung, <a href="https://ieeexplore.ieee.org/document/4812104">“Adaptive particle swarm optimization”</a>, <em>IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics)</em>, <strong>39</strong>, 1362, (2009). <a href="#fnref:Zhan09" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Letif MonesIn a previous blog post, we reviewed how neural networks (NNs) can be used to predict solutions of optimal power flow (OPF) problems. We showed that widely used approaches fall into two major classes: end-to-end and hybrid techniques. In the case of end-to-end (or direct) methods, a NN is applied as a regressor and either the full set or a subset of the optimization variables is predicted based on the grid parameters (typically active and reactive power loads). Hybrid (or indirect) approaches include two steps: in the first step, some of the inputs of OPF are inferred by a NN, and in the second step an OPF optimization is performed with the predicted inputs. This can reduce the computational time by either enhancing convergence to the solution or formulating an equivalent but smaller problem.Using Neural Networks for Predicting Solutions to Optimal Power Flow2021-10-11T00:00:00+00:002021-10-11T00:00:00+00:00https://invenia.github.io/blog/2021/10/11/opf-nn<p>In a previous <a href="/blog/2021/06/18/opf-intro/">blog post</a>, we discussed the fundamental concepts of optimal power flow (OPF), a core problem in operating electricity grids.
In their principal form, AC-OPFs are non-linear and non-convex optimization problems that are in general expensive to solve.
In practice, due to the large size of electricity grids and number of constraints, solving even the linearized approximation (DC-OPF) is a challenging optimization requiring significant computational effort.
Adding to the difficulty, the growing integration of renewable generation has increased uncertainty in grid conditions and increasingly requires grid operators to solve OPF problems in near real-time.
This has led to new research efforts in using machine learning (ML) approaches to shift the computational effort away from real-time (optimization) to offline training, potentially providing an almost instant prediction of the outputs of OPF problems.</p>
<p>This blog post focuses on a specific set of machine learning approaches that applies neural networks (NNs) to predict (directly or indirectly) solutions to OPF.
In order to provide a concise and common framework for these approaches, we consider first the general mathematical form of OPF and the standard technique to solve it: the interior-point method.
Then, we will see that this optimization problem can be viewed as an <em>operator</em> that maps a set of quantities to the optimal value of the optimization variables of the OPF.
For the simplest case, when all arguments of the OPF operator are fixed except the grid parameters, we introduce the term, <em>OPF function</em>.
We will show that all the NN-based approaches discussed here can be considered as estimators of either the OPF function or the OPF operator.<sup id="fnref:Falconer21" role="doc-noteref"><a href="#fn:Falconer21" class="footnote" rel="footnote">1</a></sup></p>
<h2 id="general-form-of-opf">General form of OPF</h2>
<p>OPF problems can be expressed in the following concise form of mathematical programming:</p>
\[\begin{equation}
\begin{aligned}
& \min \limits_{y}\ f(x, y) \\
& \mathrm{s.\ t.} \ \ c_{i}^{\mathrm{E}}(x, y) = 0 \quad i = 1, \dots, n \\
& \quad \; \; \; \; \; c_{j}^{\mathrm{I}}(x, y) \ge 0 \quad j = 1, \dots, m \\
\end{aligned}
\label{opt}
\end{equation}\]
<p>where \(x\) is the vector of grid parameters and \(y\) is the vector of optimization variables, \(f(x, y)\) is the objective (or cost) function to minimize, subject to equality constraints \(c_{i}^{\mathrm{E}}(x, y) \in \mathcal{C}^{\mathrm{E}}\) and inequality constraints \(c_{j}^{\mathrm{I}}(x, y) \in \mathcal{C}^{\mathrm{I}}\).
For convenience we write \(\mathcal{C}^{\mathrm{E}}\) and \(\mathcal{C}^{\mathrm{I}}\) for the sets of equality and inequality constraints, respectively, with corresponding cardinalities \(n = \lvert \mathcal{C}^{\mathrm{E}} \rvert\) and \(m = \lvert \mathcal{C}^{\mathrm{I}} \rvert\).
The objective function is optimized solely with respect to the optimization variables (\(y\)), while variables (\(x\)) parameterize the objective and constraint functions.
For example, for a simple <a href="https://invenia.github.io/blog/2021/06/18/opf-intro/#the-economic-dispatch-problem">economic dispatch</a> (ED) problem \(x\) includes voltage magnitudes and active powers of generators, \(y\) is a vector of active and reactive power components of loads, the objective function is the cost of the total real power generation, equality constraints include the power balance and power flow equations, while inequality constraints impose lower and upper bounds on other critical quantities.</p>
<p>The most widely used approach to solving the above optimization problem (that can be non-linear, non-convex and even mixed-integer) is the <a href="https://invenia.github.io/blog/2021/06/18/opf-intro/#solving-the-ed-problem-using-the-interior-point-method">interior-point method</a>.<sup id="fnref:Boyd04" role="doc-noteref"><a href="#fn:Boyd04" class="footnote" rel="footnote">2</a></sup> <sup id="fnref:Nocedal06" role="doc-noteref"><a href="#fn:Nocedal06" class="footnote" rel="footnote">3</a></sup> <sup id="fnref:Wachter06" role="doc-noteref"><a href="#fn:Wachter06" class="footnote" rel="footnote">4</a></sup>
The interior-point (or barrier) method is a highly efficient iterative algorithm.
However, it requires the computation of the Hessian (i.e. second derivatives) of the Lagrangian of the system with respect to the optimization variables at each iteration step.
Due to the non-convex nature of the power flow equations appearing in the equality constraints, the method can be prohibitively slow for large-scale systems.</p>
<p>The formulation of eq. \(\eqref{opt}\) gives us the possibility of looking at OPF as an operator that maps the grid parameters (\(x\)) to the optimal value of the optimization variables (\(y^{*}\)).
In a more general sense, the objective and constraint functions are also arguments of this operator.
Also, in this discussion we assume that, if feasible, the exact solution of the OPF problem is always provided by the interior-point method.
Therefore, the operator is parameterized implicitly by the starting value of the optimization variables (\(y^{0}\)).
The actual value of \(y^{0}\) can significantly affect the convergence rate of the interior-point method and the total computational time, and for non-convex formulations (where multiple local minima might exist) even the optimal point can differ.
The general form of the OPF operator can be written as:</p>
\[\begin{equation}
\Phi: \Omega \to \mathbb{R}^{n_{y}}: \quad \Phi\left( x, y^{0}, f, \mathcal{C}^{\mathrm{E}}, \mathcal{C}^{\mathrm{I}} \right) = y^{*},
\label{opf-operator}
\end{equation}\]
<p>where \(\Omega\) is an abstract set within the values of the operator arguments are allowed to change and \(n_{y}\) is the dimension of the optimization variables.
We note that a special case of the general form is the DC-OPF operator, whose mathematical properties have been thoroughly investigated in a recent work.<sup id="fnref:Zhou20" role="doc-noteref"><a href="#fn:Zhou20" class="footnote" rel="footnote">5</a></sup></p>
<p>In many recurring problems, most of the arguments of the OPF operator are fixed and only (some of) the grid parameters vary.
For these cases, we introduce a simpler notation, the OPF function:</p>
\[\begin{equation}
F_{\Phi}: \mathbb{R}^{n_{x}} \to \mathbb{R}^{n_{y}}: \quad F_{\Phi}(x) = y^{*},
\label{opf-function}
\end{equation}\]
<p>where \(n_{x}\) and \(n_{y}\) are the dimensions of grid parameter and optimization variables, respectively.
We also denote the set of all feasible points of OPF as \(\mathcal{F}_{\Phi}\).
The optimal value \(y^{*}\) is a member of the set \(\mathcal{F}_{\Phi}\) and in the case when the problem is infeasible, \(\mathcal{F}_{\Phi} = \emptyset\).</p>
<p>The daily task of electricity grid operators is to provide solutions for OPF, given constantly changing grid parameters \(x_{t}\).
The <em>standard</em> OPF approach would be to compute \(F_{\Phi}(x_{t}) = y_{t}^{*}\) using some default values of the other arguments of \(\Phi\).
However, in practice this is used rather rarely as usually additional information about the grid is also available that can be used to obtain the solution more efficiently.
For instance, it is reasonable to assume that for similar grid parameter vectors the corresponding optimal points are also close to each other.
If one of these problems is solved then its optimal point can be used as the starting value for the optimization variables of the other problem, which can then converge significantly faster compared to some default initial values.
This strategy is called <em>warm-start</em> and might be useful for consecutive problems, i.e. \(\Phi \left( x_{t}, y_{t-1}^{*}, f, \mathcal{C}^{\mathrm{E}}, \mathcal{C}^{\mathrm{I}} \right) = y_{t}^{*}\).
Another way to reduce the computational time of solving OPF is to reduce the problem size.
The OPF solution is determined by the objective function and <em>all</em> binding constraints.
However, at the optimal point, not all of the constraints are actually binding and there is a large number of non-binding inequality constraints that can be therefore removed from the mathematical problem without changing the optimal point, i.e. \(\Phi \left( x_{t}, y^{0}, f, \mathcal{C}^{\mathrm{E}}, \mathcal{A}_{t} \right) = y_{t}^{*}\), where \(\mathcal{A}_{t}\) is the set of all binding inequality constraints of the actual problem (equality constraints are always binding).
This formulation is called <em>reduced</em> OPF and it is especially useful for DC-OPF problems.
The three main approaches discussed are depicted in Figure 1.
<br />
<br /></p>
<p><img src="/blog/public/images/opf_solve_types.png" alt="opf_solve_types" />
Figure 1. Main approaches to solving OPF. Varying arguments are highlighted, while other arguments are potentially fixed.</p>
<p><br /></p>
<h2 id="nn-based-approaches-for-predicting-opf-solutions">NN-based approaches for predicting OPF solutions</h2>
<p>The ML methods we will discuss apply either an estimator function (\(\hat{F}_{\Phi}(x_{t}) = \hat{y}_{t}^{*}\)) or an estimator operator (\(\hat{\Phi}(x_{t}) = \hat{y}_{t}^{*}\)) to provide a prediction of the optimal point of OPF based on the grid parameters.</p>
<p>We can categorize these methods in different ways.
For instance, based on the estimator type they use, we can distinguish between <em>end-to-end</em> (aka <em>direct</em>) and <em>hybrid</em> (aka <em>indirect</em>) approaches.
End-to-end methods use a NN as an estimator function and map the grid parameters directly to the optimal point of OPF.
Hybrid or indirect methods apply an estimator operator that includes two steps: in the first step, a NN maps the grid parameters to some quantities, which are then used in the second step as inputs to some optimization problem resulting in the predicted (or even exact) optimal point of the original OPF problem.</p>
<p>We can also group techniques based on the NN predictor type: the NN can be used either as regressor or classifier.</p>
<h2 id="regression">Regression</h2>
<p>The OPF function establishes a non-linear relationship between the grid parameters and optimization variables.
In regression techniques, this complex relationship is approximated by NNs, treating the grid parameters as the input and the optimal values of the optimization variables as the output of the NN model.</p>
<h3 id="end-to-end-methods">End-to-end methods</h3>
<p>End-to-end methods<sup id="fnref:Guha19" role="doc-noteref"><a href="#fn:Guha19" class="footnote" rel="footnote">6</a></sup> apply NNs as regressors, mapping the grid parameters (as inputs) directly to the optimal point of OPF (as outputs):</p>
\[\begin{equation}
\hat{F}_{\Phi}(x_{t}) = \mathrm{NN}_{\theta}^{\mathrm{r}}(x_{t}) = \hat{y}_{t}^{*},
\end{equation}\]
<p>where the subscript \(\theta\) denotes all parameters (weights, biases, etc.) of the NN and the superscript \(\mathrm{r}\) indicates that the NN is used as a regressor.
We note that once the prediction \(\hat{y}_{t}^{*}\) is computed, other dependent quantities (e.g. power flows) can be easily obtained by solving the <a href="https://invenia.github.io/blog/2020/12/04/pf-intro/#the-power-flow-problem">power flow problem</a><sup id="fnref:Guha19:1" role="doc-noteref"><a href="#fn:Guha19" class="footnote" rel="footnote">6</a></sup> <sup id="fnref:Zamzam19" role="doc-noteref"><a href="#fn:Zamzam19" class="footnote" rel="footnote">7</a></sup> — given the prediction is a feasible point.</p>
<p>As OPF is a constrained optimization problem, the optimal point is not necessarily a smooth function of the grid parameters: changes of the binding status of constraints can lead to abrupt changes of the optimal solution.
Also, the number of congestion regimes — i.e. the number of distinct sets of active (binding) constraints — increases exponentially with grid size.
Therefore, in order to obtain sufficiently high coverage and accuracy of the model, a substantial amount of training data is required.</p>
<h3 id="warm-start-methods">Warm-start methods</h3>
<p>For real power grids, the available training data is rather limited compared to the system size.
As a consequence, it is challenging to provide predictions by end-to-end methods that are optimal.
Of even greater concern, there is no guarantee that the predicted optimal point is a feasible point (i.e. satisfies all constraints), and violation of important constraints could lead to severe security issues for the grid.</p>
<p>Nevertheless, the predicted optimal point can be utilized as a starting point to initialize an interior-point method.
Interior-point methods (and actually most of the relevant optimization algorithms for OPF) can be started from a specific value of the optimization variables (warm-start optimization).
The idea of the warm-start approaches is to use a hybrid model that first applies a NN for predicting a set-point, \(\hat{y}_{t}^{0} = \mathrm{NN}_{\theta}^{\mathrm{r}}(x_{t})\), from which a warm-start optimization can be performed.<sup id="fnref:Baker19" role="doc-noteref"><a href="#fn:Baker19" class="footnote" rel="footnote">8</a></sup> <sup id="fnref:Jamei19" role="doc-noteref"><a href="#fn:Jamei19" class="footnote" rel="footnote">9</a></sup>
Using the previously introduced notation, we can write this approach as an estimator of the OPF operator, where the starting value of the optimization variables is estimated:</p>
\[\begin{equation}
\begin{aligned}
\hat{\Phi}^{\mathrm{warm}}(x_{t}) & = \Phi \left( x_{t}, \hat{y}_{t}^{0}, f, \mathcal{C}^{\mathrm{E}}, \mathcal{C}^{\mathrm{I}} \right) \\ & = \Phi \left( x_{t}, \mathrm{NN}_{\theta}^{\mathrm{r}}(x_{t}), f, \mathcal{C}^{\mathrm{E}}, \mathcal{C}^{\mathrm{I}} \right) \\ &= y_{t}^{*} \quad .
\end{aligned}
\end{equation}\]
<p>The flowchart of the approach is shown in Figure 2.
<br />
<br /></p>
<p><img src="/blog/public/images/warm_start_opf.png" alt="warm_start_opf" />
Figure 2. Flowchart of the warm-start method (yellow panel) in combination with an NN regressor (purple panel). Default arguments of OPF operator are omitted for clarity.</p>
<p><br />
It is important to note that warm-start methods provide an exact (locally) optimal point as it is eventually obtained from an optimization problem equivalent to the original one.
Predicting an accurate set-point can significantly reduce the number of iterations (and so the computational cost) needed to obtain the optimal point compared to default heuristics of the optimization method.</p>
<p>Although the concept of warm-start interior-point techniques is quite attractive, there are some practical difficulties as well that we briefly discuss.
First, because only primal variables are initialized, the duals still need to converge, as interior-point methods require a minimum number of iterations even if the primals are set to their optimal values.
Trying to predict the duals with NN as well would make the task even more challenging.
Second, if the initial values of primals are far from optimal (i.e. inaccurate prediction of set-points), the optimization can lead to a different local minimum.
Finally, even if the predicted values are close to the optimal solution, as there are no guarantees on feasibility of the starting point, this could be located in a region resulting in substantially longer solve times or even convergence failure.</p>
<h2 id="classification">Classification</h2>
<h3 id="predicting-active-sets">Predicting active sets</h3>
<p>An alternative hybrid approach using an NN classifier (\(\mathrm{NN}_{\theta}^{\mathrm{c}}\)) leverages the observation that only a fraction of all constraints is actually binding at the optimum<sup id="fnref:Ng18" role="doc-noteref"><a href="#fn:Ng18" class="footnote" rel="footnote">10</a></sup> <sup id="fnref:Deka19" role="doc-noteref"><a href="#fn:Deka19" class="footnote" rel="footnote">11</a></sup>, so a reduced OPF problem can be formulated by keeping only the binding constraints.
Since this reduced problem still has the same objective function as the original one, the solution should be equivalent to that of the original full
problem: \(\Phi \left( x_{t}, y^{0}, f, \mathcal{C}^{\mathrm{E}}, \mathcal{A}_{t} \right) = \Phi \left( x_{t}, y^{0}, f, \mathcal{C}^{\mathrm{E}}, \mathcal{C}^{\mathrm{I}} \right) = y_{t}^{*}\), where \(\mathcal{A}_{t} \subseteq \mathcal{C}^{\mathrm{I}}\) is the active or binding subset of the inequality constraints (also note that \(\mathcal{C}^{\mathrm{E}} \cup \mathcal{A}_{t}\) contains all active constraints defining the specific congestion regime).
This suggests a classification formulation, in which the grid parameters are used to predict the active set.
The corresponding NN based estimator of the OPF operator can be written as:</p>
\[\begin{equation}
\begin{aligned}
\hat{\Phi}^{\mathrm{red}}(x_{t}) &= \Phi \left( x_{t}, y^{0}, f, \mathcal{C}^{\mathrm{E}}, \hat{\mathcal{A}}_{t} \right) \\ &= \Phi \left( x_{t}, y^{0}, f, \mathcal{C}^{\mathrm{E}}, \mathrm{NN}_{\theta}^{\mathrm{c}}(x_{t}) \right) \\ &= \hat{y}_{t}^{*} \quad .
\end{aligned}
\end{equation}\]
<p>Technically, the NN can be used in two ways to predict the active set.
One approach is to identify all distinct active sets in the training data and train a multiclass classifier that maps the input to the corresponding active set.<sup id="fnref:Deka19:1" role="doc-noteref"><a href="#fn:Deka19" class="footnote" rel="footnote">11</a></sup>
Since the number of active sets increases exponentially with system size, for larger grids it might be better to predict the binding status of each non-trivial constraint by using a binary multi-label classifier.<sup id="fnref:Robson20" role="doc-noteref"><a href="#fn:Robson20" class="footnote" rel="footnote">12</a></sup></p>
<h3 id="iterative-feasibility-test">Iterative feasibility test</h3>
<p>Imperfect prediction of the binding status of constraints (or active set) can lead to similar security issues as imperfect regression.
This is especially important for false negative predictions, i.e. predicting an actually binding constraint as non-binding.
As there may be violated constraints not included in the reduced model, one can use the <em>iterative feasibility test</em> to ensure convergence to an optimal point of the full problem.<sup id="fnref:Pineda20" role="doc-noteref"><a href="#fn:Pineda20" class="footnote" rel="footnote">13</a></sup> <sup id="fnref:Robson20:1" role="doc-noteref"><a href="#fn:Robson20" class="footnote" rel="footnote">12</a></sup>
The procedure has been widely used by power grid operators.
In combination with a classifier, it includes the following steps:</p>
<ol>
<li>An initial active set of inequality constraints (\(\hat{\mathcal{A}}_{t}^{(1)}\)) is proposed by the classifier and a solution of the reduced problem is obtained.</li>
<li>In each feasibility iteration, \(k = 1, \ldots, K\), the optimal point of the actual reduced problem (\({\hat{y}_{t}^{*}}^{(k)}\)) is validated against the constraints \(\mathcal{C}^{\mathrm{I}}\) of the original full formulation.</li>
<li>At each step \(k\), the violated constraints \(\mathcal{N}_{t}^{(k)} \subseteq \mathcal{C}^{\mathrm{I}} \setminus \hat{\mathcal{A}}_{t}^{(k)}\) are added to the set of considered inequality constraints to form the active set of the next iteration: \(\hat{\mathcal{A}}_{t}^{(k+1)} = \hat{\mathcal{A}}_{t}^{(k)} \cup \mathcal{N}_{t}^{(k)}\).</li>
<li>The procedure is repeated until no violations are found (i.e. \(\mathcal{N}_{t}^{(k)} = \emptyset\)), and the optimal point satisfies all original constraints \(\mathcal{C}^{\mathrm{I}}\). At this point, we have found the optimal point to the full problem (\({\hat{y}_{t}^{*}}^{(k)} = y_{t}^{*}\)).</li>
</ol>
<p>The flowchart of the iterative feasibility test in combination with NN is presented in Figure 3.
As the reduced OPF is much cheaper to solve than the full problem, this procedure (if converged in few iterations) can be very efficient.
<br />
<br /></p>
<p><img src="/blog/public/images/reduced_opf.png" alt="reduced_opf" />
Figure 3. Flowchart of the iterative feasibility test method (yellow panel) in combination with an NN classifier (purple panel) Default arguments of OPF operator are omitted for clarity.</p>
<p><br /></p>
<h2 id="technical-details-of-models">Technical details of models</h2>
<p>In this section, we provide a high level overview of the most general technical details used in the field.</p>
<h3 id="systems-and-samples">Systems and samples</h3>
<p>As discussed earlier, both the regression and classification approaches require a relatively large number of training samples, ranging between a few thousand and hundreds of thousands, depending on the OPF type, system size, and various grid parameters.
Therefore, most of the works use synthetic grids of the Power Grid Library<sup id="fnref:Babaeinejadsarookolaee19" role="doc-noteref"><a href="#fn:Babaeinejadsarookolaee19" class="footnote" rel="footnote">14</a></sup> for which the generation of samples can be obtained straightforwardly.
The size of the investigated systems usually varies between a few tens to a few thousands of buses and both DC- and AC-OPF problems can be investigated for economic dispatch, security constrained, unit commitment and even security constrained unit commitment problems.
The input grid parameters are primarily the active and reactive power loads, although a much wider selection of varied grid parameters is also possible.
The standard technique is to generate feasible samples by varying the grid parameters by a deviation of \(3-70\%\) from their default values and using multivariate uniform, normal, and truncated normal distributions.</p>
<p>Finally, we note that given the rapid increase of attention in the field it would be beneficial to have standard benchmark data sets in order to compare different models and approaches.<sup id="fnref:Robson20:2" role="doc-noteref"><a href="#fn:Robson20" class="footnote" rel="footnote">12</a></sup></p>
<h3 id="loss-functions">Loss functions</h3>
<p>For regression-based approaches, the most basic loss function optimized with respect to the NN parameters is the (mean) squared error.
In order to reduce possible violations of certain constraints, an additional penalty term can be added to this loss function.</p>
<p>For classification-based methods, cross-entropy (multiclass classifier) or binary cross-entropy (multi-label classifier) functions can be applied with a possible regularization term.</p>
<h3 id="nn-architectures">NN architectures</h3>
<p>Most of the models applied a fully connected NN (FCNN) from shallow to deep architectures.
However, there are also attempts to take the grid topology into account, and convolutional (CNN) and graph (GNN) neural networks have been used for both regression and classification approaches.
GNNs, which can use the graph of the grid explicitly, seemed particularly successful compared to FCNN and CNN architectures.<sup id="fnref:Falconer20" role="doc-noteref"><a href="#fn:Falconer20" class="footnote" rel="footnote">15</a></sup>
<br />
<br /></p>
<p>Table 1. Comparison of some works using neural networks for predicting solutions to OPF. From each reference the largest investigated system is shown with corresponding number of buses (\(\lvert \mathcal{N} \rvert\)). Dataset size and grid input types are also presented. For sample generation \(\mathcal{U}\) and \(\mathcal{TN}\) denote uniform and truncated normal distributions, respectively and their arguments are the minimum and maximum factors multiplying the default grid parameter value. FCNN, CNN and GNN denote fully connected, convolutional and graph neural networks, respectively. SE, MSE and CE indicate squared error, mean squared error and cross-entropy loss functions, respectively, and cvpt denotes constraint violation penalty term.</p>
<table>
<thead>
<tr>
<th style="text-align: left">Ref.</th>
<th style="text-align: left">OPF</th>
<th style="text-align: left">System</th>
<th style="text-align: left">\(\lvert \mathcal{N} \rvert\)</th>
<th style="text-align: left">Dataset</th>
<th style="text-align: left">Input</th>
<th style="text-align: left">Sampling</th>
<th style="text-align: left">NN</th>
<th style="text-align: left">Predictor</th>
<th style="text-align: left">Loss</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left"><sup id="fnref:Guha19:2" role="doc-noteref"><a href="#fn:Guha19" class="footnote" rel="footnote">6</a></sup></td>
<td style="text-align: left">AC-ED</td>
<td style="text-align: left">118-ieee</td>
<td style="text-align: left">118</td>
<td style="text-align: left">813k</td>
<td style="text-align: left">loads</td>
<td style="text-align: left">\(\mathcal{U}(0.9, 1.1)\)</td>
<td style="text-align: left">FCNN</td>
<td style="text-align: left">regressor</td>
<td style="text-align: left">MSE + cvpt</td>
</tr>
<tr>
<td style="text-align: left"><sup id="fnref:Fioretto19" role="doc-noteref"><a href="#fn:Fioretto19" class="footnote" rel="footnote">16</a></sup></td>
<td style="text-align: left">AC-ED</td>
<td style="text-align: left">300-ieee</td>
<td style="text-align: left">300</td>
<td style="text-align: left">236k</td>
<td style="text-align: left">loads</td>
<td style="text-align: left">\(\mathcal{U}(0.8, 1.2)\)</td>
<td style="text-align: left">FCNN</td>
<td style="text-align: left">regressor</td>
<td style="text-align: left">SE + cvpt</td>
</tr>
<tr>
<td style="text-align: left"><sup id="fnref:Zamzam19:1" role="doc-noteref"><a href="#fn:Zamzam19" class="footnote" rel="footnote">7</a></sup></td>
<td style="text-align: left">AC-ED</td>
<td style="text-align: left">118-ieee</td>
<td style="text-align: left">118</td>
<td style="text-align: left">100k</td>
<td style="text-align: left">loads</td>
<td style="text-align: left">\(\mathcal{TN}(0.3, 1.7)\) <br /> \(\mathcal{U}(0.8, 1.0)\)</td>
<td style="text-align: left">FCNN</td>
<td style="text-align: left">regressor</td>
<td style="text-align: left">MSE</td>
</tr>
<tr>
<td style="text-align: left"><sup id="fnref:Pan19" role="doc-noteref"><a href="#fn:Pan19" class="footnote" rel="footnote">17</a></sup></td>
<td style="text-align: left">DC-SCED</td>
<td style="text-align: left">300-ieee</td>
<td style="text-align: left">300</td>
<td style="text-align: left">55k</td>
<td style="text-align: left">load</td>
<td style="text-align: left">\(\mathcal{U}(0.9, 1.1)\)</td>
<td style="text-align: left">FCNN</td>
<td style="text-align: left">regressor</td>
<td style="text-align: left">MSE + cvpt</td>
</tr>
<tr>
<td style="text-align: left"><sup id="fnref:Owerko19" role="doc-noteref"><a href="#fn:Owerko19" class="footnote" rel="footnote">18</a></sup></td>
<td style="text-align: left">AC-ED</td>
<td style="text-align: left">118-ieee</td>
<td style="text-align: left">118</td>
<td style="text-align: left">13k</td>
<td style="text-align: left">loads</td>
<td style="text-align: left">\(\mathcal{U}(0.9, 1.1)\)</td>
<td style="text-align: left">FCNN <br /> GNN</td>
<td style="text-align: left">regressor</td>
<td style="text-align: left">MSE</td>
</tr>
<tr>
<td style="text-align: left"><sup id="fnref:Deka19:2" role="doc-noteref"><a href="#fn:Deka19" class="footnote" rel="footnote">11</a></sup></td>
<td style="text-align: left">DC-ED</td>
<td style="text-align: left">24-ieee-rts</td>
<td style="text-align: left">24</td>
<td style="text-align: left">50k</td>
<td style="text-align: left">load</td>
<td style="text-align: left">\(\mathcal{U}(0.97, 1.03)\)</td>
<td style="text-align: left">FCNN</td>
<td style="text-align: left">classifier</td>
<td style="text-align: left">CE</td>
</tr>
<tr>
<td style="text-align: left"><sup id="fnref:Chatzos20" role="doc-noteref"><a href="#fn:Chatzos20" class="footnote" rel="footnote">19</a></sup></td>
<td style="text-align: left">AC-ED</td>
<td style="text-align: left">France-Lyon</td>
<td style="text-align: left">3411</td>
<td style="text-align: left">10k</td>
<td style="text-align: left">loads</td>
<td style="text-align: left">\(\mathcal{U}(0.8, 1.2)\)</td>
<td style="text-align: left">FCNN</td>
<td style="text-align: left">regressor</td>
<td style="text-align: left">SE + cvpt</td>
</tr>
<tr>
<td style="text-align: left"><sup id="fnref:Pan20" role="doc-noteref"><a href="#fn:Pan20" class="footnote" rel="footnote">20</a></sup></td>
<td style="text-align: left">AC-ED</td>
<td style="text-align: left">30-ieee</td>
<td style="text-align: left">30</td>
<td style="text-align: left">12k</td>
<td style="text-align: left">loads</td>
<td style="text-align: left">\(\mathcal{U}(0.8, 1.2)\)</td>
<td style="text-align: left">FCNN</td>
<td style="text-align: left">regressor</td>
<td style="text-align: left">SE + cvpt</td>
</tr>
<tr>
<td style="text-align: left"><sup id="fnref:Venzke20" role="doc-noteref"><a href="#fn:Venzke20" class="footnote" rel="footnote">21</a></sup></td>
<td style="text-align: left">DC-ED</td>
<td style="text-align: left">300-ieee</td>
<td style="text-align: left">300</td>
<td style="text-align: left">100k</td>
<td style="text-align: left">load</td>
<td style="text-align: left">\(\mathcal{U}(0.4, 1.0)\)</td>
<td style="text-align: left">FCNN</td>
<td style="text-align: left">regressor</td>
<td style="text-align: left">MSE</td>
</tr>
<tr>
<td style="text-align: left"><sup id="fnref:Robson20:3" role="doc-noteref"><a href="#fn:Robson20" class="footnote" rel="footnote">12</a></sup></td>
<td style="text-align: left">DC-ED</td>
<td style="text-align: left">1354-pegase</td>
<td style="text-align: left">1354</td>
<td style="text-align: left">10k</td>
<td style="text-align: left">load + <br /> 3 other params</td>
<td style="text-align: left">\(\mathcal{U}(0.85, 1.15)\) <br /> \(\mathcal{U}(0.9, 1.1)\)</td>
<td style="text-align: left">FCNN</td>
<td style="text-align: left">classifier</td>
<td style="text-align: left">CE</td>
</tr>
<tr>
<td style="text-align: left"><sup id="fnref:Robson20:4" role="doc-noteref"><a href="#fn:Robson20" class="footnote" rel="footnote">12</a></sup></td>
<td style="text-align: left">AC-ED</td>
<td style="text-align: left">300-ieee</td>
<td style="text-align: left">300</td>
<td style="text-align: left">1k</td>
<td style="text-align: left">loads + <br /> 5 other params</td>
<td style="text-align: left">\(\mathcal{U}(0.85, 1.15)\) <br /> \(\mathcal{U}(0.9, 1.1)\)</td>
<td style="text-align: left">FCNN</td>
<td style="text-align: left">classifier</td>
<td style="text-align: left">CE</td>
</tr>
<tr>
<td style="text-align: left"><sup id="fnref:Falconer20:1" role="doc-noteref"><a href="#fn:Falconer20" class="footnote" rel="footnote">15</a></sup></td>
<td style="text-align: left">AC-ED</td>
<td style="text-align: left">300-ieee</td>
<td style="text-align: left">300</td>
<td style="text-align: left">10k</td>
<td style="text-align: left">loads + <br /> 5 other params</td>
<td style="text-align: left">\(\mathcal{U}(0.85, 1.15)\) <br /> \(\mathcal{U}(0.9, 1.1)\)</td>
<td style="text-align: left">FCNN <br /> CNN <br /> GNN</td>
<td style="text-align: left">regressor <br /> classifier</td>
<td style="text-align: left">MSE <br /> CE</td>
</tr>
<tr>
<td style="text-align: left"><sup id="fnref:Falconer21:1" role="doc-noteref"><a href="#fn:Falconer21" class="footnote" rel="footnote">1</a></sup></td>
<td style="text-align: left">AC-ED</td>
<td style="text-align: left">2853-sdet</td>
<td style="text-align: left">2853</td>
<td style="text-align: left">10k</td>
<td style="text-align: left">loads</td>
<td style="text-align: left">\(\mathcal{U}(0.8, 1.2)\)</td>
<td style="text-align: left">FCNN <br /> CNN <br /> GNN</td>
<td style="text-align: left">regressor <br /> classifier</td>
<td style="text-align: left">MSE <br /> CE</td>
</tr>
</tbody>
</table>
<h2 id="conclusions">Conclusions</h2>
<p>By moving the computational effort to offline training, machine learning techniques to predict OPF solutions have become an intense research direction.</p>
<p>Neural network based approaches are particularly promising as they can effectively model complex non-linear relationships between grid parameters and generator set-points in electrical grids.</p>
<p>End-to-end approaches, which try to map the grid parameters directly to the optimal value of the optimization variables, can provide the most computational gain.
However, they require a large training data set in order to achieve sufficient predictive accuracy, otherwise the predictions will not likely be optimal or even feasible.</p>
<p>Hybrid techniques apply a combination of an NN model and a subsequent OPF step.
The NN model can be used to predict a good starting point for warm-starting the OPF, or to generate a series of efficiently reduced OPF models using the iterative feasibility test.
Hybrid methods can, therefore, improve the efficiency of OPF computations without sacrificing feasibility and optimality.
Neural networks of the hybrid models can be trained by using conventional loss functions that measure some form of the prediction error.
In a subsequent <a href="/blog/2021/12/17/opf-nn-meta/">blog post</a>, we will show that compared to these traditional loss functions, a significant improvement of the computational gain of hybrid models can be made by optimizing the computational cost directly.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:Falconer21" role="doc-endnote">
<p>T. Falconer and L. Mones, <a href="https://arxiv.org/abs/2110.00306">“Leveraging power grid topology in machine learning assisted optimal power flow”</a>, <em>arXiv:2110.00306</em>, (2021). <a href="#fnref:Falconer21" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:Falconer21:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:Boyd04" role="doc-endnote">
<p>S. Boyd and L. Vandenberghe, <a href="https://web.stanford.edu/~boyd/cvxbook/">“Convex Optimization”</a>, <em>New York: Cambridge University Press</em>, (2004). <a href="#fnref:Boyd04" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Nocedal06" role="doc-endnote">
<p>J. Nocedal and S. J. Wright, <a href="https://link.springer.com/book/10.1007/978-0-387-40065-5">“Numerical Optimization”</a>, <em>New York: Springer</em>, (2006). <a href="#fnref:Nocedal06" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Wachter06" role="doc-endnote">
<p>A. Wächter and L. Biegler, <a href="https://link.springer.com/article/10.1007/s10107-004-0559-y">“On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming”</a>, <em>Math. Program.</em>, <strong>106</strong>, pp. 25, (2006). <a href="#fnref:Wachter06" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Zhou20" role="doc-endnote">
<p>F. Zhou, J. Anderson and S. H. Low, <a href="https://arxiv.org/abs/1907.02219">“The Optimal Power Flow Operator: Theory and Computation”</a>, <em>arXiv:1907.02219</em>, (2020). <a href="#fnref:Zhou20" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Guha19" role="doc-endnote">
<p>G. Neel, Z. Wang and A. Majumdar, <a href="https://www.climatechange.ai/papers/icml2019/9/paper.pdf">“Machine Learning for AC Optimal Power Flow”</a>, <em>Proceedings of the 36th International Conference on Machine Learning Workshop</em>, (2019). <a href="#fnref:Guha19" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:Guha19:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a> <a href="#fnref:Guha19:2" class="reversefootnote" role="doc-backlink">↩<sup>3</sup></a></p>
</li>
<li id="fn:Zamzam19" role="doc-endnote">
<p>A. Zamzam and K. Baker, <a href="https://arxiv.org/abs/1910.01213">“Learning Optimal Solutions for Extremely Fast AC Optimal Power Flow”</a>, <em>arXiv:1910.01213</em>, (2019). <a href="#fnref:Zamzam19" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:Zamzam19:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:Baker19" role="doc-endnote">
<p>K. Baker, <a href="https://arxiv.org/abs/1905.08860">“Learning Warm-Start Points For Ac Optimal Power Flow”</a>, <em>IEEE International Workshop on Machine Learning for Signal Processing</em>, <strong>pp. 1</strong>, (2019). <a href="#fnref:Baker19" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Jamei19" role="doc-endnote">
<p>M. Jamei, L. Mones, A. Robson, L. White, J. Requeima and C. Ududec, <a href="https://www.climatechange.ai/papers/icml2019/42/paper.pdf">“Meta-Optimization of Optimal Power Flow”</a>, <em>Proceedings of the 36th International Conference on Machine Learning Workshop</em>, (2019) <a href="#fnref:Jamei19" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Ng18" role="doc-endnote">
<p>Y. Ng, S. Misra, L. A. Roald and S. Backhaus, <a href="https://arxiv.org/abs/1801.07809">“Statistical Learning For DC Optimal Power Flow”</a>, <em>arXiv:1801.07809</em>, (2018). <a href="#fnref:Ng18" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Deka19" role="doc-endnote">
<p>D. Deka and S. Misra, <a href="https://arxiv.org/abs/1902.05607">“Learning for DC-OPF: Classifying active sets using neural nets”</a>, <em>arXiv:1902.05607</em>, (2019). <a href="#fnref:Deka19" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:Deka19:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a> <a href="#fnref:Deka19:2" class="reversefootnote" role="doc-backlink">↩<sup>3</sup></a></p>
</li>
<li id="fn:Robson20" role="doc-endnote">
<p>A. Robson, M. Jamei, C. Ududec and L. Mones, <a href="https://arxiv.org/abs/1911.06784">“Learning an Optimally Reduced Formulation of OPF through Meta-optimization”</a>, <em>arXiv:1911.06784</em>, (2020). <a href="#fnref:Robson20" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:Robson20:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a> <a href="#fnref:Robson20:2" class="reversefootnote" role="doc-backlink">↩<sup>3</sup></a> <a href="#fnref:Robson20:3" class="reversefootnote" role="doc-backlink">↩<sup>4</sup></a> <a href="#fnref:Robson20:4" class="reversefootnote" role="doc-backlink">↩<sup>5</sup></a></p>
</li>
<li id="fn:Pineda20" role="doc-endnote">
<p>S. Pineda, J. M. Morales and A. Jiménez-Cordero, <a href="https://arxiv.org/abs/1907.04694">“Data-Driven Screening of Network Constraints for Unit Commitment”</a>, <em>IEEE Transactions on Power Systems</em>, <strong>35</strong>, pp. 3695, (2020). <a href="#fnref:Pineda20" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Babaeinejadsarookolaee19" role="doc-endnote">
<p>S. Babaeinejadsarookolaee, A. Birchfield, R. D. Christie, C. Coffrin, C. DeMarco, R. Diao, M. Ferris, S. Fliscounakis, S. Greene, R. Huang, C. Josz, R. Korab, B. Lesieutre, J. Maeght, D. K. Molzahn, T. J. Overbye, P. Panciatici, B. Park, J. Snodgrass and R. Zimmerman, <a href="https://arxiv.org/abs/1908.02788">“The Power Grid Library for Benchmarking AC Optimal Power Flow Algorithms”</a>, arXiv:1908.02788, (2019). <a href="#fnref:Babaeinejadsarookolaee19" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Falconer20" role="doc-endnote">
<p>T. Falconer and L. Mones, <a href="https://arxiv.org/abs/2011.03352">“Deep learning architectures for inference of AC-OPF solutions”</a>, arXiv:2011.03352, (2020). <a href="#fnref:Falconer20" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:Falconer20:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:Fioretto19" role="doc-endnote">
<p>F. Fioretto, T. Mak and P. V. Hentenryck, <a href="https://arxiv.org/abs/1909.10461">“Predicting AC Optimal Power Flows: Combining Deep Learning and Lagrangian Dual Methods”</a>, <em>arXiv:1909.10461</em>, (2019). <a href="#fnref:Fioretto19" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Pan19" role="doc-endnote">
<p>X. Pan, T. Zhao, M. Chen and S Zhang, <a href="https://arxiv.org/abs/1910.14448">“DeepOPF: A Deep Neural Network Approach for Security-Constrained DC Optimal Power Flow”</a>, <em>arXiv:1910.14448</em>, (2019). <a href="#fnref:Pan19" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Owerko19" role="doc-endnote">
<p>D. Owerko, F. Gama and A. Ribeiro, <a href="https://arxiv.org/abs/1910.09658">“Optimal Power Flow Using Graph Neural Networks”</a>, arXiv:1910.09658, (2019). <a href="#fnref:Owerko19" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Chatzos20" role="doc-endnote">
<p>M. Chatzos, F. Fioretto, T. W.K. Mak, P. V. Hentenryck, <a href="https://arxiv.org/abs/2006.16356">“High-Fidelity Machine Learning Approximations of Large-Scale Optimal Power Flow”</a>, <em>arXiv:2006.1635</em>, (2020). <a href="#fnref:Chatzos20" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Pan20" role="doc-endnote">
<p>X. Pan, M. Chen, T. Zhao and S. H. Low, <a href="https://arxiv.org/abs/2007.01002">“DeepOPF: A Feasibility-Optimized Deep Neural Network Approach for AC Optimal Power Flow Problems”</a>, <em>arXiv:2007.01002</em>, (2020). <a href="#fnref:Pan20" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Venzke20" role="doc-endnote">
<p>A. Venzke, G. Qu and S. Low and S. Chatzivasileiadis, <a href="https://arxiv.org/abs/2006.11029">“Learning Optimal Power Flow: Worst-Case Guarantees for Neural Networks”</a>, arXiv:2006.11029 (2020). <a href="#fnref:Venzke20" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Letif MonesIn a previous blog post, we discussed the fundamental concepts of optimal power flow (OPF), a core problem in operating electricity grids. In their principal form, AC-OPFs are non-linear and non-convex optimization problems that are in general expensive to solve. In practice, due to the large size of electricity grids and number of constraints, solving even the linearized approximation (DC-OPF) is a challenging optimization requiring significant computational effort. Adding to the difficulty, the growing integration of renewable generation has increased uncertainty in grid conditions and increasingly requires grid operators to solve OPF problems in near real-time. This has led to new research efforts in using machine learning (ML) approaches to shift the computational effort away from real-time (optimization) to offline training, potentially providing an almost instant prediction of the outputs of OPF problems.Invenia at JuliaCon 20212021-08-10T00:00:00+00:002021-08-10T00:00:00+00:00https://invenia.github.io/blog/2021/08/10/juliacon2021<p>Another year has passed and another JuliaCon has happened with great success. This was the second year that the conference was fully online. While it’s a shame that we don’t get to meet all the interesting people from the Julia community in person, it also means that the conference is able to reach an even broader audience. This year, there were over 20,000 registrations and over 43,000 people tuned in on YouTube. That’s roughly double the numbers from last year! As usual, Invenia was present at the conference in various forms: as sponsors, as volunteers to the organisation, as presenters, and as part of the audience.</p>
<p>In this post, we highlight the work we presented this year. If you missed any of the talks, they are all available on Youtube, and we provide the links.</p>
<h2 id="clearing-the-pipeline-jungle-with-featuretransformsjl">Clearing the Pipeline Jungle with FeatureTransforms.jl</h2>
<p>The prevalence of glue code in feature engineering pipelines poses many problems in conducting high-quality, scalable research in machine learning and data science. In worst-case scenarios, the technical debt racked up by overgrown “pipeline jungles” can prevent a project from making any meaningful progress beyond a certain point. In this talk we discuss how we thought about this problem in our own code, and what the ideal properties of a feature engineering workflow should be. The result was <a href="https://github.com/invenia/FeatureTransforms.jl">FeatureTransforms.jl</a>: a package that can help make feature engineering a more sustainable practice for users without sacrificing the desired flexibility.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/49zKPC0r-aU" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h2 id="everything-you-need-to-know-about-chainrules-10">Everything you need to know about ChainRules 1.0</h2>
<p><a href="https://en.wikipedia.org/wiki/Automatic_differentiation">Automatic differentiation</a> (AD) is a key component of most machine learning applications as it enables efficient learning of model parameters. AD systems can compute gradients of complex functions by combining the gradients of basic operations that make up the function. To do that, an AD system needs access to rules for the gradients of basic functions. The ChainRules ecosystem provides a set of rules for functions in Julia standard libraries, the utilities for writing custom rules, and utilities for testing those rules.</p>
<p>The ChainRules project has now reached the major milestone of releasing its version 1.0. One of the main highlights of this release is the ability to write rules that are defined conditionally based on the properties of the AD system. This provides the ability to write rules for higher order functions, such as <code class="language-plaintext highlighter-rouge">map</code>, by calling back into AD. Other highlights include the ability to opt out of abstractly-typed rules for a particular signature (using AD to compose a rule, instead), making sure that the differential is in the correct subspace, a number of convenient macros for writing rules, and improved testing utilities which now include testing capabilities tailored for AD systems. ChainRules is now also integrated into a number of prominent AD packages, including ForwardDiff2, Zygote, Nabla, Yota, and Diffractor. In the talk, we describe the key new features and guide the user on how to write correct and efficient rules.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/a8ol-1l84gc" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h2 id="exprtools-metaprogramming-from-reflection">ExprTools: Metaprogramming from reflection</h2>
<p>We initially created <a href="https://github.com/invenia/ExprTools.jl/">ExprTools.jl</a> to clean-up some of our code in <a href="https://github.com/invenia/Mocking.jl">Mocking.jl</a>, so that we could more easily write patches that look like function declarations. ExprTools’ initial features were a more robust version of <code class="language-plaintext highlighter-rouge">splitdef</code> and <code class="language-plaintext highlighter-rouge">combinedef</code> from <a href="https://github.com/FluxML/MacroTools.jl/">MacroTools.jl</a>. These functions allow breaking up the <a href="https://en.wikipedia.org/wiki/Abstract_syntax_tree">abstract syntax tree</a> (AST) for a function definition into a dictionary of all the parts—name, arguments, body etc.—which can then be manipulated and combined back into an <a href="https://docs.julialang.org/en/v1/manual/metaprogramming/">AST (i.e. an <code class="language-plaintext highlighter-rouge">Expr</code> object)</a> that a macro can return.</p>
<p>With the goal of supporting <a href="https://juliadiff.org/ChainRulesCore.jl/stable/">ChainRules</a> in <a href="https://github.com/invenia/Nabla.jl">Nabla.jl</a>, we recently extended ExprTools.jl with a new method: <code class="language-plaintext highlighter-rouge">signature</code>. Nabla is an operator-overloading AD: to define a new rule for how to perform AD over some function <code class="language-plaintext highlighter-rouge">f</code>, it is overloaded to accept a <code class="language-plaintext highlighter-rouge">::Node{T}</code> argument which contains both the value of type <code class="language-plaintext highlighter-rouge">T</code> and the additional tracking information needed for AD. ChainRules, on the other hand, defines a rule for how to perform AD by defining an overload for the function <a href="https://juliadiff.org/ChainRulesCore.jl/v0.6/#frule-and-rrule-1"><code class="language-plaintext highlighter-rouge">rrule</code></a>. So, for every method of <code class="language-plaintext highlighter-rouge">rrule</code>—e.g. <code class="language-plaintext highlighter-rouge">rrule(f, ::T)</code>—we needed to generate an overload of the original function that takes a node: e.g. <code class="language-plaintext highlighter-rouge">f(::Node{T})</code>. We need to use metaprogramming to define those overloads based on <a href="https://en.wikipedia.org/wiki/Reflective_programming">reflection</a> over the <a href="https://docs.julialang.org/en/v1/base/base/#Base.methods">method table</a> of <code class="language-plaintext highlighter-rouge">rrule</code>. The new <code class="language-plaintext highlighter-rouge">signature</code> function in ExprTools makes this possible. <code class="language-plaintext highlighter-rouge">signature</code> works like <code class="language-plaintext highlighter-rouge">splitdef</code>, returning a dictionary that can be combined into an <code class="language-plaintext highlighter-rouge">AST</code>, but, rather than requiring an AST to split up, it takes a <code class="language-plaintext highlighter-rouge">Method</code> object.</p>
<p>This talk goes into how MacroTools can be used, both on ASTs with <code class="language-plaintext highlighter-rouge">splitdef</code>, and on methods with <code class="language-plaintext highlighter-rouge">signature</code>.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/CREWoLxpDMo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h2 id="fancy-arrays-bof-2">Fancy Arrays BoF 2</h2>
<p>A recurring pattern in the Julia ecosystem is the need to perform operations on named dimensions (e.g. <code class="language-plaintext highlighter-rouge">std(X; dim=:time)</code>, <code class="language-plaintext highlighter-rouge">cat(X, Y; dims=:variate)</code>) or lookup by axis keys (e.g. <code class="language-plaintext highlighter-rouge">X(time = DateTime(2017, 1, 1), variate = :object1)</code>). These patterns improve code readability and reduce bugs by explicitly naming the quantities over which calculations/manipulations should be performed. While <a href="https://github.com/JuliaArrays/AxisArrays.jl">AxisArrays</a> has been the <em>de facto</em> standard over the past few years, several attempts have been made to simplify the interface.</p>
<p>Last year, during JuliaCon 2020, an initial Birds of Feather (BoF) on these “fancy” arrays outlined the common requirements folks had and set goals to create new packages that would implement that functionality as minimally as possible. Please refer to last year’s <a href="https://docs.google.com/document/d/1imBX3k0EEejauWVyXONZDRj8LTr0PeLOJNGEgo6ow1g/edit#heading=h.qrm4q6q56yxm">notes</a> for more details. Over the past year, only a few packages remain actively maintained and have largely overlapping feature sets. The goal of this year’s BoF was to reduce code duplication and feature siloing by either merging the remaining packages or constructing a shared API.</p>
<p>We identified several recurring issues with our existing wrapper array types, which <a href="https://github.com/JuliaArrays/ArrayInterface.jl">ArrayInterface.jl</a> may help address. Similarly, it became apparent that there was a growing need to address broader ecosystem support, which is complicated by disparate APIs and workflows. Various people supported specifying a minimal API for operating on named dimensions, indexing by axis keys and converting between types, similar to <a href="https://github.com/JuliaData/Tables.jl">Table.jl</a>. This shared API may address several concerns raised about the lack of consistency and standard packages within the Julia ecosystem (like <code class="language-plaintext highlighter-rouge">xarray</code>s in python). See this year’s <a href="https://docs.google.com/document/d/1RPQw3zMGRVm8cayUrQhFGzlKV5hp-1DJMUE32H_-bgo/edit?usp=sharing">notes</a> for more details.</p>
<h2 id="parameterhandlingjl">ParameterHandling.jl</h2>
<p>Any time you want to fit a model you have to figure out how to manage its parameters, and how to work with standard optimisation and inference interfaces. This becomes complicated for all but the most trivial models. <a href="https://github.com/invenia/ParameterHandling.jl/">ParameterHandling.jl</a> provides an API and tooling to help you manage this complexity in a scalable way. There are a number of packages offering slightly different approaches to solving this problem (such as <a href="https://github.com/FluxML/Functors.jl">Functors.jl</a>, <a href="https://github.com/Metalenz/FlexibleFunctors.jl">FlexibleFunctors.jl</a>, <a href="https://github.com/tpapp/TransformVariables.jl">TransformVariables.jl</a>, and <a href="https://github.com/jonniedie/ComponentArrays.jl">ComponentArrays.jl</a>), but we believe that ParameterHandling.jl has some nice properties (such as the ability to work with arbitrary data types without modification, and the ability to use immutable data structures), and it has been useful in our work. Hopefully future work will unify the best aspects of all of these packages.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/4GmQ4RJGFy0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<h2 id="distributed-computing-using-awsclustermanagersjl">Distributed Computing using AWSClusterManagers.jl</h2>
<p>Cloud computing is a key piece of the workflow for many compute-heavy applications, and Amazon’s AWS is one of the market leaders in this area. Thus, seamlessly integrating Julia and AWS is of great importance to us, and that is why we have been working on <a href="https://github.com/JuliaCloud/AWSClusterManagers.jl">AWSClusterManagers.jl</a>.</p>
<p>AWSClusterManagers.jl allows users to run a distributed workload on AWS as easily as the Base Distributed package. It is one of a few cloud packages which we have recently open-sourced, alongside <a href="https://github.com/JuliaCloud/AWSTools.jl">AWSTools.jl</a> and <a href="https://github.com/JuliaCloud/AWSBatch.jl">AWSBatch.jl</a>. A simple example of how to use the package <a href="https://github.com/mattBrzezinski/AWSClusterManagersDemo.jl">can be found on Github</a>.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/YvEnoacr5qw" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p>We hope to see everyone again at JuliaCon 2022, and to show a bit more of our Julia applications.</p>Glenn Moynihan, Frames Catherine White, Matt Brzezinski, Miha Zgubič, Rory Finnegan, Will TebbuttAnother year has passed and another JuliaCon has happened with great success. This was the second year that the conference was fully online. While it’s a shame that we don’t get to meet all the interesting people from the Julia community in person, it also means that the conference is able to reach an even broader audience. This year, there were over 20,000 registrations and over 43,000 people tuned in on YouTube. That’s roughly double the numbers from last year! As usual, Invenia was present at the conference in various forms: as sponsors, as volunteers to the organisation, as presenters, and as part of the audience.Implementing a scalable multi-output GP model with exact inference2021-07-30T00:00:00+00:002021-07-30T00:00:00+00:00https://invenia.github.io/blog/2021/07/30/OILMM-pt3<p>This is the final post in our series about multi-output Gaussian process (GP) models. In the <a href="/blog/2021/02/19/OILMM-pt1/">first post</a>, we described how to generalise single-output GPs to multi-output GPs (MOGPs). We also introduced the <em>Mixing Model Hierarchy</em> (MMH), as a way to classify and organise a large number of MOGP models from the literature. In the <a href="/blog/2021/03/19/OILMM-pt2/">second post</a>, we discussed the <em>Instantaneous Linear Mixing Model</em> (ILMM), the base model of the MMH, showing how its low-rank assumption can be exploited to speed up inference via simple linear algebra tricks. We used those results to motivate the <a href="http://proceedings.mlr.press/v119/bruinsma20a.html"><em>Orthogonal Instantaneous Linear Mixing Model (OILMM)</em></a>, a version of the ILMM which scales even more favourably, allowing us to model up to tens of millions of points on a regular laptop.</p>
<p>In this post, we give concrete implementation details of an inference algorithm for the OILMM, showing how simple it is to have an efficient implementation. We present some simple code and also links to public implementations, in both Python and in Julia.</p>
<p>We start with a quick recap of the OILMM.</p>
<h2 id="the-orthogonal-instantaneous-linear-mixing-model-oilmm">The Orthogonal Instantaneous Linear Mixing Model (OILMM)</h2>
<p>The <a href="http://proceedings.mlr.press/v119/bruinsma20a.html"><em>Orthogonal Instantaneous Linear Mixing Model (OILMM)</em></a> is a multi-output GP (MOGP) model designed with scalability in mind. It describes the data as a linear combination of <em>latent (unobserved) processes</em>, which are themselves described as independent GPs. Mathematically:</p>
\[\begin{align}
y(t) \sim \mathcal{GP}(Hx(t), \delta_{tt'} \Sigma), \\
x(t) \sim \mathcal{GP}(0, K(t, t')),
\end{align}\]
<p>where \(x(t)\) are the latent processes, \(H\) is the <em>mixing matrix</em>, \(\delta_{tt'}\) is a <a href="https://en.wikipedia.org/wiki/Kronecker_delta">Kronecker delta</a>, and \(\Sigma\) is a matrix that specifies the noise model. The expressions above define a general <em>Instantaneous Linear Mixing Model</em> (ILMM). To obtain an <em>orthogonal</em> ILMM (OILMM), the mixing matrix \(H\) must have orthogonal columns and the noise matrix \(\Sigma\) to be of the form \(\Sigma = \sigma^2I + HDH^\top\), where \(I\) is an identity matrix and \(D > 0\) is diagonal.</p>
<p><em>The key aspect of an OILMM is that it turns a MOGP problem into a set of independent single-output GP problems</em>, which brings a very significant gain in scalability. This independence also allows the OILMM to be trivially combined with other single-output GP techniques, such as <a href="http://proceedings.mlr.press/v5/titsias09a/titsias09a.pdf">sparse GPs</a> or <a href="https://users.aalto.fi/~asolin/sde-book/sde-book.pdf">state-space approximations</a>. If you are curious about how this is possible, we recommend checking out our <a href="/blog/2021/03/19/OILMM-pt2/">previous post</a> for a high-level explanation, or our paper, <a href="http://proceedings.mlr.press/v119/bruinsma20a.html">Scalable Exact Inference in Multi-Output Gaussian Processes</a>, for a rigorous discussion. We will now focus on the practical computational aspects of the OILMM.</p>
<h2 id="implementing-the-oilmm">Implementing the OILMM</h2>
<p>Let’s start by showing the algorithm for implementing the OILMM in a general regression/prediction setting. All that is needed is a regular GP package. To illustrate, we’ll show code using <a href="https://github.com/JuliaGaussianProcesses/AbstractGPs.jl">AbstractGPs</a> in Julia, and <a href="https://github.com/wesselb/stheno">Stheno</a><sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> in Python<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>. We choose these packages because they are minimal and the code is almost like pseudo-code making it straightforward to follow even for people who have never used Julia or Python before.</p>
<p>We discuss the procedure for performing inference, for sampling from the posterior, and for computing the log-likelihood of the data. We will assume that the OILMM has \(p\) outputs, \(n\) timestamps, and \(m\) latent processes.</p>
<h3 id="notation">Notation</h3>
<style type="text/css"> td { vertical-align: top; } </style>
<table>
<thead>
<tr>
<th>Symbol</th>
<th>Type</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>\(U\)</td>
<td>Truncated orthogonal \(p \times m\) matrix</td>
<td>Orthogonal part of the mixing matrix, \(H = US^{1/2}\)</td>
</tr>
<tr>
<td>\(S\)</td>
<td>Positive, diagonal \(m \times m\) matrix</td>
<td>Diagonal part of the mixing matrix, \(H = US^{1/2}\)</td>
</tr>
<tr>
<td>\(\sigma^2\)</td>
<td>Positive real number</td>
<td>Part of the observation noise</td>
</tr>
<tr>
<td>\(D\)</td>
<td>Positive, diagonal \(m \times m\) matrix</td>
<td>Part of the observation noise deriving from the latent processes</td>
</tr>
<tr>
<td>\(Y\)</td>
<td>\(p \times n\) matrix</td>
<td>Matrix of observations</td>
</tr>
</tbody>
</table>
<h3 id="performing-inference">Performing inference</h3>
<p>There are five steps to performing inference in the OILMM framework:</p>
<ol>
<li>Build the projection.</li>
<li>Project the observations to the latent space.</li>
<li>Project the noise to the latent space.</li>
<li>Independently condition each latent process on the projected observations, using the projected noise.</li>
<li>Transform the posterior means and covariances back to the space as observations, using the mixing matrix.</li>
</ol>
<h4 id="step-0-preliminaries">Step 0: preliminaries</h4>
<p>Let’s start with some basic definitions for the example code.</p>
<p>Julia:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">using</span> <span class="n">AbstractGPs</span>
<span class="k">using</span> <span class="n">LinearAlgebra</span>
<span class="k">using</span> <span class="n">Statistics</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">100</span> <span class="c"># Number of timestamps</span>
<span class="c"># Model specification:</span>
<span class="n">p</span> <span class="o">=</span> <span class="mi">10</span> <span class="c"># Number of outputs</span>
<span class="n">m</span> <span class="o">=</span> <span class="mi">3</span> <span class="c"># Number of latent processes</span>
<span class="n">σ²</span> <span class="o">=</span> <span class="mf">0.1</span> <span class="c"># Observation noise</span>
<span class="n">D</span> <span class="o">=</span> <span class="kt">Diagonal</span><span class="x">(</span><span class="n">rand</span><span class="x">(</span><span class="n">m</span><span class="x">))</span> <span class="c"># Latent noises</span>
</code></pre></div></div>
<p>Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">lab</span> <span class="k">as</span> <span class="n">B</span>
<span class="kn">from</span> <span class="nn">matrix</span> <span class="kn">import</span> <span class="n">Dense</span><span class="p">,</span> <span class="n">Diagonal</span>
<span class="kn">from</span> <span class="nn">stheno</span> <span class="kn">import</span> <span class="n">GP</span><span class="p">,</span> <span class="n">Matern52</span>
<span class="n">n</span> <span class="o">=</span> <span class="mi">100</span> <span class="c1"># Number of timestamps
</span>
<span class="c1"># Model specification:
</span><span class="n">p</span> <span class="o">=</span> <span class="mi">10</span> <span class="c1"># Number of outputs
</span><span class="n">m</span> <span class="o">=</span> <span class="mi">3</span> <span class="c1"># Number of latent processes
</span><span class="n">noise</span> <span class="o">=</span> <span class="mf">0.1</span> <span class="c1"># Observation noise
</span><span class="n">d</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="n">m</span><span class="p">)</span> <span class="c1"># Latent noises
</span></code></pre></div></div>
<h4 id="step-1-build-the-projection">Step 1: build the projection</h4>
<p>We know that the original space, where the observations are made, and the latent space are connected via the mixing matrix \(H\). Thus, to project observations to the latent space we need to determine the left inverse of \(H\), which we call \(T\). <em>It is easy to see that \(T = S^{-1/2}U^\top\).</em> Notice that the matrix \(T\) is the same that defines <a href="/blog/2021/03/19/OILMM-pt2/#an-alternative-formulation">the sufficient statistic we used to motivate the OILMM</a>.</p>
<p>Julia:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Sample a random mixing matrix.</span>
<span class="n">U</span><span class="x">,</span> <span class="n">s</span><span class="x">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">svd</span><span class="x">(</span><span class="n">randn</span><span class="x">(</span><span class="n">p</span><span class="x">,</span> <span class="n">m</span><span class="x">))</span>
<span class="n">H</span> <span class="o">=</span> <span class="n">U</span> <span class="o">*</span> <span class="kt">Diagonal</span><span class="x">(</span><span class="n">broadcast</span><span class="x">(</span><span class="n">sqrt</span><span class="x">,</span> <span class="n">s</span><span class="x">))</span>
<span class="c"># Build the projection.</span>
<span class="n">T</span> <span class="o">=</span> <span class="kt">Diagonal</span><span class="x">(</span><span class="n">sqrt</span><span class="o">.</span><span class="x">(</span><span class="n">s</span><span class="x">))</span> <span class="o">\</span> <span class="n">transpose</span><span class="x">(</span><span class="n">U</span><span class="x">)</span>
</code></pre></div></div>
<p>Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Sample a random mixing matrix.
</span><span class="n">U</span><span class="p">,</span> <span class="n">s</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">svd</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">m</span><span class="p">))</span>
<span class="n">U</span><span class="p">,</span> <span class="n">S</span> <span class="o">=</span> <span class="n">Dense</span><span class="p">(</span><span class="n">U</span><span class="p">),</span> <span class="n">Diagonal</span><span class="p">(</span><span class="n">s</span><span class="p">)</span>
<span class="n">H</span> <span class="o">=</span> <span class="n">U</span> <span class="o">@</span> <span class="n">S</span>
<span class="c1"># Build the projection.
</span><span class="n">T</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">S</span><span class="p">))</span> <span class="o">@</span> <span class="n">U</span><span class="p">.</span><span class="n">T</span>
</code></pre></div></div>
<h4 id="step-2-project-the-observations">Step 2: project the observations</h4>
<p>Taking the observations to the latent space is done by left-multiplying by \(T\). Thus, <em>the projected observations can be written as \(Y_{\text{proj}} = TY\).</em> Some intuition for why this takes the observations to the latent space is as follows: if the observations \(Y\) were generated as \(Y = HX\), then \(TY = THX = X\) recovers \(X\). We call \(y^{(i)}_ \text{proj}\) the \(i\)-th row of \(Y_\text{proj}\), corresponding to the observations of the \(i\)-th latent process.</p>
<p>Julia:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Sample some noisy data.</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">range</span><span class="x">(</span><span class="mf">0.0</span><span class="x">,</span> <span class="mf">10.0</span><span class="x">;</span> <span class="n">length</span><span class="o">=</span><span class="n">n</span><span class="x">)</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">transpose</span><span class="x">(</span><span class="n">rand</span><span class="x">(</span><span class="n">GP</span><span class="x">(</span><span class="n">Matern52Kernel</span><span class="x">())(</span><span class="n">x</span><span class="x">),</span> <span class="mi">10</span><span class="x">))</span> <span class="c"># Generate sample data from some GP.</span>
<span class="c"># Project the observations.</span>
<span class="n">Y_proj</span> <span class="o">=</span> <span class="n">T</span> <span class="o">*</span> <span class="n">Y</span>
</code></pre></div></div>
<p>Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Sample some noisy data.
</span><span class="n">x</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="n">Matern52</span><span class="p">())</span>
<span class="n">Y</span> <span class="o">=</span> <span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">noise</span><span class="p">).</span><span class="n">sample</span><span class="p">(</span><span class="n">p</span><span class="p">).</span><span class="n">T</span> <span class="c1"># Generate sample data from some GP.
</span>
<span class="c1"># Project the observations.
</span><span class="n">Y_proj</span> <span class="o">=</span> <span class="n">T</span> <span class="o">@</span> <span class="n">Y</span>
</code></pre></div></div>
<h4 id="step-3-project-the-noise">Step 3: project the noise</h4>
<p>We start by noting that \(Y_{\text{proj}} = TY\) means that \(Y_{\text{proj}}\) encodes the noise present in the observations \(Y\). <a href="/blog/2021/03/19/OILMM-pt2/">We know that</a>
\(Ty(t) | x(t), H \sim \mathrm{GP}(THx(t), \delta_{tt'}T \Sigma_T T^\top)\), so we must <em>compute \(\Sigma_T = T \Sigma_T T^\top\), which we call the projected noise.</em> \(\Sigma_T\) is a diagonal matrix by construction.</p>
<p>Julia:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ΣT</span> <span class="o">=</span> <span class="n">repeat</span><span class="x">(</span><span class="n">σ²</span> <span class="o">./</span> <span class="n">s</span> <span class="o">+</span> <span class="n">diag</span><span class="x">(</span><span class="n">D</span><span class="x">),</span> <span class="mi">1</span><span class="x">,</span> <span class="n">n</span><span class="x">)</span> <span class="c"># Repeat the same noise matrix for every timestamp.</span>
</code></pre></div></div>
<p>Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">noise_proj</span> <span class="o">=</span> <span class="n">noise</span> <span class="o">/</span> <span class="n">B</span><span class="p">.</span><span class="n">diag</span><span class="p">(</span><span class="n">S</span><span class="p">)</span> <span class="o">+</span> <span class="n">d</span>
</code></pre></div></div>
<h4 id="step-4-condition-latent-processes">Step 4: condition latent processes</h4>
<p>Since \(\Sigma_T\) is diagonal and the latent processes are independent under the prior, the latent processes can be conditioned independently. Thus, the GP corresponding to the \(i\)-th latent process is conditioned on \(y^{(i)}_ \text{proj}\) using the corresponding noise \((\Sigma_T)_{ii}\). Since this only involves dealing with single-output GPs, any available GP package can be used for this. This makes it particularly easy to construct the OILMM with your favourite GP package. Moreover, any scaling technique for single-output GPs can be used here, such as the <a href="http://proceedings.mlr.press/v5/titsias09a/titsias09a.pdf">sparse inducing points technique by Titsias</a>.</p>
<p>Julia:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lats</span> <span class="o">=</span> <span class="x">[</span><span class="n">GP</span><span class="x">(</span><span class="n">Matern52Kernel</span><span class="x">())</span> <span class="k">for</span> <span class="n">_</span> <span class="k">in</span> <span class="mi">1</span><span class="o">:</span><span class="n">m</span><span class="x">]</span> <span class="c"># Noiseless latent processes</span>
<span class="c"># Condition the latent processes.</span>
<span class="n">lats_post</span> <span class="o">=</span> <span class="x">[</span><span class="n">posterior</span><span class="x">(</span><span class="n">lats</span><span class="x">[</span><span class="n">j</span><span class="x">](</span><span class="n">x</span><span class="x">,</span> <span class="n">ΣT</span><span class="x">[</span><span class="n">j</span><span class="x">,</span> <span class="o">:</span><span class="x">]),</span> <span class="n">Y_proj</span><span class="x">[</span><span class="n">j</span><span class="x">,</span> <span class="o">:</span><span class="x">])</span> <span class="k">for</span> <span class="n">j</span> <span class="k">in</span> <span class="mi">1</span><span class="o">:</span><span class="n">m</span><span class="x">]</span>
</code></pre></div></div>
<p>Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lats</span> <span class="o">=</span> <span class="p">[</span><span class="n">GP</span><span class="p">(</span><span class="n">Matern52</span><span class="p">())</span> <span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">m</span><span class="p">)]</span> <span class="c1"># Noiseless latent processes
</span>
<span class="c1"># Condition the latent processes.
</span><span class="n">lats_post</span> <span class="o">=</span> <span class="p">[</span><span class="n">f</span><span class="p">.</span><span class="n">condition</span><span class="p">(</span><span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">ni</span><span class="p">),</span> <span class="n">yi</span><span class="p">)</span> <span class="k">for</span> <span class="n">ni</span><span class="p">,</span> <span class="n">yi</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">noise_proj</span><span class="p">,</span> <span class="n">Y</span><span class="p">)]</span>
</code></pre></div></div>
<h4 id="step-5-transform-posterior-latent-processes-to-observation-space">Step 5: transform posterior latent processes to observation space</h4>
<p>For the predictive mean of the full MOGP, simply compute the predictive means of each of the posterior latent processes, stack them in an \(m \times n'\) matrix \(\mu\) (with each row corresponding to a different latent process), and then multiply by the mixing matrix, obtaining \(\text{predictive mean} = H \mu \in \mathrm{R}^{p \times n'}\).</p>
<p>Julia:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Compute the posterior marginal means `M`.</span>
<span class="n">M_latent</span> <span class="o">=</span> <span class="n">vcat</span><span class="x">([</span><span class="n">transpose</span><span class="x">(</span><span class="n">mean</span><span class="x">(</span><span class="n">lats_post</span><span class="x">[</span><span class="n">j</span><span class="x">](</span><span class="n">x</span><span class="x">)))</span> <span class="k">for</span> <span class="n">j</span> <span class="k">in</span> <span class="mi">1</span><span class="o">:</span><span class="n">m</span><span class="x">]</span><span class="o">...</span><span class="x">)</span>
<span class="n">M</span> <span class="o">=</span> <span class="n">H</span> <span class="o">*</span> <span class="n">M_latent</span>
</code></pre></div></div>
<p>Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Compute the posterior marginal means `M`.
</span><span class="n">M_latent</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="o">*</span><span class="p">[</span><span class="n">f</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">T</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">lats_post</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">M</span> <span class="o">=</span> <span class="n">H</span> <span class="o">@</span> <span class="n">M_latent</span>
</code></pre></div></div>
<p>For the predictive marginal variances, repeat the same process as with the predictive means, but stacking the variances \(\nu^{(i)}\) of each latent process instead, obtaining
\begin{equation}
\text{predictive marginal variances} = (H \circ H) \nu,
\end{equation}
with \(\circ\) denoting the element-wise (Hadamard) product.</p>
<p>Julia:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Compute the posterior marginal means `V`.</span>
<span class="n">V_latent</span> <span class="o">=</span> <span class="n">vcat</span><span class="x">([</span><span class="n">transpose</span><span class="x">(</span><span class="n">var</span><span class="x">(</span><span class="n">lats_post</span><span class="x">[</span><span class="n">j</span><span class="x">](</span><span class="n">x</span><span class="x">)))</span> <span class="k">for</span> <span class="n">j</span> <span class="k">in</span> <span class="mi">1</span><span class="o">:</span><span class="n">m</span><span class="x">]</span><span class="o">...</span><span class="x">)</span>
<span class="n">V</span> <span class="o">=</span> <span class="n">abs2</span><span class="o">.</span><span class="x">(</span><span class="n">H</span><span class="x">)</span> <span class="o">*</span> <span class="x">(</span><span class="n">V_latent</span> <span class="o">.+</span> <span class="n">D</span><span class="o">.</span><span class="n">diag</span><span class="x">)</span> <span class="o">.+</span> <span class="n">σ²</span>
</code></pre></div></div>
<p>Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Compute the posterior marginal means `V`.
</span><span class="n">V_latent</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="o">*</span><span class="p">[</span><span class="n">f</span><span class="p">.</span><span class="n">kernel</span><span class="p">.</span><span class="n">elwise</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">T</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">lats_post</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">V</span> <span class="o">=</span> <span class="p">(</span><span class="n">H</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span> <span class="o">@</span> <span class="p">(</span><span class="n">V_latent</span> <span class="o">+</span> <span class="n">d</span><span class="p">[:,</span> <span class="bp">None</span><span class="p">])</span> <span class="o">+</span> <span class="n">noise</span>
</code></pre></div></div>
<p>It is also possible to compute full predictive covariance matrices, by observing that for any two given points in time, say \(t_1\) and \(t_2\),</p>
<p>\begin{equation}
(\text{predictive covariance})_{t_1t_2} = H K_{\text{posterior}}(t_1, t_2) H^\top \in \mathrm{R}^{p \times p}
\end{equation}</p>
<p>with</p>
<p>\begin{equation}
K_{\text{posterior}}(t_1, t_2) = \mathrm{Diagonal}([k^{(1)}(t_1, t_2), \cdots, k^{(m)}(t_1, t_2)]) \in \mathrm{R}^{m \times m}
\end{equation}</p>
<p>and \(k^{(i)}\) the posterior kernel of the \(i\)-th latent process. Putting this together takes a bit of careful index-keeping and depends on how you are organising the dimensions (time first or outputs first), but the computation itself is still straightforward.</p>
<h3 id="sampling-from-the-posterior">Sampling from the posterior</h3>
<p>Drawing samples from the posterior is rather similar to computing the predictive mean in step 5 above. Because the posterior latent processes remain independent from each other, we can sample each of them independently, which is a functionality that any GP package provides. This way, we can stack a single sample from each latent process into a matrix \(\hat X \in \mathrm{R}^{m \times n'}\), and then transform it via the mixing matrix, obtaining \(\text{posterior sample} = H \hat X\).</p>
<p>Julia:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># Sample from the noiseless posterior.</span>
<span class="n">F_latent</span> <span class="o">=</span> <span class="n">vcat</span><span class="x">([</span><span class="n">transpose</span><span class="x">(</span><span class="n">rand</span><span class="x">(</span><span class="n">lats_post</span><span class="x">[</span><span class="n">j</span><span class="x">](</span><span class="n">x</span><span class="x">)))</span> <span class="k">for</span> <span class="n">j</span> <span class="k">in</span> <span class="mi">1</span><span class="o">:</span><span class="n">m</span><span class="x">]</span><span class="o">...</span><span class="x">)</span>
<span class="n">F</span> <span class="o">=</span> <span class="n">H</span> <span class="o">*</span> <span class="n">F_latent</span>
<span class="n">F_noisy</span> <span class="o">=</span> <span class="n">F</span> <span class="o">.+</span> <span class="n">sqrt</span><span class="x">(</span><span class="n">σ²</span><span class="x">)</span> <span class="o">.*</span> <span class="n">randn</span><span class="x">(</span><span class="n">size</span><span class="x">(</span><span class="n">F</span><span class="x">))</span>
</code></pre></div></div>
<p>Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Sample from the noiseless posterior.
</span><span class="n">F_latent</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">concat</span><span class="p">(</span><span class="o">*</span><span class="p">[</span><span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">sample</span><span class="p">().</span><span class="n">T</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">lats_post</span><span class="p">],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
<span class="n">F</span> <span class="o">=</span> <span class="n">H</span> <span class="o">@</span> <span class="n">F_latent</span>
<span class="n">F_noisy</span> <span class="o">=</span> <span class="n">F</span> <span class="o">+</span> <span class="n">B</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">noise</span><span class="p">)</span> <span class="o">*</span> <span class="n">B</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="o">*</span><span class="n">B</span><span class="p">.</span><span class="n">shape</span><span class="p">(</span><span class="n">F</span><span class="p">))</span>
</code></pre></div></div>
<h3 id="computing-the-log-likelihood-of-the-data">Computing the log-likelihood of the data</h3>
<p>Computing the log-likelihood of data is a three-step process. First, we compute the log-likelihood for each latent process independently. Then, we compute a term that is independent of the latent kernels, which we identify as a regularisation term. Finally, we combine the two terms.</p>
<h4 id="step-1-compute-the-likelihood-under-the-latent-processes">Step 1: compute the likelihood under the latent processes</h4>
<p>First, we must project the data to the latent space, as described in the inference section above, giving us \(Y_{\text{proj}} = TY\). Each row of this matrix will correspond to the time series of a different latent process. Since they are fully independent, all we have to do is compute the log-likelihood of the \(i\)-th row under the \(i\)-th latent process. We call this quantity \(\operatorname{LL}_i\). This can be done using any GP package.</p>
<p>Julia:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lml_latents</span> <span class="o">=</span> <span class="x">[</span><span class="n">logpdf</span><span class="x">(</span><span class="n">lats</span><span class="x">[</span><span class="n">j</span><span class="x">](</span><span class="n">x</span><span class="x">,</span> <span class="n">ΣT</span><span class="x">[</span><span class="n">j</span><span class="x">,</span> <span class="o">:</span><span class="x">]),</span> <span class="n">Y_proj</span><span class="x">[</span><span class="n">j</span><span class="x">,</span> <span class="o">:</span><span class="x">])</span> <span class="k">for</span> <span class="n">j</span> <span class="k">in</span> <span class="mi">1</span><span class="o">:</span><span class="n">m</span><span class="x">]</span>
</code></pre></div></div>
<p>Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lml_latents</span> <span class="o">=</span> <span class="p">[</span><span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">ni</span><span class="p">).</span><span class="n">logpdf</span><span class="p">(</span><span class="n">yi</span><span class="p">)</span> <span class="k">for</span> <span class="n">f</span><span class="p">,</span> <span class="n">ni</span><span class="p">,</span> <span class="n">yi</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">lats</span><span class="p">,</span> <span class="n">noise_proj</span><span class="p">,</span> <span class="n">Y_proj</span><span class="p">)]</span>
</code></pre></div></div>
<h4 id="step-2-add-regularisation-term">Step 2: add regularisation term</h4>
<p>The log-likelihood of the data under an OILMM does not depend solely on the latent processes, as it must account for the effects of the projection step. As we show in <a href="http://proceedings.mlr.press/v119/bruinsma20a.html">our work</a>, the log-likelihood can be written as the sum of two terms. The first one is the log-likelihood of the projected data under the latent processes, which we computed in the step above. The second one is a term that accounts for the loss of information during the projection. Since it prevents the data from being projected to zero, it can be seen as a regularisation term. This term is given by:</p>
\[\begin{equation}
\text{regulariser} = - \frac{n}{2} \log |S| - \frac{n (p-m)}{2} \log 2 \pi \sigma^2 -
\frac{1}{2\sigma^2} (\left \| Y \right \|_F^2 - \left \| U^T Y \right \|_F^2),
\end{equation}\]
<p>with \(\left \| Y \right \|_F\) the <a href="https://mathworld.wolfram.com/FrobeniusNorm.html">Frobenius norm</a>.</p>
<p>Julia:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">regulariser</span> <span class="o">=</span> <span class="o">-</span><span class="x">(</span><span class="n">n</span> <span class="o">*</span> <span class="x">(</span><span class="n">sum</span><span class="x">(</span><span class="n">abs2</span><span class="x">,</span> <span class="n">s</span><span class="x">)</span> <span class="o">+</span> <span class="x">(</span><span class="n">p</span> <span class="o">-</span> <span class="n">m</span><span class="x">)</span> <span class="o">*</span> <span class="n">log</span><span class="x">(</span><span class="mi">2</span><span class="nb">π</span> <span class="o">*</span> <span class="n">σ²</span><span class="x">))</span> <span class="o">+</span> <span class="x">(</span><span class="n">sum</span><span class="x">(</span><span class="n">abs2</span><span class="x">,</span> <span class="n">Y</span><span class="x">)</span> <span class="o">-</span> <span class="n">sum</span><span class="x">(</span><span class="n">abs2</span><span class="x">,</span> <span class="kt">Diagonal</span><span class="x">(</span><span class="n">sqrt</span><span class="o">.</span><span class="x">(</span><span class="n">s</span><span class="x">))</span> <span class="o">*</span> <span class="n">Y_proj</span><span class="x">))</span> <span class="o">/</span> <span class="n">σ²</span><span class="x">)</span> <span class="o">/</span> <span class="mi">2</span>
</code></pre></div></div>
<p>Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">regulariser</span> <span class="o">=</span> <span class="o">-</span><span class="mf">0.5</span> <span class="o">*</span> <span class="p">(</span>
<span class="n">n</span> <span class="o">*</span> <span class="p">(</span><span class="n">p</span> <span class="o">-</span> <span class="n">m</span><span class="p">)</span> <span class="o">*</span> <span class="n">B</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">B</span><span class="p">.</span><span class="n">pi</span> <span class="o">*</span> <span class="n">noise</span><span class="p">)</span>
<span class="o">+</span> <span class="n">n</span> <span class="o">*</span> <span class="n">B</span><span class="p">.</span><span class="n">logdet</span><span class="p">(</span><span class="n">S</span><span class="p">)</span>
<span class="o">+</span> <span class="p">(</span><span class="n">B</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">Y</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span> <span class="o">-</span> <span class="n">B</span><span class="p">.</span><span class="nb">sum</span><span class="p">((</span><span class="n">B</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">S</span><span class="p">)</span> <span class="o">@</span> <span class="n">Y_proj</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">))</span> <span class="o">/</span> <span class="n">noise</span>
<span class="p">)</span>
</code></pre></div></div>
<h4 id="step-3-combine-both-terms">Step 3: combine both terms</h4>
<p>The log-likelihood of the data is given by the sum of the log-likelihoods under each of the latent processes, \(\operatorname{LL}_i\), as computed in step 1 above, and the regularisation term from the previous step. That is:</p>
<p>\begin{equation}
\log p(Y) = \text{regulariser} + \sum_{i=1}^m \text{LL}_i.
\end{equation}</p>
<p>Julia:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loglik</span> <span class="o">=</span> <span class="n">regulariser</span> <span class="o">+</span> <span class="n">sum</span><span class="x">(</span><span class="n">lml_latents</span><span class="x">)</span>
</code></pre></div></div>
<p>Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">loglik</span> <span class="o">=</span> <span class="n">regulariser</span> <span class="o">+</span> <span class="nb">sum</span><span class="p">(</span><span class="n">lml_latents</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="summary-of-section">Summary of section</h3>
<p>We highlight that all of the implementation steps above are quite simple to perform using any GP package that is available, even if it has no MOGP implementation. We consider the simplicity of the implementation one of the strengths of this method. However, if you are interested in dedicated implementations of the method, which work off-the-shelf, we <a href="#open-implementations">show below</a> available packages for both Python and Julia.</p>
<h2 id="time-and-memory-complexities">Time and memory complexities</h2>
<p>Scalability is one of the key strengths of the OILMM, so it is relevant to discuss the time and memory complexities involved in utilising the method. We’ll consider the case of \(p\) outputs, observed at \(n\) timestamps each, and an OILMM with \(m\) latent processes.</p>
<p>In realistic cases we typically have \(n > p > m\), meaning that costs in both time and memory will be dominated by the storage and inversion of the covariance matrix. We have discussed in detail how we arrive at these costs in our <a href="/blog/2021/02/19/OILMM-pt1/">two previous</a> <a href="/blog/2021/03/19/OILMM-pt2/">posts in the series</a>. In the table below we summarise the results using the general case for MOGPs and the ILMM as reference. The linear scaling on \(m\) presented by the OILMM is a direct consequence of inference being done independently for each latent process, as described in <a href="#step-4-condition-latent-processes">step 4</a> of the “Performing Inference” section above.</p>
<p>Table 1: Time and memory scaling for storing and inverting the covariance matrix under the general MOGPs, the ILMM, and the OILMM.</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Time</th>
<th>Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>General MOGP</td>
<td>\(\mathcal{O}(n^3p^3)\)</td>
<td>\(\mathcal{O}(n^2p^2)\)</td>
</tr>
<tr>
<td>ILMM</td>
<td>\(\mathcal{O}(n^3m^3)\)</td>
<td>\(\mathcal{O}(n^2m^2)\)</td>
</tr>
<tr>
<td>OILMM</td>
<td>\(\mathcal{O}(n^3m)\)</td>
<td>\(\mathcal{O}(n^2m)\)</td>
</tr>
</tbody>
</table>
<p>For cases in which \(n\) is too large and even the favourable scaling presented by the OILMM is still prohibitive, it is straightforward to combine the OILMM with scaling techniques designed for single-output GPs, such as the one developed by <a href="http://proceedings.mlr.press/v5/titsias09a/titsias09a.pdf">Titsias</a> or the one developed by <a href="https://ieeexplore.ieee.org/abstract/document/5589113">Hartikainen and Särkkä</a>. Focusing only on the implementation aspects, the addition of these scaling techniques only affects <a href="#step-4-condition-latent-processes">step 4</a> of the “Performing Inference” section and <a href="#step-1-compute-the-likelihood-under-the-latent-processes">step 1</a> of the section “Computing the log-likelihood of the data’’ above, in which the independent single-output GPs should be treated according to the chosen approximation. In the table below we present the scaling costs for the combination of the OILMM with these two methods, assuming \(r\) inducing points in Titsias’ method, and dimensionality \(d\) in Hartikainen and Särkkä’s method. Typically \(r \ll n\) and \(d \ll n, m\). In <a href="http://proceedings.mlr.press/v119/bruinsma20a.html">Bruinsma et al.</a>, we used Hartikainen and Särkkä’s method and a separable spatio-temporal kernel to efficiently train the OILMM over several million data points.</p>
<p>Table 2: Time and memory scaling for performing inference under the OILMM combined with single-output GP scaling techniques.</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Time</th>
<th>Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>OILMM + Titsias</td>
<td>\(\mathcal{O}(nmr^2)\)</td>
<td>\(\mathcal{O}(nmr)\)</td>
</tr>
<tr>
<td>OILMM + Hartikainen and Särkkä</td>
<td>\(\mathcal{O}(nmd^3)\)</td>
<td>\(\mathcal{O}(nmd^2)\)</td>
</tr>
</tbody>
</table>
<p>If we are rigorous, there are other costs involved in applying the OILMM, such as the cost of storing the data in memory, <a href="#step-1-build-the-projection">building the matrix T</a> to project the data into the latent space, <a href="#step-2-project-the-observations">performing this projection</a>, and <a href="#step-5-transform-posterior-latent-processes-to-observation-space">building the predictive marginal means and variances</a>. Usually, these costs are largely dominated by the costs shown in Table 1 above, and they become relevant only when the number of timestamps \(n\) becomes comparable with the number of latent processes \(m\), which is not a common setting (that would likely be too little data to efficiently train the model). We summarise these costs in the table below.</p>
<p>Table 3: Time and memory scaling for performing secondary tasks under the OILMM.</p>
<table>
<thead>
<tr>
<th>Task</th>
<th>Time</th>
<th>Memory</th>
</tr>
</thead>
<tbody>
<tr>
<td>Storing data</td>
<td>—</td>
<td>\(\mathcal{O}(np)\)</td>
</tr>
<tr>
<td>Building matrix T</td>
<td>\(\mathcal{O}(m^2p)\)</td>
<td>\(\mathcal{O}(mp)\)</td>
</tr>
<tr>
<td>Projecting data</td>
<td>\(\mathcal{O}(nmp)\)</td>
<td>\(\mathcal{O}(np)\)</td>
</tr>
<tr>
<td>Building marginal statistics</td>
<td>\(\mathcal{O}(nmp)\)</td>
<td>\(\mathcal{O}(np)\)</td>
</tr>
</tbody>
</table>
<h2 id="conclusion">Conclusion</h2>
<p>With this post we conclude a three-part series on multi-output Gaussian process models, with emphasis on the OILMM.</p>
<p>In the <a href="/blog/2021/02/19/OILMM-pt1/">first part of the series</a> we presented a very brief introduction to MOGPs, arguing that they can be viewed simply as single-output GPs acting over an extended input space. In that post we also introduced the <em>Mixing Model Hierarchy</em>, which attempts to organise a large number of MOGP models from the literature using a simple and widespread ILMM as reference.</p>
<p>In the <a href="/blog/2021/03/19/OILMM-pt2/">second post</a> we delved a bit deeper into the ILMM, discussing the mathematical tricks that make it more scalable. We used one of these tricks to motivate and present the OILMM, which improves on the scalability of the ILMM.</p>
<p>In this post we learned how to efficiently implement the OILMM in practice, and shared some of our implementations in both Julia and Python.</p>
<p>We hope these posts have served to highlight some of the interesting properties of MOGPs, and might serve as a general (albeit not too technical) introduction to the OILMM. Below we offer some open implementations of the OILMM we have made in Python and in Julia.</p>
<h2 id="open-implementations">Open implementations</h2>
<p>We hope that this post shows how simple it is to implement the OILMM, and can serve as a reference for implementing the model in any language. We also offer open implementations in both Julia and Python, which should make the OILMM readily accessible to everyone. Both implementations are based on the GP package Stheno (about which we already talked in <a href="/blog/2021/01/19/linear-models-with-stheno-and-jax/">our post about linear Gaussian process models using Jax</a>), and present very similar APIs, adapted to the particularities of each language. Below we briefly show a simple application with each of the implementations.</p>
<h3 id="julia">Julia</h3>
<p>This implementation can be found in <a href="https://github.com/willtebbutt/OILMMs.jl">OILMMs.jl</a></p>
<p>Without learning:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">using</span> <span class="n">AbstractGPs</span>
<span class="k">using</span> <span class="n">LinearAlgebra</span>
<span class="k">using</span> <span class="n">OILMMs</span>
<span class="k">using</span> <span class="n">Random</span>
<span class="k">using</span> <span class="n">TemporalGPs</span>
<span class="c"># Specify and construct an OILMM.</span>
<span class="n">p</span> <span class="o">=</span> <span class="mi">10</span>
<span class="n">m</span> <span class="o">=</span> <span class="mi">3</span>
<span class="n">U</span><span class="x">,</span> <span class="n">s</span><span class="x">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">svd</span><span class="x">(</span><span class="n">randn</span><span class="x">(</span><span class="n">p</span><span class="x">,</span> <span class="n">m</span><span class="x">))</span>
<span class="n">σ²</span> <span class="o">=</span> <span class="mf">0.1</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">OILMM</span><span class="x">(</span>
<span class="x">[</span><span class="n">to_sde</span><span class="x">(</span><span class="n">GP</span><span class="x">(</span><span class="n">Matern52Kernel</span><span class="x">()),</span> <span class="n">SArrayStorage</span><span class="x">(</span><span class="kt">Float64</span><span class="x">))</span> <span class="k">for</span> <span class="n">_</span> <span class="k">in</span> <span class="mi">1</span><span class="o">:</span><span class="n">m</span><span class="x">],</span>
<span class="n">U</span><span class="x">,</span>
<span class="kt">Diagonal</span><span class="x">(</span><span class="n">s</span><span class="x">),</span>
<span class="kt">Diagonal</span><span class="x">(</span><span class="n">rand</span><span class="x">(</span><span class="n">m</span><span class="x">)</span> <span class="o">.+</span> <span class="mf">0.1</span><span class="x">),</span>
<span class="x">);</span>
<span class="c"># Sample from the model. LARGE DATA SET!</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">MOInput</span><span class="x">(</span><span class="n">RegularSpacing</span><span class="x">(</span><span class="mf">0.0</span><span class="x">,</span> <span class="mf">1.0</span><span class="x">,</span> <span class="mi">1_000_000</span><span class="x">),</span> <span class="n">p</span><span class="x">);</span>
<span class="n">fx</span> <span class="o">=</span> <span class="n">f</span><span class="x">(</span><span class="n">x</span><span class="x">,</span> <span class="n">σ²</span><span class="x">);</span>
<span class="n">rng</span> <span class="o">=</span> <span class="kt">MersenneTwister</span><span class="x">(</span><span class="mi">123456</span><span class="x">);</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">rand</span><span class="x">(</span><span class="n">rng</span><span class="x">,</span> <span class="n">fx</span><span class="x">);</span>
<span class="c"># Compute the logpdf of the data under the model.</span>
<span class="n">logpdf</span><span class="x">(</span><span class="n">fx</span><span class="x">,</span> <span class="n">y</span><span class="x">)</span>
<span class="c"># Perform posterior inference. This produces another OILMM.</span>
<span class="n">f_post</span> <span class="o">=</span> <span class="n">posterior</span><span class="x">(</span><span class="n">fx</span><span class="x">,</span> <span class="n">y</span><span class="x">)</span>
<span class="c"># Compute the posterior marginals.</span>
<span class="c"># We can also use `rand` and `logpdf` as before.</span>
<span class="n">post_marginals</span> <span class="o">=</span> <span class="n">marginals</span><span class="x">(</span><span class="n">f_post</span><span class="x">(</span><span class="n">x</span><span class="x">));</span>
</code></pre></div></div>
<p>With learning:</p>
<div class="language-julia highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">using</span> <span class="n">AbstractGPs</span>
<span class="k">using</span> <span class="n">OILMMs</span>
<span class="k">using</span> <span class="n">TemporalGPs</span>
<span class="c"># Load standard packages from the Julia ecosystem</span>
<span class="k">using</span> <span class="n">LinearAlgebra</span>
<span class="k">using</span> <span class="n">Optim</span> <span class="c"># Standard optimisation algorithms.</span>
<span class="k">using</span> <span class="n">ParameterHandling</span> <span class="c"># Helper functionality for dealing with model parameters.</span>
<span class="k">using</span> <span class="n">Random</span>
<span class="k">using</span> <span class="n">Zygote</span> <span class="c"># Algorithmic Differentiation</span>
<span class="c"># Specify OILMM parameters as a NamedTuple.</span>
<span class="c"># Utilise orthogonal and positive from ParameterHandling.jl to constrain appropriately.</span>
<span class="n">p</span> <span class="o">=</span> <span class="mi">2</span>
<span class="n">m</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">θ_init</span> <span class="o">=</span> <span class="x">(</span>
<span class="n">U</span> <span class="o">=</span> <span class="n">orthogonal</span><span class="x">(</span><span class="n">randn</span><span class="x">(</span><span class="n">p</span><span class="x">,</span> <span class="n">m</span><span class="x">)),</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">positive</span><span class="o">.</span><span class="x">(</span><span class="n">rand</span><span class="x">(</span><span class="n">m</span><span class="x">)</span> <span class="o">.+</span> <span class="mf">0.1</span><span class="x">),</span>
<span class="n">σ²</span> <span class="o">=</span> <span class="n">positive</span><span class="x">(</span><span class="mf">0.1</span><span class="x">),</span>
<span class="x">)</span>
<span class="c"># Define a function which builds an OILMM, given a NamedTuple of parameters.</span>
<span class="k">function</span><span class="nf"> build_oilmm</span><span class="x">(</span><span class="n">θ</span><span class="o">::</span><span class="kt">NamedTuple</span><span class="x">)</span>
<span class="k">return</span> <span class="n">OILMM</span><span class="x">(</span>
<span class="c"># Here we adopt a state-space approximation for better</span>
<span class="c"># scalability. We could have instead chosen to use regular</span>
<span class="c"># GPs, for instance, `GP(SEKernel())`, without the call to</span>
<span class="c"># `to_sde`.</span>
<span class="x">[</span><span class="n">to_sde</span><span class="x">(</span><span class="n">GP</span><span class="x">(</span><span class="n">Matern52Kernel</span><span class="x">()),</span> <span class="n">SArrayStorage</span><span class="x">(</span><span class="kt">Float64</span><span class="x">))</span> <span class="k">for</span> <span class="n">_</span> <span class="k">in</span> <span class="mi">1</span><span class="o">:</span><span class="n">m</span><span class="x">],</span>
<span class="n">θ</span><span class="o">.</span><span class="n">U</span><span class="x">,</span>
<span class="kt">Diagonal</span><span class="x">(</span><span class="n">θ</span><span class="o">.</span><span class="n">s</span><span class="x">),</span>
<span class="kt">Diagonal</span><span class="x">(</span><span class="n">zeros</span><span class="x">(</span><span class="n">m</span><span class="x">)),</span>
<span class="x">)</span>
<span class="k">end</span>
<span class="c"># Generate some synthetic data to train on.</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">build_oilmm</span><span class="x">(</span><span class="n">ParameterHandling</span><span class="o">.</span><span class="n">value</span><span class="x">(</span><span class="n">θ_init</span><span class="x">));</span>
<span class="kd">const</span> <span class="n">x</span> <span class="o">=</span> <span class="n">MOInput</span><span class="x">(</span><span class="n">RegularSpacing</span><span class="x">(</span><span class="mf">0.0</span><span class="x">,</span> <span class="mf">0.01</span><span class="x">,</span> <span class="mi">1_000_000</span><span class="x">),</span> <span class="n">p</span><span class="x">);</span>
<span class="n">fx</span> <span class="o">=</span> <span class="n">f</span><span class="x">(</span><span class="n">x</span><span class="x">,</span> <span class="mf">0.1</span><span class="x">);</span>
<span class="n">rng</span> <span class="o">=</span> <span class="kt">MersenneTwister</span><span class="x">(</span><span class="mi">123456</span><span class="x">);</span>
<span class="kd">const</span> <span class="n">y</span> <span class="o">=</span> <span class="n">rand</span><span class="x">(</span><span class="n">rng</span><span class="x">,</span> <span class="n">fx</span><span class="x">);</span>
<span class="c"># Define a function which computes the negative log marginal likelihood given the parameters.</span>
<span class="k">function</span><span class="nf"> objective</span><span class="x">(</span><span class="n">θ</span><span class="o">::</span><span class="kt">NamedTuple</span><span class="x">)</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">build_oilmm</span><span class="x">(</span><span class="n">θ</span><span class="x">)</span>
<span class="k">return</span> <span class="o">-</span><span class="n">logpdf</span><span class="x">(</span><span class="n">f</span><span class="x">(</span><span class="n">x</span><span class="x">,</span> <span class="n">θ</span><span class="o">.</span><span class="n">σ²</span><span class="x">),</span> <span class="n">y</span><span class="x">)</span>
<span class="k">end</span>
<span class="c"># Build a version of the objective function which can be used with Optim.jl.</span>
<span class="n">θ_init_flat</span><span class="x">,</span> <span class="n">unflatten</span> <span class="o">=</span> <span class="n">flatten</span><span class="x">(</span><span class="n">θ_init</span><span class="x">);</span>
<span class="n">unpack</span><span class="x">(</span><span class="n">θ</span><span class="o">::</span><span class="kt">Vector</span><span class="x">{</span><span class="o">&</span><span class="n">lt</span><span class="x">;</span><span class="o">:</span><span class="kt">Real</span><span class="x">})</span> <span class="o">=</span> <span class="n">ParameterHandling</span><span class="o">.</span><span class="n">value</span><span class="x">(</span><span class="n">unflatten</span><span class="x">(</span><span class="n">θ</span><span class="x">))</span>
<span class="n">objective</span><span class="x">(</span><span class="n">θ</span><span class="o">::</span><span class="kt">Vector</span><span class="x">{</span><span class="o">&</span><span class="n">lt</span><span class="x">;</span><span class="o">:</span><span class="kt">Real</span><span class="x">})</span> <span class="o">=</span> <span class="n">objective</span><span class="x">(</span><span class="n">unpack</span><span class="x">(</span><span class="n">θ</span><span class="x">))</span>
<span class="c"># Utilise Optim.jl + Zygote.jl to optimise the model parameters.</span>
<span class="n">training_results</span> <span class="o">=</span> <span class="n">Optim</span><span class="o">.</span><span class="n">optimize</span><span class="x">(</span>
<span class="n">objective</span><span class="x">,</span>
<span class="n">θ</span> <span class="o">-></span> <span class="n">only</span><span class="x">(</span><span class="n">Zygote</span><span class="o">.</span><span class="n">gradient</span><span class="x">(</span><span class="n">objective</span><span class="x">,</span> <span class="n">θ</span><span class="x">)),</span>
<span class="n">θ_init_flat</span> <span class="o">+</span> <span class="n">randn</span><span class="x">(</span><span class="n">length</span><span class="x">(</span><span class="n">θ_init_flat</span><span class="x">)),</span> <span class="c"># Add some noise to make learning non-trivial</span>
<span class="n">BFGS</span><span class="x">(</span>
<span class="n">alphaguess</span> <span class="o">=</span> <span class="n">Optim</span><span class="o">.</span><span class="n">LineSearches</span><span class="o">.</span><span class="n">InitialStatic</span><span class="x">(</span><span class="n">scaled</span><span class="o">=</span><span class="nb">true</span><span class="x">),</span>
<span class="n">linesearch</span> <span class="o">=</span> <span class="n">Optim</span><span class="o">.</span><span class="n">LineSearches</span><span class="o">.</span><span class="n">BackTracking</span><span class="x">(),</span>
<span class="x">),</span>
<span class="n">Optim</span><span class="o">.</span><span class="n">Options</span><span class="x">(</span><span class="n">show_trace</span> <span class="o">=</span> <span class="nb">true</span><span class="x">);</span>
<span class="n">inplace</span><span class="o">=</span><span class="nb">false</span><span class="x">,</span>
<span class="x">)</span>
<span class="c"># Compute posterior marginals at optimal solution.</span>
<span class="n">θ_opt</span> <span class="o">=</span> <span class="n">unpack</span><span class="x">(</span><span class="n">training_results</span><span class="o">.</span><span class="n">minimizer</span><span class="x">)</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">build_oilmm</span><span class="x">(</span><span class="n">θ_opt</span><span class="x">)</span>
<span class="n">f_post</span> <span class="o">=</span> <span class="n">posterior</span><span class="x">(</span><span class="n">f</span><span class="x">(</span><span class="n">x</span><span class="x">,</span> <span class="n">θ_opt</span><span class="o">.</span><span class="n">σ²</span><span class="x">),</span> <span class="n">y</span><span class="x">)</span>
<span class="n">fx</span> <span class="o">=</span> <span class="n">marginals</span><span class="x">(</span><span class="n">f_post</span><span class="x">(</span><span class="n">x</span><span class="x">))</span>
</code></pre></div></div>
<h3 id="python">Python</h3>
<p>The dependencies for this implementation can be installed via a call to <code class="language-plaintext highlighter-rouge">pip install oilmm jax jaxlib</code> in the command line.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">jax.numpy</span> <span class="k">as</span> <span class="n">jnp</span>
<span class="kn">from</span> <span class="nn">stheno</span> <span class="kn">import</span> <span class="n">EQ</span><span class="p">,</span> <span class="n">GP</span>
<span class="kn">from</span> <span class="nn">oilmm.jax</span> <span class="kn">import</span> <span class="n">OILMM</span>
<span class="k">def</span> <span class="nf">build_latent_processes</span><span class="p">(</span><span class="n">params</span><span class="p">):</span>
<span class="c1"># Return models for latent processes, which are noise-contaminated GPs.
</span> <span class="k">return</span> <span class="p">[</span>
<span class="p">(</span>
<span class="c1"># Create GPs with learnable variances initialised to one and
</span> <span class="c1"># learnable length scales, also initialised to one.
</span> <span class="n">p</span><span class="p">.</span><span class="n">variance</span><span class="p">.</span><span class="n">positive</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">GP</span><span class="p">(</span><span class="n">EQ</span><span class="p">().</span><span class="n">stretch</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">length_scale</span><span class="p">.</span><span class="n">positive</span><span class="p">(</span><span class="mi">1</span><span class="p">))),</span>
<span class="c1"># Use learnable noise variances, initialised to `1e-2`.
</span> <span class="n">p</span><span class="p">.</span><span class="n">noise</span><span class="p">.</span><span class="n">positive</span><span class="p">(</span><span class="mf">1e-2</span><span class="p">),</span>
<span class="p">)</span>
<span class="k">for</span> <span class="n">p</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">params</span><span class="p">,</span> <span class="nb">range</span><span class="p">(</span><span class="mi">3</span><span class="p">))</span>
<span class="p">]</span>
<span class="c1"># Construct model.
</span><span class="n">prior</span> <span class="o">=</span> <span class="n">OILMM</span><span class="p">(</span><span class="n">jnp</span><span class="p">.</span><span class="n">float32</span><span class="p">,</span> <span class="n">build_latent_processes</span><span class="p">,</span> <span class="n">num_outputs</span><span class="o">=</span><span class="mi">6</span><span class="p">)</span>
<span class="c1"># Create some sample data.
</span><span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">prior</span><span class="p">.</span><span class="n">sample</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="c1"># Fit OILMM.
</span><span class="n">prior</span><span class="p">.</span><span class="n">fit</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">trace</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">jit</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">prior</span><span class="p">.</span><span class="n">vs</span><span class="p">.</span><span class="k">print</span><span class="p">()</span> <span class="c1"># Print all learned parameters.
</span>
<span class="c1"># Make predictions.
</span><span class="n">posterior</span> <span class="o">=</span> <span class="n">prior</span><span class="p">.</span><span class="n">condition</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">mean</span><span class="p">,</span> <span class="n">var</span> <span class="o">=</span> <span class="n">posterior</span><span class="p">.</span><span class="n">predict</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">lower</span> <span class="o">=</span> <span class="n">mean</span> <span class="o">-</span> <span class="mf">1.96</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">var</span><span class="p">)</span>
<span class="n">upper</span> <span class="o">=</span> <span class="n">mean</span> <span class="o">+</span> <span class="mf">1.96</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="n">var</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="notes">Notes</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>To install Stheno, simply run <code class="language-plaintext highlighter-rouge">pip install stheno</code> from the command line. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>We also use the python package <a href="https://github.com/wesselb/lab">LAB</a> to make our code agnostic to the linear algebra backend. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Eric Perim, Wessel Bruinsma, and Will TebbuttThis is the final post in our series about multi-output Gaussian process (GP) models. In the first post, we described how to generalise single-output GPs to multi-output GPs (MOGPs). We also introduced the Mixing Model Hierarchy (MMH), as a way to classify and organise a large number of MOGP models from the literature. In the second post, we discussed the Instantaneous Linear Mixing Model (ILMM), the base model of the MMH, showing how its low-rank assumption can be exploited to speed up inference via simple linear algebra tricks. We used those results to motivate the Orthogonal Instantaneous Linear Mixing Model (OILMM), a version of the ILMM which scales even more favourably, allowing us to model up to tens of millions of points on a regular laptop.A Gentle Introduction to Optimal Power Flow2021-06-18T00:00:00+00:002021-06-18T00:00:00+00:00https://invenia.github.io/blog/2021/06/18/opf-intro<p>In an earlier blog post, we discussed the <a href="/blog/2020/12/04/pf-intro/">power flow problem</a>, which serves as the key component of a much more challenging task: the optimal power flow (OPF).
OPF is an umbrella term that covers a wide range of constrained optimization problems, the most important ingredients of which are: <em>variables</em> that optimize an <em>objective function</em>, some <em>equality constraints</em>, including the power balance and power flow equations, and <em>inequality constraints</em>, including bounds on the variables.
The sets of variables and constraints, as well as the form of the objective, will vary depending on the type of OPF.</p>
<p>We first derive the most widely used OPF variant, the economic dispatch problem, which aims to find a minimum cost solution of power demand and supply in the electricity grid.
Then, we will briefly review other basic types of OPF.
We will see that, unlike the power flow problem, OPF is generally rather expensive to solve.
Therefore, we will discuss approximations (known as relaxations) that simplify the original model, as well as computational intelligence and machine learning techniques that try to reduce computational costs.</p>
<h2 id="the-power-flow-problem">The power flow problem</h2>
<p>In our previous <a href="/blog/2020/12/04/pf-intro/">blog post</a>, we introduced the concept of the power flow problem and derived the equations for a <a href="https://invenia.github.io/blog/2020/12/04/pf-intro/#power-flow-models">simple</a> and for an <a href="https://invenia.github.io/blog/2020/12/04/pf-intro/#towards-more-realistic-power-flow-models">improved</a> model.
Because OPF relies on the power flow problem, we present here the final equations of the improved power flow model.</p>
<p>Let \(\mathcal{N}\) denote the set of buses (represented by nodes in the corresponding graph of the power grid), and \(\mathcal{E}, \mathcal{E}^{R} \subseteq \mathcal{N} \times \mathcal{N}\) the sets of forward and reverse orientations of transmission lines or branches (represented by directed edges in the corresponding graph of the power grid).
Further, let \(\mathcal{G}_{i}\), \(\mathcal{L}_{i}\) and \(\mathcal{S}_{i}\) denote the sets of generators, loads and shunt elements that belong to bus \(i\).
Similarly, the sets of <em>all</em> generators, loads and shunt elements are designated by \(\mathcal{G} = \bigcup \limits_{i \in \mathcal{N}} \mathcal{G}_{i}\), \(\mathcal{L} = \bigcup \limits_{i \in \mathcal{N}} \mathcal{L}_{i}\) and \(\mathcal{S} = \bigcup \limits_{i \in \mathcal{N}} \mathcal{S}_{i}\), respectively.</p>
<p>Power flow models explicitly describe the power balance at each bus, based on <a href="https://en.wikipedia.org/wiki/Tellegen\%27s_theorem">Tellegen’s theorem</a>:</p>
\[\begin{equation}
S_{i}^{\mathrm{gen}} - S_{i}^{\mathrm{load}} - S_{i}^{\mathrm{shunt}} = S_{i}^{\mathrm{trans}} \qquad \forall i \in \mathcal{N}.
\label{balance}
\end{equation}\]
<p>Eq. \(\eqref{balance}\) states that the injected power by the connected generators (\(S_{i}^{\mathrm{gen}}\)) into the bus, reduced by the connected loads (\(S_{i}^{\mathrm{load}}\)) and shunt elements (\(S_{i}^{\mathrm{shunt}}\)), is equal to the transmitted power from and to the adjacent buses (\(S_{i}^{\mathrm{trans}}\)).</p>
<p>There are two basic power flow models: the <a href="https://invenia.github.io/blog/2020/12/04/pf-intro/#bus-injection-model">bus injection model</a> (BIM) and the <a href="https://invenia.github.io/blog/2020/12/04/pf-intro/#branch-flow-model">branch flow model</a> (BFM) that are based on the undirected and directed graph representation of the power grid, respectively.
In this discussion, we focus on the BFM-based OPF, but we note that an equivalent OPF formulation can be derived from the BIM.
Also, we will use the <a href="https://invenia.github.io/blog/2020/12/04/pf-intro/#improved-bim-and-bfm-models">improved power flow model</a> that we described in the previous blog post.
Under this model, the BFM has four sets of equations: the first set expresses the power balance for each bus, the second set is the series current of the \(\pi\)-section model for each transmission line and the last two sets are the (asymmetric) power flows for the two directions of each transmission line:</p>
<p>\(\begin{equation}
\begin{aligned}
& \sum \limits_{g \in \mathcal{G}_{i}} S_{g}^{G} - \sum \limits_{l \in \mathcal{L}_{i}} S_{l}^{L} - \sum \limits_{s \in \mathcal{S}_{i}} \left( Y_{s}^{S} \right)^{*} \lvert V_{i} \rvert^{2} \\
& \qquad \quad \ = \sum \limits_{(i, j) \in \mathcal{E}} S_{ij} + \sum \limits_{(k, i) \in \mathcal{E}^{R}} S_{ki} \hspace{1.0cm} \forall i \in \mathcal{N}, \\
& I_{ij}^{s} = Y_{ij} \left( \frac{V_{i}}{T_{ij}} - V_{j} \right) \hspace{2.5cm} \forall (i, j) \in \mathcal{E}, \\
& S_{ij} = V_{i} \left( \frac{I_{ij}^{s}}{T_{ij}^{*}} + Y_{ij}^{C} \frac{V_{i}}{\lvert T_{ij} \rvert^{2}}\right)^{*} \hspace{0.7cm} \forall (i, j) \in \mathcal{E}, \\
& S_{ji} = V_{j} \left( -I_{ij}^{s} + Y_{ji}^{C}V_{j} \right)^{*} \hspace{1.8cm} \forall (j, i) \in \mathcal{E}^{R}, \\
\end{aligned}
\label{bfm_concise}
\end{equation}\)
where \(S_{g}^{G}\) and \(S_{l}^{L}\) represent the complex powers of generator \(g\) and load \(l\); \(Y_{s}^{S}\) is the admittance of shunt element \(s\), and \(V_{i}\) is the complex voltage of bus \(i\).
Also, \(I_{ij}^{s}\) and \(S_{ij}\) denote the series current and power flow; \(Y_{ij}\) and \(Y_{ij}^{C}\) represent the series admittance and shunt admittance; and \(T_{ij}\) designates the complex tap ratio of the \(\pi\)-section model of branch \(i \to j\).
<br />
<br /></p>
<p>Table 1. Comparison of the improved BIM and BFM formulations of power flow. For each formulation, we show the complex variables (and their real components), the total number of complex (and corresponding real) variables, and the number of complex (and corresponding real) equations. \(N = \lvert \mathcal{N} \rvert\), \(E = \lvert \mathcal{E} \rvert\), \(G = \lvert \mathcal{G} \rvert\) and \(L = \lvert \mathcal{L} \rvert\) denote the total number of buses, transmission lines, generators and loads. The variables \(v_{i}\), \(\delta_{i}\), \(i_{ij}^{s}\) and \(\gamma_{ij}^{s}\) designate the voltage magnitude and angle of bus \(i\), and the magnitude and angle of series current flowing from bus \(i\) to bus \(j\). The variables \(p\) and \(q\) denote the active and reactive components of the corresponding complex powers.</p>
<table>
<thead>
<tr>
<th style="text-align: center">Formulation</th>
<th style="text-align: center">Variables</th>
<th style="text-align: center">Number of variables</th>
<th style="text-align: center">Number of equations</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">BIM</td>
<td style="text-align: center">\(V_{i} = (v_{i} e^{\mathrm{j}\delta_{i}})\) <br /> \(S_{g}^{G} = (p_{g}^{G} + \mathrm{j} q_{g}^{G})\) <br /> \(S_{l}^{L} = (p_{l}^{L} + \mathrm{j} q_{l}^{L})\)</td>
<td style="text-align: center">\(N + G + L\) <br /> \((2N + 2G + 2L)\)</td>
<td style="text-align: center">\(N\) <br /> \((2N)\)</td>
</tr>
<tr>
<td style="text-align: center">BFM</td>
<td style="text-align: center">\(V_{i} = (v_{i} e^{\mathrm{j}\delta_{i}})\) <br /> \(S_{g}^{G} = (p_{g}^{G} + \mathrm{j} q_{g}^{G})\) <br /> \(S_{l}^{L} = (p_{l}^{L} + \mathrm{j} q_{l}^{L})\) <br /> \(I_{ij}^{s} = (i_{ij}^{s} e^{\mathrm{j} \gamma_{ij}^{s}})\) <br /> \(S_{ij} = (p_{ij} + \mathrm{j} q_{ij})\) <br /> \(S_{ji} = (p_{ji} + \mathrm{j} q_{ji})\)</td>
<td style="text-align: center">\(N + G + L + 3E\) <br /> \((2N + 2G + 2L + 6E)\)</td>
<td style="text-align: center">\(N + 3E\) <br /> \((2N + 6E)\)</td>
</tr>
</tbody>
</table>
<p><br /></p>
<h2 id="the-economic-dispatch-problem">The economic dispatch problem</h2>
<p>The first OPF model we discuss is a specific variant of the economic dispatch (ED) model, which tries to find the economically optimal (i.e. lowest possible cost) output of the generators that meets the load and physical constraints of the power system.
We first present the elements of an ED formulation using the improved BMF model. (The full optimization problem can be found in the <a href="#appendix-a-the-economic-dispatch-problem-using-bfm">Appendix</a>).
Then we discuss the interior-point method, which is one of the most widely used techniques to solve ED problems.</p>
<h3 id="variables">Variables</h3>
<p>In power flow problems, the number of equations determines the maximum number of unknown variables for which the non-linear system can be solved.
For example, as shown in Table 1, BIM has \(N+G+L\) complex variables and \(N\) complex equations, which means that \(G+L\) variables need to be specified for the problem to be uniquely solvable.</p>
<p>In ED, the variables are basically the same as in the power flow problems, but, as an optimization problem, ED has a higher number of unknown variables.
For generality, in this discussion we treat all variables in ED formally as unknown.
For variables whose value is specified, we can simply define an equality constraint or, more typically, the corresponding lower and upper bounds with the same specific value.
This latter formalism (of using lower and upper bounds) makes the model construction more straightforward, and also follows general practice.<sup id="fnref:Coffrin18" role="doc-noteref"><a href="#fn:Coffrin18" class="footnote" rel="footnote">1</a></sup>
For the following sections, let \(X\) denote all variables in ED.</p>
<h3 id="objective-function">Objective function</h3>
<p>The objective (or cost) function of the most widely used economic dispatch OPF is the cost of the total real power generation.
Let \(C_{g}(p_{g}^{G})\) denote the individual cost curve of generator \(g \in \mathcal{G}\), that is a function solely depending on the (active) power \(p_{g}^{G}\) generated by generator \(g\).
\(C_{g}\) is a monotonically increasing function of \(p_{g}^{G}\), usually modeled by a quadratic or piecewise linear function. An example of such a function is shown in Figure 1.
The objective function \(C(X)\) can then be written as:</p>
\[\begin{equation}
C(X) = \sum \limits_{g \in \mathcal{G}} C_{g}(p_{g}^{G}).
\end{equation}\]
<p>We note that there can be alternative objective functions in OPF that can be used to investigate different features of the power grid.<sup id="fnref:AlRashidi09" role="doc-noteref"><a href="#fn:AlRashidi09" class="footnote" rel="footnote">2</a></sup>
For instance, one can minimize the <em>line loss</em> (i.e. power loss on transmission lines) over the network.
Also, the objective function can express additional environmental and economical costs other than pure generation cost.
As an example, with the increasing concerns over climate change, multiple studies have suggested that reducing overall emissions from generation should be taken into account in the objective function.<sup id="fnref:Gholami14" role="doc-noteref"><a href="#fn:Gholami14" class="footnote" rel="footnote">3</a></sup>
<br />
<br /></p>
<p><img src="/blog/public/images/cost_curve.png" alt="cost_curve" />
Figure 1. Nonlinear generation cost curve (solid line) and its piecewise linear approximation (dashed lines) by three straight-line segments. \(p^{\mathrm{min}}\) and \(p^{\mathrm{max}}\) specify the operation limits.</p>
<p><br /></p>
<h3 id="equality-constraints">Equality constraints</h3>
<p>Equality constraints are expressions of the optimization variables characterized by equations.
Equivalently, one can think of an equality constraint as a function of the optimization variables whose function value is fixed.</p>
<p>Because power flow equations express fundamental physical laws, they must naturally be contained in any OPF.
The power balance equations together with the power flow equations \(\eqref{bfm_concise}\) are treated as equality constraints in OPF.</p>
<p>In the equations of power flow, the voltage angle differences play a key role, rather than the individual angles.
Therefore, without loss of generality we can select a specific value of the voltage angle of a bus as reference and apply an additional equality constraint to this reference bus by setting its voltage angle to zero.
We call this the <em>slack</em> or <em>reference bus</em>.
In a more general setting, there can be multiple <em>slack buses</em> (i.e. buses used to make up any generation and demand mismatch caused by line losses) and corresponding constraints on their voltage angle.
Denoting the set of all reference/slack buses by \(\mathcal{R}\) we can write the corresponding constraints for all \(r \in \mathcal{R}\) as: \(\delta_{r} = 0\).</p>
<h3 id="inequality-constraints">Inequality constraints</h3>
<p>Finally, OPF also includes a set of inequality constraints.
Similarly to equality constraints, an inequality constraint can be described by a function of the optimization variables, whose function value is allowed to change within a specified interval.
Inequality constraints usually express some physical limitations of the elements of the system.
For example, the most typical constraints for the ED problem are:</p>
<ul>
<li>lower and upper bounds on voltage magnitude of each bus \(i\): \(\underline{v}_{i} \le v_{i} \le \overline{v}_{i}\),</li>
<li>lower and upper bounds on voltage angle difference between all adjacent buses \(i\) and \(j\): \(\underline{\delta}_{ij} \le \delta_{ij} \le \overline{\delta}_{ij}\),</li>
<li>lower and upper bounds on output active power of each generator \(g\): \(\underline{p}_{g}^{G} \le p_{g}^{G} \le \overline{p}_{g}^{G}\),</li>
<li>lower and upper bounds on output reactive power of each generator \(g\): \(\underline{q}_{g}^{G} \le q_{g}^{G} \le \overline{q}_{g}^{G}\),</li>
<li>upper bounds on the absolute value of active power flow of all transmission lines: \(\lvert p_{ij}\rvert \le \overline{p}_{ij}\),</li>
<li>upper bounds on the absolute value of reactive power flow of all transmission lines: \(\lvert q_{ij}\rvert \le \overline{q}_{ij}\),</li>
<li>upper bound on the absolute (square) value of complex power flow of all transmission lines: \(\lvert S_{ij}\rvert^{2} = p_{ij}^{2} + q_{ij}^{2} \le \overline{S}_{ij}^{2}\).</li>
</ul>
<p>We note that in the actual implementation, some inequality constraints can appear in alternative forms.
For instance, lower and upper bounds on power flows can be replaced by a corresponding set of constraints on the currents.</p>
<h3 id="solving-the-ed-problem-using-the-interior-point-method">Solving the ED problem using the interior-point method</h3>
<p>Economic dispatch is a non-linear and non-convex optimization problem.
One of the most successful techniques to solve such large-scale optimization problems is the <a href="https://en.wikipedia.org/wiki/Interior-point_method">interior-point method</a>.<sup id="fnref:Nocedal06" role="doc-noteref"><a href="#fn:Nocedal06" class="footnote" rel="footnote">4</a></sup> <sup id="fnref:Wachter06" role="doc-noteref"><a href="#fn:Wachter06" class="footnote" rel="footnote">5</a></sup>
In this section, we briefly discuss the key aspects of this method.</p>
<p>In a general mathematical optimization, we try to find values of variables \(x\) that minimize an objective function \(f(x)\) subject to a set of equality and inequality constraints \(c^{\mathrm{E}}(x)\) and \(c^{\mathrm{I}}(x)\):</p>
\[\begin{equation}
\begin{aligned}
& \min \limits_{x} \ f(x), \\
& \mathrm{s.t.} \; \; c^{\mathrm{E}}(x) = 0, \\
& \quad \; \; \; \; c^{\mathrm{I}}(x) \ge 0. \\
\end{aligned}
\label{opt_problem}
\end{equation}\]
<p>The ED problem can be expressed in such a form. For example, using the BFM formalism, \(x = X^{\mathrm{BFM}}\), \(\ f(x) = C(X^{\mathrm{BFM}})\), with \(\ c^{\mathrm{E}}_{i} \in \mathcal{C}^{E}\), where \(\mathcal{C}^{\mathrm{E}} = \mathcal{C}^{\mathrm{PB}} \cup \mathcal{C}^{\mathrm{REF}} \cup \mathcal{C}^{\mathrm{BFM}}\), and \(c^{\mathrm{I}}_{j} \in \mathcal{C}^{I}\), where \(\mathcal{C}^{\mathrm{I}} = \mathcal{C}^{\mathrm{INEQ}}\).
Details can be found in the <a href="#appendix-a-the-economic-dispatch-problem-using-bfm">Appendix</a>.</p>
<p>Interior-point methods are also referred to as barrier methods as they introduce a surrogate model of the above optimization problem by converting the inequality constraints to equality ones and replacing the objective function with a logarithmic barrier function:</p>
\[\begin{equation}
\begin{aligned}
& \min \limits_{x, s} \ f(x) - \mu \sum \limits_{i=1}^{\lvert \mathcal{C}^{\mathrm{I}} \rvert} \log s_{i}, \\
& \mathrm{s.t.} \; \; c^{\mathrm{E}}(x) = 0, \\
& \quad \; \; \; \; c^{\mathrm{I}}(x) - s = 0. \\
\end{aligned}
\label{barrier_problem}
\end{equation}\]
<p>Here \(s\) is the vector of slack variables, and the second term in the objective function is called the barrier term, with \(\mu > 0\) barrier parameter.
We note that the barrier term implicitly imposes an inequality constraint on the slack variables, due to the logarithmic function: \(s_i \ge 0\) for all \(i\).
It is easy to see that the barrier problem is not equivalent to the original non-linear program for \(\mu > 0\).
However, as \(\mu \to 0\), the solution of the barrier problem \(\eqref{barrier_problem}\) approaches that of the original optimization problem \(\eqref{opt_problem}\).
The idea of the barrier method is to find an approximate solution of the original problem by using an appropriate sequence of positive barrier parameters that converges to zero.
The basic interior-point algorithm is based on the <a href="https://en.wikipedia.org/wiki/Karush%E2%80%93Kuhn%E2%80%93Tucker_conditions">Karush–Kuhn–Tucker</a> (KKT) conditions of the barrier problem.
The KKT conditions consist of four sets of equations: the first two sets of equations are derived from the first-order condition of the system <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrangian</a>, i.e. its derivatives with respect to \(x\) and \(s\) are equal to zero, while the third and fourth sets of equations correspond to the equality and inequality constraints, respectively:</p>
\[\begin{equation}
\begin{aligned}
\nabla f(x) - J_{\mathrm{E}}^{T}(x)y - J_{\mathrm{I}}^{T}(x)z = 0, \\
Sz - \mu e = 0, \\
c^{\mathrm{E}}(x) = 0, \\
c^{\mathrm{I}}(x) - s = 0. \\
\end{aligned}
\label{kkt}
\end{equation}\]
<p>The \(J_{\mathrm{E}}(x)\) and \(J_{\mathrm{I}}(x)\) are the <a href="https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant">Jacobian</a> matrices of the equality and of the inequality functions, and
\(y\) and \(z\) are their corresponding <a href="https://en.wikipedia.org/wiki/Lagrange_multiplier">Lagrange multipliers</a>.
Also, we use the following additional notations: \(S = \mathrm{diag}(s)\) and \(Z = \mathrm{diag}(z)\) are diagonal matrices, \(I\) is the identity matrix, and \(e = (1, \dots, 1)^{T}\).</p>
<p>The KKT conditions establish a relationship between the primal \((x, s)\) and <a href="https://en.wikipedia.org/wiki/Duality_(optimization)">dual</a> \((y, z)\) variables, resulting in a non-linear system \(\eqref{kkt}\) that can be expressed by a vector-valued function: \(F_{\mathrm{KKT}}(x, s, y, z) = 0\).
In order to solve this non-linear system, we can apply the <a href="https://en.wikipedia.org/wiki/Newton%27s_method">Newton–Raphson method</a> and look for a search direction based on the following linear problem:</p>
\[\begin{equation}
J_{\mathrm{KKT}}(x, s, y, z)p \ =\ -F_{\mathrm{KKT}}(x, s, y, z),
\label{primal_dual}
\end{equation}\]
<p>where \(J_{\mathrm{KKT}}(x, s, y, z)\) is the Jacobian matrix of the KKT system, and \(p\) is the search direction vector.
The full form of these components is as follows:</p>
\[\begin{equation}
\begin{aligned}
F_{\mathrm{KKT}}(x, s, y, z) & =
\begin{pmatrix}
\nabla f(x) - J_{\mathrm{E}}^{T}(x)y - J_{\mathrm{I}}^{T}(x)z \\
Sz - \mu e \\
c^{\mathrm{E}}(x) \\
c^{\mathrm{I}}(x) - s \\
\end{pmatrix}, \\
J_{\mathrm{KKT}}(x, s, y, z) & =
\begin{pmatrix}
\nabla_{xx}L & 0 & -J_{\mathrm{E}}^{T}(x) & -J_{\mathrm{I}}^{T}(x) \\
0 & Z & 0 & S \\
J_{\mathrm{E}}(x) & 0 & 0 & 0 \\
J_{\mathrm{I}}(x) & -I & 0 & 0 \\
\end{pmatrix}, \\
p & =
\begin{pmatrix}
p_{x} \\
p_{s} \\
p_{y} \\
p_{z} \\
\end{pmatrix},
\end{aligned}
\end{equation}\]
<p>where the Lagrangian of the system is defined as</p>
\[\begin{equation}
L(x, s, y, z) = f(x) - y^{T} c^{\mathrm{E}}(x) - z^{T} \left( c^{\mathrm{I}}(x) - s \right),
\label{lagrangian}
\end{equation}\]
<p>and \(\nabla_{xx}L\) is the <a href="https://en.wikipedia.org/wiki/Hessian_matrix">Hessian</a> of the Lagrangian with respect to the optimization variables.</p>
<p>Equation \(\eqref{primal_dual}\)
is called the primal-dual system, as it includes the dual variables besides the primal ones.
The primal-dual system is solved iteratively: in iteration \((n+1)\), after obtaining \(p^{(n+1)}\), we can update the variables \((x,s,y,z)\) and the barrier parameter \(\mu\):</p>
\[\begin{equation}
\begin{aligned}
& x_{n+1} = x_{n} + \alpha_{x}^{(n+1)} p_{x}^{(n+1)}, \\
& s_{n+1} = s_{n} + \alpha_{s}^{(n+1)} p_{s}^{(n+1)}, \\
& y_{n+1} = y_{n} + \alpha_{y}^{(n+1)} p_{y}^{(n+1)}, \\
& z_{n+1} = z_{n} + \alpha_{z}^{(n+1)} p_{z}^{(n+1),} \\
& \mu_{n+1} = g(\mu_{n}, x_{n+1}, s_{n+1}, y_{n+1}, z_{n+1}), \\
\end{aligned}
\end{equation}\]
<p>where \(g\) is a function that computes the next value of \(\mu\)
from the previous value and the updated variables.
The \(\alpha_k\)’s are line search parameters for variables \(k = x, s, y, z\)
that are dynamically adapted to get better convergence.</p>
<p>There are several modern interior-point methods that propose different heuristic functions for computing line search parameters \(\alpha_k\) and barrier parameter \(\mu\), in order to cope with non-convexities and non-linearities, and to provide good convergence rates.</p>
<h2 id="opf-variants">OPF variants</h2>
<p>So far we have been focusing on one specific OPF problem, a variant of the economic dispatch that is one of the most basic types in the realm of OPF.
In this section, we briefly discuss other models.</p>
<p>In a more general setting, the task of <em>economic dispatch</em> is to optimize some economic utility function for participants, including both loads and generators.
For instance, incorporating both power supply and power demand into the objective function leads to the basic concept of the electricity market.
At the optimum, the amount of supply and demand power reflects simple market principles subject to the physical constraints of the system: power is primarily bought from generators offering the lowest prices and primarily sold to loads offering the highest prices.</p>
<p>In the so-called <em>volt/var control</em> we optimize the combination of power loss, peak demand, and power consumption, with respect to the voltage level and reactive powers.</p>
<p>In the above problems, the set of generators is specified and all of these generators are supposed to contribute to the total power generation.
However, in real industrial problems, not all generators in the grid need to be active.
Generators differ in many ways besides their generation curve: they have different start up and shut down time scales (i.e. how quickly a generator can reach its operating level and how quickly it can stop running), different costs and ramp rates (i.e. the rates at which a generator can increase or decrease its output), different minimum and maximum power generation, etc. All of these quantities need to be taken into account in the planning procedure.
Therefore, it is very much desirable to allow specific generators to not run in the optimization model.
In order to select the active generators, electricity grid operators solve the <em>unit commitment</em> (UC) problem.
The <em>unit commitment economic dispatch</em> (UCED) problem can be considered an extension of the ED problem, where the set of variables is extended with binary variables, each responsible for including or excluding a specific generator in the optimization.
The UC problem is, therefore, a mixed-integer non-linear programming problem, which is even harder to solve computationally.</p>
<p>Another aspect that needs to be considered is the security of the grid.
Power grids are complex systems that include thousands of buses and transmission lines, where failure of any element can happen occasionally and impact the operation of the power grid significantly.
Thus, it is very important to account for such contingencies.
The corresponding set of problems is called <em>security constrained optimal power flow</em> (SCOPF) and in combination with ED it is called <em>security constrained economic dispatch</em> (SCED).
SCOPF provides a solution not only for the ideal unfailing case, but at the same time for some contingency scenarios as well.
These problems are much larger scale problems, including a much larger number of variables and constraints, and requiring costly mixed-integer non-linear programming.
For example, the widely used \(N-1\) contingency problem addresses all scenarios where a single component fails.</p>
<p>Finally, we note that a more general (and very large-scale) problem can be formulated by using both UC and SCOPF, leading to the <em>security constrained unit commitment problem</em> (SCUC).</p>
<h2 id="solving-opf-problems">Solving OPF problems</h2>
<p>Solving OPF problems by interior-point methods, as described earlier, is widely used as a reliable standard approach.
However, solving such problems by interior-point methods is not cheap, as it requires the computation of the Hessian of the Lagrangian \(\nabla_{xx} L(x_{n}, s_{n}, y_{n}, z_{n})\) at each iteration step \(n\) as in equation \(\eqref{primal_dual}\).
Because the required computational time has a disadvantageous superlinear scaling with system size, solving large scale problems can be prohibitively difficult.
Moreover, the scaling is even worse for mixed-integer OPF problems.</p>
<p>There are several approaches that address this issue by either simplifying the original OPF problem, or by computing approximate solutions.</p>
<h3 id="convex-relaxations">Convex relaxations</h3>
<p>The first set of approximations we discuss is called convex relaxations.
The main idea of these approaches is to approximate the original non-convex optimization problem with a convex one.
Unlike in non-convex problems, in convex problems any local minimum is a global minimum as well (so convex problems, by definition, can only have a single minimum).
There are very straightforward algorithms with excellent convergence rates to solve convex optimization problems.<sup id="fnref:Nocedal06:1" role="doc-noteref"><a href="#fn:Nocedal06" class="footnote" rel="footnote">4</a></sup> <sup id="fnref:Boyd04" role="doc-noteref"><a href="#fn:Boyd04" class="footnote" rel="footnote">6</a></sup></p>
<p>To demonstrate convex relaxations, let us consider the economic dispatch problem with the basic BIM formulation.
As it can be shown, this problem – with some slight modification – can be reformulated as a <a href="https://en.wikipedia.org/wiki/Quadratically_constrained_quadratic_program">quadratically constrained quadratic program</a> (QCQP) using the complex voltages of the buses.<sup id="fnref:Low14" role="doc-noteref"><a href="#fn:Low14" class="footnote" rel="footnote">7</a></sup>
In this reformulated optimization problem, the objective function and all (inequality) constraints have a quadratic form in the complex voltages:</p>
\[\begin{equation}
\begin{aligned}
& \min \limits_{V \in \mathbb{C}^{N}}\ V^{H}CV, \\
& \mathrm{s.t.} \quad V^{H}M_{k}V \le m_{k} \quad k \in \mathcal{K}, \\
\end{aligned}
\label{qcqp}
\end{equation}\]
<p>where the superscript \(H\) denotes <a href="https://en.wikipedia.org/wiki/Conjugate_transpose">conjugate transpose</a>, \(M_{k}\)s are <a href="https://en.wikipedia.org/wiki/Hermitian_matrix">Hermitian</a> matrices (i.e.\ \(M_{k}^{H} = M_{k}\)), and \(m_{k}\)s are real-valued vectors specifying the corresponding inequality constraints.
QCQP is still a non-convex problem that is also an <a href="https://en.wikipedia.org/wiki/NP-hardness">NP-hard</a> problem.
Let \(W = VV^{H}\) denote a rank–1 <a href="https://en.wikipedia.org/wiki/Definite_symmetric_matrix">positive semidefinite</a> matrix (i.e. <a href="https://en.wikipedia.org/wiki/Rank_(linear_algebra)">rank</a>\((W) = 1\) and \(W \succeq 0\)).
Using the identity of the <a href="https://en.wikipedia.org/wiki/Trace_(linear_algebra)">trace function</a> for any Hermitian matrix: \(V^{H}MV = \mathrm{tr}(MVV^{H}) = \mathrm{tr}(MW)\), we can transform the quadratic forms of problem \(\eqref{qcqp}\) to expressions that include \(W\) instead of \(V\):</p>
\[\begin{equation}
\begin{aligned}
& \min \limits_{W \in \mathbb{S}}\ \mathrm{tr}(CW), \\
& \mathrm{s.t.} \; \; \mathrm{tr}(M_{k}W) \le m_{k} \quad k \in \mathcal{K}, \\
& \quad \quad \ W \succeq 0, \\
& \quad \quad \ \mathrm{rank}(W) = 1, \\
\end{aligned}
\end{equation}\]
<p>where \(\mathbb{S}\) denotes the space of \(N \times N\) symmetric matrices.
The above problem is convex in \(W\), except for the rank–1 equality constraint.
Removing this constraint leads to a <a href="https://en.wikipedia.org/wiki/Semidefinite_programming">semidefinite programming problem</a> (SDP), which is called the SDP relaxation of the OPF.
SDP introduces a convex superset of the optimization variables.
Other choices of convex supersets yield different relaxations, like chordal or <a href="https://en.wikipedia.org/wiki/Second-order_cone_programming">second-order cone programming</a> (SOCP), that can be solved even more efficiently than SDP problems.
We note that all of these relaxations provide a lower bound to OPF.
For some sufficient conditions these relaxations can be even exact.<sup id="fnref:Low14a" role="doc-noteref"><a href="#fn:Low14a" class="footnote" rel="footnote">8</a></sup>
Also, an advantage of convex relaxation approaches over other approximations is that if a relaxed problem is infeasible, then the original problem is also infeasible.</p>
<h3 id="dc-opf">DC-OPF</h3>
<p>One of the most widely used approximations is the DC-OPF approach.
Its name is pretty misleading, as this method does not assume the use of direct current (as opposed to alternating current), although it removes the reactive power flow equations and the corresponding variables and constraints.
The DC approximation is, in fact, a linearization of the original OPF problem (which is sometimes called AC-OPF).</p>
<p>In order to demonstrate the simplifications introduced by the DC-OPF approach, we recall that for a <a href="https://invenia.github.io/blog/2020/12/04/pf-intro/#complex-power-in-ac-circuits">simple power flow</a> model the active and reactive powers flowing from bus \(i\) to \(j\) can be writtes as:</p>
\[\begin{equation}
\left\{
\begin{aligned}
p_{ij} & = \frac{1}{r_{ij}^{2} + x_{ij}^{2}} \left[ r_{ij} \left( v_{i}^{2} - v_{i} v_{j} \cos(\delta_{ij}) \right) + x_{ij} \left( v_{i} v_{j} \sin(\delta_{ij}) \right) \right], \\
q_{ij} & = \frac{1}{r_{ij}^{2} + x_{ij}^{2}} \left[ x_{ij} \left( v_{i}^{2} - v_{i} v_{j} \cos(\delta_{ij}) \right) + r_{ij} \left( v_{i} v_{j} \sin(\delta_{ij}) \right) \right]. \\
\end{aligned}
\right.
\label{power_flow_z}
\end{equation}\]
<p>DC-OPF makes three major assumptions<sup id="fnref:Liu09" role="doc-noteref"><a href="#fn:Liu09" class="footnote" rel="footnote">9</a></sup>, each of which lead to a simpler expression for the active and reactive power flows (\(p_{ij}\) and \(q_{ij}\)):</p>
<ol>
<li>The resistance \(r_{ij}\) of each branch is negligible, i.e. \(r_{ij} \approx 0\).
Applying this assumption to \(\eqref{power_flow_z}\), we obtain:</li>
</ol>
\[\begin{equation}
\left\{
\begin{aligned}
p_{ij} & {\ \approx\ } \frac{1}{x_{ij}} \left( v_{i}v_{j}\sin(\delta_{ij}) \right) \\
q_{ij} & {\ \approx\ } \frac{1}{x_{ij}} \left( v_{i}^{2} - v_{i} v_{j} \cos(\delta_{ij}) \right)
\end{aligned}
\right.
\end{equation}\]
<ol>
<li>The bus voltage magnitudes are approximated by one per unit everywhere, i.e. \(v_{i} \approx 1\).
Using this assumption, we obtain:</li>
</ol>
\[\begin{equation}
\left\{
\begin{aligned}
p_{ij} & {\ \approx\ } \frac{\sin(\delta_{ij})}{x_{ij}} \\
q_{ij} & {\ \approx\ } \frac{1 - \cos(\delta_{ij})}{x_{ij}}
\end{aligned}
\right.
\end{equation}\]
<ol>
<li>The voltage angle difference of each branch is very small, i.e. \(\cos(\delta_{ij}) \approx 1\) and \(\sin(\delta_{ij}) \approx \delta_{ij}\). Applying this assumption leads to our final form:</li>
</ol>
\[\begin{equation}
\left\{
\begin{aligned}
p_{ij} & {\ \approx\ } \frac{\delta_{ij}}{x_{ij}}, \\
q_{ij} & {\ \approx\ } 0.
\end{aligned}
\right.
\end{equation}\]
<p>As we can see, DC-OPF reduces the original power flow problem significantly by removing several variables and constraints (the full DC-OPF economic dispatch model is shown in the <a href="#appendix-b-the-dc-opf-economic-dispatch-problem">Appendix</a>).
Also, the remaining set of constraints with a linear objective function results in a linear programming problem that can be solved very efficiently by interior-point methods and by <a href="https://en.wikipedia.org/wiki/Simplex_algorithm">simplex methods</a>.
DC-OPF has proven to be useful for a wide variety of applications.
However, it also has several limitations among which the most severe one is that the solution of DC-OPF may not even be a feasible solution of AC-OPF.<sup id="fnref:Baker19" role="doc-noteref"><a href="#fn:Baker19" class="footnote" rel="footnote">10</a></sup>
In such situations, some constraints need to be tightened, and the DC-OPF computation needs to be re-run.</p>
<h3 id="computational-intelligence-and-machine-learning">Computational intelligence and machine learning</h3>
<p>The increasing volatility in grid conditions due to the integration of renewable resources (such as wind and solar) creates a desire for OPF problems to be solved in near real-time to have the most accurate state of the system.
Satisfying this desire requires grid operators to have the computational capacity to solve consecutive instances of OPF problems, and do so with fast convergence.
However, this task is rather challenging for large-scale OPF problems—even when using the DC-OPF approximation.
Therefore, as an alternative to interior-point methods, there has been intense research in the computational intelligence<sup id="fnref:AlRashidi09:1" role="doc-noteref"><a href="#fn:AlRashidi09" class="footnote" rel="footnote">2</a></sup> and machine learning<sup id="fnref:Hasan20" role="doc-noteref"><a href="#fn:Hasan20" class="footnote" rel="footnote">11</a></sup> communities.
Below we provide a list of some well-known approaches.</p>
<p>Computational intelligence<sup id="fnref:AlRashidi09:2" role="doc-noteref"><a href="#fn:AlRashidi09" class="footnote" rel="footnote">2</a></sup> techniques for solving OPF:</p>
<ul>
<li>optimization algorithms that do not require second derivatives, like the <a href="https://en.wikipedia.org/wiki/Limited-memory_BFGS">L-BFGS-B method</a>, <a href="https://en.wikipedia.org/wiki/Coordinate_descent">coordinate-descent algorithm</a>, <a href="https://en.wikipedia.org/wiki/Simulated_annealing">simulated annealing</a>, <a href="https://en.wikipedia.org/wiki/Particle_swarm_optimization">particle swarm optimization</a> and <a href="https://en.wikipedia.org/wiki/Ant_colony_optimization_algorithms">ant colony optimization</a>;</li>
<li>evolutionary algorithms like <a href="https://en.wikipedia.org/wiki/Evolutionary_programming">evolutionary programming</a> and <a href="https://en.wikipedia.org/wiki/Evolution_strategy">evolutionary strategies</a>;</li>
<li><a href="https://en.wikipedia.org/wiki/Genetic_algorithm">genetic algorithms</a>;</li>
<li><a href="https://en.wikipedia.org/wiki/Fuzzy_set">fuzzy set theory</a>.</li>
</ul>
<p>Among machine learning algorithms<sup id="fnref:Hasan20:1" role="doc-noteref"><a href="#fn:Hasan20" class="footnote" rel="footnote">11</a></sup>, <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">reinforcement</a>, <a href="https://en.wikipedia.org/wiki/Unsupervised_learning">unsupervised</a>, and <a href="https://en.wikipedia.org/wiki/Supervised_learning">supervised</a> (both regression and classification) learning approaches can be used to solve OPF problems.
Most of these techniques reduce the computational effort of real-time prediction by performing offline training.
The advantage of taking such an approach is that, while the offline training may be time-consuming, the evaluation of the trained model is typically very fast.
The trained model can be sufficient for a period of time, but may occasionally require re-training.
These machine learning approaches are used for:</p>
<ul>
<li>direct prediction of OPF variables;</li>
<li>prediction of set-points of OPF variables for subsequent warm-starts of interior-point methods;</li>
<li>prediction of the active constraints of OPF in order to build and solve significantly smaller reduced OPF problems.</li>
</ul>
<p>In a subsequent <a href="/blog/2021/10/11/opf-nn/">blog post</a>, we will talk about using neural networks for some of the above tasks.</p>
<h2 id="appendix-a-the-economic-dispatch-problem-using-bfm">Appendix A: the economic dispatch problem using BFM</h2>
<p>Putting everything together, the form of the economic dispatch OPF with BFM is as follows:</p>
\[\begin{equation}
\begin{aligned}
& \textbf{Variables:} \\
& \quad X^{\mathrm{BFM}}
\begin{cases}
V_{i} = (v_{i}, \delta_{i}) \hspace{10.4em} \forall i \in \mathcal{N} \\
S_g^{G} = (p_{g}^{G}, q_{g}^{G}) \hspace{9.1em} \forall g \in \mathcal{G} \\
S_{ij} = (p_{ij}, q_{ij}) \hspace{7.8em} \forall (i, j) \in \mathcal{E} \\
S_{ji} = (p_{ji}, q_{ji}) \hspace{7.8em} \forall (j, i) \in \mathcal{E}^{R} \\
I_{ij}^{s} = (i_{ij}^{s}, \gamma_{ij}^{s}) \hspace{8.0em} \forall (i, j) \in \mathcal{E} \\
\end{cases} \\[1em]
& \textbf{Objective function:} \\
& \quad \min \limits_{X^{\mathrm{BFM}}} \sum \limits_{g \in \mathcal{G}} C_{g}(p_{g}^{G}) \\[1em]
& \textbf{Constraints:} \\
& \quad \mathcal{C}^{\mathrm{PB}} \hspace{0.3cm}
\begin{cases}
\sum \limits_{g \in \mathcal{G}_{i}} S_{g}^{G} - \sum \limits_{l \in \mathcal{L}_{i}} S_{l}^{L} - \sum \limits_{s \in \mathcal{S}_{i}} \left( Y_{s}^{S} \right)^{*} \lvert V_{i} \rvert^{2} \\
\hspace{0.95cm} \ = \sum \limits_{(i, j) \in \mathcal{E}} S_{ij} + \sum \limits_{(k, i) \in \mathcal{E}^{R}} S_{ki} \hspace{3.3em} \forall i \in \mathcal{N} \\
\end{cases} \\
& \quad \mathcal{C}^{\mathrm{REF}} \hspace{0.2cm}
\begin{cases}
\delta_{r} = 0 \hspace{1.9cm} \hspace{9.3em} \forall r \in \mathcal{R} \\
\end{cases} \\
& \quad \mathcal{C}^{\mathrm{BFM}} \hspace{0.1cm}
\begin{cases}
I_{ij}^{s} = Y_{ij} \left( \frac{V_{i}}{T_{ij}} - V_{j} \right) \hspace{5.2em} \forall (i, j) \in \mathcal{E} \\
S_{ij} = V_{i} \left( \frac{I_{ij}^{s}}{T_{ij}^{*}} + Y_{ij}^{C} \frac{V_{i}}{\lvert T_{ij} \rvert^{2}}\right)^{*} \hspace{2.6em} \forall (i, j) \in \mathcal{E} \\
S_{ji} = V_{j} \left( -I_{ij}^{s} + Y_{ji}^{C}V_{j} \right)^{*} \hspace{3.0em} \forall (j, i) \in \mathcal{E}^{R} \\
\end{cases} \\
& \quad \mathcal{C}^{\mathrm{INEQ}}
\begin{cases}
\underline{v}_{i} \le v_{i} \le \overline{v}_{i} \hspace{10.1em} \forall i \in \mathcal{N} \\
\underline{p}_{g}^{G} \le p_{g}^{G} \le \overline{p}_{g}^{G} \hspace{9.0em} \forall g \in \mathcal{G} \\
\underline{q}_{g}^{G} \le q_{g}^{G} \le \overline{q}_{g}^{G} \hspace{9.0em} \forall g \in \mathcal{G} \\
\underline{\delta}_{ij} \le \delta_{ij} \le \overline{\delta}_{ij} \hspace{7.6em} \forall (i, j) \in \mathcal{E} \\
\lvert p_{ij}\rvert \le \overline{p}_{ij} \hspace{9.4em} \forall (i, j) \in \mathcal{E} \cup \mathcal{E}^{R} \\
\lvert q_{ij}\rvert \le \overline{q}_{ij} \hspace{9.4em} \forall (i, j) \in \mathcal{E} \cup \mathcal{E}^{R} \\
\lvert S_{ij}\rvert^{2} = p_{ij}^{2} + q_{ij}^{2} \le \overline{S}_{ij}^{2} \hspace{3.9em} \forall (i, j) \in \mathcal{E} \cup \mathcal{E}^{R} \\
\end{cases} \\
\end{aligned}
\label{opf_bmf}
\end{equation}\]
<p>\(\mathcal{C}^{\mathrm{PB}}\), \(\mathcal{C}^{\mathrm{REF}}\), and \(\mathcal{C}^{\mathrm{BFM}}\) denote the equality sets for the power balance, the voltage angle of reference buses, and the improved BFM. \(\mathcal{C}^{\mathrm{INEQ}}\) is the set of inequality constraints.</p>
<h2 id="appendix-b-the-dc-opf-economic-dispatch-problem">Appendix B: the DC-OPF economic dispatch problem</h2>
\[\begin{equation}
\begin{aligned}
& \textbf{Variables:} \\
& \quad X^{\mathrm{DC}}
\begin{cases}
\delta_{i} \hspace{14.3em} \forall i \in \mathcal{N} \\
p_{g}^{G} \hspace{13.8em} \forall g \in \mathcal{G} \\
p_{ij} \hspace{12.3em} \forall (i, j) \in \mathcal{E} \\
\end{cases} \\[1em]
& \textbf{Objective function:} \\
& \quad \min \limits_{X^{\mathrm{DC}}} \sum \limits_{g \in \mathcal{G}} C_{g}(p_{g}^{G}) \\[1em]
& \textbf{Constraints:} \\
& \quad \mathcal{C}^{\mathrm{PB}} \hspace{0.3cm}
\begin{cases}
\sum \limits_{g \in \mathcal{G}_{i}} p_{g}^{G} - \sum \limits_{l \in \mathcal{L}_{i}} p_{l}^{L} = \sum \limits_{(i, j) \in \mathcal{E}} p_{ij} \hspace{3.4em} \forall i \in \mathcal{N} \\
\end{cases} \\
& \quad \mathcal{C}^{\mathrm{REF}} \hspace{0.15cm}
\begin{cases}
\delta_{r} = 0 \hspace{1.9cm} \hspace{8.5em} \forall r \in \mathcal{R} \\
\end{cases} \\
& \quad \mathcal{C}^{\mathrm{DC}} \hspace{0.3cm}
\begin{cases}
p_{ij} = \frac{\delta_{ij}}{x_{ij}} \hspace{9.4em} \forall (i, j) \in \mathcal{E} \\
\end{cases} \\
& \quad \mathcal{C}^{\mathrm{INEQ}}
\begin{cases}
\underline{v}_{i} \le v_{i} \le \overline{v}_{i} \hspace{9.2em} \forall i \in \mathcal{N} \\
\underline{p}_{g}^{G} \le p_{g}^{G} \le \overline{p}_{g}^{G} \hspace{8.1em} \forall g \in \mathcal{G} \\
\underline{\delta}_{ij} \le \delta_{ij} \le \overline{\delta}_{ij} \hspace{6.7em} \forall (i, j) \in \mathcal{E} \\
\lvert p_{ij}\rvert \le \overline{p}_{ij} \hspace{8.5em} \forall (i, j) \in \mathcal{E} \\
\end{cases} \\
\end{aligned}
\label{opf_dc}
\end{equation}\]
<p>\(\mathcal{C}^{\mathrm{PB}}\), \(\mathcal{C}^{\mathrm{REF}}\), and \(\mathcal{C}^{\mathrm{DC}}\) denote the equality sets for the power balance, the voltage angle of reference buses, and the DC-OPF power flow. \(\mathcal{C}^{\mathrm{INEQ}}\) is the set of inequality constraints.</p>
<hr />
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:Coffrin18" role="doc-endnote">
<p>C. Coffrin, R. Bent, K. Sundar, Y. Ng and M. Lubin, <a href="https://arxiv.org/abs/1711.01728">“PowerModels.jl: An Open-Source Framework for Exploring Power Flow Formulations”</a>, <em>Power Systems Computation Conference (PSCC)</em>, pp. 1, (2018). <a href="#fnref:Coffrin18" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:AlRashidi09" role="doc-endnote">
<p>M. R. AlRashidi and M. E. El-Hawary, <a href="https://www.sciencedirect.com/science/article/abs/pii/S0378779608002757">“Applications of computational intelligence techniques for solving the revived optimal power flow problem”</a>, <em>Electric Power Systems Research</em>, <strong>79</strong>, (2009). <a href="#fnref:AlRashidi09" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:AlRashidi09:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a> <a href="#fnref:AlRashidi09:2" class="reversefootnote" role="doc-backlink">↩<sup>3</sup></a></p>
</li>
<li id="fn:Gholami14" role="doc-endnote">
<p>A. Gholami, J. Ansari, M. Jamei and A. Kazemi, <a href="https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/iet-gtd.2014.0235">“Environmental/economic dispatch incorporating renewable energy sources and plug-in vehicles”</a>, <em>IET Generation, Transmission & Distribution</em>, <strong>8</strong>, pp. 2183, (2014). <a href="#fnref:Gholami14" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Nocedal06" role="doc-endnote">
<p>J. Nocedal and S. J. Wright, <a href="https://link.springer.com/book/10.1007/978-0-387-40065-5">“Numerical Optimization”</a>, <em>New York: Springer</em>, (2006). <a href="#fnref:Nocedal06" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:Nocedal06:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
<li id="fn:Wachter06" role="doc-endnote">
<p>A. Wächter and L. Biegler, <a href="https://link.springer.com/article/10.1007/s10107-004-0559-y">“On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming”</a> <em>Math. Program.</em> <strong>106</strong>, pp. 25, (2006). <a href="#fnref:Wachter06" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Boyd04" role="doc-endnote">
<p>S. Boyd and L. Vandenberghe, <a href="https://web.stanford.edu/~boyd/cvxbook/">“Convex Optimization”</a>, <em>New York: Cambridge University Press</em>, (2004). <a href="#fnref:Boyd04" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Low14" role="doc-endnote">
<p>S. H. Low, <a href="https://ieeexplore.ieee.org/document/6756976">“Convex Relaxation of Optimal Power Flow—Part I: Formulations and Equivalence”</a>, <em>IEEE Transactions on Control of Network Systems</em>, <strong>1</strong>, pp. 15, (2014). <a href="#fnref:Low14" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Low14a" role="doc-endnote">
<p>S. H. Low, <a href="https://ieeexplore.ieee.org/document/6815671">“Convex Relaxation of Optimal Power Flow—Part II: Exactness”</a>, <em>IEEE Transactions on Control of Network Systems</em>, <strong>1</strong>, pp. 177, (2014). <a href="#fnref:Low14a" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Liu09" role="doc-endnote">
<p>H. Liu, L. Tesfatsion and A. A. Chowdhury, <a href="https://ieeexplore.ieee.org/document/5275503">“Locational marginal pricing basics for restructured wholesale power markets”</a>, <em>IEEE Power & Energy Society General Meeting</em>, pp. 1, (2009). <a href="#fnref:Liu09" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Baker19" role="doc-endnote">
<p>K. Baker, <a href="https://arxiv.org/abs/1912.00319">“Solutions of DC OPF are Never AC Feasible”</a>, <em>arXiv:1912.00319</em>, (2019). <a href="#fnref:Baker19" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:Hasan20" role="doc-endnote">
<p>F. Hasan, A. Kargarian and A. Mohammadi, <a href="https://ieeexplore.ieee.org/document/9042547">“A Survey on Applications of Machine Learning for Optimal Power Flow”</a>, <em>IEEE Texas Power and Energy Conference (TPEC)</em>, pp. 1, (2020). <a href="#fnref:Hasan20" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:Hasan20:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
</ol>
</div>Letif MonesIn an earlier blog post, we discussed the power flow problem, which serves as the key component of a much more challenging task: the optimal power flow (OPF). OPF is an umbrella term that covers a wide range of constrained optimization problems, the most important ingredients of which are: variables that optimize an objective function, some equality constraints, including the power balance and power flow equations, and inequality constraints, including bounds on the variables. The sets of variables and constraints, as well as the form of the objective, will vary depending on the type of OPF.Scaling multi-output Gaussian process models with exact inference2021-03-19T00:00:00+00:002021-03-19T00:00:00+00:00https://invenia.github.io/blog/2021/03/19/OILMM-pt2<p>In our <a href="/blog/2021/02/19/OILMM-pt1/">previous post</a>, we explained that multi-output Gaussian processes (MOGPs) are not fundamentally different from their single-output counterparts. We also introduced the <em>Mixing Model Hierarchy</em> (MMH), which is a broad class of MOGPs that covers several popular and powerful models from the literature. In this post, we will take a closer look at the central model from the MMH, the <em>Instantaneous Linear Mixing Model</em> (ILMM). We will discuss the linear algebra tricks that make inference in this model much cheaper than for general MOGPs. Then, we will take an alternative and more intuitive approach and use it as motivation for a yet better scaling model, the <em><a href="http://proceedings.mlr.press/v119/bruinsma20a.html">Orthogonal Instantaneous Linear Mixing Model (OILMM)</a></em>. Like most linear MOGPs, the OILMM represents data in a smaller-dimensional subspace; but contrary to most linear MOGPs, in practice the OILMM scales <em>linearly</em> with the dimensionality of this subspace, retaining exact inference.</p>
<p>Check out our <a href="/blog/2021/02/19/OILMM-pt1/">previous post</a> for a brief intro to MOGPs. We also recommend checking out the definition of the MMH from that post, but this post should be self-sufficient for those familiar with MOGPs. Some familiarity with linear algebra is also assumed. We start with the definition of the ILMM.</p>
<h2 id="the-instantaneous-linear-mixing-model">The Instantaneous Linear Mixing Model</h2>
<p>Some data sets can exhibit a low-dimensional structure, where the data set presents an <a href="https://en.wikipedia.org/wiki/Intrinsic_dimension">intrinsic dimensionality</a> that is significantly lower than the dimensionality of the data. Imagine a data set consisting of coordinates in a 3D space. If the points in this data set form a single straight line, then the data is intrinsically one-dimensional, because each point can be represented by a single number once the supporting line is known. Mathematically, we say that the data lives in a lower dimensional (linear) subspace.</p>
<p>While this lower-dimensionality property may not be exactly true for most real data sets, many large real data sets frequently exhibit approximately low-dimensional structure, as discussed by <a href="https://epubs.siam.org/doi/pdf/10.1137/18M1183480">Udell & Townsend (2019)</a>. In such cases, we can represent the data in a lower-dimensional space without losing a significant part of the information. There is a large field in statistics dedicated to identifying suitable lower-dimensional representations of data (e.g. <a href="https://cseweb.ucsd.edu/~saul/papers/sde_cvpr04.pdf">Weinberger & Saul (2004)</a>) and assessing their quality (e.g. <a href="https://ieeexplore.ieee.org/abstract/document/8017645">Xia <em>et al.</em> (2017)</a>). These <a href="https://en.wikipedia.org/wiki/Dimensionality_reduction">dimensionality reduction</a> techniques play an important role in computer vision (see <a href="https://ieeexplore.ieee.org/document/1177153">Basri & Jacobs (2003)</a> for an example), and in other fields (a <a href="https://www.sciencedirect.com/science/article/abs/pii/S0005109807003950">paper by Markovsky (2008)</a> contains an overview).</p>
<p>The Instantaneous Linear Mixing Model (ILMM) is a simple model that appears in many fields, e.g. machine learning (<a href="https://papers.nips.cc/paper/2007/file/66368270ffd51418ec58bd793f2d9b1b-Paper.pdf">Bonilla <em>et al.</em> (2007)</a> and <a href="https://arxiv.org/abs/1702.08530">Dezfouli <em>et al.</em> (2017)</a>), signal processing (<a href="https://ieeexplore.ieee.org/abstract/document/4505467">Osborne <em>et al.</em> (2008)</a>), and geostatistics (<a href="https://www.researchgate.net/publication/224839861_Geostatistics_for_Natural_Resource_Evaluation">Goovaerts (1997)</a>). The model represents data in a lower-dimensional linear subspace—not unlike <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a>— which implies the model’s covariance matrix is low-rank. As we will discuss in the next sections, this low-rank structure can be exploited for efficient inference. In the ILMM, the observations are described as a linear combination of <em>latent processes</em> (i.e. unobserved stochastic processes), which are modelled as single-output GPs.</p>
<p>If we denote our observations as \(y\), the ILMM models the data according to the following generative model: \(y(t) \cond x, H = Hx(t) + \epsilon\). Here, \(y(t) \cond x, H\) is used to denote the value of \(y(t)\) given a known \(H\) and \(x(t)\), \(H\) is a matrix of weights, which we call the <em>mixing matrix</em>, \(\epsilon\) is Gaussian noise, and \(x(t)\) represents the (time-dependent) latent processes, described as independent GPs—note that we use \(x\) here to denote an unobserved (latent) stochastic process, not an input, which is represented by \(t\). Using an ILMM with \(m\) latent processes to model \(p\) outputs, \(y(t)\) is \(p\)-dimensional, \(x(t)\) is \(m\)-dimensional, and \(H\) is a \(p \times m\) matrix. Since the latent processes are GPs, and <a href="https://www.statlect.com/probability-distributions/normal-distribution-linear-combinations">Gaussian random variables are closed under linear combinations</a>, \(y(t)\) is also a GP. This means that the usual closed-form <a href="https://en.wikipedia.org/wiki/Gaussian_process#Gaussian_process_prediction,_or_Kriging">formulae for inference</a> in GPs can be used. However, naively computing these formulae is not computationally efficient, because a large covariance matrix has to be inverted, which is a significant bottleneck. In the next section we discuss tricks to speed up this inversion.</p>
<h2 id="leveraging-the-matrix-inversion-lemma">Leveraging the matrix inversion lemma</h2>
<p>When working with GPs, the computational bottleneck is typically the inversion of the covariance matrix over the training input locations. For the case of the ILMM, observing \(p\) outputs, each at \(n\) input locations, we require the inversion of an \(np \times np\) matrix. This operation quickly becomes computationally intractable. What sets the ILMM apart from the general MOGP case is that the covariance matrices generated via an ILMM have some structure that can be exploited.</p>
<p>There is a useful and widely used result from linear algebra that allows us to exploit this structure, known as the <a href="https://en.wikipedia.org/wiki/Woodbury_matrix_identity">matrix inversion lemma</a> (also known as the Sherman–Morrison–Woodbury formula, or simply as the Woodbury matrix formula). This lemma comes in handy whenever we want to invert a matrix that can be written as the sum of a low-rank matrix and a diagonal one.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> Whereas typically the inversion of such matrices scales with the size of the matrix, the lemma cleverly allows the inversion operation to scale with the rank of the low-rank matrix instead. Therefore, if the rank of the low-rank matrix is much smaller than the size of the matrix, the lemma can enable significant computational speed-ups. We can show that the covariance of the ILMM can be written as a sum of a low-rank and diagonal matrix, which means that the matrix inversion lemma can be applied.</p>
<p>For an ILMM that has \(n\) observations for each of \(p\) outputs and uses \(m\) latent processes, the covariance matrix has size \(np \times np\), but the low-rank part has rank \(nm \times nm\). Thus, by choosing an \(m\) that is smaller than \(p\) and using the matrix inversion lemma, we can effectively decrease the memory and time costs associated with the matrix inversion.</p>
<p>This is not the first time we leverage the matrix inversion lemma to make computations more efficient, see <a href="/blog/2021/01/19/linear-models-with-stheno-and-jax/#making-inference-fast">our post on linear models from a GP point of view</a> for another example. The ubiquity of models that represent data in lower-dimensional linear subspaces makes the use of this lemma widespread. However, this approach requires careful application of linear algebra tricks, and obfuscates the <em>reason</em> why it is even possible to get such a speed-up. In the next section we show an alternative view, which is more intuitive and leads to the same performance improvements.</p>
<h2 id="an-alternative-formulation">An alternative formulation</h2>
<p>Instead of focusing on linear algebra tricks, let’s try to understand <em>why</em> we can even reduce the complexity of the inversion. The ILMM, as a general GP, scales poorly with the number of observations; and the larger the number of outputs, the larger the number of observations. However, the ILMM makes the modelling assumption that the observations can be represented in a lower-dimensional space. Intuitively, that means that every observation contains a lot of redundant information, because it can be summarised by a much lower-dimensional representation. If it were possible to somehow extract these lower-dimensional representations and use them as observations instead, then that could lead to very appealing computational gains. The challenge here is that we don’t have access to the lower-dimensional representations, but we can try to estimate them.</p>
<p>Recall that the ILMM is a probabilistic model that connects every observation \(y(t)\) to a set of latent, unobserved variables \(x(t)\), defining the lower-dimensional representation. It is this lower-dimensional \(x(t)\) that we want to estimate. A natural choice is to find the <a href="https://en.wikipedia.org/wiki/Maximum_likelihood_estimation">maximum likelihood estimate</a> (MLE) \(T(y(t))\) of \(x(t)\) given the observations \(y(t)\):</p>
\[\begin{equation}
T(y(t)) = \underset{x(t)}{\mathrm{argmax}} \, p(y(t) \cond x(t)).
\end{equation}\]
<p>The solution to the equation above is \(T = (H^\top \Sigma^{-1} H)^{-1} H^\top \Sigma^{-1}\) (see prop. 2 of appendix D from <a href="http://proceedings.mlr.press/v119/bruinsma20a.html">Bruinsma <em>et al.</em> (2020)</a>), where \(H\) is the mixing matrix, and \(\Sigma\) is the noise covariance.</p>
<p>The advantage of working with \(Ty(t)\) instead of with \(y(t)\) is that \(y(t)\) comprises \(p\) outputs, while \(Ty(t)\) presents only \(m\) outputs. Thus, conditioning on \(Ty(t)\) is computationally cheaper because \(m < p\). When conditioning on \(n\) observations, this approach brings the memory cost from \(\mathcal{O}(n^2p^2)\) to \(\mathcal{O}(n^2m^2)\), and the time cost from \(\mathcal{O}(n^3p^3)\) to \(\mathcal{O}(n^3m^3)\).<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> These savings are identical to those we get by using the matrix inversion lemma, as discussed in the previous section.</p>
<p>In general, we cannot arbitrarily transform observations and use the resulting transformed data as observations instead. We must show that our proposed procedure is valid, that conditioning on \(Ty(t)\) is equivalent to conditioning on \(y(t)\). We do this showing that (see <a href="http://proceedings.mlr.press/v119/bruinsma20a.html">Bruinsma <em>et al.</em> (2020)</a>),<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> under the ILMM, \(Ty(t)\) is a <em><a href="https://en.wikipedia.org/wiki/Sufficient_statistic">sufficient statistic</a></em> for \(x(t)\) given \(y\)—and this is only possible because of the particular structure of the ILMM.</p>
<p>A sufficient statistic \(T(y)\) is a function of the data which is defined with respect to a probabilistic model, \(p(y \cond \theta)\), and an associated unknown parameter \(\theta\). When a statistic is sufficient, it means that computing it over a set of observations, \(y\), provides us with all the information about \(\theta\) that can be extracted from those observations (under that probabilistic model). Thus, there is no other quantity we can compute over \(y\) that will increase our knowledge of \(\theta\). Formally, that is to say that \(p(\theta \cond y) = p(\theta \cond T(y))\). For the ILMM, as \(Ty(t)\) is a sufficient statistic for \(x(t)\), we have that \(p(x(t) \cond y(t)) = p(x(t) \cond Ty(t))\). This property guarantees that the procedure of conditioning the model on the summary of the observations is mathematically valid.</p>
<p>The choice of \(m < p\) is exactly the choice of imposing a low-rank structure onto the model; and the lower the rank (controlled by \(m\), the number of latent processes), the more parsimonious is \(Ty\), the summary of our observations.</p>
<p>Besides being more intuitive, this approach based on the sufficient statistic makes it easy to see how introducing a simple constraint over the mixing matrix \(H\) will allow us to scale the model even further, and obtain linear scaling on the number of latent processes, \(m\).<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup> That’s what we discuss next.</p>
<h2 id="the-orthogonal-instantaneous-linear-mixing-model">The Orthogonal Instantaneous Linear Mixing Model</h2>
<p>Although the ILMM scales much better than a general MOGP, the ILMM still has a cubic (quadratic) scaling on \(m\) for time (memory), which quickly becomes intractable for large systems. We therefore want to identify a subset of the ILMMs that scale even more favourably. We show that a very simple restriction over the mixing matrix \(H\) can lead to linear scaling on \(m\) for both time and memory. We call this model the <em>Orthogonal Instantaneous Linear Mixing Model</em> (OILMM). The plots below show a comparison of the time and memory scaling for the ILMM vs the OILMM as \(m\) grows.</p>
<p><img src="/blog/public/images/oilmm2_scaling.png" alt="OILMM_scaling" />
Figure 1: Runtime (left) and memory usage (right) of the ILMM and OILMM for computing the evidence of \(n = 1500\) observations for \(p = 200\) outputs.</p>
<p>Let’s return to the definition of the ILMM, as discussed in the first section: \(y(t) \cond x(t), H = Hx(t) + \epsilon(t)\). Because \(\epsilon(t)\) is Gaussian noise, we can write that \(y(t) \cond x(t), H \sim \mathrm{GP}(Hx(t), \delta_{tt'}\Sigma)\), where \(\delta_{tt'}\) is the <a href="https://en.wikipedia.org/wiki/Kronecker_delta">Kronecker delta</a>. Because we know that \(Ty(t)\) is a sufficient statistic for \(x(t)\) under this model, we know that \(p(x \cond y(t)) = p(x \cond Ty(t))\). Then what is the distribution of \(Ty(t)\)? A simple calculation gives \(Ty(t) \cond x(t), H \sim \mathrm{GP}(THx(t), \delta_{tt'}T\Sigma T^\top)\). The crucial thing to notice here is that the summarised observations \(Ty(t)\) <em>only</em> couple the latent processes \(x\) via the noise matrix \(T\Sigma T^\top\). If that matrix were diagonal, then observations would not couple the latent processes, and we could condition each of them individually, which is <em>much</em> more computationally efficient.</p>
<p>This is the insight for the <em>Orthogonal Instantaneous Linear Mixing Model</em> (OILMM). If we let \(\Sigma = \sigma^2I_p\), then it can be shown that \(T\Sigma T^\top\) is diagonal if and only if the columns of \(H\) are orthogonal (Prop. 6 from the paper by <a href="http://proceedings.mlr.press/v119/bruinsma20a.html">Bruinsma <em>et al.</em> (2020)</a>), which means that \(H\) can be written as \(H = US^{1/2}\), with \(U\) a matrix of orthonormal columns, and \(S > 0\) diagonal. Because the columns of \(H\) are orthogonal, we name the OILMM <em>orthogonal</em>. In summary: by restricting the columns of the mixing matrix in an ILMM to be orthogonal, we make it possible to <strong>treat each latent process as an independent, single-output GP problem</strong>.</p>
<p>The result actually is a bit more general: we can allow any observation noise of the form \(\sigma^2I_p + H D H^\top\), with \(D>0\) diagonal. Thus, it is possible to have a non-diagonal noise matrix, i.e., noise that is correlated across different outputs, and still be able to decouple the latent processes and retain all computational gains from the OILMM (which we discuss next).</p>
<p>Computationally, the OILMM approach allows us to go from a cost of \(\mathcal{O}(n^3m^3)\) in time and \(\mathcal{O}(n^2m^2)\) in memory, for a regular ILMM, to \(\mathcal{O}(n^3m)\) in time and \(\mathcal{O}(n^2m)\) in memory. This is because now the problem reduces to \(m\) independent single-output GP problems.<sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup> The figure below explains the inference process under the ILMM and the OILMM, with the possible paths and the associated costs.</p>
<p><img src="/blog/public/images/oilmm2_inference.png" alt="inference_process" />
Figure 2: Commutative diagrams depicting that conditioning on \(Y\) in the ILMM (left) and OILMM (right) is equivalent to conditioning respectively on \(TY\) and independently every \(x_i\) on \((TY)_{i:}\), but yield different computational complexities. The reconstruction costs assume computation of the marginals.</p>
<p>If we take the view of an (O)ILMM representing every data point \(y(t)\) via a set of basis vectors \(h_1, \ldots, h_m\) (the columns of the mixing matrix) and a set of time-dependent coefficients \(x_1(t), \ldots, x_m(t)\) (the latent processes), the difference between an ILMM and an OILMM is that in the latter the coordinate system is chosen to be orthogonal, as is common practice in most fields. This insight is illustrated below.</p>
<p><img src="/blog/public/images/oilmm2_basis.png" alt="basis_sets" style="width:400px;margin:0 auto 0 auto;" />
Figure 3: Illustration of the difference between the ILMM and OILMM. The trajectory of a particle (dashed line) in two dimensions is modelled by the ILMM (blue) and OILMM (orange). The noise-free position \(f(t)\) is modelled as a linear combination of basis vectors \(h_1\) and \(h_2\) with coefficients \(x_1(t)\) and \(x_2(t)\) (two independent GPs). In the OILMM, the basis vectors \(h_1\) and \(h_2\) are constrained to be orthogonal; in the ILMM, \(h_1\) and \(h_2\) are unconstrained.</p>
<p>Another important difference between a general ILMM and an OILMM is that, while in both cases the latent processes are independent <em>a priori</em>, only for an OILMM they remain so <em>a posteriori</em>. Besides the computational gains we already mentioned, this property also improves interpretability as the posterior latent marginal distributions can be inspected (and plotted) independently. In comparison, inspecting only the marginal distributions in a general ILMM would neglect the correlations between them, obscuring the interpretation.</p>
<p>Finally, the fact that an OILMM problem is just really a set of single-output GP problems makes the OILMM immediately compatible with any single-output GP approach. This allows us to trivially use powerful approaches, like sparse GPs (as detailed in the paper by <a href="http://proceedings.mlr.press/v5/titsias09a/titsias09a.pdf">Titsias (2009)</a>), or state-space approximations (as presented in the freely available book by <a href="https://users.aalto.fi/~asolin/sde-book/sde-book.pdf">Särkkä & Solin</a>), for scaling to extremely large data sets. We have illustrated this by using the OILMM, combined with state-space approximations, to model 70 million data points (see <a href="http://proceedings.mlr.press/v119/bruinsma20a.html">Bruinsma <em>et al.</em> (2020)</a> for details).</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post we have taken a deeper look into the <em>Instantaneous Linear Mixing Model</em> (ILMM), a widely used multi-output GP (MOGP) model which stands at the base of the <em>Mixing Model Hierarchy</em> (MMH)—which was described in detail in our <a href="/blog/2021/02/19/OILMM-pt1/">previous post</a>. We discussed how the <em>matrix inversion lemma</em> can be used to make computations much more efficient. We then showed an alternative but equivalent (and more intuitive) view based on a <em>sufficient statistic</em> for the model. This alternative view gives us a better understanding on <em>why</em> and <em>how</em> these computational gains are possible.</p>
<p>From the sufficient statistic formulation of the ILMM we showed how a simple constraint over one of the model parameters can decouple the MOGP problem into a set of independent single-output GP problems, greatly improving scalability. We call this model the <em>Orthogonal Instantaneous Linear Mixing Model</em> (OILMM), a subset of the ILMM.</p>
<p>In the next and last post in this series, we will discuss implementation details of the OILMM and show some of our implementations in Julia and in Python.</p>
<!-- Footnotes themselves at the bottom. -->
<h2 id="notes">Notes</h2>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>The lemma can also be useful in case the matrix can be written as the sum of a low-rank matrix and a <em>block</em>-diagonal one. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>It is true that computing the \(Ty\) also has an associated cost. This cost is of \(\mathcal{O}(nmp)\) in time and \(\mathcal{O}(mp)\) in memory.
These costs are usually dominated by the others, as the number of observations \(n\) tends to be much larger than the number of outputs \(p\). <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>See prop. 3 of the paper by <a href="http://proceedings.mlr.press/v119/bruinsma20a.html">Bruinsma <em>et al.</em> (2020)</a>. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>It also shows that \(H\) can be trivially made time-dependent. This comes as a direct consequence of the MLE problem which we solve to determine \(T\).
If we adopt a time-dependent mixing matrix \(H(t)\), the solution still has the same form, with the only difference that it will also be time-varying: \(T(t) = (H(t)^\top \Sigma^{-1} H(t))^{-1} H(t)^\top \Sigma^{-1}\). <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:5" role="doc-endnote">
<p>There are costs associated with computing the projector \(T\) and executing the projections.
However, these costs are dominated by the ones related to storing and inverting the covariance matrix in practical scenarios (see appendix C of the paper by <a href="http://proceedings.mlr.press/v119/bruinsma20a.html">Bruinsma <em>et al.</em> (2020)</a>). <a href="#fnref:5" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Eric Perim, Wessel Bruinsma, and Will TebbuttIn our previous post, we explained that multi-output Gaussian processes (MOGPs) are not fundamentally different from their single-output counterparts. We also introduced the Mixing Model Hierarchy (MMH), which is a broad class of MOGPs that covers several popular and powerful models from the literature. In this post, we will take a closer look at the central model from the MMH, the Instantaneous Linear Mixing Model (ILMM). We will discuss the linear algebra tricks that make inference in this model much cheaper than for general MOGPs. Then, we will take an alternative and more intuitive approach and use it as motivation for a yet better scaling model, the Orthogonal Instantaneous Linear Mixing Model (OILMM). Like most linear MOGPs, the OILMM represents data in a smaller-dimensional subspace; but contrary to most linear MOGPs, in practice the OILMM scales linearly with the dimensionality of this subspace, retaining exact inference.Gaussian Processes: from one to many outputs2021-02-19T00:00:00+00:002021-02-19T00:00:00+00:00https://invenia.github.io/blog/2021/02/19/OILMM-pt1<p>This is the first post in a three-part series we are preparing on multi-output Gaussian Processes. Gaussian Processes (GPs) are a popular tool in machine learning, and a technique that we routinely use in our work.
Essentially, GPs are a powerful Bayesian tool for regression problems (which can be extended to classification problems through some modifications).
As a Bayesian approach, GPs provide a natural and automatic mechanism to construct and calibrate uncertainties.
Naturally, getting <em>well</em>-calibrated uncertainties is not easy and depends on a combination of how well the model matches the data and on how much data is available.
Predictions made using GPs are not just point predictions: they are whole probability distributions, which is convenient for downstream tasks. There are several good references for those interested in learning more about the benefits of Bayesian methods, from <a href="https://towardsdatascience.com/an-introduction-to-bayesian-inference-e6186cfc87bc">introductory</a> <a href="https://towardsdatascience.com/what-is-bayesian-statistics-used-for">blog posts</a> to <a href="https://probml.github.io/pml-book/book0.html">classical</a> <a href="http://www.stat.columbia.edu/~gelman/book/">books</a>.</p>
<p>In this post (and in the forthcoming ones in this series), we are going to assume that the reader has some level of familiarity with GPs in the single-output setting.
We will try to keep the maths to a minimum, but will rely on mathematical notation whenever that helps making the message clear.
For those who are interested in an introduction to GPs—or just a refresher—we point towards <a href="https://distill.pub/2019/visual-exploration-gaussian-processes/">other</a> <a href="https://medium.com/analytics-vidhya/intuitive-intro-to-gaussian-processes-328740cdc37f">resources</a>.
For a rigorous and in-depth introduction, the <a href="http://www.gaussianprocess.org/gpml/">book by Rasmussen and Williams</a> stands as one of the best references (and it is made freely available in electronic form by the authors).</p>
<p>We will start by discussing the extension of GPs from one to multiple dimensions, and review popular (and powerful) approaches from the literature.
In the following posts, we will look further into some powerful tricks that bring improved scalability and will also share some of our code.</p>
<h2 id="multi-output-gps">Multi-output GPs</h2>
<p>While most people with a background in machine learning or statistics are familiar with GPs, it is not uncommon to have only encountered their single-output formulation.
However, many interesting problems require the modelling of multiple outputs instead of just one.
Fortunately, it is simple to extend single-output GPs to multiple outputs, and there are a few different ways of doing so. We will call these constructions multi-output GPs (MOGPs).</p>
<p>An example application of a MOGP might be to predict both temperature and humidity as a function of time. Sometimes we might want to include binary or categorical outputs as well, but in this article we will limit the discussion to real-valued outputs.
(MOGPs are also sometimes called multi-task GPs, in which case an output is instead referred to as a task. But the idea is the same.)
Moreover we will refer to inputs as time, as in the time series setting, but all the discussion in here is valid for any kind of input.</p>
<p>The simplest way to extend GPs to multi-output problems is to model each of the outputs independently, with single-output GPs.
We call this model IGPs (for independent GPs). While conceptually simple, computationally cheap, and easy to implement, this approach fails to account for correlations between outputs.
If the outputs are correlated, knowing one can provide useful information about others (as we illustrate below), so assuming independence can hurt performance and, in many cases, limit it to being used as a baseline.</p>
<p>To define a general MOGP, all we have to do is to also specify how the outputs covary.
Perhaps the simplest way of doing this is by prescribing an additional covariance function (kernel) over outputs, \(k_{\mathrm{o}}(i, j)\), which specifies the covariance between outputs \(i\) and \(j\).
Combining this kernel over outputs with a kernel over inputs, e.g. \(k_{\mathrm{t}}(t, t')\), the full kernel of the MOGP is then given by</p>
\[\begin{equation}
k((i, t), (j, t')) = \operatorname{cov}(f_i(t), f_j(t')) = k_{\mathrm{o}}(i, j) k_{\mathrm{t}}(t, t'),
\end{equation}\]
<p>which says that the covariance between output \(i\) at input \(t\) and output \(j\) at input \(t'\) is equal to the product \(k_{\mathrm{o}}(i, j) k_{\mathrm{t}}(t, t')\).
When the kernel \(k((i, t), (j, t'))\) is a product between a kernel over outputs \(k_{\mathrm{o}}(i, j)\) and a kernel over inputs \(k_{\mathrm{t}}(t,t’)\), the kernel \(k((i, t), (j, t'))\) is called <em>separable</em>.
In the general case, the kernel \(k((i, t), (j, t'))\) does not have to be separable, i.e. it can be any arbitrary <a href="https://en.wikipedia.org/wiki/Positive-definite_function">positive-definite function</a>.</p>
<p>Contrary to IGPs, general MOGPs do model correlations between outputs, which means that they are able to use observations from one output to better predict another output.
We illustrate this below by contrasting the predictions for two of the outputs in the <a href="https://sccn.ucsd.edu/~arno/fam2data/publicly_available_EEG_data.html">EEG dataset</a>, one observed and one not observed, using IGPs and another flavour of MOGPs, the ILMM, which we will discuss in detail in the next section. Contrary to the independent GP (IGP), the ILMM is able to successfully predict F2 by exploiting the observations for F3 (and other outputs not shown).</p>
<p><img src="/blog/public/images/eeg.png" alt="IGP_vs_MOGP" />
Figure 1: Predictions for two of the outputs in the EEG dataset using two distinct MOGP approaches, the ILMM and the IGP.
All outputs are modelled jointly, but we only plot two of them for clarity.</p>
<h3 id="equivalence-to-single-output-gps">Equivalence to single-output GPs</h3>
<p>An interesting thing to notice is that a general MOGP kernel is just another kernel, like those used in single-output GPs, but one that now operates on an <em>extended</em> input space (because it also takes in \(i\) and \(j\) as input).
Mathematically, say one wants to model \(p\) outputs over some input space \(\mathcal{T}\).
By also letting the index of the output be part of the input, we can construct this extended input space: \(\mathcal{T}_{\mathrm{ext}} = \{1,...,p\} \times \mathcal{T}\). Then, a multi-output Gaussian process (MOGP) can be defined via a mean function, \(m\colon \mathcal{T}_{\mathrm{ext}} \to \mathbb{R}\), and a kernel, \(k\colon \mathcal{T}_{\mathrm{ext}}^2 \to \mathbb{R}\).
Under this construction it is clear that any property of single-output GPs immediately transfers to MOGPs, because MOGPs can simply be seen as single-output GPs on an extended input space.</p>
<p>An equivalent formulation of MOGPs can be obtained by stacking the multiple outputs into a vector, creating a <em>vector-valued GP</em>.
It is sometimes helpful to view MOGPs from this perspective, in which the multiple outputs are viewed as one multidimensional output.
We can use this equivalent formulation to define MOGP via a <em>vector-valued</em> mean function, \(m\colon \mathcal{T} \to \mathbb{R}^p\), and a <em>matrix-valued</em> kernel, \(k\colon\mathcal{T}^2 \to \mathbb{R}^{p \times p}\). This mean function and kernel are <em>not</em> defined on the extended input space; rather, in this equivalent formulation, they produce <em>multi-valued outputs</em>.
The vector-valued mean function corresponds to the mean of the vector-valued GP, \(m(t) = \mathbb{E}[f(t)]\), and the matrix-valued kernel to the covariance matrix of vector-valued GP, \(k(t, t’) = \mathbb{E}[(f(t) - m(t))(f(t’) - m(t’))^\top]\).
When the matrix-valued kernel is evaluated at \(t = t’\), the resulting matrix \(k(t, t) = \mathbb{E}[(f(t) - m(t))(f(t) - m(t))^\top]\) is sometimes called the <em>instantaneous spatial covariance</em>: it describes a covariance between different outputs at a given point in time \(t\).</p>
<p>Because MOGPs can be viewed as single-output GPs on an extended input space, inference works exactly the same way.
However, by extending the input space we exacerbate the scaling issues inherent with GPs, because the total number of observations is counted by adding the numbers of observations for each output, and GPs scale badly in the number of observations.
While inference in the single-output setting requires the inversion of an \(n \times n\) matrix (where \(n\) is the number of data points), in the case of \(p\) outputs, assuming that at all times all outputs are observed,<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> this turns into the inversion of an \(np \times np\) matrix (assuming the same number of input points for each output as in the single output case), which can quickly become computationally intractable (i.e. not feasible to compute with limited computing power and time).
That is because the inversion of a \(q \times q\) matrix takes \(\mathcal{O}(q^3)\) time and \(\mathcal{O}(q^2)\) memory, meaning that time and memory performance will scale, respectively, cubically and quadratically on the number of points in time, \(n\), and outputs, \(p\).
In practice this scaling characteristic limits the application of this general MOGP formulation to data sets with very few outputs and data points.</p>
<h3 id="low-rank-approximations">Low-rank approximations</h3>
<p>A popular and powerful approach to making MOGPs computationally tractable is to impose a <a href="https://en.wikipedia.org/wiki/Rank_(linear_algebra)">low-rank</a> structure over the covariance between outputs.
That is equivalent to assuming that the data can be described by a set of latent (unobserved) Gaussian processes in which the number of these <em>latent processes</em> is fewer than the number of outputs.
This builds a simpler lower-dimensional representation of the data. The structure imposed over the covariance matrices through this lower-dimensional representation of the data can be exploited to perform the inversion operation more efficiently (we are going to discuss in detail one of these cases in the next post of this series).
There are a variety of different ways in which this kind of structure can be imposed, leading to an interesting class of models which we discuss in the next section.</p>
<p>This kind of assumption is typically used to make the method computationally cheaper.
However, these assumptions do bring extra <a href="https://en.wikipedia.org/wiki/Inductive_bias">inductive bias</a> to the model. Introducing inductive bias <a href="https://towardsdatascience.com/supercharge-your-model-performance-with-inductive-bias-48559dba5133">can be a powerful tool</a> in making the model more data-efficient and better-performing in practice, provided that such assumptions are adequate to the particular problem at hand.
For instance, <a href="https://epubs.siam.org/doi/pdf/10.1137/18M1183480">low-rank data</a> <a href="https://ieeexplore.ieee.org/document/1177153">occurs naturally</a> in <a href="https://www.sciencedirect.com/science/article/abs/pii/S0005109807003950">different settings</a>.
This also happens to be true in electricity grids, due to the <a href="https://lib.dr.iastate.edu/cgi/viewcontent.cgi?article=1031&context=econ_las_conf">mathematical structure</a> of the price-forming process.
To make good choices about the kind of inductive bias to use, experience, domain knowledge, and familiarity with the data can be helpful.</p>
<h2 id="the-mixing-model-hierarchy">The Mixing Model Hierarchy</h2>
<p>Models that use a lower-dimensional representation for data have been present in the literature for a long time.
Well-known examples include <a href="https://en.wikipedia.org/wiki/Factor_analysis">factor analysis</a>, <a href="https://en.wikipedia.org/wiki/Principal_component_analysis">PCA</a>, and <a href="https://en.wikipedia.org/wiki/Autoencoder#Variational_autoencoder_(VAE)">VAEs</a>.
Even if we restrict ourselves to GP-based models there are still a significant number of notable and powerful models that make this simplifying assumption in one way or another.
However, models in the class of MOGPs that explain the data with a lower-dimensional representation are often framed in many distinct ways, which may obscure their relationship and overarching structure.
Thus, it is useful to try to look at all these models under the same light, forming a well-defined family that highlights their similarities and differences.</p>
<p>In one of the appendices of a <a href="http://proceedings.mlr.press/v119/bruinsma20a.html">recent paper of ours</a>,<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> we presented what we called the <em>Mixing Model Hierarchy (MMH)</em>, an attempt at a unifying presentation of this large class of MOGP models.
It is important to stress that our goal with the Mixing Model Hierarchy is only to organise a large number of pre-existing models from the literature in a reasonably general way, and not to present new models nor claim ownership over important contributions made by other researchers.
Further down in this article we present a diagram that connects several relevant papers to the models from the MMH.</p>
<p>The simplest model from the Mixing Model Hierarchy is what we call the <em>Instantaneous Linear Mixing Model</em> (ILMM), which, despite its simplicity, is still a rather general way of describing low-rank covariance structures. In this model the observations are described as a linear combination of the <em>latent processes</em> (i.e. unobserved stochastic processes), given by a set of constant weights.
That is, we can write \(f(t) | x, H = Hx(t)\),<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> where \(f\) are the observations, \(H\) is a matrix of weights, which we call the <em>mixing matrix</em>, and \(x(t)\) represents the (time-dependent) latent processes—note that we use \(x\) here to denote an unobserved (latent) stochastic process, not an input, which is represented by \(t\).
If the latent processes \(x(t)\) are described as GPs, then due to the fact that linear combinations of Gaussian variables are also Gaussian, \(f(t)\) will also be a (MO)GP.</p>
<p>The top-left part of the figure below illustrates the graphical model for the ILMM.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup>
The graphical model highlights the two restrictions imposed by the ILMM when compared with a general MOGP: <em>(i)</em> the <em>instantaneous spatial covariance</em> of \(f\), \(\mathbb{E}[f(t) f^\top(t)] = H H^\top\), does not vary with time, because neither \(H\) nor \(K(t, t) = I_m\) varies with time; and <em>(ii)</em> the noise-free observation \(f(t)\) is a function of \(x(t')\) for \(t'=t\) only, meaning that, for example, \(f\) cannot be \(x\) with a delay or a smoothed version of \(x\). Reflecting this, we call the ILMM a <em>time-invariant</em> (due to <em>(i)</em>) and <em>instantaneous</em> (due to <em>(ii)</em>) MOGP.</p>
<p><img src="/blog/public/images/MMH-graphical_model.png" alt="MMH_graphical_model" />
Figure 2: Graphical model for different models in the MMH.<sup id="fnref:4:1" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>
<p>There are three general ways in which we can generalise the ILMM within the MMH.
<em>The first one is to allow the mixing matrix \(H\) to vary in time</em>.
That means that the amount each latent process contributes to each output varies in time.
Mathematically, \(H \in \R^{p \times m}\) becomes a matrix-valued function \(H\colon \mathcal{T} \to \R^{p \times m}\), and the mixing mechanism becomes</p>
\[\begin{equation}
f(t)\cond H, x = H(t) x(t).
\end{equation}\]
<p>We call such MOGP models <em>time-varying</em>.
Their graphical model is shown in the figure above, on the top right corner.</p>
<p><em>A second way to generalise the ILMM is to assume that \(f(t)\) depends on \(x(t')\) for all \(t' \in \mathcal{T}\).</em>
That is to say that, at a given time, each output may depend on the values of the latent processes at any other time.
We say that these models become <em>non-instantaneous</em>.
The mixing matrix \(H \in \R^{p \times m}\) becomes a matrix-valued time-invariant filter \(H\colon \mathcal{T} \to \R^{p \times m}\), and the mixing mechanism becomes</p>
\[\begin{equation}
f(t)\cond H, x = \int H(t - \tau) x(\tau) \mathrm{d\tau}.
\end{equation}\]
<p>We call such MOGP models <em>convolutional</em>.
Their graphical model is shown in the figure above, in the bottom left corner.</p>
<p><em>A third generalising assumption that we can make is that \(f(t)\) depends on \(x(t')\) for all \(t' \in \mathcal{T}\) <span style="text-decoration:underline;">and</span> this relationship may vary with time.</em>
This is similar to the previous case, in that both models are non-instantaneous, but with the difference that this one is also time-varying.
The mixing matrix \(H \in \R^{p \times m}\) becomes a matrix-valued time-varying filter \(H\colon \mathcal{T}\times\mathcal{T} \to \R^{p \times m}\), and the mixing mechanism becomes</p>
\[\begin{equation}
f(t)\cond H, x = \int H(t, \tau) x(\tau) \mathrm{d\tau}.
\end{equation}\]
<p>We call such MOGP models <em>time-varying</em> and <em>convolutional</em>.
Their graphical model is shown in the figure above in the bottom right corner.</p>
<p>Besides these generalising assumptions, a further extension is to adopt a prior over \(H\).
Using such a prior allows for a principled way of further imposing inductive bias by, for instance, encouraging sparsity.
This extension and the three generalisations discussed above together form what we call the <em><a href="http://proceedings.mlr.press/v119/bruinsma20a.html">Mixing Model Hierarchy (MMH)</a></em>, which is illustrated in the figure below.
The MMH organises multi-output Gaussian process models according to their distinctive modelling assumptions.
The figure below shows how twenty one MOGP models from the machine learning and geostatistics literature can be recovered as special cases of the various generalisations of the ILMM.</p>
<p><img src="/blog/public/images/MMH-Zoubins_cube.png" alt="MMH" />
Figure 3: Diagram relating several models from the literature to the MMH, based on their properties.</p>
<p>Naturally, these different members of the MMH vary in complexity and each brings their own set of challenges.
Particularly, exact inference is computationally expensive or even intractable for many models in the MMH, which requires the use of approximate inference methods such as <a href="https://en.wikipedia.org/wiki/Variational_Bayesian_methods">variational inference</a> (VI), or even <a href="https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo">Markov Chain Monte Carlo</a> (MCMC).
We definitely recommend the reading of the original papers if one is interested in seeing all the clever ways the authors find to perform inference efficiently.</p>
<p>Although the MMH defines a large and powerful family of models, not all multi-output Gaussian process models are covered by it.
For example, <a href="https://arxiv.org/abs/1211.0358">Deep GPs</a> and its <a href="https://papers.nips.cc/paper/2018/hash/2974788b53f73e7950e8aa49f3a306db-Abstract.html">variations</a> are excluded because they transform the latent processes <em>nonlinearly</em> to generate the observations.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post we have briefly discussed how to extend regular, single output Gaussian Processes (GP) to multi-output Gaussian Processes (MOGP), and argued that MOGPs are really just single-output GPs after all.
We have also introduced the Mixing Model Hierarchy (MMH), which classifies a large number of models from the MOGP literature based on the way they generalise a particular base model, the Instantaneous Linear Mixing Model (ILMM).</p>
<p>In the next post of this series, we are going to discuss the ILMM in more detail and show how some simple assumptions can lead to a much more scalable model, which is applicable to extremely large systems that not even the simplest members of the MMH can tackle in general.</p>
<!-- Footnotes themselves at the bottom. -->
<h3 id="notes">Notes</h3>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>When every output is observed at each time stamp, we call them fully observed outputs. If the outputs are not fully observed, only a subset of them might be available for certain times (for example due to faulty sensors).
In this case, the number of data points will be smaller than \(np\), but will still scale proportionally with \(p\).
Thus, the scaling issues will still be present. In the case where only a single output is observed at any given time, the number of observations will be \(n\), and the MOGP would have the same time and memory scaling as a single-output GP. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p><a href="http://proceedings.mlr.press/v119/bruinsma20a.html">Bruinsma, Wessel, et al. “Scalable Exact Inference in Multi-Output Gaussian Processes.” International Conference on Machine Learning. PMLR, 2020</a>. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:3" role="doc-endnote">
<p>Here \(f(t)\mid x,H\) is used to denote the value of \(f(t)\) given a known \(H\) and \(x(t)\). <a href="#fnref:3" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:4" role="doc-endnote">
<p>Although we illustrate the GPs in the graphical models as a Markov chain, that is just to improve clarity.
In reality, GPs are much more general than Markov chains, as there is no conditional independence between timestamps. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">↩</a> <a href="#fnref:4:1" class="reversefootnote" role="doc-backlink">↩<sup>2</sup></a></p>
</li>
</ol>
</div>Eric Perim, Wessel Bruinsma, and Will TebbuttThis is the first post in a three-part series we are preparing on multi-output Gaussian Processes. Gaussian Processes (GPs) are a popular tool in machine learning, and a technique that we routinely use in our work. Essentially, GPs are a powerful Bayesian tool for regression problems (which can be extended to classification problems through some modifications). As a Bayesian approach, GPs provide a natural and automatic mechanism to construct and calibrate uncertainties. Naturally, getting well-calibrated uncertainties is not easy and depends on a combination of how well the model matches the data and on how much data is available. Predictions made using GPs are not just point predictions: they are whole probability distributions, which is convenient for downstream tasks. There are several good references for those interested in learning more about the benefits of Bayesian methods, from introductory blog posts to classical books.How to Start Contributing to Open Source Software2021-01-29T00:00:00+00:002021-01-29T00:00:00+00:00https://invenia.github.io/blog/2021/01/29/contribute-open-source<p>If you are someone who feels comfortable using code to solve a problem, answer a question, or just implement something for fun, chances are you are relying on <a href="https://opensource.com/resources/what-open-source">open source software</a>.
If you want to contribute to open source software, but don’t know how and where to start, this guide is for you.</p>
<p>In this post, we will first discuss some of the possible mental and technical barriers standing between you and your first meaningful contribution.
Then, we will go through the steps involved in making the first contribution.</p>
<p>For the sake of brevity, we assume some basic familiarity with <a href="https://guides.github.com/introduction/git-handbook/">git</a> (if you know how to commit changes and what a branch is, you’ll be fine).
Because many open source projects are hosted on GitHub, we also discuss <a href="https://guides.github.com/features/issues">GitHub Issues</a>, which refer to GitHub functionality that allows discussion about bugs and new features, and the term “pull/merge request” (PR/MR, they are the same thing), which is a mechanism for submitting your changes to be incorporated into the existing repository.</p>
<h2 id="mental-and-technical-barriers">Mental and Technical Barriers</h2>
<p>It is easy to observe the world of open source and think:
“Oh look at all these smart people doing meaningful work and developing meaningful relationships, building humanity’s digital heritage that helps to run our civilisation.
I wish I could join the party, <em>but</em> <something I believe to be true>.”</p>
<p>Some misconceptions I and other people have had are listed below, along with what we think about it now.</p>
<blockquote>
<p>“I need to be an expert in <the language> AND <the package> to even consider reporting a bug, making a fix, or implementing new functionality.”</p>
</blockquote>
<p>If you are using the package and something doesn’t work as expected, you should report it, for example by opening a GitHub issue.
You don’t need to know how the package works internally.
All you need to do is take a quick look at the documentation to see if anything is written about your problem, and check if a similar issue has been opened already.
If you want to fix something yourself, that’s fantastic! Just jump in and try to figure it out.</p>
<blockquote>
<p>“I know about <a thing>, but other people know much more, and it’s much better if they implemented it.”</p>
</blockquote>
<p>This is almost always the case, unless you are the world’s leading expert on <a thing>.
However, they are likely busy with other important work, and the issue is not a priority for them.
So, either you implement it, or it may not get done at all, so go ahead!
The best part, however, is that these other more knowledgeable people will usually be happy to review your solution and suggest how to make it even better, which means that a great thing gets built and you learn something in the process.</p>
<blockquote>
<p>“I don’t know anyone contributing to the package and they look like a team, isn’t it weird if I just jump in and open an issue/PR?”</p>
</blockquote>
<p>No, it’s not weird.
Teams make their code and issues public because they’re looking for new contributors like you.</p>
<blockquote>
<p>“I won’t be able to create a perfect solution, and people will point out flaws and ask me to change it.”</p>
</blockquote>
<p>Solutions to issues usually come in the form of pull requests.
However, opening a PR is best thought of as a conversation about a solution, rather than a finished product that is either approved or rejected.
Experienced contributors often open PRs to solicit feedback about an idea, because an open PR on GitHub offers convenient tools for discussion about code.
Even if you think your solution is complete, people will likely ask you to make changes, and that’s alright!
If it isn’t explicitly mentioned (it should be), ask why the changes are needed—these are valuable opportunities to learn.</p>
<blockquote>
<p>“I would like to make a contribution, but don’t know where to start.”</p>
</blockquote>
<p>Finding the right place to start can be challenging, but see advice below.
Once you make a contribution or two they will lead you on to others, so you typically only have to overcome this barrier once.</p>
<blockquote>
<p>“I know what is broken and I think I know how to fix it, but don’t know the steps to publish these to the official repository.”</p>
</blockquote>
<p>That’s fantastic! See the rest of the guide.</p>
<blockquote>
<p>“What if people ask for changes? How do I implement those?”</p>
</blockquote>
<p>Somehow I thought implementing review feedback was hard and messy, but, in practice, it’s as easy as adding more commits to a branch.</p>
<h2 id="steps-to-first-contribution">Steps to First Contribution</h2>
<p>Now that we have gone through some of the concerns you may have, here is the step-by-step guide to your first contribution.</p>
<h3 id="1-learn-the-mechanics-of-a-pull-request">1) Learn the mechanics of a pull request</h3>
<p>The workflow is described in this <a href="https://github.com/firstcontributions/first-contributions">excellent repository on GitHub</a>, built just for learning the mechanics of making a pull request.
I recommend to go through it by not just reading it, but by actually going through all the steps.
That exercise should make you comfortable with the process.</p>
<h3 id="2-find-something-you-want-to-fix">2) Find something you want to fix</h3>
<p>A good first project might be solving a bug that affects you, as that means you already have a test case and you will be more motivated to find a solution.
However, if the bug is in a large and complicated library or requires a lot of code refactoring to fix, it is probably better to start somewhere else.</p>
<p>It may be more enjoyable to start with smaller or medium-sized packages because they can be easier to understand.
When you find a package you would like to modify, make sure that it is the original (not a fork) and it is being maintained, which you can check by looking at the issues and pull requests on its GitHub page.</p>
<p>Then, look through the issues and see if there is something that you find interesting.
Pay attention to <code class="language-plaintext highlighter-rouge">good-first-issue</code> labels, which indicate issues that people think are appropriate for first-time contributors.
This usually means that they are nice and not too hard to solve.
You don’t have to restrict yourself to issues <code class="language-plaintext highlighter-rouge">good-first-issue</code> labels, feel free to tackle anything you feel motivated and able to do.
Keep in mind that it might be better to start with a smaller PR and get that merged first, before tackling a bigger issue.
You don’t want to submit a week worth of work only to find out that the package has been abandoned and there is nobody willing to review and merge your PR.</p>
<p>When you find an interesting issue and decide you want to work on it, it is a good idea to comment on the issue first and ask whether anyone is willing to review a potential PR.
Commenting will also create a feeling of responsibility and ownership of the issue which will motivate you and help you finish the PR.</p>
<p>As a few ideas, here are some concrete Julia packages that Invenia is involved with, for example <a href="https://github.com/JuliaCloud/AWS.jl">AWS.jl</a> for interacting with AWS, <a href="https://github.com/invenia/Intervals.jl">Intervals.jl</a> for working with intervals over ordered types, <a href="https://github.com/invenia/BlockDiagonals.jl">BlockDiagonals.jl</a> for working with block-diagonal matrices, and <a href="https://github.com/JuliaDiff/ChainRules.jl">ChainRules.jl</a> for automatic differentiation.
We are happy to help you contribute to these!</p>
<h3 id="3-implement-your-solution-and-open-a-pr">3) Implement your solution, and open a PR</h3>
<p>While you should be familiar with the mechanics of pull requests after step 1, there are some additional social/etiquette considerations.
Generally, the authors of open source packages are delighted when someone uses their package, opens issues about bugs or potential improvements, and especially so when someone opens a pull request with a solution to a known problem.
That said, they will appreciate it if you make things easy for them by linking the issue your PR is solving and a brief reason why you have chosen this approach.
If you are unsure about whether something was a good choice, point it out in the description.</p>
<p>If your background isn’t in computer science or software engineering, you might not have heard of <a href="https://ocw.mit.edu/ans7870/6/6.005/s16/classes/03-testing/index.html">unit testing</a>.
Testing is a way of ensuring the correctness of the code by checking the output of the code for a number of inputs.
In packages with unit tests, every new feature is typically expected to come with tests.
When fixing a bug, a test that fails using the old code and passes using the new code may also be expected.</p>
<p>You can make the review process more efficient and pleasant by quickly examining the PR yourself.
What needs to be included in the PR depends on the issue at hand, but there are some general questions you can think about:</p>
<ul>
<li>Does my code work in corner cases, did I include reasonable tests?</li>
<li>Did I add or change documentation to match my changes?</li>
<li>Does my code formatting follow the rest of the package? Some packages follow code style guides, such as <a href="https://www.python.org/dev/peps/pep-0008/">PEP8</a> or <a href="https://github.com/invenia/BlueStyle">BlueStyle</a>.</li>
<li>Did I include any lines or files by mistake?</li>
</ul>
<p>Don’t worry if you can’t answer these questions.
It is perfectly fine to ask!
You can also self-review your PR and add some thoughts as comments to the code.</p>
<p>The <a href="http://colprac.sciml.ai">contributor’s guide on collaborative practices</a> is a great resource about the best practices regarding collaboration on open source projects.
The packages that follow it display this badge: <a href="https://github.com/SciML/ColPrac"><img src="https://img.shields.io/badge/ColPrac-Contributor's%20Guide-blueviolet" alt="ColPrac: Contributor's Guide on Collaborative Practices for Community Packages" /></a>
Other packages typically have their own guidelines outlined in a file named <code class="language-plaintext highlighter-rouge">CONTRIBUTING.md</code>.</p>
<h3 id="4-address-feedback-and-wait-for-the-merge">4) Address feedback, and wait for the merge</h3>
<p>Once the PR is submitted someone will likely respond to it in a few days.
If it doesn’t happen, feel free to “bump” it by adding a comment to the PR asking for it to be reviewed.
Most maintainers do not mind if you bump the PR every ten days or so, and in fact find it useful in case it has slipped under their radar.</p>
<p>Ideally the feedback will be constructive, actionable, and educational.
Sometimes it isn’t, and if you are very unlucky the reviewer might come across as stern and critical.
It helps to remember that such feedback might have been unintentional and that you are in fact on the same side, both wanting the best code to be merged.
Using plural first-person pronouns (we) is a good way to convey this sentiment (and remind the reviewer about it), for example:
“What are the benefits if we implement <a feature> in this way?” is better than “Why do you think <a feature> should be implemented this way?”.</p>
<p>Addressing feedback is easy: simply add more commits to the branch that the PR is for.
Once you think you have addressed all the feedback, let the reviewer know explicitly, as they don’t know whether you plan to add more commits or not.
If all went well the reviewer will then merge the PR, hooray!</p>
<h2 id="conclusions">Conclusions</h2>
<p>Unless you have started programming very recently, you likely already have the technical/programming ability to contribute to open source projects.
You have valuable contributions to make, but psychological/sociological barriers may be holding you back.
Hopefully reading this post will help you overcome them: and we are looking forward to welcoming you to the community and seeing what you come up with!</p>Miha ZgubicIf you are someone who feels comfortable using code to solve a problem, answer a question, or just implement something for fun, chances are you are relying on open source software. If you want to contribute to open source software, but don’t know how and where to start, this guide is for you.Navigating Your First Technical Internship2021-01-22T00:00:00+00:002021-01-22T00:00:00+00:00https://invenia.github.io/blog/2021/01/22/navigating-first-internship<p>During the final weeks of my internship with Invenia, while looking back on my time here,
I had the idea to share some thoughts on my experience and what advice I wish I had been given leading up to my start date.
There are countless recipes for a successful internship, and I hope what follows can help guide you towards making yours great.</p>
<p>First and foremost, congratulations on landing your first technical
internship! After hours of fine-tuning your resume, countless cover
letters written, and (if you are in the same boat I was in) having faced
your fair share of rejections, you’ve made it. Before digging into the
rest of this post, take a moment to be proud of the work you’ve put in
and appreciate where it has led you. If perhaps you are reading this in
the middle of your application process and are yet to receive an offer,
don’t worry. The first summer I applied for technical internships, out
of roughly 20 applications sent, I received a single interview and zero
offers. It never feels like it is going to happen until it does. Here’s
a <a href="https://www.freecodecamp.org/news/how-to-land-a-top-notch-tech-job-as-a-student-5c97fec82f3d/">useful
guide</a>
if you’re looking for tips on the application process.</p>
<p>Hopefully the advice that follows in this post will give you some
insight that helps you make a strong impression throughout the course
of your internship. Some more technical subjects or toolkits will be
mentioned for the sake of example, but we won’t get into the details of
these in this post. There are plenty of online resources that provide
great introductions to
<a href="https://guides.github.com/introduction/git-handbook/">git</a>,
<a href="https://www.zdnet.com/article/what-is-cloud-computing-everything-you-need-to-know-about-the-cloud/">the
cloud</a>,
and countless other new topics you might encounter.</p>
<h2 id="before-you-start">Before you start</h2>
<p>The most important thing you can do leading up to your internship is <strong>try
to relax</strong>. It’s common to feel a nervous excitement as your first day
approaches. Remember that you will have plenty of time (give or take 40
hours per week) to focus on this soon enough. Take the time to enjoy
other hobbies or interests while you have the time.</p>
<p>If all distractions fail, a great way to spend any built up excitement
is to do some research. You likely already have a good sense of what the
company does, the general culture and what your role might entail from
the application and interview processes. Feel free to dig deeper into
these topics. See if there’s information on the company website about
your coworkers, the impact your work might drive and the industry as a
whole. This will help you feel more prepared when you start, as well as
settle some jitters in the weeks leading up.</p>
<h2 id="hitting-the-ground-running">Hitting the ground running</h2>
<p>The day has finally arrived, you’ve made it! Surely your first week will
be filled with lots of meetings, introductions, and learning. My advice
for this information overload is about as basic as it gets: <strong>take
notes, and lots of them</strong>. Ironically, being a software engineer, I am a
big fan of handwriting my notes. I also think it looks better from the
perspective of the person speaking to see someone writing notes by hand
versus typing away at a laptop, unable to see the screen.</p>
<p>In the same vein, any buzzwords you hear (any term, phrase, acronym or
title that you don’t fully understand) should be added to a personal
dictionary. This was really important for me, and I made reference to it
weekly throughout my 8 month internship. You should keep this up
throughout the entire internship, however the first two weeks is likely
when it will grow the most. <em>What is a Docker container and what makes
them helpful? Who knows what [insert some arbitrary industry acronym
here] stands for?</em> These are questions that are super easy to answer by
asking coworkers or even the internet. Maintaining your own shortlist of
important definitions will help fast track your learning and be a great
tool to pass along to new interns that join during your stay.</p>
<p>One of the best ways you can get to know the company and those you are
working with is simply to <strong>reach out to your coworkers</strong>. This can be
challenging depending on the size of the company you are with. It
becomes even more intimidating if you are in the middle of a global
pandemic and are onboarding from home. Try connecting via Slack, email
or set up calls to get to know those both in and outside of your team.
Learning about the different teams, individuals’ career paths and
building your network is one of the best ways to make the most of your
internship, both personally and professionally. Reaching out like this
can be intimidating, especially early on, but it will show that you are
interested in the company as a whole and highlight that you have strong
initiative. People generally are more than happy to talk about their
work and personal interests. There is no reason not to start this during
your first couple weeks.</p>
<h2 id="the-bulk">The Bulk</h2>
<p>Throughout my internship, I discovered three main takeaways that should
be considered by anyone working through an internship, especially if it
is your first.</p>
<h3 id="1-who-is-this-imposter">1. Who is this imposter?</h3>
<p>Imposter Syndrome is the feeling of doubting your own skills,
accomplishments and thinking of yourself as some kind of fraud. For me,
this manifested itself as a stream of questions such as <em>When will they
realize they’ve made a huge mistake in hiring me?</em> and thoughts similar
to <em>I am definitely not smart enough for this!</em> Temper this by
remembering <strong>they hired you for a reason</strong>. After interviewing many
candidates and having spent hours speaking to you and testing your
skills, they picked you. This reminder certainly won’t make all these
anxieties disappear, but can hopefully help mitigate any unnecessary
stress. Being a little nervous can help motivate you to work hard and
push you to succeed. It is worth remembering that all your coworkers
likely went through the same experience, or even still feel this way
from time to time. Feel free to get their insights or experience with
this if you feel comfortable doing so.</p>
<p>These thoughts and feelings might start well before your first day and
last well into the internship itself, as they did in my case. It will
improve with time and experience. Until that happens just
remember to use your available sources of support and try to translate
it into motivation.</p>
<h3 id="2-asking-for-help">2. Asking for help</h3>
<p>A big fear I had during my internship was coming across as naive and
inexperienced. I was very worried about asking a question and getting
“<em>How do you not know that? Don’t you know anything</em>?” as a response.
While this is certainly a normal thought process, it is misguided for a
few reasons. First off, my coworkers are great people, as I am sure are
yours. The odds of someone saying that are slim to none, and if they do,
it tells you a lot more about them than it does about you. Secondly, and
this is an important thing to keep in mind: <strong>no one expects you to know
everything and be great at everything,</strong> especially early on as an
intern. Asking questions is an important part of learning and
internships are no exception. This one comes in two parts: <em>when</em> and <em>how</em> to ask for help.</p>
<p>Let me save you the time I wasted trying to figure out when is the
perfect point to ask a question. While you definitely do not want to
just ask before thinking or doing any digging yourself, no one wants you
endlessly spinning your wheels. Take the time to think about the
problem, see if there are any reliable resources or answers in some
documentation, attempt a couple of solutions, but don’t fuss until the
end of time out of fear of looking dumb.</p>
<p>How you ask for help is the easy one. You’ve done all the work already.
Avoid questions like “<em>Hey how do you do [insert problem]?</em>”, or even
worse “<em>Hey I know this is probably SUPER stupid but I don’t get
[insert problem here] haha!</em>”. Do say something along the lines of
“<em>Hey [insert name]. I have been trying to figure out how to solve
[insert problem] and seem to be stuck. I have tried [insert attempted
solutions] with no success and was hoping you could point me in the
right direction.</em>” You can also frame it as a leading question, such as
“<em>So in order to do X, we have to do Y because of Z</em>?” It doesn’t have
to be a lengthy breakdown of every thought you had, it really shouldn’t
be. People generally prefer concise messages, just show that you have
put some thought and effort into it.</p>
<h3 id="3-make-your-voice-heard">3. Make your voice heard</h3>
<p>The scope of this point may vary depending on the size of your company,
its culture and your role, however, the point remains the same. Share
your thoughts on what you are working on. Share what interests you as
subjects, both within and outside your role.</p>
<p>I had the incredible opportunity to contribute to my organization in
ways beyond the scope of my job description. Yes, this is because of the
supportive nature of the company and the flexibility that comes with
working at a smaller organization, but it also would not have happened
had I not shared what I am interested in. I gained fantastic experience
in my role, but also developed an appreciation and better understanding
for other work being done at the company.</p>
<p>Share your thoughts at the weekly team meeting, don’t be afraid to
review code or improve documentation and bounce ideas off coworkers
during coffee breaks (oh, try not to drink too much coffee). They
hired you in large part for your brain, don’t be afraid to use it!</p>
<h2 id="final-impressions">Final Impressions</h2>
<p>You’ve made it to the final few weeks of your internship, congrats!
Hopefully you have had a fantastic experience, learning a lot and making
lasting relationships. Now is the time to think of who you would like to
connect with for a chat or call before you finish. This can be for any
number of reasons; giving a little extra thanks, asking for career
advice or even just for the sake of saying farewell to the friends
you’ve made along the way!</p>
<p>Regardless of whether or not you follow any of this advice, I wish you
the best of luck in your internship. While the advice above worked well
for me, it is by no means a one-size-fits-all magical recipe to the
perfect internship. There will certainly be hurdles along the way,
anxieties to overcome, and inevitable mistakes made, all of which will
contribute to making your internship a great learning experience. Good
luck and enjoy the ride.</p>Tom WrightDuring the final weeks of my internship with Invenia, while looking back on my time here, I had the idea to share some thoughts on my experience and what advice I wish I had been given leading up to my start date. There are countless recipes for a successful internship, and I hope what follows can help guide you towards making yours great.Linear Models from a Gaussian Process Point of View with Stheno and JAX2021-01-19T00:00:00+00:002021-01-19T00:00:00+00:00https://invenia.github.io/blog/2021/01/19/linear-models-with-stheno-and-jax<p><em>Cross-posted at <a href="https://wesselb.github.io/2021/01/19/linear-models-with-stheno-and-jax.html">wesselb.github.io</a>.</em></p>
<p>A linear model prescribes a linear relationship between inputs and outputs.
Linear models are amongst the simplest of models, but they are ubiquitous across science.
A linear model with Gaussian distributions on the coefficients forms one of the simplest instances of a <em><a href="https://en.wikipedia.org/wiki/Gaussian_process">Gaussian process</a></em>.
In this post, we will give a brief introduction to linear models from a Gaussian process point of view.
We will see how a linear model can be implemented with <em>Gaussian process probabilistic programming</em> using <a href="https://github.com/wesselb/stheno">Stheno</a>, and how this model can be used to denoise noisy observations.
(Disclosure: <a href="https://willtebbutt.github.io/">Will Tebbutt</a> and Wessel are the authors of Stheno;
Will maintains a <a href="https://github.com/willtebbutt/Stheno.jl">Julia version</a>.)
In short, <a href="https://en.wikipedia.org/wiki/Probabilistic_programming">probabilistic programming</a> is a programming paradigm that brings powerful probabilistic models to the comfort of your programming language, which often comes with tools to automatically perform inference (make predictions).
We will also use <a href="https://github.com/google/jax">JAX</a>’s just-in-time compiler to make our implementation extremely efficient.</p>
<h2 id="linear-models-from-a-gaussian-process-point-of-view">Linear Models from a Gaussian Process Point of View</h2>
<p>Consider a data set \((x_i, y_i)_{i=1}^n \subseteq \R \times \R\) consisting of \(n\) real-valued input–output pairs.
Suppose that we wish to estimate a linear relationship between the inputs and outputs:</p>
\[\label{eq:ax_b}
y_i = a \cdot x_i + b + \e_i,\]
<p>where \(a\) is an unknown slope, \(b\) is an unknown offset, and \(\e_i\) is some error/noise associated with the observation \(y_i\).
To implement this model with Gaussian process probabilistic programming, we need to cast the problem into a <em>functional form</em>.
This means that we will assume that there is some underlying, random function \(y \colon \R \to \R\) such that the observations are evaluations of this function: \(y_i = y(x_i)\).
The model for the random function \(y\) will embody the structure of the linear model \eqref{eq:ax_b}.
This may sound hard, but it is not difficult at all.
We let the random function \(y\) be of the following form:</p>
\[\label{eq:ax_b_functional}
y(x) = a(x) \cdot x + b(x) + \e(x)\]
<p>where \(a\colon \R \to \R\) is a <em>random constant function</em>.
An example of a <em>constant function</em> \(f\) is \(f(x) = 5\).
<em>Random</em> means that the value \(5\) is not fixed, but modelled with a random value drawn from some probability distribution, because we don’t know the true value.
We let \(b\colon \R \to \R\) also be a random <em>constant function</em>, and \(\e\colon \R \to \R\) a random <em>noise function</em>.
Do you see the similarities between \eqref{eq:ax_b} and \eqref{eq:ax_b_functional}?
If all that doesn’t fully make sense, don’t worry; things should become more clear as we implement the model.</p>
<p>To model random constant functions and random noise functions, we will use <a href="https://github.com/wesselb/stheno">Stheno</a>, which is a Python library for Gaussian process modelling.
We also have a <a href="https://github.com/willtebbutt/Stheno.jl">Julia version</a>, but in this post we’ll use the Python version.
To install Stheno, run the command</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install</span> <span class="nt">--upgrade</span> <span class="nt">--upgrade-strategy</span> eager stheno
</code></pre></div></div>
<p>In Stheno, a Gaussian process can be created with <code class="language-plaintext highlighter-rouge">GP(kernel)</code>, where <code class="language-plaintext highlighter-rouge">kernel</code> is the so-called <a href="https://en.wikipedia.org/wiki/Gaussian_process#Covariance_functions"><em>kernel</em> or <em>covariance function</em> of the Gaussian process</a>.
The kernel determines the properties of the function that the Gaussian process models.
For example, the kernel <code class="language-plaintext highlighter-rouge">EQ()</code> models smooth functions, and the kernel <code class="language-plaintext highlighter-rouge">Matern12()</code> models functions that look jagged.
See the <a href="https://www.cs.toronto.edu/~duvenaud/cookbook/">kernel cookbook</a> for an overview of commonly used kernels and the <a href="https://wesselb.github.io/stheno/docs/_build/html/readme.html#available-kernels">documentation of Stheno</a> for the corresponding classes.
For constant functions, you can set the kernel to simply a constant, for example <code class="language-plaintext highlighter-rouge">1</code>, which then models the constant function with a value drawn from \(\mathcal{N}(0, 1)\). (By default, in Stheno, all means are zero; but, if you like, <a href="https://wesselb.github.io/stheno/docs/_build/html/readme.html#available-means">you can also set a mean</a>.)</p>
<p>Let’s start out by creating a Gaussian process for the random constant function \(a(x)\) that models the slope.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">stheno</span> <span class="kn">import</span> <span class="n">GP</span>
<span class="o">>>></span> <span class="n">a</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">a</span>
<span class="n">GP</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>You can see how the Gaussian process looks by simply sampling from it.
To sample from the Gaussian process <code class="language-plaintext highlighter-rouge">a</code> at some inputs <code class="language-plaintext highlighter-rouge">x</code>, evaluate it at those inputs, <code class="language-plaintext highlighter-rouge">a(x)</code>, and call the method <code class="language-plaintext highlighter-rouge">sample</code>: <code class="language-plaintext highlighter-rouge">a(x).sample()</code>.
This shows that you can really think of a Gaussian process just like you think of a function:
pass it some inputs to get (the model for) the corresponding outputs.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">a</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">sample</span><span class="p">(</span><span class="mi">20</span><span class="p">));</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/blog/public/images/linear-models-constant-functions.png" alt="Samples of a Gaussian process that models a constant function" />
Figure 1: Samples of a Gaussian process that models a constant function.</p>
<p>We’ve sampled a bunch of constant functions.
Sweet!
The next step in the model \eqref{eq:ax_b_functional} is to multiply the slope function \(a(x)\) by \(x\).
To multiply <code class="language-plaintext highlighter-rouge">a</code> by \(x\), we multiply <code class="language-plaintext highlighter-rouge">a</code> by the function <code class="language-plaintext highlighter-rouge">lambda x: x</code>, which casts also \(x\) as a function:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">f</span> <span class="o">=</span> <span class="n">a</span> <span class="o">*</span> <span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">f</span>
<span class="n">GP</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="o"><</span><span class="k">lambda</span><span class="o">></span><span class="p">)</span>
</code></pre></div></div>
<p>This will give rise to functions like \(x \mapsto 0.1x\) and \(x \mapsto -0.4x\), depending on the value that \(a(x)\) takes.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">sample</span><span class="p">(</span><span class="mi">20</span><span class="p">));</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/blog/public/images/linear-models-slope-functions.png" alt="Samples of a Gaussian process that models functions with a random slope" />
Figure 2: Samples of a Gaussian process that models functions with a random slope.</p>
<p>This is starting to look good!
The only ingredient that is missing is an offset.
We model the offset just like the slope, but here we set the kernel to <code class="language-plaintext highlighter-rouge">10</code> instead of <code class="language-plaintext highlighter-rouge">1</code>, which models the offset with a value drawn from \(\mathcal{N}(0, 10)\).</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">b</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="mi">10</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">f</span> <span class="o">=</span> <span class="n">a</span> <span class="o">*</span> <span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">)</span> <span class="o">+</span> <span class="n">b</span>
<span class="nb">AssertionError</span><span class="p">:</span> <span class="n">Processes</span> <span class="n">GP</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="o"><</span><span class="k">lambda</span><span class="o">></span><span class="p">)</span> <span class="ow">and</span> <span class="n">GP</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span> <span class="o">*</span> <span class="mi">1</span><span class="p">)</span> <span class="n">are</span> <span class="n">associated</span> <span class="n">to</span> <span class="n">different</span> <span class="n">measures</span><span class="p">.</span>
</code></pre></div></div>
<p>Something went wrong.
Stheno has an abstraction called <em>measures</em>, where only <code class="language-plaintext highlighter-rouge">GP</code>s that are part of the same measure can be combined into new <code class="language-plaintext highlighter-rouge">GP</code>s;
the abstraction of measures is there to keep things safe and tidy.
What goes wrong here is that <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> are not part of the same measure.
Let’s explicitly create a new measure and attach <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code> to it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">stheno</span> <span class="kn">import</span> <span class="n">Measure</span>
<span class="o">>>></span> <span class="n">prior</span> <span class="o">=</span> <span class="n">Measure</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">a</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">measure</span><span class="o">=</span><span class="n">prior</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">b</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="n">measure</span><span class="o">=</span><span class="n">prior</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">f</span> <span class="o">=</span> <span class="n">a</span> <span class="o">*</span> <span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">)</span> <span class="o">+</span> <span class="n">b</span>
<span class="o">>>></span> <span class="n">f</span>
<span class="n">GP</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="o"><</span><span class="k">lambda</span><span class="o">></span> <span class="o">+</span> <span class="mi">10</span> <span class="o">*</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>Let’s see how samples from <code class="language-plaintext highlighter-rouge">f</code> look like.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">f</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">sample</span><span class="p">(</span><span class="mi">20</span><span class="p">));</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/blog/public/images/linear-models-linear-functions.png" alt="Samples of a Gaussian process that models linear functions" />
Figure 3: Samples of a Gaussian process that models linear functions.</p>
<p>Perfect!
We will use <code class="language-plaintext highlighter-rouge">f</code> as our linear model.</p>
<p>In practice, observations are corrupted with noise.
We can add some noise to the lines in Figure 3 by adding a Gaussian process that models noise.
You can construct such a Gaussian process by using the kernel <code class="language-plaintext highlighter-rouge">Delta()</code>, which models the noise with independent \(\mathcal{N}(0, 1)\) variables.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">stheno</span> <span class="kn">import</span> <span class="n">Delta</span>
<span class="o">>>></span> <span class="n">noise</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="n">Delta</span><span class="p">(),</span> <span class="n">measure</span><span class="o">=</span><span class="n">prior</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">y</span> <span class="o">=</span> <span class="n">f</span> <span class="o">+</span> <span class="n">noise</span>
<span class="o">>>></span> <span class="n">y</span>
<span class="n">GP</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="o"><</span><span class="k">lambda</span><span class="o">></span> <span class="o">+</span> <span class="mi">10</span> <span class="o">*</span> <span class="mi">1</span> <span class="o">+</span> <span class="n">Delta</span><span class="p">())</span>
<span class="o">>>></span> <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">sample</span><span class="p">(</span><span class="mi">20</span><span class="p">));</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/blog/public/images/linear-models-noisy-linear-functions.png" alt="Samples of a Gaussian process that models noisy linear functions" />
Figure 4: Samples of a Gaussian process that models noisy linear functions.</p>
<p>That looks more realistic, but perhaps that’s a bit too much noise.
We can tune down the amount of noise, for example, by scaling <code class="language-plaintext highlighter-rouge">noise</code> by <code class="language-plaintext highlighter-rouge">0.5</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">y</span> <span class="o">=</span> <span class="n">f</span> <span class="o">+</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">noise</span>
<span class="o">>>></span> <span class="n">y</span>
<span class="n">GP</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="o"><</span><span class="k">lambda</span><span class="o">></span> <span class="o">+</span> <span class="mi">10</span> <span class="o">*</span> <span class="mi">1</span> <span class="o">+</span> <span class="mf">0.25</span> <span class="o">*</span> <span class="n">Delta</span><span class="p">())</span>
<span class="o">>>></span> <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">(</span><span class="n">x</span><span class="p">).</span><span class="n">sample</span><span class="p">(</span><span class="mi">20</span><span class="p">));</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/blog/public/images/linear-models-noisy-linear-functions-2.png" alt="Samples of a Gaussian process that models noisy linear functions" />
Figure 5: Samples of a Gaussian process that models noisy linear functions.</p>
<p>Much better.</p>
<p>To summarise, our linear model is given by</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prior</span> <span class="o">=</span> <span class="n">Measure</span><span class="p">()</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">measure</span><span class="o">=</span><span class="n">prior</span><span class="p">)</span> <span class="c1"># Model for slope
</span><span class="n">b</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="n">measure</span><span class="o">=</span><span class="n">prior</span><span class="p">)</span> <span class="c1"># Model for offset
</span><span class="n">f</span> <span class="o">=</span> <span class="n">a</span> <span class="o">*</span> <span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">)</span> <span class="o">+</span> <span class="n">b</span> <span class="c1"># Noiseless linear model
</span>
<span class="n">noise</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="n">Delta</span><span class="p">(),</span> <span class="n">measure</span><span class="o">=</span><span class="n">prior</span><span class="p">)</span> <span class="c1"># Model for noise
</span><span class="n">y</span> <span class="o">=</span> <span class="n">f</span> <span class="o">+</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">noise</span> <span class="c1"># Noisy linear model
</span></code></pre></div></div>
<p>We call a program like this a <em>Gaussian process probabilistic program</em> (GPPP).
Let’s generate some noisy synthetic data, <code class="language-plaintext highlighter-rouge">(x_obs, y_obs)</code>, that will make up an example data set \((x_i, y_i)_{i=1}^n\).
We also save the observations without noise added—<code class="language-plaintext highlighter-rouge">f_obs</code>—so we can later check how good our predictions really are.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">x_obs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">50_000</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">f_obs</span> <span class="o">=</span> <span class="mf">0.8</span> <span class="o">*</span> <span class="n">x_obs</span> <span class="o">-</span> <span class="mf">2.5</span>
<span class="o">>>></span> <span class="n">y_obs</span> <span class="o">=</span> <span class="n">f_obs</span> <span class="o">+</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">50_000</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x_obs</span><span class="p">,</span> <span class="n">y_obs</span><span class="p">);</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/blog/public/images/linear-models-observations.png" alt="Some observations" />
Figure 6: Some observations.</p>
<p>We will see next how we can fit our model to this data.</p>
<h2 id="inference-in-linear-models">Inference in Linear Models</h2>
<p>Suppose that we wish to remove the noise from the observations in Figure 6.
We carefully phrase this problem in terms of our GPPP:
the observations <code class="language-plaintext highlighter-rouge">y_obs</code> are realisations of the <em>noisy</em> linear model <code class="language-plaintext highlighter-rouge">y</code> at <code class="language-plaintext highlighter-rouge">x_obs</code>—realisations of <code class="language-plaintext highlighter-rouge">y(x_obs)</code>—and we wish to make predictions for the <em>noiseless</em> linear model <code class="language-plaintext highlighter-rouge">f</code> at <code class="language-plaintext highlighter-rouge">x_obs</code>—predictions for <code class="language-plaintext highlighter-rouge">f(x_obs)</code>.</p>
<p>In Stheno, we can make predictions based on observations by <em>conditioning</em> the measure of the model on the observations.
In our GPPP, the measure is given by <code class="language-plaintext highlighter-rouge">prior</code>, so we aim to condition <code class="language-plaintext highlighter-rouge">prior</code> on the observations <code class="language-plaintext highlighter-rouge">y_obs</code> for <code class="language-plaintext highlighter-rouge">y(x_obs)</code>.
Mathematically, this process of incorporating information by conditioning happens through <a href="https://en.wikipedia.org/wiki/Bayes%27_theorem">Bayes’ rule</a>.
Programmatically, we first make an <code class="language-plaintext highlighter-rouge">Observations</code> object, which represents the information—the observations—that we want to incorporate, and then condition <code class="language-plaintext highlighter-rouge">prior</code> on this object:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">stheno</span> <span class="kn">import</span> <span class="n">Observations</span>
<span class="o">>>></span> <span class="n">obs</span> <span class="o">=</span> <span class="n">Observations</span><span class="p">(</span><span class="n">y</span><span class="p">(</span><span class="n">x_obs</span><span class="p">),</span> <span class="n">y_obs</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">post</span> <span class="o">=</span> <span class="n">prior</span><span class="p">.</span><span class="n">condition</span><span class="p">(</span><span class="n">obs</span><span class="p">)</span>
</code></pre></div></div>
<p>You can also more concisely perform these two steps at once, as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">post</span> <span class="o">=</span> <span class="n">prior</span> <span class="o">|</span> <span class="p">(</span><span class="n">y</span><span class="p">(</span><span class="n">x_obs</span><span class="p">),</span> <span class="n">y_obs</span><span class="p">)</span>
</code></pre></div></div>
<p>This mimics the mathematical notation used for conditioning.</p>
<p>With our updated measure <code class="language-plaintext highlighter-rouge">post</code>, which is often called the <em>posterior</em> measure, we can make a prediction for <code class="language-plaintext highlighter-rouge">f(x_obs)</code> by passing <code class="language-plaintext highlighter-rouge">f(x_obs)</code> to <code class="language-plaintext highlighter-rouge">post</code>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">pred</span> <span class="o">=</span> <span class="n">post</span><span class="p">(</span><span class="n">f</span><span class="p">(</span><span class="n">x_obs</span><span class="p">))</span>
<span class="o">>>></span> <span class="n">pred</span><span class="p">.</span><span class="n">mean</span>
<span class="o"><</span><span class="n">dense</span> <span class="n">matrix</span><span class="p">:</span> <span class="n">shape</span><span class="o">=</span><span class="mi">50000</span><span class="n">x1</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">float64</span>
<span class="n">mat</span><span class="o">=</span><span class="p">[[</span><span class="o">-</span><span class="mf">2.498</span><span class="p">]</span>
<span class="p">[</span><span class="o">-</span><span class="mf">2.498</span><span class="p">]</span>
<span class="p">[</span><span class="o">-</span><span class="mf">2.498</span><span class="p">]</span>
<span class="p">...</span>
<span class="p">[</span> <span class="mf">5.501</span><span class="p">]</span>
<span class="p">[</span> <span class="mf">5.502</span><span class="p">]</span>
<span class="p">[</span> <span class="mf">5.502</span><span class="p">]]</span><span class="o">></span>
<span class="o">>>></span> <span class="n">pred</span><span class="p">.</span><span class="n">var</span>
<span class="o"><</span><span class="n">low</span><span class="o">-</span><span class="n">rank</span> <span class="n">matrix</span><span class="p">:</span> <span class="n">shape</span><span class="o">=</span><span class="mi">50000</span><span class="n">x50000</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">float64</span><span class="p">,</span> <span class="n">rank</span><span class="o">=</span><span class="mi">2</span>
<span class="n">left</span><span class="o">=</span><span class="p">[[</span><span class="mf">1.e+00</span> <span class="mf">0.e+00</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">2.e-04</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">4.e-04</span><span class="p">]</span>
<span class="p">...</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">1.e+01</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">1.e+01</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">1.e+01</span><span class="p">]]</span>
<span class="n">middle</span><span class="o">=</span><span class="p">[[</span> <span class="mf">2.001e-05</span> <span class="o">-</span><span class="mf">2.995e-06</span><span class="p">]</span>
<span class="p">[</span><span class="o">-</span><span class="mf">2.997e-06</span> <span class="mf">6.011e-07</span><span class="p">]]</span>
<span class="n">right</span><span class="o">=</span><span class="p">[[</span><span class="mf">1.e+00</span> <span class="mf">0.e+00</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">2.e-04</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">4.e-04</span><span class="p">]</span>
<span class="p">...</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">1.e+01</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">1.e+01</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">1.e+01</span><span class="p">]]</span><span class="o">></span>
</code></pre></div></div>
<p>The prediction <code class="language-plaintext highlighter-rouge">pred</code> is a <a href="https://en.wikipedia.org/wiki/Multivariate_Gaussian_distribution">multivariate Gaussian distribution</a> with a particular mean and variance, which are displayed above.
You should view <code class="language-plaintext highlighter-rouge">post</code> as a function that assigns a probability distribution—the prediction—to every part of our GPPP, like <code class="language-plaintext highlighter-rouge">f(x_obs)</code>.
Note that the variance of the prediction is a <em>massive</em> matrix of size 50k \(\times\) 50k.
Under the hood, Stheno uses <a href="https://github.com/wesselb/matrix">structured representations for matrices</a> to compute and store matrices in an efficient way.</p>
<p>Let’s see how the prediction <code class="language-plaintext highlighter-rouge">pred</code> for <code class="language-plaintext highlighter-rouge">f(x_obs)</code> looks like.
The prediction <code class="language-plaintext highlighter-rouge">pred</code> exposes the method <code class="language-plaintext highlighter-rouge">marginals</code> that conveniently computes the mean and associated lower and upper error bounds for you.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">mean</span><span class="p">,</span> <span class="n">error_bound_lower</span><span class="p">,</span> <span class="n">error_bound_upper</span> <span class="o">=</span> <span class="n">pred</span><span class="p">.</span><span class="n">marginals</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">mean</span>
<span class="n">array</span><span class="p">([</span><span class="o">-</span><span class="mf">2.49818708</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.49802708</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.49786708</span><span class="p">,</span> <span class="p">...,</span> <span class="mf">5.50148996</span><span class="p">,</span>
<span class="mf">5.50164996</span><span class="p">,</span> <span class="mf">5.50180997</span><span class="p">])</span>
<span class="o">>>></span> <span class="n">error_bound_upper</span> <span class="o">-</span> <span class="n">error_bound_lower</span>
<span class="n">array</span><span class="p">([</span><span class="mf">0.01753381</span><span class="p">,</span> <span class="mf">0.01753329</span><span class="p">,</span> <span class="mf">0.01753276</span><span class="p">,</span> <span class="p">...,</span> <span class="mf">0.01761883</span><span class="p">,</span> <span class="mf">0.01761935</span><span class="p">,</span>
<span class="mf">0.01761988</span><span class="p">])</span>
</code></pre></div></div>
<p>The error is very small—on the order of \(10^{-2}\)—which means that Stheno predicted <code class="language-plaintext highlighter-rouge">f(x_obs)</code> with high confidence.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">plt</span><span class="p">.</span><span class="n">scatter</span><span class="p">(</span><span class="n">x_obs</span><span class="p">,</span> <span class="n">y_obs</span><span class="p">);</span> <span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x_obs</span><span class="p">,</span> <span class="n">mean</span><span class="p">);</span> <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/blog/public/images/linear-models-denoised-observations.png" alt="Mean of the prediction (blue line) for the denoised observations" />
Figure 7: Mean of the prediction (blue line) for the denoised observations.</p>
<p>The blue line in Figure 7 shows the mean of the predictions.
This line appears to nicely pass through the observations with the noise removed.
But let’s see how good the predictions really are by comparing to <code class="language-plaintext highlighter-rouge">f_obs</code>, which we previously saved.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">f_obs</span> <span class="o">-</span> <span class="n">mean</span>
<span class="n">array</span><span class="p">([</span><span class="o">-</span><span class="mf">0.00181292</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.00181292</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.00181292</span><span class="p">,</span> <span class="p">...,</span> <span class="o">-</span><span class="mf">0.00180997</span><span class="p">,</span>
<span class="o">-</span><span class="mf">0.00180997</span><span class="p">,</span> <span class="o">-</span><span class="mf">0.00180997</span><span class="p">])</span>
<span class="o">>>></span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">((</span><span class="n">f_obs</span> <span class="o">-</span> <span class="n">mean</span><span class="p">)</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span> <span class="c1"># Compute the mean square error.
</span><span class="mf">3.281323087544209e-06</span>
</code></pre></div></div>
<p>That’s pretty close!
Not bad at all.</p>
<p>We wrap up this section by encapsulating everything that we’ve done so far in a function <code class="language-plaintext highlighter-rouge">linear_model_denoise</code>, which denoises noisy observations from a linear model:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">linear_model_denoise</span><span class="p">(</span><span class="n">x_obs</span><span class="p">,</span> <span class="n">y_obs</span><span class="p">):</span>
<span class="n">prior</span> <span class="o">=</span> <span class="n">Measure</span><span class="p">()</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">measure</span><span class="o">=</span><span class="n">prior</span><span class="p">)</span> <span class="c1"># Model for slope
</span> <span class="n">b</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="n">measure</span><span class="o">=</span><span class="n">prior</span><span class="p">)</span> <span class="c1"># Model for offset
</span> <span class="n">f</span> <span class="o">=</span> <span class="n">a</span> <span class="o">*</span> <span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">)</span> <span class="o">+</span> <span class="n">b</span> <span class="c1"># Noiseless linear model
</span> <span class="n">noise</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="n">Delta</span><span class="p">(),</span> <span class="n">measure</span><span class="o">=</span><span class="n">prior</span><span class="p">)</span> <span class="c1"># Model for noise
</span> <span class="n">y</span> <span class="o">=</span> <span class="n">f</span> <span class="o">+</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">noise</span> <span class="c1"># Noisy linear model
</span>
<span class="n">post</span> <span class="o">=</span> <span class="n">prior</span> <span class="o">|</span> <span class="p">(</span><span class="n">y</span><span class="p">(</span><span class="n">x_obs</span><span class="p">),</span> <span class="n">y_obs</span><span class="p">)</span> <span class="c1"># Condition on observations.
</span> <span class="n">pred</span> <span class="o">=</span> <span class="n">post</span><span class="p">(</span><span class="n">f</span><span class="p">(</span><span class="n">x_obs</span><span class="p">))</span> <span class="c1"># Make predictions.
</span> <span class="k">return</span> <span class="n">pred</span><span class="p">.</span><span class="n">marginals</span><span class="p">()</span> <span class="c1"># Return the mean and associated error bounds.
</span></code></pre></div></div>
<p></p>
<p><!-- Prevent tabs. --></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">linear_model_denoise</span><span class="p">(</span><span class="n">x_obs</span><span class="p">,</span> <span class="n">y_obs</span><span class="p">)</span>
<span class="p">(</span><span class="n">array</span><span class="p">([</span><span class="o">-</span><span class="mf">2.49818708</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.49802708</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.49786708</span><span class="p">,</span> <span class="p">...,</span> <span class="mf">5.50148996</span><span class="p">,</span>
<span class="mf">5.50164996</span><span class="p">,</span> <span class="mf">5.50180997</span><span class="p">]),</span> <span class="n">array</span><span class="p">([</span><span class="o">-</span><span class="mf">2.50695399</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.50679372</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.50663346</span><span class="p">,</span> <span class="p">...,</span> <span class="mf">5.49268055</span><span class="p">,</span>
<span class="mf">5.49284029</span><span class="p">,</span> <span class="mf">5.49300003</span><span class="p">]),</span> <span class="n">array</span><span class="p">([</span><span class="o">-</span><span class="mf">2.48942018</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.48926044</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.4891007</span> <span class="p">,</span> <span class="p">...,</span> <span class="mf">5.51029937</span><span class="p">,</span>
<span class="mf">5.51045964</span><span class="p">,</span> <span class="mf">5.51061991</span><span class="p">]))</span>
<span class="o">>>></span> <span class="o">%</span><span class="n">timeit</span> <span class="n">linear_model_denoise</span><span class="p">(</span><span class="n">x_obs</span><span class="p">,</span> <span class="n">y_obs</span><span class="p">)</span>
<span class="mi">233</span> <span class="n">ms</span> <span class="err">±</span> <span class="mf">12.6</span> <span class="n">ms</span> <span class="n">per</span> <span class="n">loop</span> <span class="p">(</span><span class="n">mean</span> <span class="err">±</span> <span class="n">std</span><span class="p">.</span> <span class="n">dev</span><span class="p">.</span> <span class="n">of</span> <span class="mi">7</span> <span class="n">runs</span><span class="p">,</span> <span class="mi">1</span> <span class="n">loop</span> <span class="n">each</span><span class="p">)</span>
</code></pre></div></div>
<p>To denoise 50k observations, <code class="language-plaintext highlighter-rouge">linear_model_denoise</code> takes about 250 ms.
Not terrible, but we can do much better, which is important if we want to scale to larger numbers of observations.
In the next section, we will make this function really fast.</p>
<h2 id="making-inference-fast">Making Inference Fast</h2>
<p>To make <code class="language-plaintext highlighter-rouge">linear_model_denoise</code> fast, firstly, the linear algebra that happens under the hood when <code class="language-plaintext highlighter-rouge">linear_model_denoise</code> is called should be simplified as much as possible.
Fortunately, this happens automatically, due to <a href="https://github.com/wesselb/matrix">the structured representation of matrices</a> that Stheno uses.
For example, when making predictions with Gaussian processes, the main computational bottleneck is usually the construction and inversion of <code class="language-plaintext highlighter-rouge">y(x_obs).var</code>, the variance associated with the observations:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">y</span><span class="p">(</span><span class="n">x_obs</span><span class="p">).</span><span class="n">var</span>
<span class="o"><</span><span class="n">Woodbury</span> <span class="n">matrix</span><span class="p">:</span> <span class="n">shape</span><span class="o">=</span><span class="mi">50000</span><span class="n">x50000</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">float64</span>
<span class="n">diag</span><span class="o">=<</span><span class="n">diagonal</span> <span class="n">matrix</span><span class="p">:</span> <span class="n">shape</span><span class="o">=</span><span class="mi">50000</span><span class="n">x50000</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">float64</span>
<span class="n">diag</span><span class="o">=</span><span class="p">[</span><span class="mf">0.25</span> <span class="mf">0.25</span> <span class="mf">0.25</span> <span class="p">...</span> <span class="mf">0.25</span> <span class="mf">0.25</span> <span class="mf">0.25</span><span class="p">]</span><span class="o">></span>
<span class="n">lr</span><span class="o">=<</span><span class="n">low</span><span class="o">-</span><span class="n">rank</span> <span class="n">matrix</span><span class="p">:</span> <span class="n">shape</span><span class="o">=</span><span class="mi">50000</span><span class="n">x50000</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="n">float64</span><span class="p">,</span> <span class="n">rank</span><span class="o">=</span><span class="mi">2</span>
<span class="n">left</span><span class="o">=</span><span class="p">[[</span><span class="mf">1.e+00</span> <span class="mf">0.e+00</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">2.e-04</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">4.e-04</span><span class="p">]</span>
<span class="p">...</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">1.e+01</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">1.e+01</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">1.e+01</span><span class="p">]]</span>
<span class="n">middle</span><span class="o">=</span><span class="p">[[</span><span class="mf">10.</span> <span class="mf">0.</span><span class="p">]</span>
<span class="p">[</span> <span class="mf">0.</span> <span class="mf">1.</span><span class="p">]]</span>
<span class="n">right</span><span class="o">=</span><span class="p">[[</span><span class="mf">1.e+00</span> <span class="mf">0.e+00</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">2.e-04</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">4.e-04</span><span class="p">]</span>
<span class="p">...</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">1.e+01</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">1.e+01</span><span class="p">]</span>
<span class="p">[</span><span class="mf">1.e+00</span> <span class="mf">1.e+01</span><span class="p">]]</span><span class="o">>></span>
</code></pre></div></div>
<p>Indeed observe that this matrix has particular structure:
it is a sum of a diagonal and a low-rank matrix.
In Stheno, the sum of a diagonal and a low-rank matrix is called a <em>Woodbury</em> matrix, because the <a href="https://en.wikipedia.org/wiki/Woodbury_matrix_identity">Sherman–Morrison–Woodbury formula</a> can be used to efficiently invert it.
Let’s see how long it takes to construct <code class="language-plaintext highlighter-rouge">y(x_obs).var</code> and then invert it.
We invert <code class="language-plaintext highlighter-rouge">y(x_obs).var</code> using <a href="https://github.com/wesselb/lab">LAB</a>, which is automatically installed alongside Stheno and exposes the API to efficiently work with structured matrices.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">lab</span> <span class="k">as</span> <span class="n">B</span>
<span class="o">>>></span> <span class="o">%</span><span class="n">timeit</span> <span class="n">B</span><span class="p">.</span><span class="n">inv</span><span class="p">(</span><span class="n">y</span><span class="p">(</span><span class="n">x_obs</span><span class="p">).</span><span class="n">var</span><span class="p">)</span>
<span class="mf">28.5</span> <span class="n">ms</span> <span class="err">±</span> <span class="mf">1.69</span> <span class="n">ms</span> <span class="n">per</span> <span class="n">loop</span> <span class="p">(</span><span class="n">mean</span> <span class="err">±</span> <span class="n">std</span><span class="p">.</span> <span class="n">dev</span><span class="p">.</span> <span class="n">of</span> <span class="mi">7</span> <span class="n">runs</span><span class="p">,</span> <span class="mi">10</span> <span class="n">loops</span> <span class="n">each</span><span class="p">)</span>
</code></pre></div></div>
<p>That’s only 30 ms! Not bad, for such a big matrix. Without exploiting structure, a 50k \(\times\) 50k matrix takes 20 GB of memory to store and about an hour to invert.</p>
<p>Secondly, we would like the code implemented by <code class="language-plaintext highlighter-rouge">linear_model_denoise</code> to be as efficient as possible.
To achieve this, we will use <a href="https://github.com/google/jax">JAX</a> to compile <code class="language-plaintext highlighter-rouge">linear_model_denoise</code> with <a href="https://www.tensorflow.org/xla">XLA</a>, which generates blazingly fast code.
We start out by importing JAX and loading the JAX extension of Stheno.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">jax</span>
<span class="o">>>></span> <span class="kn">import</span> <span class="nn">jax.numpy</span> <span class="k">as</span> <span class="n">jnp</span>
<span class="o">>>></span> <span class="kn">import</span> <span class="nn">stheno.jax</span> <span class="c1"># JAX extension for Stheno
</span></code></pre></div></div>
<p>We use JAX’s just-in-time (JIT) compiler <code class="language-plaintext highlighter-rouge">jax.jit</code> to compile <code class="language-plaintext highlighter-rouge">linear_model_denoise</code>:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">linear_model_denoise_jitted</span> <span class="o">=</span> <span class="n">jax</span><span class="p">.</span><span class="n">jit</span><span class="p">(</span><span class="n">linear_model_denoise</span><span class="p">)</span>
</code></pre></div></div>
<p>Let’s see what happens when we run <code class="language-plaintext highlighter-rouge">linear_model_denoise_jitted</code>.
We must pass <code class="language-plaintext highlighter-rouge">x_obs</code> and <code class="language-plaintext highlighter-rouge">y_obs</code> as JAX arrays to use the compiled version.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">linear_model_denoise_jitted</span><span class="p">(</span><span class="n">jnp</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">x_obs</span><span class="p">),</span> <span class="n">jnp</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">y_obs</span><span class="p">))</span>
<span class="n">Invalid</span> <span class="n">argument</span><span class="p">:</span> <span class="n">Cannot</span> <span class="n">bitcast</span> <span class="n">types</span> <span class="k">with</span> <span class="n">different</span> <span class="n">bit</span><span class="o">-</span><span class="n">widths</span><span class="p">:</span> <span class="n">F64</span> <span class="o">=></span> <span class="n">S32</span><span class="p">.</span>
</code></pre></div></div>
<p>Oh no!
What went wrong is that the JIT compiler wasn’t able to deal with the complicated control flow from the automatic linear algebra simplifications.
Fortunately, there is a simple way around this:
we can run the function once with NumPy to see how the control flow should go, <em>cache that control flow</em>, and then use this cache to run <code class="language-plaintext highlighter-rouge">linear_model_denoise</code> with JAX.
Sounds complicated, but it’s really just a bit of boilerplate:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">lab</span> <span class="k">as</span> <span class="n">B</span>
<span class="o">>>></span> <span class="n">control_flow_cache</span> <span class="o">=</span> <span class="n">B</span><span class="p">.</span><span class="n">ControlFlowCache</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">control_flow_cache</span>
<span class="o"><</span><span class="n">ControlFlowCache</span><span class="p">:</span> <span class="n">populated</span><span class="o">=</span><span class="bp">False</span><span class="o">></span>
</code></pre></div></div>
<p>Here <code class="language-plaintext highlighter-rouge">populated=False</code> means that the cache is not yet populated.
Let’s populate it by running <code class="language-plaintext highlighter-rouge">linear_model_denoise</code> once with NumPy:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="k">with</span> <span class="n">control_flow_cache</span><span class="p">:</span>
<span class="n">linear_model_denoise</span><span class="p">(</span><span class="n">x_obs</span><span class="p">,</span> <span class="n">y_obs</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">control_flow_cache</span>
<span class="o"><</span><span class="n">ControlFlowCache</span><span class="p">:</span> <span class="n">populated</span><span class="o">=</span><span class="bp">True</span><span class="o">></span>
</code></pre></div></div>
<p>We now construct a compiled version of <code class="language-plaintext highlighter-rouge">linear_model_denoise</code> that uses the control flow cache:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">@</span><span class="n">jax</span><span class="p">.</span><span class="n">jit</span>
<span class="k">def</span> <span class="nf">linear_model_denoise_jitted</span><span class="p">(</span><span class="n">x_obs</span><span class="p">,</span> <span class="n">y_obs</span><span class="p">):</span>
<span class="k">with</span> <span class="n">control_flow_cache</span><span class="p">:</span>
<span class="k">return</span> <span class="n">linear_model_denoise</span><span class="p">(</span><span class="n">x_obs</span><span class="p">,</span> <span class="n">y_obs</span><span class="p">)</span>
</code></pre></div></div>
<p></p>
<p><!-- Prevent tabs. --></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">linear_model_denoise_jitted</span><span class="p">(</span><span class="n">jnp</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">x_obs</span><span class="p">),</span> <span class="n">jnp</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">y_obs</span><span class="p">))</span>
<span class="p">(</span><span class="n">DeviceArray</span><span class="p">([</span><span class="o">-</span><span class="mf">2.4981871</span> <span class="p">,</span> <span class="o">-</span><span class="mf">2.4980271</span> <span class="p">,</span> <span class="o">-</span><span class="mf">2.49786709</span><span class="p">,</span> <span class="p">...,</span> <span class="mf">5.50149004</span><span class="p">,</span>
<span class="mf">5.50165005</span><span class="p">,</span> <span class="mf">5.50181005</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">float64</span><span class="p">),</span> <span class="n">DeviceArray</span><span class="p">([</span><span class="o">-</span><span class="mf">2.5069514</span> <span class="p">,</span> <span class="o">-</span><span class="mf">2.50679114</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.50663087</span><span class="p">,</span> <span class="p">...,</span> <span class="mf">5.4927699</span> <span class="p">,</span>
<span class="mf">5.49292964</span><span class="p">,</span> <span class="mf">5.49308938</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">float64</span><span class="p">),</span> <span class="n">DeviceArray</span><span class="p">([</span><span class="o">-</span><span class="mf">2.4894228</span> <span class="p">,</span> <span class="o">-</span><span class="mf">2.48926306</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.48910332</span><span class="p">,</span> <span class="p">...,</span> <span class="mf">5.51021019</span><span class="p">,</span>
<span class="mf">5.51037046</span><span class="p">,</span> <span class="mf">5.51053072</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">float64</span><span class="p">))</span>
</code></pre></div></div>
<p>Nice!
Let’s see how much faster <code class="language-plaintext highlighter-rouge">linear_model_denoise_jitted</code> is:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="o">%</span><span class="n">timeit</span> <span class="n">linear_model_denoise</span><span class="p">(</span><span class="n">x_obs</span><span class="p">,</span> <span class="n">y_obs</span><span class="p">)</span>
<span class="mi">233</span> <span class="n">ms</span> <span class="err">±</span> <span class="mf">12.6</span> <span class="n">ms</span> <span class="n">per</span> <span class="n">loop</span> <span class="p">(</span><span class="n">mean</span> <span class="err">±</span> <span class="n">std</span><span class="p">.</span> <span class="n">dev</span><span class="p">.</span> <span class="n">of</span> <span class="mi">7</span> <span class="n">runs</span><span class="p">,</span> <span class="mi">1</span> <span class="n">loop</span> <span class="n">each</span><span class="p">)</span>
<span class="o">>>></span> <span class="o">%</span><span class="n">timeit</span> <span class="n">linear_model_denoise_jitted</span><span class="p">(</span><span class="n">jnp</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">x_obs</span><span class="p">),</span> <span class="n">jnp</span><span class="p">.</span><span class="n">array</span><span class="p">(</span><span class="n">y_obs</span><span class="p">))</span>
<span class="mf">1.63</span> <span class="n">ms</span> <span class="err">±</span> <span class="mf">16.5</span> <span class="n">µs</span> <span class="n">per</span> <span class="n">loop</span> <span class="p">(</span><span class="n">mean</span> <span class="err">±</span> <span class="n">std</span><span class="p">.</span> <span class="n">dev</span><span class="p">.</span> <span class="n">of</span> <span class="mi">7</span> <span class="n">runs</span><span class="p">,</span> <span class="mi">1000</span> <span class="n">loops</span> <span class="n">each</span><span class="p">)</span>
</code></pre></div></div>
<p>The compiled function <code class="language-plaintext highlighter-rouge">linear_model_denoise_jitted</code> only takes 2 ms to denoise 50k observations!
Compared to <code class="language-plaintext highlighter-rouge">linear_model_denoise</code>, that’s a speed-up of two orders of magnitude.</p>
<h2 id="conclusion">Conclusion</h2>
<p>We’ve seen how a linear model can be implemented with a Gaussian process probabilistic program (GPPP) using <a href="https://github.com/wesselb/stheno">Stheno</a>.
Stheno allows us to focus on model construction, and takes away the distraction of the technicalities that come with making predictions.
This flexibility, however, comes at the cost of some complicated machinery that happens in the background, such as structured representations of matrices.
Fortunately, we’ve seen that this overhead can be completely avoided by compiling your program using <a href="https://github.com/google/jax">JAX</a>, which can result in extremely efficient implementations.
To close this post and to warm you up for <a href="https://github.com/wesselb/stheno#examples">what’s further possible with Gaussian process probabilistic programming using Stheno</a>, the linear model that we’ve built can easily be extended to, for example, include a <em>quadratic</em> term:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">quadratic_model_denoise</span><span class="p">(</span><span class="n">x_obs</span><span class="p">,</span> <span class="n">y_obs</span><span class="p">):</span>
<span class="n">prior</span> <span class="o">=</span> <span class="n">Measure</span><span class="p">()</span>
<span class="n">a</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">measure</span><span class="o">=</span><span class="n">prior</span><span class="p">)</span> <span class="c1"># Model for slope
</span> <span class="n">b</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">measure</span><span class="o">=</span><span class="n">prior</span><span class="p">)</span> <span class="c1"># Model for coefficient of quadratic term
</span> <span class="n">c</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="n">measure</span><span class="o">=</span><span class="n">prior</span><span class="p">)</span> <span class="c1"># Model for offset
</span> <span class="c1"># Noiseless quadratic model
</span> <span class="n">f</span> <span class="o">=</span> <span class="n">a</span> <span class="o">*</span> <span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span><span class="p">)</span> <span class="o">+</span> <span class="n">b</span> <span class="o">*</span> <span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="n">x</span> <span class="o">**</span> <span class="mi">2</span><span class="p">)</span> <span class="o">+</span> <span class="n">c</span>
<span class="n">noise</span> <span class="o">=</span> <span class="n">GP</span><span class="p">(</span><span class="n">Delta</span><span class="p">(),</span> <span class="n">measure</span><span class="o">=</span><span class="n">prior</span><span class="p">)</span> <span class="c1"># Model for noise
</span> <span class="n">y</span> <span class="o">=</span> <span class="n">f</span> <span class="o">+</span> <span class="mf">0.5</span> <span class="o">*</span> <span class="n">noise</span> <span class="c1"># Noisy quadratic model
</span>
<span class="n">post</span> <span class="o">=</span> <span class="n">prior</span> <span class="o">|</span> <span class="p">(</span><span class="n">y</span><span class="p">(</span><span class="n">x_obs</span><span class="p">),</span> <span class="n">y_obs</span><span class="p">)</span> <span class="c1"># Condition on observations.
</span> <span class="n">pred</span> <span class="o">=</span> <span class="n">post</span><span class="p">(</span><span class="n">f</span><span class="p">(</span><span class="n">x_obs</span><span class="p">))</span> <span class="c1"># Make predictions.
</span> <span class="k">return</span> <span class="n">pred</span><span class="p">.</span><span class="n">marginals</span><span class="p">()</span> <span class="c1"># Return the mean and associated error bounds.
</span></code></pre></div></div>
<p>To use Gaussian process probabilistic programming for your specific problem, the main challenge is to figure out which model you need to use.
Do you need a quadratic term?
Maybe you need an exponential term!
But, using Stheno, implementing the model and making predictions should then be simple.</p>Wessel Bruinsma and James RequeimaCross-posted at wesselb.github.io.