Gaussian Processes: from one to many outputs

Author: Eric Perim, Wessel Bruinsma, and Will Tebbutt

This is the first post in a three-part series we are preparing on multi-output Gaussian Processes. Gaussian Processes (GPs) are a popular tool in machine learning, and a technique that we routinely use in our work. Essentially, GPs are a powerful Bayesian tool for regression problems (which can be extended to classification problems through some modifications). As a Bayesian approach, GPs provide a natural and automatic mechanism to construct and calibrate uncertainties. Naturally, getting well-calibrated uncertainties is not easy and depends on a combination of how well the model matches the data and on how much data is available. Predictions made using GPs are not just point predictions: they are whole probability distributions, which is convenient for downstream tasks. There are several good references for those interested in learning more about the benefits of Bayesian methods, from introductory blog posts to classical books.

In this post (and in the forthcoming ones in this series), we are going to assume that the reader has some level of familiarity with GPs in the single-output setting. We will try to keep the maths to a minimum, but will rely on mathematical notation whenever that helps making the message clear. For those who are interested in an introduction to GPs—or just a refresher—we point towards other resources. For a rigorous and in-depth introduction, the book by Rasmussen and Williams stands as one of the best references (and it is made freely available in electronic form by the authors).

We will start by discussing the extension of GPs from one to multiple dimensions, and review popular (and powerful) approaches from the literature. In the following posts, we will look further into some powerful tricks that bring improved scalability and will also share some of our code.

Multi-output GPs

While most people with a background in machine learning or statistics are familiar with GPs, it is not uncommon to have only encountered their single-output formulation. However, many interesting problems require the modelling of multiple outputs instead of just one. Fortunately, it is simple to extend single-output GPs to multiple outputs, and there are a few different ways of doing so. We will call these constructions multi-output GPs (MOGPs).

An example application of a MOGP might be to predict both temperature and humidity as a function of time. Sometimes we might want to include binary or categorical outputs as well, but in this article we will limit the discussion to real-valued outputs. (MOGPs are also sometimes called multi-task GPs, in which case an output is instead referred to as a task. But the idea is the same.) Moreover we will refer to inputs as time, as in the time series setting, but all the discussion in here is valid for any kind of input.

The simplest way to extend GPs to multi-output problems is to model each of the outputs independently, with single-output GPs. We call this model IGPs (for independent GPs). While conceptually simple, computationally cheap, and easy to implement, this approach fails to account for correlations between outputs. If the outputs are correlated, knowing one can provide useful information about others (as we illustrate below), so assuming independence can hurt performance and, in many cases, limit it to being used as a baseline.

To define a general MOGP, all we have to do is to also specify how the outputs covary. Perhaps the simplest way of doing this is by prescribing an additional covariance function (kernel) over outputs, \(k_{\mathrm{o}}(i, j)\), which specifies the covariance between outputs \(i\) and \(j\). Combining this kernel over outputs with a kernel over inputs, e.g. \(k_{\mathrm{t}}(t, t')\), the full kernel of the MOGP is then given by

\[\begin{equation} k((i, t), (j, t')) = \operatorname{cov}(f_i(t), f_j(t')) = k_{\mathrm{o}}(i, j) k_{\mathrm{t}}(t, t'), \end{equation}\]

which says that the covariance between output \(i\) at input \(t\) and output \(j\) at input \(t'\) is equal to the product \(k_{\mathrm{o}}(i, j) k_{\mathrm{t}}(t, t')\). When the kernel \(k((i, t), (j, t'))\) is a product between a kernel over outputs \(k_{\mathrm{o}}(i, j)\) and a kernel over inputs \(k_{\mathrm{t}}(t,t’)\), the kernel \(k((i, t), (j, t'))\) is called separable. In the general case, the kernel \(k((i, t), (j, t'))\) does not have to be separable, i.e. it can be any arbitrary positive-definite function.

Contrary to IGPs, general MOGPs do model correlations between outputs, which means that they are able to use observations from one output to better predict another output. We illustrate this below by contrasting the predictions for two of the outputs in the EEG dataset, one observed and one not observed, using IGPs and another flavour of MOGPs, the ILMM, which we will discuss in detail in the next section. Contrary to the independent GP (IGP), the ILMM is able to successfully predict F2 by exploiting the observations for F3 (and other outputs not shown).

IGP_vs_MOGP Figure 1: Predictions for two of the outputs in the EEG dataset using two distinct MOGP approaches, the ILMM and the IGP. All outputs are modelled jointly, but we only plot two of them for clarity.

Equivalence to single-output GPs

An interesting thing to notice is that a general MOGP kernel is just another kernel, like those used in single-output GPs, but one that now operates on an extended input space (because it also takes in \(i\) and \(j\) as input). Mathematically, say one wants to model \(p\) outputs over some input space \(\mathcal{T}\). By also letting the index of the output be part of the input, we can construct this extended input space: \(\mathcal{T}_{\mathrm{ext}} = \{1,...,p\} \times \mathcal{T}\). Then, a multi-output Gaussian process (MOGP) can be defined via a mean function, \(m\colon \mathcal{T}_{\mathrm{ext}} \to \mathbb{R}\), and a kernel, \(k\colon \mathcal{T}_{\mathrm{ext}}^2 \to \mathbb{R}\). Under this construction it is clear that any property of single-output GPs immediately transfers to MOGPs, because MOGPs can simply be seen as single-output GPs on an extended input space.

An equivalent formulation of MOGPs can be obtained by stacking the multiple outputs into a vector, creating a vector-valued GP. It is sometimes helpful to view MOGPs from this perspective, in which the multiple outputs are viewed as one multidimensional output. We can use this equivalent formulation to define MOGP via a vector-valued mean function, \(m\colon \mathcal{T} \to \mathbb{R}^p\), and a matrix-valued kernel, \(k\colon\mathcal{T}^2 \to \mathbb{R}^{p \times p}\). This mean function and kernel are not defined on the extended input space; rather, in this equivalent formulation, they produce multi-valued outputs. The vector-valued mean function corresponds to the mean of the vector-valued GP, \(m(t) = \mathbb{E}[f(t)]\), and the matrix-valued kernel to the covariance matrix of vector-valued GP, \(k(t, t’) = \mathbb{E}[(f(t) - m(t))(f(t’) - m(t’))^\top]\). When the matrix-valued kernel is evaluated at \(t = t’\), the resulting matrix \(k(t, t) = \mathbb{E}[(f(t) - m(t))(f(t) - m(t))^\top]\) is sometimes called the instantaneous spatial covariance: it describes a covariance between different outputs at a given point in time \(t\).

Because MOGPs can be viewed as single-output GPs on an extended input space, inference works exactly the same way. However, by extending the input space we exacerbate the scaling issues inherent with GPs, because the total number of observations is counted by adding the numbers of observations for each output, and GPs scale badly in the number of observations. While inference in the single-output setting requires the inversion of an \(n \times n\) matrix (where \(n\) is the number of data points), in the case of \(p\) outputs, assuming that at all times all outputs are observed,1 this turns into the inversion of an \(np \times np\) matrix (assuming the same number of input points for each output as in the single output case), which can quickly become computationally intractable (i.e. not feasible to compute with limited computing power and time). That is because the inversion of a \(q \times q\) matrix takes \(\mathcal{O}(q^3)\) time and \(\mathcal{O}(q^2)\) memory, meaning that time and memory performance will scale, respectively, cubically and quadratically on the number of points in time, \(n\), and outputs, \(p\). In practice this scaling characteristic limits the application of this general MOGP formulation to data sets with very few outputs and data points.

Low-rank approximations

A popular and powerful approach to making MOGPs computationally tractable is to impose a low-rank structure over the covariance between outputs. That is equivalent to assuming that the data can be described by a set of latent (unobserved) Gaussian processes in which the number of these latent processes is fewer than the number of outputs. This builds a simpler lower-dimensional representation of the data. The structure imposed over the covariance matrices through this lower-dimensional representation of the data can be exploited to perform the inversion operation more efficiently (we are going to discuss in detail one of these cases in the next post of this series). There are a variety of different ways in which this kind of structure can be imposed, leading to an interesting class of models which we discuss in the next section.

This kind of assumption is typically used to make the method computationally cheaper. However, these assumptions do bring extra inductive bias to the model. Introducing inductive bias can be a powerful tool in making the model more data-efficient and better-performing in practice, provided that such assumptions are adequate to the particular problem at hand. For instance, low-rank data occurs naturally in different settings. This also happens to be true in electricity grids, due to the mathematical structure of the price-forming process. To make good choices about the kind of inductive bias to use, experience, domain knowledge, and familiarity with the data can be helpful.

The Mixing Model Hierarchy

Models that use a lower-dimensional representation for data have been present in the literature for a long time. Well-known examples include factor analysis, PCA, and VAEs. Even if we restrict ourselves to GP-based models there are still a significant number of notable and powerful models that make this simplifying assumption in one way or another. However, models in the class of MOGPs that explain the data with a lower-dimensional representation are often framed in many distinct ways, which may obscure their relationship and overarching structure. Thus, it is useful to try to look at all these models under the same light, forming a well-defined family that highlights their similarities and differences.

In one of the appendices of a recent paper of ours,2 we presented what we called the Mixing Model Hierarchy (MMH), an attempt at a unifying presentation of this large class of MOGP models. It is important to stress that our goal with the Mixing Model Hierarchy is only to organise a large number of pre-existing models from the literature in a reasonably general way, and not to present new models nor claim ownership over important contributions made by other researchers. Further down in this article we present a diagram that connects several relevant papers to the models from the MMH.

The simplest model from the Mixing Model Hierarchy is what we call the Instantaneous Linear Mixing Model (ILMM), which, despite its simplicity, is still a rather general way of describing low-rank covariance structures. In this model the observations are described as a linear combination of the latent processes (i.e. unobserved stochastic processes), given by a set of constant weights. That is, we can write \(f(t) | x, H = Hx(t)\),3 where \(f\) are the observations, \(H\) is a matrix of weights, which we call the mixing matrix, and \(x(t)\) represents the (time-dependent) latent processes—note that we use \(x\) here to denote an unobserved (latent) stochastic process, not an input, which is represented by \(t\). If the latent processes \(x(t)\) are described as GPs, then due to the fact that linear combinations of Gaussian variables are also Gaussian, \(f(t)\) will also be a (MO)GP.

The top-left part of the figure below illustrates the graphical model for the ILMM.4 The graphical model highlights the two restrictions imposed by the ILMM when compared with a general MOGP: (i) the instantaneous spatial covariance of \(f\), \(\mathbb{E}[f(t) f^\top(t)] = H H^\top\), does not vary with time, because neither \(H\) nor \(K(t, t) = I_m\) varies with time; and (ii) the noise-free observation \(f(t)\) is a function of \(x(t')\) for \(t'=t\) only, meaning that, for example, \(f\) cannot be \(x\) with a delay or a smoothed version of \(x\). Reflecting this, we call the ILMM a time-invariant (due to (i)) and instantaneous (due to (ii)) MOGP.

MMH_graphical_model Figure 2: Graphical model for different models in the MMH.4

There are three general ways in which we can generalise the ILMM within the MMH. The first one is to allow the mixing matrix \(H\) to vary in time. That means that the amount each latent process contributes to each output varies in time. Mathematically, \(H \in \R^{p \times m}\) becomes a matrix-valued function \(H\colon \mathcal{T} \to \R^{p \times m}\), and the mixing mechanism becomes

\[\begin{equation} f(t)\cond H, x = H(t) x(t). \end{equation}\]

We call such MOGP models time-varying. Their graphical model is shown in the figure above, on the top right corner.

A second way to generalise the ILMM is to assume that \(f(t)\) depends on \(x(t')\) for all \(t' \in \mathcal{T}\). That is to say that, at a given time, each output may depend on the values of the latent processes at any other time. We say that these models become non-instantaneous. The mixing matrix \(H \in \R^{p \times m}\) becomes a matrix-valued time-invariant filter \(H\colon \mathcal{T} \to \R^{p \times m}\), and the mixing mechanism becomes

\[\begin{equation} f(t)\cond H, x = \int H(t - \tau) x(\tau) \mathrm{d\tau}. \end{equation}\]

We call such MOGP models convolutional. Their graphical model is shown in the figure above, in the bottom left corner.

A third generalising assumption that we can make is that \(f(t)\) depends on \(x(t')\) for all \(t' \in \mathcal{T}\) and this relationship may vary with time. This is similar to the previous case, in that both models are non-instantaneous, but with the difference that this one is also time-varying. The mixing matrix \(H \in \R^{p \times m}\) becomes a matrix-valued time-varying filter \(H\colon \mathcal{T}\times\mathcal{T} \to \R^{p \times m}\), and the mixing mechanism becomes

\[\begin{equation} f(t)\cond H, x = \int H(t, \tau) x(\tau) \mathrm{d\tau}. \end{equation}\]

We call such MOGP models time-varying and convolutional. Their graphical model is shown in the figure above in the bottom right corner.

Besides these generalising assumptions, a further extension is to adopt a prior over \(H\). Using such a prior allows for a principled way of further imposing inductive bias by, for instance, encouraging sparsity. This extension and the three generalisations discussed above together form what we call the Mixing Model Hierarchy (MMH), which is illustrated in the figure below. The MMH organises multi-output Gaussian process models according to their distinctive modelling assumptions. The figure below shows how twenty one MOGP models from the machine learning and geostatistics literature can be recovered as special cases of the various generalisations of the ILMM.

MMH Figure 3: Diagram relating several models from the literature to the MMH, based on their properties.

Naturally, these different members of the MMH vary in complexity and each brings their own set of challenges. Particularly, exact inference is computationally expensive or even intractable for many models in the MMH, which requires the use of approximate inference methods such as variational inference (VI), or even Markov Chain Monte Carlo (MCMC). We definitely recommend the reading of the original papers if one is interested in seeing all the clever ways the authors find to perform inference efficiently.

Although the MMH defines a large and powerful family of models, not all multi-output Gaussian process models are covered by it. For example, Deep GPs and its variations are excluded because they transform the latent processes nonlinearly to generate the observations.

Conclusion

In this post we have briefly discussed how to extend regular, single output Gaussian Processes (GP) to multi-output Gaussian Processes (MOGP), and argued that MOGPs are really just single-output GPs after all. We have also introduced the Mixing Model Hierarchy (MMH), which classifies a large number of models from the MOGP literature based on the way they generalise a particular base model, the Instantaneous Linear Mixing Model (ILMM).

In the next post of this series, we are going to discuss the ILMM in more detail and show how some simple assumptions can lead to a much more scalable model, which is applicable to extremely large systems that not even the simplest members of the MMH can tackle in general.

Notes

  1. When every output is observed at each time stamp, we call them fully observed outputs. If the outputs are not fully observed, only a subset of them might be available for certain times (for example due to faulty sensors). In this case, the number of data points will be smaller than \(np\), but will still scale proportionally with \(p\). Thus, the scaling issues will still be present. In the case where only a single output is observed at any given time, the number of observations will be \(n\), and the MOGP would have the same time and memory scaling as a single-output GP. 

  2. Bruinsma, Wessel, et al. “Scalable Exact Inference in Multi-Output Gaussian Processes.” International Conference on Machine Learning. PMLR, 2020

  3. Here \(f(t)\mid x,H\) is used to denote the value of \(f(t)\) given a known \(H\) and \(x(t)\). 

  4. Although we illustrate the GPs in the graphical models as a Markov chain, that is just to improve clarity. In reality, GPs are much more general than Markov chains, as there is no conditional independence between timestamps.  2


How to Start Contributing to Open Source Software

Author: Miha Zgubic

If you are someone who feels comfortable using code to solve a problem, answer a question, or just implement something for fun, chances are you are relying on open source software. If you want to contribute to open source software, but don’t know how and where to start, this guide is for you.

In this post, we will first discuss some of the possible mental and technical barriers standing between you and your first meaningful contribution. Then, we will go through the steps involved in making the first contribution.

For the sake of brevity, we assume some basic familiarity with git (if you know how to commit changes and what a branch is, you’ll be fine). Because many open source projects are hosted on GitHub, we also discuss GitHub Issues, which refer to GitHub functionality that allows discussion about bugs and new features, and the term “pull/merge request” (PR/MR, they are the same thing), which is a mechanism for submitting your changes to be incorporated into the existing repository.

Mental and Technical Barriers

It is easy to observe the world of open source and think: “Oh look at all these smart people doing meaningful work and developing meaningful relationships, building humanity’s digital heritage that helps to run our civilisation. I wish I could join the party, but <something I believe to be true>.”

Some misconceptions I and other people have had are listed below, along with what we think about it now.

“I need to be an expert in <the language> AND <the package> to even consider reporting a bug, making a fix, or implementing new functionality.”

If you are using the package and something doesn’t work as expected, you should report it, for example by opening a GitHub issue. You don’t need to know how the package works internally. All you need to do is take a quick look at the documentation to see if anything is written about your problem, and check if a similar issue has been opened already. If you want to fix something yourself, that’s fantastic! Just jump in and try to figure it out.

“I know about <a thing>, but other people know much more, and it’s much better if they implemented it.”

This is almost always the case, unless you are the world’s leading expert on <a thing>. However, they are likely busy with other important work, and the issue is not a priority for them. So, either you implement it, or it may not get done at all, so go ahead! The best part, however, is that these other more knowledgeable people will usually be happy to review your solution and suggest how to make it even better, which means that a great thing gets built and you learn something in the process.

“I don’t know anyone contributing to the package and they look like a team, isn’t it weird if I just jump in and open an issue/PR?”

No, it’s not weird. Teams make their code and issues public because they’re looking for new contributors like you.

“I won’t be able to create a perfect solution, and people will point out flaws and ask me to change it.”

Solutions to issues usually come in the form of pull requests. However, opening a PR is best thought of as a conversation about a solution, rather than a finished product that is either approved or rejected. Experienced contributors often open PRs to solicit feedback about an idea, because an open PR on GitHub offers convenient tools for discussion about code. Even if you think your solution is complete, people will likely ask you to make changes, and that’s alright! If it isn’t explicitly mentioned (it should be), ask why the changes are needed—these are valuable opportunities to learn.

“I would like to make a contribution, but don’t know where to start.”

Finding the right place to start can be challenging, but see advice below. Once you make a contribution or two they will lead you on to others, so you typically only have to overcome this barrier once.

“I know what is broken and I think I know how to fix it, but don’t know the steps to publish these to the official repository.”

That’s fantastic! See the rest of the guide.

“What if people ask for changes? How do I implement those?”

Somehow I thought implementing review feedback was hard and messy, but, in practice, it’s as easy as adding more commits to a branch.

Steps to First Contribution

Now that we have gone through some of the concerns you may have, here is the step-by-step guide to your first contribution.

1) Learn the mechanics of a pull request

The workflow is described in this excellent repository on GitHub, built just for learning the mechanics of making a pull request. I recommend to go through it by not just reading it, but by actually going through all the steps. That exercise should make you comfortable with the process.

2) Find something you want to fix

A good first project might be solving a bug that affects you, as that means you already have a test case and you will be more motivated to find a solution. However, if the bug is in a large and complicated library or requires a lot of code refactoring to fix, it is probably better to start somewhere else.

It may be more enjoyable to start with smaller or medium-sized packages because they can be easier to understand. When you find a package you would like to modify, make sure that it is the original (not a fork) and it is being maintained, which you can check by looking at the issues and pull requests on its GitHub page.

Then, look through the issues and see if there is something that you find interesting. Pay attention to good-first-issue labels, which indicate issues that people think are appropriate for first-time contributors. This usually means that they are nice and not too hard to solve. You don’t have to restrict yourself to issues good-first-issue labels, feel free to tackle anything you feel motivated and able to do. Keep in mind that it might be better to start with a smaller PR and get that merged first, before tackling a bigger issue. You don’t want to submit a week worth of work only to find out that the package has been abandoned and there is nobody willing to review and merge your PR.

When you find an interesting issue and decide you want to work on it, it is a good idea to comment on the issue first and ask whether anyone is willing to review a potential PR. Commenting will also create a feeling of responsibility and ownership of the issue which will motivate you and help you finish the PR.

As a few ideas, here are some concrete Julia packages that Invenia is involved with, for example AWS.jl for interacting with AWS, Intervals.jl for working with intervals over ordered types, BlockDiagonals.jl for working with block-diagonal matrices, and ChainRules.jl for automatic differentiation. We are happy to help you contribute to these!

3) Implement your solution, and open a PR

While you should be familiar with the mechanics of pull requests after step 1, there are some additional social/etiquette considerations. Generally, the authors of open source packages are delighted when someone uses their package, opens issues about bugs or potential improvements, and especially so when someone opens a pull request with a solution to a known problem. That said, they will appreciate it if you make things easy for them by linking the issue your PR is solving and a brief reason why you have chosen this approach. If you are unsure about whether something was a good choice, point it out in the description.

If your background isn’t in computer science or software engineering, you might not have heard of unit testing. Testing is a way of ensuring the correctness of the code by checking the output of the code for a number of inputs. In packages with unit tests, every new feature is typically expected to come with tests. When fixing a bug, a test that fails using the old code and passes using the new code may also be expected.

You can make the review process more efficient and pleasant by quickly examining the PR yourself. What needs to be included in the PR depends on the issue at hand, but there are some general questions you can think about:

  • Does my code work in corner cases, did I include reasonable tests?
  • Did I add or change documentation to match my changes?
  • Does my code formatting follow the rest of the package? Some packages follow code style guides, such as PEP8 or BlueStyle.
  • Did I include any lines or files by mistake?

Don’t worry if you can’t answer these questions. It is perfectly fine to ask! You can also self-review your PR and add some thoughts as comments to the code.

The contributor’s guide on collaborative practices is a great resource about the best practices regarding collaboration on open source projects. The packages that follow it display this badge: ColPrac: Contributor's Guide on Collaborative Practices for Community Packages Other packages typically have their own guidelines outlined in a file named CONTRIBUTING.md.

4) Address feedback, and wait for the merge

Once the PR is submitted someone will likely respond to it in a few days. If it doesn’t happen, feel free to “bump” it by adding a comment to the PR asking for it to be reviewed. Most maintainers do not mind if you bump the PR every ten days or so, and in fact find it useful in case it has slipped under their radar.

Ideally the feedback will be constructive, actionable, and educational. Sometimes it isn’t, and if you are very unlucky the reviewer might come across as stern and critical. It helps to remember that such feedback might have been unintentional and that you are in fact on the same side, both wanting the best code to be merged. Using plural first-person pronouns (we) is a good way to convey this sentiment (and remind the reviewer about it), for example: “What are the benefits if we implement <a feature> in this way?” is better than “Why do you think <a feature> should be implemented this way?”.

Addressing feedback is easy: simply add more commits to the branch that the PR is for. Once you think you have addressed all the feedback, let the reviewer know explicitly, as they don’t know whether you plan to add more commits or not. If all went well the reviewer will then merge the PR, hooray!

Conclusions

Unless you have started programming very recently, you likely already have the technical/programming ability to contribute to open source projects. You have valuable contributions to make, but psychological/sociological barriers may be holding you back. Hopefully reading this post will help you overcome them: and we are looking forward to welcoming you to the community and seeing what you come up with!


Navigating Your First Technical Internship

Author: Tom Wright

During the final weeks of my internship with Invenia, while looking back on my time here, I had the idea to share some thoughts on my experience and what advice I wish I had been given leading up to my start date. There are countless recipes for a successful internship, and I hope what follows can help guide you towards making yours great.

First and foremost, congratulations on landing your first technical internship! After hours of fine-tuning your resume, countless cover letters written, and (if you are in the same boat I was in) having faced your fair share of rejections, you’ve made it. Before digging into the rest of this post, take a moment to be proud of the work you’ve put in and appreciate where it has led you. If perhaps you are reading this in the middle of your application process and are yet to receive an offer, don’t worry. The first summer I applied for technical internships, out of roughly 20 applications sent, I received a single interview and zero offers. It never feels like it is going to happen until it does. Here’s a useful guide if you’re looking for tips on the application process.

Hopefully the advice that follows in this post will give you some insight that helps you make a strong impression throughout the course of your internship. Some more technical subjects or toolkits will be mentioned for the sake of example, but we won’t get into the details of these in this post. There are plenty of online resources that provide great introductions to git, the cloud, and countless other new topics you might encounter.

Before you start

The most important thing you can do leading up to your internship is try to relax. It’s common to feel a nervous excitement as your first day approaches. Remember that you will have plenty of time (give or take 40 hours per week) to focus on this soon enough. Take the time to enjoy other hobbies or interests while you have the time.

If all distractions fail, a great way to spend any built up excitement is to do some research. You likely already have a good sense of what the company does, the general culture and what your role might entail from the application and interview processes. Feel free to dig deeper into these topics. See if there’s information on the company website about your coworkers, the impact your work might drive and the industry as a whole. This will help you feel more prepared when you start, as well as settle some jitters in the weeks leading up.

Hitting the ground running

The day has finally arrived, you’ve made it! Surely your first week will be filled with lots of meetings, introductions, and learning. My advice for this information overload is about as basic as it gets: take notes, and lots of them. Ironically, being a software engineer, I am a big fan of handwriting my notes. I also think it looks better from the perspective of the person speaking to see someone writing notes by hand versus typing away at a laptop, unable to see the screen.

In the same vein, any buzzwords you hear (any term, phrase, acronym or title that you don’t fully understand) should be added to a personal dictionary. This was really important for me, and I made reference to it weekly throughout my 8 month internship. You should keep this up throughout the entire internship, however the first two weeks is likely when it will grow the most. What is a Docker container and what makes them helpful? Who knows what [insert some arbitrary industry acronym here] stands for? These are questions that are super easy to answer by asking coworkers or even the internet. Maintaining your own shortlist of important definitions will help fast track your learning and be a great tool to pass along to new interns that join during your stay.

One of the best ways you can get to know the company and those you are working with is simply to reach out to your coworkers. This can be challenging depending on the size of the company you are with. It becomes even more intimidating if you are in the middle of a global pandemic and are onboarding from home. Try connecting via Slack, email or set up calls to get to know those both in and outside of your team. Learning about the different teams, individuals’ career paths and building your network is one of the best ways to make the most of your internship, both personally and professionally. Reaching out like this can be intimidating, especially early on, but it will show that you are interested in the company as a whole and highlight that you have strong initiative. People generally are more than happy to talk about their work and personal interests. There is no reason not to start this during your first couple weeks.

The Bulk

Throughout my internship, I discovered three main takeaways that should be considered by anyone working through an internship, especially if it is your first.

1. Who is this imposter?

Imposter Syndrome is the feeling of doubting your own skills, accomplishments and thinking of yourself as some kind of fraud. For me, this manifested itself as a stream of questions such as When will they realize they’ve made a huge mistake in hiring me? and thoughts similar to I am definitely not smart enough for this! Temper this by remembering they hired you for a reason. After interviewing many candidates and having spent hours speaking to you and testing your skills, they picked you. This reminder certainly won’t make all these anxieties disappear, but can hopefully help mitigate any unnecessary stress. Being a little nervous can help motivate you to work hard and push you to succeed. It is worth remembering that all your coworkers likely went through the same experience, or even still feel this way from time to time. Feel free to get their insights or experience with this if you feel comfortable doing so.

These thoughts and feelings might start well before your first day and last well into the internship itself, as they did in my case. It will improve with time and experience. Until that happens just remember to use your available sources of support and try to translate it into motivation.

2. Asking for help

A big fear I had during my internship was coming across as naive and inexperienced. I was very worried about asking a question and getting “How do you not know that? Don’t you know anything?” as a response. While this is certainly a normal thought process, it is misguided for a few reasons. First off, my coworkers are great people, as I am sure are yours. The odds of someone saying that are slim to none, and if they do, it tells you a lot more about them than it does about you. Secondly, and this is an important thing to keep in mind: no one expects you to know everything and be great at everything, especially early on as an intern. Asking questions is an important part of learning and internships are no exception. This one comes in two parts: when and how to ask for help.

Let me save you the time I wasted trying to figure out when is the perfect point to ask a question. While you definitely do not want to just ask before thinking or doing any digging yourself, no one wants you endlessly spinning your wheels. Take the time to think about the problem, see if there are any reliable resources or answers in some documentation, attempt a couple of solutions, but don’t fuss until the end of time out of fear of looking dumb.

How you ask for help is the easy one. You’ve done all the work already. Avoid questions like “Hey how do you do [insert problem]?”, or even worse “Hey I know this is probably SUPER stupid but I don’t get [insert problem here] haha!”. Do say something along the lines of “Hey [insert name]. I have been trying to figure out how to solve [insert problem] and seem to be stuck. I have tried [insert attempted solutions] with no success and was hoping you could point me in the right direction.” You can also frame it as a leading question, such as “So in order to do X, we have to do Y because of Z?” It doesn’t have to be a lengthy breakdown of every thought you had, it really shouldn’t be. People generally prefer concise messages, just show that you have put some thought and effort into it.

3. Make your voice heard

The scope of this point may vary depending on the size of your company, its culture and your role, however, the point remains the same. Share your thoughts on what you are working on. Share what interests you as subjects, both within and outside your role.

I had the incredible opportunity to contribute to my organization in ways beyond the scope of my job description. Yes, this is because of the supportive nature of the company and the flexibility that comes with working at a smaller organization, but it also would not have happened had I not shared what I am interested in. I gained fantastic experience in my role, but also developed an appreciation and better understanding for other work being done at the company.

Share your thoughts at the weekly team meeting, don’t be afraid to review code or improve documentation and bounce ideas off coworkers during coffee breaks (oh, try not to drink too much coffee). They hired you in large part for your brain, don’t be afraid to use it!

Final Impressions

You’ve made it to the final few weeks of your internship, congrats! Hopefully you have had a fantastic experience, learning a lot and making lasting relationships. Now is the time to think of who you would like to connect with for a chat or call before you finish. This can be for any number of reasons; giving a little extra thanks, asking for career advice or even just for the sake of saying farewell to the friends you’ve made along the way!

Regardless of whether or not you follow any of this advice, I wish you the best of luck in your internship. While the advice above worked well for me, it is by no means a one-size-fits-all magical recipe to the perfect internship. There will certainly be hurdles along the way, anxieties to overcome, and inevitable mistakes made, all of which will contribute to making your internship a great learning experience. Good luck and enjoy the ride.


Linear Models from a Gaussian Process Point of View with Stheno and JAX

Author: Wessel Bruinsma and James Requeima

Cross-posted at wesselb.github.io.

A linear model prescribes a linear relationship between inputs and outputs. Linear models are amongst the simplest of models, but they are ubiquitous across science. A linear model with Gaussian distributions on the coefficients forms one of the simplest instances of a Gaussian process. In this post, we will give a brief introduction to linear models from a Gaussian process point of view. We will see how a linear model can be implemented with Gaussian process probabilistic programming using Stheno, and how this model can be used to denoise noisy observations. (Disclosure: Will Tebbutt and Wessel are the authors of Stheno; Will maintains a Julia version.) In short, probabilistic programming is a programming paradigm that brings powerful probabilistic models to the comfort of your programming language, which often comes with tools to automatically perform inference (make predictions). We will also use JAX’s just-in-time compiler to make our implementation extremely efficient.

Linear Models from a Gaussian Process Point of View

Consider a data set \((x_i, y_i)_{i=1}^n \subseteq \R \times \R\) consisting of \(n\) real-valued input–output pairs. Suppose that we wish to estimate a linear relationship between the inputs and outputs:

\[\label{eq:ax_b} y_i = a \cdot x_i + b + \e_i,\]

where \(a\) is an unknown slope, \(b\) is an unknown offset, and \(\e_i\) is some error/noise associated with the observation \(y_i\). To implement this model with Gaussian process probabilistic programming, we need to cast the problem into a functional form. This means that we will assume that there is some underlying, random function \(y \colon \R \to \R\) such that the observations are evaluations of this function: \(y_i = y(x_i)\). The model for the random function \(y\) will embody the structure of the linear model \eqref{eq:ax_b}. This may sound hard, but it is not difficult at all. We let the random function \(y\) be of the following form:

\[\label{eq:ax_b_functional} y(x) = a(x) \cdot x + b(x) + \e(x)\]

where \(a\colon \R \to \R\) is a random constant function. An example of a constant function \(f\) is \(f(x) = 5\). Random means that the value \(5\) is not fixed, but modelled with a random value drawn from some probability distribution, because we don’t know the true value. We let \(b\colon \R \to \R\) also be a random constant function, and \(\e\colon \R \to \R\) a random noise function. Do you see the similarities between \eqref{eq:ax_b} and \eqref{eq:ax_b_functional}? If all that doesn’t fully make sense, don’t worry; things should become more clear as we implement the model.

To model random constant functions and random noise functions, we will use Stheno, which is a Python library for Gaussian process modelling. We also have a Julia version, but in this post we’ll use the Python version. To install Stheno, run the command

pip install --upgrade --upgrade-strategy eager stheno

In Stheno, a Gaussian process can be created with GP(kernel), where kernel is the so-called kernel or covariance function of the Gaussian process. The kernel determines the properties of the function that the Gaussian process models. For example, the kernel EQ() models smooth functions, and the kernel Matern12() models functions that look jagged. See the kernel cookbook for an overview of commonly used kernels and the documentation of Stheno for the corresponding classes. For constant functions, you can set the kernel to simply a constant, for example 1, which then models the constant function with a value drawn from \(\mathcal{N}(0, 1)\). (By default, in Stheno, all means are zero; but, if you like, you can also set a mean.)

Let’s start out by creating a Gaussian process for the random constant function \(a(x)\) that models the slope.

>>> from stheno import GP

>>> a = GP(1)

>>> a
GP(0, 1)

You can see how the Gaussian process looks by simply sampling from it. To sample from the Gaussian process a at some inputs x, evaluate it at those inputs, a(x), and call the method sample: a(x).sample(). This shows that you can really think of a Gaussian process just like you think of a function: pass it some inputs to get (the model for) the corresponding outputs.

>>> x = np.linspace(0, 10, 100)

>>> plt.plot(x, a(x).sample(20)); plt.show()

Samples of a Gaussian process that models a constant function Figure 1: Samples of a Gaussian process that models a constant function.

We’ve sampled a bunch of constant functions. Sweet! The next step in the model \eqref{eq:ax_b_functional} is to multiply the slope function \(a(x)\) by \(x\). To multiply a by \(x\), we multiply a by the function lambda x: x, which casts also \(x\) as a function:

>>> f = a * (lambda x: x)

>>> f
GP(0, <lambda>)

This will give rise to functions like \(x \mapsto 0.1x\) and \(x \mapsto -0.4x\), depending on the value that \(a(x)\) takes.

>>> plt.plot(x, f(x).sample(20)); plt.show()

Samples of a Gaussian process that models functions with a random slope Figure 2: Samples of a Gaussian process that models functions with a random slope.

This is starting to look good! The only ingredient that is missing is an offset. We model the offset just like the slope, but here we set the kernel to 10 instead of 1, which models the offset with a value drawn from \(\mathcal{N}(0, 10)\).

>>> b = GP(10)

>>> f = a * (lambda x: x) + b
AssertionError: Processes GP(0, <lambda>) and GP(0, 10 * 1) are associated to different measures.

Something went wrong. Stheno has an abstraction called measures, where only GPs that are part of the same measure can be combined into new GPs; the abstraction of measures is there to keep things safe and tidy. What goes wrong here is that a and b are not part of the same measure. Let’s explicitly create a new measure and attach a and b to it.

>>> from stheno import Measure

>>> prior = Measure()

>>> a = GP(1, measure=prior)

>>> b = GP(10, measure=prior)

>>> f = a * (lambda x: x) + b

>>> f
GP(0, <lambda> + 10 * 1)

Let’s see how samples from f look like.

>>> plt.plot(x, f(x).sample(20)); plt.show()

Samples of a Gaussian process that models linear functions Figure 3: Samples of a Gaussian process that models linear functions.

Perfect! We will use f as our linear model.

In practice, observations are corrupted with noise. We can add some noise to the lines in Figure 3 by adding a Gaussian process that models noise. You can construct such a Gaussian process by using the kernel Delta(), which models the noise with independent \(\mathcal{N}(0, 1)\) variables.

>>> from stheno import Delta

>>> noise = GP(Delta(), measure=prior)

>>> y = f + noise

>>> y
GP(0, <lambda> + 10 * 1 + Delta())

>>> plt.plot(x, y(x).sample(20)); plt.show()

Samples of a Gaussian process that models noisy linear functions Figure 4: Samples of a Gaussian process that models noisy linear functions.

That looks more realistic, but perhaps that’s a bit too much noise. We can tune down the amount of noise, for example, by scaling noise by 0.5.

>>> y = f + 0.5 * noise

>>> y
GP(0, <lambda> + 10 * 1 + 0.25 * Delta())

>>> plt.plot(x, y(x).sample(20)); plt.show()

Samples of a Gaussian process that models noisy linear functions Figure 5: Samples of a Gaussian process that models noisy linear functions.

Much better.

To summarise, our linear model is given by

prior = Measure()

a = GP(1, measure=prior)            # Model for slope
b = GP(10, measure=prior)           # Model for offset
f = a * (lambda x: x) + b           # Noiseless linear model

noise = GP(Delta(), measure=prior)  # Model for noise
y = f + 0.5 * noise                 # Noisy linear model

We call a program like this a Gaussian process probabilistic program (GPPP). Let’s generate some noisy synthetic data, (x_obs, y_obs), that will make up an example data set \((x_i, y_i)_{i=1}^n\). We also save the observations without noise added—f_obs—so we can later check how good our predictions really are.

>>> x_obs = np.linspace(0, 10, 50_000)

>>> f_obs = 0.8 * x_obs - 2.5

>>> y_obs = f_obs + 0.5 * np.random.randn(50_000)

>>> plt.scatter(x_obs, y_obs); plt.show()

Some observations Figure 6: Some observations.

We will see next how we can fit our model to this data.

Inference in Linear Models

Suppose that we wish to remove the noise from the observations in Figure 6. We carefully phrase this problem in terms of our GPPP: the observations y_obs are realisations of the noisy linear model y at x_obs—realisations of y(x_obs)—and we wish to make predictions for the noiseless linear model f at x_obs—predictions for f(x_obs).

In Stheno, we can make predictions based on observations by conditioning the measure of the model on the observations. In our GPPP, the measure is given by prior, so we aim to condition prior on the observations y_obs for y(x_obs). Mathematically, this process of incorporating information by conditioning happens through Bayes’ rule. Programmatically, we first make an Observations object, which represents the information—the observations—that we want to incorporate, and then condition prior on this object:

>>> from stheno import Observations

>>> obs = Observations(y(x_obs), y_obs)

>>> post = prior.condition(obs)

You can also more concisely perform these two steps at once, as follows:

>>> post = prior | (y(x_obs), y_obs)

This mimics the mathematical notation used for conditioning.

With our updated measure post, which is often called the posterior measure, we can make a prediction for f(x_obs) by passing f(x_obs) to post:

>>> pred = post(f(x_obs))

>>> pred.mean
<dense matrix: shape=50000x1, dtype=float64
 mat=[[-2.498]
      [-2.498]
      [-2.498]
      ...
      [ 5.501]
      [ 5.502]
      [ 5.502]]>

>>> pred.var
<low-rank matrix: shape=50000x50000, dtype=float64, rank=2
 left=[[1.e+00 0.e+00]
       [1.e+00 2.e-04]
       [1.e+00 4.e-04]
       ...
       [1.e+00 1.e+01]
       [1.e+00 1.e+01]
       [1.e+00 1.e+01]]
 middle=[[ 2.001e-05 -2.995e-06]
         [-2.997e-06  6.011e-07]]
 right=[[1.e+00 0.e+00]
        [1.e+00 2.e-04]
        [1.e+00 4.e-04]
        ...
        [1.e+00 1.e+01]
        [1.e+00 1.e+01]
        [1.e+00 1.e+01]]>

The prediction pred is a multivariate Gaussian distribution with a particular mean and variance, which are displayed above. You should view post as a function that assigns a probability distribution—the prediction—to every part of our GPPP, like f(x_obs). Note that the variance of the prediction is a massive matrix of size 50k \(\times\) 50k. Under the hood, Stheno uses structured representations for matrices to compute and store matrices in an efficient way.

Let’s see how the prediction pred for f(x_obs) looks like. The prediction pred exposes the method marginals that conveniently computes the mean and associated lower and upper error bounds for you.

>>> mean, error_bound_lower, error_bound_upper  = pred.marginals()

>>> mean
array([-2.49818708, -2.49802708, -2.49786708, ...,  5.50148996,
        5.50164996,  5.50180997])

>>> error_bound_upper - error_bound_lower
array([0.01753381, 0.01753329, 0.01753276, ..., 0.01761883, 0.01761935,
       0.01761988])

The error is very small—on the order of \(10^{-2}\)—which means that Stheno predicted f(x_obs) with high confidence.

>>> plt.scatter(x_obs, y_obs); plt.plot(x_obs, mean); plt.show()

Mean of the prediction (blue line) for the denoised observations Figure 7: Mean of the prediction (blue line) for the denoised observations.

The blue line in Figure 7 shows the mean of the predictions. This line appears to nicely pass through the observations with the noise removed. But let’s see how good the predictions really are by comparing to f_obs, which we previously saved.

>>> f_obs - mean
array([-0.00181292, -0.00181292, -0.00181292, ..., -0.00180997,
       -0.00180997, -0.00180997])

>>> np.mean((f_obs - mean) ** 2)  # Compute the mean square error.
3.281323087544209e-06

That’s pretty close! Not bad at all.

We wrap up this section by encapsulating everything that we’ve done so far in a function linear_model_denoise, which denoises noisy observations from a linear model:

def linear_model_denoise(x_obs, y_obs):
    prior = Measure()
    a = GP(1, measure=prior)            # Model for slope
    b = GP(10, measure=prior)           # Model for offset
    f = a * (lambda x: x) + b           # Noiseless linear model
    noise = GP(Delta(), measure=prior)  # Model for noise
    y = f + 0.5 * noise                 # Noisy linear model

    post = prior | (y(x_obs), y_obs)    # Condition on observations.
    pred = post(f(x_obs))               # Make predictions.
    return pred.marginals()             # Return the mean and associated error bounds.

>>> linear_model_denoise(x_obs, y_obs)
(array([-2.49818708, -2.49802708, -2.49786708, ...,  5.50148996,
        5.50164996,  5.50180997]), array([-2.50695399, -2.50679372, -2.50663346, ...,  5.49268055,
        5.49284029,  5.49300003]), array([-2.48942018, -2.48926044, -2.4891007 , ...,  5.51029937,
        5.51045964,  5.51061991]))

>>> %timeit linear_model_denoise(x_obs, y_obs)
233 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

To denoise 50k observations, linear_model_denoise takes about 250 ms. Not terrible, but we can do much better, which is important if we want to scale to larger numbers of observations. In the next section, we will make this function really fast.

Making Inference Fast

To make linear_model_denoise fast, firstly, the linear algebra that happens under the hood when linear_model_denoise is called should be simplified as much as possible. Fortunately, this happens automatically, due to the structured representation of matrices that Stheno uses. For example, when making predictions with Gaussian processes, the main computational bottleneck is usually the construction and inversion of y(x_obs).var, the variance associated with the observations:

>>> y(x_obs).var
<Woodbury matrix: shape=50000x50000, dtype=float64
 diag=<diagonal matrix: shape=50000x50000, dtype=float64
       diag=[0.25 0.25 0.25 ... 0.25 0.25 0.25]>
 lr=<low-rank matrix: shape=50000x50000, dtype=float64, rank=2
     left=[[1.e+00 0.e+00]
           [1.e+00 2.e-04]
           [1.e+00 4.e-04]
           ...
           [1.e+00 1.e+01]
           [1.e+00 1.e+01]
           [1.e+00 1.e+01]]
     middle=[[10.  0.]
             [ 0.  1.]]
     right=[[1.e+00 0.e+00]
            [1.e+00 2.e-04]
            [1.e+00 4.e-04]
            ...
            [1.e+00 1.e+01]
            [1.e+00 1.e+01]
            [1.e+00 1.e+01]]>>

Indeed observe that this matrix has particular structure: it is a sum of a diagonal and a low-rank matrix. In Stheno, the sum of a diagonal and a low-rank matrix is called a Woodbury matrix, because the Sherman–Morrison–Woodbury formula can be used to efficiently invert it. Let’s see how long it takes to construct y(x_obs).var and then invert it. We invert y(x_obs).var using LAB, which is automatically installed alongside Stheno and exposes the API to efficiently work with structured matrices.

>>> import lab as B

>>> %timeit B.inv(y(x_obs).var)
28.5 ms ± 1.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

That’s only 30 ms! Not bad, for such a big matrix. Without exploiting structure, a 50k \(\times\) 50k matrix takes 20 GB of memory to store and about an hour to invert.

Secondly, we would like the code implemented by linear_model_denoise to be as efficient as possible. To achieve this, we will use JAX to compile linear_model_denoise with XLA, which generates blazingly fast code. We start out by importing JAX and loading the JAX extension of Stheno.

>>> import jax

>>> import jax.numpy as jnp

>>> import stheno.jax  # JAX extension for Stheno

We use JAX’s just-in-time (JIT) compiler jax.jit to compile linear_model_denoise:

>>> linear_model_denoise_jitted = jax.jit(linear_model_denoise)

Let’s see what happens when we run linear_model_denoise_jitted. We must pass x_obs and y_obs as JAX arrays to use the compiled version.

>>> linear_model_denoise_jitted(jnp.array(x_obs), jnp.array(y_obs))
Invalid argument: Cannot bitcast types with different bit-widths: F64 => S32.

Oh no! What went wrong is that the JIT compiler wasn’t able to deal with the complicated control flow from the automatic linear algebra simplifications. Fortunately, there is a simple way around this: we can run the function once with NumPy to see how the control flow should go, cache that control flow, and then use this cache to run linear_model_denoise with JAX. Sounds complicated, but it’s really just a bit of boilerplate:

>>> import lab as B

>>> control_flow_cache = B.ControlFlowCache()

>>> control_flow_cache
<ControlFlowCache: populated=False>

Here populated=False means that the cache is not yet populated. Let’s populate it by running linear_model_denoise once with NumPy:

>>> with control_flow_cache:
        linear_model_denoise(x_obs, y_obs)

>>> control_flow_cache
<ControlFlowCache: populated=True>

We now construct a compiled version of linear_model_denoise that uses the control flow cache:

@jax.jit
def linear_model_denoise_jitted(x_obs, y_obs):
    with control_flow_cache:
        return linear_model_denoise(x_obs, y_obs)

>>> linear_model_denoise_jitted(jnp.array(x_obs), jnp.array(y_obs))
(DeviceArray([-2.4981871 , -2.4980271 , -2.49786709, ...,  5.50149004,
              5.50165005,  5.50181005], dtype=float64), DeviceArray([-2.5069514 , -2.50679114, -2.50663087, ...,  5.4927699 ,
              5.49292964,  5.49308938], dtype=float64), DeviceArray([-2.4894228 , -2.48926306, -2.48910332, ...,  5.51021019,
              5.51037046,  5.51053072], dtype=float64))

Nice! Let’s see how much faster linear_model_denoise_jitted is:

>>> %timeit linear_model_denoise(x_obs, y_obs)
233 ms ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

>>> %timeit linear_model_denoise_jitted(jnp.array(x_obs), jnp.array(y_obs))
1.63 ms ± 16.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

The compiled function linear_model_denoise_jitted only takes 2 ms to denoise 50k observations! Compared to linear_model_denoise, that’s a speed-up of two orders of magnitude.

Conclusion

We’ve seen how a linear model can be implemented with a Gaussian process probabilistic program (GPPP) using Stheno. Stheno allows us to focus on model construction, and takes away the distraction of the technicalities that come with making predictions. This flexibility, however, comes at the cost of some complicated machinery that happens in the background, such as structured representations of matrices. Fortunately, we’ve seen that this overhead can be completely avoided by compiling your program using JAX, which can result in extremely efficient implementations. To close this post and to warm you up for what’s further possible with Gaussian process probabilistic programming using Stheno, the linear model that we’ve built can easily be extended to, for example, include a quadratic term:

def quadratic_model_denoise(x_obs, y_obs):
    prior = Measure()
    a = GP(1, measure=prior)            # Model for slope
    b = GP(1, measure=prior)            # Model for coefficient of quadratic term
    c = GP(10, measure=prior)           # Model for offset
    # Noiseless quadratic model
    f = a * (lambda x: x) + b * (lambda x: x ** 2) + c
    noise = GP(Delta(), measure=prior)  # Model for noise
    y = f + 0.5 * noise                 # Noisy quadratic model

    post = prior | (y(x_obs), y_obs)    # Condition on observations.
    pred = post(f(x_obs))               # Make predictions.
    return pred.marginals()             # Return the mean and associated error bounds.

To use Gaussian process probabilistic programming for your specific problem, the main challenge is to figure out which model you need to use. Do you need a quadratic term? Maybe you need an exponential term! But, using Stheno, implementing the model and making predictions should then be simple.


A Gentle Introduction to Power Flow

Author: Letif Mones

Although governed by simple physical laws, power grids are among the most complex human-made systems. The main source of the complexity is the large number of components of the power systems that interact with each other: one needs to maintain a balance between power injections and withdrawals while satisfying certain physical, economic, and environmental conditions. For instance, a central task of daily planning and operations of electricity grid operators1 is to dispatch generation in order to meet demand at minimum cost, while respecting reliability and security constraints. These tasks require solving a challenging constrained optimization problem, often referred to as some form of optimal power flow (OPF).2

In a series of two blog posts, we are going to discuss the basics of power flow and optimal power flow problems. In this first post, we focus on the most important component of OPF: the power flow (PF) equations. For this, first we introduce some basic definitions of power grids and AC circuits, then we define the power flow problem.

us_power_grid Figure 1. Complexity of electricity grids: the electric power transmission grid of the United States (source: FEMA and Wikipedia).


Power grids as graphs

Power grids are networks that include two main components: buses that represent important locations of the grid (e.g. generation points, load points, substations) and transmission (or distribution) lines that connect these buses. It is pretty straightforward, therefore, to look at power grid networks as graphs: buses and transmission lines can be represented by nodes and edges of a corresponding graph. There are two equivalent graph models that can be used to derive the basic power flow equations3:

  • directed graph representation (left panel of Figure 2): \(\mathbb{G}_{D}(\mathcal{N}, \mathcal{E})\);
  • undirected graph representation (right panel of Figure 2): \(\mathbb{G}_{U}(\mathcal{N}, \mathcal{E} \cup \mathcal{E}^{R})\),

where \(\mathcal{N}\), \(\mathcal{E} \subseteq \mathcal{N} \times \mathcal{N}\) and \(\mathcal{E}^{R} \subseteq \mathcal{N} \times \mathcal{N}\) denote the set of nodes (buses), and the forward and reverse orientations of directed edges (branches) of the graph, respectively.

power_grid_graphs Figure 2. Directed graph representation of synthetic grid 14-ieee (left) and undirected graph representation of synthetic grid 30-ieee (right). Red and blue circles denote generator and load buses, respectively.


Complex power in AC circuits

Power can be transmitted more efficiently at high voltages as high voltage (or equivalently, low current) reduces the loss of power due to its dissipation on transmission lines. Power grids generally use alternating current (AC) since the AC voltage can be altered (from high to low) easily via transformers. Therefore, we start with some notation and definitions for AC circuits.

The most important characteristics of AC circuits is that, unlike in direct current (DC) circuits, the currents and voltages are not constant in time: both their magnitude and direction vary periodically. Because of several technical reasons (like low losses and disturbances), power generators use sinusoidal alternating quantities that can be straightforwardly modeled by complex numbers.

We will consistently use capital and small letters to denote complex and real-valued quantities, respectively. For instance, let us consider two buses, \(i, j \in \mathcal{N}\), that are directly connected by a transmission line \((i, j) \in \mathcal{E}\). The complex power flowing from bus \(i\) to bus \(j\) is denoted by \(S_{ij}\) and it can be decomposed into its active (\(p_{ij}\)) and reactive (\(q_{ij}\)) components:

\[\begin{equation} S_{ij} = p_{ij} + \mathrm{j}q_{ij}, \end{equation}\]

where \(\mathrm{j} = \sqrt{-1}\). The complex power flow can be expressed as the product of the complex voltage at bus \(i\), \(V_{i}\) and the complex conjugate of the current flowing between the buses, \(I_{ij}^{*}\):

\[\begin{equation} S_{ij} = V_{i}I_{ij}^{*}, \label{power_flow} \end{equation}\]

It is well known that transmission lines have power losses due to their resistance (\(r_{ij}\)), which is a measure of the opposition to the flow of the current. For AC-circuits, a dynamic effect caused by the line reactance (\(x_{ij}\)) also plays a role. Unlike resistance, reactance does not cause any loss of power but has a delayed effect by storing and later returning power to the circuit. The effect of resistance and reactance together can be represented by a single complex quantity, the impedance: \(Z_{ij} = r_{ij} + \mathrm{j}x_{ij}\). Another useful complex quantity is the admittance, which is the reciprocal of the impedance: \(Y_{ij} = \frac{1}{Z_{ij}}\). Similarly to the impedance, the admittance can be also decomposed into its real, conductance (\(g_{ij}\)), and imaginary, susceptance (\(b_{ij}\)), components: \(Y_{ij} = g_{ij} + \mathrm{j}b_{ij}\).

Therefore, the current can be written as a function of the line voltage drop and the admittance between the two buses, which is an alternative form of Ohm’s law:

\[\begin{equation} I_{ij} = Y_{ij}(V_{i} - V_{j}). \end{equation}\]

Replacing the above expression for the current in the power flow equation (eq. \(\eqref{power_flow}\)), we get

\[\begin{equation} S_{ij} = Y_{ij}^{*}V_{i}V_{i}^{*} - Y_{ij}^{*}V_{i}V_{j}^{*} = Y_{ij}^{*} \left( |V_{i}|^{2} - V_{i}V_{j}^{*} \right). \end{equation}\]

The above power flow equation can be expressed by using the polar form of voltage, i.e. \(V_{i} = v_{i}e^{\mathrm{j} \delta_{i}} = v_{i}(\cos\delta_{i} + \mathrm{j}\sin\delta_{i})\) (where \(v_{i}\) and \(\delta_{i}\) are the voltage magnitude and angle of bus \(i\), respectively), and the admittance components:

\[\begin{equation} S_{ij} = \left(g_{ij} - \mathrm{j}b_{ij}\right) \left(v_{i}^{2} - v_{i}v_{j}\left(\cos\delta_{ij} + \mathrm{j}\sin\delta_{ij}\right)\right), \end{equation}\]

where for brevity we introduced the voltage angle difference \(\delta_{ij} = \delta_{i} - \delta_{j}\). Similarly, using a simple algebraic identity of \(g_{ij} - \mathrm{j}b_{ij} = \frac{g_{ij}^{2} + b_{ij}^{2}}{g_{ij} + \mathrm{j}b_{ij}} = \frac{|Y_{ij}|^{2}}{Y_{ij}} = \frac{Z_{ij}}{|Z_{ij}|^{2}} = \frac{r_{ij} + \mathrm{j}x_{ij}}{r_{ij}^{2} + x_{ij}^{2}}\), the impedance components-based expression has the following form:

\[\begin{equation} S_{ij} = \frac{r_{ij} + \mathrm{j}x_{ij}}{r_{ij}^{2} + x_{ij}^{2}} \left( v_{i}^{2} - v_{i}v_{j}\left(\cos\delta_{ij} + \mathrm{j}\sin\delta_{ij}\right)\right). \end{equation}\]

Finally, the corresponding real equations can be written as

\[\begin{equation} \left\{ \begin{aligned} p_{ij} & = g_{ij} \left( v_{i}^{2} - v_{i} v_{j} \cos\delta_{ij} \right) - b_{ij} \left( v_{i} v_{j} \sin\delta_{ij} \right) \\ q_{ij} & = b_{ij} \left( -v_{i}^{2} + v_{i} v_{j} \cos\delta_{ij} \right) - g_{ij} \left( v_{i} v_{j} \sin\delta_{ij} \right), \\ \end{aligned} \right. \label{power_flow_y} \end{equation}\]

and

\[\begin{equation} \left\{ \begin{aligned} p_{ij} & = \frac{1}{r_{ij}^{2} + x_{ij}^{2}} \left[ r_{ij} \left( v_{i}^{2} - v_{i} v_{j} \cos\delta_{ij} \right) + x_{ij} \left( v_{i} v_{j} \sin\delta_{ij} \right) \right] \\ q_{ij} & = \frac{1}{r_{ij}^{2} + x_{ij}^{2}} \left[ x_{ij} \left( v_{i}^{2} - v_{i} v_{j} \cos\delta_{ij} \right) + r_{ij} \left( v_{i} v_{j} \sin\delta_{ij} \right) \right]. \\ \end{aligned} \right. \label{power_flow_z} \end{equation}\]

Power flow models

In the previous section we presented the power flow between two connected buses and established a relationship between complex power flow and complex voltages. In power flow problems, the entire power grid is considered and the task is to calculate certain quantities based on some other specified ones. There are two equivalent power flow models depending on the graph model used: the bus injection model (based on the undirected graph representation) and the branch flow model (based on the directed graph representation). First, we introduce the basic formulations. Then, we show the most widely used technique to solve power flow problems. Finally, we extend the basic equations and derive more sophisticated models including additional components for real power grids.

Bus injection model

The bus injection model (BIM) uses the undirected graph model of the power grid, \(\mathbb{G}_{U}\). For each bus \(i\), we denote by \(\mathcal{N}_{i} \subset \mathcal{N}\) the set of buses directly connected to bus \(i\). Also, for each bus we introduce the following quantities34 (Figure 3):

  • \(S_{i}^{\mathrm{gen}}\): generated power flowing into bus \(i\).
  • \(S_{i}^{\mathrm{load}}\): demand power or load flowing out of the bus \(i\).
  • \(S_{i}\): net power injection at bus \(i\), i.e. \(S_{i} = S_{i}^{\mathrm{gen}} - S_{i}^{\mathrm{load}}\).
  • \(S_{i}^{\mathrm{trans}}\): transmitted power flowing between bus \(i\) and its adjacent buses.

... Figure 3. Power balance and quantities of a bus connected to three adjacent buses and including a single generator and a single load.


Tellegen’s theorem establishes a simple relationship between these power quantities:

\[\begin{equation} S_{i} = S_{i}^{\mathrm{gen}} - S_{i}^{\mathrm{load}} = S_{i}^{\mathrm{trans}} \ \ \ \ \forall i \in \mathcal{N}. \label{power_balance} \end{equation}\]

Eq. \(\eqref{power_balance}\) expresses the law of conservation of power (energy): the power injected (\(S_{i}^{\mathrm{gen}}\)) to bus \(i\) must be equal to the power going out from the bus, i.e. the sum of the withdrawn (\(S_{i}^{\mathrm{load}}\)) and transmitted power (\(S_{i}^{\mathrm{trans}}\)). In the most basic model, a bus can represent either a generator (i.e. \(S_{i} = S_{i}^{\mathrm{gen}}\)) or a load (i.e. \(S_{i} = -S_{i}^{\mathrm{load}}\)). For a given bus, the transmitted power can be obtained simply as a sum of the powers flowing in to and out from the bus, \(S_{i}^{\mathrm{trans}} = \sum \limits_{j \in \mathcal{N}_{i}} S_{ij}\). Therefore, the basic BIM has the following concise form:

\[\begin{equation} S_{i} = \sum \limits_{j \in \mathcal{N}_{i}} Y_{ij}^{*} \left( |V_{i}|^{2} - V_{i}V_{j}^{*} \right) \ \ \ \ \forall i \in \mathcal{N} . \label{bim_basic_concise} \end{equation}\]

Branch flow model

We briefly describe an alternative formulation of the power flow problem, the branch flow model (BFM).5 The BFM is based on the directed graph representation of the grid network, \(\mathbb{G}_{D}\), and directly models branch flows and currents. Let us fix an arbitrary orientation of \(\mathbb{G}_{D}\) and let \((i, j) \in \mathcal{E}\) denote an edge pointing from bus \(i\) to bus \(j\). Then, the BFM is defined by the following set of complex equations:

\[\begin{equation} \left\{ \begin{aligned} S_{i} & = \sum \limits_{(i, j) \in \mathcal{E}_{D}} S_{ij} - \sum \limits_{(k, i) \in \mathcal{E}} \left( S_{ki} - Z_{ki} |I_{ki}|^{2} \right) & \forall i \in \mathcal{N}, \\ I_{ij} & = Y_{ij} \left( V_{i} - V_{j} \right) & \forall (i, j) \in \mathcal{E}, \\ S_{ij} & = V_{i}I_{ij}^{*} & \forall (i, j) \in \mathcal{E}, \\ \end{aligned} \right. \label{bfm_basic_concise} \end{equation}\]

where the first, second and third sets of equations correspond to the power balance, Ohm’s law, and branch power definition, respectively.

The power flow problem

The power flow problem is to find a unique solution of a power system (given certain input variables) and, therefore, it is central to power grid analysis.

First, let us consider the BIM equations (eq. \(\eqref{bim_basic_concise}\)) that define a complex non-linear system with \(N = |\mathcal{N}|\) complex equations, and \(2N\) complex variables, \(\left\{S_{i}, V_{i}\right\}_{i=1}^{N}\). Equivalently, either using the admittance (eq. \(\eqref{power_flow_y}\)) or the impedance (eq. \(\eqref{power_flow_z}\)) components we can construct \(2N\) real equations with \(4N\) real variables: \(\left\{p_{i}, q_{i}, v_{i}, \delta_{i}\right\}_{i=1}^{N}\), where \(S_{i} = p_{i} + \mathrm{j}q_{i}\). The power flow problem is the following: for each bus we specify two of the four real variables and then using the \(2N\) equations we derive the remaining variables. Depending on which variables are specified there are three basic types of buses:

  • Slack (or \(V\delta\) bus) is usually a reference bus, where the voltage angle and magnitude are specified. Also, slack buses are used to make up any generation and demand mismatch caused by line losses. The voltage angle is usually set to 0, while the magnitude to 1.0 per unit.
  • Load (or \(PQ\) bus) is the most common bus type, where only demand but no power generation takes place. For such buses the active and reactive powers are specified.
  • Generator (or \(PV\) bus) specifies the active power and voltage magnitude variables.

Table 1. Basic bus types in power flow problem.

Bus type Code Specified variables
Slack \(V\delta\) \(v_{i}\), \(\delta_{i}\)
Load \(PQ\) \(p_{i}\), \(q_{i}\)
Generator \(PV\) \(p_{i}\), \(v_{i}\)


Further, let \(E = |\mathcal{E}|\) denote the number of directed edges. The BFM formulation (eq. \(\eqref{bfm_basic_concise}\)) includes \(N + 2E\) complex equations with \(2N + 2E\) complex variables: \(\left\{ S_{i}, V_{i} \right\}_{i \in \mathcal{N}} \cup \left\{ S_{ij}, I_{ij} \right\}_{(i, j) \in \mathcal{E}}\) or equivalently, \(2N + 4E\) real equations with \(4N + 4E\) real variables.

BIM vs BFM

The BIM and the BFM are equivalent formulations, i.e. they define the same physical problem and provide the same solution.6 Although the formulations are equivalent, in practice we might prefer one model over the other. Depending on the actual problem and the structure of the system, it might be easier to obtain results and derive an exact solution or an approximate relaxation from one formulation than from the other one.

Table 2. Comparison of basic BIM and BFM formulations: complex variables, number of complex variables and equations. Corresponding real variables, their numbers as well as the number of real equations are also shown in parentheses.

Formulation Variables Number of variables Number of equations
BIM \(V_{i} \ (v_{i}, \delta_{i})\)
\(S_{i} \ (p_{i}, q_{i})\)
\(2N\)
\((4N)\)
\(N\)
\((2N)\)
BFM \(V_{i} \ (v_{i}, \delta_{i})\)
\(S_{i} \ (p_{i}, q_{i})\)
\(I_{ij} \ (i_{ij}, \gamma_{ij})\)
\(S_{ij} \ (p_{ij}, q_{ij})\)
\(2N + 2E\)
\((4N + 4E)\)
\(N + 2E\)
\((2N + 4E)\)


Solving the power flow problem

Power flow problems (formulated as either a BIM or a BFM) define a non-linear system of equations. There are multiple approaches to solve power flow systems but the most widely used technique is the Newton–Raphson method. Below we demonstrate how it can be applied to the BIM formulation. First, we rearrange eq. \(\eqref{bim_basic_concise}\):

\[\begin{equation} F_{i} = S_{i} - \sum \limits_{j \in \mathcal{N}_{i}} Y_{ij}^{*} \left( |V_{i}|^{2} - V_{i}V_{j}^{*} \right) = 0 \ \ \ \ \forall i \in \mathcal{N}. \end{equation}\]

The above set of equations can be expressed simply as \(F(X) = 0\), where \(X\) denotes the \(N\) complex or more conveniently, the \(2N\) real unknown variables and \(F\) represents the \(N\) complex or \(2N\) real equations.

In the Newton—Raphson method the solution is sought in an iterative fashion, until a certain threshold of convergence is satisfied. In the \((n+1)\)th iteration we obtain:

\[\begin{equation} X_{n+1} = X_{n} - J_{F}(X_{n})^{-1} F(X_{n}), \end{equation}\]

where \(J_{F}\) is the Jacobian matrix with \(\left[J_{F}\right]_{ij} = \frac{\partial F_{i}}{\partial X_{j}}\) elements. We also note that, instead of computing the inverse of the Jacobian matrix, a numerically more stable approach is to solve first a linear system of \(J_{F}(X_{n}) \Delta X_{n} = -F(X_{n})\) and then obtain \(X_{n+1} = X_{n} + \Delta X_{n}\).

Since Newton—Raphson method requires only the computation of first derivatives of the \(F_{i}\) functions with respect to the variables, this technique is in general very efficient even for large, real-size grids.

Towards more realistic power flow models

In the previous sections we introduced the main concepts needed to understand and solve the power flow problem. However, the derived models were rather basic ones, and real systems require more reasonable approaches. Below we will construct more sophisticated models by extending the basic equations considering the following:

  • multiple generators and loads included in buses;
  • shunt elements connected to buses;
  • modeling transformers and phase shifters;
  • modeling asymmetric line charging.

We note that actual power grids can include additional elements besides the ones we discuss.

Improving the bus model

We start with the generated (\(S_{i}^{\mathrm{gen}}\)) and load (\(S_{i}^{\mathrm{load}}\)) powers in the power balance equations, i.e. \(S_{i} = S_{i}^{\mathrm{gen}} - S_{i}^{\mathrm{load}} = S_{i}^{\mathrm{trans}}\) for each bus \(i\). Electric grid buses can actually include multiple generators and loads together. In order to take this structure into account we introduce \(\mathcal{G}_{i}\) and \(\mathcal{L}_{i}\), denoting the set of generators and the set of loads at bus \(i\), respectively. Also, \(\mathcal{G} = \bigcup \limits_{i \in \mathcal{N}} \mathcal{G}_{i}\) and \(\mathcal{L} = \bigcup \limits_{i \in \mathcal{N}} \mathcal{L}_{i}\) indicate the sets of all generators and loads, respectively. Then we can write the following complex-valued expressions:

\[\begin{equation} S_{i}^{\mathrm{gen}} = \sum \limits_{g \in \mathcal{G}_{i}} S_{g}^{G} \ \ \ \ \forall i \in \mathcal{N}, \end{equation}\] \[\begin{equation} S_{i}^{\mathrm{load}} = \sum \limits_{l \in \mathcal{L}_{i}} S_{l}^{L} \ \ \ \ \forall i \in \mathcal{N}, \end{equation}\]

where \(\left\{S_{g}^{G}\right\}_{g \in \mathcal{G}}\) and \(\left\{S_{l}^{L}\right\}_{l \in \mathcal{L}}\) are the complex powers of generator dispatches and load consumptions, respectively. It is easy to see from the equations above that having multiple generators and loads per bus increases the total number of variables, while the number of equations does not change. If the system includes \(N = \lvert \mathcal{N} \rvert\) buses, \(G = \lvert \mathcal{G} \rvert\) generators and \(L = \lvert \mathcal{L} \rvert\) loads, then the total number of complex variables changes to \(N + G + L\) and \(N + G + L + 2E\) for BIM and BFM, respectively.

Buses can also include shunt elements that are used to model connected capacitors and inductors. They are represented by individual admittances resulting in additional power flowing out of the bus:

\[\begin{equation} S_{i} = S_{i}^{\mathrm{gen}} - S_{i}^{\mathrm{load}} - S_{i}^{\mathrm{shunt}} = \sum \limits_{g \in \mathcal{G}_{i}} S_{g}^{G} - \sum \limits_{l \in \mathcal{L}_{i}} S_{l}^{L} - \sum \limits_{s \in \mathcal{S}_{i}} S_{s}^{S} \ \ \ \ \forall i \in \mathcal{N}, \end{equation}\]

where \(\mathcal{S}_{i}\) is the set of shunts attached to bus \(i\) and \(S^{S}_{s} = \left( Y_{s}^{S} \right)^{*} \lvert V_{i} \rvert^{2}\) with admittance \(Y_{s}^{S}\) of shunt \(s \in \mathcal{S}_{i}\). Shunt elements do not introduce additional variables in the system.

Improving the branch model

In real power grids, branches can be transmission lines, transformers, or phase shifters. Transformers are used to control voltages, active and, sometimes, reactive power flows. In this section we take the transformer and line charging also into account by a general branch model (Figure 4).

branch_model Figure 4. General branch model including transformers and \(\pi\)-section model (source: MATPOWER manual7).


Transformers (and also phase shifters) are represented by the complex tap ratio \(T_{ij}\), or equivalently, by its magnitude \(t_{ij}\) and phase angle \(\theta_{ij}\). Also, line charging is usually modeled by the \(\pi\) transmission line (or \(\pi\)-section) model, which places shunt admittances at the two ends of the transmission line, beside its series admittance. We will treat the shunt admittances in a fairly general way, i.e. \(Y_{ij}^{C}\) and \(Y_{ji}^{C}\) are not necessarily equal but we also note that a widely adopted practice is to assume equal values and consider their susceptance component only. Since this model is not symmetric anymore, for consistency we introduce the following convention: we select the orientation of each branch \((i, j) \in \mathcal{E}\) that matches the structure presented in Figure 4.

In order to derive the corresponding power flow equations of this model (for both directions due to the asymmetric arrangement), we first look for expressions of the current \(I_{ij}\) and \(I_{ji}\).
The transformer alters the input voltage \(V_{i}\) and current \(I_{ij}\). Using the complex tap ratio the output voltage and current are \(\frac{V_{i}}{T_{ij}}\) and \(T_{ij}^{*}I_{ij}\), respectively. Let \(I_{ij}^{s}\) denote the series current flowing between point \(A\) and \(B\) in Figure 4. Based on Kirchhoff’s current law the net current flowing into node \(A\) is equal to the net current flowing out from it, i.e. \(T_{ij}^{*} I_{ij} = I_{ij}^{s} + Y_{ij}^{C} \frac{V_{i}}{T_{ij}}\). Rearranging this expression for \(I_{ij}\) we get:

\[\begin{equation} I_{ij} = \frac{I_{ij}^{s}}{T_{ij}^{*}} + Y_{ij}^{C} \frac{V_{i}}{\lvert T_{ij} \rvert^{2}} \quad \forall (i, j) \in \mathcal{E}. \end{equation}\]

Similarly, applying Kirchhoff’s current law for node \(B\) we can obtain \(I_{ji}\):

\[\begin{equation} I_{ji} = -I_{ij}^{s} + Y_{ji}^{C} V_{j} \quad \forall (j, i) \in \mathcal{E}^{R}. \end{equation}\]

Finally, using again Ohm’s law, the series current has the following simple expression: \(I_{ij}^{s} = Y_{ij} \left( \frac{V_{i}}{T_{ij}} - V_{j} \right)\). Now we can easily obtain the corresponding power flows:

\[\begin{equation} \begin{aligned} S_{ij} & = V_{i}I_{ij}^{*} = V_{i} \left( \frac{I_{ij}^{s}}{T_{ij}^{*}} + Y_{ij}^{C} \frac{V_{i}}{\lvert T_{ij} \rvert^{2}}\right)^{*} & \\ & = Y_{ij}^{*} \left( \frac{\lvert V_{i} \rvert^{2}}{\lvert T_{ij} \rvert^{2}} - \frac{V_{i} V_{j}^{*}}{T_{ij}} \right) + \left( Y_{ij}^{C} \right)^{*} \frac{\lvert V_{i} \rvert^{2}}{\lvert T_{ij} \rvert^{2}} \ \ \ \ & \forall (i, j) \in \mathcal{E} \\ S_{ji} & = V_{j}I_{ji}^{*} = V_{j} \left( -I_{ij}^{s} + Y_{ji}^{C}V_{j} \right)^{*} & \\ & = Y_{ij}^{*} \left( \lvert V_{j} \rvert^{2} - \frac{V_{j}V_{i}^{*}}{T_{ij}^{*}} \right) + \left( Y_{ji}^{C} \right)^{*} \lvert V_{j} \rvert^{2} \ \ \ \ & \forall (j, i) \in \mathcal{E}^{R} \\ \end{aligned} \end{equation}\]

Improved BIM and BFM models

Putting everything together from the previous sections we present the improved BIM and BFM models:

BIM:

\[\begin{equation} \begin{aligned} \sum \limits_{g \in \mathcal{G}_{i}} S_{g}^{G} - \sum \limits_{l \in \mathcal{L}_{i}} S_{l}^{L} - \sum \limits_{s \in \mathcal{S}_{i}} \left( Y_{s}^{S} \right)^{*} \lvert V_{i} \rvert^{2} = \sum \limits_{(i, j) \in \mathcal{E}} Y_{ij}^{*} \left( \frac{\lvert V_{i} \rvert^{2}}{\lvert T_{ij} \rvert^{2}} - \frac{V_{i} V_{j}^{*}}{T_{ij}} \right) + \left( Y_{ij}^{C} \right)^{*} \frac{\lvert V_{i} \rvert^{2}}{\lvert T_{ij} \rvert^{2}} \\ + \sum \limits_{(k, i) \in \mathcal{E}^{R}} Y_{ik}^{*} \left( \lvert V_{k} \rvert^{2} - \frac{V_{k}V_{i}^{*}}{T_{ik}^{*}} \right) + \left( Y_{ki}^{C} \right)^{*} \lvert V_{k} \rvert^{2} \ \ \ \ \forall i \in \mathcal{N}, \\ \end{aligned} \label{bim_concise} \end{equation}\]

where, on the left-hand side, the net power at bus \(i\) is computed from the power injection of multiple generators and power withdrawals of multiple loads and shunt elements, while, on the right-hand side, the transmitted power is a sum of the outgoing (first term) and incoming (second term) power flows to and from the corresponding adjacent buses.

BFM:

\[\begin{equation} \begin{aligned} & \sum \limits_{g \in \mathcal{G}_{i}} S_{g}^{G} - \sum \limits_{l \in \mathcal{L}_{i}} S_{l}^{L} - \sum \limits_{s \in \mathcal{S}_{i}} \left( Y_{s}^{S} \right)^{*} \lvert V_{i} \rvert^{2} = \sum \limits_{(i, j) \in \mathcal{E}} S_{ij} + \sum \limits_{(k, i) \in \mathcal{E}^{R}} S_{ki} & \forall i \in \mathcal{N} \\ & I_{ij}^{s} = Y_{ij} \left( \frac{V_{i}}{T_{ij}} - V_{j} \right) & \forall (i, j) \in \mathcal{E} \\ & S_{ij} = V_{i} \left( \frac{I_{ij}^{s}}{T_{ij}^{*}} + Y_{ij}^{C} \frac{V_{i}}{\lvert T_{ij} \rvert^{2}}\right)^{*} & \forall (i, j) \in \mathcal{E} \\ & S_{ji} = V_{j} \left( -I_{ij}^{s} + Y_{ji}^{C}V_{j} \right)^{*} & \forall (j, i) \in \mathcal{E}^{R} \\ \end{aligned} \label{bfm_concise} \end{equation}\]

As before, BFM is an equivalent formulation to BIM (first set of equations), with the only difference being that BFM treats the series currents (second set of equations) and power flows (third and fourth sets of equations) as explicit variables.

Table 3. Comparison of improved BIM and BFM formulations: complex variables, number of complex variables and equations. Corresponding real variables, their numbers as well as the number of real equations are also shown in parentheses.

Formulation Variables Number of variables Number of equations
BIM \(V_{i} \ (v_{i}, \delta_{i})\)
\(S_{g}^{G} \ (p_{g}^{G}, q_{g}^{G})\)
\(S_{l}^{L} \ (p_{l}^{L}, q_{l}^{L})\)
\(N + G + L\)
\((2N + 2G + 2L)\)
\(N\)
\((2N)\)
BFM \(V_{i} \ (v_{i}, \delta_{i})\)
\(S_{g}^{G} \ (p_{g}^{G}, q_{g}^{G})\)
\(S_{l}^{L} \ (p_{l}^{L}, q_{l}^{L})\)
\(I_{ij}^{s} \ (i_{ij}^{s}, \gamma_{ij}^{s})\)
\(S_{ij} \ (p_{ij}, q_{ij})\)
\(S_{ji} \ (p_{ji}, q_{ji})\)
\(N + G + L + 3E\)
\((2N + 2G + 2L + 6E)\)
\(N + 3E\)
\((2N + 6E)\)


Conclusions

In this blog post, we introduced the main concepts of the power flow problem and derived the power flow equations for a basic and a more realistic model using two equivalent formulations. We also showed that the resulting mathematical problems are non-linear systems that can be solved by the Newton—Raphson method.

Solving power flow problems is essential for the analysis of power grids. Based on a set of specified variables, the power flow solution provides the unknown ones. Consider a power grid with \(N\) buses, \(G\) generators and \(L\) loads, and let us specify only the active and reactive powers of the loads, which means \(2L\) number of known real-valued variables. Whichever formulation we use, this system is underdetermined and we would need to specify \(2G\) additional variables (Table 3.) to obtain a unique solution. However, generators have distinct power supply capacities and cost curves, and we can ask an interesting question: among the many physically possible specifications of the generator variables which is the one that would minimize the total economic cost?

This question leads to the mathematical model of economic dispatch, which is one of the most widely used optimal power flow problems. Optimal power flow problems are constrained optimization problems that are based on the power flow equations with additional physical constraints and an objective function. In our next post, we will discuss the main types and components of optimal power flow, and show how exact and approximate solutions can be obtained for these significantly more challenging problems.


  1. J. Tong, “Overview of PJM energy market design, operation and experience”, 2004 IEEE International Conference on Electric Utility Deregulation, Restructuring and Power Technologies. Proceedings, 1, pp. 24, (2004). 

  2. M. B. Cain, R. P. Oneill, and A. Castillo, “History of optimal power flow and formulations”, Federal Energy Regulatory Commission, 1, pp. 1, (2012). 

  3. S. H. Low, “Convex Relaxation of Optimal Power Flow—Part I: Formulations and Equivalence”, IEEE Transactions on Control of Network Systems, 1, pp. 15, (2014).  2

  4. A. J. Wood, B. F. Wollenberg and G. B. Sheblé, “Power generation, operation, and control”, Hoboken, New Jersey: Wiley-Interscience, (2014). 

  5. M. Farivar and S. H. Low, “Branch Flow Model: Relaxations and Convexification—Part I”, IEEE Transactions on Power Systems, 28, pp. 2554, (2013). 

  6. B. Subhonmesh, S. H. Low and K. M. Chandy, “Equivalence of branch flow and bus injection models”, 2012 50th Annual Allerton Conference on Communication, Control, and Computing, pp. 1893, (2012). 

  7. R. D. Zimmerman, C. E. Murillo-Sanchez, “MATPOWER User’s Manual, Version 7.1. 2020.”, (2020).