Invenia at JuliaCon 2021
10 Aug 2021Another year has passed and another JuliaCon has happened with great success. This was the second year that the conference was fully online. While it’s a shame that we don’t get to meet all the interesting people from the Julia community in person, it also means that the conference is able to reach an even broader audience. This year, there were over 20,000 registrations and over 43,000 people tuned in on YouTube. That’s roughly double the numbers from last year! As usual, Invenia was present at the conference in various forms: as sponsors, as volunteers to the organisation, as presenters, and as part of the audience.
In this post, we highlight the work we presented this year. If you missed any of the talks, they are all available on Youtube, and we provide the links.
Clearing the Pipeline Jungle with FeatureTransforms.jl
The prevalence of glue code in feature engineering pipelines poses many problems in conducting high-quality, scalable research in machine learning and data science. In worst-case scenarios, the technical debt racked up by overgrown “pipeline jungles” can prevent a project from making any meaningful progress beyond a certain point. In this talk we discuss how we thought about this problem in our own code, and what the ideal properties of a feature engineering workflow should be. The result was FeatureTransforms.jl: a package that can help make feature engineering a more sustainable practice for users without sacrificing the desired flexibility.
Everything you need to know about ChainRules 1.0
Automatic differentiation (AD) is a key component of most machine learning applications as it enables efficient learning of model parameters. AD systems can compute gradients of complex functions by combining the gradients of basic operations that make up the function. To do that, an AD system needs access to rules for the gradients of basic functions. The ChainRules ecosystem provides a set of rules for functions in Julia standard libraries, the utilities for writing custom rules, and utilities for testing those rules.
The ChainRules project has now reached the major milestone of releasing its version 1.0. One of the main highlights of this release is the ability to write rules that are defined conditionally based on the properties of the AD system. This provides the ability to write rules for higher order functions, such as map
, by calling back into AD. Other highlights include the ability to opt out of abstractly-typed rules for a particular signature (using AD to compose a rule, instead), making sure that the differential is in the correct subspace, a number of convenient macros for writing rules, and improved testing utilities which now include testing capabilities tailored for AD systems. ChainRules is now also integrated into a number of prominent AD packages, including ForwardDiff2, Zygote, Nabla, Yota, and Diffractor. In the talk, we describe the key new features and guide the user on how to write correct and efficient rules.
ExprTools: Metaprogramming from reflection
We initially created ExprTools.jl to clean-up some of our code in Mocking.jl, so that we could more easily write patches that look like function declarations. ExprTools’ initial features were a more robust version of splitdef
and combinedef
from MacroTools.jl. These functions allow breaking up the abstract syntax tree (AST) for a function definition into a dictionary of all the parts—name, arguments, body etc.—which can then be manipulated and combined back into an AST (i.e. an Expr
object) that a macro can return.
With the goal of supporting ChainRules in Nabla.jl, we recently extended ExprTools.jl with a new method: signature
. Nabla is an operator-overloading AD: to define a new rule for how to perform AD over some function f
, it is overloaded to accept a ::Node{T}
argument which contains both the value of type T
and the additional tracking information needed for AD. ChainRules, on the other hand, defines a rule for how to perform AD by defining an overload for the function rrule
. So, for every method of rrule
—e.g. rrule(f, ::T)
—we needed to generate an overload of the original function that takes a node: e.g. f(::Node{T})
. We need to use metaprogramming to define those overloads based on reflection over the method table of rrule
. The new signature
function in ExprTools makes this possible. signature
works like splitdef
, returning a dictionary that can be combined into an AST
, but, rather than requiring an AST to split up, it takes a Method
object.
This talk goes into how MacroTools can be used, both on ASTs with splitdef
, and on methods with signature
.
Fancy Arrays BoF 2
A recurring pattern in the Julia ecosystem is the need to perform operations on named dimensions (e.g. std(X; dim=:time)
, cat(X, Y; dims=:variate)
) or lookup by axis keys (e.g. X(time = DateTime(2017, 1, 1), variate = :object1)
). These patterns improve code readability and reduce bugs by explicitly naming the quantities over which calculations/manipulations should be performed. While AxisArrays has been the de facto standard over the past few years, several attempts have been made to simplify the interface.
Last year, during JuliaCon 2020, an initial Birds of Feather (BoF) on these “fancy” arrays outlined the common requirements folks had and set goals to create new packages that would implement that functionality as minimally as possible. Please refer to last year’s notes for more details. Over the past year, only a few packages remain actively maintained and have largely overlapping feature sets. The goal of this year’s BoF was to reduce code duplication and feature siloing by either merging the remaining packages or constructing a shared API.
We identified several recurring issues with our existing wrapper array types, which ArrayInterface.jl may help address. Similarly, it became apparent that there was a growing need to address broader ecosystem support, which is complicated by disparate APIs and workflows. Various people supported specifying a minimal API for operating on named dimensions, indexing by axis keys and converting between types, similar to Table.jl. This shared API may address several concerns raised about the lack of consistency and standard packages within the Julia ecosystem (like xarray
s in python). See this year’s notes for more details.
ParameterHandling.jl
Any time you want to fit a model you have to figure out how to manage its parameters, and how to work with standard optimisation and inference interfaces. This becomes complicated for all but the most trivial models. ParameterHandling.jl provides an API and tooling to help you manage this complexity in a scalable way. There are a number of packages offering slightly different approaches to solving this problem (such as Functors.jl, FlexibleFunctors.jl, TransformVariables.jl, and ComponentArrays.jl), but we believe that ParameterHandling.jl has some nice properties (such as the ability to work with arbitrary data types without modification, and the ability to use immutable data structures), and it has been useful in our work. Hopefully future work will unify the best aspects of all of these packages.
Distributed Computing using AWSClusterManagers.jl
Cloud computing is a key piece of the workflow for many compute-heavy applications, and Amazon’s AWS is one of the market leaders in this area. Thus, seamlessly integrating Julia and AWS is of great importance to us, and that is why we have been working on AWSClusterManagers.jl.
AWSClusterManagers.jl allows users to run a distributed workload on AWS as easily as the Base Distributed package. It is one of a few cloud packages which we have recently open-sourced, alongside AWSTools.jl and AWSBatch.jl. A simple example of how to use the package can be found on Github.
We hope to see everyone again at JuliaCon 2022, and to show a bit more of our Julia applications.