The rumor says that Julia can achieve the same computing
performance as any other compiled language
like C++ and FORTRAN. After coding in Julia for the past two years I have
definitely fell in love with its pythonic syntax, multiple dispatch, and MATLAB-like
handiness in linear algebra, while being able to use compilation features
like explicit type declaration for bug-preventive programming. In summary,
Julia’s philosophy brings all the flexibility of an interpreted language,
meanwhile its Just-In-Time (JIT) compilation makes it a defacto
compiled language.
Julia’s high level syntax makes the language easygoing for programmers from any
background, however, achieving high performance is sort of
an art. In this post I summarize some of the things I’ve learned while crafting
my Julia codes for high-performance computing. I will attempt to show the
process of code optimization through a real-world computing application from
aerodynamics: the vortex particle
method\(^{[1,\,2]}\).
Problem Definition
In the vortex particle method we
are interested in calculating the velocity \(\mathbf{u}\) and velocity Jacobian
\(\mathbf{J}\) that a field of \(N\) vortex particles induces at an arbitrary
position \(\mathbf{x}\). This is calculated as
with \({\bf K}\) the singular Newtonian kernel \({\bf K}\left( {\bf x}\right)=-\frac{ {\bf x} }{4\pi \Vert{\bf x}\Vert^3}\), \(g_\sigma\) a regularizing
function of smoothing radius \(\sigma\), and \(\mathbf{x}_p\) and
\(\boldsymbol{\Gamma}_p\) the position and vectorial strength of the \(p\)-th
particle, respectively. Furthermore, the governing equations of the method
require evaluating the velocity \(\mathbf{u}\) and Jacobian \(\mathbf{J}\) that the
collection of particles induces on itself, leading to the well-known \(N\)-body
problem of computational
complexity \(\mathcal{O}(N^2)\).
The figure at the top of the page is a simulation of a 6x6x6 box of particles with vorticity
initially concentrated at its center, diffusing as the simulation progresses due
to viscous effects:
C++ Benchmark
Here I have coded a benchmark
of the C++ code shown above, evaluating P2P on a box of 6x6x6=216 particles,
and made it callable in this notebook through the
CxxWrap Julia package:
In [1]:
In [2]:
Samples: 1000
min time: 3.99555 ms
ave time: 4.77778 ms
max time: 6.73414 ms
This was ran in my Dell Latitude 5580 laptop (Intel® Core™ i7-7820HQ CPU @
2.90GHz × 8 ) in only one process, and we see that the C++ kernel, best-case
scenario, is evaluated in ~4 ms. Let’s move on and code this up in Julia.
NOTE: The C++ code was compiled with the -O3 flag for code optimization.
Optimizing Julia
Julia Baseline: Pythonic Programming
Tempted by the Python-like syntax available in Julia, our first inclination is
to make the code as general, simple, and easy to understand as possible. Here is
the most general implementation where no types are specified:
In [3]:
In [4]:
C++ is 58.48 times faster than P2P_pythonic (3.996ms vs 233.679ms)
BenchmarkTools.Trial:
memory estimate: 217.57 MiB
allocs estimate: 4087152
--------------
minimum time: 233.679 ms (5.04% GC)
median time: 238.482 ms (4.99% GC)
mean time: 251.540 ms (9.28% GC)
maximum time: 295.517 ms (19.89% GC)
--------------
samples: 4
evals/sample: 1
Here we see that in our pythonic attempt we’ve got code that is pretty neat
and concise, but ~58x slower than the C++ implementation.
Fix #1: Always work with concrete types
The problem with the pythonic approach is that variable types are never
declared, and without foreknowledge of the types to be handled, Julia can’t
optimize the function during compilation. In order to help us catch ambiguous
(abstract) types in our code, the Julia Base package provides the macro
@code_warntype, which prints the lowered and type-inferred AST used during
compilation highlighting any abstract types encountered.
The output is pretty lengthy, so here I have copied and pasted only a snippet:
Understanding this lowered AST syntax is sort of an art, but you’ll soon learn
that @code_warntype is your best friend when optimizing code. As we scroll
down the AST we see that code encounters types Any in the properties of our
ParticleAmbiguous type, which immediately should raise a red flag to us (Any
is an abstract type). In fact, when running @code_warntype the output
automatically highlights those ::Any asserts in red.
We can take care of those abstract types by defining the properties of the
struct parametrically:
In [5]:
No modifications further are needed in our P2P_pythonic function since Julia’s
multiple dispatch and JIT automatically compiles a version of the function
specialized for our new Particle{T} type on the fly. Still, we will define an
alias to help us compare benchmarks:
In [6]:
In [7]:
P2P_concretetypes is 3.06 times faster than P2P_pythonic (76.488ms vs 233.679ms)
BenchmarkTools.Trial:
memory estimate: 189.93 MiB
allocs estimate: 2275776
--------------
minimum time: 76.488 ms (13.63% GC)
median time: 77.860 ms (13.79% GC)
mean time: 84.567 ms (17.89% GC)
maximum time: 145.918 ms (42.64% GC)
--------------
samples: 12
evals/sample: 1
Voilà! By specifying concrete types in our Particle struct now we have gained
a 3x speed up (we should run @code_warntype again to make sure we got rid of
all abstract types, but I’ll omit it for brevity).
Let’s new see how we compare to C++:
In [8]:
C++ is 19.14 times faster than P2P_concretetypes (3.996ms vs 76.488ms)
Working with concrete types greatly sped up the computation; however, the C++
version is still ~20x faster than Julia. Let’s see what else can we optimize.
Fix #2: Avoid List Comprehensions
The wonders of list comprehension may tempt you to write some line-efficient
code; however, loosely used may lead to a very inefficient
computation. Take for example this list-comprehension sum:
In [9]:
80.975 ns (1 allocation: 896 bytes)
Here is the version of the same function unrolled without the list
comprehension, which does the job 60 times faster:
In [10]:
1.374 ns (0 allocations: 0 bytes)
In our P2P function we have a Kronecker delta cross product that we were
calculating in just one line as a list comprehension as
The problem with list comprehension operations is that it has to allocate memory
to build the generated array. Just resist the temptation of using list
comprehensions to save yourself a few lines, and simply unroll it. As seen below
we get a 1.5x speed up by unrolling this line:
In [11]:
In [12]:
P2P_nocomprehension is 1.52 times faster than P2P_concretetypes (50.196ms vs 76.488ms)
BenchmarkTools.Trial:
memory estimate: 126.16 MiB
allocs estimate: 1300536
--------------
minimum time: 50.196 ms (11.28% GC)
median time: 51.929 ms (11.65% GC)
mean time: 55.319 ms (15.41% GC)
maximum time: 107.039 ms (50.23% GC)
--------------
samples: 19
evals/sample: 1
Fix #3: Reduce Allocation
Next, we notice that the benchmarking test is allotting an unusual amount of
memory (126MiB) and allocation operations (1.3M). I am suspicious that this is
an issue with Julia allowing arrays of dynamic sizes. The first step to solve
this is to do away with creating any internal variables of type arrays. In
the code bellow, notice that I had replaced the array variables dX and crss
with float variables dX1, dX2, dX3, and crss1, crss2, crss3, which leads to
having to fully unroll the inner for-loop:
In [13]:
In [14]:
P2P_noallocation is 3.68 times faster than P2P_nocomprehension (13.627ms vs 50.196ms)
BenchmarkTools.Trial:
memory estimate: 21.99 MiB
allocs estimate: 325296
--------------
minimum time: 13.627 ms (7.41% GC)
median time: 15.615 ms (8.40% GC)
mean time: 16.582 ms (12.83% GC)
maximum time: 59.011 ms (66.91% GC)
--------------
samples: 61
evals/sample: 1
Here we have reduced the memory allocated from 126MiB to 22MiB, leading to a
3.5x speed up. Let’s see what else can we do to decrease memory allocation.
Fix #4: No Linear Algebra
The next thing to consider is that doing linear algebra operations
using the Julia base library (i.e., dot(X,X),
cross(X,X), norm(X,X)) is more expensive that explicitely unfolding the
operation into lines of code. I am suspicious that this is a memory allocation
problem since these functions need to allocate internal arrays to store the
computation prior to outputting the result. Here is the code without any
functional linear algebra functions from the base library (notice that I no longer use norm() nor
cross()):
In [15]:
In [16]:
P2P_nolinalg is 2.10 times faster than P2P_noallocation (6.487ms vs 13.627ms)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.487 ms (0.00% GC)
median time: 7.140 ms (0.00% GC)
mean time: 7.198 ms (0.00% GC)
maximum time: 9.172 ms (0.00% GC)
--------------
samples: 139
evals/sample: 1
By doing away with linear algebra functions we are now not allocating any
memory, reaching an extra 2x speed up.
Fix #5: Fine Tuning
Notice that by now we are achieving benchmarks in the same order of magnitude
than C++ (6.5ms in Julia vs 4.0ms in C++):
In [17]:
C++ is 1.62 times faster than P2P_nolinalg (3.996ms vs 6.487ms)
What we did prior to this point were
general principles that apply to any code that attempts to get high performance.
What it follows now is fine tuning of our code in ways that only apply to the
specific computation that we are performing.
For instance, recall that our P2P function receives any user-defined
regularizing kernel that our function calls through this lines:
function P2P(sources,targets,g::Function,dgdr::Function)forPiintargetsforPjinsources...# g_σ and ∂gσ∂rgsgm=g(r/Pj.sigma)dgsgmdr=dgdr(r/Pj.sigma)...endendend
For the case of Winckelmans’ kernel, g and dgdr look like this:
We notice that each of these function calculate a power operation independently,
(r^2 + 1)^2.5 and (r^2 + 1)^3.5. I have observed that any sort of non-integer
power operation takes Julia more than any basic arithmetic operation or
even space allocation. Thus, we can save computation by merging this two functions
and reusing the square root calculation as
In [45]:
In [19]:
P2P_reusesqrt is 1.60 times faster than P2P_nolinalg (4.050ms vs 6.487ms)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.050 ms (0.00% GC)
median time: 4.479 ms (0.00% GC)
mean time: 4.493 ms (0.00% GC)
maximum time: 5.361 ms (0.00% GC)
--------------
samples: 223
evals/sample: 1
Hence, by simply avoiding one extra power calculation we have now gained a
1.6x speed up.
Final Fix: @inbounds, @simd, @fastmath
This is straight from the Julia official documentation:
Like many modern programming languages, Julia uses bounds checking to ensure
program safety when accessing arrays. In tight inner loops or other performance
critical situations, you may wish to skip these bounds checks to improve runtime
performance. For instance, in order to emit vectorized (SIMD) instructions, your
loop body cannot contain branches, and thus cannot contain bounds checks.
Consequently, Julia includes an @inbounds(…) macro to tell the compiler to
skip such bounds checks within the given block. User-defined array types can use
the @boundscheck(…) macro to achieve context-sensitive code selection.
Write @simd in front of for loops to promise that the iterations are
independent and may be reordered. Note that in many cases, Julia can
automatically vectorize code without the @simd macro; it is only beneficial in
cases where such a transformation would otherwise be illegal, including cases
like allowing floating-point re-associativity and ignoring dependent memory
accesses (@simd ivdep). Again, be very careful when asserting @simd as
erroneously annotating a loop with dependent iterations may result in unexpected
results. In particular, note that setindex! on some AbstractArray subtypes is
inherently dependent upon iteration order. This feature is experimental and
could change or disappear in future versions of Julia.
Use @fastmath to allow floating point optimizations that are correct for real
numbers, but lead to differences for IEEE numbers. Be careful when doing this,
as this may change numerical results. This corresponds to the -ffast-math option
of clang.
I won’t dive into details, but what I have realized is that you can get a speed
up from the three macros listed above only when you are working with data
structures that pass the isbits
test. In practice,
that means that our structs can’t contain any arrays, which lead us to
redefining the struct as follows:
In [30]:
With that twist, now we implement @simd in the internal for loop, while both
@fastmath and @inbounds wrap all floating point operations and output
allocation:
In [50]:
In [51]:
P2P_FINAL is 1.14 times faster than P2P_reusesqrt (3.552ms vs 4.050ms)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.552 ms (0.00% GC)
median time: 4.090 ms (0.00% GC)
mean time: 4.056 ms (0.00% GC)
maximum time: 4.761 ms (0.00% GC)
--------------
samples: 247
evals/sample: 1
Comparing to C++, our Julia code is now as fast —and a little
faster—than C++:
In [52]:
P2P_FINAL is 1.12 times faster than C++ (3.552ms vs 3.996ms)
However, our Julia code is using clang’s fastmath compilation mode. A more fair
comparison is done by also compiling C++ with the -ffast-math flag in addition
to -O3, which I have coded, compiled, and wrapped in this
CxxWrap:
In [34]:
Samples: 1000
min time: 1.40114 ms
ave time: 1.72617 ms
max time: 2.71737 ms
In [53]:
C++ -ffast-math is 2.54 times faster than P2P_FINAL (1.401ms vs 3.552ms)
And once again, -ffast-math places C++ at a modest 2.5x over the optimized
Julia code; but be not discouraged, we are talking about a dynamic-like language
that is only 0.4x slower than C++! =]
NOTE: In a subsequent test I commented out the two power operations (r =
sqrt(...) and gsgm, dgsgmdr = g_dgdr(r/Pj.sigma)) and the resulting wall-
clock time went down from 6.35ms to 0.496ms, meaning that at this stage the
overhead is on non-algebraic operations and that the function could further be
optimized only by finding a more efficient way of calculation powers in Julia.
Julia as Fast as C++
Finally, let’s take a second to compare where we started and where we ended. Our
initial Julia function was very compact and understandable, however it was more
than 50x slower than its C++ counterpart:
"""
Pythonic programming approach
"""function P2P_pythonic(particles,g,dgdr)forPiinparticlesforPjinparticlesdX=Pi.X-Pj.Xr=norm(dX)ifr!=0# g_σ and ∂gσ∂rgsgm=g(r/Pj.sigma)dgsgmdr=dgdr(r/Pj.sigma)# K × Γpcrss=cross(-const4*(dX/r^3),Pj.Gamma)# U = ∑g_σ(x-xp) * K(x-xp) × ΓpPi.U[:]+=gsgm*crss# ∂u∂xj(x) = ∑[ ∂gσ∂xj(x−xp) * K(x−xp)×Γp + gσ(x−xp) * ∂K∂xj(x−xp)×Γp ]forjin1:3Pi.J[:,j]+=(dX[j]/(Pj.sigma*r)*(dgsgmdr*crss)-gsgm*(3*dX[j]/r^2)*crss-gsgm*(const4/r^3)*cross([i==jforiin1:3],Pj.Gamma))endendendendend
In [48]:
C++ is 58.48 times faster than P2P_pythonic (3.996ms vs 233.679ms)
After all the optimization we end up with about twice the amount of lines, but
65x faster that the original version and only 2.5x slower than C++ with
-ffastmath:
"""
Optimized Julia code
"""function P2P_FINAL(particles,g_dgdr,U,J)for(i,Pi)inenumerate(particles)@simdforPjinparticles@fastmath@inboundsbegindX1=Pi.X1-Pj.X1dX2=Pi.X2-Pj.X2dX3=Pi.X3-Pj.X3r=sqrt(dX1*dX1+dX2*dX2+dX3*dX3)ifr!=0# g_σ and ∂gσ∂rgsgm,dgsgmdr=g_dgdr(r/Pj.sigma)# K × Γpcrss1=-const4/r^3*(dX2*Pj.Gamma3-dX3*Pj.Gamma2)crss2=-const4/r^3*(dX3*Pj.Gamma1-dX1*Pj.Gamma3)crss3=-const4/r^3*(dX1*Pj.Gamma2-dX2*Pj.Gamma1)# U = ∑g_σ(x-xp) * K(x-xp) × ΓpU[1,i]+=gsgm*crss1U[2,i]+=gsgm*crss2U[3,i]+=gsgm*crss3# ∂u∂xj(x) = ∑[ ∂gσ∂xj(x−xp) * K(x−xp)×Γp + gσ(x−xp) * ∂K∂xj(x−xp)×Γp ]# ∂u∂xj(x) += ∑p[(Δxj∂gσ∂r/(σr) − 3Δxjgσ/r^2) K(Δx)×Γpaux=dgsgmdr/(Pj.sigma*r)*-3*gsgm/r^2# j=1J[1,1,i]+=aux*crss1*dX1J[2,1,i]+=aux*crss2*dX1J[3,1,i]+=aux*crss3*dX1# j=2J[1,2,i]+=aux*crss1*dX2J[2,2,i]+=aux*crss2*dX2J[3,2,i]+=aux*crss3*dX2# j=3J[1,3,i]+=aux*crss1*dX3J[2,3,i]+=aux*crss2*dX3J[3,3,i]+=aux*crss3*dX3# Adds the Kronecker delta termaux=-const4*gsgm/r^3# j=1J[2,1,i]-=aux*Pj.Gamma3J[3,1,i]+=aux*Pj.Gamma2# j=2J[1,2,i]+=aux*Pj.Gamma3J[3,2,i]-=aux*Pj.Gamma1# j=3J[1,3,i]-=aux*Pj.Gamma2J[2,3,i]+=aux*Pj.Gamma1endendendendend
In [54]:
P2P_FINAL is 65.79 times faster than P2P_pythonic (3.552ms vs 233.679ms)
P2P_FINAL is 1.12 times faster than C++ (3.552ms vs 3.996ms)
C++ -ffast-math is 2.54 times faster than P2P_FINAL (1.401ms vs 3.552ms)
Oddly enough, through the optimization process we naturally morphed the code
into something that looks very similar to the C++ version:
In conclusion, if you are trying to get a piece of code fit for high-performance
computing, my recommendation is to forget about elegant and line-efficient
coding (“pythonic” some would say), and code in C++-esque style from the
beginning.
Conclusions
Without foreknowledge of the types to be handled, Julia can’t optimize the
code during compilation. Multiple dispatch and JIT will compile a well-defined
version of every function based on the arguments that is given on the fly, but
this is not the case for structs properties. Always define your structs with
its properties explicitely specified as concrete
types.
@code_warntype is your best friend when trying to catch
abstract types in your code.
Avoid memory allocation: If your function is allocating an insane amount
of memory, that’s an indication that you are saving yourself some lines of code
while sacrificing computation efficiency (e.g., inline list-comprehension
operations or array algebra instead of unrolling the operations
into explicit code). I recommend using
@time or @benchmark to check memory allocation.
Avoid storing intermediate calculations in internal arrays, just use Float
variables. For some reason Julia spends a lot of time trying to allocate space
for internal arrays.
Any non-integer power operations (^ or sqrt) are very expensive in
Julia. If possible, reduce these operations by storing and reusing previous
calculations.
At all cost, avoid using functions from the LinearAlgebra package
(like dot, cross, norm) as they will require allocating memory for
internal arrays.
For some reason,
@simd,
@fastmath,
and @inbounds
speed up your code only when working with isbits types.
Forget about elegant and line-efficient coding (“pythonic” some would say),
and code in C++-esque style in parts of the code where performance is
critical.
Notes
This benchmark was done in a Dell Latitude 5580 laptop (Intel® Core™ i7-7820HQ
CPU @ 2.90GHz × 8 ) in only one process.
Julia v1.0.3 was used.
The official Julia documentation also provide very useful tips for performance.
Winckelmans, G. S., & Leonard, A. (1993). Contributions to Vortex Particle
Methods for the Computation of Three-Dimensional Incompressible Unsteady Flows.
Journal of Computational Physics, 109(2), 247–273.
https://doi.org/10.1006/jcph.1993.1216
Alvarez, E. J., & Ning, A. (2018). Development of a Vortex Particle Code for
the Modeling of Wake Interaction in Distributed Propulsion. In 2018 Applied
Aerodynamics Conference (pp. 1–22). American Institute of Aeronautics and
Astronautics. https://doi.org/10.2514/6.2018-3646 . PDF
Available
C++ Benchmark
Here I have coded a benchmark
of the C++ code shown above, evaluating P2P on a box of 6x6x6=216 particles,
and made it callable in this notebook through the
CxxWrap Julia package:
In [1]:
In [2]:
Samples: 1000
min time: 3.99555 ms
ave time: 4.77778 ms
max time: 6.73414 ms
This was ran in my Dell Latitude 5580 laptop (Intel® Core™ i7-7820HQ CPU @
2.90GHz × 8 ) in only one process, and we see that the C++ kernel, best-case
scenario, is evaluated in ~4 ms. Let’s move on and code this up in Julia.
NOTE: The C++ code was compiled with the -O3 flag for code optimization.
Optimizing Julia
Julia Baseline: Pythonic Programming
Tempted by the Python-like syntax available in Julia, our first inclination is
to make the code as general, simple, and easy to understand as possible. Here is
the most general implementation where no types are specified:
In [3]:
In [4]:
C++ is 58.48 times faster than P2P_pythonic (3.996ms vs 233.679ms)
BenchmarkTools.Trial:
memory estimate: 217.57 MiB
allocs estimate: 4087152
--------------
minimum time: 233.679 ms (5.04% GC)
median time: 238.482 ms (4.99% GC)
mean time: 251.540 ms (9.28% GC)
maximum time: 295.517 ms (19.89% GC)
--------------
samples: 4
evals/sample: 1
Here we see that in our pythonic attempt we’ve got code that is pretty neat
and concise, but ~58x slower than the C++ implementation.
Fix #1: Always work with concrete types
The problem with the pythonic approach is that variable types are never
declared, and without foreknowledge of the types to be handled, Julia can’t
optimize the function during compilation. In order to help us catch ambiguous
(abstract) types in our code, the Julia Base package provides the macro
@code_warntype, which prints the lowered and type-inferred AST used during
compilation highlighting any abstract types encountered.
The output is pretty lengthy, so here I have copied and pasted only a snippet:
Understanding this lowered AST syntax is sort of an art, but you’ll soon learn
that @code_warntype is your best friend when optimizing code. As we scroll
down the AST we see that code encounters types Any in the properties of our
ParticleAmbiguous type, which immediately should raise a red flag to us (Any
is an abstract type). In fact, when running @code_warntype the output
automatically highlights those ::Any asserts in red.
We can take care of those abstract types by defining the properties of the
struct parametrically:
In [5]:
No modifications further are needed in our P2P_pythonic function since Julia’s
multiple dispatch and JIT automatically compiles a version of the function
specialized for our new Particle{T} type on the fly. Still, we will define an
alias to help us compare benchmarks:
In [6]:
In [7]:
P2P_concretetypes is 3.06 times faster than P2P_pythonic (76.488ms vs 233.679ms)
BenchmarkTools.Trial:
memory estimate: 189.93 MiB
allocs estimate: 2275776
--------------
minimum time: 76.488 ms (13.63% GC)
median time: 77.860 ms (13.79% GC)
mean time: 84.567 ms (17.89% GC)
maximum time: 145.918 ms (42.64% GC)
--------------
samples: 12
evals/sample: 1
Voilà! By specifying concrete types in our Particle struct now we have gained
a 3x speed up (we should run @code_warntype again to make sure we got rid of
all abstract types, but I’ll omit it for brevity).
Let’s new see how we compare to C++:
In [8]:
C++ is 19.14 times faster than P2P_concretetypes (3.996ms vs 76.488ms)
Working with concrete types greatly sped up the computation; however, the C++
version is still ~20x faster than Julia. Let’s see what else can we optimize.
Fix #2: Avoid List Comprehensions
The wonders of list comprehension may tempt you to write some line-efficient
code; however, loosely used may lead to a very inefficient
computation. Take for example this list-comprehension sum:
In [9]:
80.975 ns (1 allocation: 896 bytes)
Here is the version of the same function unrolled without the list
comprehension, which does the job 60 times faster:
In [10]:
1.374 ns (0 allocations: 0 bytes)
In our P2P function we have a Kronecker delta cross product that we were
calculating in just one line as a list comprehension as
The problem with list comprehension operations is that it has to allocate memory
to build the generated array. Just resist the temptation of using list
comprehensions to save yourself a few lines, and simply unroll it. As seen below
we get a 1.5x speed up by unrolling this line:
In [11]:
In [12]:
P2P_nocomprehension is 1.52 times faster than P2P_concretetypes (50.196ms vs 76.488ms)
BenchmarkTools.Trial:
memory estimate: 126.16 MiB
allocs estimate: 1300536
--------------
minimum time: 50.196 ms (11.28% GC)
median time: 51.929 ms (11.65% GC)
mean time: 55.319 ms (15.41% GC)
maximum time: 107.039 ms (50.23% GC)
--------------
samples: 19
evals/sample: 1
Fix #3: Reduce Allocation
Next, we notice that the benchmarking test is allotting an unusual amount of
memory (126MiB) and allocation operations (1.3M). I am suspicious that this is
an issue with Julia allowing arrays of dynamic sizes. The first step to solve
this is to do away with creating any internal variables of type arrays. In
the code bellow, notice that I had replaced the array variables dX and crss
with float variables dX1, dX2, dX3, and crss1, crss2, crss3, which leads to
having to fully unroll the inner for-loop:
In [13]:
In [14]:
P2P_noallocation is 3.68 times faster than P2P_nocomprehension (13.627ms vs 50.196ms)
BenchmarkTools.Trial:
memory estimate: 21.99 MiB
allocs estimate: 325296
--------------
minimum time: 13.627 ms (7.41% GC)
median time: 15.615 ms (8.40% GC)
mean time: 16.582 ms (12.83% GC)
maximum time: 59.011 ms (66.91% GC)
--------------
samples: 61
evals/sample: 1
Here we have reduced the memory allocated from 126MiB to 22MiB, leading to a
3.5x speed up. Let’s see what else can we do to decrease memory allocation.
Fix #4: No Linear Algebra
The next thing to consider is that doing linear algebra operations
using the Julia base library (i.e., dot(X,X),
cross(X,X), norm(X,X)) is more expensive that explicitely unfolding the
operation into lines of code. I am suspicious that this is a memory allocation
problem since these functions need to allocate internal arrays to store the
computation prior to outputting the result. Here is the code without any
functional linear algebra functions from the base library (notice that I no longer use norm() nor
cross()):
In [15]:
In [16]:
P2P_nolinalg is 2.10 times faster than P2P_noallocation (6.487ms vs 13.627ms)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 6.487 ms (0.00% GC)
median time: 7.140 ms (0.00% GC)
mean time: 7.198 ms (0.00% GC)
maximum time: 9.172 ms (0.00% GC)
--------------
samples: 139
evals/sample: 1
By doing away with linear algebra functions we are now not allocating any
memory, reaching an extra 2x speed up.
Fix #5: Fine Tuning
Notice that by now we are achieving benchmarks in the same order of magnitude
than C++ (6.5ms in Julia vs 4.0ms in C++):
In [17]:
C++ is 1.62 times faster than P2P_nolinalg (3.996ms vs 6.487ms)
What we did prior to this point were
general principles that apply to any code that attempts to get high performance.
What it follows now is fine tuning of our code in ways that only apply to the
specific computation that we are performing.
For instance, recall that our P2P function receives any user-defined
regularizing kernel that our function calls through this lines:
function P2P(sources,targets,g::Function,dgdr::Function)forPiintargetsforPjinsources...# g_σ and ∂gσ∂rgsgm=g(r/Pj.sigma)dgsgmdr=dgdr(r/Pj.sigma)...endendend
For the case of Winckelmans’ kernel, g and dgdr look like this:
We notice that each of these function calculate a power operation independently,
(r^2 + 1)^2.5 and (r^2 + 1)^3.5. I have observed that any sort of non-integer
power operation takes Julia more than any basic arithmetic operation or
even space allocation. Thus, we can save computation by merging this two functions
and reusing the square root calculation as
In [45]:
In [19]:
P2P_reusesqrt is 1.60 times faster than P2P_nolinalg (4.050ms vs 6.487ms)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 4.050 ms (0.00% GC)
median time: 4.479 ms (0.00% GC)
mean time: 4.493 ms (0.00% GC)
maximum time: 5.361 ms (0.00% GC)
--------------
samples: 223
evals/sample: 1
Hence, by simply avoiding one extra power calculation we have now gained a
1.6x speed up.
Final Fix: @inbounds, @simd, @fastmath
This is straight from the Julia official documentation:
Like many modern programming languages, Julia uses bounds checking to ensure
program safety when accessing arrays. In tight inner loops or other performance
critical situations, you may wish to skip these bounds checks to improve runtime
performance. For instance, in order to emit vectorized (SIMD) instructions, your
loop body cannot contain branches, and thus cannot contain bounds checks.
Consequently, Julia includes an @inbounds(…) macro to tell the compiler to
skip such bounds checks within the given block. User-defined array types can use
the @boundscheck(…) macro to achieve context-sensitive code selection.
Write @simd in front of for loops to promise that the iterations are
independent and may be reordered. Note that in many cases, Julia can
automatically vectorize code without the @simd macro; it is only beneficial in
cases where such a transformation would otherwise be illegal, including cases
like allowing floating-point re-associativity and ignoring dependent memory
accesses (@simd ivdep). Again, be very careful when asserting @simd as
erroneously annotating a loop with dependent iterations may result in unexpected
results. In particular, note that setindex! on some AbstractArray subtypes is
inherently dependent upon iteration order. This feature is experimental and
could change or disappear in future versions of Julia.
Use @fastmath to allow floating point optimizations that are correct for real
numbers, but lead to differences for IEEE numbers. Be careful when doing this,
as this may change numerical results. This corresponds to the -ffast-math option
of clang.
I won’t dive into details, but what I have realized is that you can get a speed
up from the three macros listed above only when you are working with data
structures that pass the isbits
test. In practice,
that means that our structs can’t contain any arrays, which lead us to
redefining the struct as follows:
In [30]:
With that twist, now we implement @simd in the internal for loop, while both
@fastmath and @inbounds wrap all floating point operations and output
allocation:
In [50]:
In [51]:
P2P_FINAL is 1.14 times faster than P2P_reusesqrt (3.552ms vs 4.050ms)
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 3.552 ms (0.00% GC)
median time: 4.090 ms (0.00% GC)
mean time: 4.056 ms (0.00% GC)
maximum time: 4.761 ms (0.00% GC)
--------------
samples: 247
evals/sample: 1
Comparing to C++, our Julia code is now as fast —and a little
faster—than C++:
In [52]:
P2P_FINAL is 1.12 times faster than C++ (3.552ms vs 3.996ms)
However, our Julia code is using clang’s fastmath compilation mode. A more fair
comparison is done by also compiling C++ with the -ffast-math flag in addition
to -O3, which I have coded, compiled, and wrapped in this
CxxWrap:
In [34]:
Samples: 1000
min time: 1.40114 ms
ave time: 1.72617 ms
max time: 2.71737 ms
In [53]:
C++ -ffast-math is 2.54 times faster than P2P_FINAL (1.401ms vs 3.552ms)
And once again, -ffast-math places C++ at a modest 2.5x over the optimized
Julia code; but be not discouraged, we are talking about a dynamic-like language
that is only 0.4x slower than C++! =]
NOTE: In a subsequent test I commented out the two power operations (r =
sqrt(...) and gsgm, dgsgmdr = g_dgdr(r/Pj.sigma)) and the resulting wall-
clock time went down from 6.35ms to 0.496ms, meaning that at this stage the
overhead is on non-algebraic operations and that the function could further be
optimized only by finding a more efficient way of calculation powers in Julia.
Julia as Fast as C++
Finally, let’s take a second to compare where we started and where we ended. Our
initial Julia function was very compact and understandable, however it was more
than 50x slower than its C++ counterpart:
"""
Pythonic programming approach
"""function P2P_pythonic(particles,g,dgdr)forPiinparticlesforPjinparticlesdX=Pi.X-Pj.Xr=norm(dX)ifr!=0# g_σ and ∂gσ∂rgsgm=g(r/Pj.sigma)dgsgmdr=dgdr(r/Pj.sigma)# K × Γpcrss=cross(-const4*(dX/r^3),Pj.Gamma)# U = ∑g_σ(x-xp) * K(x-xp) × ΓpPi.U[:]+=gsgm*crss# ∂u∂xj(x) = ∑[ ∂gσ∂xj(x−xp) * K(x−xp)×Γp + gσ(x−xp) * ∂K∂xj(x−xp)×Γp ]forjin1:3Pi.J[:,j]+=(dX[j]/(Pj.sigma*r)*(dgsgmdr*crss)-gsgm*(3*dX[j]/r^2)*crss-gsgm*(const4/r^3)*cross([i==jforiin1:3],Pj.Gamma))endendendendend
In [48]:
C++ is 58.48 times faster than P2P_pythonic (3.996ms vs 233.679ms)
After all the optimization we end up with about twice the amount of lines, but
65x faster that the original version and only 2.5x slower than C++ with
-ffastmath:
"""
Optimized Julia code
"""function P2P_FINAL(particles,g_dgdr,U,J)for(i,Pi)inenumerate(particles)@simdforPjinparticles@fastmath@inboundsbegindX1=Pi.X1-Pj.X1dX2=Pi.X2-Pj.X2dX3=Pi.X3-Pj.X3r=sqrt(dX1*dX1+dX2*dX2+dX3*dX3)ifr!=0# g_σ and ∂gσ∂rgsgm,dgsgmdr=g_dgdr(r/Pj.sigma)# K × Γpcrss1=-const4/r^3*(dX2*Pj.Gamma3-dX3*Pj.Gamma2)crss2=-const4/r^3*(dX3*Pj.Gamma1-dX1*Pj.Gamma3)crss3=-const4/r^3*(dX1*Pj.Gamma2-dX2*Pj.Gamma1)# U = ∑g_σ(x-xp) * K(x-xp) × ΓpU[1,i]+=gsgm*crss1U[2,i]+=gsgm*crss2U[3,i]+=gsgm*crss3# ∂u∂xj(x) = ∑[ ∂gσ∂xj(x−xp) * K(x−xp)×Γp + gσ(x−xp) * ∂K∂xj(x−xp)×Γp ]# ∂u∂xj(x) += ∑p[(Δxj∂gσ∂r/(σr) − 3Δxjgσ/r^2) K(Δx)×Γpaux=dgsgmdr/(Pj.sigma*r)*-3*gsgm/r^2# j=1J[1,1,i]+=aux*crss1*dX1J[2,1,i]+=aux*crss2*dX1J[3,1,i]+=aux*crss3*dX1# j=2J[1,2,i]+=aux*crss1*dX2J[2,2,i]+=aux*crss2*dX2J[3,2,i]+=aux*crss3*dX2# j=3J[1,3,i]+=aux*crss1*dX3J[2,3,i]+=aux*crss2*dX3J[3,3,i]+=aux*crss3*dX3# Adds the Kronecker delta termaux=-const4*gsgm/r^3# j=1J[2,1,i]-=aux*Pj.Gamma3J[3,1,i]+=aux*Pj.Gamma2# j=2J[1,2,i]+=aux*Pj.Gamma3J[3,2,i]-=aux*Pj.Gamma1# j=3J[1,3,i]-=aux*Pj.Gamma2J[2,3,i]+=aux*Pj.Gamma1endendendendend
In [54]:
P2P_FINAL is 65.79 times faster than P2P_pythonic (3.552ms vs 233.679ms)
P2P_FINAL is 1.12 times faster than C++ (3.552ms vs 3.996ms)
C++ -ffast-math is 2.54 times faster than P2P_FINAL (1.401ms vs 3.552ms)
Oddly enough, through the optimization process we naturally morphed the code
into something that looks very similar to the C++ version:
In conclusion, if you are trying to get a piece of code fit for high-performance
computing, my recommendation is to forget about elegant and line-efficient
coding (“pythonic” some would say), and code in C++-esque style from the
beginning.
Conclusions
Without foreknowledge of the types to be handled, Julia can’t optimize the
code during compilation. Multiple dispatch and JIT will compile a well-defined
version of every function based on the arguments that is given on the fly, but
this is not the case for structs properties. Always define your structs with
its properties explicitely specified as concrete
types.
@code_warntype is your best friend when trying to catch
abstract types in your code.
Avoid memory allocation: If your function is allocating an insane amount
of memory, that’s an indication that you are saving yourself some lines of code
while sacrificing computation efficiency (e.g., inline list-comprehension
operations or array algebra instead of unrolling the operations
into explicit code). I recommend using
@time or @benchmark to check memory allocation.
Avoid storing intermediate calculations in internal arrays, just use Float
variables. For some reason Julia spends a lot of time trying to allocate space
for internal arrays.
Any non-integer power operations (^ or sqrt) are very expensive in
Julia. If possible, reduce these operations by storing and reusing previous
calculations.
At all cost, avoid using functions from the LinearAlgebra package
(like dot, cross, norm) as they will require allocating memory for
internal arrays.
For some reason,
@simd,
@fastmath,
and @inbounds
speed up your code only when working with isbits types.
Forget about elegant and line-efficient coding (“pythonic” some would say),
and code in C++-esque style in parts of the code where performance is
critical.
Notes
This benchmark was done in a Dell Latitude 5580 laptop (Intel® Core™ i7-7820HQ
CPU @ 2.90GHz × 8 ) in only one process.
Julia v1.0.3 was used.
The official Julia documentation also provide very useful tips for performance.
Winckelmans, G. S., & Leonard, A. (1993). Contributions to Vortex Particle
Methods for the Computation of Three-Dimensional Incompressible Unsteady Flows.
Journal of Computational Physics, 109(2), 247–273.
https://doi.org/10.1006/jcph.1993.1216
Alvarez, E. J., & Ning, A. (2018). Development of a Vortex Particle Code for
the Modeling of Wake Interaction in Distributed Propulsion. In 2018 Applied
Aerodynamics Conference (pp. 1–22). American Institute of Aeronautics and
Astronautics. https://doi.org/10.2514/6.2018-3646 . PDF
Available