Saturday, April 27, 2013

Fun with Numba, NumPy and F2PY

I've started using Numba to speed up MD simulations for our course in Molecular Statistics, which I teach together with +Jan Jensen and +Jimmy Charnley Kromann.

One exercise is using the Velo-Verlet algorithm to simulate a Lennard-Jones gas/liquid with periodic boundary conditions. Since we do everything in Python in this course, running the actual simulation is quite slow. What we do is we supply the students with a compiled FORTRAN module, compiled with F2PY which they can use when they've written their own Lennard-Jones gradient code.

The FORmula TRANslator module is extremely fast, compared to Python code. I rewrote the code from last year's course to a function I could use with Numba and compare directly to the F2PY module

For 100 particles:

  • Numba: 0.146 ms/iteration (466x speed up)
  • F2PY: 0.236 ms/iteration (289x speed up)
  • Python: 68.06 ms/iteration
Now, I'm pretty sure a pure FORTRAN implementation would be faster than the F2PY, but I am very impressed with the Numba, since it's standard Python code and MUCH more simple to read than FORTRAN -- in theory all you have to do is write @autojit before a function definition. This is of course in theory ... I didn't do anything to improve on the readability of the Lennard-Jones code here. Before I posted, I had actually inserted a comment saying """This code block is unreadable""".

Now, there were a few issues I discovered:


1) @autojit vs. @jit(argtypes=[double[:,:], double[:]])

Since Numba has to decide what arguments a function takes before it's being compiled, it needs to know what possible type of arguments the function takes. If you're lucky, Numba decides on the right type and you can get away with @autojit, but sometimes autojitting makes the code SLOWER than standard, interpreted Python. In one case I got a factor of 10x slower with autojit, but explicitly stating the function argument types with @jit(argtypes = .... ) I got a speed up of 20x on the same code compared to interpreted Python.


2) Returning tuples in compiled code block

Numba does not allow a tuple to be returned inside a compiled code block. So don't do this.


3) Use Numpy properly

  • Numba is NumPy aware, so code ran faster when numbers were stored in numpy.array types rather than just regular Python lists.
  • At first I was numpy.round() to round to nearest integer in the periodic boundary condition code. Switching three numpy.round() calls to numpy.rint() gave a speed up of around 100x on the code execution.
  • Use U[i, j] instead of U[i][j] on NumPy arrays. Not much of a difference in the vanilla Python code, but MASSIVE speed gains in compiled code.


Python Lennard-Jones code:

import numpy
from numba import double, jit
rc2 = 6.25
rc2i=1.0/rc2
rc6i=rc2i*rc2i*rc2i
ecut=rc6i*(rc6i-1.0)
@jit(argtypes=[double[:,:], double[:]])
def lennardjones(U, box):
# Can't use (ndim, npart) = numpy.shape(U)
# with Numba. No unpacking of tuples.
ndim = len(U)
npart = len(U[0])
F = numpy.zeros((ndim, npart))
Epot = 0.0
Vir = 0.0
for i in range(npart):
for j in range(npart):
if i > j:
X = U[0, j] - U[0, i]
Y = U[1, j] - U[1, i]
Z = U[2, j] - U[2, i]
# Periodic boundary condition
X -= box[0] * numpy.rint(X/box[0])
Y -= box[1] * numpy.rint(Y/box[1])
Z -= box[2] * numpy.rint(Z/box[2])
# Distance squared
r2 = X*X + Y*Y + Z*Z
if(r2 < rc2):
r2i = 1.0 / r2
r6i = r2i*r2i*r2i
Epot = Epot + r6i*(r6i-1.0) - ecut
ftmp = 48. * r6i*(r6i- 0.5) * r2i
F[0, i] -= ftmp * X
F[1, i] -= ftmp * Y
F[2, i] -= ftmp * Z
F[0, j] += ftmp * X
F[1, j] += ftmp * Y
F[2, j] += ftmp * Z
Vir += ftmp
Epot = Epot * 4.0
return Epot, F, Vir
view raw lj.py hosted with ❤ by GitHub

F2PY Lennard-Jones code:

C THE LENNARD-JONES POTENTIAL, GRADIENT AND THE VIRIAL
subroutine LennardJones(U,box,Epot,F,Vir,N)
cf2py intent(in) U
cf2py intent(in) box
cf2py intent(out) Epot
cf2py intent(out) F
cf2py intent(out) Vir
implicit none
double precision U,Epot,F,box,Vir
integer N
dimension U(3,N),F(3,N),box(3)
C
double precision zero,one,X,Y,Z,r2,r2i,r6i,feight,half,
* ftmp,four,rc2,rc2i,rc6i,ecut
integer i,j
C
parameter(zero=0.0D+00,one=1.0D+00,feight=48.0D+00,
* half=0.5D+00,four=4.0D+00,rc2=(6.25D+00),
* rc2i=one/rc2,rc6i=rc2i*rc2i*rc2i,ecut=rc6i*(rc6i-one))
Epot = zero
Vir = zero
do i=1,N
F(1,i) = zero
F(2,i) = zero
F(3,i) = zero
enddo
do i=1,N-1
do j=i+1,N
X = U(1,j) - U(1,i)
Y = U(2,j) - U(2,i)
Z = U(3,j) - U(3,i)
X = X - box(1)*nint(X/box(1))
Y = Y - box(2)*nint(Y/box(2))
Z = Z - box(3)*nint(Z/box(3))
r2 = X*X + Y*Y + Z*Z
if(r2.lt.rc2) then
r2i = one / r2
r6i = r2i*r2i*r2i
Epot = Epot + r6i*(r6i-one) - ecut
C WE MULTIPLY BY 48 HERE AND NOT LAST.
ftmp = feight*r6i*(r6i-half)
F(1,i) = F(1,i) - ftmp*X*r2i
F(2,i) = F(2,i) - ftmp*Y*r2i
F(3,i) = F(3,i) - ftmp*Z*r2i
F(1,j) = F(1,j) + ftmp*X*r2i
F(2,j) = F(2,j) + ftmp*Y*r2i
F(3,j) = F(3,j) + ftmp*Z*r2i
Vir = Vir + ftmp
endif
enddo
enddo
Epot = Epot * four
return
end
view raw LJ.f hosted with ❤ by GitHub

Python Velo-Verlet solver:

import numpy
from forces import lennardjones
# Initialize simulations (code not shown)
temperature = 2.0
n_atoms = 100
rho = 0.01
U = initialize_positions(n_atoms, rho)
box = initialize_box(n_atoms, rho)
V = initialize_velocities(U, temperature)
# Start simulation
n_steps = 10000
dt = 0.001
(epot,F,Vir) = lennardjones(U,box)
for i in range(n_steps):
U += V * dt + 0.5 * F * dt * dt
F0 = F[:]
(epot, F, Vir) = lennardjones(U, box)
V += 0.5 * (F + F0) * dt









Friday, April 26, 2013

Introducing 32 students to Linux and Python via VirtualBox

I am currently teaching the Molecular Statistics course together with +Jimmy Charnley Kromann and +Jan Jensen at the University of Copenhagen. My part of the course is teaching the students how to program simple molecular simulation algorithms in Python. +Jan Jensen does the theoretical lectures, and I give lectures in basic Python programming and how to implement the equations from Jan.

One of the challenges we faced before the start of the course was to just get Python, Numpy and Matplotlib up and running on every single student's laptop. Since the students run everything from Windows XP, 7 and 8 to Linux and Mac, this can pose quite a challenge for the students, some of which have barely ever even installed a program before.

This year, we opted to create a VirtualBox with Ubuntu on it, and make sure everything the need throughout the course was pre-installed. The students could then simply download the VirtualBox program and install our VirtualBox image.

+Jimmy Charnley Kromann made this excellent video of how to get the VirtualBox going before the first class:



My biggest fear was that the students would be frightened and confused by their first meeting with a non-Microsoft system. But that (fortunately!) didn't turn out to be an issue at all.

These are my observation from the first lecture and first exercises class:
  • To my surprise all of the students had managed to bring their laptop with a working VirtualBox to the first class. ALL students! I was pretty thrilled since that meant they all could follow my programming from the projector on their own laptop, which is what I want them to do during my lectures.
  • What was going on on the projector was the same as on their laptop during the lecture, which I think help lessen the confusion.
  • We didn't have to do ANY tech support during the interactive lectures or the following exercises class to get Python working. Only hitch was two students who wanted American keyboard layout (our Box came with a Danish keyboard layout by default). This was quickly (though some might say awkwardly) solved by "setxkbmap us" in the terminal.
  • Some of the groups of students wanted an extra module to make a video of the things they had plotted in Matplotlib. +Jimmy Charnley Kromann then posted simple instructions and the relevant "sudo apt-get install" command on the course website -- so that went relatively easy as well.
  • Things which didn't work very well were the shared clipboard and drag-and-drop of files between the host OS and the VirtualBox. That only worked on 50% of the student's laptops. Must be a bug in VirtualBox somewhere.
  • I only overheard the sentence "Grrr, I hate linux!" once.

So far so good!



Monday, April 22, 2013

Numba -- It's like Python on steroids ... on steroids.

I'm teaching a course in molecular simulations this year in which the students have to program simulations of various simple systems. We're trying to keep things as simple as possible and everything is done in Python.

The task of the second week of the course is to program a simple simulation of a 2D Lennard-Jones gas. The computationally intensive step in this exercise is calculating the energy and gradient. In the past we've (and by we, I mean +Casper Steinmann) written a FORTRAN module and used F2PY to exploit the awesomeness of the good old FORmula TRANslator and have the students link their Python scripts with a precompiled .so module for this to run at an acceptable speed

This solution is of course completely fine, but not very general. I've been looking into Cython, pypy and various other methods to speed things up a bit. But they all seemed quite elaborate an unPythonic, and you might as well do the whole thing in C then.

Yesterday Numba was brought to my attention on this blog. In short, Numba is a way to compile (at execution time) Python functions into C which normally makes your program run substantially faster. The blog has a very nice example where the code gets a 1000x speedup just by adding one function decorator. And the syntax couldn't be simpler.

Let's try Numba on our Lennard-Jones program. Here is the naive function they have to implement:

def calc_energy_and_gradient(X, Y):
""" Calculate the Lennard-Jones energy and
Energy gradient between particles.
X and Y are lists containing x and y
coordinates, respectively.
"""
# Holds the energy and the energy gradient
energy = 0.0
gradient = numpy.zeros((2, len(X)))
# Double loop over all particles
for i in range(len(X)):
for j in range(len(X)):
if i>j:
# Fortran'esque calculation of inverse distances
# to 2nd, 6th and 12th power.
rij2 = (X[i]-X[j])**2 + (Y[i]-Y[j])**2
irij2 = 1.0 / rij2
irij6 = irij2*irij2*irij2
irij12 = irij6*irij6
# Summation of the Lennard-Jones energy
energy += irij12 - irij6
# Summation of the Lennard-Jones energy gradient
gradient[0][i] += -(X[j]-X[i])*irij2 * (irij12 - 0.5*irij6)
gradient[0][j] += (X[j]-X[i])*irij2 * (irij12 - 0.5*irij6)
gradient[1][i] += -(Y[j]-Y[i])*irij2 * (irij12 - 0.5*irij6)
gradient[1][j] += (Y[j]-Y[i])*irij2 * (irij12 - 0.5*irij6)
energy *= 4.0
gradient *= 48.0
return energy, gradient
view raw jlge.py hosted with ❤ by GitHub
The code is then executed via Python as one would normally execute Python code, specifying in the script that the calc_energy_and_gradient() function is to be compiled via Numba. This can be done simply by writing the the @autojit decorator before the function definition. 

You can also specify exactly which type of input the just-in-time compilation takes takes (an example is commented out). This makes the compilation of the function be type specific, but a bit faster.  Make sure to import numba and the specific type of argument (here the arguments are of the type double) and the relevant decorators (jit or autojit).

import numpy
import random
from numba import double # Add this
from numba.decorators import autojit, jit # Add this
#@jit(argtypes=[double[:], double[:]]) # Add this if you want more speed than autojit
@autojit # Add this
def calc_energy_and_gradient(X, Y):
""" Calculate the Lennard-Jones energy and
Energy gradient between particles.
X and Y are lists containing x and y
coordinates, respectively.
"""
[ code here ]
return energy, gradient
n_particles = 16
# Initialize particle coordinates and velocities
X = numpy.array([random.random() for i in range(n_particles)])
Y = numpy.array([random.random() for i in range(n_particles)])
# Calculate initial energy and gradient
for n in range(10000):
energy, gradient = calc_energy_and_gradient(X, Y)
view raw solution2.py hosted with ❤ by GitHub
Conclusion: 10000 gradient and energy evaluations take 16 seconds with the standard Python implementation. 2 seconds with @autojit and 1.5 seconds with the explicit @jit. Not too bad for something that doesn't take ANY code changes.