Scientific Programming Languages
11 Aug 2015, Andrew NingI’ve used a number of scientific programming languages over the past 16 years: C++, C, Matlab, Java, Fortran, Python, and Julia, and I wouldn’t name any one as the “best” (I’ve also used Objective-C, JavaScript, and PHP quite a bit, but not for scientific computing). Usually I try to pick the right tool for the job, not necessarily just the tool I happen to know best (as they say: if all you have is a hammer, everything looks like a nail). However, my usage has evolved over the years from Matlab-centric, to Python-centric, and I’m contemplating a move to Julia-centric. Before explaining why, let’s discuss some of the reasons why I might choose one language over the others.
Edit (May 2019): We’ve gone all in with Julia shortly after this post was written (see bottom of post)
Matlab
Matlab is widely used in university settings. For students it is very affordable, and it is very easy to use. Matlab is oriented towards scientific computing and it comes pre-packaged with a built-in IDE, debugger, and a large collection of built-in methods and toolboxes. I did much of my graduate work and PhD Dissertation using Matlab. Most students like it a lot.
It’s great as a student, particularly an undergraduate student, but as you move to larger problems and/or move out of a university setting its weaknesses become more apparent. Matlab is quite expensive outside of universities, and it runs very slowly. The first concern could be addressed with Octave. Octave is an open-source software designed to mimic Matlab, but it runs even slower and is far less capable in terms of available packages (it’s been years since I’ve used Octave so maybe the gap isn’t so large anymore, but Matlab hasn’t been idle either and has made quite a few performance and capability improvements including a JIT).
Because of the speed and parallelization issues, a typical workflow for me was to prototype a code in Matlab, and then if needed rewrite the entire code in either C, C++, or Fortran. Obviously, rewriting is a pain, but if needed the speedup was usually worth it (usually 1-3 orders of magnitude). Mex files can be used here, but I found them more painful than helpful with large multi-language projects.
I haven’t used Matlab for research work in several years (with one exception noted below). My graduate students rarely use it either. I do, however, often use it in undergraduate classes because it is widely available and easy for students to use. I also have most of my undergraduate research assistants use Matlab. It’s perfectly suited for their problems where high performance is not necessary, and familiarity and ease of use is much more important. That’s not necessarily a deliberate choice, I’d be happy to use Python with them as well, but Matlab is what they already know from other classes. For the level of time they have available and the complexity of their problems, it’s just not worth trying to teach them something else.
There is one exception for my usage, which is that I sometimes still use fmincon from the Optimization Toolbox (which usually involves some sort of terrible hack to make the wrapping work). I’ve used a lot of optimization packages for constrained nonconvex problems, and fmincon is still one of the most robust on the types of problems I solve. These days I primarily use SNOPT through Python, but still find fmincon useful from time to time.
Update (10/1/2015): With Mathworks relatively new Matlab Engine for Python, connected to fmincon from Python was relatively easy. I posted an example here, to hopefully be integrated with pyopt-sparse later. Having access to fmincon is great, and makes me once again interested in keeping a Matlab license around.
C, C++, and Fortran
If the end result is a re-write in a compiled language, why not just start there to begin with? Good question. I have done that on occasion. The problem is that while these languages have great run times, development time is usually much slower (even when accounting for the re-write), and re-writing may not be necessary. Development time is usually a much bigger bottleneck as compared to run time, at least for my use cases.
In languages like Matlab, debugging and inspecting variables, plotting, making small changes and retesting, and using existing functions is just much faster. All these things can be done in C, C++, and Fortran, but it just takes more work and time to repeatedly compile, integrate existing libraries or functions yourself, debug and plot results, etc. It is always more important for your algorithms to be correct than to be fast. Using a language like Matlab allows for rapid development, with more testing and inspecting for a give time allotment. Once you are confident in a code’s correctness, then you can start thinking about speeding it up, if necessary, usually by rewriting in one of these compiled languages. The key point is that rewriting often isn’t necessary. For example, if my vortex lattice code was running in a tenth of a second in Matlab, that was already good enough for the number of cases I needed to run. Code can always be pushed off to other computers or clusters to run overnight and on the weekends easily.
Of all of the languages I am discussing in the post I would say C++ is the most difficult to work with. Despite its complexity, there are still times when C++ is a good choice. It’s object-oriented and it’s fast (if done well). If that is the cross-section you need, like in a CFD code, then it may be the best way to go. XCode is actually a nice IDE for C++ development. It has a great static analyzer that eases some of the pain of working with it.
For people with limited Fortran experience it often has a bad reputation, and in my opinion undeservedly so. It is true that Fortran 77 and older are pretty horrible to work with. GOTO statements, implicit typing, fixed layout, and so on just made life difficult. Lots of awesome numerical packages were developed in Fortran 77 (or older), and so the only exposure many people have had to Fortran is looking at the syntax of some of these old codes.
However, modern Fortran (90 and up) is actually quite nice to work with. A lot of improvements were made including free-form input, array notation (like Matlab’s), modules, dynamic memory allocation, etc. Recent versions have even added object-oriented (OO) features. Fortran is designed for scientific programming (unlike C which is more general), and the syntax is actually easy to use and similar to Matlab’s. It does requiring specifying all of the types at the beginning of a function or subroutine, which slows down development, especially when your interface to the functions is still evolving quite a bit.
As an aside: one thing that drives me crazy in Fortran is how it handles double precision constants. If you define a variable like this:
real(8) :: x
x = 1.0
x will actually hold a single precision number and not double even though its type is double (compiler-dependent). Instead you need to specify the constant as a double as well: x = 1.0d0
. In practice I actually use something like below for portability:
integer, parameter :: dp = selected_real_kind(15, 307)
real(dp) :: x
x = 1.0_dp
Littering _dp
or d0
all over the place is a bit of a pain. I didn’t know this early on, and it was a big gotcha. I was working on an optimization application where I needed exact gradients, and the difference between single and double precision in some places was causing my gradients to be very close but just barely off to the point where it caused some numerical issues. Took me a while to figure this out.
If I needed speed, but didn’t need OO, how would I decide between Fortran and C? Usually just by what packages I needed, and which language I thought would be easier to do the integration in. I still use both Fortran and C quite a bit, but almost always only in connection with Python as I’ll discuss below. I don’t use C++ very much anymore, unless I want something that can stand alone without Python. Otherwise, the combination of Python with C I find to be much easier to develop in and just as fast to run (more on this later).
Java
I used Java quite a bit during graduate school. Most folks don’t think of Java as a scientific langauge. My advisor liked Java. He was ahead of his time in developing interactive modules for teaching. His course notes (e.g., http://adg.stanford.edu/aa241) contained Java applets that allowed you to run interactive examples on wing design, etc., which were super helpful. Recently, there has been a lot of interest in the scientific community to do similar things with Jupyter notebooks.
From some CS courses I took, I had developed a pretty deep familiarity with Java. I also liked object-oriented programming and Java was way easier than C++ for development. None of the other languages I used (Matlab, C, Fortran) accomodated OOP. I used Java for several aircraft analysis tools and it worked very nicely. As a bonus you could load Java *.jar files in Matlab and call the functions from Matlab for plotting and other visualizations.
As a graduate student I also worked on the side for a startup, Complete Solar Solution. I developed an analysis tool, HelioQuote, for them in Java. This was a comprehensive tool that would automatically pull in electric rates for an address, download a Google map of their roof, perform an optimization analysis, and layout solar panels on their house that you could then drag and drop around if desired.
Java was great because it was a full featured programming language, was easy to use, had pretty good GUI support if needed, and actually had surprisingly good performance through an aggressive JIT compiler. The main problem with Java was numerical support was weak. There were very few scientific libraries available through Java. I wrote a bunch of my own methods and interfaces to do basic, but frequently used stuff like integration, root finding, linear solves, etc. I used that library quite a bit, but it was very limited compared to what’s available in Matlab or Python.
Java was pretty nice to work with, but I haven’t used it in years. After diving into Python I found it superior in every way. It is an even easier language to work with and has great scientific support. Its only disadvantage is performance, but this is remedied through its easy connections to C/Fortran.
Python
I dabbled in Python a bit during graduate school, but really only for random fun side projects (like this movie filter project we made as a proof of concept). As a postdoc I started using Python for everything. Python is pretty great. It is a very easy language to use. It has simplistic syntax like Matlab, but unlike Matlab it allows for objected oriented programming (I know Matlab has some OO features, but they are pretty weak), functional programming, or procedural programming.
Python is free. This may have saved us some money (not really we already had Matlab licenses at the lab), but the bigger benefit is that it allowed others to use our code. We had some users/collaborators who did not have Matlab and so developing in Matlab limited our ability to collaborate. Matlab does have a runtime you can distribute to allow users to run your code, but it doesn’t allow them to develop as well.
It’s also open-source. In scientific computing, I’ve needed to dive into the details of certain algorithms many times. In Python I can do this, in Matlab I can rarely do this.
One of the main benefits of Python, for me, was performance. This seems odd to say because Python is not fast. It’s an interpreted langauge with speeds similar to Matlab’s. However, Python allows you to wrap C/Fortran code pretty easily. Now, instead of rewriting my entire code in a compiled language, I would profile, find the bottleneck, and rewrite just that portion in Fortran or C. This allowed me to approach effectively the same speeds I would get in pure Fortran or C, but with an easy-to-use, rapid development environment in Python for scripting, plotting, debugging, etc. It also allowed me to easily take advantage of existing C/C++/Fortran libraries. A wide variety of compiled numerical libraries are already available in numpy, scipy, etc., but for specialized tools (like finite element codes, aircraft panel codes, etc.) I was able to wrap these in Python. Being able to directly pass variables, instead of trying to read/write input files, made using and integrating these codes much easier—especially for optimization applications.
This is perhaps the primary use of Python in scientific computing. It is often referred to as a “glue code”. Meaning, that most of the work is not actually being done in Python, but Python serves as the glue that links codes together. I’ve linked a blade element code in Fortran 95, a beam finite element code in C++, a cost model in C, and dozens of other components in pure Python, all with an optimizer written in Fortran 77. All the interfacing is done in Python making data passing and scripting very easy.
Python has completely replaced Java and C++ for me and almost completely replaced Matlab as well. The combination of Python with either Fortran or C gives me the benefits of a fast compiled language with a lightweight, interpreted, easy-to-use interface. What’s the downside? The main downside is that I have to work with multiple languages still. There are some inefficiencies (in terms of development) in switching between languages.
I should say of course that your mileage may vary. I don’t work in controls, but as far as I’m aware there is nothing as capable as Simulink and I don’t think the support for controls is as strong in Python as compared to Matlab. But that’s not my area, just my impression. For the types of things I do, the only thing I’ve missed are optimization algorithms. The Optimization Toolbox in Matlab is pretty capable and robust. Conversely, the optimization tools built into scipy are not very good in my opinion. Fortunately, this is remedied through the use of pyOpt and now pyOptSparse.
Julia
Julia appears to be the holy-grail of scientific programming languages. An interpreted, easy-to-use language, and with speeds comparable to those of C. In addition, its designed for mathematical computing, including parallelism and cloud computing, and if needed it makes it even easier to call C or Fortran code (no wrappers needed).
I’ve been testing it with my students and with one of my own projects. I’ll provide more details in a subsequent post, but in short I’ve concluded that it’s promising but still too early to fully switch. The surrounding ecosystem is lacking, and so I needed to develop or wrap a lot of methods myself that are just available in Python. There were also a few other minor issues that slowed down development, and in the end even the run time was disappointing (though I made no effort to optimize and I suspect there are some small changes I could make that would make a significant impact on run time).
Julia is at a point where I would say it is fully functional for any of our projects, but in terms of development time (and even run time) it wasn’t yet superior to our current workflow in Python with C/Fortran. I’m definitely rooting for it to get there, and with all the ongoing work by the Julia team I imagine it won’t be long. Working in multiple languages has been frustrating at times, and I’d love to go back to spending most of my time in a single language.
I should mention that Python has several JITs that look very promising, notably Numba and Pyston. However, these are not quite there yet either for the full stack of scipy tools. Additionally, I often rely on automatic differentiation and I’m quite sure how that would work with a JIT. Perhaps these will soon reach a point of full usability across the scientific stack to where one could do high-performance development fully in Python.
Edit (May 2019): In our lab we completely switched over to Julia about three years ago (at least for for all new development). We’ve been very impressed with the performance and happy with the ease of one language. There were a few growing pains, but once we started using Revise.jl, and text editors with Julia support, the new workflow became more natural. We also had to learn (or un-learn) styles for better performance.
I haven’t adopted Julia in teaching yet but would like to. In many of my classes programming is a tool and not the point of the class so it’s hard to justify taking time out to teach Julia when they already know something else. In my graduate optimization class it would be a natural fit except that the optimization solver support (for general nonlinear optimization) isn’t quite to the point where I’d be comfortable with a switch. In our research this isn’t an issue as we use commercial optimizers that we’ve licensed and wrapped in Julia.