Is Julia the best language for quantitative finance?
本文为转载,作者信息: Mario Emmanuel
Sep 17, 2019 · 13 min read
I have been working on quantitative intraday strategies during the last months. As a side result, I have tested workflows for similar tasks in Python, C, Fortran and Julia. Here are my findings.
The context
To give a background on the nature of the projects that have been tested I will begin clarifying that:
- The projects are related to instruments trading (i.e. I design and simulate derivatives market algorithmic/quantitative strategies).
- I have not used Machine Learning or AI techniques in these strategies, just plain/vanilla statistics and simulation.
- Handled data sets are large but not huge, normally my simulations cover 30 million records data sets per asset/instrument, every data is used several times, and I do parametric and Monte Carlo analysis. This implies a large number of iterations.
- I am not an expert programmer and I am not interested in becoming one, I just want to focus on the market logic and the strategies that exploit profitable edges.
My quest is to find the right tool combination that performs well enough and simplify my workflow. Hence the review is based on the perspective of an end-user of these technologies.
This context has some implications:
- I need a language that can deal easily and without efforts with large data sets.
- I need speed.
- I do not need that much speed to require multi-core or parallel processing.
- I do not need —at this time— Machine Learning or AI libraries.
This post is the outcome from the journey I have done to find an optimal workflow. It is a subjective but still informed view of each language strengths and weaknesses for this particular endeavour. I hope you find it useful and enjoyable.
The beginnings: Python and R
Approaching the field means that you will probably begin with Python or R, so did I.
R language has been the natural choice for doing statistics in the scientific/academic community well before Data Science term was coined. R is the open-source implementation of the S language, which was created in the 70s at Bell Labs. While R is revolutionary and well suited to statistics, I found it difficult to master and extremely inefficient in handling large data sets. I was not able to do simple tasks such as loading a whole price set for several years from a CSV.
Trading simulation usually requires large amounts of data, and if you need to start splitting it in advance or looking for alternative loading methods or simply the load process is complex it is worth to explore other options. This does not mean that R can not cope with it, it is just that I did not find it easy and straightforward. Data loading is an integral part of data analysis and it shall be provided out of the box.
Surprisingly basic tasks such as loading an average/large size CSV can be a handicap for certain modern languages. Loading a 200Mb CSV file into memory can show some weaknesses on the first stage of every analysis: data loading and preparation.
Python had been my companion language for small/medium projects during the last years. Python is easy to learn, easy to code and counts with a good reputation in Data Science when combined with Numpy and Pandas. Jupyter as an interactive console has also contributed to making Python one of the most widely chosen languages for Data Science projects. Part of its success is due to the fact that it is a widely-used general-purpose language that can be also used for Data Science. That makes it more attractive to newcomers with a programming background.
While Python is easy we can not forget that Numpy and Pandas are extensions to the language through external libraries. They are not natural parts of the language. Python is also a slow number cruncher because it was designed as a very high level loosely typed language. Python enthusiasts will state that you can improve that and that modern Python has things such as Cython and a myriad of solutions. But if I have to deal with number-crunching I would prefer a natural in number-crunching.
From Python one of the most useful takeaways is the concept of using DataFrames. DataFrames is an outstanding strategy that I have incorporated into my Workflow (even when the intensive number crunching analysis is performed using plain arrays). DataFrames allow to enrich, debug and visualize all the workflow during your simulation/analysis process. I use them as a pipeline where the different steps of the simulation add up their results to the analysis. I like to think them as a tabulated board where all results, scores, tests and conditions are annotated and passed between the different simulation stages. Each stage uses previous data and adds data to the board and at the end of the simulation, the DataFrame contains all the information to evaluate the simulated strategy.
During the first tests with Python, I had many issues to load large data sets, especially in the small server I had allocated for this task. I also experienced poor performance. While it was easy to get a more powerful server I decided to give C a try.
C/C++
You can not be wrong with C. C is mature, it can do almost anything and it delivers very efficient code. C did not change substantially since the ANSI C standard was formulated back in 1989. Modern standards added functionalities, more portable data types and other features but for simulation, even the old standard fits. C++ is widely used but object-oriented paradigm provides no value in this kind of tasks. I think its usage is more related to the fact that younger people study C++ instead of C than to an actual benefit of the object-oriented paradigm for this particular field.
The main problem with C is that it is difficult to do complex things with it. It shines for small algorithmic routines, and it is actually quite difficult to make it perform slow. But the big issue is that speed and performance are done through pointers and debugging pointers can be challenging.
I started developing a C library to assist in the simulation tests, and while I managed to get proper results I noticed how while the project was getting bigger the timings and complexity to debugging were slowly but steadily growing. When that happens it is an early warning that the project is gaining too much weight in terms of complexity. I was probably trying to do too much with C.
C is an outstanding tool and it is widely used in HFT industry because it is a natural performer, but you must be ready to cope with longer projects that require skilled resources. Additionally, certain tasks which are super simple in languages like Python can be really challenging in C. Debugging is also a big issue as already stated.
Python + NumPy + Pandas
Once I realised about the complexity of doing everything in C, I came back to Python, trying to find a workaround using smaller subsets and SQL. It also helped that at that time a colleague showed me a library named backtraderand we spend a couple of joint sessions implementing some tests I proposed to analyse DAX openings.
Backtrader proved to be slow and a burden sometimes, but it did also help to detect some early needs on the workflow. It is a good library if you want something ready and you are willing to pay the performance and customization penalty. According to my experience, if you are in the business you will end up wanting something more specific to your strategies.
With the information gained from those experiences, I managed to get everything fully implemented in Python using Jupyter and the first strategy was fully delivered.
While it was a great joy to see the positive and growing equity curve I noticed severe bottlenecks in the workflow process. The simulation was too big for Jupyter. I did not find natural to include Monte Carlo or parametric analysis in Jupyter and it was really unclear how to implement the reports to cross-validate the results and ensure that there were no mistakes made during the programming phase.
Python was also overall very slow for the task and Jupyter proved to be a good aid in small projects but a bit harder to follow in larger simulations.
From all issues, the main ones were clearly performance and the lack of structure in the code.
Claiming for structured programming in 2019 might be seen as an anachronism but in fact simulation programs share many roots with the structured programming of the 80s. Concepts such as object orientation or functional programming are —for these particular tasks— unnecessary layers of complexity and what I found is that basic, vanilla, plain structured programming is a good approach to solve these type of problems.
I find hard to properly structure code in Python; the package/module implementation is behind the average level of simplicity that Python delivers in other areas. This is a personal opinion, though.
Debugging with Python was easy as it has always been, and it proved to be a great improvement when compared with the previous C solution.
Python + C
The next logical step was to replace the slow areas of Python with C. With the experience gained from my first trial of a pure C solution I defined a layer of routines which handled the simulation itself. As this was focusing on the core simulation routines were simpler and hence easier to debug and implement in C, while Python was in charge of orchestrating everything else.
Orchestrating the simulation workflow does not require speed, and Python provides a more friendly environment. The C routines were packed in a shared library (DLL for Windows, shared object for Unix) which was later invoked from Python using CTypes.
This combination allowed C to shining doing things like preloading all data (30 million records) in 200/300 milliseconds through serialization —just to give an example on how speedy C is — . When serialization was not used, CSV reading was also really fast as fixed-length records were being used (it was much faster than with Python). At those speeds, there is no need to use any database. It is just a plain file and memory. The rest of the statistical analysis was also extremely fast, as C allows to implement fast algorithms through the usage of pointers and arrays. The bottleneck of this setup was always Matplotlib in Python.
Pairing Python and C is, in my opinion, a win-win solution where you can get the best of both worlds.
Fortran
I decided to do an extra effort to test two additional languages: Fortran and Julia. The objective was to join both the high-level simulation workflow coding and the low-level backtesting routines in the same language. Despite it is far from being a trendy language, I first evaluated Fortran.
Fortran is one of the oldest programming languages in service. COBOL aside, it is probably the oldest. While it was once the first choice for scientific and engineering analysis, it was largely replaced by Matlab during the 90s but it has remained strong in computing demanding areas of the science.
Fortran is now a niche language (currently around position 30 in the list of most used programming languages in the world) but it still has an active community in specialised fields such as high-energy physics, astrophysics and molecular dynamics. Together with C, it is also the main programming language for supercomputing. Standards such as OpenMP and MPI which define ways to distribute computing among cores and distributed computers are always implemented first for C and Fortran.
While modern Fortran has nothing to do with what most people think (upper case code, goto instructions and punched cards), its syntax shows some age. It allows a good code structure and it is quite easy to learn (even to master). My experience while porting the last simulation to Fortran was that coding was easy and that specifically, the algorithmic parts written in C were much easier to code in Fortran. The structured code was also superior to Python, although this is a personal opinion and I know many would disagree with this statement.
There were also some problems: a harder integration with graphical tools and the fact that variables need to be declared in advance (something I do not like) but in my opinion, the main issue was that debugging was hard because Fortran at the end ends up calling a C compiler. So my experience is that debugging Fortran was a bit harder than debugging the Python+C solution.
On the positive side, Fortran also had some unique solutions to deal with array and structured data, which includes custom indexes — array[2000:2020] is a valid index range, something that no other language can achieve — , vector operations and simple ways to initialise variables and structures.
Fortran or C is the way to go if you need speed and plan to do analysis requiring multi CPU and/or multi-core —HFT industry uses C++ intensively — , and it is not casual that both AMD and Intel keep their compiler divisions selling both C++ and Fortran compilers. But if you do not need that much speed, it might be better to trade off some performance for a more friendly environment easier to debug and the possibility to use DataFrames.
Fortran performance is astonishing. Fortran was able to read a whole year of prices from a CSV file, convert them to integers to avoid loss of precision, round them down to contract ticks and store them into memory in around 1 second. This means that a strategy using 20 years of 1-minute prices would be loaded into memory and ready to be used in 20 seconds. It simply outperforms by far any other language I tried.
Its native usage of arrays also made calculations very fast too. In many operations Fortran outperforms C. Believe it or not, 60 years later, it is still the number one number cruncher and its structure is well suited for simulation problems because it was designed with that kind of problems in mind.
Julia, the newcomer
The last language I tried was Julia. With the experience gained, and the promise of Julia to deliver the simplicity of Python and Pandas with the speed of C, trying Julia was a must.
I found the language easy to learn and properly structured. The syntax is clear, it is not verbose and it is legible. Modules allow (once you understand them) to adequately separate the code as they separate domains and variables. Modules also allow to easily share global variables.
Certain aspects of the language were a bit puzzling at the beginning. It took some time to understand what was mutable and what was not. Variables are not passed by reference nor value but bind to a particular object. In that way, if I understood the mechanism properly it is similar to Python.
The language is flexible enough to allow you to enforce types (something I find useful in this domain as it helps optimisation both in terms of memory used and performance) but for light usage, you do not need to specify the types if you do not want to. Timestamp handling is pretty easy too, something which is relevant in this industry.
Using Julia feels like you are an early adopter of a new technology. The main fear is stability and you also face the fact of knowing that most people would choose Python.
Julia incorporates vector notation and DataFrames as part of the language. In reality, DataFrames are not part of the language but incorporated as a library — same as in Python with Pandas — but the feeling is that they are part of the language, something I do not experience when I use Pandas.
Data loading was slightly slower than in other languages, but I was using CSVs and an auxiliar DataFrame to parse the input data. With direct conversion and using fixed-length, I am sure that a better performance can be achieved. The overall times required to perform calculation are close to C so in general, you will not notice any difference.
Julia might not be as stable and mature as other languages. In my BSD workstation I had to download the latest version directly from the website, as the ones included by default in the distribution were not able to compile the packages properly.
As already mentioned Julia is easy to code and relatively easy to learn —once you overcome some not so intuitive aspects of the language such as mutability/immutability, scope of variables and how modules work—. The strong points is that DataFrames and vector operations are far much better incorporated than in Python. Julia compiles code in advance and it has been designed to be fast. The package installation service is also better incorporated than in Python.
Overall my feeling is that Julia was the right tool for this task. It is modern, easy to learn, it incorporates better the functionalities to operate vectors and DataFrames and the performance is far superior to Python. It waives you from using C code (more difficult to debug) or Fortran (which might be seen more exotic in the industry). The only concern is that being not that mature you might fear issues in larger projects or even computation errors. I haven’t found those but I haven’t spend enough time with the language either.
Summary
The problems found in quantitative finance are largely similar to the ones found in scientific simulation.
Aspects such as workflow, vector operations, debugging, code clarity, presence of DataFrames and performance are the key elements that determine the suitability of a language for this task.
For the highest demanding tasks, C and Fortran keep their position as mature solutions but they are demanding solutions in terms of skills. Fortran is probably not commonly used in this specific industry but it fits well if you want to use it.
For lighter usage Python is the simplest approach, but it will be slow for certain tasks. The combination of Python and C can be a good trade-off between simplicity and performance when C is used to code the core simulation tasks.
Julia is extremely promising, clean and fast language. While I did not spend enough time with the language I felt it can be the right tool for this particular job, being the lack of maturity the biggest fear I would point. Other than that it incorporates all the modern functionalities and strengths which are useful in quantitative finance, it is easy to learn and very fast.
It might not be the right choice for well funded larger institutions, as it might be seen as a risky proposal, but on the other side individuals and smaller organisations could achieve competitive advantage by being early adopters. My personal view is that it is worth to explore it as it can contribute significative to build a production-grade strategy development and analysis workflow, something that it is required in the long term if you want to be in the business.
网友评论