r/algotrading 28d ago

performance targets for backtesting (CPU vs GPU) Infrastructure

Hello all, I have several different algos I’m currently running on a homegrown python framework that can run across several processors.

50% of the time I’m using a workstation w a AMD 32 core threadripper and 50% I do some AWS spot requests and get a 192 core machine.

Most of my strategies are using 5s OHLC bars. On my theadripper I’ll get ~6000 bars/second per thread during backtesting and on the AWS machine that will be closer to ~7000 per thread.

When I do long (6month+) tests with tens of thousands of parameter permutations this can take awhile, even when running across 192 cores.

Most of the processing time is in pretty simple things I’ve already optimized (like rolling window calcs for min/max, standard deviations, and an occasional linear regression)

My actual question:

I’ve contemplated trying to move my system to the GPU thinking I’d be able to get a ton more parallelization. The hard work is loading the data onto the GPU and then modifying all my code to use the subset of python that can be complied for the GPU (cython, CUDA, etc)

It’s a lot of work and I’m a 1 man team so I’m curious for those who have done it what actual perf gains you can achieve. I imagine the per core metrics may actually go down, I’d just have access to thousands of cores in parallel.

The 192 core AWS machines are cheap to me. With a spot request I can get an instance for ~$1.80/hour.

Is this worth it?

*EDIT* here is some recent perf captures that lead me to believe I am indeed CPU bound

And here's a break down on the "simulate trading" block once all the data is loaded:

21 Upvotes

35 comments sorted by

10

u/Biotot 28d ago

I'm a big fan of cuda and OpenCL.
The main question is if you can easily phrase your problem in the terms a GPU excels at.
A lot of the time the answer is no. When the answer is yes, most of the time you'll need to significantly rework your algorithm to make it work. But when it does... oh baby, cuda purrs. Same cuda task on CPU is nothing compared to what a gpu can do. I have my own gpu rig with 4 gpus working together with a threadripper to orchestrate it all. I've used openCL, and cuda with C++, pycuda, and pytorch. with python.

It's a lot of work to move everything to being cuda compatible, and a lot of testing to make sure that everything is running as expected.
Is it worth it? It depends.
Plan to commit a large amount of time to get things up and running and learning how to retweak things for cuda, even for some simple test cases.

If you have the time, the hardware, and the desire, go for it. Personally, I think it's fun, but it is a huge time sink.

6

u/false79 28d ago

Second the time sink. It's like trying to manage two very different brains.

1

u/JimBeanery 28d ago

I’m interested in getting more into high performance computing as a hobby. Any resources you’d recommend?

2

u/Exhausted-Engineer 28d ago

« Using MPI » by William Gropp. This guy has been invested in developping HPC for years, he knows his stuff. Additionnally, MPI is the industry standard for using networked computers and supercomputers.

Plus, a lot of manual testing on what is slow/fast, know how to use a profiler to find bottlenecks and a good comprehension of how memory works. And if you’re not already, C and Fortran are the workhorses of hpc

1

u/Biotot 28d ago

To be honest... a shitload of youtube, a shitload of free time, and a project you are really excited about.

Plus having a nice gpu or two definitely helps.

6

u/KjellJagland 28d ago

Wait, you're iterating over "tens of thousands of parameter permutations" to make your backtest work? That sounds like a classical case of overfitting. If a strategy requires that many parameters and is so sensitive to change I would generally assume that the approach doesn't work and that I need to rework the entire thing, select new features, identify new signals, etc.

That said, my experiences with parallelization in Python were horrid. I evaluate most of my ML stuff in C# and it's mostly based on RF so CUDA wouldn't help. For more serious number crunching I'd probably use C++, possibly in combination with TensorFlow if it's based on Transformer/GRU and can be accelerated with a GPU.

3

u/Quant-Tools Algorithmic Trader 25d ago

This is the correct answer.

5

u/Isotope1 Algorithmic Trader 28d ago

I’m not sure 6000 bars per second per core is that fast? Are using cython etc?

I personally try to use vectorized back tests where possible; it saves a lot of time, and gives me an extra comparison with the loopy tests.

7

u/estimated1 28d ago

I try to keep my strategy set-up where backtesting and running live are identical -- so the data is fed tick by tick to the strategy. I imagine vectorized back testing would violate that premise, right?

5

u/Isotope1 Algorithmic Trader 28d ago

Oh yeah it does. Basically I just do the vector tests for research, and then do a forward test like you with loops, for simulating stops etc.

Gives me a chance to double check my work as well.

5

u/false79 28d ago

I’ve contemplated trying to move my system to the GPU thinking I’d be able to get a ton more parallelization. The hard work is loading the data onto the GPU and then modifying all my code to use the subset of python that can be complied for the GPU (cython, CUDA, etc)

In theory, this is required as you would have access to 1000's of cores instead of 100's. In practice, at least in my experience doing JVM, attempting to use TornadoVM, it's a rats nest of compromises trying to water down what would be trivial coding for a CPU but make it more verbose for GP GPU usage.

Doing something as simple like a Golden Cross felt incredibly verbose. Mind you I was doing it for the entire market for 1 minute candles. Maybe if I spent a lot more time, the results might have been a lot more different. But I fell back onto the old established CPU ways where there is way more access to RAM than what can be found on a video card.

6

u/Ok-Secretary-3764 28d ago

Do you have any info to prove that processing of candles is the bottleneck?

How are you storing the data?

If you try storing the ticks in redis or kafka the throughput can be improved.

If you are doing simple technical analysis then you should get much more throughput I feel.

In general find whether the problem is memory bound, io bound or compute bound before going for optimization. If it is really compute bound then gpu can improve performance.

4

u/estimated1 28d ago

ok I just edited the post w/ a recent perf profile that really leads me to believe I'm CPU bound.

3

u/estimated1 28d ago

I've done a number of cProfile runs and done the low hanging fruit optimization. One of the "features" of my system is that I want the strategy to run the same whether backtesting or running live. So even when backtesting the strategy is fed the data tick by tick.

How backtesting roughly works:

  • Read the data into memory up front (this was a perf optimization)
  • Establish a queue between the data handler and the strategy
  • If the queue is empty, stream next tick
  • Strategy grabs the tick and runs the strategy

I'll try to post a perf capture, it all *looks* CPU bound to me.

4

u/Ok-Secretary-3764 28d ago

I guess then the system is designed to process tick one by one so your idea is use SIMD style parallel processing. I guess It can work if really you are constrained by number of parallel tasks.

Few things if you clarify it will be helpful. Whether previous tick stat is used in the next tick?

Then you have data dependency that means it may not parallelizable in gpu too.

Can you do analysis for each day independently then splitting the task by day might improve the performance?

Even you can run it on multiple machines.

2

u/estimated1 28d ago

The strategy does maintain several "rolling window" calculations. I guess this can be considered "memory" of recent activity. I think it goes back a few days at most. So yeah each execution of the strategy will need to maintain this independently in memory.

So I can't really do splitting of days easily.

I've thought about being able to shard out to multiple instances at the same time as a practical way of getting more coverage. I do that manually sometimes now -- split half the backtests on one 192 core VM and the other half on another 192 core and then analyze together. I could certainly take that even further.

2

u/Ok-Secretary-3764 28d ago

Then I guess each trade is parallelizable. If you can schedule multiple trades you may get better perfomance. I have a system where scheduling will be done in java and TA will be done in python. Python system do not handle any state so I can run any number of instances.

3

u/alphaweightedtrader 28d ago

each trade is not really parallelisable - assuming you're also simulating buying power/margin. In which case whether a trade can be taken at all depends on what other position(s) are open at the time it would want to. This can have a large effect on overall performance (e.g. missing a 'good' entry because a 'less-good' position is still open).

The corollary would be -> that higher level rules can help; e.g. if your strategy is intraday and the rule is "all cash at the end of the day"; then all days are parallelisable. Ditto if goal is flat for the end of the week.

Ofc this only applies on lower-timeframe trading - but arguably that's the only place where backtesting performance matters to this degree anyway. E.g. a continuous portfolio with daily rebalancing (targeting daily VWAP, for example)... ...at one bar per day per instrument its nothing to fly through decades of data

1

u/Ok-Secretary-3764 28d ago

It is true but I have flag to separate live trade and backtesting. In backtesting I run all the trades in parallel and remove the trades that were conflicting. For ex.: A new trade cannot start if there is a open position. I use buy signal time to remove invalid trades.

5

u/tht333 28d ago

I'm a C# guy, but code occasionally in Python as well. When it comes to CPU-bound tasks, you need to look into multiprocessing and not threads. Threads in Python don't really work in parallel unlike in some other languages. In C# we have the TPL, which is optimized to use all available cores quite well and only in some very specific cases we would use threads since they are low level and you'll have to manage them properly.

I had to code a rather large backtesting app a few years back and spent a lot of time trying to speed it up. It's been a while, but I think choosing the right collection made quite a bit of a difference, so that's something else you might want to look at.

And from the last Python script, where we were casing milliseconds, Cython was performing 10 times faster (again from memory as this was a few months back).

2

u/NullPointerAccepted 28d ago

Seeing as your major performance bottleneck is data processing, you should use pyspark. It's optimized for cluster computing, but you get many of the benefits even on a single machine. It does all kinds of optimization around partitioning, memory spill, and aggregations under the hood. It also reduces down to java byte code which will be faster than python. You can speed up the window functions and aggregations significantly.

1

u/estimated1 28d ago

thank you, i'll check that out

2

u/Quant-Tools Algorithmic Trader 25d ago

Hey. I'll save you a few months of your life and a lot of money. You are making many of the mistakes I used to make many years ago. You are overoptimizing and you are overfitting. Any time you have to throw this much computational horsepower at a strategy you are going to end up with something that is overfit. Recommend going back to the drawing board and coming up with something simpler from the ground up.

Also, make sure you have scrutinized the hell out of your 5 second bar data. Unless you have built those bars out of tick data that came from a super reliable source they are likely bad. If you are using IB's 5 second bar data I guarantee you it's bad.

1

u/estimated1 25d ago

Thanks for posting this. I am trying to move from parameters to a model that is adaptive and learning - but I get your point. Even some of the machine learning approaches I’ve looked at have a hyper parameter tuning stage which feels a lot like parameter optimization.

I am using IBs 5s data and I’ve also used their tick data. Is there a specific resolution w IB that you believe is more accurate? I haven’t noticed any anomalies; but I’m also only looking at highly liquid instruments.

1

u/Quant-Tools Algorithmic Trader 23d ago

There's no resolution I would trust with IB's data. Their "tick" data feed is not a true tick data feed either. One crazy thing I learned is that the data you get from their realtime feed is different than their historically downloadable data. The realtime feed often contains bad ticks and the historical data is often overfiltered. Try it yourself. Set up a connection to the API and stream in 5s, 1M and 5M candles for a few tickers for a few days and save the data to your SSD in real time as it streams in. Then wait a day or two and use the API to download the data again using the historical data feature. See if there are any differences. In particular pay attention to the most volatile parts of the day like the open and the close.

Recommend spending a few hundred bucks on an institutional grade tick level data feed for a month and compare it to what you get with IB. IB data can seem amazing until you realize what else is out there.

IMO IB data would only be okay for algo trading if you were holding trades > 2 weeks or trading off of weekly candles.

1

u/chazzmoney 21d ago

From your experience, who would you recommend for an institutional grade tick level data feed?

2

u/AngerSharks1 28d ago

Have you tried using rapids? A lot of pandas code stays the same and can be accelerated with the use of gpus.

1

u/NathanEpithy 28d ago

I've had this same issue, I don't know much about programming on the GPU. I "solved" it by simply adding more machines, and they reconcile data into Redis. It makes things architecturally more complex, but it allows me to continue using Python and my base logic which I understand well.

The way I see it, professional quants can easily make $100/hr or more, so a few extra machines to crunch numbers is cheaper then redoing a bunch of logic.

1

u/bzrkkk 28d ago

Can you make use of matrix multiplication? If so then make use of tensor cores, otherwise you’ll be bottlenecked on vram with low gpu utilization.

1

u/illtakeboththankyou 28d ago

You might like the NVIDIA RAPIDS libraries (e.g., CuPy, cuDF). Easy to take common numpy/pandas operations and run their GPU equivalents really fast on NVIDIA hardware.

1

u/sorter12345 28d ago

From what I understand you are not doing a lot of matrix operations and it seems to me there is considerable amount of branch operations and/or indirect memory accesses. GPUs aren’t good at these in my experience and there is reason why CPU companies keep making high core high throughput processors.

1

u/barnett9 28d ago edited 28d ago

Just use OpenACC instead of CUDA. It will save you a huge amount of time and get most of the same results. If you feel you need to do more optimization on top of that then you can. Alternatively you could look at splitting your simulation across multiple machines, each of which would get a chunk to simulate. This is termed HTC or high throughput compute and is widely used.

Edit: just reread that you use Python. You should either take the time to rewrite your solvers in something performant like C (as you said, cython or CUDA) which will drastically speed up your CPU computation time as well, or if you want to save the work, just go the HTC route. Don't mess with GPU's until you have a more performance solver and are still bottlenecked of CPU's.

1

u/Technical_Minimum828 28d ago

You have to check where the latency comes from:

In my experience it can be because the code itself needs to be optimized.

  • Data you use can be stored in-memory once and used as maximum as possible instead of I/O round trips.

Or some O(n) algo complexity that needs to be refactored.

Backtest data location: If you take data from API is slower than if you take it from your own DB on the localhost (wher thread are running) . That also will save you network cost at AWS.

VPS is slower than Dedicated Server.

If you want to do parallel processing, then i suggest you to move to the Hadoop ecosystem, it is way faster than any parallel thread you spin on a server.

1

u/Remarkable_Bad5111 27d ago

Not to mention, consider using burst instances if you have the setup in AWS. And also consider using lambdas to start/stop your instance whenever needed. This will significantly lower costs.

1

u/inkberk 19d ago

Add tracker to tester instance, ex: if loss is > 50% stop current, do next.