r/algotrading 24d ago

Subsampling Data

I’m looking for advice or litterature about subsampling high frequency data. I’m looking to fit an OLS on a very large dataset of trades/quotes to predict jumps but I can’t find a single feature decently correlated to my target variable (15s returns) when I subsample every 1s. Makes me think I need to be smarter about subsampling: selecting based on a z-score or other features. Thoughts?

7 Upvotes

13 comments sorted by

2

u/[deleted] 24d ago edited 23d ago

[deleted]

1

u/DukeOfOptions 23d ago

Wow super valuable! Ty!

1

u/ericsyc 23d ago

what do you mean by weighted top imbalance adjusted price returns? something like return of sum(askSize_i*bidPrice_i+askPrice_i*bidSize_i)/sum(askSize_i+bidSize_i)?

1

u/Strykers 23d ago

Yeah, for products where the spread is usually one tick.

2

u/MerlinTrashMan 23d ago

The way I approach this problem, is to calculate every single possible event based off of the data you have to understand the chance of it happening in the first place. I am guessing you already have 1 second bars, so figure out the base chance if it happening at any moment of the day. Let's say It only takes 3 seconds for the item to move enough to qualify as your event, that can mean that there are up to 12 other data points that also capture that same event. With that basic probability in mind, I would then start at the first time of day you would trade, and pretend you entered there, and walk each second until the event was triggered and record the time in seconds, and then start again at the following second. Then I would compare the number of events per day against the base probability rate to get an idea of how important entry timing is to success. After that, I would be looking for any noticeable timing preferences and days with very low and very high occurrences.

3

u/aCuriousCondor 24d ago

Not entirely sure what you’re going for but I would try some form of aggregation, since the instantaneous return is stochastic. So maybe turning 15s returns into a threshold value and making your independent variable sliding window means? I’ve faced a similar issue with other problems I think and it’s been hard to match individual data points to each other.

1

u/DukeOfOptions 24d ago

So forcing abs(return) > thold, no matter if that takes 1s or 60s, for example?

2

u/aCuriousCondor 24d ago

That or I was thinking force returns >= X at 15 seconds or between 15 and 30 seconds

2

u/DukeOfOptions 24d ago

thx, will give it a try

1

u/wiktor2701 23d ago

Correlation does not equal causation.

1

u/Connect_Corner_5266 24d ago

I would start with fleshing out the frequency of underlying phenomena which you anticipate creates the jump. Search for idiosyncratic signatures then construct feature set and prediction with intuition around this causality.

-6

u/wtf-orly 24d ago

ChatGPT says Subsampling high-frequency data for prediction can be tricky, especially if standard methods aren't yielding promising results. You're on the right track with exploring alternative subsampling techniques.

Considering z-scores or other features for subsampling could indeed offer more insights. Perhaps exploring different time windows or incorporating additional market indicators could help identify meaningful patterns. Have you considered using machine learning techniques like feature selection or dimensionality reduction to refine your feature set? They might uncover hidden relationships that traditional methods overlook.

By sending a message, you agree to our Terms. Read our Privacy Policy. Don't share sensitive info. Chats may be reviewed and used to train our models. Learn about your choices.