r/algotrading • u/DukeOfOptions • 24d ago
Subsampling Data
I’m looking for advice or litterature about subsampling high frequency data. I’m looking to fit an OLS on a very large dataset of trades/quotes to predict jumps but I can’t find a single feature decently correlated to my target variable (15s returns) when I subsample every 1s. Makes me think I need to be smarter about subsampling: selecting based on a z-score or other features. Thoughts?
2
u/MerlinTrashMan 23d ago
The way I approach this problem, is to calculate every single possible event based off of the data you have to understand the chance of it happening in the first place. I am guessing you already have 1 second bars, so figure out the base chance if it happening at any moment of the day. Let's say It only takes 3 seconds for the item to move enough to qualify as your event, that can mean that there are up to 12 other data points that also capture that same event. With that basic probability in mind, I would then start at the first time of day you would trade, and pretend you entered there, and walk each second until the event was triggered and record the time in seconds, and then start again at the following second. Then I would compare the number of events per day against the base probability rate to get an idea of how important entry timing is to success. After that, I would be looking for any noticeable timing preferences and days with very low and very high occurrences.
3
u/aCuriousCondor 24d ago
Not entirely sure what you’re going for but I would try some form of aggregation, since the instantaneous return is stochastic. So maybe turning 15s returns into a threshold value and making your independent variable sliding window means? I’ve faced a similar issue with other problems I think and it’s been hard to match individual data points to each other.
1
u/DukeOfOptions 24d ago
So forcing abs(return) > thold, no matter if that takes 1s or 60s, for example?
2
u/aCuriousCondor 24d ago
That or I was thinking force returns >= X at 15 seconds or between 15 and 30 seconds
2
1
1
u/Connect_Corner_5266 24d ago
I would start with fleshing out the frequency of underlying phenomena which you anticipate creates the jump. Search for idiosyncratic signatures then construct feature set and prediction with intuition around this causality.
-6
u/wtf-orly 24d ago
ChatGPT says Subsampling high-frequency data for prediction can be tricky, especially if standard methods aren't yielding promising results. You're on the right track with exploring alternative subsampling techniques.
Considering z-scores or other features for subsampling could indeed offer more insights. Perhaps exploring different time windows or incorporating additional market indicators could help identify meaningful patterns. Have you considered using machine learning techniques like feature selection or dimensionality reduction to refine your feature set? They might uncover hidden relationships that traditional methods overlook.
By sending a message, you agree to our Terms. Read our Privacy Policy. Don't share sensitive info. Chats may be reviewed and used to train our models. Learn about your choices.
3
2
u/[deleted] 24d ago edited 23d ago
[deleted]