Baseline solution

Hi everyone,

This is the baseline solution we’re trying to beat:
https://colab.research.google.com/drive/1OfFkPA5wDPgrOxQBVZBlKUa8UHiBMSsZ

It’s in Google colab so you can run it on your browser.

All the best.

1 Like

Thank you for sharing with everybody!

The accuracies in this notebook are indeed very close to those of the benchmark. Playing a bit with the hyperparameters might bost the accuracy, as could switching to very different methods (SVM, neural networks,…).

1 Like

Now, the testing in the notebook is done on dates that can be identical to dates seen in the training: the testing is therefore not done out of sample and the observed performance can be over-optimistic.

PS: I had initially mentioned stratified sampling, but what I had in mind was splitting on dates when defining the test set.

1 Like

Thank you for the message. Unless I made a mistake, the testing in the notebook is done dropping the dates:

X_train, X_test, y_train, y_test = train_test_split(df_input.drop([‘ID’, ‘eqt_code’, ‘date’], axis=1), df_output[‘is_positive’], test_size=0.2, random_state=42)

I’ll take a look into stratified sampling to see if I can include the dates in the training.

That’s precisely why the testing set usually contains dates already seen in training. If you imagine for instance that each date had identical inputs for all stocks, then you would find the same data in both the training and testing set (since the date and stock are removed), right? This would obviously yield a very optimistic performance estimate.

1 Like

OK, I get it. I have trusted that ‘train_test_split’ from scikit-learn is well done and both sets are disjoint (to be checked).

The sets are disjoint, but they can contain the same dates. So this means that the algorithms predict some stocks on a given day by knowing “in advance” what some other stocks will do on that day. Obiously, in the extreme case where all the stocks on a given day have the exact same input data, it would be easy to make predictions for that day for stocks from the test set.

A more robust testing procedure would test on dates that are not in the training set.

2 Likes

I haven’t yet solved the problem that was raised above, but here are some results obtained by classic stacking:
https://colab.research.google.com/drive/1frboVbBri7JfDz4So2xUUZhMqU9m97XE

PS: I also added SVMs.

I have edited the link of my first post and now the training and testing sets have different dates and indeed the performance was over-optimistic (now it’s below the benchmark but not by too much).

1 Like