Baseline solution

Alonso_Silva · July 2, 2019, 8:26pm

Hi everyone,

This is the baseline solution we’re trying to beat:
https://colab.research.google.com/drive/1OfFkPA5wDPgrOxQBVZBlKUa8UHiBMSsZ

It’s in Google colab so you can run it on your browser.

All the best.

lebigot · July 3, 2019, 10:20am

Thank you for sharing with everybody!

The accuracies in this notebook are indeed very close to those of the benchmark. Playing a bit with the hyperparameters might bost the accuracy, as could switching to very different methods (SVM, neural networks,…).

lebigot · July 4, 2019, 8:54am

Now, the testing in the notebook is done on dates that can be identical to dates seen in the training: the testing is therefore not done out of sample and the observed performance can be over-optimistic.

PS: I had initially mentioned stratified sampling, but what I had in mind was splitting on dates when defining the test set.

Alonso_Silva · July 4, 2019, 11:39am

Thank you for the message. Unless I made a mistake, the testing in the notebook is done dropping the dates:

X_train, X_test, y_train, y_test = train_test_split(df_input.drop([‘ID’, ‘eqt_code’, ‘date’], axis=1), df_output[‘is_positive’], test_size=0.2, random_state=42)

I’ll take a look into stratified sampling to see if I can include the dates in the training.

lebigot · July 4, 2019, 2:28pm

That’s precisely why the testing set usually contains dates already seen in training. If you imagine for instance that each date had identical inputs for all stocks, then you would find the same data in both the training and testing set (since the date and stock are removed), right? This would obviously yield a very optimistic performance estimate.

Alonso_Silva · July 4, 2019, 2:35pm

OK, I get it. I have trusted that ‘train_test_split’ from scikit-learn is well done and both sets are disjoint (to be checked).

lebigot · July 4, 2019, 2:51pm

The sets are disjoint, but they can contain the same dates. So this means that the algorithms predict some stocks on a given day by knowing “in advance” what some other stocks will do on that day. Obiously, in the extreme case where all the stocks on a given day have the exact same input data, it would be easy to make predictions for that day for stocks from the test set.

A more robust testing procedure would test on dates that are not in the training set.

Alonso_Silva · July 4, 2019, 3:10pm

I haven’t yet solved the problem that was raised above, but here are some results obtained by classic stacking:
https://colab.research.google.com/drive/1frboVbBri7JfDz4So2xUUZhMqU9m97XE

PS: I also added SVMs.

Alonso_Silva · July 10, 2019, 6:32am

I have edited the link of my first post and now the training and testing sets have different dates and indeed the performance was over-optimistic (now it’s below the benchmark but not by too much).

Topic		Replies	Views
Sharing some ideas Data	7	2857	April 10, 2021
[2018] Wide Format Regression Modeling	1	733	January 21, 2020
Aggregation of the volatilities Modeling	2	856	April 19, 2018
Starting Kit (Notebook) Modeling	4	1462	April 5, 2018
"tod" & "ts_last_update" parameters - unclear specification CFM	8	1137	July 22, 2024

Baseline solution

Related topics