[2018] Wide Format Regression

Hello,
For the CFM 2018 Challenge, I am trying to build a linear regression (baseline) model using all the data available at a given day (instead of only the volatitiles/returns of the same asset).
But the score I’ve got on the test set is completely off my validation score (valid: 24%, test: >70%)

Is there any explanation on the way the dataset has been created ? My first guess is the test and train data are maybe many months apart which makes such an approach overfit on the training period.

*edit: Add context + better style

My bad, it was because of me. I remarked when I shuffled my predictions dataframe (using a mere .sample(frac=1), ie keeping the same ID for a given prediction), the test score were quite different. Finally, I put the data in the same order as it is in the test dataframe and the score dropped to 24%.

1 Like