For the CFM 2018 Challenge, I am trying to build a linear regression (baseline) model using all the data available at a given day (instead of only the volatitiles/returns of the same asset).
But the score I’ve got on the test set is completely off my validation score (valid: 24%, test: >70%)
Is there any explanation on the way the dataset has been created ? My first guess is the test and train data are maybe many months apart which makes such an approach overfit on the training period.
*edit: Add context + better style