[2018] Wide Format Regression

v01dXYZ · January 15, 2020, 9:45am

Hello,
For the CFM 2018 Challenge, I am trying to build a linear regression (baseline) model using all the data available at a given day (instead of only the volatitiles/returns of the same asset).
But the score I’ve got on the test set is completely off my validation score (valid: 24%, test: >70%)

Is there any explanation on the way the dataset has been created ? My first guess is the test and train data are maybe many months apart which makes such an approach overfit on the training period.

*edit: Add context + better style

v01dXYZ · January 21, 2020, 10:14pm

My bad, it was because of me. I remarked when I shuffled my predictions dataframe (using a mere .sample(frac=1), ie keeping the same ID for a given prediction), the test score were quite different. Finally, I put the data in the same order as it is in the test dataframe and the score dropped to 24%.

Topic		Replies	Views
Baseline solution Modeling	8	1222	July 10, 2019
Planned benchmarks CFM	4	1037	February 12, 2018
Welcome to the CFM Data Challenge Forum CFM	2	10932	February 19, 2019
Academic project about this competition CFM	2	931	April 17, 2018
Mid-2021 final ranking! CFM	0	634	July 27, 2021

[2018] Wide Format Regression

Related topics