Sharing some ideas

Hi,

I’m starting this thread with the hope of motivating some participants to share fresh ideas :slight_smile:

My current solution is based on the benchmark described by the hosts (linear regression followed by training boosted trees on residuals) with feature engineering. I’ve dropped most of the original columns ending up of about 42 features for my best model (public leaderboard 0.4181)

I’ve met some success computing averages of past values but surprisingly rolling windows, shifts and lags did not work so far. Before working on other solution than trees I wanted to work on periodicity.

Here is the mean target of the 900 stocks:

It looks like there is some periodicity.

The vertical red lines are for ~20 days and ~60 days. So they represent months and quarters as they are about 20/22 trading days per months. It seems that end of months and end of quarters are prone to high volumes

The challenge would be to translate this into a feature. Any thoughts ?

Alex

2 Likes

It also seems that while average auction volume of stocks increase through time their standard deviation decrease.


So what does it mean ?
In the beginning only few stocks were impacted by the rise while at the end a majority of stocks are impacted. This is just a supposition and maybe the host can give us some details about this supposed phenomenon?
Thanks

PS: it looks like there is a yearly trend in this graph !

1 Like

Hello,

For those struggling with their RAM, especially when doing one hot encoding: I have found on Kaggle this useful function to save approximately 70% of memory usage on each dataframe.

def reduce_mem_usage(df, verbose=True):

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
start_mem = df.memory_usage().sum() / 1024**2
for col in df.columns:
    col_type = df[col].dtypes
    if col_type in numerics:
        c_min = df[col].min()
        c_max = df[col].max()
        if str(col_type)[:3] == 'int':
            if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                df[col] = df[col].astype(np.int8)
            elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                df[col] = df[col].astype(np.int16)
            elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                df[col] = df[col].astype(np.int32)
            elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                df[col] = df[col].astype(np.int64)
        else:
            if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                df[col] = df[col].astype(np.float16)
            elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                df[col] = df[col].astype(np.float32)
            else:
                df[col] = df[col].astype(np.float64)

end_mem = df.memory_usage().sum() / 1024**2
print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))
return df
1 Like

Hi Alex,

I am also trying to reduce the number of features currently, but it is not optimal yet. And as you did a great job doing so, I am just wondering, if you created new features or you just removed most of the original ones? I tried to create new features, but it is not helping much yet.

Regarding your question on how to capture the periodicity, I do not have a clear idea yet, should I find something interesting, I will get back to you.

In terms of model I am currently also running linear regression with LGBM, but I am using K-Fold splits - maybe this could be interesting for you. It improved my score quite a bit.

David

Hi David,

I first trained a model with all the features, plot importance and train again a model using features with highest importance. The gap was very small and I could drastically reduce the number of features. Then I started computing features based on the remaining ones and my score did improve. You can also plot some correlation to help you to figure out which features are useful.

Yeah k-fold split it definitely on my todo list, glad it improved your score. Are you using sklearn module TimeSeriesSplit or a custom one?

Good luck
Alex

3 Likes

Hey,

Not sure if this is still useful, but usually, periodicity is dealt with using dummy variables like hour of day, day of week, day of month, month, etc.

Here, we do not have the date which makes it a little tricky (how do you know when days are off?). I guess you could try building the features with trial and errors though.

Also, I see that you have plotted the FFT magnitudes. So maybe you could simply add the Fourrier cos/sine function for the frequencies you deem relevant? At least you could hope that the smoothing from the cos/sine would ease up the unknown holiday effect…

What do you think?

Cheers,
Gilto

Also, instead of looking at the target series, you ought to look at the residuals (after lgbm). If you do not see any periodic cycle, then your regression probably took care of it…

Hi Gilto,

I agree, we don’t know which day is off.

Interesting and this can be done for each stock. It reminds me a cyclical feature encoding that I’ve tried ; first compute the week id for each row then normalize it to [0, 2pi] and apply cos/sin. Conclusion: my model scored better without cyclical encoding. This is supposed to make closer week 0 and the last week (last week of December and first week of January for example) when training with neural networks. However this is not really useful for trees expect for hypothetical smoothing purposes…

However there are also other features that can be extracted from FFT… :wink:

Alex

1 Like