Hello, in the description of your documents, you define the volatility as the standard deviation of the price over a time period. I am not sure to understand regarding the classical definition of volatility in finance. Could you be more explicit about the computation of the volatility?
Good question! The description was not complete enough (it read that volatility was “usually” computed as a standard deviation without mentioning how it is defined in this challenge). I therefore updated the description of volatility in the challenge description. It now reads:
The volatility of an asset is loosely defined as the size of the variations of its price over a period of time: a price that doesn’t change between the beginning and the end of a period is more volatile if it fluctuates more during the period; a price that changes smoothly has a higher volatility if its overall change is large. There are multiple ways of capturing these two aspects of volatility. The data uses a unique (but secret) definition.
Even if the formula for the volatility is kept secret, could you still tell us what it means when the volatility is equal to zero ? Often it seems that zeros should be treated as NaN, but sometimes not.
Good question! A zero volatility means that all the prices that we have during the corresponding 5 minute interval do not vary at all (i.e. the prices at different times in this interval are the same). Thus, for example, if we get a single price during the interval, the volatility is zero.
Is this consistent with what you observe (in particular with respect to NaNs)?
Thank you for your answer. It’s just that I expected the American stock market to be very liquid. I don’t understand how it is possible for a stock to have quite high volatilities at some points in the day, and zero volatilities at the others (see an example below). Plus some NaNs.
If this happened only sporadically, I could understand. But some products (product 211 for example) have many NaNs and zeros, on almost every day. Does it mean that you added some very special stocks to the data ?
Thanks for your interesting question. The dataset does include some illiquid stocks which may show an intermittent trading activity (isolated bursts of activity followed by flat calm).
Therefore it should not be surprising to observe null volatilities at some points and high volatilities at some others in the day.
I would add that NaN values can appear when the stock price remains stable (this happens when the source data does not contain any price update in a 5 minute time slice, because the price would not change).
I have two questions, one on on the volatility computation and one on the NaN.
Volatility Computation: We have to predict a volatility over the following two hours (14 to 16). Is that computed as an average of 5 minutes vols over that interval, or do you apply your volatility definition only to the beginning and end point of the interval?
NaN: in the post above you suggest that NaN occur when there is no price update over the 5 minute slice (eg. illiquid stock not moving). In this post instead, you suggest that NaN might be cause by a stock halted or issues with exchange data and suggest not imputing zeros. The two statements look contrasting to me. Could you please explain how we should interpret NaN, so we can proceed with an appropriate imputation.
We do not want to share too much about the specific definition of the volatility (so as to avoid leaks in the challenge), but I can say that the same method was used for calculating the volatilities over 5-minute intervals and over the last two hours of the day (the target). This method does not have to be any of the two methods that you suggest.
NaNs can indeed represent any of the possibilities that you mention. There is no explicit distinction in the data about which specific reason explains a given NaN value. However, you might find ways of making some educated guesses, and you can also get a feel for how important the question itself is for the quality of the predictions.