Thursday, July 02, 2015

Time series analysis and data gaps

Most time series techniques such as the ADF test for stationarity, Johansen test for cointegration, or ARIMA model for returns prediction, assume that our data points are collected at regular intervals. In traders' parlance, it assumes bar data with fixed bar length. It is easy to see that this mundane requirement immediately presents a problem even if we were just to analyze daily bars: how are we do deal with weekends and holidays?

You can see that the statistics of return bars over weekdays can differ significantly from those over weekends and holidays. Here is a table of comparison for SPY daily returns from 2005/05/04-2015/04/09:

SPY daily returns
Number of bars
Mean Returns (bps)
Mean Absolute Returns (bps)
Kurtosis (3 is “normal”)
Weekdays only
1,958
3.9
80.9
13.0
Weekends/holidays only
542
0.3
82.9
23.7

Though the absolute magnitude of the returns over a weekday is similar to that over a weekend, the mean returns are much more positive on the weekdays. Note also that the kurtosis of returns is almost doubled on the weekends. (Much higher tail risks on weekends with much less expected returns: why would anyone hold a position over weekends?) So if we run any sort of time series analysis on daily data, we are force-fitting a model on data with heterogeneous statistics that won't work well.

The problem is, of course, much worse if we attempt time series analysis on intraday bars. Not only are we faced with the weekend gap, in the case of stocks or ETFs we are faced with the overnight gap as well. Here is a table of comparison for AUDCAD 15-min returns vs weekend returns from 2009/01/01-2015/06/16:

AUDCAD 15-min returns
Number of bars
Mean Returns (bps)
Mean Absolute Returns (bps)
Kurtosis (3 is “normal”)
Weekdays only
158,640
0.01
4.5
18.8
Weekends/holidays only
343
-2.06
15.3
4.6

In this case, every important statistic is different (and it is noteworthy that kurtosis is actually lower on the weekends here, illustrating the mean-reverting character of this time series.)

So how should we predict intraday returns with data that has weekend gaps? (The same solution should apply to overnight gaps for stocks, and so omitted in the following discussion.) Let's consider several proposals:

1) Just delete the weekend returns, or set them as NaN in Matlab, or missing values NA in R. 

This won't work because the first few bars of a week isn't properly predicted by the last few bars of the previous week. We shouldn't use any linear model built with daily or intraday data to predict the returns of the first few bars of a week, whether or not that model contains data with weekend gaps. As for how many bars constitute the "first few bars", it depends on the lookback of the model. (Notice I emphasize linear model here because some nonlinear models can deal with large jumps during the weekends appropriately.)

2) Just pretend the weekend returns are no different from the daily or intraday returns when building/training the time series model, but do not use the model for predicting weekend returns. I.e. do not hold positions over the weekends.

This has been the default, and perhaps simplest (naive?) way of handling this issue for many traders, and it isn't too bad. The predictions for the first few bars in a week will again be suspect, as in 1), so one may want to refrain from trading then. The model built this way isn't the best possible one, but then we don't have to be purists.

3) Use only the most recent period without a gap to train the model. So for an intraday FX model, we would be using the bars in the previous week, sans the weekends, to train the model. Do not use the model for predicting weekend returns nor the first few bars of a week.

This sounds fine, except that there is usually not enough data in just a week to build a robust model, and the resulting model typically suffers from severe data snooping bias.

You might think that it should be possible to concatenate data from multiple gapless periods to form a larger training set. This "concatenation" does not mean just piecing together multiple weeks' time series into one long time series - that would be equivalent to 2) and wrong. Concatenation just means that we maximize the total log likelihood of a model over multiple independent time series, which in theory can be done without much fuss since log likelihood (i.e. log probability) of independent data are additive. But in practice, most pre-packaged time series model programs do not have this facility. (Do add a comment if anyone knows of such a package in Matlab, R, or Python!) Instead of modifying the guts of a likelihood-maximization routine of a time series fitting package, we will examine a short cut in the next proposal.

4) Rather than using a pre-packaged time series model with maximum likelihood estimation, just use an equivalent multiple linear regression (LR) model. Then just fit the training data with this LR model with all the data in the training set except the weekend bars, and use it for predicting all future bars except the weekend bars and the first few bars of a week.

This conversion of a time series model into a LR model is fairly easy for an autoregressive model AR(p), but may not be possible for an autoregressive moving average model ARMA(p, q). This is because the latter involves a moving average of the residuals, creating a dependency which I don't know how to incorporate into a LR. But I have found that AR(p) model, due to its simplicity, often works better out-of-sample than ARMA models anyway. It is of course, very easy to just omit certain data points from a LR fit, as each data point is presumed independent. 

Here is a plot of the out-of-sample cumulative returns of one such AR model built for predicting 15-minute returns of NOKSEK, assuming midpoint executions and no transaction costs (click to enlarge.)












Whether or not one decides to use this or the other techniques for handling data gaps, it is always a good idea to pay some attention to whether a model will work over these special bars.

===

My Upcoming Workshop


This is a new online workshop focusing on the practical use of AI techniques for identifying predictive indicators for asset returns.

===

Managed Accounts Update

Our FX Managed Account program is 6.02% in June (YTD: 31.33%).

===

Industry Update
  • I previously reported on a fundamental stock model proposed by Lyle and Wang using a linear combination of just two firm fundamentals ― book-to-market ratio and return on equity. Professor Lyle has posted a new version of this model.
  • Charles-Albert Lehalle, Jean-Philippe Bouchaud, and Paul Besson reported that "intraday price is more aligned to signed limit orders (cumulative order replenishment) rather than signed market orders (cumulative order imbalance), even if order imbalance is able to forecast short term price movements." Hat tip: Mattia Manzoni. (I don't have a link to the original paper: please ask Mattia for that!)
  • A new investment competition to help you raise capital is available at hedgefol.io.
  • Enjoy an Outdoor Summer Party with fellow quants benefiting the New York Firefighters Burn Center Foundation on Tuesday, July 14th with great food and cool drinks on a terrace overlooking Manhattan. Please RSVP to join quant fund managers, systematic traders, algorithmic traders, quants and high frequency sharks for a great evening. This is a complimentary event (donations are welcomed). 
===

Follow me on Twitter: @chanep

77 comments:

Michael Harris said...

Hello Ernie,

Thanks for the article. I have a few questions:

1. My analysis shows that SPY arithmetic returns on Fridays are negative on the average for the period of your analysis. Why not worrying about that more? Should we also take out Fridays?

2. It appears that returns are highest on Tuesdays on the average by an order of magnitude. Do you have the same results?

3. Why not working directly with absolute returns instead of arithmetic? If that concerns you, this is what the model of data is saying. Then you have no issues.

4. Did you use actual or adjusted data for SPY?

My best regards to you.

pcavatore said...

I'm missing how you can have an average return for weekends when market are closed...

Mattia Manzoni said...

Thanks Ernie for your mention in the blog, I'm really excited for that!!!

Only some indications for your readers:

- The intuition underlying this result is taken from Lehalle-Besson presentation on market resiliency measurements at Markets Microstructure 2014, event hold in Paris at December 2014 http://market-microstructure.institutlouisbachelier.org/?lng=EN. The authors show that most part of intraday returns are generated by limit orders movement rather than by market orders;

- The guiding effect of limit orders on returns seems to be a little bit fuzzy: sometimes market orders have more influence, even if limit orders seem to present a heavier signal).

- However, based on personal preliminary analyses, it seems that for short term prediction limit orders are not helpful. I'm trying to produce new results in the next days

Mattia

Anonymous said...

Consider a missing data problem which can be easily resolved with the Kalman filter

Ernie Chan said...

Hi Pcavatore,
By return for a weekend, I meant the return from the close of the trading day before the weekend, to the close of the trading day after the weekend, in the case of SPY. For 15-min AUDCAD returns, weekend return means the return from 5pm ET on Friday, to 5:15pm ET on Sunday when the market re-opens.

Ernie

Ernie Chan said...

Hi Mattia,
Thanks for the link and explanation!
Ernie

Ernie Chan said...

Can you elaborate on why the Kalman filter can solve the missing data problem?

Ernie

Ernie Chan said...

Hi Michael,

1)-2) Indeed there is seasonality in daily returns of many instruments, but seasonality is actually easier to deal with. You just need to build a special model for each day of the week. Also, special time series techniques are developed for that. (See the book by Ruppert on my Recommended Books list on the right sidebar.) Here, we are trying to build a model that works almost everyday, except perhaps Monday.

3) Even if we can accurately predict absolute returns, we cannot make a directional trade. (though maybe you can trade its options.)

4) I used adjusted data.

Ernie

Michael Harris said...

Hello Ernie,

Thanks for the reply.

I repeated your analysis with all available history for SPY and I cannot find any significant variation due to weekends. I repeated for TLT and I found negative mean returns for Wednesdays and Thursdays but also no significant difference for the weekend. For GLD, the mean weekend return is slightly negative but so is for Thursday.

Anonymous said...

Hi Ernie,

Could we do pairs trading for negatively correlated assets?

Do you have any examples?

Thanks.

Ernie Chan said...

Hi Michael,
Have you done the analysis for SPY over the same period I quoted? That way, we can compare and find out if either of our calculation or data has errors.
Ernie

Ernie Chan said...

Yes, Es and VX futures are negatively correlated, and I have described a momentum-based pair trading strategy in my book Algorithmic Trading p.143.

Ernie

Michael Harris said...

Hello Ernie,

Yes, the results agree for the period you used. Mean is 3.9 bps during the week and for weekend (return from F to M) the mean is something less than 0.1 bp. But for whole history I do not find that difference. I used adjusted data. Thanks.

Ernie Chan said...

Hi Michael,
Interesting that you didn't find such seasonality over the entire history of SPY.
I know that academic researchers like to use as long a history as possible when writing papers (30 years minimum?), but I disagree with that approach.

To me, finance is not like physics. Physical laws may be unchanged since a few seconds after the Big Bang, but financial laws change every few years! I believe we should not use any data prior to 2009 for most analysis. The reason? Please look at Figure 5.10 of my second book Algorithmic Trading.

Ernie

Michael Harris said...

Hello Ernie,

From 01/2009 to last close, for SPY mean return during the week is 7.2 bps and for weekend it is 2.5 bps. However, Fridays alone are also 2.9 bps so from these numbers I do not find any significant effect unless I'am not interpreting them correctly.

As far a using data after 2009 I have a different opinion here. I think the last 6 years was an aberration due to QE and other causes and soon price series dynamics may change.

Michael

Ernie Chan said...

Hi Michael,
If you find that the Friday return is also significantly (statistically speaking) different from the mean, that would indicate that there is a seasonal effect for Fridays, and it should be treated separately from other days as well.

It is anybody's guess when the current post-crisis regime will end. But I think it is a safe bet that even after QE ends, the statistical characteristics of the market won't be the same as that before the crisis. The proliferation of ETFs, especially levered ETFs, as well as the prevalence of HFT have significantly altered the equity market structure. Regulation changes also played a significant part. Prior to 2005, Regulation NMS was not in place. Short sale rules were different. Prior to 2001, stock quotes were not even in decimal places. In my opinion, it is not helpful to use old data for trading research.

Ernie

Michael Harris said...

Hello Ernie,

You are making some very good points as always. However, it should also be clear that you are an advanced professional involved in arbitrage strategies and in that area market micro-structure changes do matter. Regarding your recent excellent interview in Better System Trader, some may misinterpret your claims because you are not a retail trader with more freedom in applying risk and money management methods. I have traded for a HF during some difficult time periods (financial crisis) and the risk constraints imposed are different and so are the methods to implement them. I do not think that was made clear in your interview and thus I do not think that many of the retail traders that listened to it understood why you put aside 80% and you risk 20% with Kelly leverage. AS a matter of fact I think you were an outlier in that website. Even the interviewer confused your method with Kelly fraction. Therefore, from your point of view the change in market dynamics may be important but for most technical traders that trade randomly anyway without realizing it, they are not. Ex. a head and shoulders is a head and shoulders no matter what for them.

I am doing some analysis regarding your claim about changing market dynamics and I hope to post something in the next few days. I am investigating whether price pattern dynamics have changed. Your post was an inspiration for that.

Best regards,

Michael

Ernie Chan said...

Hi Mike,
I agree with your points that the context of our discussions does matter.
Looking forward to the results of your new research!
Ernie

Anonymous said...

Hi Ernie,

In pairs trading,

y_price = alpha + beta*x_price + noise.

How do we use hedge ratio(beta) to choose the weights of two legs(y, x)?

Thanks.

Ernie Chan said...

We trade 1 share of y, and beta shares of x.

Ernie

Anonymous said...

Hi Ernie,

Do you think that the hedge ratio method outperforms dollar neutral or beta neutral(CAPM) or volatility neutral for weighting?

Is there any reference for this discussion?

Thanks.

Ernie Chan said...

Using hedge ratio does not guarantee that the performance is better than dollar neutral or volatility weighting. However, the other methods of hedging do not follow from the stationarity of the spread. I.e. even if a spread is truly stationary, there is no guarantee that the other methods would be profitable.

Ernie

Michael Harris said...

Hello Ernie,

Here is my analysis applied to price patterns: http://bit.ly/1J3rMiC



Ernie Chan said...

Hi Michael,
Thanks for the link to your study. Interesting conclusion: I will study it in some details.
Ernie

N said...

aaa

Nestor said...

Hi Ernie
I have a couple of questions hope you can help me to solve them:

1. What do you mean by mean absolute return? Is it the absolute value of the return?, then you sum up all this values?. (I knew absolute return as compare to relative return but I thing you don’t give the same meaning so I want to be clear on this.). In case the calculation is as I said previously I am obtaining 8.2 bp not 82 bp. For the same data you used.
If the absolute return is simply the cumulative sum of the individual returns of the series i get something like 70% (not bp) for the “weekday” and -14% for weekends.
Also the mean return, is it the mean of the series of the returns?. In case it is , I am having the 0.3 bp for the “week days” but -0.3bp for “weekends” not the 3.9.
I downloaded the date from Bloomberg n case we have different sources; however I don’t believe that is the problem.

2. If you are measuring the “weekend” return and you take close Friday, close Monday wouldn’t you be measuring MONDAY return? ( given that the return of closing Friday and open Monday is zero or in the best cases negligible (for the series you evaluate is of 0.06 bp) you end Friday as you start Monday therefore considering opening Monday and closing Monday is Monday return?). so weekends=Monday?.

Thank you very much for helping me to solve these questions, i’ll be waiting for the answers.

Regards
Nestor

Ernie Chan said...

Hi Nestor,
1) Mean absolute return means the average (over several years) of the absolute value of the daily returns. This is a typical way to compute intraday volatility (and not standard deviation of returns), but for consistency, I am using it to compute daily volatilities as well here. It is not equal to either of the definitions you suggested.

In Matlab, it is

mean(abs(ret));

assuming ret=log(close_price) - backshift(1, log(close_price);

where backshift is the function I used throughout my books.

2) Yes, you can call the weekend return "Monday return". It is just a matter of convention. I am excluding Mondays from my weekdays return.

Ernie

HK said...

Hi Ernie,

I look at some arbitrage traders positions, and it is interesting that a lot of the firms require the trader to show sharpe ratio 2 to 2.5 record and that is after transaction cost. For our own strategy, how good sharpe ratio should we look for?

-HK

Ernie Chan said...

Hi HK,
I would say a Sharpe of at least 1 is necessary to indicate statistical significance, unless you have strong fundamental reason to believe that the strategy should work in spite of little statistical evidence.
Ernie

HK said...

Hi Ernie,

Thanks. I still cannot find any good arbitrage opportunity in Hong Kong market. There is 0.1% tax for stock, which means a buy and sell would be 0.2% cost. The main two futures HSI and HHI are not cointegrated enough. ETF is tax free but most of the ETF is about China market.

Should I focus on oversea opportunity?

-HK

Ernie Chan said...

Hi HK,
Most of my UK and European clients and associates are trading US stocks. I don't see why you should only trade HK stocks.
Ernie

Anonymous said...

Hi Ernie,

What is your view for US dollars?

Could it be stronger? Thanks.

Ernie Chan said...

If US interest rate increases, surely USD will go up, since we don't expect the ECB of BOJ to increase rates.

Ernie

Anonymous said...

Hi Ernie,

Do you think Japanese government want to make Yen devalue against US dollars further?

Thanks.

Ernie Chan said...

I have no idea - I am not an economist!

Ernie

Anonymous said...

Hi Ernie,

Is it possible to hedge currency risk if we trade stocks in Japan, but base currency is US dollars?

Anonymous said...

Hi Ernie,

How do we compute CAGR?

Ernie Chan said...

You can hedge currency risks most cheaply by using options. See the new book "FX Option Performance" at the top of my Recommended Books list on the right sidebar.

Ernie

Ernie Chan said...

CAGR=Compound annualized growth rate. You can compute the cumulative compounded return of your strategy by taking the daily returns r_i (more precisely, the daily mark-to-market returns), and multiplying them: R=(1+r_1)*(1+r_2)*...*(1+r_N)-1. Then annualize it by CAGR=(1+R)^(252/N)-1, where 252 is the number of trading days in a year.

Ernie

Anonymous said...

Hi Ernie,

Thank you for the formation.

How do we compute r_1? Do we need to include deposit in broker account?

Ernie Chan said...

The daily returns r_i is the daily P&L (mark-to-market P&L, meaning it includes both realized and unrealized P&Ls) divided by the NAV of your account. Naturally, the NAV includes cash.

Ernie

Anonymous said...

Hi Ernie,

Are you able to share with us some details on the AR model you used in your example?

Many thanks,
John

Ernie Chan said...

Hi John,
It is an AR model with a very long lookback, such as AR(288).
Please note that this cumPL curve was generated assuming mid quote executions. Naturally, in real trading we may have to pay the bid-ask spread a lot of times.

Ernie

Anonymous said...

Hi Ernie,

I've heard some HFT firms require traders to have a track record with sharpe ratios of at least 3-5. To me that sounds crazy high. I realize no one in their right mind is going to give up any details on a strategy performing that great. But do you have any insight into the sort of strategies and techniques they use? As most HFT firms seem to require a track record with this type of stellar performance, how can anyone actually break into this field?

Many thanks,
Peter

Ernie Chan said...

Hi Peter,
High frequency market-making strategies often can have very high Sharpe ratio, as can latency arbitrage strategies.

Few if any individual traders can afford the infrastructure necessary to implement a high frequency strategy. You are not expected to have such a ready-made strategy to be hired by a HFT firm: you are expected to be a stellar programmer that will assist in the team effort to perfect such strategies once you join the firm.

Ernie

Anonymous said...

Hi Ernie,

Thank you for your blog. I always enjoy your posts and I have your books.

Regarding the AR model, I have a couple of questions:
When finding the best fitting p in AR(p), is this not overfitting the data (data snooping)? You mention AR(288), how do you arrive at 288?
When you then test AR(288) on the test set, and you confirm the validity, doesn't that out-of sample test become in-sample, because you used it to verify and select the model?

Is it not a more robust approach to test a p-lags value that shows stable results across multiple FX pairs and time frames?

Thank you.
Kris


Anonymous said...

Hi Ernie,

Where can we get free fundamental data for stocks?

Thanks.

Ernie Chan said...

Hi Kris,
The value of p is found by using Maximum Likelihood Estimation on the training set. There is no data snooping bias when you find the optimal parameter in the training set, and confirm that it works well in the test set. The only case where data snooping bias would occur is that you find the optimal p in the training set, but the model still doesn't work in the test set. So you change the model in some other way, and use the training set to re-optimize the paramters.

One can seldom find a model that works on multiple FX pairs using the same set of parameters.

Ernie



Ernie Chan said...

The Sharadar database in Quandl.com provides a limited amount of free fundamental data.
Ernie

Anonymous said...

Thanks Ernie,

I am currently paper trading a futures model, which was backtested on 1 minute data and can confirm that using the model to trade the first few bars of the week is impossible. My, perhaps somewhat naive explanation is that the information that builds up during the weekend period gets priced into the market within the first few bars, and as you point out, we cannot model this weekend information flow, as there is no corresponding price for it.
In general, this discontinuity (including the overnight gaps up or down) is causing me some headache. I even thought about looking into models for volatility jumps, but they look too complicated to be robust, and volatility may behave differently compared to prices.

Thanks again
Kris

Ernie Chan said...

Hi Kris,
Thanks for sharing your experience with your model. Indeed it is very difficult to have an intraday model that nevertheless holds positions over night or even over the weekend. Oil and water do not mix well.
Ernie

Anonymous said...

Hi Ernie,

Has US stocks historical intraday data on IB been dividends and splits adjusted? such as 1 min BID bars.

Thanks.

Ernie Chan said...

No data on IB has been split/dividend-adjusted.

Ernie

Anonymous said...

Hi Ernie,

Thank you for response.

I think it has been at least splits adjusted when I check the data.

The historical data is consistent with the charts on TWS.

But I am not sure if it is dividends adjusted.

Paul said...

And for weekend gap, we can just connect the Friday close with next Monday open, and backward adjust the other weekdays in similar manner, and this may work for strategies that doesn't hold position over weekend?

HK said...

Dear Ernie,

I saw you mentioned your FX performance. For FX, do you trade with spot price or future? Margin?


-HK

Ernie Chan said...

Hi Paul,
Even for an intraday strategy, you will find that the model will make some erroneous predictions for the first few bars of a day. Just because you fooled the model to ignore the gap doesn't mean there is no gap in the prices which would affect predictions.
Ernie

Ernie Chan said...

Hi HK,
We trade spot FX only, at 10x leverage.
Ernie

Anonymous said...

For those who are doing pairs trading with e-mini futures, there is an interesting blog that shows the real out-of-sample signals (as well as the market timing behind the pairs trading):

index pairs trading


Forex Speaker said...

Master ERNIE
actually i m searching about your e mail to send you this message
i m so sorry to write it here but i dont know how can i contacte you >>>
i wrote an article about the TRADING LIFE STYLE
and actually i was presented you for all the arabic traders as an example for the trader live and all your activites
i wish you will check it
thanks in advance
http://www.forexspeaker.com/2015/08/trading-lifestyle.html

Anonymous said...

Hi Erine

Do you try to use different Garch models to model the time series or you feels that AR model usually offers better performance for out of sample test due to its simplicity? Thanks

Karen

Ernie Chan said...

Hi Karen,
I haven't tried it myself, but it is certainly worth looking into combining GARCH and ARIMA.
Ernie

Kimi said...

Hi Erine

May I know the reason of using the 15Min timeframe for the simulation? thanks

Dave

Ernie Chan said...

Hi Dave,
This is because there is a 15M gap in daily FX trading, from 17:00-17:15 ET. Using 15M data avoids having to deal with this gap.
Ernie

Anonymous said...

Ernie,

Do you ever make use of wavelets in your systems or training data to remove noise but keep the main signal intact?

Ernie Chan said...

I did that before, and find slight improvement - but that was when I was young and foolish. I plan to take another stab at it soon.

Ernie

Nick Kirk said...

Hi Ernie. We met at the Systematic Trading Meetup in London.
I have a new Blog to discuss systematic trading strategies and systems dev. Please take a look. I hope to make fairly frequent posts.
http://mintegration.eu

Ernie Chan said...

Thanks for the link, Nick!

Ernie Chan said...

Thanks for the link, Nick!

Anonymous said...

Hi Ernie!

How have your strategies passed through this Flash Crash?
Do you think FX is vulnerable to some kind of Flash Crash events?

Thank you!

Ernie Chan said...

Our FX strategy is a mean-reverting strategy, and therefore short volatility. So naturally we expect some losses when volatility picks up. It is down about 2.5% in August. However, our fund is up more than 1%, since this strategy is hedged with a long volatility futures strategy, which performed unusually well.

Ernie

Anonymous said...

Hi Ernie,

Is the long volatility currency futures strategy also a mean-reverting strategy?

Thanks.

Ernie Chan said...

No, long volatility strategies such as our future strategy are usually momentum based.

Ernie

Anonymous said...

Hi Ernie,

Some people said that stocks pairs trading is a "long volatility" trading strategy.

What is your comments? Thanks.

Ernie Chan said...

Most pair trading strategies are mean reverting and thus short volatility.

Ernie

Anonymous said...

Hi Ernie,

Academic papers typically evaluate the success of trading signals and predictive variables using, say, a 5% significance level.

But for a predictive variable to be useful in a trading context we don't need such a stringent criteria, right?

I mean, even if a variable is significant at, say, the 20% level, we can still make money using it. Agree?

Thanks.

Ernie Chan said...

I discussed this issue in Chapter 1 of Algorithmic Trading. Assuming Gaussian distribution of returns, we need a p-Value of 5% if we want to have a Sharpe ratio of 1.6 or higher. Of course one can still make money with a low Sharpe ratio, but the volatility of returns will be high.

Ernie