Monday, November 06, 2006

Cointegration is not the same as correlation

A reader asked me recently why I believe that energy stock prices (e.g. XLE) are correlated with crude oil futures front-month contract (QM). Actually I don’t believe they are necessarily correlated – I only think they are “cointegrated”.

What is the difference between correlation and cointegration? If XLE and QM were really correlated, when XLE goes up one day, QM would likely go up also on the same day, and vice versa. Their daily (or weekly, or monthly) returns would have risen or fallen in synchrony. But that’s not what my analysis was about. I claim that XLE and QM are cointegrated, meaning that the two price series cannot wander off in opposite directions for very long without coming back to a mean distance eventually. But it doesn’t mean that on a daily basis the two prices have to move in synchrony at all.

Two hypothetical graphs illustrate the differences. In the first graph, stock A and stock B are correlated. You can see that their prices move in the same direction almost everyday.

Now consider stock A and stock C.

Stock C clearly doesn’t move in any correlated fashion with stock A: some days they move in same direction, other days opposite. Most days stock C doesn’t move at all! But notice that the spread in stock prices between C and A always return to about $1 after a while. This is a manifestation of cointegration between A and C. In this instance, a profitable trade would be to buy A and short C at around day 10, then exit both positions at around day 19. Another profitable trade would be to buy C and short A at around day 31, then closing out the positions around day 40.

Cointegration is the foundation upon which pair trading (“statistical arbitrage”) is built. If two stocks simply move in a correlated manner, there may never be any widening of the spread. Without a temporary widening of the spread in either direction, there is no opportunity to short (or buy) the spread, and no reason to expect the spread to revert to the mean either.

For further reading:

Alexander, Carol (2001). Market Models: A Guide to Financial Data Analysis. John Wiley & Sons.


51 comments:

  1. Interesting post. I've found this cointegration to be between OIH and USO/XLE.

    I usually like to short OIH due to its higher intra-day volatility than XLE and go long XLE as a hedge, or vice versa.

    It doesn't always cointegrate like you mentioned but every now and then there is an opportunity. I.e. On short-covering days

    ReplyDelete
  2. Yes, OIH is certainly an alternative to XLE. OIH is the most liquid oil services ETF. Frankly, I don't remember the reason anymore why I chose XLE instead of OIH to do the analysis. They both cointegrate with USO equally well.

    ReplyDelete
  3. Dear Mr. Chan,

    Wonderful blog you have over here. I looked for something like this for a long time.

    You might also try DVN as an alternative to XLE.

    Cheers,
    Max

    ReplyDelete
  4. Max,
    Glad you like my articles, and thanks for your suggestion.
    Hope to exchange more ideas with you in the future!
    Ernie

    ReplyDelete
  5. Ernie and Yaser: For my trading, I've found XLE has a strong advantage over the alternatives: There is a single-stock futures available for XLE, and using the SSF drops the margin requirement from 50% to 20%. This extra leverage is very useful in spread trading. It is difficult to capture spread profits without that leverage, due to the small size of spread changes.

    ReplyDelete
  6. Hi Ernie et al.,

    Just wondering if there are any traders out there that use correlation or cointegration on an intra-day time scale to do day-trading. For example taking data samples as fast as 15 seconds, or maybe longer like every 10 minutes. Is there any useful information in time scales that small? I would think that it would depend highly on the volatility/liquidity of the underlyings so that enough margin could be made on the spread for such a strategy to be profitable. Just wondering if you have any experience or opinions on this.
    Cheers,
    Jack

    ReplyDelete
  7. Hi Jack,
    Theoretically, cointegration is time-scale independent. So we cannot say a pair of stocks are cointegrated on a time scale of years, but not minutes. However, it is meaningful to ask what the average mean-reversion time is. I have written elsewhere on this blog (see Ornstein-Uhlenbeck formula) a good way to estimate this, and it will help you determine whether the pair of stocks is suitable for trading at the time-scale of interest.
    Ernie

    ReplyDelete
  8. Hi Ernie,

    I have been trading pairs in the Indian stock markets. I find your blog very much informative and educative. I really appreciate your efforts towards sharing indepth knowledge on the subject.

    Can u explain the cointegration method via spreadsheet and if possible, share the spreadsheet. Appreciate if you can explain in a non-quantitative style. I want to learn interpreting the output of the cointegration test, whether it is mean-reverting or not for a given time frame.

    Thanks
    Bhumir

    ReplyDelete
  9. Hi Bhumir,
    Thank you for your interest in my blog. Unfortunately, cointegration test cannot easily be performed on Excel. I performed mine using Matlab. If you purchase my book, you will find sample codes on how to compute this.
    Ernie

    ReplyDelete
  10. Hi Ernie, I have been reading your book. I must say it's very informative and it has helped me tremendously.

    One question though about LeSage's cadf function when testing for co-integration. I notice that if you reverse the order of the y and x parameters (in cadf(y,x,p,nlag)), the resulting t-statistic can be very different for the same two sets of data.

    Using your Matlab sample code 7_2.m as an example, if y is GLD and x is GDX, I get a t-statistic of -3.52. If y is GDX and x is GLD, I get -4.11. So what do I make out of this? Which result should I rely on to see if there's co-integration between the two sets of data? Or should I use both results (or the average of the two) as a guideline?

    Thanks
    Sam

    ReplyDelete
  11. Hi Sam,
    Yes, indeed the results are different depending on which series you pick as the independent variable.
    My rule of thumb is to be conservative: regard a pair as cointegrating only if both t-stats meet the criterion.
    Ernie

    ReplyDelete
  12. Hey Ernie, this is Peter from University of Cape Town South Africa. I am writing to ask you if you get any meaningful link if one is testing for integration if one uses correlation. i am testing integration across african markets for my thesis and have used Engel Granger cointegration test, but thought it might be nice to include a correlation matrix but dont want to look stupid. Also, just to confirm, does it matter if i only use A and independant and B as dependant and not test both ways?

    Thanks
    Please respond asap if possible!

    ReplyDelete
  13. Peter,
    Including a correlation matrix will not convince anybody that the African markets are cointegrated. However, it might serve as an useful comparison in technique.

    Indeed cointegration tests are variable-order-dependent, esp. for borderline cases. Try both orders.
    Ernie

    ReplyDelete
  14. Hi, Ernie:
    I have a rather simple question regarding index tracking using cointegration optimal portfolio (following an earler paper by Dunis & Ho: Cointegration portfolio of European Equities for Index Tracking) Suppose I am able to find cointegration in the following manner: ln(index)=2*ln(p1)+3*ln(p2) where p1 and p2 are the prices of constituent stocks in the index. The paper suggests using the "normalized" parameters for weights (can you please explain what normalization means in that paper?). I assume it is 2/5=0.4 and 3/5=0.6 for weights. Suppose asset 1&2 each has return of 5%, then the portfolio constructed with the 0.4, 0.6 weight would give 5%*0.4+5%*0.6=5% return. However by the original cointegration result: ln(index)=2*ln(p1)+3*ln(p2) and by first differencing it (becoming returns on both sides), the index return should be 2*10%+3*10%=50%. Definitely the portfolio is not tracking the index. I am sure there is something not right here... Thanks for your help.

    Fuzhi

    ReplyDelete
  15. Hi Fuzhi,
    You have to apply the normalized weights before computing returns, otherwise the two sides won't match. It would be like comparing the P&L of $1 capital with the P&L of $1M capital if you don't normalize by capital.
    Ernie

    ReplyDelete
  16. Ernie:
    Appreciate very much your reply. However, I am still a little confused. Could you please explain again how you would normalize the weights if this is the cointegration results you get: ln(index)=2*ln(p1)+3*ln(p2)
    where "index" is the index price, "p1" and "p2" are the prices of constituent stocks in the index. Seems all are in percent return terms and have nothing to do with the amount of capital.

    Thanks.

    Fuzhi

    ReplyDelete
  17. Fuzhi,
    The 2 and 3 represents units of capital. So clearly we need to normalize them so that both sides have the same total capital, typically 1 unit.

    In any case, I dislike using logs. I prefer raw prices so that the number of shares are fixed.
    Ernie

    ReplyDelete
  18. Dear Ernie, like say I manage to identify a good cointegrated pair. My question now is to work on a hedge ratio. When I regress price of A over B compared to B over A, I end up with two different hedge ratios. Which hedge ratio should I choose? As I will need to use the residual to determine a band for entry and exit. Depending on which hedge ratio I use, I end up with two different entry and exit.

    ReplyDelete
  19. Suny,
    The eigenvector obtained from the Johansen test can be used to determine a unique linear combination (i.e. hedge ratio) of the 2 price series.
    Ernie

    ReplyDelete
  20. Hi Erin,
    As you said in one of the earlier blog here, You said S1 ~ S2 and
    S2 ~ S1 both should pass co integration test.

    1. Let us say S1 ~ S2 is co integrated while reverse order is not. So does it mean that such pair is not co integrated?


    2. Let us say, we have 5 stock to trade with, which one I should use as independent variable and other 4 as dependent without trying so many combination.

    ReplyDelete
  21. Hi Jeet,
    1) This indicates the pair is borderline cointegrating. Trade at your own risk!
    2) You should use Johansen test: it will give you all good combinations of symbols with no unique "independent" variable.
    Ernie

    ReplyDelete
  22. what should be the logic behind choosing the independent set of stock and dependent stock from a basket of stock?

    Johnson set might give result but I am looking for logic.

    ReplyDelete
  23. Jeet,
    Logic can only be found if you have a fundamental economic understanding of the relationship between the assets. For e.g. if you believe that firm A and B are both big customers of firm C, you might argue that C's price should be a dependent variable.
    However, I usually do not find it important to find out why a variable is independent: it makes no difference to the trading model.
    Ernie

    ReplyDelete
  24. Dear Mr.Chan,

    I used your file to test ex7_2.m for GLD and GDX.
    The t-statistic is -9.72, not -3.36.
    What's wrong with it??

    By the way, your book is very good.
    Thanks.

    ReplyDelete
  25. Anon,
    Did you use my data file for the test? Did you set all the parameters for the cadf test to be the same as mine?
    Ernie

    ReplyDelete
  26. Dear Mr.Chan,

    Only used copy and paste.(ex7-2 , jplv7 *m-files and GLD/ GLD)


    parameters?
    Not only use those m.file, but also need to change parameters??

    Base on the result, only t-statistic is not right.
    Others are similar.

    I tried GLD and GLD, two the same data, t-statistic is also -9. @@||

    ReplyDelete
  27. 99,
    There is no need to change the input parameters to the cadf function for testing cointegration.

    Are you using the same input data as I used? Have you made sure the dates of those price series are ascending (most recent data on last row)?

    Ernie

    ReplyDelete
  28. Dear Chan,

    hmm.... I used those GLD/ GDX files from your server.(2006/05/23~2007/11/30 data)
    I tested those file to "adftest.xls", the "Dickey Fuller Test Statistic" is right.
    Is my Matlab wrong? @.@||

    ReplyDelete
  29. Dear Chan,

    I mailed a letter to your G-mail with all m.file and my matlab screen.
    If you have time, could you help to read it?
    Thanks and sorry... disturb you.

    ReplyDelete
  30. 99,
    I did not receive your email (I checked the spam folder too). Could u pls resend?
    Ernie

    ReplyDelete
  31. I am wondering what everyone feels is the most reliable cointegration test in matlab? I have tried egci adf and jci and get widely different results. Then to make matters worse I check with catalystcorner and many pairs show significantly different results there.

    ReplyDelete
  32. Cointegration is not the same as correlation.
    certainly. it can be proven that pearson correlation coefficient will be close to 1 only if variance of each asset is relatively small to the variance of random walk process that generates data.

    ReplyDelete
  33. ADF, Variance Ratio, CADF test

    Dear Dr Chan,

    I have a set of pair data using 5min and daily data@9am.

    5min data: ADF = -2.63 vs 10%critical value=2.59, H2 = 0.35, h=1, CADF= -2.6 vs 10% critical value = -3.

    Daily data: ADF = -1.6 vs 10%critical value=2.59, H2 = 0.4, h=0 p=0.8, CADF= -1.9 vs 10% critical value = -3

    May I know if I can trade this pair? How to overcome different time frame where one shows a trend and the other shows a weak mean-reversal as above.

    Thanks
    Leo

    ReplyDelete
  34. ADF, Variance Ratio, CADF test

    Dear Dr Chan,

    I have a set of pair data using 5min and daily data@9am.

    5min data: ADF = -2.63 vs 10%critical value=2.59, H2 = 0.35, h=1, CADF= -2.6 vs 10% critical value = -3.

    Daily data: ADF = -1.6 vs 10%critical value=2.59, H2 = 0.4, h=0 p=0.8, CADF= -1.9 vs 10% critical value = -3

    May I know if I can trade this pair? How to overcome different time frame where one shows a trend and the other shows a weak mean-reversal as above.

    Thanks
    Leo

    ReplyDelete
  35. Hi Leo,
    It looks to me that for both 5 min or daily data, we can't reject the CADF null hypothesis. But that doesn't mean you can't create a profitable mean-reverting strategy: you just have to backtest it with various parameters.

    It is common for an instrument to mean-revert in some timeframe while trend in another. You just have to adapt your strategy to the respective timeframes accordingly.

    Ernie

    ReplyDelete
  36. Dear Dr Earnest,

    1) It looks to me that for both 5 min or daily data, we can't reject the CADF null hypothesis. But that doesn't mean you can't create a profitable mean-reverting strategy: you just have to backtest it with various parameters.
    >> Do you mean adjusting the moving average and number of standard deviation for bollinger? If they are very weak mean-reversal as indicated by the cadf, we can expect a very poor result regardless of how well we optimize the bollinger. May I know how we can overcome it?

    2) In example 5.1,
    you use audusd data but I could only find the inputData_AUDCAD_20120426 from your "box" link. Could there be a mistake because the results I get after I replace audusd with audcad looked very close.

    3) You use Johansen weights for the hedge ratio. If we normalize it ie weight 1/ weight2, we see huge spikes ie 20 times. If we use the second set of Johansen weights, the normalize weights also have spikes. The change of weights vary drastically day to day. Looks unstable to me?

    Thanks
    Leo

    ReplyDelete
  37. Hi Leo,
    1) Yes, you can optimize the parameters of the Bollinger band. If the best results are still too weak, you shouldn't be trading these assets using mean reversion models.
    2) My box.net has data on all 3 files: AUDCAD, AUDUSD, and USDCAD.
    3) I think I mentioned to you (or some other reader?) before that the Johansen test is not guaranteed to pick the same eigenvector as the "best" one everyday, as the order of their eigenvalues do change. So if you want to enforce continuity, you will have to make sure that the same eigenvector is used until the eigenvalue is much "worse" than the best one.

    Ernie

    ReplyDelete
  38. Dear Dr Ernest,

    1) & 2) you are right. This shows highly correlated asset may not be highly cointegrated for mean-reversion.

    3) I think I mentioned to you (or some other reader?) Yes. I have tried to use moving average on the eigenvector, switching to another eigenvector set when one is too high, taking average of the two eigenvector sets but all failed. The change is still drastic. Then I try to figure out the new value that is added and oldest value that is dropped out and compared to the average for the period (as you taught me), I dont see much of the change in graphical point of view but the resultant eigenvector set is still having big change. What can we do about it? Because at a certain day for both different eigenvector sets, I have to suddenly long ie 20 times more than usually on an asset.

    Thank you
    Leo

    ReplyDelete
  39. Leo,
    Are you saying that even if you stick with one continuously changing eigenvector, the hedge ratio still changes discontinuously? That seems unlikely.
    Ernie

    ReplyDelete
  40. Hi notcher,
    As I wrote before, my web host considers that file too large to upload. If you email me, I can send you a link to the box.net folder.
    Ernie

    ReplyDelete
  41. Dear Dr Chan,

    May I know if you are using signal properties ie ADF and cointegration to check mean-reverting property in the lookback period and if yes, proceed to use mean-reverting strategy for the next period. Hoping that the next period will continue to be mean-reverting until ADF and cointegration in the lookback period indicates non-reverting?

    Thank you
    Leo

    ReplyDelete
  42. Hi Leo,
    Actually, we use ADF test only as a screen for suitable pairs. Once a pair is deemed suitable for pair-trading, we just run Bollinger band type strategy on it.
    Ernie

    ReplyDelete
  43. Dear Earnest Chan,

    Do you use Matlab in your production? I know prototyping using Matlab has a fast development speed but Matlab is very slow, compared to python, C++ or Java in terms of processing speed.

    Thank you
    Leo

    ReplyDelete
  44. Hi Leo,
    Yes, I use Matlab for trading some low frequency strategies.

    I disagree that Matlab is slower than Python. Please see this academic study: Aruoba, S. Borağan and Fernández-Villaverde, Jesús . 2014. A Comparison of Programming Languages in Economics. NBER Working Paper No. 20263. Available at economics.sas.upenn.edu/~jesusfv/comparison_languages.pdf

    Ernie

    ReplyDelete
  45. Hi Dr Ernie

    Would you run in unix or windows for production? I thinking of using cygwin to configure PC to unix.

    That paper is very good.

    Thank you
    Leo

    ReplyDelete
  46. Hi Leo,

    Since I am not a high frequency trader, it doesn't matter to me whether we run it in Linux or Windows. We run it in Windows Server.

    Ernie

    ReplyDelete
  47. Hi,
    pg. 111 of "Algorithmic Trading:winning strategies..." refers to 2 files:
    1)inputData_USDCAD_20120426
    2)inputData_AUDUSD_20120426

    But when I go to "http://epchan.com/book2/" I don't see none of them.
    Where can I find them?

    ReplyDelete
  48. Hi Ernie,
    my email: polargnome@yahoo.com

    Thanks

    ReplyDelete