## Friday, March 03, 2017

### More Data or Fewer Predictors: Which is a Better Cure for Overfitting?

One of the perennial problems in building trading models is the spareness of data and the attendant danger of overfitting. Fortunately, there are systematic methods of dealing with both ends of the problem. These methods are well-known in machine learning, though most traditional machine learning applications have a lot more data than we traders are used to. (E.g. Google used 10 million YouTube videos to train a deep learning network to recognize cats' faces.)

To create more training data out of thin air, we can resample (perhaps more vividly, oversample) our existing data. This is called bagging. Let's illustrate this using a fundamental factor model described in my new book. It uses 27 factor loadings such as P/E, P/B, Asset Turnover, etc. for each stock. (Note that I call cross-sectional factors, i.e. factors that depend on each stock, "factor loadings" instead of "factors" by convention.) These factor loadings are collected from the quarterly financial statements of SP 500 companies, and are available from Sharadar's Core US Fundamentals database (as well as more expensive sources like Compustat). The factor model is very simple: it is just a multiple linear regression model with the next quarter's return of a stock as the dependent (target) variable, and the 27 factor loadings as the independent (predictor) variables. Training consists of finding the regression coefficients of these 27 predictors. The trading strategy based on this predictive factor model is equally simple: if the predicted next-quarter-return is positive, buy the stock and hold for a quarter. Vice versa for shorts.

Note there is already a step taken in curing data sparseness: we do not try to build a separate model with a different set of regression coefficients for each stock. We constrain the model such that the same regression coefficients apply to all the stocks. Otherwise, the training data that we use from 200701-201112 will only have 1,260 rows, instead of 1,260 x 500 = 630,000 rows.

The result of this baseline trading model isn't bad: it has a CAGR of 14.7% and Sharpe ratio of 1.8 in the out-of-sample period 201201-201401. (Caution: this portfolio is not necessarily market or dollar neutral. Hence the return could be due to a long bias enjoying the bull market in the test period. Interested readers can certainly test a market-neutral version of this strategy hedged with SPY.) I plotted the equity curve below.

Next, we resample the data by randomly picking N (=630,000) data points with replacement to form a new training set (a "bag"), and we repeat this K (=100) times to form K bags. For each bag, we train a new regression model. At the end, we average over the predicted returns of these K models to serve as our official predicted returns. This results in marginal improvement of the CAGR to 15.1%, with no change in Sharpe ratio.

Now, we try to reduce the predictor set. We use a method called "random subspace". We randomly pick half of the original predictors to train a model, and repeat this K=100 times. Once again, we average over the predicted returns of all these models. Combined with bagging, this results in further marginal improvement of the CAGR to 15.1%, again with little change in Sharpe ratio.

The improvements from either method may not seem large so far, but at least it shows that the original model is robust with respect to randomization.

But there is another method in reducing the number of predictors. It is called stepwise regression. The idea is simple: we pick one predictor from the original set at a time, and add that to the model only if BIC  (Bayesian Information Criterion) decreases. BIC is essentially the negative log likelihood of the training data based on the regression model, with a penalty term proportional to the number of predictors. That is, if two models have the same log likelihood, the one with the larger number of parameters will have a larger BIC and thus penalized. Once we reached minimum BIC, we then try to remove one predictor from the model at a time, until the BIC couldn't decrease any further. Applying this to our fundamental factor loadings, we achieve a quite significant improvement of the CAGR over the base model: 19.1% vs. 14.7%, with the same Sharpe ratio.

It is also satisfying that the stepwise regression model picked only two variables out of the original 27. Let that sink in for a moment: just two variables account for all of the predictive power of a quarterly financial report! As to which two variables these are - I will reveal that in my talk at QuantCon 2017 on April 29.

===

My Upcoming Workshops

March 11 and 18: Cryptocurrency Trading with Python

I will be moderating this online workshop for my friend Nick Kirk, who taught a similar course at CQF in London to wide acclaim.

May 13 and 20: Artificial Intelligence Techniques for Traders

I will discuss in details AI techniques such as those described above, with other examples and in-class exercises. As usual, nuances and pitfalls will be covered.

Honey said...

Wow! only two Variables! :D But how we could estimate that those two variables are sufficient for last 40 days but not for 400 days?

Honey said...

BTW your results also convey that KISS principle Rocks! https://en.wikipedia.org/wiki/KISS_principle

David Bryant said...

Great article. Could also use Akaike information criterion .

Ernie Chan said...

Hi Honey,
That 2 variables worked for the entire in and out of sample period. So I am not sure what you meant by "last 40 days".

Yes, I have always been a big fan of KISS!

Ernie

Ernie Chan said...

Thank you, David!

Yes, AIC and BIC are both used often.

Ernie

Eduardo Gonzatti said...

Dr Chan, nice post, as usual!
Did you try to run a LS version of this simple strategy? Shorting those that seemed overpriced, or, hedging with the SP500 as you said? Could you post the rough numbers, if you did?

Thanks!

Best Regards

Ernie Chan said...

Thanks, Eduardo!

I have not tested the market neutral version. However, as presented, it is already LS, though the 2 sides may not net to 0.

Ernie

Eduardo Gonzatti said...

Do you think that the long boas from the market during the OOS window could have inflated the results by this much?

Best Regards

Ernie Chan said...

Hi Eduardo,
I don't know how much the net exposure may have inflated the result. My guess is that there is a long bias in the portfolio, because most stocks have positive returns during both the training and OOS periods.
Ernie

Eduardo Gonzatti said...

Thanks for the caveat, Dr!

dashiell said...

Great post.

Sorry, this may be a very silly question, but I'm having trouble following what each row of data represents. How do you get 1260 rows for each stock? Based on the number, it seems like that represents daily returns for 5 years, but it didn't seem like daily returns were being used here. Thanks.

Thomas said...

Why not just do L1/L2 regularization?

Ernie Chan said...

Hi dashiell,

5 years * 252 trading days per year = 1260 rows.

Daily returns are used.

Ernie

Ernie Chan said...

Hi Thomas,
Thank you for mentioning L1/L2 regularization as yet another method to reduce overfitting.

I am not claiming that the methods I outlined above are superior to every other ML method out there such as L1/L2 regularization. If you are able to present the L1/L2 results here, you would be most welcome. Otherwise, I will try it and present my findings at QuantCon 2017.

Ernie

Anonymous said...

So you try one variable and check BIC. Then add another (at random?) and check for decreased BIC? Is it as simple as an iterative search for lowest BIC?

Ernie Chan said...

In stepwise regression, we try *every* variable at each iteration, with replacement, and pick the one that has the lowest BIC, either to be added or removed.

So no randomness involved.

Ernie

Ever Garcia said...

Does the transaction cost adjust based on the CAGR for say \$100K?

Anonymous said...

Interesting post. I did not understand your answer to dashiell on the number of training rows available, the post refers to using factor loadings from quarterly reports as the only independent factors in the regression model, how can you use it for daily returns? It would seem you have 4 different points per year so a total of only 5x4=20 points per stock for predicting the next quarterly return, unless I misunderstood the regression setup. Can you clarify please?

Ernie Chan said...

Hi Ever,
I typically just take 5bps as transaction costs for S&P 500 stocks. It doesn't depend on your account NAV.
I haven't, however, deducted transaction costs in my results discussed above.

Ernie

Ernie Chan said...

Even though each stock has new data 4 times a year, the stocks all have different earning release dates. Hence we have to compare all the stocks' fundamental numbers every day to make investment decision.

Please see Example 2.2 in my new book for details and codes.

Ernie

Ever Garcia said...

Thanks for the clarification.

Anonymous said...

Hi Ernie,

When we backtest a trading strategy, after we get daily P&L, how do we get daily returns of the strategy?

If we assume different initial equity, we may get different volatility later.

Thanks.

Ernie Chan said...

Hi,
There are 2 types of returns you can consider.

For unlevered returns, divide the P&L by the "gross absolute market value" of your portfolio. For e.g. if you are long \$10 AAPL and short \$5 MSFT, the gross absolute market value is \$15.

For levered returns, divide the P&L by the NAV of the account. In this case, yes, different leverages (i.e. initial NAV) will result in different returns.

Ernie

Anonymous said...

Hi Ernie,

Thank you for quick response!

If we want to do portfolio optimization (portfolio of strategies, not assets), how do we get co-variance matrix according to returns computation methods you mentioned above?

Many thanks.

Ernie Chan said...

Hi,
For portfolio optimization, one should use unlevered returns.
Ernie

Anonymous said...

Hi Ernie,

Using Johansen-based pairs trading, is it possible to get intraday signals when we only use end-of-day prices?

Ernie Chan said...

Hi,
If you determined that a pair is cointegrating based on applying Johansen test to daily prices, it is quite OK to trade it intraday as well.
Ernie

aagold said...

Ernie,

The main result of your article is "we achieve a quite significant improvement of the CAGR over the base model: 19.1% vs. 14.7%, with the same Sharpe ratio."

Are you calculating CAGR using levered or unlevered returns?

I'm surprised the CAGR can be so different when the Sharpe Ratio is the same. Eq. 7.3 from Thorpe's Kelly Criterion paper shows that the maximum achievable growth rate of a continuous-time diffusion is a function of Sharpe Ratio S (g_opt = S^2/2 + risk-free-rate). So if you're seeing such a major difference in CAGR without any difference in S then it must be because you're operating far from the kelly-optimal leverage. If you optimized leverage then you probably wouldn't see any difference in CAGR.

Ernie Chan said...

aagold,
You have a good point.

I was reporting unlevered returns. Indeed, if we were to use Kelly-optimal leverage, the levered returns would be the same - but that's assuming that the returns distribution is close to Gaussian. There are usually fat tails that render the Kelly leverage sub-optimal, and so we would prefer a strategy that has higher unlevered returns if its Sharpe ratio is same as the alternative.
Ernie

aagold said...

Thanks for the quick response.

Just out of curiosity - since the original 27-variable model and the optimized 2-variable version have the same sharpe ratio but different CAGR, one must have both a higher numerator (mean excess return) and a higher denominator (standard deviation) such that the changes cancel out. Which model has both higher mean excess return and higher standard deviation?

I'm trying to figure out if the CAGR improvement of the 2-variable model happened because it's both more volatile and has higher return than the 27-variable version, which implies the 27-variable version was far below kelly-optimal leverage, or if it's the opposite which implies the 27-variable version was far above kelly-optimal leverage.

Ernie Chan said...

aagold,

The 2-variable model has higher CAGR and average annualized (uncompounded) return too. The latter is what goes into Sharpe ratio. Since it has the same Sharpe ratio as the 27-variable model, it must mean that the 2-variable model has higher volatility.

Ernie

aagold said...

Ernie,

I think it would be interesting to explore how Kelly optimization interacts with the predictor-variable reduction work you've discussed.

For example, how does the Kelly-optimized 27-variable CAGR compare to the Kelly-optimized 2-variable CAGR? They won't necessarily turn out the same, even though the out-of-sample unleveraged Sharpe ratios ended up the same, because the leverage optimization is done using in-sample data and the testing is done with out-of-sample data. I suspect the in-sample leverage optimization using the 2-variable model will be more effective than that using the 27-variable model.

I don't have your data and backtesting software, otherwise I'd investigate this myself, but if you're interested in doing that study then I think it would be interesting.

Regards,
aagold

Ernie Chan said...

aagold,
Interesting idea. I will look into this when I have some time.
Thanks,
Ernie

aagold said...

Ernie,

I've got a comment on your use of the term "factor loading" in this and a few other blog posts I've seen. I believe the term "characteristic" is used in the literature for a particular stock or portfolio's B/P, E/P, etc. ratio, not "factor loading". The term "factor loading" is used when a time-series regression is done to calculate how much exposure a stock or portfolio has to a "factor" or "risk factor". For example, if we regress the time series of a particular stock's returns against HML (long portfolio of high B/P stocks, short portfolio of low B/P stocks) returns, the calculated regression coefficient (beta) is that stock's "factor loading" on the "value factor" (HML).

Here's a paper that studies the relationship between these two concepts:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2549578

Here's the abstract:
We develop a methodology for bias-corrected return-premium estimation from cross-sectional regressions of individual stock returns on betas and characteristics. Over the period from July 1963 to December 2013, there is some evidence of positive beta premiums on the profitability and investment factors of Fama and French (2014), a negative premium on the size factor and a less robust positive premium on the market, but no reliable pricing evidence for the book-to-market and momentum factors. Firm characteristics consistently explain a much larger proportion of variation in estimated expected returns than factor loadings, however, even with all six factors included in the model.

Here's another relevant recent paper discussing this issue:

Regards,
aagold

Ernie Chan said...

aagold,
There can be different names for the same object.

My terminology is based on the widely used graduate finance text "Statistics and Data Analysis for Financial Engineering" (2nd ed) by Profs. David Ruppert and David Matteson of Cornell University. In Section 18.5, for e.g. they discussed cross-sectional factor models which is the category of factor models I described in this post. On p. 539, they wrote "In a cross-sectional factor model ...; the loadings are directly measured and the factor values are estimated by regression". The loadings here are the fundamental characteristics such as P/E ratio that you referred to.

Ernie

aagold said...

Ok, to each his own I guess, but personally I find that terminology very counter-intuitive and confusing. I think if you do a broader literature search you'll find that calling equity multiples like P/E or P/B (or their reciprocals) "factor loadings" is very rare; in fact, the book you mentioned may be the only place it's done.

Here's a little more evidence:
http://faculty.som.yale.edu/zhiwuchen/Investments/Fama-92.pdf

On page 4 of the classic Fama-French (1992) paper "The Cross-section of Expected Returns", the authors write "Our asset pricing tests use the cross-sectional regression approach of Fama-Macbeth (1973). Each month the cross-section on stocks is regressed on variables hypothesized to explain expected returns". Note: these are the guys who basically invented cross-sectional regressions, and nowhere in this paper, or any other paper from them I'm aware of, do they refer to these variables as "factor loadings".

Regards,
aagold

Ernie Chan said...

aagold,
Yes, the terminology is a bit confusing.

However, I find that using "characteristic" to describe them doesn't make it any easier to remember whether the regression coefficients should be called "factors" or "factor loadings". In time series factor models, the "factors" such as HML are observable, and the factor loadings are unobservable, but in cross-sectional factor models, the "factors" are the regression coefficients which are unobservable. Rather than introducing yet another term "characteristic", I just stuck to 2 terms.

To confuse the matter still further, some books such as Active Portfolio Management by Grinold and Kahn refers to the "characteristics" as "factors", and refers to the regression coefficients as "factor loadings", exactly the opposite terminology of what you and Ruppert's book both use!

Ernie

aagold said...

Hi Ernie,

Thanks for your responses, glad to see you're not annoyed by this topic! :-)
Some people find this type of discussion useless but I find terminology is actually very important in scientific discussions.

Actually I think Grinold & Kahn is very consistent with my use of the term "characteristic". They make heavy use of a concept called "characteristic portfolios" on pages 28-35. Here's what they say: "Assets have a multitude of attributes, such as betas, expected returns, E/P ratios, capitalization, membership in an economic sector, and the like. In this appendix, we will associate a characteristic portfolio with each asset attribute". You can also look at Grinold & Kahn page 55 (eq. 3.16) where they define terms like factor loading/exposure and factor return - all consistent with what I've said.

You wrote, "In time series factor models, the "factors" such as HML are observable, and the factor loadings are unobservable, but in cross-sectional factor models, the "factors" are the regression coefficients which are unobservable." That may be Ruppert's terminology, but have you seen that reversal of roles between "factors" and "factor loadings" in the context of cross-sectional regressions anywhere else? I haven't.

I think this confusion stems from the two-step nature of a "Fama-Macbeth Regression", which involves both a time-series Step 1 and and cross-sectional Step 2.

In both these examples, the authors call the output of Step 2 a "risk premium", not a "factor". Note: it is true that the observable inputs to Step 2 are called "factor loadings" in general discussions of F-M regressions, since it's assumed these betas/exposures were derived in Step 1 time-series regressions. However, in the case when a characteristic such as B/P, E/P, etc. is directly observed rather than being calculated in Step 1, they're generally called "characteristics" rather than exposures or factor loadings.

Regards,
aagold

Here's

Ernie Chan said...

Hi aagold,
I am not sure which edition of Grinold & Kahn you are reading, but I am using an old version from 1995. There, on page 48, section titled "Cross-Sectional Comparisons", it starts "These FACTORS [my emphasis] compare attributes of the stocks with no link to the remainder of the economy. These cross-sectional attributes can themselves be classified into two groups: fundamental and market. Fundamental attributes include ratios such as dividend yield and earnings yield, ...".

Hence Grindold & Kahn called "dividend yield" a "factor", while I, Ruppert & Matteson called it a "factor loading", and you called it a "characteristic". Note you did not call it a "factor" - if you did, I might agree that all you did was to reverse the naming convention of Ruppert and Matteson and I would give it a pass.

I think it is clear that there is no industry standard for naming these attributes. What Grinold & Kahn called "characteristic portfolios" Ruppert & Matteson called "hedge portfolios".

I am afraid I do not share your enthusiasm in this case of adopting one author or the other's terminology, so I think I will stop here!

Ernie

GR said...

Hi Ernie,

Really enjoyed your post. It is very timely for me as I am in the process of conducting a very similar test using a different data set. I am currently running into an out of memory problem and was hoping to gain some insight into how you conducted your test?

Your test consisted of 500 stocks x 5 yrs x 252 trading days per year = 630,000 rows for the 500 stocks. Your test had 1 dependent (target) variable and the 27 independent (predictor) variables = 28 columns. Therefore, your Matlab matrix size was 630,000 x 28 = 17,640,000.

I am running Win 10, 64-bit, have 32 GB of RAM, i7-4790 3.60GHz Intel processor and cannot complete a stepwise regression on my database in Matlab due to "out of memory issues" for a matrix sized 50122 x 147 (7,367,934). I have my data saved in a table.

Did you run this simulation on a super computer or on a high performance PC? Looking for any tips to get this simulation going in Matlab is appreciated.

Also, why use 252 days of data for each stock. If the majority of fundamental data only changes quarterly (excluding P/E, etc.) couldn't the simulation be run every 63 (252/4) trading days for each stock? I am concerned about overweighting independent vairables to don't change but every 63 days.

GR

Ernie Chan said...

Hi GR,
I ran my backtest on a single PC, with just 16 GB RAM.

My program is similar to that displayed on page 113-114 of my book Machine Trading. I have never run into memory problem.

We must use daily data, because different stock's data gets updated on different days. We will update the positions whenever there is an update.

Ernie