Friday, March 03, 2017

More Data or Fewer Predictors: Which is a Better Cure for Overfitting?

One of the perennial problems in building trading models is the spareness of data and the attendant danger of overfitting. Fortunately, there are systematic methods of dealing with both ends of the problem. These methods are well-known in machine learning, though most traditional machine learning applications have a lot more data than we traders are used to. (E.g. Google used 10 million YouTube videos to train a deep learning network to recognize cats' faces.)

To create more training data out of thin air, we can resample (perhaps more vividly, oversample) our existing data. This is called bagging. Let's illustrate this using a fundamental factor model described in my new book. It uses 27 factor loadings such as P/E, P/B, Asset Turnover, etc. for each stock. (Note that I call cross-sectional factors, i.e. factors that depend on each stock, "factor loadings" instead of "factors" by convention.) These factor loadings are collected from the quarterly financial statements of SP 500 companies, and are available from Sharadar's Core US Fundamentals database (as well as more expensive sources like Compustat). The factor model is very simple: it is just a multiple linear regression model with the next quarter's return of a stock as the dependent (target) variable, and the 27 factor loadings as the independent (predictor) variables. Training consists of finding the regression coefficients of these 27 predictors. The trading strategy based on this predictive factor model is equally simple: if the predicted next-quarter-return is positive, buy the stock and hold for a quarter. Vice versa for shorts.

Note there is already a step taken in curing data sparseness: we do not try to build a separate model with a different set of regression coefficients for each stock. We constrain the model such that the same regression coefficients apply to all the stocks. Otherwise, the training data that we use from 200701-201112 will only have 1,260 rows, instead of 1,260 x 500 = 630,000 rows.

The result of this baseline trading model isn't bad: it has a CAGR of 14.7% and Sharpe ratio of 1.8 in the out-of-sample period 201201-201401. (Caution: this portfolio is not necessarily market or dollar neutral. Hence the return could be due to a long bias enjoying the bull market in the test period. Interested readers can certainly test a market-neutral version of this strategy hedged with SPY.) I plotted the equity curve below.




Next, we resample the data by randomly picking N (=630,000) data points with replacement to form a new training set (a "bag"), and we repeat this K (=100) times to form K bags. For each bag, we train a new regression model. At the end, we average over the predicted returns of these K models to serve as our official predicted returns. This results in marginal improvement of the CAGR to 15.1%, with no change in Sharpe ratio.

Now, we try to reduce the predictor set. We use a method called "random subspace". We randomly pick half of the original predictors to train a model, and repeat this K=100 times. Once again, we average over the predicted returns of all these models. Combined with bagging, this results in further marginal improvement of the CAGR to 15.1%, again with little change in Sharpe ratio.

The improvements from either method may not seem large so far, but at least it shows that the original model is robust with respect to randomization.

But there is another method in reducing the number of predictors. It is called stepwise regression. The idea is simple: we pick one predictor from the original set at a time, and add that to the model only if BIC  (Bayesian Information Criterion) decreases. BIC is essentially the negative log likelihood of the training data based on the regression model, with a penalty term proportional to the number of predictors. That is, if two models have the same log likelihood, the one with the larger number of parameters will have a larger BIC and thus penalized. Once we reached minimum BIC, we then try to remove one predictor from the model at a time, until the BIC couldn't decrease any further. Applying this to our fundamental factor loadings, we achieve a quite significant improvement of the CAGR over the base model: 19.1% vs. 14.7%, with the same Sharpe ratio.

It is also satisfying that the stepwise regression model picked only two variables out of the original 27. Let that sink in for a moment: just two variables account for all of the predictive power of a quarterly financial report! As to which two variables these are - I will reveal that in my talk at QuantCon 2017 on April 29.

===

My Upcoming Workshops

March 11 and 18: Cryptocurrency Trading with Python

I will be moderating this online workshop for my friend Nick Kirk, who taught a similar course at CQF in London to wide acclaim.

May 13 and 20: Artificial Intelligence Techniques for Traders

I will discuss in details AI techniques such as those described above, with other examples and in-class exercises. As usual, nuances and pitfalls will be covered.

21 comments:

Honey said...

Wow! only two Variables! :D But how we could estimate that those two variables are sufficient for last 40 days but not for 400 days?

Honey said...

BTW your results also convey that KISS principle Rocks! https://en.wikipedia.org/wiki/KISS_principle

David Bryant said...

Great article. Could also use Akaike information criterion .

Ernie Chan said...

Hi Honey,
That 2 variables worked for the entire in and out of sample period. So I am not sure what you meant by "last 40 days".

Yes, I have always been a big fan of KISS!

Ernie

Ernie Chan said...

Thank you, David!

Yes, AIC and BIC are both used often.

Ernie

Eduardo Gonzatti said...

Dr Chan, nice post, as usual!
Did you try to run a LS version of this simple strategy? Shorting those that seemed overpriced, or, hedging with the SP500 as you said? Could you post the rough numbers, if you did?

Thanks!

Best Regards

Ernie Chan said...

Thanks, Eduardo!

I have not tested the market neutral version. However, as presented, it is already LS, though the 2 sides may not net to 0.

Ernie

Eduardo Gonzatti said...

Sorry Dr, I misread it!
Do you think that the long boas from the market during the OOS window could have inflated the results by this much?

Best Regards

Ernie Chan said...

Hi Eduardo,
I don't know how much the net exposure may have inflated the result. My guess is that there is a long bias in the portfolio, because most stocks have positive returns during both the training and OOS periods.
Ernie

Eduardo Gonzatti said...

Thanks for the caveat, Dr!

dashiell said...

Great post.

Sorry, this may be a very silly question, but I'm having trouble following what each row of data represents. How do you get 1260 rows for each stock? Based on the number, it seems like that represents daily returns for 5 years, but it didn't seem like daily returns were being used here. Thanks.

Thomas said...

Why not just do L1/L2 regularization?

Ernie Chan said...

Hi dashiell,

5 years * 252 trading days per year = 1260 rows.

Daily returns are used.

Ernie

Ernie Chan said...

Hi Thomas,
Thank you for mentioning L1/L2 regularization as yet another method to reduce overfitting.

I am not claiming that the methods I outlined above are superior to every other ML method out there such as L1/L2 regularization. If you are able to present the L1/L2 results here, you would be most welcome. Otherwise, I will try it and present my findings at QuantCon 2017.

Ernie

Anonymous said...

So you try one variable and check BIC. Then add another (at random?) and check for decreased BIC? Is it as simple as an iterative search for lowest BIC?

Ernie Chan said...

In stepwise regression, we try *every* variable at each iteration, with replacement, and pick the one that has the lowest BIC, either to be added or removed.

So no randomness involved.

Ernie

Ever Garcia said...

Does the transaction cost adjust based on the CAGR for say $100K?

Anonymous said...

Interesting post. I did not understand your answer to dashiell on the number of training rows available, the post refers to using factor loadings from quarterly reports as the only independent factors in the regression model, how can you use it for daily returns? It would seem you have 4 different points per year so a total of only 5x4=20 points per stock for predicting the next quarterly return, unless I misunderstood the regression setup. Can you clarify please?

Ernie Chan said...

Hi Ever,
I typically just take 5bps as transaction costs for S&P 500 stocks. It doesn't depend on your account NAV.
I haven't, however, deducted transaction costs in my results discussed above.

Ernie

Ernie Chan said...

Even though each stock has new data 4 times a year, the stocks all have different earning release dates. Hence we have to compare all the stocks' fundamental numbers every day to make investment decision.

Please see Example 2.2 in my new book for details and codes.

Ernie

Ever Garcia said...

Thanks for the clarification.