Friday, April 26, 2019

Is News Sentiment Still Adding Alpha?

By Ernest Chan and Roger Hunter

Nowadays it is nearly impossible to step into a quant trading conference without being bombarded with flyers from data vendors and panel discussions on news sentiment. Our team at QTS has made a vigorous effort in the past trying to extract value from such data, with indifferent results. But the central quandary of testing pre-processed alternative data is this: is the null result due to the lack of alpha in such data, or is the data pre-processing by the vendor faulty? We, like many quants, do not have the time to build a natural language processing engine ourselves to turn raw news stories into sentiment and relevance scores (though NLP was the specialty of one of us back in the day), and we rely on the data vendor to do the job for us. The fact that we couldn't extract much alpha from one such vendor does not mean news sentiment is in general useless.

So it was with some excitement that we heard Two Sigma, the $42B+ hedge fund, was sponsoring a news sentiment competition at Kaggle, providing free sentiment data from Thomson-Reuters for testing. That data started from 2007 and covers about 2,000 US stocks (those with daily trading dollar volume of roughly $1M or more), and complemented with price and volume of those stocks provided by Intrinio. Finally, we get to look for alpha from an industry-leading source of news sentiment data!

The evaluation criterion of the competition is effectively the Sharpe ratio of a user-constructed market-neutral portfolio of stock positions held over 10 days.  (By market-neutral, we mean zero beta. Though that isn't the way Two Sigma put it, it can be shown statistically and mathematically that their criterion is equivalent to our statement.) This is conveniently the Sharpe ratio of the "alpha", or excess returns, of a trading strategy using news sentiment.

It may seem straightforward to devise a simple trading strategy to test for alpha with pre-processed news sentiment scores, but Kaggle and Two Sigma together made it unusually cumbersome and time-consuming to conduct this research. Here are some  common complaints from Kagglers, and we experienced the pain of all of them:

  1. As no one is allowed to download the precious news data to their own computers for analysis, research can only be conducted via Jupyter Notebook run on Kaggle's servers. As anyone who has tried Jupyter Notebook knows, it is a great real-time collaborative and presentation platform, but a very unwieldy debugging platform
  2. Not only is Jupyter Notebook a sub-optimal tool for efficient research and software development, we are only allowed to use 4 CPU's and a very limited amount of memory for the research. GPU access is blocked, so good luck running your deep learning models. Even simple data pre-processing killed our kernels (due to memory problems) so many times that our hair was thinning by the time we were done.
  3. Kaggle kills a kernel if left idle for a few hours. Good luck training a machine learning model overnight and not getting up at 3 a.m. to save the results just in time.
  4. You cannot upload any supplementary data to the kernel. Forget about using your favorite market index as input, or hedging your portfolio with your favorite ETP.
  5. There is no "securities master database" for specifying a unique identifier for each company and linking the news data with the price data.
The last point requires some elaboration. The price data uses two identifiers for a company, assetCode and assetName, neither of which can be used as its unique identifier. One assetName such as Alphabet can map to multiple assetCodes such as GOOG.O and GOOGL.O. We need to keep track of GOOG.O and GOOGL.O separately because they have different price histories. This presents difficulties that are not present in industrial-strength databases such as CRSP, and requires us to devise our own algorithm to create a unique identifier. We did it by finding out for each assetName whether the histories of its multiple assetCodes overlapped in time. If so, we treated each assetCode as a different unique identifier. If not, then we just used the last known assetCode as the unique identifier. In the latter case, we also checked that “joining” the multiple assetCodes made sense by checking that the gap between the end of one and the start of the other was small, and that the prices made sense. With only around 150 cases, these could all be checked externally. On the other hand, the news data has only assetName as the unique identifier, as presumably different classes of stocks such as GOOG.O and GOOGL.O are affected by the same news on Alphabet. So each news item is potentially mapped to multiple price histories.

The price data is also quite noisy, and Kagglers spent much time replacing bad data with good ones from outside sources. (As noted above, this can't be done algorithmically as data can neither be downloaded nor uploaded to the kernel. The time-consuming manual process of correcting the bad data seemed designed to torture participants.) It is harder to determine whether the news data contained bad data, but at the very least, time series plots of the statistics of some of the important news sentiment features revealed no structural breaks (unlike those of another vendor we tested previously.) 

To avoid overfitting, we first tried the two most obvious numerical news features: Sentiment and Relevance. The former ranges from -1 to 1 and the latter from 0 to 1 for each news item. The simplest and most sensible way to combine them into a single feature is to multiply them together. But since there can be many news item for a stock per day, and we are only making a prediction once a day, we need some way to aggregate this feature over one or more days. We compute a simple moving average of this feature over the last 5 days (5 is the only parameter of this model, optimized over training data from 20070101 to 20141231). Finally, the predictive model is also as simple as we can imagine: if the moving average is positive, buy the stock, and short it if it is negative. The capital allocation across all trading signals is uniform. As we mentioned above, the evaluation criterion of this competition means that we have to enter into such positions at the market open on day t+1 after all the news sentiment data for day t was known by midnight (in UTC time zone). The position has to be held for 10 trading days, and exit at the market open on day t+11, and any net beta of the portfolio has to be hedged with the appropriate amount of the market index. The alpha on the validation set from 20150101 to 20161231 is about 2.3% p.a., with an encouraging Sharpe ratio of 1. The alpha on the out-of-sample test set from 20170101 to 20180731 is a bit lower at 1.8% p.a., with a Sharpe ratio of 0.75. You might think that this is just a small decrease, until you take a look at their respective equity curves:



One cliché in data science confirmed: a picture is worth a thousand words. (Perhaps you’ve heard of the Anscombe's Quartet?) We would happily invest in a strategy that looked like that in the validation set, but no way would we do so for that in the test set. What kind of overfitting have we done for the validation set that caused so much "variance" (in the bias-variance sense) in the test set? The honest answer is: Nothing. As we discussed above, the strategy was specified based only on the train set, and the only parameter (5) was also optimized purely on that data. The validation set is effectively an out-of-sample test set, no different from the "test set". We made the distinction between validation vs test sets in this case in anticipation of machine learning hyperparameter optimization, which wasn't actually used for this simple news strategy.  

We will comment more on this deterioration in performance for the test set later. For now, let’s address another question: Can categorical features improve the performance in the validation set? We start with 2 categorical features that are most abundantly populated across all news items and most intuitively important: headlineTag and audiences. 

The headlineTag feature is a single token (e.g. "BUZZ"), and there are 163 unique tokens. The audiences feature is a set of tokens (e.g. {'O', 'OIL', 'Z'}), and there are 191 unique tokens. The most natural way to deal with such categorical features is to use "one-hot-encoding": each of these tokens will get its own column in the feature matrix, and if a news item contains such a token, the corresponding column will get a "True" value (otherwise it is "False"). One-hot-encoding also allows us to aggregate these features over multiple news items over some lookback period. To do that, we decided to use the OR operator to aggregate them over the most recent trading day (instead of the 5-day lookback for numerical features). I.e. as long as one news item contains a token within the most recent day, we will set that daily feature to True. Before trying to build a predictive model using this feature matrix, we compared their features importance to other existing features using boosted random forest, as implemented in LightGBM. 



These categorical features are nowhere to be found in the top 5 features compared to the price features (returns). But more shockingly, LightGBM returned assetCode as the most important feature! That is a common fallacy of using train data for feature importance ranking (the problem is highlighted by Larkin.) If a classifier knows that GOOG had a great Sharpe ratio in-sample, of course it is going to predict GOOG to have positive residual return no matter what! The proper way to compute feature importance is to apply Mean Decrease Accuracy (MDA) using validation data or with cross-validation (see our kernel demonstrating that assetCode is no longer an important feature once we do that.) Alternatively, we can manually exclude such features that remain constant through the history of a stock from features importance ranking. Once we have done that, we find the most important features are



Compared to the price features, these categorical news features are much less important, and we find that adding them to the simple news strategy above does not improve performance.

So let's return to the question of why it is that our simple news strategy suffered such deterioration of performance going from validation to test set. (We should note that it isn’t just us that were unable to extract much value from the news data. Most other kernels published by other Kagglers have not shown any benefits in incorporating news features in generating alpha either. Complicated price features with complicated machine learning algorithms are used by many leading contestants that have published their kernels.) We have already ruled out overfitting, since there is no additional information extracted from the validation set. The other possibilities are bad luck, regime change, or alpha decay.  Comparing the two equity curves, bad luck seems an unlikely explanation. Given that the strategy uses news features only, and not macroeconomic, price or market structure features, regime change also seems unlikely. Alpha decay seems a likely culprit - by that we mean the decay of alpha due to competition from other traders who use the same features to generate signals. A recently published academic paper (Beckers, 2018) lends support to this conjecture. Based on a meta-study of most published strategies using news sentiment data, the author found that such strategies generated an information ratio of 0.76 from 2003 to 2007, but only 0.25 from 2008-2017, a drop of 66%!

Does that mean we should abandon news sentiment as a feature? Not necessarily. Our predictive horizon is constrained to be 10 days. Certainly one should test other horizons if such data is available. When we gave a summary of our findings at a conference, a member of the audience suggested that news sentiment can still be useful if we are careful in choosing which country (India?), or which sector (defence-related stocks?), or which market cap (penny stocks?) we apply it to. We have only applied the research to US stocks in the top 2,000 of market cap, due to the restrictions imposed by Two Sigma, but there is no reason you have to abide by those restrictions in your own news sentiment research.

----

Workshop update:

We have launched a new online course "Lifecycle of Trading Strategy Development with Machine Learning." This is a 12-hour, in-depth, online workshop focusing on the challenges and nuances of working with financial data and applying machine learning to generate trading strategies. We will walk you through the complete lifecycle of trading strategies creation and improvement using machine learning, including automated execution, with unique insights and commentaries from our own research and practice. We will make extensive use of Python packages such as Pandas, Scikit-learn, LightGBM, and execution platforms like QuantConnect. It will be co-taught by Dr. Ernest Chan and Dr. Roger Hunter, principals of QTS Capital Management, LLC. See www.epchan.com/workshops for registration details.

Friday, April 05, 2019

The most overlooked aspect of algorithmic trading

Many algorithmic traders justifiably worship the legends of our industry, people like Jim Simons, David Shaw, or Peter Muller, but there is one aspect of their greatness most traders have overlooked. They have built their businesses and vast wealth not just by sitting in front of their trading screens or scribbling complicated equations all day long, but by collaborating and managing other talented traders and researchers. If you read the recent interview of Simons, or the book by Lopez de Prado (head of machine learning at AQR), you will notice that both emphasized a collaborative approach to quantitative investment management. Simons declared that total transparency within Renaissance Technologies is one reason of their success, and Lopez de Prado deemed the "production chain" (assembly line) approach the best meta-strategy for quantitative investment. One does not need to be a giant of the industry to practice team-based strategy development, but to do that well requires years of practice and trial and error. While this sounds no easier than developing strategies on your own, it is more sustainable and scalable - we as individual humans do get tired, overwhelmed, sick, or old sometimes. My experience in team-based strategy development falls into 3 categories: 1) pair-trading, 2) hiring researchers, and 3) hiring subadvisors. Here are my thoughts.

From Pair Programming to Pair Trading


Software developers may be familiar with the concept of "pair programming". I.e. two programmers sitting in front of the same screen staring at the same piece of code, and taking turns at the keyboard. According to software experts, this practice reduces bugs and vastly improves the quality of the code.  I have found that to work equally well in trading research and executions, which gives new meaning to the term "pair trading".

The more different the pair-traders are, the more they will learn from each other at the end of the day. One trader may be detail-oriented, while another may be bursting with ideas. One trader may be a programmer geek, and another may have a CFA. Here is an example. In financial data science and machine learning, data cleansing is a crucial step, often seriously affecting the validity of the final results. I am, unfortunately, often too impatient with this step, eager to get to the "red meat" of strategy testing. Fortunately, my colleagues at QTS Capital are much more patient and careful, leading to much better quality work and invalidating quite a few of my bogus strategies along the way. Speaking of invalidating strategies, it is crucial to have a pair-trader independently backtest a strategy before trading it, preferably in two different programming languages. As I have written in my book, I backtest with Matlab and others in my firm use Python, while the final implementation as a production system by my pair-trader Roger is always in C#. Often, subtle biases and bugs in a strategy will be revealed only at this last step. After the strategy is "cross-validated" by your pair-trader, and you have moved on to live trading, it is a good idea to have one human watching over the trading programs at all times, even for fully automated strategies.  (For the same reason, I always have my foot ready on the brake even though my car has a collision avoidance system.) Constant supervision requires two humans, at least, especially if you trade in international as well as domestic markets.

Of course, pair-trading is not just about finding bugs and monitoring live trading. It brings to you new ideas, techniques, strategies, or even completely new businesses. I have started two hedge funds in the past. In both cases, it started with me consulting for a client, and the consulting progressed to a collaboration, and the collaboration became so fruitful that we decided to start a fund to trade the resulting strategies.

For balance, I should talk about a few downsides to pair-trading. Though the final product's quality is usually higher, collaborative work often takes a lot longer. Your pair-trader's schedule may be different from yours. If the collaboration takes the form of a formal partnership in managing a fund or business, be careful not to share ultimate control of it with your pair-trading partner (sharing economic benefits is of course necessary). I had one of my funds shut down due to the early retirement of my partner. One of the reasons I started trading independently instead of working for a large firm is to avoid having my projects or strategies prematurely terminated by senior management, and having a partner involuntarily shuts you down is just as bad.

Where to find your pair-trader? Publish your ideas and knowledge to social media is the easiest way (note this blog here). Whether you blog, tweet, quora, linkedIn, podcast, or youTube, if your audience finds you knowledgeable, you can entice them to a collaboration.

Hiring Researchers

Besides pair-trading with partners on a shared intellectual property basis, I have also hired various interns and researchers, where I own all the IP. They range from undergraduates to post-doctoral researchers (and I would not hesitate to hire talented high schoolers either.) The difference with pair-traders is that as the hired quants are typically more junior in experience and hence require more supervision, and they need to be paid a guaranteed fee instead of sharing profits only. Due to the guaranteed fee, the screening criterion is more important.  I found short interviews, even one with brain teasers, to be quite unpredictive of future performance (no offence, D.E. Shaw.) We settled on giving an applicant a tough financial data science problem to be done at their leisure. I also found that there is no particular advantage to being in the same physical office with your staff. We have worked very well with interns spanning the globe from the UK to Vietnam.

Though physical meetings are unimportant, regular Google Hangouts with screen-sharing is essential in working with remote researchers. Unlike with pair-traders, there isn't time to work together on coding with all the different researchers. But it is very beneficial to walk through their codes whenever results are available. Bugs will be detected, nuances explained, and very often, new ideas come out of the video meetings. We used to have a company-wide weekly video meetings where a researcher would present his/her results using Powerpoints, but I have found that kind of high level presentation to be less useful than an in-depth code and result review. Powerpoint presentations are also much more time-consuming to prepare, whereas a code walk-through needs little preparation.

Generally, even undergraduate interns prefer to develop a brand new strategy on their own. But that is not necessarily the most productive use of their talent for the firm. It is rare to be able to develop and complete a trading strategy using machine learning within a summer internship. Also, if the goal of the strategy is to be traded as an independent managed account product (e.g. our Futures strategy), it takes a few years to build a track record for it to be marketable. On the other hand, we can often see immediate benefits from improving an existing strategy, and the improvement can be researched within 3 or 4 months. This also fits within the "production chain" meta-strategy described by Lopez de Prado above, where each quant should mainly focus on one aspect of the strategy production.

This whole idea of emphasizing improving existing strategies over creating new strategies was suggested to us by our post-doctoral researcher, which leads me to the next point.

Sometimes one hires people because we need help with something we can do ourselves but don't have time to. This would generally be the reason to hire undergraduate interns. But sometimes, I hire  people who are better than I am at something. For example, despite my theoretical physics background, my stochastic calculus isn't top notch (to put it mildly). This is remedied by hiring our postdoc Ray who found tedious mathematics a joy rather than a drudgery. While undergraduate interns improve our productivity, graduate and post-doctoral researchers are generally able to break new ground for us. For these quants, they require more freedom to pursue their projects, but that doesn't mean we can skip the code reviews and weekly video conferences, just like what we do with pair-traders.

Some firms may spend a lot of time and money to find such interns and researchers using professional recruiters. In contrast, these hires generally found their way to us, despite our minuscule size. That is because I am known as an educator (both formally as adjunct faculty in universities, as well as informally on social media and through books). Everybody likes to be educated while getting paid. If you develop a reputation of being an educator in the broadest sense, you shall find recruits coming to you too.

Hiring Subadvisors

If one decides to give up on intellectual property creation, and just go for returns on investment, finding subadvisors to trade your account isn't a bad option. After all, creating IP takes a lot of time and money, and finding a profitable subadvisor will generate that cash flow and diversify your portfolio and revenue stream while you are patiently doing research. (In contrast to Silicon Valley startups where the cash for IP creation comes from venture capital, cash flow for hedge funds like ours comes mainly from fees and expense reimbursements, which are quite limited unless the fund is large or very profitable.)

We have tried a lot of subadvisors in the past. All but one failed to deliver. Why? That is because we were cheap. We picked "emerging" subadvisors who had profitable, but short, track records, and charged lower fees. To our chagrin, their long and deep drawdown typically immediately began once we hired them. There is a name for this: it is called selection bias. If you generate 100 geometric random walks representing the equity curves of subadvisors, it is likely that one of them has a Sharpe ratio greater than 2 if the random walk has only 252 steps. 

Here, I simulated 100 normally distributed returns series with 252 bars, and sure enough, the maximum Sharpe ratio of those is 2.8 (indicated by the red curve in the graph below.)



(The first 3 readers who can email me a correct analytical expression with a valid proof that describes the cumulative probability P of obtaining a Sharpe ratio greater than or equal to S of a normally distributed returns series of length T will get a free copy of my book Machine Trading. At their option, I can also tweet their names and contact info to attract potential employment or consulting opportunities.)

These lucky subadvisors are unlikely to maintain their Sharpe ratios going forward. To overcome this selection bias, we adopted this rule: whenever a subadvisor approaches us, we time-stamp that as Day Zero. We will only pay attention to the performance thereafter. This is similar in concept to "paper trading" or "walk-forward testing". 

Subadvisors with longer profitable track records do pass this test more often than "emerging" subadvisors. But these subadvisors typically charge the full 2 and 20 fees, and the more profitable ones may charge even more. Some investors balk at those high fees. I think these investors suffer from a behavioral finance bias, which for lack of a better term I will call "Scrooge syndrome". Suppose one owns Amazon's stock that went up 92461% since IPO. Does one begrudge Jeff Bezo's wealth? Does one begrudge the many millions he rake in every day? No, the typical investor only cares about the net returns on equity. So why does this investor suddenly becomes so concerned with the difference between gross and net return of a subadvisor? As long as the net return is attractive, we shouldn't care how much fees the subadvisor is raking in. Renaissance Technologies' Medallion Fund reportedly charges 5 and 44, but most people would jump at the chance of investing if they were allowed.

Besides fees, some quant investors balk at hiring subadvisors because of pride. That is another behavioral bias, which is known as the "NIH syndrome" (Not Invented Here). Nobody would feel diminished buying AAPL even though they were not involved in creating the iPhone at Apple, why should they feel diminished paying for a service that generates uncorrelated returns? Do they think they alone can create every new strategy ever discoverable by humankind?

Epilogue


Your ultimate wealth when you are 100 years old will more likely be determined by the strategies created by your pair-traders, your consultants/employees, and your subadvisors, than the amazing strategies you created in your twenties. Hire well.

===

Industry update


1) A python package for market simulations by Techila is available here. It enables easy parallel computations.

2) A very readable new book on using R in Finance by Jonathan Regenstein, who is the Director of Financial Services Practice at RStudio.

3) PsyQuation now provides an order flow sentiment indicator.

4) Larry Connors published a new book on simple but high Sharpe ratio strategies. I enjoyed reading it very much.

5) QResearch is a backtest platform for the Chinese stock market for non-programmers. 

6) Logan Kane describes an innovative application of volatility prediction here.

7) If you aren't following @VolatilityQ on Twitter, you are missing out on a lot of quant research and alphas.