Friday, June 29, 2018

Loss aversion is not a behavioral bias

In his famous book "Thinking, Fast and Slow", the Nobel laureate Daniel Kahneman described one common example of a behavioral finance bias:

"You are offered a gamble on the toss of a [fair] coin.
If the coin shows tails, you lose $100.
If the coin shows heads, you win $110.
Is this gamble attractive? Would you accept it?"

(I have modified the numbers to be more realistic in a financial market setting, but otherwise it is a direct quote.)

Experiments show that most people would not accept this gamble, even though the expected gain is $5. This is the so-called "loss aversion" behavioral bias, and is considered irrational. Kahneman went on to write that "professional risk takers" (read "traders") are more willing to act rationally and accept this gamble.

It turns out that the loss averse "layman" is the one acting rationally here.

It is true that if we have infinite capital, and can play infinitely many rounds of this game simultaneously, we should expect $5 gain per round. But trading isn't like that. We are dealt one coin at a time, and if we suffer a string of losses, our capital will be depleted and we will be in debtor prison if we keep playing. The proper way to evaluate whether this game is attractive is to evaluate the expected compound rate of growth of our capital.

Let's say we are starting with a capital of $1,000. The expected return of playing this game once is initially 0.005.  The standard deviation of the return is 0.105. To simplify matters, let's say we are allowed to adjust the payoff of each round so we have the same expected return and standard deviation of return each round. For e.g. if at some point we earned so much that we doubled our capital to $2,000, we are allowed to win $220 or lose $200 per round. What is the expected growth rate of our capital? According to standard stochastic calculus, in the continuous approximation it is -0.0005125 per round - we are losing, not gaining! The layman is right to refuse this gamble.

Loss aversion, in the context of a risky game played repeatedly, is rational, and not a behavioral bias. Our primitive, primate instinct grasped a truth that behavioral economists cannot.  It only seems like a behavioral bias if we take an "ensemble view" (i.e. allowed infinite capital to play many rounds of this game simultaneously), instead of a "time series view" (i.e. allowed only finite capital to play many rounds of this game in sequence, provided we don't go broke at some point). The time series view is the one relevant to all traders. In other words, take time average, not ensemble average, when evaluating real-world risks.

The important difference between ensemble average and time average has been raised in this paper by Ole Peters and Murray Gell-Mann (another Nobel laureate like Kahneman.) It deserves to be much more widely read in the behavioral economics community. But beyond academic interest, there is a practical importance in emphasizing that loss aversion is rational. As traders, we should not only focus on average returns: risks can depress compound returns severely.


Industry update

1) Alpaca is a new an algo-trading API brokerage platform with zero commissions.

2) AlgoTrader started a new quant strategy development and implementation platform.

My Upcoming Workshop

August 4 and 11:  Artificial Intelligence Techniques for Traders

I briefly discussed why AI/ML techniques are now part of the standard toolkit for quant traders here. This real-time online workshop will take you through many of the nuances of applying these techniques to trading.

Friday, February 02, 2018

FX Order Flow as a Predictor

Order flow is signed trade size, and it has long been known to be predictive of future price changes. (See Lyons, 2001, or Chan, 2017.) The problem, however, is that it is often quite difficult or expensive to obtain such data, whether historical or live. This is especially true for foreign exchange transactions which occur over-the-counter. Recognizing the profit potential of such data, most FX market operators guard them as their crown jewels, never to be revealed to customers. But recently FXCM, a FX broker, has kindly provided me with their proprietary data, and I have made use of that to test a simple trading strategy using order flow on EURUSD.

First, let us examine some general characteristics of the data. It captures all trades transacted on FXCM occurring in 2017, time stamped in milliseconds, and with their trade prices and signed trade sizes. The sign of a trade is positive if it is the result of a buy market order, and negative if it is the result of a sell. If we take the absolute value of these trade sizes and sum them over hourly intervals, we obtain the usual hourly volumes (click to enlarge) aggregated over the 1 year data set:

It is not surprising that the highest volume occurs between 16:00-17:00 London time, as 16:00 is when the benchmark rate (the "fix") is determined. The secondary peak at 9:00-10:00 is of course the start of the business day in London.

Next, I compute the daily total order flow of EURUSD (with the end of day at New York's midnight), and I establish a histogram of the last 20 days' daily order flow. I then determine the average next-day return of each daily order flow quintile. (I.e. I bin a next-day return based on which quintile the prior day's order flow fell into, and then take the average of the returns in each bin.) The result is satisfying:

We can see that the average next-day returns are almost monotonically increasing with the previous day's order flow. The spread between the top and bottom quintiles is about 12 bps, which annualizes to about 30%. This doesn't mean we will generate 30% annualized returns, since we won't be able to arbitrage between today's return (if the order flow is in the top or bottom quintile) with some previous day's return when its order flow was in the opposite extreme. Nevertheless, it is encouraging. Also, this is an illustration that even though order flow must be computed on a tick-by-tick basis (I am not a fan of the bulk volume classification technique), it can be used in low-frequency trading strategies.

(One may be tempted to also regress future returns against past order flows, but the result is statistically insignificant. Apparently only the top and bottom quintiles of order flow are predictive. This situation is actually quite common in finance, which is why linear regression isn't used more often in trading strategies.)

Finally, one more sanity check before backtesting. I want to see if the buy trades (trades resulting from buy market orders) are filled above the bid price, and the sell trades are filled below the ask price. Here is the plot for one day (times are in New York):

We can see that by and large, the relationship between trade and quote prices is satisfied. We can't really expect that this relationship holds 100%, due to rare occasions that the quote has moved in the sub-millisecond after the trade occurred and the change is reported as synchronous with the trade, or when there is a delay in the reporting of either a trade or a quote change.

So now we are ready to construct a simple trading strategy that uses order flow as a predictor. We can simply buy EURUSD at the end of day when the daily flow is in the top quintile among its last 20 days' values, and hold for one day, and short it when it is in the bottom quintile. Since our daily flow was measured at midnight New York time, we also define the end of day at that time. (Similar results are obtained if we use London or Zurich's midnight, which suggests we can stagger our positions.) In my backtest, I have subtracted 0.20 bps commissions (based on Interactive Brokers), and I assume I buy at the ask and sell at the bid using market orders. The equity curve is shown below:

The CAGR is 13.7%, with a Sharpe ratio of 1.6. Not bad for a single factor model!

Acknowledgement:  I thank Zachary David for his review and comments on an earlier draft of this post, and of course FXCM for providing their data for this research.


Industry update

1) Qcaid is a cloud-based platform that provides traders with backtesting, execution, and simulation facilities. They also provide servers and data feed.

2) How Cadre Uses Machine Learning to Target Real Estate Markets.

3) Check out Quantopian's new tutorial on getting started in quantitative finance.

4) A new Matlab-based backtest and live trading platform for download here.

5) A nice resource page for open source algorithmic trading tools at QuantNews.

My Upcoming Workshops

February 24 and March 3: Algorithmic Options Strategies

This online course focuses on backtesting intraday and portfolio option strategies. No pesky options pricing theories will be discussed, as the emphasis is on arbitrage trading.

June 4-8: London workshops

These intense 8-16 hours workshops cover Algorithmic Options Strategies, Quantitative Momentum Strategies, and Intraday Trading and Market Microstructure. Typical class size is under 10. They may qualify for CFA Institute continuing education credits. (Bonus: nice view of the Thames, and lots of free food.)

Thursday, January 04, 2018

A novel capital booster: Sports Arbitrage

By Stephen Hope

As traders, we of course need money to make money, but not everyone has 10-50k of capital lying around to start one's trading journey. Perhaps the starting capital is only 1k or less. This article describes how one can take a small amount of capital and multiply it as much as 10 fold in one year by taking advantage of large market inefficiencies (leading to arbitrage opportunities) in the sports asset class. However, impressive returns such as this are difficult to achieve with significantly larger seed capital, as discussed later.

Arbitrage is the perfect trade if you can get your hands on one, but clearly this is exceptionally difficult in the financial markets. In contrast, the sports markets are very inefficient due to the general lack of trading APIs and patchy liquidity etc. Arbitrages can persist for minutes (or even hours at a time).

Consider a very simple example of sports arbitrage; Team A vs Team B and three bookmakers quoting the odds shown in the table below. When the odds are expressed in decimal form we can calculate the implied probability of the event  e occurring as quoted by bookmaker i as  P(i,e) = 1/Odds(i,e)  (shown in brackets in the table).

Three Way Market
Bookmaker B1
Bookmaker B2
Bookmaker B3
Team A win
1.4 (71.4%)
1.2 (83.3%)
1.2 (83.3%)
Team A lose
8.8 (11.4%)
9.5 (10.5%)
9.1 (11.0%)
5.8 (17.2%)
6.0 (16.7%)
6.8 (14.7%)

In the Three Way Market, there are only 3 possible outcomes; Team A wins, Team A loses or it's a draw. Therefore the sum of the probabilities of these 3 events should equal 100% (in a fair market). However, we can see that the market is not efficient and the combination of odds shown in red give; 

This is an arbitrage opportunity in the Three Way market with 3 legs;

1_2_X and Odds = (1.4, 9.5, 6.8)


1 = Three Way Market (home team to win)
2 = Three Way Market (away team to win)
X = Three Way Market (a draw)

The size of the arbitrage is given by 

and in order to realise this arbitrage we need to bet the following percentage stakes against our notional

The above example is a 'simple' arbitrage. However, the majority of football arbitrage opportunities are 'complex' arbitrages. Complex in the sense that the bet legs are not mutually exclusive and more than one leg can pay out over some overlapping subset of possible outcomes. The calculation then becomes more complex. 

For example, consider the following 3 leg complex arbitrage;

AH2(-0.25)_X1_1 and Odds = (1.69, 2.1, 5.25);


AH2(-0.25) = Asian Handicap Market (away team to win, handicap -0.25) 
X1 = Double Chance Market (home team to win or draw)
1 = Three Way Market (home team to win)

We can construct a payoff matrix to more easily visualise the outcome dependent payoffs of the 3 bet legs.

Payoff Matrix
Away Team Wins
Home Team Wins

Matrix Element Meanings
0.69 –> win 0.69 * stake 1 (+ stake 1 returned)
1.1 –> win 1.1 * stake 2 (+ stake 2 returned)
4.25 –> win 4.25 * stake 3 (+ stake 3 returned)
-0.5 –> lose -0.5 * stake 1 (get half of stake 1 back)
-1 –> lose -1 * stake i (lose your full stake)

The structure of the Payoff Matrix reveals a 'potential' arbitrage because there exists no column (event outcome) that contains only negative cash flows. It is a potential 'complex arbitrage' because in the event of a draw or home team win, there exists two bet legs that can give rise to a positive cash flow for the same outcome (remember, -0.5 means half of the stake is returned so is still positive). However, whether or not the arbitrage can be 'realised' depends on whether or not we can find a solution for the stake percentages for each leg that gives a positive net profit for every outcome. So how do we do this ?

Constructed as a dynamic programming optimisation we have;


x = ( x1 , x2 , x3 ... ) are the bet leg stakes
C is a payoff matrix column chosen to maximise
A is the constraints matrix (e.g sum of stakes = 1, stake (i) >= 0 etc)

Solving the optimisation for the AH2(-0.25)_X1_1 example above gives;

Payoff Matrix
Away Team Wins
Home Team Wins
Stake %
Net Profit

We can see that the arbitrage does indeed have a solution with the stake percentages (60.2%, 34.1%, 5.7%) giving an arbitrage of 1.7% for every possible outcome. There are many thousands of these arbitrage opportunities appearing each day in the sports markets ranging in size from 0.1% - 7%+.

What returns are possible? Consider, starting with a seed capital of £1k and a trading frequency of 3 times per week with an average arbitrage size of 1.6%. Initially we compound our winnings but there are limits to how much you can stake with a given bookmaker. Assume that we cannot increase our notional beyond £5000 across any multi-leg arbitrage trade. In that case, the initial £1k can grow to approximately £9,500 in one year. Not bad for a few minutes of effort per trade. 

So what's the catch? 

There are really only two pitfalls. 
1)  Scaling: You cannot easily compound your returns as with the financial markets.
2) Limit Risk: Bookmakers don't want you to win and can be inclined to significantly reduce your allowed stake notional if you win too much. Avoiding this requires careful management.

Although sports arbitrage does not easily scale, it is a great way of boosting trading capital by a few thousand pounds per year with very small time effort; capital which could be put to use in the financial or crypto markets.

About the author: Stephen Hope is Co-Founder of Machina Trading, a proprietary crypto & sports trading firm that provides an arbitrage tool called rational bet. He is former Head of Quantitative Trading Strategies at BNP Paribas and received his PhD in Physics from the University of Cambridge.

Upcoming Workshops by Dr. Ernie Chan

February 24 and March 3: Algorithmic Options Strategies

This online course focuses on backtesting intraday and portfolio option strategies. No pesky options pricing theories will be discussed, as the emphasis is on arbitrage trading.

June 4-8London workshops

These intense 8-16 hours workshops cover Algorithmic Options Strategies, Quantitative Momentum Strategies, and Intraday Trading and Market Microstructure. Typical class size is under 10. They may qualify for CFA Institute continuing education credits.

Friday, November 17, 2017

Optimizing trading strategies without overfitting

By Ernest Chan and Ray Ng


Optimizing the parameters of a trading strategy via backtesting has one major problem: there are typically not enough historical trades to achieve statistical significance. Whatever optimal parameters one found are likely to suffer from data snooping bias, and there may be nothing optimal about them in the out-of-sample period. That's why parameter optimization of trading strategies often adds no value. On the other hand, optimizing the parameters of a time series model (such as a maximum likelihood fit to an autoregressive or GARCH model) is more robust, since the input data are prices, not trades, and we have plenty of prices. Fortunately, it turns out that there are clever ways to take advantage of the ease of optimizing time series models in order to optimize parameters of a trading strategy.

One elegant way to optimize a trading strategy is to utilize the methods of stochastic optimal control theory - elegant, that is, if you are mathematically sophisticated and able to analytically solve the Hamilton-Jacobi-Bellman (HJB) equation (see Cartea et al.) Even then, this will only work when the underlying time series is a well-known one, such as the continuous Ornstein-Uhlenbeck (OU) process that underlies all mean reverting price series. This OU process is neatly represented by a stochastic differential equation. Furthermore, the HJB equations can typically be solved exactly only if the objective function is of a simple form, such as a linear function. If your price series happens to be neatly represented by an OU process, and your objective is profit maximization which happens to be a linear function of the price series, then stochastic optimal control theory will give you the analytically optimal trading strategy: with exact entry and exit thresholds given as functions of the parameters of the OU process. There is no more need to find such optimal thresholds by trial and error during a tedious backtest process, a process that invites overfitting to sparse number of trades. As we indicated above, the parameters of the OU process can be fitted quite robustly to prices, and in fact there is an analytical maximum likelihood solution to this fit given in Leung et. al.

But what if you want something more sophisticated than the OU process to model your price series or require a more sophisticated objective function? What if, for example, you want to include a GARCH model to deal with time-varying volatility and optimize the Sharpe ratio instead? In many such cases, there is no representation as a continuous stochastic differential equation, and thus there is no HJB equation to solve. Fortunately, there is still a way to optimize without overfitting.

In many optimization problems, when an analytical optimal solution does not exist, one often turns to simulations. Examples of such methods include simulated annealing and Markov Chain Monte Carlo (MCMC). Here we shall do the same: if we couldn't find an analytical solution to our optimal trading strategy, but could fit our underlying price series quite well to a standard discrete time series model such as ARMA, then we can simply simulate many instances of the underlying price series. We shall backtest our trading strategy on each instance of the simulated price series, and find the best trading parameters that most frequently generate the highest Sharpe ratio. This process is much more robust than applying a backtest to the real time series, because there is only one real price series, but we can
we can simulate as many price series (all following the same ARMA process) as we want. That means we can simulate as many trades as we want and obtain optimal trading parameters with as high a precision as we like. This is almost as good as an analytical solution. (See flow chart below that illustrates this procedure - click to enlarge.)

Optimizing a trading strategy using simulated time series

Here is a somewhat trivial example of this procedure. We want to find an optimal strategy that trades  AUDCAD on an hourly basis. First, we fit a AR(1)+GARCH(1,1) model to the data using log midprices. The maximum likelihood fit is done using a one-year moving window of historical prices, and the model is refitted every month. We use MATLAB's Econometrics Toolbox for this fit. Once the sequence of monthly models are found, we can use them to predict both the log midprice at the end of the hourly bars, as well as the expected variance of log returns. So a simple trading strategy can be tested: if the expected log return in the next bar is higher than K times the expected volatility (square root of variance) of log returns, buy AUDCAD and hold for one bar, and vice versa for shorts. But what is the optimal K?

Following the procedure outlined above, each time after we fitted a new AR(1)+GARCH(1, 1) model, we use this to simulate the log prices for the next month's worth of hourly bars. In fact, we simulate this 1,000 times, generating 1,000 time series, each with the same number of hourly bars in a month. Then we simply iterate through all reasonable value of K and remember which K generates the highest Sharpe ratio for each simulated time series. We pick the K that most often results in the best Sharpe ratio among the 1,000 simulated time series (i.e. we pick the mode of the distribution of optimal K's across the simulated series). This is the sequence of K's (one for each month) that we use for our final backtest. Below is a sample distribution of optimal K's for a particular month, and the corresponding distribution of Sharpe ratios:

Histogram of optimal K and corresponding Sharpe ratio for 1,000 simulated price series

Interestingly, the mode of the optimal K is 0 for any month. That certainly makes for a simple trading strategy: just buy whenever the expected log return is positive, and vice versa for shorts. The CAGR is about 4.5% assuming zero transaction costs and midprice executions. Here is the cumulative returns curve:

You may exclaim: "This can't be optimal, because I am able to trade AUDCAD hourly bars with much better returns and Sharpe ratio!" Of course, optimal in this case only means optimal within a certain universe of strategies, and assuming an underlying AR(1)+GARCH(1, 1) price series model. Our universe of strategies is a pretty simplistic one: just buy or sell based on whether the expected return exceeds a multiple of the expected volatility. But this procedure can be extended to whatever price series model you assume, and whatever universe of strategies you can come up with. In every case, it greatly reduces the chance of overfitting.

P.S. we invented this procedure for our own use a few months ago, borrowing similar ideas from Dr. Ng’s computational research in condensed matter physics systems (see Ng et al here or here). But later on, we found that a similar procedure has already been described in a paper by Carr et al


About the authors: Ernest Chan is the managing member of QTS Capital Management, LLC. Ray Ng is a quantitative strategist at QTS. He received his Ph.D. in theoretical condensed matter physics from McMaster University. 


Upcoming Workshops by Dr. Ernie Chan

November 18 and December 2:  Cryptocurrency Trading with Python

I will be moderating this online workshop for Nick Kirk, a noted cryptocurrency trader and fund manager, who taught this widely acclaimed course here and at CQF in London.

February 24 and March 3: Algorithmic Options Strategies

This online course focuses on backtesting intraday and portfolio option strategies. No pesky options pricing theories will be discussed, as the emphasis is on arbitrage trading.

Thursday, September 07, 2017

StockTwits Sentiment Analysis

By Colton Smith

Exploring alternative datasets to augment financial trading models is currently the hot trend among the quantitative community. With so much social media data out there, its place in financial models has become a popular research discussion. Surely the stock market’s performance influences the reactions from the public but if the converse is true, that social media sentiment can be used to predict movements in the stock market, then this would be a very valuable dataset for a variety of financial firms and institutions.

When I began this project as a consultant for QTS Capital Management, I did an extensive literature review of the social media sentiment providers and academic research. The main approach is to take the social media firehose, filter it down by source credibility, apply natural language processing (NLP), and create a variety of metrics that capture sentiment, volume, dispersion, etc. The best results have come from using Twitter or StockTwits as the source. A feature of StockTwits that distinguishes it from Twitter is that in late 2012 the option to label your tweet as bullish or bearish was added. If these labels accurately capture sentiment and are used frequently enough, then it would be possible to avoid using NLP. Most tweets are not labeled as seen in Figure 1 below, but the percentage is increasing.

Figure 1: Percentage of Labeled StockTwits Tweets by Year

This blog post will compare the use of just the labeled tweets versus the use of all tweets with NLP. To begin, I did some basic data analysis to better understand the nature of the data. In Figure 2 below, the number of labeled tweets per hour is shown. As expected there are spikes around market open and close.

Figure 2: Number of Tweets Per Hour of the Day

The overall market sentiment can be estimated by aggregating the number of bullish and bearish labeled tweets each day. Based on the previous literature, I expected a significant bullish bias. This is confirmed in Figure 3 below with the daily mean percetage of bullish tweets being 79%.

Figure 3: Percentage of Bullish Tweets Each Day

When writing a StockTwits tweet, users can tag multiple symbols so it is possible that the sentiment label could apply to more than one symbol. Tagging more than one symbol would likely indicate less specific sentiment and predictive potential so I hoped to find that most tweets only tag a single symbol. Looking at Figure 4 below, over 90% of the tweets tag a single symbol and a very small percentage tag 5+.

Figure 4: Relative Frequency Histogram of the Number of Symbols Mentioned Per Tweet

The time period of data used in my analysis is from 2012-11-01 to 2016-12-31. In Figure 5 below, the top symbols, industries, and sectors by total labeled tweet count are shown. By far the most tweeted about industries were biotechnology and ETFs. This makes sense because of how volatile these industries are which hopefully means that they would be the best to trade based on social media sentiment data.

Figure 5: Top Symbols, Industries, and Sectors by Total Tweet Count

Now I needed to determine how I would create the sentiment score to best encompass the predictive potential of the data. Though there are obstacles to trading an open to close strategy including slippage, liquidity, and transaction costs, analyzing how well the sentiment score immediately before market open predicts open to close returns is a valuable sanity check to see if it would be useful in a larger factor model. The sentiment score for each day was calculated using the tweets from the previous market day’s open until the current day’s open:

S-Score =  (#Bullish-#Bearish)/(#Bullish+#Bearish)

This S-Score then needs to be normalized to detect the significance of a specific day’s sentiment with respect to the symbol’s historic sentiment trend. To do this, a rolling z-score is applied to the series. By changing the length of the lookback window the sensitivity can be adjusted. Additionally, since the data is quite sparse, days without any tweets for a symbol are given an S-Score of 0. At the market open each day, symbols with an S-Score above the positive threshold are entered long and symbols with an S-Score below the negative threshold are entered short. Equal dollar weight is applied to the long and short legs. These positions are assumed to be liquidated at the day’s market close. The first test is on the universe of equities with previous day closing prices > $5. With a relatively small long-short portfolio of ~250 stocks, its performance can be seen in Figure 6 below (click on chart to enlarge).

Figure 6: Price > $5 Universe Open to Close Cumulative Returns

The thresholds were cherry-picked to show the potential of a 2.11 Sharpe Ratio but the results vary depending on the thresholds used. This sensitivity is likely due to the lack of tweet volume on most symbols. Also, the long and short thresholds are not equal in an attempt to maintain roughly equal number of stocks in each leg. The neutral basket contains all of the stocks in the universe that do not have an S-Score extreme enough to generate a long or short signal. Using the same thresholds as above, the test was ran on a liquidity universe which is defined as the top quartile of 50-day Average Dollar Volume stocks. As seen in Figure 7 below, the Sharpe drops to a 1.24 but is still very encouraging.

Figure 7: Liquidity Universe Open to Close Cumulative Returns

The sensitivity of these results needs to be further inspected by performing analysis on separate train and test sets but I was very pleased with the returns that could be potentially generated from just labeled StockTwits data.

In July, I began working for Social Market Analytics, the leading social media sentiment provider. Here at SMA, we run all the StockTwits tweets through our proprietary NLP engine to determine their sentiment scores. Using sentiment data from 9:10 EST which looks at an exponentially weighted sentiment aggregation over the last 24 hours, the open to close simulation can be ran on the price > $5 universe. Each stock is separated into its respective quintile based on its S-Score in relation to the universe’s percentiles that day. A long-short portfolio is constructed in a similar fashion as previously with long positions in the top quintile stocks and short positions in the bottom quintile stocks. In Figure 8 below you can see that the results are much better than when only using sentiment labeled data.

Figure 8: SMA Open to Close Cumulative Returns Using StockTwits Data

The predictive power is there as the long-short boasts an impressive 4.5 Sharpe ratio. Due to having more data, the results are much less sensitive to long-short portfolio construction. To avoid the high turnover of an open-to-close strategy, we have been exploring possible long-term strategies. Deutsche Bank’s Quantitative Research Team recently released a paper about strategies that solely use our SMA data which includes a longer-term strategy. Additionally, I’ve recently developed a strong weekly rebalance strategy that attempts to capture weekly sentiment momentum.

Though it is just the beginning, my dive into social media sentiment data and its application in finance over the course of my time consulting for QTS has been very insightful. It is arguable that by just using the labeled StockTwits tweets, we may be able to generate predictive signals but by including all the tweets for sentiment analysis, a much stronger signal is found. If you have questions please contact me at

Colton Smith is a recent graduate of the University of Washington where he majored in Industrial and Systems Engineering and minored in Applied Math. He now lives in Chicago and works for Social Market Analytics. He has a passion for data science and is excited about his developing quantitative finance career. LinkedIn:
Upcoming Workshops by Dr. Ernie Chan

September 11-15City of London workshops

These intense 8-16 hours workshops cover Algorithmic Options StrategiesQuantitative Momentum Strategies, and Intraday Trading and Market Microstructure. Typical class size is under 10. They may qualify for CFA Institute continuing education credits.

November 18 and December 2:  Cryptocurrency Trading with Python

I will be moderating this online workshop for Nick Kirk, a noted cryptocurrency trader and fund manager, who taught this widely acclaimed course here and at CQF in London.

Friday, July 21, 2017

Building an Insider Trading Database and Predicting Future Equity Returns

By John Ryle, CFA
I’ve long been interested in the behavior of corporate insiders and how their actions may impact their company’s stock. I had done some research on this in the past, albeit in a very low-tech way using mostly Excel. It’s a highly compelling subject, intuitively aligned with a company’s equity performance - if those individuals most in-the-know are buying, it seems sensible that the stock should perform well. If insiders are selling, the opposite is implied. While reality proves more complex than that, a tremendous amount of literature has been written on the topic, and it has shown to be predictive in prior studies.

In generating my thesis to complete Northwestern’s MS in Predictive Analytics program, I figured employing some of the more prominent machine learning algorithms to insider trading could be an interesting exercise. I was concerned, however, that, as the market had gotten smarter over time, returns from insider trading signals may have decayed as well, as is often the case with strategies exposed to a wide audience over time. Information is more readily available now than at any time in the past. Not too long ago, investors needed to visit SEC offices to obtain insider filings. The standard filing document, the form 4 has only required electronic submission since 2003. Now anyone can obtain it freely via the SEC’s EDGAR website. If all this data is just sitting out there, can it continue to offer value?

I decided to inquire by gathering the filings directly by scraping the EDGAR site.  While there are numerous data providers available (at a cost), I wanted to parse the raw data directly, as this would allow for greater “intimacy” with the underlying data. I’ve spent much of my career as a database developer/administrator, so working with raw text/xml and transforming it into a database structure seemed like fun. Also, since I desired this to be a true end-to-end data science project, including the often ugly 80% of the real effort – data wrangling, was an important requirement.  That being said, mining and cleansing the data was a monstrous amount of work. It took several weekends to work through the code and finally download 2.4 million unique files. I relied heavily on Powershell scripts to first parse through the files and shred the xml into database tables in MS SQL Server.

With data from the years 2005 to 2015, the initial 2.4 million records were filtered down to 650,000 Insider Equity Buy transactions. I focused on Buys rather than Sells because the signal can be a bit murkier with sells. Insider selling happens for a great many innocent reasons, including diversification and paying living expenses. Also, I focused on equity trades rather than derivatives for similar reasons -it can be difficult to interpret the motivations behind various derivative trades.  Open market buy orders, however, are generally quite clear.

After some careful cleansing, I had 11 years’ worth of useful SEC data, but in addition, I needed pricing and market capitalization data, ideally which would account for survivorship bias/dead companies. Respectively, Zacks Equity Prices and Sharadar’s Core US Fundamentals data sets did the trick, and I could obtain both via Quandl at reasonable cost (about $350 per quarter.)

For exploratory data analysis and model building, I used the R programming language. The models I utilized were linear regression, recursive partitioning, random forest and multiplicative adaptive regression splines (MARS).  I intended to make use of a support vector machine (SVM) models as well, but experienced a great many performance issues when running on my laptop with a mere 4 cores. SVMs have trouble with scaling. I failed to overcome this issue and abandoned the effort after 10-12 crashes, unfortunately.

For the recursive partitioning and random forest models I used functions from Microsoft’s RevoScaleR package, which allows for impressive scalability versus standard tree-based packages such as rpart and randomForest. Similar results can be expected, but the RevoScaleR packages take great advantage of multiple cores. I split my data into a training set for 2005-2011, a validation set for 2012-2013, and a test set for 2014-2015. Overall, performance for each of the algorithms tested were fairly similar, but in the end, the random forest prevailed.

For my response variable, I used 3-month relative returns vs the Russell 3000 index. For predictors, I utilized a handful of attributes directly from the filings and from related company information. The models proved quite predictive in the validation set as can be seen in exhibit 4.10 of the paper, and reproduced below:
The random forest’s predicted returns were significantly better for quintile 5, the highest predicted return grouping, relative to quintile 1(the lowest). Quintiles 2 through 4 also lined up perfectly - actual performance correlated nicely with grouped predicted performance.  The results in validation seemed very promising!

However, when I ran the random forest model on the test set (2014-2015), the relationship broke down substantially, as can be seen in the paper’s Exhibit 5.2, reproduced below:

Fortunately, the predicted 1st decile was in in fact the lowest performing actual return grouping. However, the actual returns on all remaining prediction deciles appeared no better than random. In addition, relative returns were negative for every decile.  

While disappointing, it is important to recognize that when modeling time-dependent financial data, as the time-distance moves further away from the training set’s time-frame, performance of the model tends to decay. All market regimes, gradually or abruptly, end. This represents a partial (yet unsatisfying) explanation for this relative decrease in performance. Other effects that may have impaired prediction include the use of price, as well as market cap, as predictor variables. These factors certainly underperformed during the period used for the test set. Had I excluded these, and refined the filing specific features more deeply, perhaps I would have obtained a clearer signal in the test set.

In any event, this was a fun exercise where I learned a great deal about insider trading and its impact on future returns. Perhaps we can conclude that this signal has weakened over time, as the market has absorbed the informational value of insider trading data. However, perhaps further study, additional feature engineering and clever consideration of additional algorithms is worth pursuing in the future.

John J Ryle, CFA lives in the Boston area with his wife and two children. He is a software developer at a hedge fund, a graduate of Northwestern’s Master’s in Predictive Analytics program (2017), a huge tennis fan, and a machine learning enthusiast. He can be reached at 

Upcoming Workshops by Dr. Ernie Chan

July 29 and August 5Mean Reversion Strategies

In the last few years, mean reversion strategies have proven to be the most consistent winner. However, not all mean reversion strategies work in all markets at all times. This workshop will equip you with basic statistical techniques to discover mean reverting markets on your own, and describe the detailed mechanics of trading some of them. 

September 11-15: City of London workshops

These intense 8-16 hours workshops cover Algorithmic Options Strategies, Quantitative Momentum Strategies, and Intraday Trading and Market Microstructure. Typical class size is under 10. They may qualify for CFA Institute continuing education credits.

Industry updates
  • allows users to record order book data for backtesting.
  • Pair Trading Lab offers a web-based platform for easy backtesting of pairs strategies.