Wednesday, July 14, 2021

Metalabeling and the duality between cross-sectional and time-series factors

By Ernest Chan and Akshay Nautiyal

Features are inputs to supervised machine learning (ML) models. In traditional finance, they are typically called “factors”, and they are used in linear regression models to either explain or predict returns. In the former usage, the factors are contemporaneous with the target returns, while in the latter the factors must be from a prior period.

There are generally two types of factors: cross-sectional vs time-series. If you are modeling stock returns, cross-sectional factors are variables that are specific to an individual stock, such as its earnings yield, dividend yield, etc. In our previous blog post, we described how we provide 40 such factors to our subscribers for backtesting and live predictions. But as we advocate using ML for risk management and capital allocation purposes (i.e. metalabeling), not for returns predictions, you may wonder how these factors can help predict the returns of your trading strategy or portfolio. For example, if you have a long-short portfolio of tech stocks such as AAPL, GOOG, AMZN, etc., and want to predict whether the portfolio as a whole will be profitable in a certain market regime, does it really make sense to have the earnings yields of AAPL, GOOG, and AMZN as individual features?

Meanwhile, time-series factors are typically market-wide or macroeconomic variables such as the familiar Fama-French 3-factors:market (simply, the market index return), SMB (the relative return of small cap vs large cap stocks), and HML (the relative return of value vs growth stocks). These time-series factors are eminently suitable for metalabeling, because they can be used to predict your portfolio or strategy’s returns.

Given that there are many more obvious cross-sectional factors than time-series factors available, it seems a pity that we cannot use cross-sectional factors as features for metalabeling. Actually, we can –  Eugene Fama and Ken French themselves showed us how. If we have a cross-sectional factor on a stock, all we need to do is to use it to rank the stocks, form a long-short portfolio using the rankings, and use the returns of this portfolio as a time-series factor. The long-short portfolio is called a hedge portfolio.

We show the process of creation of a hedge portfolio with the help of an example, starting with Sharadar’s fundamental cross-sectional factors (which we generated as shown in the blog). There are 40 cross sectional factors updated at three different frequencies - quarterly, yearly and twelve month trailing. In this exercise, however, we use only the quarterly cross-sectional factors. Given a factor like capex (capital expenditure), we consider the normalized (the normalization procedure is found in the previously cited blog post) capex of approximately 8500 stocks on particular dates from January 1st, 2010 till current date. There are 4 particular dates of interest every year -  January 15th, April 15th, July 15th and October 15th. We call these the ranking dates. On each of these dates we find the percentile rank of the stock based on normalized capex. The dates are carefully chosen to capture change in the cross sectional factors of the maximum number of stocks post the quarterly filings.

Once the capex across stocks is ranked at each ranking date (4 dates) each year we obtain the stocks present in the upper quartile (i.e ranked above 75 percentile) and the stocks present in the lower quartile (i.e ranked below 25 percentile). We take a long position on the ones which showed highest normalized capex and take a short position on the ones with the lowest. Both these sets together make our long-short hedge portfolio.

Once we have the portfolio on a given ranking date we generate the daily returns of the portfolio using risk parity allocation (i.e allocate proportional to inverse volatility). The daily returns of each chosen stock are calculated for each day till the next ranking date. The portfolio weights on each day are the normalized inverse of the rolling standard deviation of returns for a two month window. These weights change on a daily basis and are multiplied to the daily returns of individual stocks to get the daily portfolio returns.  If a portfolio stock is delisted in between ranking dates we simply drop the stock and not use it to calculate the portfolio returns. The daily returns generated in this process are the capex time series factors. This process is repeated for all other Sharadar cross-sectional factors. 

So, voila! 40 cross-sectional factors become 40 time-series factors, and they can be used for metalabeling any portfolio or trading strategy, whether it trades stocks, futures, FX, or anything at all.

What about the opposite conversion? Can we turn time-series factors into cross-sectional factors suitable for predicting the returns of individual stocks? Actually, there is no need. You can directly add any time-series factor to your feature set for predicting individual stock’s returns. This is equivalent to building a linear factor model with an individual stock’s returns as dependent variable and the time-series factor as independent variable, a process well-known in traditional finance.

On a side note: besides these 40 time-series (and their corresponding cross-sectional) features, we have compiled an additional 197 proprietary time-series features available to our Premium subscribers, and available via our API.

Thursday, April 01, 2021

Conditional Parameter Optimization: Adapting Parameters to Changing Market Regimes via Machine Learning

Every trader knows that there are market regimes that are favorable to their strategies, and other regimes that are not. Some regimes are obvious, like bull vs bear markets, calm vs choppy markets, etc. These regimes affect many strategies and portfolios (unless they are market-neutral or volatility-neutral portfolios) and are readily observable and identifiable (but perhaps not predictable). Other regimes are more subtle, and may only affect your specific strategy. Regimes may change every day, and they may not be observable. It is often not as simple as saying the market has two regimes, and we are currently in regime 2 instead of 1. For example, with respect to the profitability of your specific strategy, the market may have 5 different regimes. But it is not easy to specify exactly what those 5 regimes are, and which of the 5 we are in today, not to mention predicting which regime we will be in tomorrow. We won’t even know that there are exactly 5!

Regime changes sometimes necessitate a complete change of trading strategy (e.g. trading a mean-reverting instead of momentum strategy). Other times, traders just need to change the parameters of their existing trading strategy to adapt to a different regime. My colleagues and I at have come up with a novel way of adapting the parameters of a trading strategy, a technique we called “Conditional Parameter Optimization” (CPO). This patent-pending invention allows traders to adapt new parameters as frequently as they like—perhaps for every trading day or even every single trade.

CPO uses machine learning to place orders optimally based on changing market conditions (regimes) in any market. Traders in these markets typically already possess a basic trading strategy that decides the timing, pricing, type, and/or size of such orders. This trading strategy will usually have a small number of adjustable trading parameters. Conventionally, they are often optimized based on a fixed historical data set (“train set”). Alternatively, they may be periodically reoptimized using an expanding or rolling train set. (The latter is often called “Walk Forward Optimization”.) With a fixed train set, the trading parameters clearly cannot adapt to changing regimes. With an expanding train set, the trading parameters still cannot respond to rapidly changing market conditions because the additional data is but a small fraction of the existing train set. Even with a rolling train set, there is no evidence that the parameters optimized in the most recent historical period gives better out-of-sample performance. A too-small rolling train set will also give unstable and unreliable predictive results given the lack of statistical significance. All these conventional optimization procedures can be called unconditional parameter optimization, as the trading parameters do not intelligently respond to rapidly changing market conditions. Ideally, we would like trading parameters that are much more sensitive to the market conditions and yet are trained on a large enough amount of data.

To address this adaptability problem, we apply a supervised machine learning algorithm (specifically, random forest with boosting) to learn from a large predictor (“feature”) set that captures various aspects of the prevailing market conditions, together with specific values of the trading parameters, to predict the outcome of the trading strategy. (An example outcome is the strategy’s future one-day return.) Once such machine-learning model is trained to predict the outcome, we can apply it to live trading by feeding in the features that represent the latest market conditions as well as various combinations of the trading parameters. The set of parameters that results in the optimal predicted outcome (e.g., the highest future one-day return) will be selected as optimal, and will be adopted for the trading strategy for the next period. The trader can make such predictions and adjust the trading strategy as frequently as needed to respond to rapidly changing market conditions.

In the example you can download here, I illustrate how we apply CPO using’s financial machine learning API to adapt the parameters of a Bollinger Band-based mean reversion strategy on GLD (the gold ETF) and obtain superior results which I highlight here:




Unconditional Optimization

Conditional Optimization

Annual Return



Sharpe Ratio



Calmar Ratio




The CPO technique is useful in industry verticals other than finance as well – after all, optimization under time varying and stochastic condition is a very general problem. For example, wait times in a hospital emergency room may be minimized by optimizing various parameters, such as staffing level, equipment and supplies readiness, discharge rate, etc. Current state-of-the-art methods generally find the optimal parameters by looking at what worked best on average in the past. There is also no mathematical function that exactly determines wait time based on these parameters. The CPO technique employs other variables such as time of day, day of week, season, weather, whether there are recent mass events, etc. to predict the wait time under various parameter combinations, and thereby find the optimal combination under the current conditions in order to achieve the shortest wait time.

We can provide you with the scripts to run CPO on your own strategy using’s API. Please email for a free trial.

Friday, January 22, 2021

The Amazing Efficacy of Cluster-based Feature Selection

One major impediment to widespread adoption of machine learning (ML) in investment management is their black-box nature: how would you explain to an investor why the machine makes a certain prediction? What's the intuition behind a certain ML trading strategy? How would you explain a major drawdown? This lack of "interpretability" is not just a problem for financial ML, it is a prevalent issue in applying ML to any domain. If you don’t understand the underlying mechanisms of a predictive model, you may not trust its predictions.

Feature importance ranking goes a long way towards providing better interpretability to ML models. The feature importance score indicates how much information a feature contributes when building a supervised learning model. The importance score is calculated for each feature in the dataset, allowing the features to be ranked. The investor can therefore see the most important predictors (features) used in the predictions, and in fact apply "feature selection" to only include those important features in the predictive model. However, as my colleague Nancy Xin Man and I have demonstrated in Man and Chan 2021a, common feature selection algorithms (e.g. MDA, LIME, SHAP) can exhibit high variability in the importance rankings of features: different random seeds often produce vastly different importance rankings. For e.g. if we run MDA on some cross validation set multiple times with different seeds, it is possible that a feature in a run is ranked at the top of the list but dropped to the bottom in the next run. This variability of course eliminates any interpretability benefit of feature selection. Interestingly, despite this variability in importance ranking, feature selection still generally improves out-of-sample predictive performance on multiple data sets that we tested in the above paper. This may be due to the "substitution effect": many alternative (substitute) features can be used to build predictive models with similar predictive power. (In linear regression, substitution effect is called "collinearity".)

To reduce variability (or what we called instability) in feature importance rankings and to improve interpretability, we found that LIME is generally preferable to SHAP, and definitely preferable to MDA. Another way to reduce instability is to increase the number of iterations during runs of the feature importance algorithms. In a typical implementation of MDA, every feature is permuted multiple times. But standard implementations of LIME and SHAP have set the number of iterations to 1 by default, which isn't conducive to stability. In LIME, each instance and its perturbed samples only fit one linear model, but we can perturb them multiple times to fit multiple linear models. In SHAP, we can permute the samples multiple times. Our experiments have shown that instability of the top ranked features do approximately converge to some minimum as the number of iterations increases; however, this minimum is not zero. So there remains some residual variability of the top ranked features, which may be attributable to the substitution effect as discussed before.

To further improve interpretability, we want to remove the residual variability. L√≥pez de Prado, M. (2020) described a clustering method to cluster together features are that are similar and  should receive the same importance rankings. This promises to be a great way to remove the substitution effect. In our new paper Man and Chan 2021b, we applied a hierarchical clustering methodology prior to MDA feature selection to the same data sets we studied previously. This method is generally called cMDA. As they say in social media click baits, the results will (pleasantly) surprise you. 

For the benchmark breast cancer dataset, the top two clusters found were:


Cluster Importance Scores

Cluster Rank


Geometry summary



  'mean radius',

  'mean perimeter',

  'mean area',

  'mean compactness',

  'mean concavity',

  'mean concave points',

  'radius error',

  'perimeter error',

  'area error',

  'worst radius',

  'worst perimeter',

  'worst area',

  'worst compactness',

  'worst concavity',

  'worst concave points'


Texture summary



'mean texture', 'worst texture'

Not only do these clusters have clear interpretations (provided by us as a "Topic"), these clusters almost never change in their top importance rankings under 100 random seeds! 

Closer to our financial focus, we also applied cMDA to a public dataset with features that may be useful for predicting S&P 500 index excess monthly returns. The two clusters found are


Cluster Scores

Cluster Rank





d/p, d/y, e/p, b/m, ntis, tbl, lty, dfy, dfr, infl




d/e, svar, ltr, tms

The two clusters can clearly be interpreted as fundamental vs technical indicators, and their rankings don't change: fundamental indicators are always found to be more important than technical indicators in all 100 runs with different random seeds.

Finally, we apply this technique to our proprietary features for predicting the success of our Tail Reaper strategy. Again, the top 2 clusters are highly interpretable, and never change with random seeds. (Since these are proprietary features, we omit displaying them.) 

If we select only those clearly interpretable, top clusters of features as input to training our random forest, we find that their out-of-sample predictive performances are also improved in many cases. For example, the accuracy of the S&P 500 monthly returns model improves from 0.517 to 0.583 when we use cMDA instead of MDA, while the AUC score improves from 0.716 to 0.779.


S&P 500 monthly returns prediction

















Meanwhile, the accuracy of the Tail Reaper metalabeling model improves from 0.529 to 0.614 when we use cMDA instead of MDA and select all clustered features with above-average importance scores, while the AUC score improves from 0.537 to 0.672.

















This added bonus of improved predictive performance is a by-product of capturing all the important, interpretable features, while removing most of the unimportant, uninterpretable features. 

You can try out this hierarchical cluster-based feature selection for free on our financial machine learning SaaS You can use the no-code version, or ask for our API. Details of our methodology can be found here.

Industry News

  1. Jay Dawani recently published a very readable, comprehensive guide to deep learning "Hands-On Mathematics for Deep Learning".
  2. is a new algo strategy marketplace that allows one to build algo strategies without coding and others to subscribe to them and take trades in their own linked brokerage accounts automatically. It can handle complex strategies such as arbitrage and options strategies. Currently some 400 algos are on offer.
  3. Jonathan Landy, a Caltech physicist, together with 3 of his physicist friends, have started a deep data science and machine learning blog with special emphasis on finance.

Thursday, August 06, 2020

What is the probability of profit of your next trade? (Introducing PredictNow.Ai)

What is the probability of profit of your next trade? You would think every trader can answer this simple question. Say you look at your historical trades (live or backtest) and count the winners and losers, and come up with a percentage of winning trades, say 60%. Is the probability of profit of your next trade 0.6? This might be a good initial estimate, but it is also a completely useless number. Let me explain.


This 0.6 is what may be called an unconditional probability of profit. It is the same for every trade that you will ever make (unless your winning ratio changes significantly in the future), so it is useless as a guide to whether you should take the next specific trade or not. It can of course tell you whether you should trade this strategy in general (e.g. you may not want to trade a strategy with an unconditional probability of profit, a.k.a. winning ratio, less than 0.51). But it can’t do so on a trade-by-trade basis. The latter is the conditional probability of profit. As the adjective suggests, this probability is conditioned on the specific market environment at the time when you expect to trade.

Let's say you are trading a short volatility strategy. It can be an algorithmic, or even discretionary, strategy. If you are trading it during a very calm market, it is likely that your conditional probability of profit would be quite high. If you are trading during a financial crisis, it could be very low. The conditions that can determine the probability may even be quantifiable.  The level of VIX? The recent SPY returns? How about the interest rate change or Nonfarm Payroll number just announced? Or even the % change in Covid-19 cases on the previous day? You may not have taken all these myriad numbers into account when you were building your simple trading strategy, or when you decide to make a discretionary trade, but you can't deny they may have an impact on the conditional probability of profit. So how are we to compute this probability?

Spoiler alert: computing this conditional probability helped us earned 64% YTD return as of June 2020. You can find out how to do that with But more on that later.


The only known way to compute this conditional probability is machine learning. Let's return to the example of your short volatility strategy above. Suppose you prepare a spreadsheet of the returns of the historical trades you have done, like this:


Figure 1: Spreadsheet with historical returns of short vol trades.

Figure 1: Spreadsheet with historical returns of short vol trades.

Again, these trades could be due to an algorithm, or it could be discretionary (perhaps based on some combination of fundamental analysis and intuition like what Warren Buffet does).


Now let's say we only care about whether they are profitable or not, so we ignore the magnitude of returns and label those trades that are profitable 1, otherwise 0. (These are called "metalabels" by Marcos Lopez de Prado, who pioneered this financial machine learning technique. They are “meta” because he assumed the original simple strategy is used to predict the ups and downs of the market itself – those are the base predictions, or labels. The metalabels are on whether those base predictions are correct or not.) The resulting spreadsheet looks like this. 

Figure 2: Spreadsheet with labels: is historical return of short vol strategy profitable?

Figure 2: Spreadsheet with labels: are historical returns of short vol strategy profitable?

Simple, right? Now comes the hard part. Your intuition tells you that there are some variables that you didn't take into account in your original, simple, trading strategy. There are just too many of these variables, and you don't know how to incorporate them to improve your trading strategy. You don't even know if some of them are useless. But that's not a problem for machine learning. You can add as many variables, called features / predictors / independent variables, as you like, useful or not. The machine learning algorithm will get rid of the useless features via a process called feature selection. But more on that later.


So let's say for every historical trade (represented by a row in the spreadsheet), you collect some features like VIX, 1-day SPY return, change in interest rate on the previous day, etc. We must, of course, ensure that these features' values were known prior to each trade's entry time, otherwise there will be look-ahead bias and you won't be able to use this system for live trading. So here is how your spreadsheet augmented with features may look: 

Figure 3: Spreadsheet with features augmented.

Figure 3: Spreadsheet with features augmented.

OK, now that you have prepared all these historical data, how do you build (or "train", in machine learning parlance) a predictive model based on that? You may not know it, but you have probably used the simplest kind of machine learning model already, maybe way back in a college statistics class. It is called linear regression, or its close sibling logistic regression for our binary (profit or not) classification problem. Those features that you created above are just the independent variables, often called X (a vector of many variables), and the labels are just the dependent variable often called Y (with values of 0 or 1). But applying linear or logistic regression on a large, disparate set of features to predict a label usually fails, because many relationships cannot be captured by a linear model. The nonlinear co-dependences between these predictors need to be discovered and utilized. For example, maybe when VIX <= 15, the 1-day SPY return isn't useful for predicting the probability of profit of your trade. But when VIX >= 15, 1-day SPY return is very useful. This type of relationship is best discovered using a "supervised" hierarchical learning algorithm called random forest, which is what we have implemented on


A random forest algorithm may discover the hypothetical relationship between VIX, 1-day SPY return, …, and whether your short vol trade will be profitable as illustrated in this schematic diagram: 

Figure 4: Example classification tree generated by internally.

Figure 4: Example classification tree generated by internally.

To build this tree, and all its cousins that together form a "random forest", all you need to do is to upload your spreadsheet above to, click a button, and it will probably be done in less than 15 minutes, often much sooner. (Certainly faster than a pizza delivery.)


Figure 5: Choosing training mode at

Figure 5: Choosing training mode at

Figure 6: Uploading training data.

Figure 6: Uploading training data.

Figure 7: Choosing hyperparameters for building random forest.

Figure 7: Choosing hyperparameters for building random forest.

Once this random forest is built (trained) with historical data, it is ready for your live trading. You can just plug in the latest values for VIX, 1-day SPY, and any other features into a new spreadsheet like this:

Figure 8: Live trading input

Figure 8: Live trading input.

Notice that the format of this spreadsheet is the same as the training data, except that there is no known Return of course - we are hoping to predict that! You can upload this to together with the model you just trained, press PREDICT, 

Figure 9: Live prediction.

and voila! You can now download the random forest's prediction of whether that trade will be profitable, and with what conditional probability.

Figure 10: Live prediction, with probability.

One of the output files (left in Figure 10) tells you the most likely outcome of your trade: profit or not. The other file (right one in Figure 10) tells you the probability of that outcome. You can use that probability to size your trade. For example, you may decide that if the probability of profit is higher than 0.6, you will buy $10K of TSLA. But if the probability is between 0.51 and 0.6, you will only buy $5K, while if the probability is lower than 0.51, you won’t buy at all.


Typically the live prediction will take 1 second or less, while the training (which may not need to be re-done more than once a quarter) typically won't take more than 15 minutes even for thousands of rows of historical data with 100 features. You can make live predictions as frequently as you like (i.e. as frequently as your input changes), but if you are a high frequency trader, you would want to use our API so that our predictions can be seamlessly integrated with your trading system.

But predicting the conditional probability of profit for your next trade is not all that we can do. We can also tell you what features are important in making that prediction. In fact, you may be more interested in that than a black-box prediction, because this list of important features, sorted in decreasing order of importance, may help you improve your underlying simple trading strategy. In other words, it can help improves your intuition about what works with your strategy, so you can change your trading rules.


Going back to our example, can generate such a graph for you: 

Figure 11: Features with decreasing importance

You can see that VIX was deemed the most important feature, followed by 1-day SPY return, the latest interest rate change, and so on. Our internal predictive algorithm will actually remove all features that are "below average" and retrain the model, but you may benefit from incorporating just VIX and 1-day SPY return in your simple strategy when it generates a trading signal. Remember, your simple strategy does not need to be an algorithmic strategy. It could be discretionary.


(For the machine learning mavens among you, we use SHAP for feature selection, as discussed in our paper.)

You may wonder why our predictive service is restricted to only taking your strategy’s historical or live trades as input and predicting their probabilities of profit. Why can’t it be used directly to predict the market’s return? Of course it can: you only need to pretend that your strategy is buy-and-holding the market. It can even predict the magnitude, not just the sign, of the return. But as we all know, it is very hard to predict the market’s movement, because of low signal-to-noise ratio. Your own strategy, however, has presumably found a way to filter out those noise, and machine learning prediction is more likely to succeed in telling you what “regime” is favorable/unfavorable to your strategy, and with what probability. Another usage of our service is to use it to predict numbers that are not subject to arbitrage, things such as a company’s earning surprise, credit rating change, or the US nonfarm payroll surprise (as we have already done successfully). In these usages, there are no adversaries (your fellow traders) that are trying their hardest to arbitrage away your trading alpha, so these predictions will be more likely to work far into the future. 

(For machine learning mavens, you may wonder why we have only implemented random forest learning algorithm. The beauty of random forest is that it is simple, but not too simple. Complicated deep learning algorithms such as LSTM can indeed take into account the time series dependence of the features and labels more readily, but they run serious risk of data snooping due to the large number of parameters to fit. GPT-3, the latest and hottest deep learning algorithm for natural language processing, for example, has more than 175 billion parameters to fit. Imagine fitting that to 1,000 historical trades!)

So does this stuff really work? We have implemented this machine learning system for our Tail Reaper strategy in our fund around the August of 2019. Yes, the 64% YTD return as of June 2020 (net of 25% incentive fee!) is nice, but what's more amazing is that the machine learning program told us to not enter any trade (due to the low conditional probability of profit) from Nov 2019 - Jan 2020. In retrospect, that made sense because Tail Reaper is a crisis alpha, tail hedge strategy. There was no crisis, no tail movement, from which to reap profits in those calm months. But suddenly, starting on February 1, 2020, this machine learning program told us to expect a crisis. We thought the machine learning program was nuts - there were just a handful of Covid-19 cases in the US at that time! Nonetheless we followed its advice and restarted Tail Reaper. It went on to capture over 12% return later that month, and the rest is history. (Past performance is not necessarily indicative of future results. For detailed disclosure of this strategy, please visit

Figure 12: Tail Reaper equity curve

Figure 12: Tail Reaper equity curve.

For readers interested in a free trial or to participate in a live webinar on how to use to predict the conditional probability of profit of your trades, please sign up here.