Monday, December 09, 2019

US nonfarm employment prediction using RIWI Corp. alternative data


Introduction

The monthly US nonfarm payroll (NFP) announcement by the United States Bureau of Labor Statistics (BLS) is one of the most closely watched economic indicators, for economists and investors alike. (When I was teaching a class at a well-known proprietary trading firm, the traders suddenly ran out of the classroom to their desks on a Friday morning just before 8:30am EST.) Naturally, there were many efforts in the past trying to predict this number, ranging from using other macroeconomic indicators such as credit spreads to using Twitter sentiment as predictive features. In this article, I will report on research conducted by Radu Ciobanu and I using the unique and proprietary continuous survey data provided by RIWI Corp. to predict this important number.

RIWI is an alternative data provider that conducts online surveys and risk measurement monitoring in all countries of the world anonymously, without collecting any personally identifiable information or providing incentives to respondents. RIWI’s technology has collected and analyzed more than 1.5 billion responses globally. Critically, in their surveys, they can reach a segment of the population that is usually hidden: three quarters of their respondents across the world have not answered a survey of any kind in the preceding month. Their surveys strive to be as representative of the general online population as possible, without the usual bias towards the loud social media voices. This is important in predictive data for financial markets, where it is vital to separate noise from signal.

The financial market reacts mainly to surprise, i.e. the difference between the actual announced NFP number and the Wall Street consensus. This surprise can move not only the US financial markets, but international markets as well. Case in point: I watched the German DAX index moved sharply higher last week (December 6, 2019 ) due to the huge positive surprise (adding 266K jobs instead of the Wall Street consensus of 183K).  Therefore the surprise is what we want to predict. We compared predicting the sign of this surprise using machine learning with the RIWI score as the only feature vs. a number of other benchmarks that do not include the RIWI score, and found that the RIWI score generates higher predictive accuracy than all other benchmarks during cross validation test. We also predicted both the magnitude and sign of the NFP surprise. Including the RIWI score as one of the features achieved the smallest averaged cross-validated mean squared error (MSE) than otherwise. Limited out-of-sample results indicate the RIWI score continues to have significant power for both sign and magnitude predictions.

Data

The historical NFP monthly numbers were seasonally adjusted by the BLS. These numbers were released on the first Friday of every month, at 8:30 am ET (except on certain national holidays when they are released one day before or delayed by one week.) To compute the surprise, we subtract the Wall Street consensus on the day before the announcement from the actual NFP number.
The RIWI data were based on their online surveys of US consumers, and consist of two datasets. The first one is dated December 2013 - October 2017 and the second one is dated Sep 2018 - Sep 2019. The former dataset is based on the yes/no answer to the following survey question: ‘Are you working for more than 35 hours per week?’. The latter dataset is based on several survey questions related to opinions regarding US companies or products, along with respondents’ personal background, such as their employment status (full-time/part-time/student/retired), marital status, etc. In order to merge the two datasets, we regard respondents who said they worked “full-time” or “part-time” as equivalent to “working more than 35 hours per week”. If we were to count only the “full-time” respondents, a significant structural break in the time series would be observed between the two time periods, as seen in Figure 1 below.



Figure 1: Weighted monthly RIWI score, without seasonal adjustments, including only “Full-Time” respondents, for Dec 2013-Oct 2017 and Sep 2018-Sep 2019.

If we include both “Full-time” and “Part-Time” respondents, we obtain Figure 2 below, which clearly doesn’t have that structural break.




Figure 2: Weighted monthly RIWI score, without seasonal adjustments,  including “Full-time + part-time” respondents, for Dec 2013-Oct 2017 and Sep 2018-Sep 2019.

RIWI provides a weight for each respondent in order to transform the data so that it can reflect the demographics of the general US population, hence the adjective “Weighted” in the figure captions. Note that the survey is conducted such that each respondent can go back and change their answers but they will not show up as more than one sample in the data set. In order to extract a summary score in advance of each month’s NFP announcement, we compute a monthly average of the product of the respondents’ weights and the indicator (0 or 1) of whether the individual respondent is working full or part-time. The monthly average is computed over the same month that the NFP number measures. We call this the “RIWI score”. As the NFP data were seasonally adjusted, we need to do the same to the monthly differences of the RIWI score. We employ the same adjustment that the BLS uses: X12-ARIMA. But for comparison purposes, we did not apply seasonal adjustment to Figures 1 and 2.

Classification models

Our classification models were used to predict whether the sign of the NFP surprisewas positive or negative (there were no zero surprises in the data.) The models were trained on the data on Dec 2013 – Oct 2017 (“train set”), where cross validation testing also took place. Out-of-sample testing was done on the data Sep 2018-Oct 2019 (“test set”). As mentioned above, the test set’s RIWI survey questions were somewhat different from the train set questions. So test set result is a joint test of whether the classification model works out-of-sample and whether the slight difference in the RIWI data degrades predictive accuracy significantly.

To provide benchmark comparisons against RIWI score, we also studied several other standard features, some of which were found useful for NFP predictions:

·         Previous 1-month NFP surprise
·         Previous 12-month NFP surprise
·         Bloomberg  Barclays US Corporate High Yield Average Option Adjusted Spread Index (a.k.a. credit spreads)
·         Index of Consumer Sentiment (University of Michigan)

The Bloomberg Barclays US Corporate High Yield Average Option Adjusted Spread Index denotes the difference (spread) between a computed Option Adjusted Spread index of all high yield corporate bonds and a spot US Treasury curve. An Option Adjusted Spread index is computed using constituent bonds’ option adjusted spreads, weighted by market capitalization. In what follows, we will refer to the Bloomberg Barclays US Corporate High Yield Average Option Adjusted Spread Index as the “credit spreads” feature.

Since machine learning can only be performed on stationary features, we will use the monthly differences in the RIWI score and other features.
The benchmarks models we tested are:

  1. Logistic regression* on Previous surprise.
  2. Trend-following model predicts next sign(surprise)=sign(previous surprise).
  3. Contrarian model predicts next sign(surprise)=-sign(previous surprise).
  4. Logistic regression on credit spreads.
  5. Logistic regression on Index of Consumer Sentiment.
*All logistic regressions were L2-regularized.

Here are the results, compared to applying Random Forest to the RIWI score alone:

ML model
Features
CV accuracy (in-sample)
Out-of-sample accuracy
Contrarian model
Prev 1-month surprise
0.46
0.66
LogReg (Ridge)
Credit spreads
0.52
0.51
LogReg (Ridge)
Prev 1-month surprise
0.53
0.50
LogReg (Ridge)
Consumer sentiment index
0.53
0.50
Random Forest
All features
0.53
0.58
Trend following model
Prev 1-month surprise
0.54
0.33
Random Forest
RIWI score alone
0.63  +/- 0.03
0.58  +/- 0.04

Table 1: Classification benchmarks and other features

Based on the predictive accuracy on the cross validation data, the best machine learning model is one that uses the RIWI score as the only feature. This model applied the random forest classifier to the RIWI score to predict sign(NFP surprise). It obtained an average cross-validated (CV) accuracy of 63% +/- 0.03 (using 10-fold cross-validation on Dec 2013 – Oct 2017 data) and a 58.3% +/- 0.04 out-of-sample accuracy. As the out-of-sample data consists only of 12 data points, we view that as a test of whether the random forest classifier overfitted on training data, and whether the slightly different RIWI data affected predictions, but not as a fair comparison of the various models. Since the predictive accuracy did not deteriorate significantly on the out-of-sample data, we conclude that no overfitting was likely, and the new RIWI data did not differ significantly from that which we trained on. We have also applied random forest to all the features including the RIWI score, and found lower CV (53%) and out-of-sample (58%)  accuracies than using the RIWI score alone.

Regression models

Our regression models were used to predict the actual NFP surprise (sign + magnitude). The train vs. test data were the same as for the classification models, and features set were also the same.

To provide benchmark comparisons against the RIWI score, we studied the following models:

  1. ARMA (2,1) model* that uses past NFP surprises.
  2. Trend-following model predicts next surprise=(previous surprise).
  3. Contrarian model predicts next surprise=-(previous surprise).
*The lags and coefficients were optimized based on AIC minimization on the train set.

Here are the results, compared to applying Random Forest to the RIWI score alone:

ML method
Features
CV MSE (in-sample)
Out-of-sample MSE
Trend following model
Prev 1-month surprise
6788.60
19575.16
Contrarian model
Prev 1-month surprise
5941.78
9652.16
ARMA(2,1)
Prev 1-month surprise
3317.47
7192.9
Linear regression (Ridge)
Prev 1mth surprise +prev 12mth surprise
3310.66
7302.94
Random Forest
RIWI score
3280.13
7208.01
Random Forest
Credit spreads
3257.51
7227.63
Random Forest
Consumer sentiment index
3251.48
7231.74
Random Forest
All features
3251.18 
7268.75
Random Forest
RIWI score + prev 1mth surprise + prev 12mth surprise
3249.35 +/- 70
7269.20 +/- 134

Table 2: Regression benchmarks

Based on the mean squared error (MSE) of predicted surprises on the cross validation data, the best machine learning model is one that includes the RIWI score as a feature. It applied the random forest classifier to the RIWI score, previous 1-month and 12-month surprises in order to predict actual NFP surprise. It obtained an average cross-validated MSE of 3249.35 +/- 70 and a 7269.2+/- 134 out-of-sample accuracy. It marginally outperformed all benchmarks in cross-validation. As with all other benchmarks, including the Contrarian model which requires no training, out-of-sample MSE increased significantly over the CV MSE. But again, as the out-of-sample data consists only of 12 data points, we don’t view it as a fair comparison of the various models. We also applied random forest to all the features including the RIWI score, and found somewhat higher CV MSE (and hence a worse model) than using the RIWI score alone, but the difference is within error bounds.

Conclusion and Future Work

Using the technique of cross validation on RIWI data from December 2013 - October 2017, we found that the RIWI score (after weighting, seasonal adjustment, and differentiation), has outperformed all other benchmarks in predictive accuracy for the sign of the NFP surprises. We also found that the similarly transformed RIWI score, if supplemented with other indicators, has performed as well or better than all other benchmarks. While such absolute dominance needs to be confirmed in an extended out-of-sample test, we believe there is great potential for using the RIWI score for predicting the all-important Nonfarm Payroll number.

But beyond predicting NFP surprises, RIWI’s data have the potential to be a more accurate gauge of the actual U.S. employment situation, and therefore economic growth, than the NFP number. The “gig economy” is employing more workers whose data do not easily find their way into the official BLS count. (Here is an article on why BLS’ effort to count these workers has been a failure. This Bank of Canada report also concluded that official numbers were undercounting gig workers.) Undocumented workers are not counted in the NFP but they do contribute to the economy. Even illegal activities could have contributed more than 1% to the U.S. GDP, according to this Wall Street Journal report. In contrast, RIWI’s survey methodology was cited in this paper by Harvard researchers among others as the preferred method of collecting data on hard-to-reach populations. One can imagine an ambitious researcher using RIWI data to directly predict GDP growth and achieving better results than using the traditional economic indicators such as NFP.

Acknowledgement

We thank Jason Cho, Head of Data Operations at RIWI, for providing us the Company’s proprietary data for our evaluation purposes.

*Note a PDF version of this article can be downloaded from www.epchan.com.

Wednesday, December 04, 2019

Experiments with GANs for Simulating Returns (Guest post)

By Akshay Nautiyal, Quantinsti


Simulating returns using either the traditional closed-form equations or probabilistic models like Monte Carlo has been the standard practice to match them against empirical observations from stock, bond and other financial time-series data. (See Chan and Ng, 2017 and Lopez de Prado, 2018.)  Some of the stylised facts of return distributions are as follows:

  1. The tails of an empirical return distribution are always thick, indicating lucky gains and enormous losses are more probable than a Gaussian distribution would suggest. 
  2. Empirical distributions of assets show sharp peaks which traditional models are often not able to gauge. 

To generate simulated return distributions that are faithful to their empirical counterpart, I tried my hand on various kinds of Generative Adversarial Networks, a very specialised Neural Network to learn the features of a stationary series we’ll describe later. The GAN architectures used here are a direct descendant of the simple GAN invented by Goodfellow in his 2014 paper. The ones tried for this exercise were the conditional recurrent GAN and the simple GAN using fully connected layers. The idea involved in the architecture is that there are two constituent neural networks. One is called the Generator which takes a vector of random noise as input and then generates a time series window of a couple of days as output. The other component called Discriminator tries to take either this generated window as input or takes a real window of price returns or other features as input and tries to decipher whether a given window of returns or other features is “real” ( from the AAPL data)  or “fake” (generated by the Generator). The job of the generator is to try to “fool” the discriminator by successively (as it is being trained) generating more “real” data. The training goes on until:
1) the generator is able to output the feature set which is identical in distribution to the real dataset on which both the networks were trained 
2) The discriminator is able to tell real data from the generated one
The mathematical objectives of this training are to maximise: 
a )   log(D(x)) + log(1 - D(G(z))) - Done by the discriminator - Increase the expected ( over many iterations ) log probability of the Discriminator D to identify between the real and fake samples x. Simultaneously, increase the expected log probability of discriminator D to correctly identify all samples generated by generator G using noise z. 
b)   log(D(G(z))) - Done by the generator - So, as observed empirically while training GANs, at the beginning of training G is an extremely poor “truth” generator while D quickly becomes good at identifying real data. Hence, the component log(1 - D(G(z))) saturates or remains low. It is the job of G to maximize log(1 - D(G(z))). What that means is G is doing a good job of creating real data that D isn’t able to “call out”. But because log(1 - D(G(z))) saturates, we train G to maximize log(D(G(z))) rather than minimize log(1 - D(G(z))). 
Together the min-max game that the two networks play between them is formally described as:
minGmaxDV (D, G) =Epdata(x)[log D(x)]  +E p(z) [log(1 − D(G(z)))]    

The real data sample x is sampled from the distribution of empirical returns pdata(x)and the zis random noise variable sampled from a multivariate gaussian p(z). The expectations are calculated over both these distributions. This happens over multiple iterations. 

The hypothesis was that the various GANs tried will be able to generate a distribution of returns which are closer to the empirical distributions of returns than ubiquitous baselines like Monte Carlo method using the Geometric Brownian motion.
The experiments
A bird’s-eye view of what we’re trying to do here is that we’re trying to learn a joint probability distribution across time windows of all features along with the percentage change in adjusted close. This is so that they can be simulated organically with all the nuances they naturally come together with. For all the GAN training processes, Bayesian optimisation was used for hyperparameter tuning. 
In this exercise, initially, we first collected some features belong to the categories of trend, momentum, volatility etc like RSI, MACD, Parabolic SAR, Bollinger bands etc to create a feature set on the adjusted close of AAPL data which spanned from the 1980s to today. The window size of the sequential training sample was set based on hyperparameter tuning. Apart from these indicators the percentage change in the adjusted OLHCV data were taken and concatenated to the list of features. Both the generator and discriminator were recurrent neural networks ( to sequentially take in the multivariate window as input) powered by LSTMs which further passed the output to dense layers. I have tried learning the joint distributions of 14 and also 8 features The results were suboptimal,  probably because of the architecture being used and also because of how notoriously tough the GAN architecture might become to train. The suboptimality was in terms of the generators’ error not reducing at all ( log(1 - D(G(z))) saturating very early in the training ) after initially going up and the random return distributions without any particular form being generated by the generators. 
After trying conditional recurrent GANs, which didn’t train well, I tried using simpler multilayer perceptrons for both Generator and Discriminators in which I passed the entire window of returns of the adjusted close price of AAPL. The optimal window size was derived from hyperparameter tuning using Bayesian optimisation. The distribution generated by the feed-forward GAN is shown in figure 1. 

 Fig 1. Returns by simple feed-forward GAN
Some of the common problems I faced were either partial or complete mode collapse - where the distribution either did not have a similar sharp peak as the empirical distribution ( partial ) or any noise sample input into the generator produces a limited set of output samples ( complete). 
      
The figure above shows mode collapsing during training. Every subsequent epoch of the training is printed with the mean and standard deviation of both the empirical subset (“real data”) that is put into the discriminator for training and the subset generated by the generator ( “fake data”). As we can see at the 150th epoch, the distribution of the generated “fake data” absolutely collapses. The mean becomes 1.0 and the stdev becomes 0. What this means is that all the noise samples put into the generator are producing the same output! This phenomenon is called Mode Collapse as the frequencies of other local modes are not inline with the real distribution. As you can see in the figure below, this is the final distribution generated in the training iterations shown above:  




















A few tweaks which reduced errors for both Generator and Discriminator were 1) using a different learning rate for both the neural networks. Informally, the discriminator learning rate should be one order higher than the one for the generator. 2) Instead of using fixed labels like 1 or a 0 (where 1 means “real data” and 0 means “fake data”) for training the discriminator it helps to subtract a small noise from the label 1 and add a similar small noise to label 0. This has the effect of changing from classification to a regression model, using mean square error loss instead of binary cross-entropy as the objective function. Nonetheless, these tweaks have not eliminated completely the suboptimality and mode collapse problems associated with recurrent networks.
Baseline Comparisons
We compared this generated distribution against the distribution of empirical returns and the distribution generated via the Geometric Brownian Motion - Monte Carlo simulations done on AAPL via python. The metrics used to compare the empirical returns from GBM-MC and GAN were Kullback-Leibler divergence to compare the “distance” between return distributions and VAR measures to understand the risk being inferred for each kind of simulation. The chains generated by the GBM-MC can be seen in fig. 4. Ten paths were simulated in 1000 days in the future based on the inputs of the variance and mean of the AAPL stock data from the 1980s to 2019. The input for the initial price in GBM was the AAPL price on day one.

   

Fig 2. shows the empirical distributions for AAPL starting 1980s up till now. Fig 3. shows the generated returns by Geometric Brownian motion on AAPL.

To compare the various distributions generated in the exercise I binned the return values into 10,000 bins and then calculated the Divergence using the non-normalised frequency value of each bin. The code is: 
     
The formula scipy uses behind the scene for entropy is: 
S = sum(pk * log(pk / qk)) where pk,qk are bin frequencies 
The Kullback-Leibler divergence which was calculated between distributions: 
Comparison
KL Divergence
Empirical vs GAN
7.155841564194154
GAN vs Empirical 
10.180867728820251
Empirical vs GBM 
1.9944835997277586
GBM vs Empirical 
2.990622397328334

The Geometric Brownian Motion generation is a better match for the empirical data compared to the one generated using Multiperceptron GANs even though it should be noted that both are extremely bad.
The VAR values ( calculated over 8 samples ) here tell us that beyond a confidence level, the kind of returns (or losses) we might get - in this case, it is the percentage losses with 5% and 1% chance given the distributions of returns:  
Comparison
Mean and Std Dev of VAR Values ( for 95% confidence level ) 
Mean and Std Dev of VAR Values ( for 99% confidence level )
GANs
Mean = -0.1965352900
Stdev =  0.007326252
Mean = -0.27456501573
Stdev =  0.0093324205
GBM with Monte Carlo 
Mean = -0.0457949236
Stdev =  0.0003046359
Mean = -0.0628570539
Stdev = 0.0008578205
Empirical data
-0.0416606773394755 (one ground truth value) 
-0.0711425634927405 (one ground truth value) 

The GBM generator VARs seem to be much closer to the VARs of the Empirical distribution. 
Fig 4. Showing the various paths generated by the Geometric Brownian motion model using monte Carlo. 

Conclusion
The distributions generated by both methods didn’t generate the sharp peak shown in the empirical distribution (figure 2). The spread of the return distribution by the GBM with Monte Carlo was much closer to reality as shown by the VAR values and its distance to the empirical distribution was much closer to the empirical distribution as shown by the Kulback-Leibler divergence, compared to the ones generated by the various GANs I tried. This exercise reinforced that GANs even though enticing are tough to train. While at it I discovered and read about a few tweaks that might be helpful in GAN training. Some of the common problems I faced were 1) mode collapse discussed above 2) Another one was the saturation of the generator and “overpowering” by the discriminator. This saturation causes suboptimal learning of distribution probabilities by the GAN. Although not really successful, this exercise creates scope for exploring the various newer GAN architectures, in addition to the conditional recurrent and multilayer perceptron ones which I tried, and use their fabled ability to learn the subtlest of distributions and apply them for financial time-series modelling. Our codes can be found at Github here. Any modifications to the codes that can help improve performance are most welcome!

About Author: 
Akshay Nautiyal is a Quantitative Analyst at Quantinsti, working at the confluence of Machine Learning and Finance. QuantInsti is a premium institute in Algorithmic & Quantitative Trading with instructor-led and self-study learning programs. For example, there is an interactive course on using Machine Learning in Finance Markets that provides hands-on training in complex concepts like LSTM, RNN, cross validation and hyper parameter tuning.

Industry update

1) Cris Doloc published a new book “Computational Intelligence in Data-Driven Trading” that has extensive discussions on applying reinforcement learning to trading.

2) Nicolas Ferguson has translated the Kalman Filter codes in my book Algorithmic Trading to KDB+/Q. It is available on Github. He is available for programming/consulting work.

3) Brain Stanley at QuantRocket.com wrote a blog post on "Is Pairs Trading Still Viable?"

4) Ramon Martin started a new blog with a piece on "DeepTrading with Tensorflow IV".

5) Joe Marwood added my book to his top 100 trading books list.

6) Agustin Lebron's new book The Laws of Trading contains a good interview question on adverse selection (via Bayesian reasoning).

7) Linda Raschke's new autobiography Trading Sardines is hilarious!