In the previous blog post, we gave a very simple example of how traders can use self-attention transformers as a feature selection method: in this case, to select which previous returns of a stock to use for predictions or optimizations. To be precise, the transformer assigns weights on the different transformed features for downstream applications. In this post, we will discuss how traders can incorporate different feature series from this stock while adding a sense of time. The technique we discuss is based partly on Prof. Will Cong’s AlphaPortfolio paper.
Recall that in the simple example in a Poor Person’s Transformer, the input X is just a n-vector with previous returns X=[R(t), R(t-1), …, R(t-n+1)]T. Some of you fundamental analysts will complain “What about the fundamentals of a stock? Shouldn’t they be part of the input?” Sure they should! Let’s say, following AlphaPortfolio, we add B/M, EPS, …, all 51 fundamental variables of a company as input features. Furthermore, just as for the returns, we want to know the n previous snapshots of these variables. So we expand X from 1 to 52 columns (including the returns column). For concreteness, let’s say we use n=12 snapshots, captured at monthly intervals, and regard R(t) as the monthly return from t-1 to t. X is now a 12 × 52 matrix.
For those of us who grew up before GenAI became a thing (e.g. Ernie), we often use tree-based algorithms for supervised learning. Trees work very well with heterogeneous and tabular feature sets, and by limiting the number of nodes or the depth of a branch, there is feature selection by default. With neural networks (NN), before deep learning comes around, it is quite common to perform feature selection using L1 regularization - i.e. adding a L1 penalty term to the objective function in order to encourage some of the network parameters to become zero. However, L1 regularizations are quite tedious when we have millions or billions of parameters in a deep neural network. In its place, transformers and attention become the go-to technique for feature selection in a deep neural network (see Chapter 5 of our book.) But beyond making feature selection practical for DNN, the attention mechanism provides one important benefit that is absent from traditional regularization or feature selection methods (such as MDA, SHAP, or LIME, see Chan & Man https://arxiv.org/abs/2005.12483): the selected features depend on each sample. They aren’t selected globally like traditional feature selection methods do. In other words, the features are selected based on their values themselves. In the language of transformers, we use self-attention for feature selection.
Transformers are usually illustrated with textual input. For e.g., a sentence containing 4 features (words/tokens) “I”, “am”, “a”, “student”. Let’s call this input feature vector X. In DNN, each feature may be a vector (e.g. we may use a d-dimensional vector to represent a word/token), as opposed to a scalar. So X may actually have dimension n × d, where n is the number of features (not the number of samples!) and d is the dimension of each feature. A financial application where this can be useful is when one feature (row) vector captures the daily return of a stock, its P/E, dividend yield, …, up d types of features, at a snapshot in time t. Another feature vector captures the same information at time t-1, and so on, up to a lookback of n. So if you have n lookback periods, the feature matrix has dimension n × d. But in many financial applications, each feature is just a real-valued scalar such as the daily return of a single stock. So X=[r(t), r(t-1), …, r(t-n+1)]T. This is the simple example we will use in our Poor Person’s version of transformer: d=1, and X is just a column vector with dimension n × 1.
Now, in ordinary transformers, the next step is to transform X into 3 different vectors / matrices: Q (query, with width dq), K (key, with height dk), and V (value, with height dv). An element in Q is like “what this feature is looking for in other features that can provide as context”, an element in K is like “this is the context that this feature can provide, and element in V is like “this is a feature in a new representation”.
In a typical transformer with self-attention, for each input vector X, the Q, K and V values are calculated as linear transformation of X:
The WQ, WK, WV matrices themselves are learned parameters, learned based on the ultimate objective of this NN (e.g. classification, regression, or optimization), but the resulting attention score is computed as the function of the input sample X. The W’s all have heights n, but widths dq, dk, and dv respectively, though dq, dk are often set to be the same dimension. The intuition behind these Q, K, V is we want some linear mixtures of the original feature matrix X that best represent it, reminiscent of the familiar PCA. In the example of the n × d financial feature matrix we described above, we want to linearly project the return and fundamentals of a stock to some “principal component” vector, while preserving the distinctness of each lagged snapshot of these features since the projection is row-wise. I.e. Q, K, V have same height as X and so each row still represents a specific snapshot in time, as seen in the figure below which illustrates the building blocks of a transformer with self-attention.
The figure below shows specifically a transformer with n=4, d=4, and dq=dk=dv=2. It shows also how the Q and K matrices are multiplied together, scaled by sqrt(dk) to prevent the magnitude from exploding, and fed through a softmax function to turn them into attention scores in [0, 1], in a process called “Scaled Dot-Product Attention” (for more details, see again Chapter 5 of our book).
Why sqrt(dk)? We will quote Cong et. al. “Assume that the components of q and k are independent random variables with mean 0 and variance 1. Then their dot product,
has mean 0 and variance dk. Why softmax? Softmax function normalizes the scaled dot-product into a matrix where each row is the normalized weights (i.e. they sum to 1) which are the attention weights applicable to the feature value matrix V. To wit,
Source: https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html. Here n=4, d=2.
But, in our Poor Person’s transformer, W is just a scalar, and Q, K, and V are all just 1-dimensional vectors. So we might as well eliminate this step and replace them all by the vector X. Note that this doesn’t collapse the matrix QKT into a scalar or vector. It is still a n × n matrix formed by XXT. Each feature i is still multiplying feature j to form the attention matrix element A(i, j). The elements of each row of A sums to 1 as in all attention matrices. If you ask “What is the feature importance score of feature j”, you can sum over all the values of column j, since column j represents the key feature j.
So if feature importance scores or feature selection are all you are after, we are done. But usually we are interested in downstream applications. In our 1 stock n-returns example, we might be interested in using these n daily returns, with proper feature weights, to predict the next day’s return. In this case, all we need to do it to multiply the attention matrix with V, which in our case is equal to X, to create the “context vector” Z=AV=AX. The context vector is an attention-weighted version of our original feature vector X. Downstream, we can use Z as input to a MLP for supervised learning, such as predicting the next day’s return, or for optimization via reinforcement learning.
Does this work? You can ask ChatGPT or some other favorite chatbot to create a program based on this blog post and try it out. Let us know how the results look in the comments!
P.S.
You may get excited by this feature selection method and think we should throw in a bunch of “heterogeneous” features such as volatility, P/E, earning yield, … of the stock to see if they work better. Unfortunately, the Poor Person’s self-attention method discussed above doesn’t work very well with features that cannot embedded in the same space. For example, it is nonsensical to add together A(i, j)=volatility * P/E and A(i+1, j)=dividend * P/E to form the feature importance score of P/E. To do that, we need to do some normalization and embedding. Also, maybe we want to tell the transformer that r(t), r(t-1), … is a time series and the features are time-ordered. All topics for the next blog post!
The answer to this question may seem obvious if you read the breathless proclamations of AI luminaries, but good quantitative investors should be hype-immune. We want to carefully compare the ChatGPT’s unsatisfactory responses to a couple of the prompts outlined in our book (which are mostly generated in early 2024) to their responses now (April 2025). In addition, if ChatGPT’s response is still not satisfactory, we want to compare the responses of several popular LLMs, some of which weren’t even available in early 2024 or weren’t launched in production mode yet. Of course, to test all the LLMs (e.g. ChatGPT, Copilot, DeepSeek, Gemini, Grok, Perplexity, …) is very time consuming, so we also use Chatbot Arena to randomly pick 2 for each prompt (see https://www.bloomberg.com/news/articles/2025-02-18/before-deepseek-blew-up-one-website-announced-its-arrival).
Let’s try the first and very simple task: plotting annualized returns of an ETF. Shockingly, ChatGPT failed at this task previously, as you could read from our book. It committed the egregious error of annualizing returns by multiplying by sqrt(252).
User Prompt
Write a Python program to download the adjusted closing prices of the ETF BILL, and plot its annualized daily returns chart. We only need YYYY as x-axis label.
Unfortunately, many LLMs (such as ChatGPT, Perplexity, gemma-3-4b-it by Google, command-a-03-2025 by Cohere) misinterpreted the question and produced some complicated calculations for the rolling 252-day compound returns. The only chatbot that correctly and simply multiplied the daily returns by 252 and displayed that is Microsoft’s Copilot. Bravo Mr. Nadella!
Now let’s proceed to another task that ChatGPT failed miserably before: summarizing a trading strategy based on Simon and Campasano (2014).
User Prompt
Pretend you are a quantitative finance expert. Please summarize the trading strategies described in this paper. [Author Note: We attach the pdf.]
ChatGPT Response
Absolutely—here’s a summary of the trading strategies described in the paper "The VIX Futures Basis: Evidence and Trading Strategies" by David P. Simon and Jim Campasano:
🧠Core Insight:
The VIX futures basis—the difference between VIX futures prices and the VIX spot index—does not predict changes in the VIX spot index, but does predict changes in VIX futures prices. This creates trading opportunities due to the convergence of futures prices to spot at expiry.
📊 Main Strategy:
Exploit the "Roll" in the VIX Futures Curve
Short VIX futures when the curve is in contango (futures > spot) beyond a threshold.
Buy VIX futures when the curve is in backwardation (futures < spot) beyond a threshold.
These trades take advantage of the fact that VIX futures converge toward the spot VIX price as they approach settlement.
🔧 Implementation Details:
Daily Roll Signal: The basis is divided by the number of business days to settlement.
Enter short if daily roll > $100
Enter long if daily roll < –$100
Exit criteria:
Daily roll falls below $50 (short) or rises above –$50 (long), or
Maximum holding period of 9 business days
⚖️ Hedging Equity Risk:
Because VIX futures prices move inversely to equity markets, trades are hedged using mini-S&P 500 futures.
Short VIX = Short S&P hedge
Long VIX = Long S&P hedge
Hedge ratios are computed using regressions of VIX futures price changes on S&P futures returns, updated daily out-of-sample.
💵 Profitability Summary (2007–2011):
Strategy Mean P&L Sortino Ratio Notes Short VIX (hedged) $792 1.26 62 trades, 2:1 win/loss Long VIX (hedged) $1,018 1.03 40 trades Unhedged versions Slightly higher P&L, but significantly higher downside risk
Transaction costs included full bid-ask spread + brokerage fees.
Strategies showed robust performance even out-of-sample and across sub-periods (pre/post-2009).
🧾 Key Takeaway:
The profitability is mainly driven by the roll yield (the premium or discount in futures relative to spot), and much of the risk can be hedged using equity index futures, enabling statistically significant and economically meaningful returns.
Let me know if you’d like this turned into a code simulation, visual walkthrough, or executive-style slide!
This is a much better response than the one recorded in our book. We truly can’t ask for a better answer. Let’s take ChatGPT up on its offer and turn it into code. The results can found here: https://chatgpt.com/canvas/shared/680a8810a5808191aaa35e4b31d0a813. Looks great, doesn’t it? What we don’t know, however, is whether ChatGPT has used our previous prompt as part of its training data for this version, so this isn’t a completely fair benchmark.
TL;DR
Yes, the answers got much better over the last year. But still not all LLMs give equally satisfactory answers - sometimes one to try quite a few to find the suitable one. We suggest you try out other prompts in our book and see if the answers improved!