Friday, March 03, 2017

More Data or Fewer Predictors: Which is a Better Cure for Overfitting?

One of the perennial problems in building trading models is the spareness of data and the attendant danger of overfitting. Fortunately, there are systematic methods of dealing with both ends of the problem. These methods are well-known in machine learning, though most traditional machine learning applications have a lot more data than we traders are used to. (E.g. Google used 10 million YouTube videos to train a deep learning network to recognize cats' faces.)

To create more training data out of thin air, we can resample (perhaps more vividly, oversample) our existing data. This is called bagging. Let's illustrate this using a fundamental factor model described in my new book. It uses 27 factor loadings such as P/E, P/B, Asset Turnover, etc. for each stock. (Note that I call cross-sectional factors, i.e. factors that depend on each stock, "factor loadings" instead of "factors" by convention.) These factor loadings are collected from the quarterly financial statements of SP 500 companies, and are available from Sharadar's Core US Fundamentals database (as well as more expensive sources like Compustat). The factor model is very simple: it is just a multiple linear regression model with the next quarter's return of a stock as the dependent (target) variable, and the 27 factor loadings as the independent (predictor) variables. Training consists of finding the regression coefficients of these 27 predictors. The trading strategy based on this predictive factor model is equally simple: if the predicted next-quarter-return is positive, buy the stock and hold for a quarter. Vice versa for shorts.

Note there is already a step taken in curing data sparseness: we do not try to build a separate model with a different set of regression coefficients for each stock. We constrain the model such that the same regression coefficients apply to all the stocks. Otherwise, the training data that we use from 200701-201112 will only have 1,260 rows, instead of 1,260 x 500 = 630,000 rows.

The result of this baseline trading model isn't bad: it has a CAGR of 14.7% and Sharpe ratio of 1.8 in the out-of-sample period 201201-201401. (Caution: this portfolio is not necessarily market or dollar neutral. Hence the return could be due to a long bias enjoying the bull market in the test period. Interested readers can certainly test a market-neutral version of this strategy hedged with SPY.) I plotted the equity curve below.




Next, we resample the data by randomly picking N (=630,000) data points with replacement to form a new training set (a "bag"), and we repeat this K (=100) times to form K bags. For each bag, we train a new regression model. At the end, we average over the predicted returns of these K models to serve as our official predicted returns. This results in marginal improvement of the CAGR to 15.1%, with no change in Sharpe ratio.

Now, we try to reduce the predictor set. We use a method called "random subspace". We randomly pick half of the original predictors to train a model, and repeat this K=100 times. Once again, we average over the predicted returns of all these models. Combined with bagging, this results in further marginal improvement of the CAGR to 15.1%, again with little change in Sharpe ratio.

The improvements from either method may not seem large so far, but at least it shows that the original model is robust with respect to randomization.

But there is another method in reducing the number of predictors. It is called stepwise regression. The idea is simple: we pick one predictor from the original set at a time, and add that to the model only if BIC  (Bayesian Information Criterion) decreases. BIC is essentially the negative log likelihood of the training data based on the regression model, with a penalty term proportional to the number of predictors. That is, if two models have the same log likelihood, the one with the larger number of parameters will have a larger BIC and thus penalized. Once we reached minimum BIC, we then try to remove one predictor from the model at a time, until the BIC couldn't decrease any further. Applying this to our fundamental factor loadings, we achieve a quite significant improvement of the CAGR over the base model: 19.1% vs. 14.7%, with the same Sharpe ratio.

It is also satisfying that the stepwise regression model picked only two variables out of the original 27. Let that sink in for a moment: just two variables account for all of the predictive power of a quarterly financial report! As to which two variables these are - I will reveal that in my talk at QuantCon 2017 on April 29.

===

My Upcoming Workshops

March 11 and 18: Cryptocurrency Trading with Python

I will be moderating this online workshop for my friend Nick Kirk, who taught a similar course at CQF in London to wide acclaim.

May 13 and 20: Artificial Intelligence Techniques for Traders

I will discuss in details AI techniques such as those described above, with other examples and in-class exercises. As usual, nuances and pitfalls will be covered.

33 comments:

Honey said...

Wow! only two Variables! :D But how we could estimate that those two variables are sufficient for last 40 days but not for 400 days?

Honey said...

BTW your results also convey that KISS principle Rocks! https://en.wikipedia.org/wiki/KISS_principle

David Bryant said...

Great article. Could also use Akaike information criterion .

Ernie Chan said...

Hi Honey,
That 2 variables worked for the entire in and out of sample period. So I am not sure what you meant by "last 40 days".

Yes, I have always been a big fan of KISS!

Ernie

Ernie Chan said...

Thank you, David!

Yes, AIC and BIC are both used often.

Ernie

Eduardo Gonzatti said...

Dr Chan, nice post, as usual!
Did you try to run a LS version of this simple strategy? Shorting those that seemed overpriced, or, hedging with the SP500 as you said? Could you post the rough numbers, if you did?

Thanks!

Best Regards

Ernie Chan said...

Thanks, Eduardo!

I have not tested the market neutral version. However, as presented, it is already LS, though the 2 sides may not net to 0.

Ernie

Eduardo Gonzatti said...

Sorry Dr, I misread it!
Do you think that the long boas from the market during the OOS window could have inflated the results by this much?

Best Regards

Ernie Chan said...

Hi Eduardo,
I don't know how much the net exposure may have inflated the result. My guess is that there is a long bias in the portfolio, because most stocks have positive returns during both the training and OOS periods.
Ernie

Eduardo Gonzatti said...

Thanks for the caveat, Dr!

dashiell said...

Great post.

Sorry, this may be a very silly question, but I'm having trouble following what each row of data represents. How do you get 1260 rows for each stock? Based on the number, it seems like that represents daily returns for 5 years, but it didn't seem like daily returns were being used here. Thanks.

Thomas said...

Why not just do L1/L2 regularization?

Ernie Chan said...

Hi dashiell,

5 years * 252 trading days per year = 1260 rows.

Daily returns are used.

Ernie

Ernie Chan said...

Hi Thomas,
Thank you for mentioning L1/L2 regularization as yet another method to reduce overfitting.

I am not claiming that the methods I outlined above are superior to every other ML method out there such as L1/L2 regularization. If you are able to present the L1/L2 results here, you would be most welcome. Otherwise, I will try it and present my findings at QuantCon 2017.

Ernie

Anonymous said...

So you try one variable and check BIC. Then add another (at random?) and check for decreased BIC? Is it as simple as an iterative search for lowest BIC?

Ernie Chan said...

In stepwise regression, we try *every* variable at each iteration, with replacement, and pick the one that has the lowest BIC, either to be added or removed.

So no randomness involved.

Ernie

Ever Garcia said...

Does the transaction cost adjust based on the CAGR for say $100K?

Anonymous said...

Interesting post. I did not understand your answer to dashiell on the number of training rows available, the post refers to using factor loadings from quarterly reports as the only independent factors in the regression model, how can you use it for daily returns? It would seem you have 4 different points per year so a total of only 5x4=20 points per stock for predicting the next quarterly return, unless I misunderstood the regression setup. Can you clarify please?

Ernie Chan said...

Hi Ever,
I typically just take 5bps as transaction costs for S&P 500 stocks. It doesn't depend on your account NAV.
I haven't, however, deducted transaction costs in my results discussed above.

Ernie

Ernie Chan said...

Even though each stock has new data 4 times a year, the stocks all have different earning release dates. Hence we have to compare all the stocks' fundamental numbers every day to make investment decision.

Please see Example 2.2 in my new book for details and codes.

Ernie

Ever Garcia said...

Thanks for the clarification.

Anonymous said...

Hi Ernie,

When we backtest a trading strategy, after we get daily P&L, how do we get daily returns of the strategy?

If we assume different initial equity, we may get different volatility later.

Thanks.

Ernie Chan said...

Hi,
There are 2 types of returns you can consider.

For unlevered returns, divide the P&L by the "gross absolute market value" of your portfolio. For e.g. if you are long $10 AAPL and short $5 MSFT, the gross absolute market value is $15.

For levered returns, divide the P&L by the NAV of the account. In this case, yes, different leverages (i.e. initial NAV) will result in different returns.

Ernie

Anonymous said...

Hi Ernie,

Thank you for quick response!

If we want to do portfolio optimization (portfolio of strategies, not assets), how do we get co-variance matrix according to returns computation methods you mentioned above?

Many thanks.

Ernie Chan said...

Hi,
For portfolio optimization, one should use unlevered returns.
Ernie

Anonymous said...

Hi Ernie,

Using Johansen-based pairs trading, is it possible to get intraday signals when we only use end-of-day prices?

Ernie Chan said...

Hi,
If you determined that a pair is cointegrating based on applying Johansen test to daily prices, it is quite OK to trade it intraday as well.
Ernie

aagold said...

Ernie,

The main result of your article is "we achieve a quite significant improvement of the CAGR over the base model: 19.1% vs. 14.7%, with the same Sharpe ratio."

Are you calculating CAGR using levered or unlevered returns?

I'm surprised the CAGR can be so different when the Sharpe Ratio is the same. Eq. 7.3 from Thorpe's Kelly Criterion paper shows that the maximum achievable growth rate of a continuous-time diffusion is a function of Sharpe Ratio S (g_opt = S^2/2 + risk-free-rate). So if you're seeing such a major difference in CAGR without any difference in S then it must be because you're operating far from the kelly-optimal leverage. If you optimized leverage then you probably wouldn't see any difference in CAGR.

Ernie Chan said...

aagold,
You have a good point.

I was reporting unlevered returns. Indeed, if we were to use Kelly-optimal leverage, the levered returns would be the same - but that's assuming that the returns distribution is close to Gaussian. There are usually fat tails that render the Kelly leverage sub-optimal, and so we would prefer a strategy that has higher unlevered returns if its Sharpe ratio is same as the alternative.
Ernie

aagold said...

Thanks for the quick response.

Just out of curiosity - since the original 27-variable model and the optimized 2-variable version have the same sharpe ratio but different CAGR, one must have both a higher numerator (mean excess return) and a higher denominator (standard deviation) such that the changes cancel out. Which model has both higher mean excess return and higher standard deviation?

I'm trying to figure out if the CAGR improvement of the 2-variable model happened because it's both more volatile and has higher return than the 27-variable version, which implies the 27-variable version was far below kelly-optimal leverage, or if it's the opposite which implies the 27-variable version was far above kelly-optimal leverage.

Ernie Chan said...

aagold,
Your deduction is correct.

The 2-variable model has higher CAGR and average annualized (uncompounded) return too. The latter is what goes into Sharpe ratio. Since it has the same Sharpe ratio as the 27-variable model, it must mean that the 2-variable model has higher volatility.

Ernie

aagold said...

Ernie,

I think it would be interesting to explore how Kelly optimization interacts with the predictor-variable reduction work you've discussed.

For example, how does the Kelly-optimized 27-variable CAGR compare to the Kelly-optimized 2-variable CAGR? They won't necessarily turn out the same, even though the out-of-sample unleveraged Sharpe ratios ended up the same, because the leverage optimization is done using in-sample data and the testing is done with out-of-sample data. I suspect the in-sample leverage optimization using the 2-variable model will be more effective than that using the 27-variable model.

I don't have your data and backtesting software, otherwise I'd investigate this myself, but if you're interested in doing that study then I think it would be interesting.

Regards,
aagold

Ernie Chan said...

aagold,
Interesting idea. I will look into this when I have some time.
Thanks,
Ernie