UNIVERSITY OF PENNSYLVANIA
Statistical Pair Trading on International ETFs
Rebecca Wu
Troy Shu
STAT 434 Final Project Report Steele
December 18, 2012
P a g e | 1
I. Executive Summary
Pair trading international ETFs with a non-adaptive strategy does not seem to perform
well over longer time frames due to the changing dynamics of mean reversion and momentum in
international ETFs. However, after applying an adaptive “filter” to our pair trading strategy, our
returns improved dramatically which suggests that being able to successfully capture these mean
reversion and momentum dynamic changes can be profitable.
Our first step was to conduct exploratory data analysis on the price and return data for 22
international ETFs. We did not find anything out of the ordinary: the international ETF prices are
highly autocorrelated, not normally distributed and not stationary while the ETF returns are
autocorrelated, heavy-tailed and stationary.
Next, we backtested our international ETF pair trading strategy. Our strategy used the
Augmented Dickey-Fuller stationarity test to select only the cointegrated ETF pairs as potential
trades.After regressing the price of one ETF against the price of the other over a rolling 120-day
formation period, our strategy then ordered the ETF pairs by the magnitude of the current
residual on the 120th day and selected the top 5 ETF pairs with the largest residual/divergence to
trade for the next 20 days.
Initial results were poor: the strategy produced a full period Sharpe Ratio of -0.16 and a
max drawdown of -53.3%. Plotting the rolling Sharpe Ratios showed that they oscillated around
0.00,so the overall risk-reward relationship of our initial strategy remained poor throughout time.
We considered using a GARCH(1,1) model to obtain a clearer picture of the standard deviation
of our strategy returns, and thus a clearer picture of the rolling Sharpe Ratio. However, the fact
that the residuals of our strategy’s returns are heavy-tailed precluded the use of GARCH to
model the standard deviation of our strategy’s returns. We also conduct an analysis using
different Kelly criterion bets and as expected, our strategy’s terminal wealth and compound
annualized growth rate is higher than the strategy that does not use Kelly bets.
In looking for ways to improve our initial international pairs trading ETF strategy, we
noticed that there seemed to be “regime shifts” over time between the dominance of mean
reversion or momentum in the returns of the ETFs. We applied a moving average “filter” to the
initial trading strategy to reverse the international ETF pair trades in the correct direction when it
P a g e | 2
seemed that these regime shifts occurred. Our pair trading strategy’s returns improved
dramatically, producing a full period Sharpe Ratio close to 1 and a max drawdown of -35%.
II. Premise: “Pairs Trading on International ETFs” Paper
We decided to base our project on the premise of the quantitative financial research paper
titled “Pairs Trading on International ETFs”, authored by Schizas, Thomakos, and Wang. In their
paper, Schizas and his colleagues developed an international ETFs pair trading strategy that
produced spectacular results but did not seem to have a strong statistical foundation.
The authors of the paper used 23 international ETFs, representing countries such as the
USA, Germany, Brazil, Japan, and even smaller countries like Belgium and Malaysia. The
authors implemented their backtest using a rolling window: They had a 120-day formation
period in which they ranked all pairs of international ETFs and selected the top five to trade in a
simple 1-to-1 ratio. Then they had a 20-day trading period in which they calculated the ex-post
returns of the ETF pairs that they selected in the formation period. Rolling these two windows
forward together by 20 days produced ex-post returns for another 20 days.
To order the ETF pairs, the authors used the average absolute difference between the
cumulative returns of two ETFs starting from the beginning of the 120-day formation period. In
doing so, they were essentially betting that two international ETFs whose prices have shown to
diverge a lot will tend to converge in the future. However, they did not offereither a fundamental
economic reason or statistical evidence to explain such convergence behavior.
When assessing the performance of their strategy, theauthors neglected to provide basic
performance metrics such as monthly return, compound annualized growth rate, or max
drawdown numbers for their strategy. They only provided a single bar chart of monthly returns
and a few equity curves that are depicted below.
P a g e | 3
Their results seemed too good to be true given that very few months had negative returns,
even throughout the 2008 financial crisis.Furthermore,the negative returns never fell below -5%
while the positive returns frequently exceeded +5%, even reaching levels of 10% of 20% at some
points. The equity curves also seemed suspect since the portfolio for the top five pairs
consistently beat the market throughout all time.
The goal of our project was to develop a more statistically sound international ETFs pair
trading strategy by only trading cointegrated international ETF pairs in order to avoid the
problem of basing our trading decisions off bogus spurious regressions that would always
produce a highly significant alpha and beta even if the two international ETFs were completely
independent of each other. In addition, rather than only trading the pair on a 1-to-1 ratio, we also
used the Engle-Granger two-step method to determine the optimal cointegration ratio and
construct our trades by going long for ETF y but then going short for ETF x in proportion to the
cointegration ratio.
III. Description of Our Strategy
In our project, we used the same type of rolling backtest that Schizas and his co-authors
used in their strategy; however, we implement the Engle-Granger two-step method in selecting
our pairs to trade. First, weperformed a regression onthe price data from the 120-day formation
period for each of the ETF pairs. Next, we ran a Dickey-Fuller stationarity test on the set of
residualsfrom theregression to check whether the first difference in price of the ETF pair was
stationary.If the residuals were not stationary, we eliminated the pairsince it did not make
statistical sense to pair trade two ETFs whose prices are expected to diverge.
After determining the valid pairs to potentially trade, we ranked the remaining pairs
based on the absolute magnitude of the last residual in the 120-day formation period (i.e. the
most recent residual).Then we selected the top five pairs with the greatest level of divergence to
trade for the following 20 days.
We constructed dollar neutral positions in the top ETF pairs that we selected using the
estimated betas from the regressions. We made sure that these trades were made in the correct
direction. For example, say that we selected to trade the SPY-EWM (iShares MSCI Malaysia
Index) pair because it had a very negative residual or -1 (very high absolute magnitude)
P a g e | 4
afterregressing 푥 = 푆푃푌 price on 푦 = 퐸푊푀 price. By definition, residuals were calculated as
푦 − 훽푥 = 푟푒푠푖푑푢푎푙, and so a residual that was very negative meant that our 훽푥 leg was
“overpriced” compared to our 푦 leg. So in this case, we shorted our 푥 leg (SPY) and went long
our 푦 leg (EWM).
Once we had the trading period (next 20 days) returns for the five pairs, we simply took
the arithmetic average across all pair returns for that day to arrive at our portfolio’s overall return
for our day; in other words, we equal weighted the five pairs that we traded in our portfolio.
One of our exit criteria was as follows: if, within the 20 day trading period, the price
residual of a pair reversed sign from the original price residual, we exited that pair (and did not
rebalance our portfolio). For example, consider the SPY-EWM pair example that we used
before: the original residual was quite negative, -1. The price residual was calculated every
trading day as 푦 − 훽푥, or the price of EWM minus 훽 times the price of SPY (since we defined x
as the price of SPY and y as the price of EWM before. One day, the price residual became 0.01;
since the sign was now opposite to the sign of our original residual (-1), we exited (on the next
day’s close, since we are only using close data). Intuitively, this made sense because we
originally entered the pair trade to capture the divergence in price as measured by the residual
with the expectation that this divergence will close, and the residual will cross zero—this
difference between the original residual and zero would be our profit. So once the residual
switched signs we would be trading in the wrong direction.Since we captured most of the profit
in the “spread” as measured by the residual one day before the entry date, we exited as soon as
possible once this happened. We exited all positions in any pairs that we were still trading once
the 20-day trading period was over.
Using a rolling backtest prevented possible data mining issues that could be encountered
from selecting a fixed training and testing period. Additionally, such a rolling backtest made
particular sense in the context of our project for 2 reasons: (1) our trading strategy was relatively
short-term so it would have been erroneous to include historical data from too far back in time,
and (2) the nature of a pairs trading strategyrequiredscreening all the pairs and only trading those
pairs that seemed the most promising. In this case, our pairs were ordered by a univariate metric,
namely the magnitude of the price regression residual.
P a g e | 5
IV. EDA of the International ETFs Data
After conducting exploratory data analysis on our international ETFs data, we found that
there were no surprises because both the ETF price and return series displayed the expected
behavior of typical financial series. As expected, the ETF prices were highly autocorrelated and
not stationary while the returns exhibited heavy tails, autocorrelation, and stationary.
Out of the 23 ETFs that Schizas and his colleagues used, we collected closing price data
for 22 ETFs that spanned the time period from April 01, 1996 to December 31, 2011.The reason
we omitted the 23rd ETF was because data for the ETF EZU (iShares MSCI EMU Index) was not
found on CRSP. While the majority of ETF records started on April 01, 1996, two ETFs(Korea
and Taiwan) had data starting on May 10, 2000 and June 20, 2000. From the 22 ETFs we could
form a maximum number of 231 pairs.
Normality:
When testing the international ETFs return data for normality, both the Shapiro-Wilks
and Jarque-Bera normality tests yielded p-values of 0.00 that strongly reject the null hypothesis
Normal QQ Plot of SPY Returns
Quantiles of Standard Normal
SP
Y R
etur
ns E
mpi
rical
Qua
ntile
s
-2 0 2
-0.1
0-0
.05
0.0
0.05
0.10
0.15
Normal QQ Plot of EWZ (Germany) Returns
Quantiles of Standard Normal
EW
G R
etur
ns E
mpi
rical
Qua
ntile
s
-2 0 2
-0.1
0-0
.05
0.0
0.05
0.10
0.15
0.20
Normal QQ Plot of EWJ (Japan) Returns
Quantiles of Standard Normal
EW
J R
etur
ns E
mpi
rical
Qua
ntile
s
-2 0 2
-0.1
0-0
.05
0.0
0.05
0.10
0.15
of a normal distribution. The Jarque-Bera test resulted in very high values for the test statistic for
each of the return series, indicating the presence of heavy tails. On the previous page, we
included the normal qq-plots for several of the largest ETFs in which the heavy-tailed
distribution could be easily observed.
Independence:
Next, we conducted the Ljung-Box test to check for
the presence of autocorrelation for all 22 ETFs. To the right,
the histogram of the p-values from the Ljung-Box tests
shows that all the p-values were very close to zero. 0.0 0.0005 0.0010 0.0015 0.0020
05
1015
20
Histogram of p-value for Ljung Box on ETF Returns
Ljung-Box p-value
Num
ber o
f ETF
s
P a g e | 6
Therefore, we strongly rejected the null hypothesis that the return series contained no
autocorrelation for all international ETFs.
Autocorrelation:
In the ACF plots included below for several of the largest ETFs, we found that the lag 1
coefficient had the largest magnitude. Since the lag 1 coefficient was negative, the ETFs seemed
to be short-term mean reverting. There were also some lags between 10 and 20 that had large
positive magnitudes, which may potentially signal medium-term momentum, but the lags could
be too far away to be meaningful.
Lag
AC
F
0 10 20 30
0.0
0.2
0.4
0.6
0.8
1.0
Autocorrelation of SPY Returns
Lag
AC
F
0 10 20 30
0.0
0.2
0.4
0.6
0.8
1.0
Autocorrelation of EWG Returns
,Lag
AC
F
0 10 20 30
0.0
0.2
0.4
0.6
0.8
1.0
Autocorrelation of EWG Returns
We then collected all of the AR(1) coefficients when modeling each ETF as an AR(1)
process. The histogram on the left side of the following page shows that all of the AR(1)
coefficients were negative, meaning that all of the ETFs in our universe tended to display mean
reversion in the short term.
Stationarity:
We conducted the Augmented Dickey-Fuller Stationarity test on the ETF returns to see
whether they contained a unit root. The above histogram on the right side of the page shows that
the Augmented Dickey-Fuller test statistics for all international ETFs were highly negative: the
most negative test statistic was around -73, and the least negative test statistic was -17. This
-0.15 -0.10 -0.05 0.0
02
46
Histogram of AR(1) Coefficient on ETF Returns
AR(1) coefficients
Num
ber o
f ETF
s
-80 -60 -40 -20 0
02
46
810
12
Histogram of Augmented Dickey-Fuller Test Statistic
ADF Test Statistic
Num
ber o
f ETF
s
P a g e | 7
result indicated that all the international pairs did not contain a unit root and were consequently
all stationary, which is expected for financial return series.
V. Overview of Our Strategy’s Performance
After confirming that our data was clean, we backtested our pair trading strategy to
generate returns and see how our strategy would have performed over time. We ran our rolling
backtest over the time period from September 04, 2004 to December 31, 2011. The reason for
selecting September 04, 2004 to be the start date was because the previous trading day was the
last day in which there was at least one ETF out of the 22 that had a zero volume day. We did not
want to be trading any low liquidity ETFs.
On the following page, we included a graph of the strategy’s equity curve (cumulative
growth of investing $1 in the strategy). The strategy has a compound annualized growth rate of -
3.7%, a maximum drawdown of -53.3%, and a full period annualized Sharpe Ratio of -0.16 (the
full period annualized Sharpe Ratio was calculated by first calculating mean daily excess return
above 10 year Treasuries, for the full period,divided by daily standard deviation, then
annualizing this quotient by multiplying by 250/√250).
Interestingly, the above equity curve shows that our international ETF pairs trading
strategy seemed to consistently lose money from early 2005 to early 2008. The strategy returns
0
0.2
0.4
0.6
0.8
1
1.2
Equity Curve of Our Pair Trading Strategy
Leve
ls
P a g e | 8
then jumped upwards erratically for a few years from mid 2008 to mid 2011. However, from mid
2011 onwards, the returns became negative again.
These results demonstrated the dynamics of dominance between mean reversion and
momentum in our pair trading strategy. The fact that our strategy consistently lost money from
2005 to 2008 meant that we were consistently making the wrong bets: instead of making a
successful bet on international ETF convergence, the ETF pairs seemed to diverge even more
after we selected them. In other words, there appeared to be momentum in the international ETFs
we traded during that period.
However, there seemed to be a regime shift after 2008, as our pair trading strategy’s
returns improved. What this suggests is that international ETFs started to become more mean
reverting than trending in the short to intermediate term. This makes intuitive sense, as the
economies of the world were in crisis during the couple of years after 2008, and so they—or at
least their markets—probably tended to move together. However, our project was an empirical
one, so the “economic story” behind the performance of our trading strategy was left as a future
research topic.
VI. EDA of Strategy Returns
The average strategy returns had a
heavy-tailed distribution and contained
autocorrelation, which was typical of most
financial returns. The ranked pair returns
displayed the same statistical
characteristics as the average returns,
although there did not appear to be a
relationship between the rank of the pair
and the level of autocorrelation or
stationarity.
Normality:
Both the Shapiro-Wilks and Jarque-Bera normality tests yielded p-values of 0.0000,
providing strong evidence that the returns were not normally distributed. The Jarque-Bera test
P a g e | 9
statistic had a very high value of 10,586.97, signifying the presence of heavy tails. The normal
qq-plot below confirmed this observation by showing that the strategy returns indeed followed a
heavy-tailed distribution.
Independence: The Ljung-Box test resulted in a p-value of 0.0000, meaning that the
returns definitely contained autocorrelation and were not independent. The
acf-plot above supported this conclusion since it shows that there were
significant lags at lag 1, lag 5 and lag 6. Fitting an AR(p) model (p=1-6) to
the average returns revealed that the AR(1) model provided the best fit, with
the AR(6) model being a close second since they had the lowest AIC values.
This outcome was consistent with the results from the acf-plot.
Stationarity: Finally, the Dickey-Fuller test resulted in a p-value of 1.01e-16, which indicated that the
returns were stationary and did not contain a unit root. This conclusion was expected because
only data that tended to be influenced by historical values, such as price data, should contain a
unit root. Returns data, on the other hand, did not depend on past data and should be stationary
without containing a unit root.
Analysis of Ranked Pair Returns: After performing the same analysis on the 5 ranked pair returns series, we found that each
series also had heavy tails and autocorrelation, much like the average strategy returns. Some
ranked pair returns contained more autocorrelation than others, but there did not appear to be a
relationship between the rank of the pair and the degree of autocorrelation or stationarity. The
pairs with rank 0 and 2 only had a few statistically significant lags while ranks 1, 3 and 4
AR(p) AIC 1 -8877 2 -8870 3 -8865 4 -8858 5 -8868 6 -8872
P a g e | 10
hadnearly all significant lags up until lag 20, and the pairs with rank 1 and 3 had the lowest p-
values for the Dickey-Fuller test on an order of 10-19 while ranks 0, 2 and 4 had higher p-values
on an order of 10-16.
VII. Analysis of Strategy Performance: Rolling Sharpe Ratio
Analyzing the 20-day rolling Sharpe ratio of our trading strategy revealed that the
performance of our trading strategy was not very good given that the Sharpe ratio oscillated
around 0.00 across time. Calculating the rolling Sharpe ratio using the GARCH conditional
deviation, rather than the rolling standard deviation, resulted in a larger range of outliers in
addition to a smaller spread between quartile 1 and quartile 3 and did not improve the overall
performance of the strategy. A closer look revealed that the GARCH model should not be used
to fit the trading strategy returns at all.
Sharpe Ratio Using Unconditional Standard Deviation: We first calculated the rolling Sharpe ratio by dividing the rolling mean by the rolling
standard deviation. Below on the left side of the page, the plot of the rolling mean and rolling
standard deviation showed that there was a positive relationship between risk and reward, since
the mean return tended to increase as the standard deviation increases. On the right side, the plot
of the rolling Sharpe ratio using rolling standard deviation showed that the series seemed to
fluctuate around 0.00.
The box plot below showed that the rolling Sharpe ratio did in fact have a mean of 0.00, and the
values between quartiles 1 and 3 ranged from -0.01 to +0.01. Based on this result, our trading
strategy did not seem to have much value.
P a g e | 11
Sharpe Ratio Using Conditional Standard Deviation:
The plot of the average strategy returns above on the right showed that there was some
volatility clustering from Q4 2008 to Q2 2009, so after fitting a GARCH(1,1) model to the
returns we recalculated the rolling Sharpe ratio by dividing the rolling mean by the conditional
standard deviation. We thought this might improve our results since the GARCH conditional
standard deviation should be better at accounting for volatility clustering than rolling standard
deviation, but the performance of the strategy ended up being worse as the mean of the rolling
Sharpe ratio still remained at 0.00 while the spread between quartiles 1 and 3 shrank even further
as shown in the box plot. It was also evident from the box plot on the previous page that there
were more outliers when using the conditional standard deviation than for the unconditional
standard deviation. One particular outlier could be seen in Q3 2010 of the plot of the rolling
Sharpe ratio using conditional standard deviation.
Comparing Conditional vs. Unconditional Standard Deviation:
A comparison of the conditional standard deviation with the unconditional rolling
standard deviation in the plot to the left revealed that the conditional standard deviation
contained much more variance and had higher peaks. The conditional standard deviation had a
variance of 0.00018 while the unconditional standard deviation had a variance of 0.00013. This
explained why the Sharpe ratios calculated using the conditional standard deviation were smaller
than the Sharpe ratios calculated using the unconditional standard deviation, since the
denominator of the ratio was standard deviation.
P a g e | 12
Comparing the above box plots of the unconditional standard deviation and conditional
standard deviation further supported this observation by showing that the conditional standard
deviation had many more outliers on the upward end than for the unconditional standard
deviation.
Evaluation of GARCH(1,1) Model:
To evaluate whether using a GARCH model was appropriate in the first place, we first
confirmed that there was significant autocorrelation in the trading strategy’s average squared
returns, which suggested that the returns might display time-varying conditional
heteroskedasticity. Additionally, the Lagrange-Multiplier test produced a p-value of 0.0002,
indicating that the residuals of the GARCH model did show an ARCH effect. However, despite
the previous evidence supporting the use of the GARCH model, the normal qq-plot of the
GARCH residuals showed that the residuals were not normally distributed at all, which meant
that the GARCH model could not be used for modeling the standard deviation of the trading
strategy’s returns.
Unconditional Stdev
Unconditional Stdev
P a g e | 13
Furthermore, there was still autocorrelation present in both the residuals and squared residuals,
based on the p-values of 0.0000 from the Ljung-Box test, which meant that the GARCH model
did not successfully model the serial correlation structure in the conditional standard deviation.
Finally, the “C” coefficient in the GARCH model was not statistically significant, with a p-value
of 0.5662 that showed that the coefficient could actually be zero instead.
VIII. Analysis of Strategy Performance: Kelly Betting
We also analyzed our strategy performance by studying the wealth process of an investor
who used Kelly betting when making daily investments in our trading strategy (i.e. he would
lever our strategy’s performance—the average return of the five pairs traded—every day based
on the Kelly Criterion).We simulated both the full and fractional versions of the Kelly Criterion
under varying restrictions. In terms of the full version of the simulation, the long-only,
unleveraged strategy performed better than the long-short, leveraged strategy. On the other hand,
the fractional version of the simulation considerably outperformed the full version altogether.
Regardless of the Kelly strategies’ relative performances to each other, all the strategies
did poorly and had negative CAGR values. However, each CAGR was still higher than our
trading strategy’s CAGR of -3.70%. This was consistent with the fact that Kelly betting focuses
on maximizing long-term terminal wealth, rather than short-term wealth, so the Kelly strategies
should have higher CAGR values than the underlying trading strategy. Overall, the results from
our Kelly simulation further confirmed the weak performance of our trading strategy since even
maximizing the long-term wealth could not produce positive returns.
Full Kelly Criterion (Long-Only, Unleveraged)
In this scenario, the investor began with an
initial wealth of 100 and only made unleveraged bets
if there was a positive expected return (calculated
from time 0 to t). To the right, the plot of the wealth
time series showed that no bets were made for a long
period of time since the expected return was
consistently negative up until approximately day
P a g e | 14
1200 (year 2008). However, after day 1200, the level of wealth spiked briefly before plummeting
to a value of 88. The simulation resulted in a negative CAGR of -0.65%. Calculating the
summary statistics for the wealth return series showed that the returns
had both a negative mean and Sharpe Ratio, and the downside was 4
times greater than the upside. Since the expected returns were so
consistently negative over time we also tried applying the long-short, leveraged version of the
Kelly betting strategy to take unleveraged short positions when expected returns were negative,
but this strategy did not fare any better as we will discuss next.
Full Kelly Criterion (Long-Short, Leveraged)
We updated the investor’s strategy to include
unleveraged short positionsas well as leveraged long
positions up to twice the amount of the investor’s
total wealth. There was visibly more variance in the
wealth time series, and a higher number of bets were
made due to the inclusion of short positions.
However, after ending at a value of 84, the
strategyresulted in a negative CAGR of -1.53% indicating that it was
even less effective than the previous long-only strategy. The summary
statistics for the wealth return series showed that the standard deviation
doubled in the positive direction, while both the mean return and CAGR doubled in the negative
direction. The maximum return also doubled, but the minimum return actually remained close to
the same as for the long-only strategy. This suggested that since the max loss of wealth on any
given day stayed roughly the same, we are unfortunately taking more bets that go against us
when compared to the long-only Kelly Criterion.
Fractional Kelly Criterion (Long-Short, Leveraged) In the full Kelly criterion, a huge assumption
was made that the historical returns were indicative of
future expected returns and variance. Fractional Kelly
mitigated that risk by scaling down the size of the bet.
Summary Statistics:
Max: 2.39% μ: -0.01% Min: -8.03% σ: 0.27% CAGR: -0.65%
Summary Statistics:
Max: 4.28% μ: -0.03% Min: -8.04% σ:0.52% CAGR: -1.53%
P a g e | 15
After testing a few values of f(0<f<1), there was definitely a positive relationship between
decreasing the value of f and improving the stability of the returns with a tradeoff of lower
returns. For f=0.20, the strategy had a CAGR of -0.29% and a
mean return of -0.03%. Decreasing to f=0.05 resulted in a higher CAGR
of -0.07%, smaller spread between the minimum and maximum from
-1.61% to +0.86%, lower standard deviation of 0.03%, but also a very
small expected return close to 0. On the other hand, increasing to f=0.50 resulted in a lower
CAGR of -0.74%, larger spread between the minimum and maximum from -4.02% to +2.14%,
higher standard deviation of 0.26%, and also a more negative expected return of -0.05%.
IX. Improving Our Strategy’s Performance
Since our international ETF pairs trading strategy seems to lose money most of the time,
we decided to look at improving our strategy’s performance, firstly by reversing the direction of
the positions we take, and then applying a moving average filter to our strategy’s equity curve.
Reversing the Strategy:
By reversing the direction of our original pair trades, we were now betting that the
international ETF pairs would continue to diverge after selecting them; since we selected pairs
based on the magnitude of divergence (as measured by a residual), the bet is essentially that there
is momentum in the divergence of international ETF pairs. The reversed strategy had a
compound annualized growth rate of 3.8%, a maximum drawdown of -66.7%, and a full period
annualized Sharpe Ratio of 0.16.
Summary Statistics:
Max: 0.86% μ: -0.03% Min: -1.61% σ: 0.10% CAGR: -0.29%
P a g e | 16
We noticed that this strategy’s returns tend to trend over the medium to long term.
Applying a moving average filter to the equity curve seems to be a good way to capture the
trending nature of the strategy’s returns. Specifically, we would calculate the moving average of
the strategy’s equity curve/cumulative growth. If the strategy is currently underperforming its
average performance, we would short the strategy (in this case, since the “strategy” under
consideration is actually the reverse of our original pair trading strategy, shorting the “strategy”
means taking the original unreversed trade). Likewise, if the strategy’s performance is higher
than its average performance, we would long the strategy.
Moving Average Filter:
We decided to test the performance of using a 200-day moving average as a type of trade
filter described above. The first graph is a plot of the reversed strategy’s equity curve along with
its 200 day moving average. The second graph is a plot of the reversed strategy’s equity curve
after filtering the trades by the 200 day moving average as described in the previous paragraph.
The performance numbers of the reversed pair trading strategy with the 200 day moving average
filter were as follows: 23.8% compound annualized growth rate, -35.2% maximum drawdown,
and a full period annualized Sharpe Ratio of 0.95.
00.20.40.60.8
11.21.41.61.8
2
Reversed Strategy Equity Curve
P a g e | 17
The filtered strategy still seemed to be very volatile: it actually had roughly the same
daily return standard deviation as the unfiltered strategy (1.79%). However, the compound
annualized growth rate was relatively high at 23.8%, which suggests that, with the 200 day
moving average filter, we are on average correctly capturing the regime shifts between
momentum and mean reversion in our international ETF pairs.
X. Final Considerations There are several considerations to take into account in the trading strategy that we
developed, and these may either be interpreted as determinants of risk in our strategy or jumping-
off points for extensions from our analysis. First of all, our trading strategy does not factor in
transaction costs, which may be significant since we are making trades somewhat frequently at a
0
0.5
1
1.5
2
Reversed Strategy Equity Curve Plotted with MA 200
Reversed Strategy Equity CurveMA200
0123456
Equity Curve After Applying MA 200 Filter
P a g e | 18
rate of 5 new positions every 20 days; since we are trading pairs, this means 10 trades per pair,
and 20 trades including both entry and exit. This averages to about a single trade every trading
day, which may be too frequent for an individual speculator, but may not be out of the realm of
possibility for a large institution like a hedge fund. In addition to transaction costs, there is
slippage due to illiquidity. We noticed that some of the international ETFs only had average
daily volume in the tens of thousands for about the first year in our backtesting period: an
institution could have had trouble trading large quantities of these ETFs in the early years
without moving the market too much. Including transaction costs and slippage would reduce the
overall profitability of our strategy.
Secondly, our trading strategy determines the ranking of the pairs based only on the last
residual of the 120-day formation period. This is to ensure that we are trading pairs that have the
greatest residual—the greatest “divergence”—right before we enter the trades, since we are
betting that the pairs will converge in the near future. We considered incorporating an
exponential moving average of the formation period residuals (weighting the more recent
residuals more) instead of just basing our trading decision on a single data point, but we decided
against it because we figured that there should not be problems with highly fluctuating residuals
since our exploratory data analysis on the international ETF’s data came out clean. Extensions
from our work may potentially consider using either an exponential moving average or some
other method of incorporating the past residuals in the formation period.
Thirdly, our initial strategy equity curve suggests that international ETF’s tend to diverge
during good economic times and converge during bad economic times. Our trading strategy
performed poorly up until the financial crisis in 2008, and then it started performing well as the
world’s economies began to move together during the crisis until it recently started dropping
again partway through 2011. This is a potential research question worth investigating.
Lastly, there is the peso problem, where historical data may not reflect all risks,
especially those in the future. An example of this phenomenon can be seen in the backtest period
from 2004 to 2008. The returns to our strategy were very consistent during that time period and
volatility was low (in this specific case, the international ETF pairs tended to diverge even more
after we picked them). If we were to put ourselves in 2008, given the historical data up until that
point, we would not have known when—if ever—the persistence in the momentum of
international ETF pairs would break down; indeed, it did break down immediately, during the
P a g e | 19
financial crisis, when international ETF pairs suddenly became mean reverting (i.e. the
international ETF pairs started moving together) and our bets on ETF pair convergence started
making money. Our models could not have foreseen this risk predicted in the historical data.
This is the reason why making models and trading strategies that are adaptive is a good thing to
do in the volatile and unpredictable world we live in today.