Statistical Arbitrage problem set 2
Document Sample


Statistical Arbitrage problem set 2
Aloke Mukherjee
1. Linear models for share volume.
a) Dates with anomalous volume
CSCO share volume 2002-2005
3.00E+08
2.50E+08
2.00E+08
volume
1.50E+08
1.00E+08
5.00E+07
0.00E+00
20020102
20020213
20020327
20020508
20020619
20020731
20020911
20021022
20021203
20030115
20030227
20030409
20030521
20030702
20030813
20030924
20031104
20031216
20040129
20040311
20040422
20040603
20040716
20040826
20041007
20041117
20041230
20050210
20050324
20050505
20050616
20050728
20050908
20051019
20051130
date
There are many peaks and troughs in this four year sample of volume data. The peaks are
much larger than the troughs. The largest volume day was October 8th, 2002 when Cisco
reached its lowest price ($8.60) over the sample period. More than 240 million shares
traded, an amount more than seven standard deviations away from the mean volume of
62 million in the period. The lowest volume day was December 23rd, 2003 when only 7
million shares traded, slightly more than two standard deviations from the mean.
What factors do high volume days have in common? Looking at high volume days in
2005 (see below) the best predictor for Cisco seems to be the release of an earnings
report. Comparing volume data with option expirations shows that expiries sometimes
coincide with high volume days but not in any consistent fashion. When the price makes
a big move it is often accompanied by large volume although there are many exceptions
and the relationship is not linear.
How about low volume days? Looking at the ten lowest volume days in the sample
reveals that most of them coincide with Christmas or American Thanksgiving. However,
looking at low volume days in 2005 (below) shows that there are many which cannot be
explained by holidays.
High volume days
(2005 days with volume greater than two standard deviations above the 2002-2005 mean)
Date Possible Explanation
20050208 2q 2005 earnings
20050209 2q 2005 earnings
20050511 3q 2005 earnings
20050810 4q/FY2005 earnings
20050811 4q/FY2005 earnings
20051110 1q 2006 earnings
20051118 acquisition of Scientific-Atlanta announced, option expiry
Low volume days
(2005 days with volume greather than one standard deviation below the 2002-2005 mean)
Date Possible Explanation
20050520
20050527 Friday before Memorial Day weekend
20050602
20050610
20050614
20050620
20050701 Friday before July 4th weekend
20050718
20050727
20050803-08 Days preceding earnings report?
20050902 Friday before Labour Day weekend
20051007 Friday before Columbus Day weekend
20051014
20051017
20051114
20051125 Day after Thanksgiving (also a Friday)
20051222-23,27,29 Christmas
b) Fit the series using AR(1), AR(2) and AR(3) models, and higher if needed.
To choose an appropriate AR model we look at the magnitude of the coefficient, the
autocorrelation of the residuals and the significance associated with the coefficient. To
accept a coefficient it should be relatively large compared to higher lag coefficients and
have a t-statistic greater than two and the associated residuals should have the properties
of white noise including zero autocorrelation.
AR(10) AR(1,5,10)
Coefficient AR(1) AR(2) AR(3) AR(4) AR(5) (t-stat) (t-stat)
1.30E+07 1.58E+07
Intercept 2.60E+07 2.36E+07 2.10E+07 1.87E+07 1.68E+07 (5.25) (7.01)
0.4798 0.5142
Φ1 0.5831 0.5273 0.5165 0.5038 0.491 (15.20) (19.18)
0.0231
Φ2 0.0947 0.0352 0.0308 0.0251 (0.63)
0.038
Φ3 0.1122 0.0541 0.0502 (1.17)
0.0466
Φ4 0.1116 0.0567 (1.27)
0.0809 0.133
Φ5 0.1082 (2.27) (4.72)
0.0255
Φ6 (0.63)
-0.0137
Φ7 (-0.36)
0.0074
Φ8 (0.21)
0.0335
Φ9 (0.91)
0.0701 0.0993
Φ10 (2.21) (3.66)
φ1 is much larger than the following terms. The reduction in the intercept and
coefficients indicate that each added term does have some predictive power. Looking at
the AR(1) residuals (below) shows that they display significant autocorrelation at five
and ten day lags.
correlogram for AR(1) residuals
0.15
0.1
0.05
0
-0.05
-0.1
1 2 3 4 5 6 7 8 9 10
Fitting an AR(10) model also shows that the five and ten day lag coefficients are the
largest ones although they are still an order of magnitude less than the first lag. The
Excel computed t-stats for the AR(10) model also show that only the 1, 5 and 10 day lags
are significantly different from zero. Based on these observations we fit an AR(10)
model but force all but the 1, 5 and 10 day coefficients to zero. This makes intuitive
sense since this indicates that there are volume effects associated with the day of the
week. The residuals for this model fall within the confidence interval for zero
autocorrelation (see below).
correlogram of AR(10) residuals - 1,5,10
0.08
0.06
0.04
0.02
0
-0.02
-0.04
-0.06
-0.08
1 2 3 4 5 6 7 8 9 10
c) What is your predictor? What is the distribution of the residual?
Based on the above the predictor chosen was:
V_k = 1.58e7 + (0.5142 * V_k-1) + (0.1330 * V_k-5) + (0.0993 * V_k-10)
The distribution of the residual is shown below.
distribution of residuals
mean=-4.486772e-009 std=1.995927e+007
skewness=1.642015e+000 kurtosis=9.132899e+000
Nobs=998
70
60
50
40
30
20
10
0
-1 -0.5 0 0.5 1 1.5
8
x 10
d) Try instead with log of volume.
Fitting AR models to the log series yields results similar to those described above. Here
the residuals are the difference between the log-volume and the predicted log-volume.
The residuals seem to be slightly better behaved but still display considerable kurtosis.
Predictor:
log(V_k) =
4.3404 + (0.5329 * log(V_k-1)) + (0.1327 * log(V_k-5)) + (0.0917 * log(V_k-10))
distribution of residuals
mean=1.584126e-015 std=2.825476e-001
skewness=-6.802007e-002 kurtosis=5.359890e+000
Nobs=998
60
50
40
30
20
10
0
-2 -1.5 -1 -0.5 0 0.5 1 1.5
2. Two-variable MA(1) process
γxy(0)/σ2 =
COV(Xn,Xn) = (1 + β112 + β122) COV(Xn,Yn) = (β11β21 + β12β22)
COV(Yn,Xn) = COV(Xn,Yn) COV(Yn,Yn) = (1 + β212 + β222)
γxy(1)/σ2 =
COV(Xn-1,Xn) = β11(1 + β12) COV(Xn-1,Yn) = β21
COV(Yn-1,Xn) = β12 COV(Yn-1,Yn) = β21(1 + β22)
γxy(-1) = γxy(1)T since
COV(Xn+1,Xn) = COV(Xn,Xn-1)
COV(Xn+1,Yn) = COV(Xn,Yn-1)
COV(Yn+1,Xn) = COV(Yn,Xn-1)
3. Technical Trading Rules
This strategy applied to Cisco (CSCO) from 2003/05/23 to 2005/12/30 (658 trading days,
~2.6 years) generates a profit of $4.31. The price on 2003/05/23 was $15.69 and on
2005/12/30 $17.12 yielding a profit for a buy-and-hold strategy of only $1.43 for the
same period.
VMA strategy for CSCO - 2003/05/23 - 2005/12/30
s=1, l=200, b=0.01
30
short-term avg
25
20
long-term avg
15
$
p&l
10
5
holdings
0
100 200 300 400 500 600
day of strategy
a) Handling dividends and splits
Dividends and splits cause a change in share price without an underlying change in the
value of a position. This means that post-event prices cannot be compared directly with
pre-event prices. The solution is to adjust the pre-event prices so that calculations with
the adjusted price series reflect the true return of the position. The strategy can then be
run on the adjusted price series.
The reference for the following is Yahoo’s discussion of its “adjusted close” data
(http://help.yahoo.com/help/us/fin/quote/quote-12.html):
For splits the prices before the split should be multiplied by the reciprocal of the split
ratio – e.g. a 3:2 split would mean multiplying pre-split prices by 2/3.
For dividends the prices before the split should be reduced by the proportion of the
dividend to the last close before the dividend ex-date. Simply subtracting the absolute
value of the dividend from previous prices can result in negative historical prices.
(another useful page is this introduction to CRSP price data:
http://www.library.hbs.edu/helpsheets/wrdscrspstock.html)
b) What value of bid-ask spread will cause your strategy to switch from profit to
loss?
For this period any spread higher than roughly 180 basis points will result in a loss. This
can be seen from the chart above: twelve shares were either bought or sold assuming that
we buy/sell anything we hold at the end of the strategy. The average price through the
period was approximately $20. The profit over the period is 2% of the absolute value of
all the trades (roughly $240).
Code
vma.m
function [pnl, y, h, sig, avgs, avgl, sprd] = vma(p, s, l, b);
% function [pnl, y, h, sig, avgs, avgl, sprd] = vma(p, s, l, b);
%
% Compute profit-and-loss for the Variable Length Moving
% Average rule and strategy as described in stat arb HW 2:
% http://www.math.nyu.edu/~almgren/statarb/hw2.pdf
%
% We assume we start at zero and the strategy is not started
% until the (l + 1)st day from the beginning of the price series.
% The strategy basically amounts to being long one share when the
% short-term average is above the long-term average and being
% short one share when it is below.
%
% inputs:
% p - price series (should be adjusted for dividends and splits)
% s - short period length
% l - long period length
% b - band parameter (debouncing the buy and sell triggers)
%
% output:
% pnl - profit and loss
% y - running cash
% h - running holdings
% sig - signal value throughout the period
% avgs - short-term moving average
% avgl - long-term moving average
% sprd - spread in basis points which makes strategy breakeven
%
% 2006 aloke mukherjee
% rt will be positive when the short-term avg. goes above the
% long-term and negative when it goes below
avgs = mavg(p, s); % short-term avg
avgl = mavg(p, l); % long-term avg
rt = log(avgs(l+1:end)./avgl(l+1:end));
% create a row vector with signals
% 1 - buy
% 0 - hold
% -1 - sell
sig = zeros(size(rt));
sig(rt > b) = 1;
sig(rt < -b) = -1;
% encodes matrix from assignment describing change in holdings
% columns are current holdings (-1 0 1)
% rows are the current signal values (sell hold buy)
delta = [0 -1 -2;
0 0 0;
2 1 0];
lasth = 0; % current holding
lasty = 0; % current profit and loss
cost = 0; % absolute value of all trades made
% iterate through signals updating holdings and pnl
for i = 1:length(sig)
d = delta(sig(i) + 2, lasth + 2);
y(i) = lasty - p(i+l) * d; lasty = y(i);
h(i) = lasth + d; lasth = h(i);
cost = cost + abs(d) * p(i+l);
end;
% add in cost of selling final holdings
cost = cost + abs(h(end)) * p(end);
% remove the initial part of the moving averages so that all
% the output arrays are the same length
avgs = avgs(l+1:end);
avgl = avgl(l+1:end);
pnl = y + h.*p(l+1:end)';
sprd = 10000 * pnl(end)/cost;
Related docs
Get documents about "