NCAA Basketball Tournament: Predicting Performance
Doug Fenton, Ben Nastou, Jon Potter Mathematics 70; Spring 2001
Project Goals
To examine some of the factors that indicate how well a team will do in the NCAA Men’s B-Ball Tournament To compare factors (such as seed, conference, and individual team) and their effect on a team’s result To create an effective model for how well a team does based on certain factors
Background
The NCAA Basketball Tournament is a 64team, single elimination tournament. This has been the tournament’s format since 1985: we use data from 1985-2000. There are four separate regions, each with sixteen teams seeded 1-16 (with 1 being the best and 16 being the worst). The result (dependent) variable is based on how many games a team wins in the tournament.
General Analysis
The first two (independent) variables looked at were the team’s seed and its winning percentage. The regression was as follows:
Result =.599 - 0.151*seed + 2.32*percent
(R^2=.387) (1.87) (-18.43) (6.00)
As can be seen from this data, both seed and winning percentage had a large effect on the team’s result, with result being positively related to percent and negatively related to seed.
A Quick Look at the Seed 1 2 3 4 Seeds... 2.422 1.656 1.609 Mean 3.359
Std. Err. Seed Mean Std. Err. Seed Mean Std. Err. Seed Mean Std. Err.
1.577 5 1.141 1.006 9 0.594 0.610 13 0.234 0.527 1.489 6 1.391 1.280 10 0.672 0.960 14 0.234 0.496 1.461 7 0.797 0.820 11 0.438 0.833 15 0.047 0.213 1.317 8 0.719 1.147 12 0.438 0.753 16 0.000 0.000
Seed Mean Std. Err. Seed Mean Std. Err. Seed Mean Std. Err. Seed Mean Std. Err.
#1 Seed vs. #2 Seed
Histogram for #1 Seed Histogram for #2 Seed
Seed Mean Std. Err.
1 3.359 1.577 2 2.422 1.489
Assuming normality in the distribution of results for #1 Seeds and #2 Seeds, T n+m-2 = (Avg(x)-Avg(y))/(SP*(sqrt(1/n+1/m)) = 3.46 T n+m-2 = 3.46 > 1.64 = T 0.05, 126
We can therefore conclude with 95% certainty that One Seeds outperform Two Seeds.
Differences in Variance
Do #1 Seeds have a different variance from #2 Seeds?
H0: S12 = S22 vs. H1: S12 ≠ S22 F = S12 / S22 = 2.49/2.22 = 1.12 Critical Value: F(63, 63) with 95% confidence: .600
1/16 t = (9 - 16*(1/16)) / ((16*1/16(1-1/16))^.5) t = 8.26 > t(.05,15) = 1.74 Therefore, we reject the null hypothesis, which means that the top seed wins the championship more often than if the tournament was randomly seeded.
Do the High Seeds Typically Outperform the Lower Seeds
| Hi Seeds (1-8) | Lo Seeds (9-16) | Total Lo Result(0-2) | 392 | 504 | 896 Hi Result (3-6) | 120 | 8 | 128 Total | 512 | 512 | 1024 % Hi Result | .234 | .0152
H0: pH = pL
vs.
H1 : p H > p L
Phat = (120+8)/(512+512) = .125 Z = (see Thm. 9.4.1) = 10.58 > 1.64 = Z.05
Hence, as expected, higher seeds outperform lower seeds.
The Conference Variables
The teams from our study came from 31 different conferences. These conferences were divided into 4 different tiers based past tournament performance and the number of schools who get into the tournament each year (Tier 1 being strongest conferences; Tier 4 being weakest conferences) We then tested how a team’s conference tier was correlated with their performance.
Comparing Teams’ Conferences
We tested the correlation between a team’s conference tier and their performance in the tournament. Likewise, we tested to see if there was significance of a team’s winning percentage given their conference tier. Therefore, we created a dummy variable for each tier and interaction terms between tier and winning percentage.
Results (R^2 = .40)
result | Coef. Std. Err. t P>|t| -------------+-------------------------------------------------------win % | 1.219 .574 2.12 0.034 Tier 1 | -4.40 .543 -8.10 0.000 Tier 2 | -2.148 .824 -2.61 0.009 Tier 3 | -4.189 1.00 -4.18 0.000 %*T1 | 7.90 .756 10.44 0.000 %*T2 | 3.93 1.136 3.46 0.000 %*T3 | 6.20 1.34 4.61 0.000
Is Tournament Fairly Seeded Based on Conference Tier?
To see if this is true, we looked at only the top 4 seeds because they seemed the most normal. For each of these seeds, we created four groups, one for each tier; to see if performance was consistent with the conference tier given a team’s seed. ANOVA was used for analysis of: H0: MT1 = MT2 = MT3 = MT4 (for each seed 1-4)
Results
Seed Group | F. | F-critical ------------------------------------------------- Seed 1 | 1.102 | 3.148 Seed 2 | 0.365 | 2.758 Seed 3 | 0.934 | 3.148 Seed 4 | 0.039 | 2.758
Analyzing Certain Teams
Dummy Variables were created for teams which had been in at least 12 (75%) of the tournaments. There are not enough data points, and the histograms are too skewed, to assume normality for the team data
A Quick Look at the Teams’ Performances
Team
Duke Ke ntucky UNC Kansas M ichigan Arkans as Syr acuse Ge orge town Louisv ille UCLA Arizona Indiana Purdue Oklahoma Te mple Illinois
Obs Mean SD
15 13 16 15 12 13 14 12 12 13 16 15 14 13 15 12 3.33 3.08 2.81 2.40 2.17 2.08 1.86 1.83 1.75 1.62 1.56 1.40 1.36 1.31 1.27 1.00 2.02 1.75 1.52 1.64 2.08 1.89 1.56 1.40 1.66 1.71 1.86 1.80 1.01 1.49 1.16 1.13
Is Duke the Best? Duke vs. Kentucky
Duke Kentucky
Team
Duke Ke ntucky
Obs Mean SD
15 13 3.33 3.08 2.02 1.75
Assuming normality in the distribution of results for both Kentucky and Duke (which may not be a valid assumpti T n+m-2 = (Avg(x)-Avg(y))/(SP*(sqrt(1/n+1/m)) = 0.356 T n+m-2 = 0.356 < 0.856 = T 0.20, 26
Therefore, we cannot reject the null hypothesis that Duke and Kentucky have perform equally well with even 20% certainty
Time Trends?
Team
Arizona Arkansas Duke Georgetown Illinois Indiana Kansas Kentucky Louisville Michigan UNC Oklahoma Purdue Syracuse Temple UCLA
E(1985)
0.82 2.08 3.83 2.70 1.35 2.51 3.38 2.03 3.65 2.33 2.74 2.44 0.78 1.94 1.36 1.18
Time Trend
0.099 0.000 -0.068 -0.148 -0.050 -0.139 -0.127 0.130 -0.230 -0.027 0.010 -0.152 0.074 -0.012 -0.001 0.049
Std. Err.
0.101 0.127 0.113 0.100 0.069 0.105 0.087 0.096 0.094 0.156 0.085 0.072 0.054 0.091 0.067 0.127
P>|t|
0.970 0.998 0.559 0.169 0.483 0.208 0.170 0.201 0.035 0.869 0.905 0.059 0.197 0.900 0.860 0.705
According to this time trend regression, Kentucky would have overtaken Duke in 1994.
Maybe So...
1995-2000
Tournament Appearances 6 5 Average Tournament Wins 4 2.2 Standard Deviation 2 1.92 Min Max 1 6 0 5
Kentucky Duke
Are Certain Teams Mis-Seeded?
If the team’s dummy variable is significant with seed, it suggests that that team is often “mis-seeded” (ie. a team is consistently seeded higher or lower than it should be).
Under-Rated?
Team
Duke Kentucky UNC Arkansas Kansas Michigan Louisville Georgetown Syracuse Temple UCLA Oklahoma Indiana Arizona Purdue Illinois
Dummy Coef.
1.342 1.200 0.874 0.596 0.478 0.430 0.257 0.224 0.110 0.007 -0.039 -0.242 -0.264 -0.265 -0.312 -0.626
Std. Err.
0.278 0.299 0.271 0.298 0.281 0.312 0.310 0.311 0.289 0.233 0.300 0.299 0.278 0.270 0.289 0.311
t
4.82 4.02 3.22 2.00 1.70 1.38 0.83 0.72 0.38 0.03 -0.13 -0.81 -0.95 -0.98 -1.08 -2.01
P>|t|
0.000 0.000 0.001 0.046 0.089 0.168 0.408 0.473 0.704 0.979 0.897 0.419 0.344 0.330 0.280 0.045
So, for example, Duke can be expected to win more than one more game than other teams of the same seed, and Illinois can be expected to win more than half a game less than other teams of the same seed. If Duke and Illinois are seeded the same, Duke can be expected to win almost two full games more than
Analyzing Experience
An experience variable was created to reflect the total number of previous tournament games (won or lost) a team had played since 1985. Result = .952 + .054*experience - .051*year
(R^2=.14) (12.80) (12.87) (-5.51)
Hence, there is correlation between experience and result, suggesting that teams which have been in the tournament often typically win more games… also, successful teams typically stay successful.
Regression with Experience (R^2=.39)
result | Coef. Std. Err. t P>|t| -------------+-------------------------------------------------------win % | 2.575 .391 6.59 0.000 seed | -.131 .009 -13.72 0.000 exper | .016 .0042 3.84 0.000 year | -0.016 .0081 -2.07 0.038
Experience (cont…)
That experience is significant when regressed with seed and winning percentage indicates that it is not fully accounted for in the seeding of teams, and that it is another variable worth looking at when making tournament predictions. The experience variable is significant in a variety of regressions indicating its robustness as an explanatory variable
FINAL REGRESSION
Source | SS df MS -------------+-----------------------------Model | 767.840021 7 109.691432 Residual | 1071.90998 1016 1.05502951 -------------+-----------------------------Total | 1839.75 1023 1.7983871 Number of obs F( 7, 1016) Prob > F R-squared Adj R-squared Root MSE
= = = = = =
---------------------------------------------------------------result | Coef. Std. Err. t P>|t| [95% C -------------+-------------------------------------------------seed | -.109729 .0122139 -8.98 0.000 -.13369 percent | 2.795619 .4106081 6.81 0.000 1.9898 experience | .0071693 .004408 1.63 0.104 -.00148 onepercent | .3927863 .1308057 3.00 0.003 .13610 duke | 1.117162 .2837561 3.94 0.000 .56034 kentucky | 1.074347 .2931143 3.67 0.000 .49916 year | -.0079938 .0081567 -0.98 0.327 -.02399 _cons | -.2546688 .3922204 -0.65 0.516 -1.0243 ----------------------------------------------------------------
Conclusions
Tournament predictions can be fairly accurate based solely on seed There are other predictors such as winning percentage, conference, and experience which can be used to refine predictions However, better teams don’t always win, so it is impossible to make predictions absolutely