Friday, 22 January 2016

More Complicated. More Accurate? Salary Projections II

The model becomes nigh unexplainable. The table is impressive. Does that mean it's better?



In my last post, I looked at a simple linear regression for the NHL free agent skaters from this past offseason and saw that model could be created to predict how much money a given skater earned based on some measurables such as Height, Age, Goals Scored, and Penalty Minutes. The model turned out to be somewhat representative but not exactly indicative, meaning that it nailed a couple of players spot on and really only got in the ballpark for some others.

There were a few reasons that this model was "simple".

1) Split up position into defensemen and forwards. The reality is there are multiple forward spots and I suspect that there is a difference in wage earnings between centers and wingers of identical statistics but for simplicity sake, I lumped all forwards together. The cleanest way to account for all "positions" is to measure defensemen and forwards separately.

2) I only looked at 9 statistics as a predictors for a 10th (Average salary). The data set that I culled from contained 121 variables (granted, some were a bit redundant like Age and Date of Birth, but still). I tried to pick out what I thought was most significant but without using the full set, it's subject to my own personal bias.

3) There were no interactions. This is the easiest model to look at. No variable has anything to do with another. But those who follow hockey even a little bit know that teams (or announcers anyway) put a higher value on big players who can score. All other things being equal, teams are more interested in a big bruising player who scored 8 goals than a small player who scored 8 goals. We'll see if that turns out to be true.

The results of the last piece were that I could account for most of the difference in salary between players, but not to any acceptable degree. This round turned out a little bit better.

***Disclaimer that few care about this sort of thing: Once again, the data was not normally distributed. This could be due to a small sample size but I find it unlikely given the sample size of 591 (once outliers were removed). More likely, the data is simply not suited for linear regression, though it is not yet known if it's more suited for a different type of regression (took a cursory look at logistic (about same is this one) and quadratic (worse than linear)). This simply means that we technically can't use the model to make accurate predictions. We're going to try to anyway, but just thought you should know***

The last model was very nice, a maximum of 9 variables each with a number attached, very clean, very easy to type out. Interactions muck things up. Makes it very unpleasant to look at and understand. This is the result I got at the end.

> summary(FinalModel)

Call:
lm(formula = AVERAGE ~ Age + HT + Wt + Pos + GP + G + A + PIM + 
    Corsi + Age:HT + Age:Wt + Age:Pos + Age:GP + Age:G + Age:A + 
    Age:PIM + Age:Corsi + HT:Wt + HT:Pos + HT:GP + HT:G + HT:A + 
    HT:PIM + HT:Corsi + Wt:Pos + Wt:GP + Wt:G + Wt:A + Wt:PIM + 
    Wt:Corsi + Pos:GP + Pos:G + Pos:A + Pos:PIM + Pos:Corsi + 
    GP:G + GP:A + GP:PIM + GP:Corsi + G:A + G:PIM + G:Corsi + 
    A:PIM + A:Corsi + PIM:Corsi + Age:HT:Wt + Age:HT:G + Age:HT:A + 
    Age:HT:Corsi + Age:Wt:Pos + Age:Wt:GP + Age:Wt:A + Age:Wt:PIM + 
    Age:Pos:GP + Age:Pos:Corsi + Age:GP:A + Age:GP:PIM + Age:G:A + 
    Age:G:Corsi + Age:A:PIM + Age:A:Corsi + HT:Wt:A + HT:Pos:G + 
    HT:G:PIM + HT:A:PIM + Wt:GP:PIM + Wt:GP:Corsi + Wt:G:Corsi + 
    Wt:A:PIM + Pos:GP:G + Pos:GP:PIM + Pos:GP:Corsi + Pos:G:A + 
    Pos:G:PIM + Pos:A:PIM + Pos:A:Corsi + GP:G:A + GP:A:PIM + 
    GP:A:Corsi + GP:PIM:Corsi + G:A:PIM + G:A:Corsi + G:PIM:Corsi + 
    HT:GP:G, data = GoDataNew[, -2])

Residuals:
    Min      1Q  Median      3Q     Max 
-141063  -19680    -909   15428  158995 

Coefficients:
                      Estimate      Std. Error     t value    Pr(>|t|)    
(Intercept)   -1.266e+07    5.855e+06    -2.163      0.031031 *  
Age                5.611e+05    2.230e+05     2.516      0.012170 *  
HT                 1.636e+05    7.895e+04     2.072      0.038792 *  
Wt                 6.010e+04    2.825e+04     2.127      0.033896 *  
Pos                 9.872e+05   5.258e+05     1.877      0.061041 .  
GP                 1.260e+04    1.622e+04    0.776       0.437905    
G                   -2.462e+05   1.765e+05    -1.395     0.163533    
A                   -2.195e+05   1.430e+05    -1.535     0.125386    
PIM              -3.244e+04   1.949e+04    -1.665     0.096601 .  
Corsi              8.987e+04   4.420e+04     2.033     0.042548 *  
Age:HT        -7.286e+03   3.005e+03     -2.425    0.015657 *  
Age:Wt        -2.655e+03    1.071e+03    -2.480    0.013475 *  
Age:Pos       -4.433e+04    1.800e+04    -2.462    0.014142 *  
Age:GP        -8.842e+02   5.432e+02    -1.628     0.104231    
Age:G            7.806e+03   5.692e+03     1.371     0.170864    
Age:A           -4.757e+03   4.267e+03    -1.115     0.265458    
Age:PIM        9.448e+02  5.468e+02      1.728    0.084611 .  
Age:Corsi     -3.625e+03  1.692e+03     -2.142    0.032653 *  
HT:Wt         -7.639e+02   3.770e+02     -2.026    0.043251 *  
HT:Pos          2.769e+03   3.740e+03      0.740    0.459394    
HT:GP           1.171e+02  1.251e+02       0.936   0.349746    
HT:G             3.265e+03  2.428e+03       1.345    0.179336    
HT:A             1.359e+03  2.033e+03       0.669    0.504082    
HT:PIM        4.540e+01  1.588e+02       0.286    0.775065    
HT:Corsi     -1.214e+03  6.079e+02      -1.997    0.046340 *  
Wt:Pos        - 6.374e+03  2.313e+03       -2.755   0.006076 ** 
Wt:GP         -1.198e+02  7.222e+01      -1.658    0.097884 .  
Wt:G             6.753e+01  6.763e+01       0.998    0.318557    
Wt:A             2.987e+03  5.676e+02       5.263    2.10e-07 ***
Wt:PIM        1.358e+02  7.027e+01       1.933    0.053806 .  
Wt:Corsi       1.322e+01  1.676e+01       0.788   0.430852    
Pos:GP          2.221e+03  1.397e+03       1.590   0.112499    
Pos:G            1.113e+05  5.420e+04       2.053   0.040605 *  
Pos:A           -5.476e+03  2.437e+03      -2.247   0.025080 *  
Pos:PIM       2.764e+03  7.851e+02        3.520   0.000470 ***
Pos:Corsi     -5.699e+03  3.016e+03      -1.890   0.059372 .  
GP:G           -1.979e+03  9.277e+02       -2.133   0.033383 *  
GP:A             5.974e+02  1.269e+02       4.709    3.23e-06 ***
GP:PIM        1.674e+02  9.042e+01        1.851   0.064708 .  
GP:Corsi      4.876e+02  1.542e+02        3.162   0.001660 ** 
G:A              -2.856e+02  4.594e+02      -0.622   0.534426    
G:PIM          4.904e+03  1.182e+03       4.150    3.90e-05 ***
G:Corsi        -3.395e+03  8.831e+02      -3.844    0.000136 ***
A:PIM          -4.578e+03  8.828e+02      -5.186    3.12e-07 ***
A:Corsi        -1.185e+03  3.165e+02      -3.745    0.000201 ***
PIM:Corsi     4.258e+01  2.841e+01       1.499   0.134557    
Age:HT:Wt   3.417e+01  1.427e+01   2.394 0.017037 *  
Age:HT:G      -1.145e+02  7.816e+01  -1.464 0.143729    
Age:HT:A       1.417e+02  6.057e+01   2.339 0.019725 *  
Age:HT:Corsi   4.739e+01  2.287e+01   2.072 0.038794 *  
Age:Wt:Pos     2.373e+02  8.654e+01   2.742 0.006325 ** 
Age:Wt:GP      5.046e+00  2.654e+00   1.902 0.057791 .  
Age:Wt:A      -2.195e+01  7.536e+00  -2.913 0.003741 ** 
Age:Wt:PIM    -4.628e+00  2.439e+00  -1.898 0.058295 .  
Age:Pos:GP    -1.271e+02  4.930e+01  -2.577 0.010240 *  
Age:Pos:Corsi  2.407e+02  1.178e+02   2.042 0.041620 *  
Age:GP:A      -2.300e+01  4.600e+00  -4.999 7.95e-07 ***
Age:GP:PIM     2.610e+00  1.891e+00   1.380 0.168259    
Age:G:A        4.859e+01  1.310e+01   3.709 0.000231 ***
Age:G:Corsi   -3.130e+01  1.741e+01  -1.798 0.072843 .  
Age:A:PIM     -9.063e+00  5.019e+00  -1.806 0.071549 .  
Age:A:Corsi    2.992e+01  1.070e+01   2.797 0.005355 ** 
HT:Wt:A       -3.423e+01  7.198e+00  -4.756 2.58e-06 ***
HT:Pos:G      -1.388e+03  7.291e+02  -1.904 0.057524 .  
HT:G:PIM      -5.810e+01  1.571e+01  -3.698 0.000241 ***
HT:A:PIM       4.971e+01  1.408e+01   3.531 0.000452 ***
Wt:GP:PIM     -1.064e+00  3.899e-01  -2.730 0.006561 ** 
Wt:GP:Corsi   -1.910e+00  7.849e-01  -2.434 0.015293 *  
Wt:G:Corsi     2.078e+01  4.032e+00   5.154 3.66e-07 ***
Wt:A:PIM       4.719e+00  1.667e+00   2.831 0.004823 ** 
Pos:GP:G       2.507e+02  1.269e+02   1.977 0.048633 *  
Pos:GP:PIM    -5.621e+01  1.589e+01  -3.539 0.000439 ***
Pos:GP:Corsi  -1.190e+02  2.720e+01  -4.376 1.47e-05 ***
Pos:G:A       -5.000e+02  2.686e+02  -1.862 0.063236 .  
Pos:G:PIM     -6.201e+02  1.405e+02  -4.413 1.25e-05 ***
Pos:A:PIM      2.685e+02  6.108e+01   4.396 1.34e-05 ***
Pos:A:Corsi    3.917e+02  1.181e+02   3.317 0.000974 ***
GP:G:A        -4.584e+00  2.955e+00  -1.551 0.121539    
GP:A:PIM       1.451e+00  1.076e+00   1.349 0.178011    
GP:A:Corsi     4.585e+00  1.974e+00   2.323 0.020563 *  
GP:PIM:Corsi  -2.023e+00  6.383e-01  -3.170 0.001619 ** 
G:A:PIM       -7.154e+00  2.910e+00  -2.459 0.014280 *  
G:A:Corsi     -2.414e+01  6.211e+00  -3.886 0.000115 ***
G:PIM:Corsi    1.143e+01  2.659e+00   4.298 2.07e-05 ***
HT:GP:G        2.472e+01  1.245e+01   1.986 0.047613 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 44200 on 506 degrees of freedom
Multiple R-squared:  0.7897, Adjusted R-squared:  0.7547 
F-statistic: 22.61 on 84 and 506 DF,  p-value: < 2.2e-16

It's big, it's ugly, it's scary. And you can see the point where I gave up trying to make it into nice columns. Needless to say, I'm not going to type out the equation that this produces. The most important part of the summary is the Adjusted R-squared value of .7547. This says that 75.47% of the changes in average salary between players can be explained using the provided variables. It also means that 24.53% of the changes have a source unaccounted for, whether that's just stats that weren't included or immeasurables like "leadership" or "toughness" it's impossible to say.
Quick note: Corsi is actually significant this time. Not super significant, maybe, but still significant. It's hard to get a read as to how many dollars it actually contributes (or takes a way perhaps) but that's the problem with models with interaction.

Examples (random):

Jarred Tinordi (MTL): 22 years old, 78 inches tall (6'6"), 225 lbs, Defensemen, 13 Games Played, 0 Goals, 2 Assists, 19 Penalty Minutes, 0.4 Corsi.
Predicted salary = $983,967,   Actual salary = $850,500,  Difference = $133,467

Barret Jackman (NSH): 33 years old, 72 inches tall (6'), 203 lbs, Defensemen, 80 Games Played, 2 Goals, 13 Assists, 47 Penalty Minutes, 5.8 Corsi.
Predicted salary = $1,750,835,   Actual salary = $2,000,000,    Difference = $249,165

Jason Akeson (PHI): 23 yrs, 70 inches (5'10"), 190 lbs, Forward, 1 GP, 0 Goals, 1 Assits, 0 PIM 34.2 Corsi (WOW! Small sample size alert).
Predicted salary = $469,916,   Actual salary = $575,000,   Difference = $105,084
(*Worth noting: The NHL minimum salary is $550,000 per year)

Olli Jokinen (NSH): 35 yrs, 74 inches (6'2"), 210 lbs, Forward, 82 GP, 18 Goals, 25 Assists, 62 PIM, -1 Corsi.
Predicted salary = $3,787,174,  Actual salary = $2,500,000,  Difference = $1,287,174
(*first prediction off by over 1 million!)

Colby Robak (FLA): 23 yrs, 75 inches (6'3"), 194 lbs, Defensemen, 16 GP, 0 Goals, 1 Assist, 17 PIM, 1.98 Corsi.
Predicted salary = $764,333,  Actual salary = $675,000,   Difference = $89,333

So what's the conclusion? It missed on a couple and got close on a few. I think the difference for Akeson should actually be $25,000 since my model predicted  a minimum salary but I make the rules and I cut myself no slack (I'm tough on myself)
The Jokinen one is interesting. I think that my model overvalues age, and so Jokinen being 35 is a great thing as far as my model goes, not so great as far as real hockey goes.
As with all of these external factors come into play. Things like offseason surgery, bad/good locker room guy, legal issues don't get implemented into the statistical model but have a large impact on the real life money of these guys.

The refinement continues. My next task will be to use many more of the statistics I have available to me and/or break the skaters into forwards and defensemen.

I also think that the team a player comes from or goes to plays a role. A player from a team that just won the Cup is more valuable because they "know how to win" and a team that has been struggling and not in a nice market (*cough cough*Edmonton*cough cough*) will likely have to overpay free agents to convince them to join. I just don't know how to model that cleanly yet.

More to come

Friday, 15 January 2016

How much is a goal worth? Salary Projections I




Salary Projections I: Simple linear regression without interaction



Something that I've been thinking a lot about recently is NHL statistics, specifically, measuring the quality of a player quantitatively. This might be fundamentally impossible, after all, unmeasurable such as "leadership" or the ever-elusive "toughness" can't truly be measured and neither can true player skill. Toughness is often correlated with blocked shots, hits, and fights. There's also a more subconscious relationship between toughness and penalty minutes. I single out toughness (as opposed to leadership) because it's also correlated to negative things. Let's be honest, blocked shots are really only necessary when your team doesn't have the puck and are giving up shots that must be blocked (right Kris Russell?). The best we can do with toughness is use a combination of things to approximate the true value. Same for overall quality.

I believe that one of the best indicators of relative overall skill should be salary. Yes, there are players who are underpaid or overpaid relative to their peers, but we can all agree that players with higher skill also usually get paid more money. To try to develop a formula we all can use to figure out how much a player should be (or should have been) paid, I took the list of this past offseason's free agents and ran a simple linear regression. I kept it quite simple to start with, churning out a model using only a few stats. I used a mixture of counting stats and biographical stats (such as age) to a) keep it simple to understand the output b) try to hone in on a few common "buzzword" statistics and c) keep it simple to program since it's good to start and work your way up.

The statistics I used are Age, Height, Weight, Position (Forward vs Defense), Games Played the past year (a durability or toughness measure), Goals, Assists, Penalty Minutes, and stat-darling Corsi.
A couple of notes: There are not that many stats that we started with, and we'll pare down from there. You'll notice that position only has two values, forward and defense. As usual, this was a simplicity decision. When I build it up more, Forwards will be broken up into Centers and Wingers, as I think there's a difference there, but for now, Forwards and Defensemen wil do. Also Goalies are judged by a completely different set of statistics and so are not lumped in with the skaters.
These statistics were used to try to predict average salary. That is, the amount a player makes on average throughout the course of the contract, and (generally speaking) the amount of money that a team cannot use to acquire other players.



Model Without Interactions

Running the model without interactions is the best way to get easy-to-understand models. At the basic level, models without interaction show the impact specific statistics have on the response. For example: Player A and B are identical in every way, same height, weight, age, even their stats are identical EXCEPT for goals. If Player B has 1 more goal than Player B, how much more money is Player B expected to earn than Player A. A model without interactions show the value of a single stat more clearly.

Running model selection, the best model that we got, given these variables, is a model comprised of Age, Weight, Position, Games Played, Goals, Assists, and Penalties. This is interesting to me because of what didn't make the cut. Only two original stats didn't make the cut: Height and Corsi. Height is often used as a measure of size but I think the Weight value probably incorporates a bit of height in it (taller players are often heavier) and strength shows itself as a measurable more clearly in weight. Corsi is an interesting cut, considering the propensity of the statistics community to espouse it as a measure of player quality. There are a few potential reasons for this: 1) Corsi simply hasn't made its way into contract negotiations in a significant way. It's impact is overshadowed by the other measurables. Or 2) The model doesn't do a great job overall of measuring correlations, just the best job given the selected original variables (entirely possible as I'll show later).

**Something to note: The data was not very nice. What I mean by that is that the data must satisfy a number of qualifiers in order to be used to project forward and these qualifiers were not met. Since the purpose of doing all this is to have a projection, we're going to proceed. Sometimes the solution to this issue is more data points (we already have 600) and sometimes, the data just isn't "normal".**

Before proceeding further, I'd like to note that there were a few outliers in this data which I removed from the data set. They didn't seem to change the data too too much but I removed them for the sake of cleanliness. There were 12 of them and I'd take special note if there were something special to note about their data. Most of them were paid on the upper levels of salary ($4.5 million and up) but that's about it. After removing the outliers, the model selection process was run again and nothing changed. This was expected since the removed values were somewhat insignificant on the whole.

Skipping any more nitty-gritty stats stuff (the overall least interesting part of the whole situation), the results

We get a formula that looks like this:

AvgSalary = 23766*Age  + 3836*Weight  - 353809*Position - 9427*Games Played + 78071*Goals + 87140*Assists + 3024*Penalty Minutes - 522951

**Age is measured in years, Weight is measured in pounds, and Position is denoted as 1 for forwards and 0 for defensemen, Games Played is only for the previous year **

Some interesting notes:
- Age is viewed as a positive but games played is a negative. Age being positive makes sense because of the experience factor and the idea that team's know what they're getting with an older player. I don't know why games played is a negative, maybe a more complex model will show it.
- Assists are worth more than goals? Defensemen tend to make more money than forwards and they tend to get more assists than goals. This drives up the perceived value of assists versus goals. If we were to run this for just forwards (future idea?) then I would guess that goals would jump in value.
- Speaking of forwards versus defense, the "player" "loses" almost $354,000 per year just by being a forward. This makes sense in a couple of ways. First, there are less defensemen as forwards and so they are generally a little more valuable simply because there are less of them. Second, if you had a forward generate the exact same statistics as a defenseman (say, 4 goals, 19 assists or whatever), that forward would unlikely be as valuable as a defenseman who can produce at the exact same level  and, presumably, play defense.
-The y-intercept is impossible but that's not a huge deal since a player with zeros across the board is impossible too.

So now what? One of the things that we can do is determine how well the model actually performs. We don't need a training/testing set because this is regression and we are attempting to model the very data that created it.
So how did we do? Turns out, not that great. This model earned an adjusted R squared value of .6306. What this means in simple terms is that 63% of the changes in average salary from player to player is determined by the variables we selected (37% by some mystery variable(s) that was/were unaccounted for). It's not bad for a simple model but I would anticipate that some more complex stuff would model it a bit better.

Examples (kind of randomly selected):
Devante Smith-Pelly (ANA): 21 years old, 222 lbs, Forward, 19 GP, 2 G, 8 A, 2 PIM
      Projected average salary = $1,154,115.  Actual = $800,000.   Difference = $354,115

Curtis McKenzie (DAL): 23 years old, 205 lbs, Forward, 36 GP, 4 G, 1 A, 48 PIM.
      Projected average salary = $661,442. Actual = $675,000.   Difference = $13,558

David Steckel (ANA): 31 years old, 215 lbs, Forward, 34 GP, 1 G, 6 A, 4 PIM.
      Projected average salary = $977,215, Actual = $550,000.   Difference = $427,215

 Calvin de Haan (NJD): 22 years old, 187 lbs, Defenseman, 51 GP, 3 G, 13 A, 30 PIM.
      Projected average salary = $1,694,209.  Actual = $1,966,667.   Differences = $272,458

As is somewhat clear, the model works better for some players than others. McKenzie was predicted fairly accurately but Steckel's projected salary was 178% of what he actually earned.

So the model isn't fantastic, but it can do well enough to give a general ballpark.

The next step is to generate a model with interactions, the next level of complexity in this process, and see if the relationships between the variables can be used to more accurately project salary