Salary Projections I: Simple linear regression without interaction
Something that I've been thinking a lot about recently is NHL statistics, specifically, measuring the quality of a player quantitatively. This might be fundamentally impossible, after all, unmeasurable such as "leadership" or the ever-elusive "toughness" can't truly be measured and neither can true player skill. Toughness is often correlated with blocked shots, hits, and fights. There's also a more subconscious relationship between toughness and penalty minutes. I single out toughness (as opposed to leadership) because it's also correlated to negative things. Let's be honest, blocked shots are really only necessary when your team doesn't have the puck and are giving up shots that must be blocked (right Kris Russell?). The best we can do with toughness is use a combination of things to approximate the true value. Same for overall quality.
I believe that one of the best indicators of relative overall skill should be salary. Yes, there are players who are underpaid or overpaid relative to their peers, but we can all agree that players with higher skill also usually get paid more money. To try to develop a formula we all can use to figure out how much a player should be (or should have been) paid, I took the list of this past offseason's free agents and ran a simple linear regression. I kept it quite simple to start with, churning out a model using only a few stats. I used a mixture of counting stats and biographical stats (such as age) to a) keep it simple to understand the output b) try to hone in on a few common "buzzword" statistics and c) keep it simple to program since it's good to start and work your way up.
The statistics I used are Age, Height, Weight, Position (Forward vs Defense), Games Played the past year (a durability or toughness measure), Goals, Assists, Penalty Minutes, and stat-darling Corsi.
A couple of notes: There are not that many stats that we started with, and we'll pare down from there. You'll notice that position only has two values, forward and defense. As usual, this was a simplicity decision. When I build it up more, Forwards will be broken up into Centers and Wingers, as I think there's a difference there, but for now, Forwards and Defensemen wil do. Also Goalies are judged by a completely different set of statistics and so are not lumped in with the skaters.
These statistics were used to try to predict average salary. That is, the amount a player makes on average throughout the course of the contract, and (generally speaking) the amount of money that a team cannot use to acquire other players.
Model Without Interactions
Running the model without interactions is the best way to get easy-to-understand models. At the basic level, models without interaction show the impact specific statistics have on the response. For example: Player A and B are identical in every way, same height, weight, age, even their stats are identical EXCEPT for goals. If Player B has 1 more goal than Player B, how much more money is Player B expected to earn than Player A. A model without interactions show the value of a single stat more clearly.
Running model selection, the best model that we got, given these variables, is a model comprised of Age, Weight, Position, Games Played, Goals, Assists, and Penalties. This is interesting to me because of what didn't make the cut. Only two original stats didn't make the cut: Height and Corsi. Height is often used as a measure of size but I think the Weight value probably incorporates a bit of height in it (taller players are often heavier) and strength shows itself as a measurable more clearly in weight. Corsi is an interesting cut, considering the propensity of the statistics community to espouse it as a measure of player quality. There are a few potential reasons for this: 1) Corsi simply hasn't made its way into contract negotiations in a significant way. It's impact is overshadowed by the other measurables. Or 2) The model doesn't do a great job overall of measuring correlations, just the best job given the selected original variables (entirely possible as I'll show later).
**Something to note: The data was not very nice. What I mean by that is that the data must satisfy a number of qualifiers in order to be used to project forward and these qualifiers were not met. Since the purpose of doing all this is to have a projection, we're going to proceed. Sometimes the solution to this issue is more data points (we already have 600) and sometimes, the data just isn't "normal".**
Before proceeding further, I'd like to note that there were a few outliers in this data which I removed from the data set. They didn't seem to change the data too too much but I removed them for the sake of cleanliness. There were 12 of them and I'd take special note if there were something special to note about their data. Most of them were paid on the upper levels of salary ($4.5 million and up) but that's about it. After removing the outliers, the model selection process was run again and nothing changed. This was expected since the removed values were somewhat insignificant on the whole.
Skipping any more nitty-gritty stats stuff (the overall least interesting part of the whole situation), the results
We get a formula that looks like this:
AvgSalary = 23766*Age + 3836*Weight - 353809*Position - 9427*Games Played + 78071*Goals + 87140*Assists + 3024*Penalty Minutes - 522951
**Age is measured in years, Weight is measured in pounds, and Position is denoted as 1 for forwards and 0 for defensemen, Games Played is only for the previous year **
Some interesting notes:
- Age is viewed as a positive but games played is a negative. Age being positive makes sense because of the experience factor and the idea that team's know what they're getting with an older player. I don't know why games played is a negative, maybe a more complex model will show it.
- Assists are worth more than goals? Defensemen tend to make more money than forwards and they tend to get more assists than goals. This drives up the perceived value of assists versus goals. If we were to run this for just forwards (future idea?) then I would guess that goals would jump in value.
- Speaking of forwards versus defense, the "player" "loses" almost $354,000 per year just by being a forward. This makes sense in a couple of ways. First, there are less defensemen as forwards and so they are generally a little more valuable simply because there are less of them. Second, if you had a forward generate the exact same statistics as a defenseman (say, 4 goals, 19 assists or whatever), that forward would unlikely be as valuable as a defenseman who can produce at the exact same level and, presumably, play defense.
-The y-intercept is impossible but that's not a huge deal since a player with zeros across the board is impossible too.
So now what? One of the things that we can do is determine how well the model actually performs. We don't need a training/testing set because this is regression and we are attempting to model the very data that created it.
So how did we do? Turns out, not that great. This model earned an adjusted R squared value of .6306. What this means in simple terms is that 63% of the changes in average salary from player to player is determined by the variables we selected (37% by some mystery variable(s) that was/were unaccounted for). It's not bad for a simple model but I would anticipate that some more complex stuff would model it a bit better.
Examples (kind of randomly selected):
Devante Smith-Pelly (ANA): 21 years old, 222 lbs, Forward, 19 GP, 2 G, 8 A, 2 PIM
Projected average salary = $1,154,115. Actual = $800,000. Difference = $354,115
Curtis McKenzie (DAL): 23 years old, 205 lbs, Forward, 36 GP, 4 G, 1 A, 48 PIM.
Projected average salary = $661,442. Actual = $675,000. Difference = $13,558
David Steckel (ANA): 31 years old, 215 lbs, Forward, 34 GP, 1 G, 6 A, 4 PIM.
Projected average salary = $977,215, Actual = $550,000. Difference = $427,215
Calvin de Haan (NJD): 22 years old, 187 lbs, Defenseman, 51 GP, 3 G, 13 A, 30 PIM.
Projected average salary = $1,694,209. Actual = $1,966,667. Differences = $272,458
As is somewhat clear, the model works better for some players than others. McKenzie was predicted fairly accurately but Steckel's projected salary was 178% of what he actually earned.
So the model isn't fantastic, but it can do well enough to give a general ballpark.
The next step is to generate a model with interactions, the next level of complexity in this process, and see if the relationships between the variables can be used to more accurately project salary