Get ready for some math!
So I spent some time on Wednesday coming up with a probabilistic model for picking the last nine WCHA games, combining everything into a probabilistic record table, and such. I hit some snags, so I’m not done, but I did want to present my thinking to this point. If you don’t care anything about that, just look above and see the estimates for this weekend’s series.
What I’m using is an estimation that the KRACH-based expected values are normally distributed around the mean — expected value is what I’ve been doing all along. The mean is almost always not 2.000, so the level of skew to the left and right takes the range of expected values away from 0-4 and makes them something like 0-7 for any game involving UAH, as the KRACH-based expected value for UAH games has been ~3.5. But Bemidji can’t pick up seven points this weekend (much as they’d like to, I’m sure), so anything greater than four is, well, a sweep.
Before you black out on me: you know the bell curve we all know well?
Yeah, what we’re doing is putting the bell curve between goal posts at 0.000 and 4.000, the theoretical minimum and maximum. Any time you shift away from the center point at 2.000, the tail bit that goes outside of the goal posts automatically counts as a sweep one way or the other. If one team is expected to sweep, the goal post will be right near the peak of the bell curve, meaning a lot of the data will be outside of the goal post.
So, you can picture moving the whole curve around by its peak with your hand. But now we need to consider two more things: the concept of the standard deviation, and what our breakpoints will be.
I should explain these.
Breakpoints: I looked, and the WCHA has had ties around 10% of the time. As such, I couldn’t use a standard rounding routine to make the decision whether it was a win or tie. Instead, I’ve taken slices of data. Instead of sweeps starting at 3.5000, they start at 3.200. Three-point weekends happen only between 2.900 and 3.199. Splits happen between 1.200 and 2.899 — a wide, wide swath that’s 42.5% of the entire band. One-point weekends happen between 0.900 and 1.199, and you get swept if it comes up below 0.899.
Standard deviations: If you look at that bell curve image above, you’ll see the concept of the standard deviation. This has to do with the spread of the data: how far the points are from the average. I used a weighting of the standard deviation that had a wider spread in the middle — because you’d expect that a predicted split could easily go sweep with a little puck luck — than you do at the edges, because teams that are expected to sweep should have a high probability of doing so, and therefore there shouldn’t be a wide spread of the standard deviation. The distance of the mean between the edge and the center linearly changes the standard deviation (spread).
So let’s look at the KRACH-based expected values and estimated standard deviations.
You can see that Bemidji would be expected to win 3.722 points this weekend, and the spread of points is 0.392. As you can see with the bell curve above, 68% of the points under the BSU-UAH curve fall between 3.33 and 4.114. As all of those values are greater than 3.200, BSU should sweep 68% of the time. But that doesn’t take account of the fact that everything on the right half of the curve is in sweep territory, so it’s 84% that’s a for-sure sweep. Now, two standard deviations is 3.722-0.392-0.392 = 2.938, which is above the split point and below the win-and-tie point., and that covers about 98% of the data. That’s why it’s pretty rare that UAH wins a game.
If anything, that win-and-tie thing is probably a little large. Those 0/1/2/3/4 breakpoints are experimental, but I think that it’s realistic to say that UAH has a 2.30% chance of picking up a win this season. UAH played teams in the 3-9 morass 18 times, and it has one win and one tie, and that’s 8.33%, which is pretty close to the pick this weekend.
Looking at the other games: we saw that the LSSU and NMU series were pretty close to the split line, and even with the wider spread (standard deviation) in the curve, the splits happen nearly 80% of the time. As you can see, the distribution skews to the left, assuming that each of these home teams should pick up a point 90-92% of the time. The spread may not be wide enough, because the chance that the home team sweeps is pretty darn small. I’ll work on tweaking that. There’s also not any sort of home-team bonus, mainly because I forgot it yesterday and haven’t thought it fully out. It would be some sort of home W% / road W% estimation, but that’s kinda noisy, and I’ve already made this a little noisy even with 1,000 runs.
Let’s finish with the Ferris State-Alaska series. You can see that a split is the most likely option, but the next-most isn’t Alaska picking up a point but getting swept by the Bulldogs. That’s done because of the breakpoints trying to estimate that the tie happens around 10% of the time. This says that it’s 22.4% chance the teams will tie one night, but that’s because the peak of the curve (expected value) is right in the break. Now, a home-team bonus would probably bring this out of that zone.
I’m not done, but I did want to get this out there. We’ll see if it holds water.
For those curious, I’m using NORM.INV(RAND(),mean,stdev). The RAND() function spits out a random number, and the mean and standard deviation are from that third table. I have that formula listed 1000 times — creating that was … fun — and the table we saw up front and just above these paragraphs sums up all of the times that 0, 1, 2, 3, and 4 points comes up in the random calculation. That calculation is a fun nested-IF statement that does all the math for putting the calculated number into the breakpoint slots. That took some time, and trying to create it in Google Docs nearly led me to chuck my laptop across the room, so I’m using Excel. I’ll probably release this after the season is over because I’m not doing the model just yet and don’t want to put something half-assed out there.