RotoGuru Baseball Forum

View the Forum Registry


0 Subject: Ridiculous Stats

Posted by: KrazyKoalaBears
- [51521713] Fri, Apr 06, 10:40

I was double checking the Offense/Defense and Power Rankings for Gleem and found the following to be just plain ridiculouse:

COL's and HOU's current offensive run production of 10.67 and 9.00 runs per game, respectively, renders the following teams in the bottom 25th percentile of offensive run production: STL, SDG, OAK, FLA, TEX, DET, PIT, MON, LOS, MIL, CHC, KAN, BOS, and BAL!

Speaking of BAL, believe it or not, the Orioles current team ERA is an amazing 1.24 (1st in the majors). Too bad their current run production is 1.33 (Last in the majors)!

How glad is STL to be leaving COL? Their current team ERA of 10.13 not only ranks them in last place in the majors, but is also so bad that it makes them the only team in the bottom 25th percentile in team ERA!

Despite starting the season at Coors Field, the Rockies current team ERA is a respectible 3.67 which is good enough for 73rd percentile (just missing the top 25th percentile).

Just a few things that I saw that seemed ridiculous, even if the season is barely started.

1Guru
      ID: 330592710
      Fri, Apr 06, 10:46
How can only one team be in the bottom 25th percentile? By definition, don't 25% of the teams have to be in any 25%-ile range?

Or am I missing something?
2Wammie
      ID: 20039259
      Fri, Apr 06, 10:51
maybe they are using the 6 sigma jazz? but i was thinking the same thin guru
3Steve Biz
      ID: 02411620
      Fri, Apr 06, 10:52
Technically you're right Guru, but I think KKB made 4 equal intervals of ERA and that STL ERA was so high that even though it was on the high end it was still high enough that no other team was within this upper interval.
4KrazyKoalaBears
      ID: 51521713
      Fri, Apr 06, 10:54
I'm not a statistics expert so I don't really know the answer to that question. ;)

The way I determine percentile is based off the stat that I'm looking at. For example, when looking at Runs Scored per Game, COL is scoring the most at 10.67 and BAL is scoring the least at 1.33. The difference between those 2 is 9.34. So if COL is 100 (percentile; the best) and BAL is 0 (percentile; the worst), then the rest of the teams are spread along the difference. 25% of the difference is 2.335. Add that to the worst team (worst score possible) and that limit is 3.665. Subtract the 2.335 from the best team (best possible score) and you get 8.335.

So that's where the numbers come from and is how I've always understood it. But again, if anyone with some better knowledge could validate/reject this (Madman?) it would be much more reliable than what I have laid out.

5CH
      ID: 4910502912
      Fri, Apr 06, 10:54
Perhaps the percentiles are based on ERA rather than the position of the teams? Thus if 10 were the highest ERA any team with an ERA >= 7.5 would be in the 25th percentile.
ERA > 7.5 25th
.5 < ERA < 7.5 50th
.25 < ERA < .5 75th
.0 < ERA < .25 100th

not sure, just a guess
6 APerfect10
      ID: 422362923
      Fri, Apr 06, 10:54
Go O's!!! Sorry, life long O's fan! And will always be an O's fan through good and bad! Its good to see the lead the majors in something during this rebuilding time. Plus its only 3 games into the season, better not get my hopes up! Lol....
7Madman
      ID: 29246911
      Fri, Apr 06, 10:56
Guru I count 15 teams in the bottom 25%. This means a league of 60 teams . . . hmmm. Sounds like the player's association would like this information for their upcoming negotiations :)
8KrazyKoalaBears
      ID: 51521713
      Fri, Apr 06, 10:59
Ok, ok, I give up! ;)

Anyone care to give an explanation of the correct way of showing this information?

9Homer
      ID: 534468
      Fri, Apr 06, 10:59
If you simply cross-correlate the Weibull distribution correcting for anomalies in the standard deviation and execute a multi-variate regression, you'll see that this makes perfect sense.
10popgun
      ID: 501261314
      Fri, Apr 06, 11:02
what homer said ;-)
11Madman
      ID: 29246911
      Fri, Apr 06, 11:03
KKB -- to do percentiles, the lowest 1-7 teams are in the bottom 25%, 8-15 are in the bottom 50%, 16-22 are in the bottom 75%, and 23-30 are in the bottom 100% ;). Teams 23-30 are also in the top 25%, if you view the world as half-full.

The raw numbers, however, are more interesting, IMO. Maybe a simple line graph with dots on it for each team and labels would be nice. That way you could get a feel for the spread.
12KrazyKoalaBears
      ID: 51521713
      Fri, Apr 06, 11:06
Madman, you're right that a line graph may be the way to go.

My understanding of percentiles was that something like a 75th percentile was read as "that teams scores better than 75 out of 100 teams" Is this incorrect? If so, which I assume it is, is this even a method of determing anything? ;)

13Madman
      ID: 29246911
      Fri, Apr 06, 11:09
KKB -- no, that's the correct definition of the 75th percentile. For example, teams 23-30 are in the 75th percentile or above.

No, this is not a method of determining anything, other than how many teams you are in front of. In this case, the pure stats are much more interesting, IMO.

If you have 5 million teams, and a distribution that you know something about, percentiles might get more interesting. With 30 teams, and no real a priori information about how they should be distributed, I don't think percentiles are particularly informative. ;)
14KrazyKoalaBears
      ID: 51521713
      Fri, Apr 06, 11:11
Makes sense. Like I said, I'm not a statistics expert. ;)
15winmiller
      ID: 49326610
      Fri, Apr 06, 11:30
KKB,

I found the Gleem Tools 28 day Percentile Schedule to be extremely helpful in hockey. It became clear after just a short while that you were not using percentiles according to the usual definition, which would always place a given number of teams in a quartile regardless of the size of the deviation in the stats. I never brought it up since I liked the way you were doing it better than if you followed the definition. You were picking up the greatest variance between strong and weak schedules.

I hope you use the same method for baseball if you are going to create a Percentile Schedule. Maybe a true statistical type can come up with the correct wording for your method?

Thanks again for the Gleem Tools.

16steve houpt
      ID: 2133444
      Fri, Apr 06, 11:53
Even if you use a "Z" table to determine top and bottom 25 percentiles, there are seven teams in the bottom 25%.

AVG of team ERA's = 4.21 (not league - avg of each team)

Standard Deviation = 2.30.

TEAM [Z] - percentile [ERA]
OAK [-0.806] 21.01% [5.76]
TEX [-0.884] 18.83% [5.91]
KAN [-0.931] 17.59% [6.00]
CHA [-1.113] 13.29% [6.35]
SDG [-1.275] 10.12% [6.86]
MIL [-1.520] 6.43% [7.13]
STL [-3.084] 0.10% [10.13]

I'm not sure how 'Gleem' is figuring.
17 Slow Stick
      ID: 58356611
      Fri, Apr 06, 11:56
Lurker here.

KKK your new understanding of percentile is correct. What you are illustrating is the Position of each team across the Range of valid values. The statistics purist will not like the term but I would suggest Deviation or Divergence. Or make up a name like Quartile.
18KrazyKoalaBears
      ID: 51521713
      Fri, Apr 06, 12:00
steve, I think that's the problem. I'm not really figuring. ;)

winmiller, I'm working on the "percentile" schedule, but I think I may change the name to something more appropriate.

19Ender
      ID: 52438315
      Fri, Apr 06, 12:23
Actually Quartile IS the name given to the 25th, 50th, and 75th percentiles. So calling what KKB is doing a Quartile really doesn't solve any of the confusion.

The real confusion is KKB is figuring his percentile based on the stat in question rather than the number of teams in the sample. Percentile only compares the ranked position of a member of the sample to the rest of the sample. That is a team at the 25th percentile is ranked higher than 25% of the sample, which by the way means there is no 100th percentile (nit picking for sure).


20JeffG
      ID: 40451227
      Fri, Apr 06, 12:53
Some statistic 101.

Mean - Average of all values.
Median - Value of middle element when stored in sequence.
Mode - Value or element that occurs most frequently.
Midrange - Average of the highest and lowest value.

So, KKB was basically computing a midrange, then within each extreme and the midrange, getting another midrange, yielding quarters, then plotting each value in the sub range that the teams fall in.
22Ender
      ID: 52438315
      Fri, Apr 06, 13:03
Nice, Jeff G. I was sure there was a way to explain what KKB actually did. I just couldn't piece it together.
23Madman
      ID: 29246911
      Fri, Apr 06, 13:49
Maybe "Spread Percentage"? So, COL's ERA has a Spread Percentage of 73%.

The interpretation here is that COL's ERA "covers" 73% of the spread between the max and the min. In other words, COL's ERA is better than the worst ERA, and also 73% of the remaining distance toward the best in the league.

I dunno. That's not a stat term, but it might convey they idea.
24Sludge
      ID: 1440310
      Fri, Apr 06, 13:52
Ender

...means there is no 100th percentile (nit picking for sure).

Not exactly. While there are different definitions of what a percentile is, one that applies in all cases is the following:

The p*100th percentile is a value such that p*100% of the observations are at or below this value.

Sample: 1, 3, 4, 5, 5, 6, 7, 7, 8

0th percentile: 0, 0.99999, -1, -100
100th percentile: 9, 8.00001, 8, 100

The moral of the story: A sample percentile does not have to be an observation or even within the range of the observations.

... to be nitpicky.
25Sludge
      ID: 1440310
      Fri, Apr 06, 13:54
Actually, Madman, KKB's definition is a percentile assuming a continuous uniform distribution with parameters given by the min and max observations.
26Madman
      ID: 29246911
      Fri, Apr 06, 14:00
sludge Yes. But an assumption of a continuous uniform distribution is absurd when you're dealing with 30 data points that represent the whole population.

In fact, any calculation that preserves the ordering of the data can give you percentiles if you allow me to pick my assumptions. I don't think you'd accept that on a homework assignment as a stats prof, however. I'm not sure what you're getting at here.
27KrazyKoalaBears
      ID: 51521713
      Fri, Apr 06, 14:03
Ok, so now I'm completely confused. ;)

JeffG, um, yeah. Exactly. I'm not sure that's exactly how I was thinking it, but if it fits, then that's what it is.

I like Madman's term "Spread Percentage" because it does seem to convey what I'm doing.

28Sludge
      ID: 1440310
      Fri, Apr 06, 14:05
Madman

Of course it's absurd! I just didn't want to be the one to say it. :)

I was just pointing out that KKB was (in a technical sense) computing percentiles under certain assumptions.
29Guru
      ID: 330592710
      Fri, Apr 06, 14:09
You could call them "Gleem-iles".

30Madman
      ID: 29246911
      Fri, Apr 06, 14:10
Sludge Gotcha. You were trying to come to his technical salvation and I blew your cover. Sorry. I guess I win the "foot in the mouth" award for the day.
Rate this thread:
5 (top notch)
4 (even better)
3 (good stuff)
2 (lightweight)
1 (no value)
If you wish, you may rate this thread on scale of 1-5. Ratings should indicate how valuable or interesting you believe this thread would be to other users of this forum. A '5' means that this thread is a 'must read'. A '1' means that this is a complete waste of time.

If you have previously rated this thread, rating it again will delete your previous rating.

If you do not want to rate this thread, but want to see how others have rated it, then click the button without entering a rating, or else click here.

RotoGuru Baseball Forum



Post a reply to this message: (But first, how about checking out this sponsor?)

Name:
Email:
Message:
Click here to create and insert a link
Ignore line feeds? no (typical)   yes (for HTML table input)


Viewing statistics for this thread
Period# Views# Users
Last hour11
Last 24 hours11
Last 7 days33
Last 30 days44
Since Mar 1, 20071004560