Skewness Mean Median

Fri, 05 Feb 2010 22:55:11 +0000



This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

29 Comments »

  1. [...] This post was mentioned on Twitter by Paul Stamatiou, Brad Feld, Topsy, topsy_top20k, topsy_top20k_en and others. topsy_top20k_en said: You Don’t Mean Average, You Mean Median http://goo.gl/fb/b4iI [...]

    Pingback by Tweets that mention You Don’t Mean Average, You Mean Median -- Topsy.com — January 2, 2010 @ 11:16 am

  2. Median is a sorely underused metric. And I suspect the main reason is there is no native median() function in sql.

    Comment by Derek Scruggs — January 2, 2010 @ 6:39 pm

  3. Thanks for doing this – I was too lazy to dig into the data when we were trading mail (and rest assured I'm well versed in mean, median, mode, skewness and kurtosis). But now you've got me interested…

    Comment by Steve Murchie — January 2, 2010 @ 6:54 pm

  4. Hi Brad

    This is slightly OT, but I am (or was) a median problem junky. By that I mean algorithms for median finding, and lower bound proofs to show limits on all algorithms. Solving the median problem optimally is a really hard nut. It was a long time before it was realized that you could find the median of 5 things in just 6 comparisons (not 7, which is what it takes you to sort 5). There's a great history of algorithms and lower bounds slowly converging on 2n and 3n to find the median of n things. The constant is still not known (AFAIK), and there's even a nice conjecture which, if true, implies the existence of a 2.5n algorithm – without giving any hint about the algorithm. I did an M.Math on the median problem 20+ years ago, and even used to dream about algorithms for it! Then I got better There's a lot of really nice fundamental CS in there.

    Regards,
    Terry

    Comment by @terrycojones — January 2, 2010 @ 7:04 pm

  5. Most statistical packages do provide this now and as Terry just noted, there's a cheat to estimate it from the mean & SD. Unfortunately, it's damned hard to do statistical inference with median (terrifiyng math deleted), LOL) From an information theory perspective, median IS the right indicator, the mean is just easier to calculate in large data sets.

    My first job was at an investment bank & you had the same mean/median problem which get amplified when you scale up to larger portfolios over time. Investment returns data (and most naturally occurring economic data period) are not distributed normally – it's actually a Pareto distribution (closely related the Pareto's 80-20 rule). Typically, it's *almost* normal but skewed and fatter tails (as Taleb notes, more black swans). Anyway, the rule in econometrics & polimetrics.. when the median & mean diverge, use the median, even though it ruins many of the statistical inference tools. (Yes, that's why econometric projections often stink.. skewed data that they transform into oblivion, LOL) Thanks, Brad, for resurrecting my inner math nerd.

    Comment by @entrep_thinking — January 2, 2010 @ 7:11 pm

  6. Yeah, but it’s not that hard (usually) to do.  Our friend Google helped me find a bunch of examples:

    http://scorreiait.wordpress.com/2008/10/28/how-to... />
    I particularly love the Sybase ASE one. 

    Comment by Brad Feld — January 2, 2010 @ 7:27 pm

  7. Reminds me of the average salary for Geography majors at UNC – 6 figures!

    Related topic: notable Geography majors – Michael Jordan

    Comment by @reecepacheco — January 2, 2010 @ 8:33 pm

  8. Awesome example.

    Comment by Brad Feld — January 2, 2010 @ 8:35 pm

  9. Surprised you didn't also mention the bio/IT gap. This generates a bimodal distribution because the financing needs are categorically different on the seed/early stage end. It seems to me that any funding statistics that do not separate these are suspect. This may also solve part of your outlier problem – for example, in the New England stats above the first seven are all bio/pharma deals.

    Comment by DaveJ — January 2, 2010 @ 9:05 pm

  10. Yup – I mentioned that it the comments with Murchie but ran out of gas in this post.  It’s an important point how IT vs. bio vs. cleantech is categorized, although I’d still suggest that a “$17m seed investment” is an oxymoron regardless of category.

    Comment by Brad Feld — January 2, 2010 @ 9:27 pm

  11. Wouldn't more meaninful metrics be
    a.) how much delployable capital is in funds
    b.) what is the thesis (size/stage/industry/etc.) of those funds
    Using past data isn't a great predictor of future activity.

    Comment by Les Makepeace — January 2, 2010 @ 9:30 pm

  12. I completely agree with your statement “Using past data isn't a great predictor of future activity.”  I think this is one of the problems with all of this analysis – it doesn’t really have any predictive ability, even if it was more accurate.

    Re: more meaningful metric – I’m not sure as – while those are useful to understand what is going on in a fund, they are probably not that useful in the aggregate.

    Comment by Brad Feld — January 2, 2010 @ 9:50 pm

  13. Next time you're asked, send the journalist that link. S/he will surely appreciate it.

    Comment by @derekscruggs — January 2, 2010 @ 10:14 pm

  14. Looking at this in some more detail (based on what is available to the unwashed masses), the classification of stage seems to be the most bogus part of the reporting. The definitions at https://www.pwcmoneytree.com/MTPublic/ns/nav.jsp?... not only eave room for interpretation, but also fall prey to a common survey error: if it's too hard to think about, pick a random response. Perhaps as a result, you see small investments made in all stages, and they don't map well to the amount raised. For example, BrightKite, which you know well, got a $40K infusion from DFJ in 3Q09 that was listed as "Later Stage". I'm guessing interns are populating their submissions to PWC…

    Anyone from PWC want to open up the dataset for some analysis?

    Comment by Steve Murchie — January 2, 2010 @ 10:20 pm

  15. Hard to imagine PWC not taking your logic into consideration. Excellent case study for College stat classes.

    In regards to capital…I’m a startup raising a seed round. While the data can’t predict future behavior, it provides some solace that I don’t suck as bad as I think. Staying hungry and optimistic for 2010.

    Thanks for the great article!

    Comment by Doug Wulff — January 2, 2010 @ 11:09 pm

  16. Yup – they are lousy classifications.  Expansion and Later are fine, but Seed/Start-Up and Early is just wrong. 

    Comment by Brad Feld — January 2, 2010 @ 11:16 pm

  17. Unfortunately I think most people forget their college stats class the millisecond they complete it, other than people who use stats for a living, and even then, most don’t have a clue what they are doing!  And – as you say – “stay hungry and optimistic” – good things happen to those that stay at it.

    Comment by Brad Feld — January 2, 2010 @ 11:25 pm

  18. I'm not convinced that it's wrong to use the mean — as long as you're using the RIGHT mean. Most things financial are best examined via their logarithms; whereas most things physical are described by Normal distributions, most things financial are described by LogNormal distributions.

    Rather than examining the arithmetic mean or the median, I'd suggest looking at the geometric mean (aka. exp-mean-log). From the data above, this gives a SV mean of $3.8M (after fudging $0 investments to $0.5M) and a NE mean of $5.3M — which eliminates the excessive skew in the arithmetic mean numbers of $6.4M/$8.4M, but still reflects the fact that NE investments tend to be somewhat larger than SV investments… a fact which is entirely lost when you look at the median ($5M in both cases).

    Comment by Colin Percival — January 3, 2010 @ 12:42 am

  19. excellent post Brad, How can we evolve the landscape if the basics keep getting messed up.

    I missed a question regarding median, mean and mode on my MCAT. It's basic stuff every college attendee should grasp.

    Comment by MattEmmi — January 3, 2010 @ 1:37 am

  20. “Econometric projections often stink [because they are] skewed data that they transform into oblivion” – great phrase – I’m definitely using it.  They also often stink because they are really studying history and trying to predict the future from it.  Uh – yeah.

    Comment by Brad Feld — January 2, 2010 @ 8:00 pm

  21. @bfeld You might enjoy this classic essay by Stephen Jay Gould, "The Median isn't the Message." http://www.stat.berkeley.edu/users/rice/Stat2/Gou...

    Comment by @bradfordcross — January 3, 2010 @ 3:39 am

  22. Brilliant essay.  And completely consistent with my “who gives a fuck” message to reporters (and others) whenever I am asked about the meaning of the data and averages (or medians) for a period of time.  I love Gould’s reasoning as well as his story telling – he is so good.  Thanks for pointing this one out.

    Comment by Brad Feld — January 3, 2010 @ 4:07 am

  23. Indeed. Several years ago I was working on an essay called "The Psychology of Mathematical Misjudgment" – a more in depth treatment of these topics with a dual inspiration from Gould's essay and the terrific talk by Charlie Menger "The Psychology of Human Misjudgment." http://vinvesting.com/docs/munger/human_misjudgem...

    Maybe I'll try to find it, dust it off, and publish it this year. It is a really fun topic.

    Comment by @bradfordcross — January 3, 2010 @ 4:27 am

  24. It seems that I forgot to mention that I enjoyed your post.

    I think another big issue here is the data is not a very good sample. The results will naturally be skewed because the data is skewed to larger deals. For example, I know there are a lot of sub-million-dollar deals and I'd imagine that it is difficult to get a broad database of these.

    Services like Crunchbase at least try to pick up the publicly announced deals, but many are still missing, and many more are never announced.

    I'd hazard a guess that another bias is that the % of deals included in these databases within a given range of amounts is proportional to the amount – the smaller the deal, the less likely it is to be represented in the databases (and/or represented correctly.)

    Comment by @bradfordcross — January 3, 2010 @ 4:49 am

  25. Correct – I tried to point that out near the end but didn’t state it as clearly as you just did.  Simply – the data is incomplete and – as a result – the conclusions are crap.

    Comment by Brad Feld — January 3, 2010 @ 5:06 am

  26. Hah!  Yeah – they don’t call me any more – they know better.

    Comment by Brad Feld — January 2, 2010 @ 10:26 pm

  27. http://flawofaverages.com/ has lots of great examples of this fallacy and what to do about them.

    Comment by Jim Franklin — January 3, 2010 @ 5:59 am

  28. In my career in forecasting, great explanations often yield very lousy forecasts (Correlation versus causation). "Simplistic" models often can predict well. Understanding possible causation requires, um, thinking about how things work & fit together … or you can just crunch numbers. Alas, we see this even in serious academic work too.

    Taking skewed, non-normal (& possibly kurtotic) data and torturing it until it fit a normal distribution destroys information & when done badly adds bias & skew. Worse… It's all too tempting to bend the data until it fits our preconceived notions.

    Anyway, thanks!

    Comment by @entrep_thinking — January 3, 2010 @ 10:48 am

  29. [...] This post was mentioned on Twitter by hussein kanji. hussein kanji said: RT @davemcclure: RT @bfeld "You Don’t Mean Average, You Mean Median" http://fndry.gr/1364c #VC #startup #metrics [...]

    Pingback by Tweets that mention You Don’t Mean Average, You Mean Median -- Topsy.com — January 3, 2010 @ 3:04 pm

RSS feed for comments on this post. TrackBack URL

Leave a comment

Name (required)

Mail (will not be published) (required)

Website

Yes (to JQ’s comment above), and one often does care more about the outliers tugging on the tail, because of potential large effects on the mean from a relatively small number of datapoints. A large positive outlier skews the mean right, making it less “representative” of the data than you might expect if you’re thinking in terms of normal distributions.

Here is a practical example, on which tens of millions of dollars have been spent, and whose results have helped drive computer design for ~20 years.

LOGNORMAL DISTRIBUTION (RIGHT-SKEWED)

Run N benchmark programs on CPUs X and Y (or simulations of candidate designs), giving runtimes TX(i) and TY(i),for i = 1..N. (Computer designers do this often.)

Compute R(i) = TY(i)/YX*i), yielding N relative performance ratios, so that R(i) = 2 means X is 2X faster than Y on that benchmark. [In the general case, ratio distributions might be Cauchy distributions, horrible beasts, but this one isn't, thank goodness.]

Now, compute Z = some function of the R(i) to yield a single figure of merit, the higher the better. The most common way for 20 years, due to the SPEC consortium has been to use the Geometric Mean, originally because it was less sensitive to outliers than were the other means. A few years ago, we finally figured out good statistical reasons for doing that..

If the R(i) are viewed as a distribution, they often fit a lognormal distribution, which is right-skewed. There are good technical reasons for this, in that computing performance tends to be driven by a multiplication of multiple independent factors (lognormal), rather than addition (normal).

The Geometric Mean is just a simpler way of computing:

Z = exp(average(ln(Ri))

The usual form of the Geometric Mean obscures the fact that a lognormal distribution might be lurking underneath, but once you check that, then the whole lovely set of properties of normal distributions leaps to your aid. You can compute useful mean, standard deviation, skewness, kurtosis.

Anyway, lognormal is a very useful distribution. If you keep finding right-skewed data, it’s well worth doing the log transform, doing the usual statistics to see if the logs are normal. If so, you may find that there is some underlying multiplicative mechanism … or it may just be that you have a situation where one side is naturally bounded, but the other isn’t, like human weights or size of diskfiles.

If for some weird reason you want to know more about the benchmark stuff:
Google: mashey war benchmark means truce

Broadcast Media Ntl Streaming Window