## Skewness (Warning: statnerdery ahead)

I’m not all that good at remembering which way various standard distinctions go, especially when I have some underlying doubt about them. In classical hypothesis testing, for example, Type I error involves erroneously rejecting the null hypothesis, while Type II error involves erroneously failing to reject. Since I mostly think in Bayesian terms, I regard the whole classical setup as a fairly arbitrary social convention. One result is that I have to remind myself, fairly regularly, which type of error is which.

I have a different kind of problem with the terminology of skewness. Positive skewness is often called “right skewness”, but it seems to me this is the wrong way around. Suppose I started with a zero-mean symmetrical distribution (say normal) and reduced some of the values near the mode/mean/median. The result would be a distribution with negative mean, mode and median, and positive skewness. In visual terms, the peak of the distribution would be pushed to the left, while the right hand tail would now be long. In ordinary terms, I would say the distribution had been skewed to the left. Any comments?

I too have always thought the definition was counter-intuitive, because intuitively if most of the mass of the distribution is to the left, then I think of it as negatively skewed.

Perhaps it all depends on whether you care most about most of the mass or the extremities.

I tutored a first year quantitative methods class for commerce students, and nobody (including me) could ever get their head around this. I found that the only way I could correctly describe whether a distribution was left- or right skewed was to consciously go against my first impression. In other words, you need to master doublethink.

The negative-positive aspect isn’t a problem for me, since it comes directly from the definition of the central moment. But as regards right-left, I agree with Uncle M – the standard definition makes sense only if you regard skewness as arising from a tug on the tail, rather than a shove on the mode.

Conveniently another website has just put up a negative skewed curve

http://peakenergy.blogspot.com/2009/06/net-hubbert-curve-what-does-it-mean.html

I would call it ‘long tailed to the left’.

Isn’t it fairly simple: with a long right tail, the mean is being skewed to the right (by the outliers in the tail)?

I think “negative rights” and “positive rights” are named the wrong way around. However this means I never forget which is which.

You move the axis to the right? It’s all relative doncha know

Yes (to JQ’s comment above), and one often does care more about the outliers tugging on the tail, because of potential large effects on the mean from a relatively small number of datapoints. A large positive outlier skews the mean right, making it less “representative” of the data than you might expect if you’re thinking in terms of normal distributions.

Here is a practical example, on which tens of millions of dollars have been spent, and whose results have helped drive computer design for ~20 years.

LOGNORMAL DISTRIBUTION (RIGHT-SKEWED)

Run N benchmark programs on CPUs X and Y (or simulations of candidate designs), giving runtimes TX(i) and TY(i),for i = 1..N. (Computer designers do this often.)

Compute R(i) = TY(i)/YX*i), yielding N relative performance ratios, so that R(i) = 2 means X is 2X faster than Y on that benchmark. [In the general case, ratio distributions might be Cauchy distributions, horrible beasts, but this one isn't, thank goodness.]

Now, compute Z = some function of the R(i) to yield a single figure of merit, the higher the better. The most common way for 20 years, due to the SPEC consortium has been to use the Geometric Mean, originally because it was less sensitive to outliers than were the other means. A few years ago, we finally figured out good statistical reasons for doing that..

If the R(i) are viewed as a distribution, they often fit a lognormal distribution, which is right-skewed. There are good technical reasons for this, in that computing performance tends to be driven by a multiplication of multiple independent factors (lognormal), rather than addition (normal).

The Geometric Mean is just a simpler way of computing:

Z = exp(average(ln(Ri))

The usual form of the Geometric Mean obscures the fact that a lognormal distribution might be lurking underneath, but once you check that, then the whole lovely set of properties of normal distributions leaps to your aid. You can compute useful mean, standard deviation, skewness, kurtosis.

Anyway, lognormal is a very useful distribution. If you keep finding right-skewed data, it’s well worth doing the log transform, doing the usual statistics to see if the logs are normal. If so, you may find that there is some underlying multiplicative mechanism … or it may just be that you have a situation where one side is naturally bounded, but the other isn’t, like human weights or size of diskfiles.

If for some weird reason you want to know more about the benchmark stuff:

Google: mashey war benchmark means truce

Your problem is that you are thinking about the change from the original symmetrical distribution. Compared to that, your new distribution has been “skewed” leftwards. But starting from a blank page the “new” distribution has a positive skew as the right tail is fatter for longer than the left tail.

I know distributions are abstractions and therefore don’t really have a left and right side but I still tend to think of a large range of problems using the pile-of-sand heuristic, so, skewing is pushing on the side of the pile, and so, my intuitive right skew would be pushing the pile to the right. Which would result in a longer left tail. I just have to remember I’m counterintuitive on skewness. HTH!

For my purposes, treating “the whole classical setup” as a convention is quite satisfactory – lables aren’t always intuitively meaningful to everybody in the same way.

Whether it’s called left or right, the important thing is to avoid implicitly treating a distribution as a Gaussian normal.

Suppose you have a sample of size 10 (yes, small) from a large population, and you want to use the mean as a predictor for what to expect from a similar sample. (After all, much of statistics is about reasoning from sample characteristics to get conclusions about the population).

Suppose that sample item #10 is a large, positive, and truly atypical outlier. It skews your prediction of the mean to the right, and if you take another sample of 10, you will be disappointed, assuming you prefer a larger mean.

Exactly this happened with the first SPEC benchmarks in 1989. One benchmark was a little too simple, and it got “cracked” by advanced software (not cheating, just different software technology coming naturally.) That gave unnaturally high numbers, unlike any other program we ever saw. It “skewed” people’s expectations of performance higher (to the right) than was realistic … so we removed it in the next release.

Speaking about skweness but just slightly off track here…I would like to suggest that some major govt departments (like the ABS and the ATO) have, it appears, from about 1999 (JHs reign) developed a certain skewness of their own. The female statistics are lacking. From decades and decades ago female taxable incomes and taxable income recipients in income bands have been reported by the ATO. Suddenly in 1999 the ATO starts reporting a 5% income distribution in which it analysies the percentage of people eg in the first income band by gender. Alas it aggregates male and female income in that income band……renderting it useless to compare to long run income distributions. The ATO now releases unit record data and suggests researchers can rely on this…however the top 1% of all income earners is truncated (for confidentiality reasons even though their name is not provided). So the top 1% are missing in action.

Then the ABS omit details they used to report – like the percentage of sole parent families with children which used to be around 25%. Its now 16% because they decided to report only the percentage of sole parent families “with children under 15 years of age.” Then there are the wage differentials reported by the ABS between men and women. The ABS only seem to report on the wage differential of full time working people (males v females). It has been constant at women earning .9 for three years out of the ten. The other years have n/a, n/a n/a in their boxes. Never mind women take 75% of all part time positions. How about some genuine reporting again on female statistics.

I call this skewness and Im writing to Julia Gillard about it soon because we need long term disaggregated statistics that aid comparability. We dont need JHs skewed attack (and he attacked the ABS budget) on anything he perceived to be feminism and believe me something has gone wrong with the statistics.

They are skewed against women. Yet again!

Sorry JQ – In know this is off track but something needs to be done to restore the proper status quo that existed in stats reporting by public departments for decades before JH made a mockery of it.

Thats what I call being skewed to the right!

In my high school statistics class, we were told to “name the whale by the tail”. Like you, I thought that the skew should be the other way, but the rhyme is rather convincing.

mean skewed left; variance skewed right

first derivative skewed left; second derivative skewed right

yin skewed left; yang skewed right

I think the explanation might be that positive numbers are to the right and negative numbers are to the left, hence positive skewness, when the statistic is positive is right skewness and so on.

I think examples can be constructed so that the mode can be to the left or the right of the mean regardless of whether you have left or right skewness.