Umlud's place: Misusing statistics: Just because a statistic feels right, doesn't make it actually correct

At the end of an article about the recent judicial decision about same-sex marriage in Arkansas, there was this disclaimer:

"I’m a lawyer, but there’s only a 2% chance I’m licensed in your state."

This statement is an apparent example of a statistic that just "feels right" as opposed to being accurate. Let me explain why.

In most of the United States, lawyers can only practice in the state in which they are a member of the bar. Lawyers who pass a bar examination for one state can become members of that state's legal bar, but this does not mean that they can practice law in another state (unless they are given special dispensation). For example, if you're a member of the California state bar, you can practice anywhere in California, but nowhere in Wyoming. This means that there are 50 unique bar associations (with most being state monopolies in practicing law, and a minority not, but effectively difficult for outsiders to practice law in that state).

Now, it appears that the blog author got their number by dividing 100% by the number of state bars (50), but this type of determination is methodologically wrong even as it simultaneously "feels right". It's methodologically incorrect, because it presumes that the US readership of the blog is equally distributed across all 50 states. In other words, if there are 50,000 readers of the blog, and the readership were equally distributed across all states, then there are 1,000 readers in each state; no more and no less. This means that there are 1,000 readers from Wyoming, constituting ~0.1% of Wyoming's population. It also means that there are exactly 1,000 readers from California constituting a relatively paltry ~0.003% of California's population. This assumption of equal readership is almost assuredly wrong. (However, without readership data, one cannot say for certain that it's wrong.)

Of course, using state-by-state population as a direct proxy of the blog's readership is not likely to perfectly map onto the proportion of the national readership held by any state, either (let alone the state in which the author is a member of the bar). If it did, then the likelihood for a US reader being in the author's state would be ~12% if they were a member of the California bar, but only ~0.2% if they were a member of the Wyoming bar. If we use the presumption that state population is a direct and accurate proxy for blog readership, and that the author was accurate in stating their 2% chance, then that means that we would be looking for a state with a population of 2% of the US population, or ~6.3 million people. Looking at the demographics of the US, there is no state with ~6.3 million people, but Missouri (~6.0 million, ~1.9% of the US population) and Tennessee (~6.5 million, ~2.1% of the US population) are the closest. But state populations are not likely a direct proxy for readership, either, which means that - even if the author is a member of the Missouri or Tennessee state bars - it's unlikely that the they represent 2% of the blog readership.

Indeed, there are many additional factors that determine the average readership rates of the blog, based on other characteristics which make the use of only a state-population's proportion of the national population to also be flawed (although likely more precise than the assumption of equal readership numbers across all states). For example, if there were a significant positive association between readership and liberalness in state politics, then the relative number of readers from California would be higher than 12% while Wyoming readership would drop below 0.2%. (Likely, too, the percent of readers from Missouri and Tennessee are also likely to be lowered from their population-based proportions of ~2%.)

The easiest way of determining the percent chance of the average reader to be a resident in the state the author practices law would be to use the readership statistics of the blog (making the assumptions that (1) use-trackers will adequately capture the rates of readership across different states, (2) the geographic distribution of the readership doesn't dramatically change during and after the data collection period, and (3) the likelihood of writing in with a legal question is equivalent across all readers of the blog). After a representative sample of the readership of the blog is collected, all that is needed is to divide the average number of readers from the state in the author holds bar membership by the average total US readership, and then there would be a far more accurate percent-chance to report in the disclaimer.

Of course, the sense and purpose of such a statistic would not be readily apparent to most people (for whom the 2% chance statistic makes better "gut sense"). For example, if 10% of the blog readership comes from the author's state, then reporting that number would have some readers (perhaps many readers) thinking that the author is a member of 5 different state bars, when in fact the number (10% in this example) is merely a reflection of the author's association with a particular state's bar (and thus the population of potential clients within that state) in a landscape of an unequally distributed readership population. In other words, the reported would be accurate, but potentially highly confusing, and - therefore - of dubious utility.

In sum, I'm making this long explanation, because the statement of "there's only a 2% chance" appears to use the analogues of the bad logic used by some creationists that use the "50-50 chance" canard to place an equivalence between the existence of God. If it's wrong to for creationists to misuse statistics, it's wrong for rationalist lawyers to misuse statistics, too.

Note 1: If there's no problem in doing so, you could just tell people which state bar you belong to. In that way, you don't need to resort to either incorrect-statistics-that-"feel"-right or correct-and-obtuse-statistics to describe what your statement of "there's only a 2% chance" is apparently trying to convey.

Note 2: It is perfectly possible that you've done the statistical calculation and that the readership from the state in which you practice law is *indeed* 2% of the total US readership of this blog, but the perfect coincidence with the "feel-good" nonsense statistic of 100%/50 states = 2% makes the statistic seem "fishy". (But not impossible.)

Umlud's place

Sunday, May 11, 2014

Misusing statistics: Just because a statistic feels right, doesn't make it actually correct

No comments: