Nate Silver, who is beyond brilliant, blogs for the New York Times, mainly on statistical prediction in politics and sport. (He made his money in sport predictions and moved into politics. He has what looks like a liberal bent from the questions he asks, but absolute intellectual integrity.)
In one of this most recent posts, he takes on a model that predicts the 2012 House results. It is an absolutely fascinating analysis of the sources of error in this kind of predictive analysis. In particular he urges that one must not be looking at too many potential predictive variables when you have too few cases to study.
Sample caution:
A general rule of thumb is that you should have no more than one variable for every 10 or 15 cases in your data set. So a model to explain what happened in 15 elections should ideally contain no more than one or two inputs. By a strict interpretation, in fact, not only should a model like this one not contain more than one or two input variables, but the statistician should not even consider more than one or two variables as candidates for the model, since otherwise he can cherry-pick the ones that happen to fit the data the best (a related problem known as data dredging).
What the blog focuses in on is not so much the prediction (Republican hold of the House), but the asserted extremely high level of statistical confidence in the prediction. The way the blog undercuts this is by showing that adding only one more election to the model radically changes not so much the most likely result, but the range of significantly possible results. This is in some ways parallel to the discussion of the Jim Greiner paper (NewsMaker Interviw, with links here), which is my opinion should perhaps be less about the accuracy of the result, and more about what the chance is that the finding of lack of statistically measurable impact from offers of representation could mask a big or small actual impact. In other words, I would suggest that the concern is not so much that there is no impact from the offer of representation, but rather that any impact is small. (This is not getting into the detailed discussion of the impact of the fact that about half of those who were denied representation in the randomization process in fact found it somewhere else.)
While the election work is not necessarily quite the same as the statistical processes we use in access to justice, the general cautions bear attention. Maybe some of you statisticians out there can give it some thought and comment.
Above all, just read Nate Silver (list of posts at this link) for the sheet joy of his intellectual clarity, and use it to think about the kind of questions we might use data to ask and answer.
Pingback: While We Wait — A Pulitzer for Nate Silver? | Richard Zorza's Access to Justice Blog