We develop two methods of automated content analysis that give approximately unbiased estimates
of quantities of theoretical interest to social scientists. With a small sample of documents
hand coded into investigator-chosen categories, our methods can give accurate estimates of the
proportion of text documents in each category in a larger population. Existing methods successful
at maximizing the percent of documents correctly classified allow for the possibility of substantial
estimation bias in the category proportions of interest. Our first approach corrects this bias for any
existing classifier, with no additional assumptions. Our second method estimates the proportions
without the intermediate step of individual document classification, and thereby greatly reduces
the required assumptions. For both methods, we also correct statistically, apparently for the first
time, for the far less-than-perfect levels of inter-coder reliability that typically characterize human
attempts to classify documents, an approach that will normally outperform even population hand
coding when that is feasible. These methods allow us to measure the classical conception of public
opinion as those views that are actively and publicly expressed, rather than the attitudes or nonattitudes
of the populace as a whole. To do this, we track the daily opinions of millions of people
about President Bush and the candidates for the 2008 presidential nominations using a massive
data set of online blogs we develop and make available with this article. We also offer easy-to-use
software that implements our methods, which we also demonstrate work with many other sources
of unstructured text.
This paper describes material that is patent pending. Earlier
versions of this paper were presented at the 2006 annual meetings of the Midwest Political Science Association (under a different title) and the Society for Political Methodology.
Download PDF