Extracting Systematic Social Science Meaning from Text

Date Published:

Sep 16, 2007


We develop two methods of automated content analysis that give approximately unbiased estimates of quantities of theoretical interest to social scientists. With a small sample of documents hand coded into investigator-chosen categories, our methods can give accurate estimates of the proportion of text documents in each category in a larger population. Existing methods successful at maximizing the percent of documents correctly classified allow for the possibility of substantial estimation bias in the category proportions of interest. Our first approach corrects this bias for any existing classifier, with no additional assumptions. Our second method estimates the proportions without the intermediate step of individual document classification, and thereby greatly reduces the required assumptions. For both methods, we also correct statistically, apparently for the first time, for the far less-than-perfect levels of inter-coder reliability that typically characterize human attempts to classify documents, an approach that will normally outperform even population hand coding when that is feasible. These methods allow us to measure the classical conception of public opinion as those views that are actively and publicly expressed, rather than the attitudes or nonattitudes of the populace as a whole. To do this, we track the daily opinions of millions of people about President Bush and the candidates for the 2008 presidential nominations using a massive data set of online blogs we develop and make available with this article. We also offer easy-to-use software that implements our methods, which we also demonstrate work with many other sources of unstructured text.


This paper describes material that is patent pending. Earlier versions of this paper were presented at the 2006 annual meetings of the Midwest Political Science Association (under a different title) and the Society for Political Methodology.
Download PDF