Prevent national health crises by mining public discussion and news to predict vaccination uptake
This is the proposal that I submitted yesterday to the Knight Foundation health data challenge. See the proposal, and vote on it if you like it, at the Knight News Challenge.
Vaccines matter. We want to predict uptake by mining news and social sources. Our initial pilot in Ireland will focus on the uptake of the HPV vaccine, a critical public health issue for women in Ireland.
We want to use all possible sources (messy, unstructured data) and use smart algorithms to extract its meaning. So, our “social data” will be the unstructured, text “news” which we will transform, using smart algorithms, into knowledge about population opinions, of use in predicting population-level behavior.
While the big-data literature is replete with “momentum” predictors (e.g., Google searches predicting flu epidemics; twitter rates predicting movie revenues, frequencies of specific words predicting cultural trends), our work is less about tracking an emerging fad/meme and more about what a whole population is thinking about a particular topic at a given point in time.
Focus of this pilot: womens’ health
The project will develop text-analytic techniques for large-scale news and social data to capture population-level opinion with a view to predicting population-level behavior. The focus of this pilot project is the uptake of the HPV vaccine for cervical cancer in Ireland (began in 2008). HPV-vaccine uptake is a critical public health issue for cancer rates in women (but, one which conflicts with religious views on sexual morality).
With sufficiently large news and social data sets it has been shown that systematic changes in language-use occur that reflect population-level opinions, sufficiently well to predict population-level decision-making. Specifically, whole distributions of words (which, typically, are power law distributions) reflect the degree of agreement/disagreement in a population on an issue. Systematic changes in weekly power-laws of news and social data can reflect the emerging coalescence of opinions on a topic. To date, this proposal has been demonstrated in the domain of high finance.
Systematic changes in weekly power-laws of the words in financial articles have been shown to track trends in the major stock indices (DJI, NIKKEI, FTSE); using 18,000 articles (10M+ words) it has been shown that, as the 2007 stock-bubble emerged, week-to-week changes in the power-law distributions of verb-phrases correlated strongly with market movements. (See previous research on this by team members involved in this proposal, which was covered in The Economist here.)
These distributional shifts show emerging agreement/disagreement in journalistic-language as reporters use a progressively narrower set of words to describe the market, reporting on the same small set of companies using the same, overwhelmingly positive language. This demonstration suggests news and social data may reflect population opinions, well enough to be used to track changes in critical social opinions in health.