Blog Archive

Sunday, January 10, 2016

Beware the catalogue and the taxonomy?

Marginalization. Discrimination. Eugenics. In their paper titled "Big data problems we face today can be traced to the social ordering practices of the 19th century," Hamish Robertson and Joanne Travaglia of New South Wales University explore traditions of using data for social order, and compare them to modern data practices.

Are catalogues, taxonomies and other organizational/cognitive methods inextricably linked to power and control? Or are they inherently neutral, sense-making tools? The path to power is paved by intent and purpose - in particular statistical descriptive methods that, in crude form, define a 'normal' society, and deviations from that norm.

Robertson and Travaglia raise legitimate concerns, given the history and use of such tools. But rather than focusing on the tools and techniques, they could explore the importance of the accessibility of both data itself and tools/techniques to a much broader set of society.

Whether known as 'democratizing data' or 'occupation', the important question is how we foster and enable independent data analysis, reporting, and journalism. This freedom is precious, for it allows us to see different viewpoints and defend against narrow agendas. DataKind and Propublica are just a few of the examples of funded organizations that continue this hard and important work.

The critical and central topic for years to come should be whether laws, in conjunction with free enterprise, will be effective at governing data privacy, ownership and control. Can we be clever enough to protect the rights of those who spend capital to acquire data, while fostering free competition for that data? At what point do certain types of data or certain scales of data create unfair or even dangerous, limited power?

And, for organizations like DataKind and Propublica, how do we further develop their economic sustainability without ties or dependence on key sponsors who, inevitably, will have their own limited agendas?


Monday, January 4, 2016

Happy New Year!

Changes underway but 2016 content will be the same. New posts coming soon!

Sunday, March 15, 2015

Brian Williams, Dan Rather...and Data Journalism

News media outlets continue to struggle through embarrassing occasions of flagship news presenters such as Brian Williams and Dan Rather losing their credibility. Williams and Rather were powerful representatives of trust and longtime commitment to journalism. These journalists allegedly made conscious decisions to present unreliable stories. Both men shared the same outcome: their professional careers toppled.

The Excitement of Data Journalism…and the Looming Failure

We can make journalism more democratic and participatory through data journalism. No longer will we have to depend upon handful of news outlets whose content is driven by advertising revenue and subject to the filter of producers and entrenched news readers. Instead, independent teams can work with deep troves of public data to reveal unseen trends and truths that matter.

In order to release, analyze, and present the massive surge of information that is becoming available to us, we have to use code. It's the first tool we have to use to convert the noise to signal. But it seems that we have lost perspective on the remaining steps, process and overall governance of what is good journalism versus mere presentation of data for the sake of supporting a point of view. Are we reducing our definition of data journalism down to...coding? 

This is a theme which I see repeated again and again in Twitter, blog discussions, and general purpose data journalism propaganda. It was good however, to see that Paul Bradshaw created a more thoughtful discussion in his blog. It's a discussion between well known and respected Alberto Cairo and other contributors on the importance of coding as a data journalism skill.

Points made:
  • Developing a story is important
  • The key skill for a data journalist is knowing whether the data in question is actually interesting
  • A journalist should know some basic CS and coding
There's two points still missing. First, journalism is a discipline and it's why we have whole schools of higher learning dedicated to the craft. Second, in order for us to be engaged and to be given value, journalism must be credible. Otherwise, we will be no better off than our current state. We refer to journalism as the fourth branch of government, and for good reason.

The Common Thread through Journalism and Credibility

If you Google "Journalism" and "Credibility" you'll get plenty of results that refer to an important word: sources. Data provenance is critical. Clean, credible sources are key to journalism and all other professions in which we are entrusted to tell stories and communicate points of view with data.

But when it comes to data journalism, does anyone have a good example of discussion and attention to the credibility of the data, and its provenance? Or even a solid method for dissecting what the data means? Field by field? Record by record? A method for building a validated data dictionary, prior to taking the data and shaping it into a story? 

Readers: what are your experiences and suggestions? 

Saturday, August 2, 2014

The Price of Life: Eating and Isolating Ourselves to Death.

Wired Magazine has published an infographic displaying the National Institutes of Health's budgetary allocations for work to prevent the top causes of death. 

The second sentence of the introductory content asks: why do we spend more than $13,000 for each person who dies of diabetes but only about $3,000 for each heart disease victim? 

Sadly, the most dramatic diseases - heart attacks and cancer - are the ones we pay the most attention to. Yet we pay less attention to the highly prevalent, pervasive and destructive yet manageable diseases: diabetes and mental illness.  Diseases that shouldn't be killing us. Yet they are, every day. Diseases that we can prevent and minimize, as shown here: 

But prevention and minimization is thwarted by a web of poverty, food policies and social stigma that prevent us from solving these destructive patterns. We are eating ourselves to death, and dying from our separation and isolation from one another. 

We in the United States consider ourselves a 'developed' nation. We have a lot to be proud of, but we have a lot more work to do. Without that work, we will continue to allow a human catastrophe.   

Tuesday, March 25, 2014

Vaccinations and Data Viz: A Case Study (Part 3 of 3)

In the third and final part of our case study (see Parts 1 and Parts 2 if you haven't previously read them) we attempted to develop an alternative data visualization to a bubble chart data visualization of the outbreaks of vaccine-preventable diseases (VPDs) world-wide. In those original first and second parts, we evaluated the visual effectiveness of the bubble chart, the provenance of the data, and whether the context was appropriate. As we explained, we felt that there might be better visual options for this story. 

Again, here's the visual, and the link to the original L.A. Times blog

As we developed our own alternative visuals, we came to our most important conclusion: the story was probably miscast. The story is neither a global or a national story. It's in fact a local story. And it’s not about history or trends. It’s about risk. Local communities have high local risk. Local communities can effect local action to protect themselves against outbreaks, and the importance of being community minded about how we increase and maintain immunity.

Through this series, we found a few things:
  • We cannot generalize the state of vaccine-preventable diseases (VPDs) - globally, or even nation-wide. There are some areas that have been unaffected; others, substantially. 
  • And, in fact, taken as a whole nationally, despite the "anti-vaccination movement," as it has been dubbed, the immunization rates haven't significantly changed. 
  • But locally, we are having significant and scary breakouts. This highlights the importance of looking the parts as well as the sum. Vulnerability is highly varied by community and by context. We have clusters of high risk in what seems to be a low risk population.
So, in this final blog post, we talk about some alternative visualizations that might better convey the story, which is about local vulnerability and risk. But first, to explain each of the points above with some examples: 

1. We cannot generalize as a whole the state of VPDs. Here's an exciting visual that we created that depicts the recent trend in pertussis cases, aka whooping cough, that might be as visually compelling as the original bubble chart. In this case, instead of a geographic bubble chart, we wanted to show a historic trend line. The result: a compelling message that surprised us. Have we really regressed to the 1950's? And we wondered, what if other things, like technology, regressed in our country to that of the 1950's. Would we accept that? Of course not! 

But not all VPDs have this same trend. Here's an alternative, seemingly positive outcome for measles. 

However, in the face of this assertion, we are also simultaneously experiencing an outbreak of measles in New York's hospitals!  So in fact, it's possible to advertise this victory, and yet measles continues to be imported into the United States, where vulnerable populations may be exposed and come down with the virus.  

2. Despite the anti-vaccine movement, nation-wide immunization rates for the primary VPDs haven't recently reduced.  The U.S. Centers for Disease Control and Prevention's survey of vaccinations from 1995 to 2011 do not show (in the aggregate) statistically significant downward trends in vaccinations. 

So again, if we aggregate vaccination trends at the national level, we aren't making a compelling case. 

3. This highlights the important point that community vulnerability - not national vulnerability - is our starting point for action and communication. Why? Because as community members, we should and can take action in our local communities. And furthermore (fortunately!) we don't yet have compelling statistics upon which we can make a compelling case for action. However, we can talk about vulnerability - and risk. 

In 2010, a localized outbreak in California of 9,000 pertussis cases represented one-third of all pertussis cases nationwide. In addition to the waning effectiveness of the current vaccine, it also appears that clustering of unvaccinated individuals played a role. Census districts with a statistically significant higher number of exemptions (referred to as an 'exemption cluster') were 2.5 times more likely to also be in a pertussis outbreak cluster. 

Outbreaks are related to the immunity of the population - driven by, for example, who has been vaccinated or has been previously exposed to the disease. The more that unvaccinated (and presumably vulnerable) people are clustered in a community, the higher the chance they will contract the disease, turning it into an outbreak.  

This diagram (credit: National Institute of Allergies and Infectious Diseases) shows the basics of how a community's composition can affect the likelihood of an outbreak: 

4. We need to bring the data visualizations and the messages down to the local level. The real story is local. Since, fortunately, most localities do not have a history of VPD outbreaks, we have to think about compelling ways to report and show risk.  

So here, to satisfy the brief, we thought about ways to present statistics in terms of the community - in this case, the potential for exposure to others. We wanted to illustrate the connectedness of the community, and the implied consequences of behavior. We imagined a kind of Public Service Announcement (PSA) scorecard for each community, by disease, where the centerpiece visual would be the connectedness of a single infected child to their community, showing the geometric effect where many more could be exposed:

And overall, support the message of consequence in the context of a community. 
To support the PSA scorecard, we relied on discussions from The Journal of Infectious Diseases, the CDC's MMWR reports, and West J Med's report of a 1990 California outbreak of measles to generate the statistics and assumptions embedded in this PSA. We are not epidemiologists or health workers, so this is only a placeholder for what might be more appropriate statistics and better handling by said professionals! It does assume, however, that at least some of the the county or district health departments in the United States would be giving thought to the following statistics or calculations: 
  1. Immunization rates in their community;
  2. Estimated 'R' rates of diffusion/transmission, especially in school systems based on attendance and classroom conditions;
  3. The likelihood of infection rates, based on #1 and #2;
  4. Estimated hospitalizations and deaths based on their age and health demographics;
  5. Average hospital and other related medical expense statistics.
It's a strong message, and a worst case scenario. Although we've qualified some of the statements with hypotheticals, it may be excessive. We assume that health departments have to find the right balance between developing strong language that encourages community health-mindedness, versus sensitivity to those who truly have medical exemptions or other significant religious concerns.  

And, importantly, we assume that county and district health departments would have the resources to collect, compile and regularly produce not only the statistics #1 - #5, but also to be able to produce and distribute the information in the format we've provided above. So, we think of it as a starting point, but perhaps the actual implementation of the solution might have to be iterative and even more grassroots. 

And finally - it goes back to the issue of whether a media organization with a national reach can effectively describe this issue in national terms. Our conclusion: it's valuable in terms of bringing attention to the issue, but the real value, as it often turns out, is illustrating what it means to you and me. 

What are your thoughts? Is this sensible? Excessive? Something else? 

We said this at the beginning: this is NOT the forum to debate whether there is a link between vaccination and autism. It's rather the forum to debate the effectiveness and validity of the original visualization, and our proposed alternatives. Please limit your comments to those on-topic. 

- Michael Thompson, Vivian Peng, Adam Vigiano

Tuesday, February 25, 2014

Vaccinations and Data Viz: A Case Study (Part 2 of 3)

In part one of our case study, we took a look at a recent Los Angeles Times blog titled "The Toll of the Anti-Vaccination Movement, In One Devastating Graphic". We illustrated some problems with news media repurposing information graphics to advance a story or idea, in particular using:
  • 'exciting' versus optimal representation of data 
  • out-of-context data
Our point in the first part of the case study was that news media has a responsibility to produce the best data and presentation of important health-related issues. Displays of information need to be crafted for the piece, and integrated into the piece in the form of citations and qualifications. If we only treat the display of information as a photo accompaniment for an article, we diminish the value and power of the story that can be generated.  

Philosophically, we want to make information exciting, truthful, and human. We think that it's possible to create eye-catching, audience-drawing displays of information that feature truthful illustrations of data. 

In the second part of this blog, we'll be taking a look at the work that might be involved to generate a more compelling and informative visual picture of this particular issue.  

For Comparison: Visual Representation of Data

This is a tracking of Pertussis (Whooping Cough) cases (provided by the U.S. Centers for Disease Control) in the U.S. since 1920, plotted by Vivian Peng.  

Here, we have a striking time series trendline of data that provides a comparison of outbreaks. Looking at this trend, we can hypothesize a correlation for three important inflection points.  The first is the introduction of the DPT vaccine in 1942.  The second is considerable, widespread concern during the late 1970's and early 1980's that the DPT vaccine was causing infantile brain damage (evidence later mounted against this claim). The third is between 2000 and the present day, possibly related to the Wakefield-generated vaccination concerns.   

Let's now look back at the original selection of a visual:

To recap, the source of the data for the visualization was the Council on Foreign Relations' tracking and collection of news media reporting on vaccine-preventable diseases.  This scraping was used to generate ongoing information for monitoring versus annual summary data for reporting.  

The CFR's scraping of media reporting, and the statistics generated using that scraping are inherently less rigorous than a proven methodology that an organization like the World Health Organization or the United States' Center for Disease Control.  This is not to say that either organization has a perfect process, but we will assume that the CDC and WHO will generally have more complete and validated information. And to be clear, it better suited the CFR's desire to track outbreaks in a more timely fashion.   

In defense of the L.A. Times, the visualization (should a reader click through to the CFR site), provides zooming into the map and filtering for certain diseases.

It should be clear to everyone that a time-series of data here would be much more effective at communicating what happened and helping people develop hypotheses. However, it's in competition now with this admittedly striking image. The CFR presentation evokes an impression of organic infection. It's eye-catching, and clearly audience-catching. The time series trendline? Unless the audience is a maven of pure data visualization, it's hard to expect that they'd be drawn to the trendline.  

In the third and final part of the case study, we'll take this rendering and try to make it as visually arresting as possible, while still maintaining the integrity of the information and the reporting. And, we'll look at how the visual can be better integrated into the written copy and overall story. 

We'll be heading into difficult territory, mindful of purists' admonitions such as "no chartjunk" and "every pixel must convey information." It's going to be a difficult brief.  

Michael Thompson, Adam Vigiano, Vivian Peng

Monday, February 3, 2014

Frankenstein's Creature, Vaccinations, and Data Viz: A Case Study (Part 1 of 3)

Beginning in the late 1700's and continuing through the first part of the 1800's, galvanism was an exciting topic for western society. Galvanism, named after Italian scientist Luigi Galvani, supposed that electricity was the primary animator of biological mechanisms. Galvani theorized, based on experiments in his laboratory, that if it electricity could be accurately channeled into biological organisms, an "operator" could physiologically direct an organism (alive or deceased) according to his or her whims.  

Inspired by this idea, Mary Shelley wrote her famous work Frankenstein; or the Modern Prometheus. Part science fiction, horror, tragedy, and social commentary, the story tells of Dr. Frankenstein's Creature blundering and crashing about through 19th century society as it becomes self-aware and struggles with human realities. The Creature experiences disaster and psychological torment due to its horrible dislocation from its proper context: a natural birth, a nurtured upbringing, a social life, and a final resting place in death.  

In this blog, we'll deal with a slightly less dramatic but still fundamentally important galvanization: the repurposing of a data visualization for a rhetorical news feature. 


A January 20th feature in the Los Angeles Times business section by Michael Hiltzik, titled "The Toll of the Anti-Vaccination Movement, In One Devastating Graphic" referred to a Council on Foreign Relations visualization of reported outbreaks of vaccine-preventable diseases. 

Hiltzik writes that the outbreaks shown in the visualization are "an artifact of the anti-vaccination movement." The anti-vaccination movement he refers to here is represented primarily by parents who defer or avoid their children's vaccinations due to fears of the supposed linkage between autism and vaccines.  

To be clear: this blog posting is NOT attempting to settle or even debate linking autism to vaccinations. Rather, this blog posting is illustrating the difference between repurposing of information visualizations for the purposes of advancing arguments, and developing original, in-depth data journalism.  

The data visualization originally caught our attention due to its less-than-effective use of a "bubble" chart in a point map format. The article features a still image of the visualization at a global "height" where one can see most of Europe and the Americas countries. Bubble charts make excellent imagery for various nasty things, evoking thoughts of mold or petri dishes.  But here, bubbles overlap and it is impossible to discern the geographical location of the outbreaks. This is simply solved by visiting the interactive site where we can zoom in and better discern where the outbreak occurred. However, we are then caught up in a slightly less difficult but still significant problem of comparing the relative magnitude of outbreaks. This problem stems from the well-documented perceptual limitations of bubble charts.

The data doesn't offer control groups or relative comparisons to history or proportion of overall population. Consequently, there's no baseline indicating a 'normal' level of outbreak, or expected levels of outbreak relative to the percentage of population vaccinated. Furthermore, the only longitudinal visualization of change is a slider that allows us to see five years of data - and an unclear trend.  

The source data for the visualization is available at the Council on Foreign Relations website. Although we had expected traditional sources of global disease data like those available from the World Health Organization, we were surprised to learn that the data was instead sourced from local news articles.  

Wanting to understand this more, we put in a call to the Global Health Program at the Council on Foreign Relations. The researcher with whom we spoke said that the visualization had been developed for tracking, in real time, regional and local disease outbreaks. This explained their choice of using news reports, not WHO data (published annually) as the source of data. And, to support a purpose, it sounded, much more like observation and inquiry rather than statistical analysis and conclusion. Finally, the data visualization had been launched several years before this article - and had not been commissioned by the newspaper for the sake of the article.   

Essentially, the Council on Foreign Relations data visualization appears to have been repurposed by the Los Angeles Times for the sake of advancing an argument that vaccine avoidance and vaccine-preventable disease are related. The purpose of this blog and the next two of these series is not to dwell on a possibly 'galvanized' use of a data visualization, but rather to illustrate the challenge and complexity of crafting proper data journalism. 

The next two blog entries will address some steps that we'll take to try to describe how a newspaper like the Los Angeles Times might more thoroughly and carefully depict the hypothesized linkage between vaccination and an increase in outbreaks of vaccine-preventable disease. In it we'll talk about provenance, context, statistical considerations, narrative, and design choices. We'll also write about the inherent challenges of depicting data about humans, who do not follow the same kinds of rules of physics or other natural laws that these visualizations were originally developed to depict.  

Extending outward - we'll invite this blog's audience to offer their own interpretations, suggestions on improvements, and other technical guidance.  

And most importantly - we'll assume that the readers of such a prominent newspaper as the Los Angeles Times, regardless of their level of education or familiarity with the topic or scientific techniques, can in fact be interested and learn from a careful storytelling and visual rendering of an observed phenomenon. The readers of the Times deserve that honor!

-Michael Thompson and Adam Vigiano

Note: Out of fairness and collegiality, we tried to reach Michael via e-mail and Twitter to get his perspective on his involvement, process and choices for framing this visualization under his by-line. As of this blog posting he has not responded to us. However, we also did not expect him to readily respond due to his unfamiliarity with our team.