Episode #166 Noah Iliinsky - Telling the Right Story with Data Visualizations A Virtual Seminar Follow-up
The right data can be more effective than words when it comes to telling a story. Even if you have the data, you have to present it in the correct manner. Choosing the right axes, colors and placement are all a big part of putting together a great visualization. Noah Iliinsky demonstrates what goes into creating an effective visualization.
The right data can be more effective than words when it comes to telling a story. Even if you have the data, you have to present it in the correct manner. Choosing the right axes, colors and placement are all a big part of putting together a great visualization.
In his virtual seminar, Telling the Right Story with Data Visualizations, Noah Iliinsky demonstrates what goes into creating an effective visualization, even building one in front of the audience. The audience asks a bunch of great questions during the seminar, and Noah joins Adam Churchill to tackle the remaining ones for this podcast. to rational chunks...”
Tune in to the podcast to hear Noah answer these questions:
- What distinguishes a data visualization from an infographic?
- Are there any good tools for culling through a large chunk of data?
- Are there particular tools or software to apply encodings to visualizations?
- What happens when you have a lot of data but aren’t sure of the story you want to tell?
- Does requirement come before specification?
- How do you prioritize what to tweak when building a visualization?
- How does the process change when your dealing with only qualitative data?
- Is it better to tell a story or tell the truth?
Adam Churchill: Welcome, everyone to another edition of the SpoolCast. A few weeks ago, Noah Iliinsky joined us for his seminar, "Telling the Right Story with Data Visualizations." In it he showed our audience how to effectively conceptualize, plan, and ultimately design powerful visualizations that tell the right story. Probably one of the neatest parts of this seminar is he actually built one, step by step, for us. That seminar, as are the rest of our seminars, have been added to the UIE User Experience Training Library that's presently over 85 recorded seminars, from wonderful topic experts just like Noah, giving you the tips and techniques you need to create great design.
Hey, Noah, welcome back.
Noah Iliinsky: Hi, Adam. Thanks.
Adam: So, for those that weren't with us for your presentation, can you give us an overview?
Noah: Sure. We talked about the process of really understanding what your data is and how you want to think about it and how you want to present it, as sort of the precursor to actually doing the design of the visualization. And that earlier part should be very familiar to people who are used to doing any portion of a user experience design process.
So, for the first part, when we were considering what we had and what we wanted to do with it, the key considerations were things like, who was our audience? What are their needs? What functionality or what knowledge do they need to take away from this visual so that they can go do their work, so that they can make the right decisions? What action do they need to be able to take?
We also needed to know about the data itself, whether it was time-series data, whether it was qualitative data, if it was quantitative, whether things were ordered, not ordered, all these different things. We need to have a deeper, intimate knowledge of the shape of the data so that we can encode it properly.
And finally, the thing that sort of drives the entire concept of doing a visualization is, as a designer, what is our goal? Is our goal to convince someone that our product is superior? Is our goal to provide a specific kind of knowledge or specific answers to the audience, who is our customer? Is our goal to just give a summary of what's going on so that other people can just feel like they are informed? So keeping our goal in mind, also, is one of these fundamental drivers.
So, again, the three fundamental considerations are the audience, their needs, their use cases, all that sort of thing, sort of the usual user experience considerations; the data itself, the shape of it, the flavor of it, call it what you want, but what kind of data we have, how much of it we have, and how it relates to itself; and then, finally, our goals as the designer, what it is we're trying to achieve by creating this visual. So that's the inputs. That's the first part.
When you have all that, you can then take these inputs and begin to conceive of the sort of answer you would like to create. I talk about that like it's a spec for the visualization, where the spec is specific enough that it tells you some things about what data you're going to be needing to include, which also means what data you can exclude because you don't need to use it all necessarily. Extra data that's not useful is also called noise. So the spec's going to tell you things like what data to include and what the relationships you want to reveal are.
So, for example, the spec that we used in the virtual seminar, the statement of purpose that was the spec that we used to create this visual was, "How have changes in health-care spending affected life expectancy in different countries between 1995 and 2009?" So that's a statement that refers to the data, refers to the relationship, gives us some boundaries in terms of the time frame. It was not the world's most perfect spec. There's some ambiguity left in it, which was done on purpose because it left us room to kind of play with the relationships that were there and see which of the different relationships available in the data were going to be the most important or the most useful.
But it also gave us some very specific data to include, gave us the boundaries of the years that we were going to include, which also, in that case, meant excluding some of the data that we didn't have, we didn't have as much consistency with. For example, we didn't have data points further back in time for a lot of the countries, and so we didn't include that.
So that's the first part. Understand your inputs. Write a spec that kind of gives you some definition. The other thing, by the way, that the spec gives you is it gives you guidance. It tells you a direction to start in, and it lets you test what you've created against your goals. If you can answer the question or satisfy the statement that is the spec with the visualization that you've created, you're probably on the right track.
So that's part one. That's what we call what to visualize, how to decide what it is we're doing.
Part two, the second phase, is actually doing the visualization, actually taking this data that you've got and starting to make marks on a page, starting to define what do colors mean, what does placement mean, what the size and shape and all these other things mean, where are we going to put the labels. And that is based on several years, decades, of research that's been done into cognitive psychology and how people perceive relative placement and differences in color and social constructs around "What does bigger mean?" or "What does green versus red mean?" all these things.
So the way you start the how to visualize, the very first thing you do there is you choose your axes. And you want to choose axes that are going to reveal the relationship that is the key interaction. So this is why scatter plots, for example, are so powerful and commonly used is they show the key interaction between the values on the X-axis and the values on the Y-axis. So I strongly advocate for well-defined axes and really thinking about those in the first place, because they sort of define the scope of your world, the boundaries of what you can create.
And then, once you've got those, once you've got the data points defined in your world based on the axes, you can start to add other, different dimensions of data. And the way that we choose the visual encodings for those other, different dimensions of data is we want to understand sort of the flavor of the data. Earlier, when I was talking about the data as an input, we want to understand things like, is the data well-ordered or not? If it's ordered, is it quantitative, or is it just ranked? If it's not ordered, is it categorical? Is it qualitative? Is there a relationship there?
The reason we want to know this about the data is because we want to choose a visual property that is compatible with the data. If we have qualitative data, we want to choose visual properties that are going to be more qualitative or categorical. If we have quantitative data, we want to choose a visual property that can represent those quantities effectively.
The research has been done. We know what those visual properties are and how they map and have actually drawn up a table of this. My table is one of a large number of these tables that's been drawn in history. But if people are interested in this, and I highly recommend printing this out and pinning it up on the wall of your cubicle, if you go to complexdiagrams.com--that's my blog--complexdiagrams.com/properties, and there's a one-page PDF there that you can just print out, and it lists the visual properties and the characteristics of the visual properties, and it also tells you what sort of data is best used to encode.
So you use something like this to guide the subsequent addition of more dimensions of data onto your visualization. And then, of course, as with all design practices, the answer is iterate, iterate, iterate. You're going to put some things on the page and say, "You know, that's just not quite right. It's not working for me. Let's try something else." And this was actually something I did quite a lot of in the virtual seminar, as people will remember.
We tried some best guesses initially and then looked at it, saw that the best guesses were not as accurate or were not as useful as we had hoped and started to make some changes, started to iterate the process to get more interesting, more useful results that were actually more informative. And that was based on perceiving what was actually there in the data rather than what we hoped or believed might be there.
So, that's the overview of the virtual seminar.
Adam: Well, let's get to some of those questions that we didn't have time to tackle in the seminar. Ann wants to know how you distinguish a data visualization from an infographic.
Noah: That's a great question. This is a question that is still debatable in the industry and in people who are involved in this. My personal favorite definition, and one that has become widely adopted among people whose work I respect and people whose thinking I respect, is it's one of sort of content density and origin. And these things go together.
Data visualizations tend to be very data-rich, hundreds or thousands or millions of data points. They tend to be generated with software. So, somebody is saying, "These are my axes. This is how I want the graph to look." But they're not in there with a mouse putting down a point for every graph. They're generated through a graphing tool, whether it's Excel, whether it's Tableau, whether it's R, or some other software package. And what that means is you can create a new version, you can update, you can add data, you can make changes to it very quickly.
So that's what we call a data visualization. Tends to be data-rich. Tends to be, obviously, guided by the designer but not hand-drawn. Every pixel is not put on the page intentionally. And they're flexible. They're easy to update. So that's a data visualization.
The other end of the spectrum is infographics. Infographics tend to be much more highly illustrated and have a much lower quantity of data. So, there are some terrible infographics that are fairly common on the web these days. There's also a number of quite good ones. Infographics tend to have a relatively small number of data points, maybe 10, in that number, tens, perhaps, of data points being represented. They tend to be highly illustrated. It tends to be something where somebody, perhaps a graphic designer, actually went into a drawing tool, like Adobe Illustrator, and drew this graph, put some text or some caption on it, made a nice fade, maybe they added some glossy effects to it, put some narration, some supporting text, and then there's the next graph.
And so these infographics tend to have a much smaller amount of data, they tend to be hand-generated, and they have, typically, much more effort put into the aesthetic presentation.
Now, the drawback to that is that if the data changes or you want to add a different data set or you want to reuse the same framework, the same sort of structure for something entirely different, there's a lot of manual effort that goes into re-plotting it, putting the data on the page a second time, because it was manually created rather than algorithmically created with software in the first place.
That's the definition I like of infographics versus data visualizations.
Adam: Noah, there were lots of questions from our audience about the tools that you use to do these things. First, what tools do you use to actually get your head around the big chunk of data? Are there tools that you recommend or that you use for playing with it and culling through it?
Noah: Personally, I just start with a spreadsheet, whether that's Excel. I'm on a Mac, so I usually use Numbers if I can. Nothing wrong with a spreadsheet for just sort of first approach to the data. If you're someone who's got access to more sophisticated tools or access to a statistics background, you can use a package like R, which is an open-source statistics package that's very popular and well-supported. Or if you've got academic access to something like SPSS or one of these other professional stats packages, those are also a great answer. But really, there's nothing wrong with just Excel or anything else, just a very basic spreadsheet to kind of see what's there, just to get a sense of the scope of your data, what rows and columns you've got, whether the data's complete or it's got holes in it, that sort of thing.
Adam: There were questions on, once you've got the data pulled out and you want to create your visualization, what software or tools do you use to generate those? And in particular, in the seminar you spent a lot of time talking about the different types of encodings. Are there tools to apply those, and is that something that you can do in Excel?
Noah: Yeah, there's a lot of different tools that you can use, and it depends a lot on what you're comfortable with, what you have access to, how much data you're dealing with. For off-the-shelf visual drawing, nothing wrong with Excel. Well, I should take that back. There's a few things wrong with Excel. All of the defaults in Excel are wrong: the axes, the fact that it makes everything 3D, the colors that it chooses, the labels, the default graph styles. Most of the things in Excel are wrong. You absolutely can use Excel to draw good graphics, but you need to know what you're doing. You need to, for example, refer back to that table that I drew up to make sure that your encodings and your labels and your positions and shapes and whatnot are actually going to be useful to your audience, not distracting. So Excel's a fine place to start with that, or, again, Numbers, or whatever spreadsheet tool you prefer.
If you're on the Windows platform or you can get access to a Windows platform, there's a fantastic tool, happens to be from Seattle, called Tableau, which is a really excellent visual analysis and data visualization software. And because it has been designed specifically to be for data visualization, all those defaults that I just mentioned in Excel, that are wrong in Excel, they get right in Tableau. And Tableau's just a really nice package. There's a free version of it called Tableau Public, which you can download and play with. So that's a great tool.
For people who are more interested in things like data art or a little more free-form representations of data, the best language out there these days is probably Processing. And that also is a very well-supported, free, open-source tool, as is D3.
For just sketching and kind of playing with ideas, if I don't want to be throwing my whole data set around, I usually start with pencil and paper. That's a great place to start. On the Mac, I use OmniGraphSketcher, just to kind of roughly say, "Well, if we do a bar graph or a stacked bar graph, I want it to look like this. And if we do a line graph, I want it to look like that."
For qualitative relationships, for things like influence and flow charts--I do actually do this a lot with flow-through software or flow-through interface--I do all that in OmniGraffle. And OmniGraffle is actually probably my favorite piece of software in the world. I've been using it constantly for like nine years or something, and it's just so well-designed. Love that tool a lot.
Back in the day, before D3 came along, I used to do some of my data munging in Perl and output it in a format called Dot. Dot is a format for drawing directed graphs, and once you have your data in this Dot format, you can visualize it with a piece of software called Graphviz, which is another open-source tool. Graphviz is also incorporated in OmniGraffle, so if you're on a Mac and you use OmniGraffle, you can format your directed graphs in the Dot language and then just open them up in OmniGraffle, and it'll do the initial layout pass. And I like that process a lot, too, because then it lets you go in and manually modify in a familiar environment like OmniGraffle.
Most of these tools that I just mentioned are listed also on my website, complexdiagrams.com/about. And that also has a section on the tools that I tend to use.
Adam: Jeremy wants you to talk more about when you've got a situation where there's lots of internal data but you're just not quite sure what the story you want to tell is yet.
Noah: That's a really excellent question also. A lot of what I talked about in terms of picking the right encodings and putting it on the page just so all assumes you already know what it is you're trying to show. And Jeremy's question gets at the very real notion that a lot of the time you don't know what's there, you don't know what you want to show.
This is a different context. This is a context where you're then looking more at the exploration end of things rather than the presentation or the explanation end of things. If you're still exploring, you get the freedom to be a little more rough with it. You don't have to worry quite so much about showing all the data quite yet or getting the colors just perfect. Instead, you can kind of be a little sloppy in the presentation, because if it's just for you or just for a small team, then you can play with the axes, you can play with what data to include and exclude.
The question of how do you know what to do? It's a little bit of intuition. It's a little bit driven by your goals. It's a little bit driven by experimentation. So, if you take your goals and say, "Well, these are kind of the relationships I think that I'm interested in," maybe you do a really quick graph in Excel and you see if that relationship exists.
Again, this is the process that I actually went through in the virtual seminar. I sort of did a best guess for my first pass at what the axes should look like. And it turned out that it wasn't very interesting, and so I had to make some modifications there. That's a very real example that happens all the time. You say, "Oh, we're going to graph it this way," and you make that graph and it's not useful. And so then you have to start thinking about, what else might be in the data, might be represented by the data? What else might be interesting to learn from the data? Are there other factors that you didn't get data for that might be influencing the situation?
So, like I said, it's a little bit of context, a little bit of letting what you can see guide you. It's a little bit of intuition, of just your awareness of the big picture. And again, it's always going to be guided by the sort of underlying motivation of, "Why are we here? What are we looking for?" Because probably you're looking for something. If somebody says, "Here's my data. Graph it," they usually have some kind of a notion or some kind of a motivation of, "This is the particular kind of answer that we think is going to be useful. This is the particular kind of relationship that we would like to understand better."
So you've usually got some kind of guiding light there to set you off in a particular direction and kind of get you started. And then, yeah, from there, iterate, take the feedback from what you see, iterate again, keep going. That's how a design works, again and again.
Adam: The Qualcomm UX group asks, shouldn't requirement come before specification?
Noah: The answer to that is, it depends. And the reason I say that is because, if you already are in a situation where you know exactly what it is you want to learn from your data, absolutely. You've got some requirements that say, "This is what we want to learn, and that's going to inform the spec." And then the spec itself is what's going to tell you what to draw, which data to include, where to put it on the axes, that kind of thing.
The flip side of the situation is when you don't exactly know what the requirements are. This goes back to Jeremy's question. You don't exactly know what the requirements are. You know that there's probably something interesting in there, but you're not quite sure what it is, and so you can't make a requirement. So you sort of make up a spec. You make up a best-guess spec, you play with it, and then, like I said, you allow what you can see there to inform your subsequent iterations and your subsequent experimentation with the data that you've got available.
Adam: In the second half of the presentation you were, again, in front of the audience, you were literally building a visualization for us. And during that time, Veronica wanted you to say more about how you were deciding which things to tweak. And I guess I'll add, how did you prioritize those decisions?
Noah: Sure. This, again, is similar to Jeremy's question of, how do you figure out what the story is? This is a little bit different than that. We knew what relationship we wanted to reveal, but we were trying to figure out how to let the data show us that. And what happened was, with our first best guess, the interesting data points, the outliers, we knew were not about the relationship that we were looking for.
So we were looking at the relationship between changes in health-care spending and changes in life expectancy. And instead, what we saw, initially, was these huge outliers, where there was large changes in life expectancy, where either countries where a war had begun, a war had ended, or there had been a lot of HIV/AIDS in those countries. Those were things that were radically changing life expectancies for better or for worse, much, much more so than the smaller life expectancy changes that were happening in response to changes in funding on health care spending.
So even when we took those outliers out of the picture, the data wasn't clear. And going back to Veronica's question, how do you decide which to tweak, then? As people may recall, I tried a couple of different strategies. One was trying to just clump the data by region or by economic tier, rather than showing all the countries, because with all the countries there, it was just kind of a mess, it was kind of a glob of data with no clear trending.
So, one attempt was to clarify by doing some clumping, which is a great strategy, by the way, if you can do that. Another attempt was to change the axes, because we decided that maybe the initial declaration, of this should be this axis, and that should be that axis, wasn't the best choice of axes. And tried a couple of different tweak of the axes.
So, the response, how did I choose what to tweak? I was attempting to address the specific deficits of what I had in front of me. The specific deficit I had in front of me was, all the data was in the clump, so we tried to clarify that by changing the axis to spread that out a little bit, and changing the density by grouping the data into geographic regions, economic regions, rather than showing data points for 150 or 200 countries all at once.
So, a little bit of intuition, again. And a little bit of looking at the problem with what's right in front of you and saying, what would make what's right in front of me, easier to understand? Do I need less density? Do I need these to be spread in a different way? Do I need some color-coding to kind of pull out the layer that's interesting to me?
And then, you experiment. You iterate with those changes that you think might be useful. If those changes aren't useful, you try some different ones and see what happens.
Adam: Dan asks the question, "How does the process change if you're dealing with only qualitative data?" And Noah, correct me if I'm wrong, I think this question came up during the first half when you were talking about massive data set, you can't use it all, you need to kind of pull out the right pieces to tell the right story. Is that your recollection as well?
Noah: Yeah, I think so, when we were talking about all the numerical manipulation you might have to do. And he said, "What if you don't have numbers? What if you've only got qualitative data?"
I am particularly fascinated by qualitative data. Diagramming qualitative data was how I ended up getting into the whole world of data visualization.
Qualitative data is really fascinating to me, because unlike quantitative data, numerical data, where we've got a lot of standards about graphs and conventions, about when you use this graph and when you use that graph. We have a lot of good tools, we have a lot of good examples. We understand the best practices, even though they're not universally followed. We have a really good understanding of how to draw pictures of numbers.
We don't have a lot of conventions, we don't have a lot of standard metaphors around pictures of qualitative data. That tends to be relationship data, is what that tends to be. The things that influence other things, things that have some affinity.
You can sometimes get rank or order out of qualitative data if you're looking at something like a process where things, where you have to follow a sequence, or where you're looking at an org chart or a hierarchy where you can say, this is someone who's at the very top. This is someone who's second tier, this is someone who's third tier.
You may not be able to say that the second tier is twice as good as the third tier, and the first tier is six times as good as the second tier. You may not have that level of quantification, but if you've got some ordering, that's a really useful lever that you can use with your qualitative data.
But what it comes down to, typically, with the qualitative data, is you've often got to invent a new metaphor, unless you're working in some standard metaphor, like a family tree or an org chart, where we already understand what those are, typically. But a lot of the time, with qualitative data, you've got to invent a new metaphor and you've got to teach that metaphor before you can start talking about your data. And that provides a little more of a challenge to a designer, a little higher barrier to success.
But I really like those situations, because I find that level of creative engagement really juicy when you can look at what you've got. The key there is really understanding what the relationships you want to reveal are. Because it's usually about the relationships in the data. It's usually about the hierarchy, the sequence, the influence, or the affinity, or some other way that the data relates to each other. And when you think about how those can relate, then you can put some ordering onto the page in a way that makes sense.
One example that I like to give for this is a menu in a restaurant. We think, well, that's not really data at all, but it is. It's ordered data. It's ranked data. Usually, what happens on a menu is, you're looking at things and they are ordered by what course, they're ordered chronologically.
You're probably going to start with an appetizer, you might have a soup or a salad course there. Maybe if you're in Italy, you'd do, like, a light pasta dish. And then, you have entrees, and then, finally at the end, you have desserts. And so, a menu really has an axis, has a time axis and we don't think about it like that. But that's what they're using to organize, to clump the data into rational chunks.
If you ever sat down in restaurant and everything was ordered alphabetically, it wouldn't make much sense. You'd have to search through the whole menu to find an appetizer, to find a salad, rather than knowing where to look. You can sort everything on a menu by price, but that would be a little strange, also. You'd probably get things like entrees clumped close together and drinks clumped close together. But again, those relationships don't make much sense.
Where, if you're saying, I'm going to clump my menu items by what course they're for, which, again, maps to the time of consumption, you've got some relationship there that helps you group the data and put it together in a way that is going to allow the audience to make some sense of it, to find what they're looking for a little more easily.
So, again, when it comes to qualitative data, it's all about, what are the underlying relationships in the data that are going to be interesting and important and useful to my audience? And then, how do I choose a grouping, an ordering, an arrangement on the page, an axis that reveals those key relationships or key properties of the data?
Note, by the way, when I say axis here, I don't necessarily mean it's got to go from 0 to 100, like it does if it's a graph or if it's something that's quantitative. You can perfectly legitimately do qualitative axes, whether it's ranked from most important to least important, or whether it's categorical. We have our three categories of desserts. On the left, we have pies, in the middle, we have cakes, on the right, we have ice creams. Right?
That's an axis where that grouping is powerful and useful, but it's not ordered, it's not quantified, it's not ranked, it's just, we're going to say, this particular axis is grouped into three parts, and you can find different things in these three parts.
So, the process is going to be different, certainly, you're going to be looking at some different favorite portions of the properties table that I mentioned. But relationships. Understand relationships, understand what's value and interesting in that regard, that it is similar to quantitative data.
Adam: The team at Turner broadcasting was hoping you could speak a bit about the difference between motivations of telling a story and telling the truth.
Noah: This is a subtle difference, this is a gray area. There certainly are people who would argue that any sort of bias, any sort of motivation or perspective or point of view, anything like that is going to, you know, impugn your pure, rational objectivity. And I'm not entirely convinced of that.
I have a physics degree, I have a strong respect for the virtue and the integrity of the data. I also think that the data, as a stand-alone entity, does not have as much value as the data that has some context. And I'm assuming, because it's coming from Turner Broadcasting, that these are people who have some journalistic context for their question.
There are definitely situations where the data is being tortured, the data is being manipulated to show a specific point of view, to support a particular argument. And it's not inherently bad to have the data, the truth of the data support a particular point of view. It's a problem when, the way that the data is represented or the way that the data is edited, the way that the graph is put together, it's a problem when those are done in such a way as to distort the meaning that is there in the data.
So, if some data points are omitted, if the zero is removed from the axis of a graph to take what otherwise would be a relatively flat slope and make it look much steeper. Obviously, if data is just made up. If you're only showing a short time span that omits a longer context. These are all, sort of, the classic traditional ways of taking data that doesn't tell the story you want it to, and making it look like it tells a story that you do want it to.
I think it's great when the truth of the data supports the truth of the story. And we should, as savvy consumers of data and savvy designers of data, be aware that you can lie with data just like you can lie with words. As designers, one would hope that you would tell the story that reveals the truth and the integrity of the data, rather than selecting data and manipulating data to support the story that you would like to tell.
This, obviously, is not always going to be the case, but I'm sure that any of the fabulous listeners to this particular podcast and attendees to the seminar have only the highest integrity and would never use data for evil. But instead, would be willing to change their story, change their point of view if the truth of the data reflected a truth that was not the one that they were hoping to see.
Adam: Very complex.
Noah: As all things involving humans and communication are.
Adam: Well, Noah, this was great. Thank you for taking time out of your day to circle back with us and answer some of the questions we had for our audience.
Noah: My pleasure, Adam. It's always a treat to be here.