2. Data Visualization

2.1. Introduction

Why should you care about Data Visualization? Scientists have to report their work to demonstrate their arguments as a research paper, poster presentation, a talk, client reports, and vulgarisation. In this sense, it is a requirement to make clear*, attractive, and convincing data visualization. They play a central part in their communication and often carry the weight of an argumentation. However, it is not only a matter of visuals. The Story behind it is as much important as the visualization.

Before digging further into the concepts behind Data Visualization, it is important to differentiate between Good, Ugly, Bad, and Wrong Figures. An Ugly figure is clear and informative, but lack Aesthetic. A Bad figure is unclear, confusing, or complicated and present Perception issues. A Wrong figure has Mathematical, or Objectivity issues. Finally, a Good figure does not present any of the issues mentioned.

import numpy as np
import matplotlib.pyplot as plt
plt.ion()
plt.style.use('fivethirtyeight')

labels = ["A", "B", "C"]
data = np.array([3, 5, 4])

fig = plt.figure(figsize=(8, 8), facecolor="white")
ax1, ax2, ax3, ax4 = [fig.add_subplot(2, 2, i + 1) for i in range(4)]

ax1.bar(range(len(data)), data, tick_label=labels)
ax1.grid(axis='x')
ax1.set_ylabel("value")
ax1.title.set_text("Good")

ax2.bar(range(len(data)), data, tick_label=labels, color=["#B3D300", "#008FD5", "#6500D1"])
ax2.grid(linewidth=3, color="k")
ax2.grid(axis='x')
ax2.set_ylabel("value")
ax2.title.set_text("Ugly")

ax4.bar(range(len(data)), data, tick_label=labels)
ax4.grid()
ax4.set_yticks(())
ax4.title.set_text("Wrong")

ax3.title.set_text("Bad")
ax3.set_axis_off()

ax31, ax32, ax33 = [fig.add_subplot(2, 6, 7 + i) for i in range(3)]
ax31.bar(range(1), data[0], tick_label=labels[0])
ax31.set_ylabel("value")
ax32.bar(range(1), data[1], tick_label=labels[1])
ax32.set_ylim(0, 5)
ax33.bar(range(1), data[2], tick_label=labels[2])

fig.subplots_adjust(hspace=0.40, wspace=0.40)
fig.canvas.draw()
../_images/1_data_visualization_2_0.png

Fig. 2.1 Good vs. Ugly vs. Bad vs. Wrong example figures. The Ugly figure exhibits highly saturated colors, the colors do not bring any supplemental information, and the grid lines are too heavy on the eye. This is not supposed to be a focal point. The Bad figure is quite hard to read and present perception issues. The axes are not aligned in terms of value. The Wrong figure presents objectivity issues as the scale is not presented, and the nature of the data is not explained.

2.2. Visualizations

2.2.1. Data and Aesthetics

2.2.1.1. Data Types

Before even talking about what kind of aesthetics can be used when creating figures, it is important to understand the type of data you are working with. There are two main kinds of data: Qualitiative and Quantitative. Qualitiative data represents characteristics, descriptors that cannot be measured objectively such as color, smell, texture. Quantitative data can be measured and is either discrete, integers, or continuous, floating numerals.

2.2.1.2. Aesthetics

Even if they seem quite different at first glance, every type of graph has the same elements of language. Those elements are called Aesthetics. Each data value needs to be mapped with only one aesthetic. When this is not the case, data visualization can lead to ambiguities.

import numpy as np
import matplotlib.pyplot as plt
plt.ion()
plt.style.use('fivethirtyeight')

fig = plt.figure(figsize=(8, 8), facecolor="white")

ax = fig.add_subplot(6, 1, 1)
ax.scatter([0, 1, -1], [2, -2, 6])
ax.grid()
ax.set_axis_off()
ax.title.set_text("Position")

ax = fig.add_subplot(6, 1, 2)
for i, marker in enumerate(["o", "+", "x", "s", "v"]):
    ax.scatter([i,], [0,], marker=marker, s=500, color="#008FD5")
ax.set_ylim(-0.5, 0.5)
ax.grid()
ax.set_axis_off()
ax.title.set_text("Shape")
    
ax = fig.add_subplot(6, 1, 3)
for i, size in enumerate([500, 250, 125, 50, 25]):
    ax.scatter([i,], [0,], s=size, color="#008FD5")
ax.set_ylim(-0.5, 0.5)
ax.grid()
ax.set_axis_off()
ax.title.set_text("Size")
    
ax = fig.add_subplot(6, 1, 4)
for i in range(5):
    ax.scatter([i,], [0,], s=500)
ax.set_ylim(-0.5, 0.5)
ax.grid()
ax.set_axis_off()
ax.title.set_text("Color")
    
ax = fig.add_subplot(6, 1, 5)
for i, width in enumerate([20, 10, 5]):
    ax.plot(np.arange(0, 1, 0.1), [-i * 2,] * 10, linewidth=width, color="#008FD5")
ax.set_ylim(-6, 0.5)
ax.grid()
ax.set_axis_off()
ax.title.set_text("Line Width")
    
ax = fig.add_subplot(6, 1, 6)
for i, ltype in enumerate(["-", "--", "-.", ":"]):
    ax.plot(np.arange(0, 1, 0.1), [-i,] * 10, linestyle=ltype, linewidth=width, color="#008FD5")
ax.set_ylim(-4, 0.5)
ax.grid()
ax.set_axis_off()
ax.title.set_text("Line Type")
    
fig.subplots_adjust(hspace=0.40, wspace=0.40)
fig.canvas.draw()
../_images/1_data_visualization_6_0.png

Fig. 2.2 Aesthetics for data visualization.

2.2.2. Visualization Types

Data points can be visualized in many ways. This section lists the visualization types with their corresponding data types they are used for.

Table 2.1 Visualization Types vs. Data Types

Data Type

Visualization Types

Correlation

Scatterplot, Bubble plot, Heatmap, Correlogram

Evolution

Line plot, Area plot, Stacked Densities, Parallel Set

Proportion

Tree plot, Stacked Densities

Ranking

Bar plot (vertical and horizontal), Spider

Distribution

Vilon, Density, Histogram

Uncertainty

Half Eyes, Confidence Bands

Maps

Map, Choropleth map, Connection Map, Buble Map

Flow

Chord Diagram, Graph Network, Parallel Set

2.3. Design Principles

This section discusses the fundamental principles of data visualization: Data-ink Ratio and Storytelling.

2.3.1. Minimalism and Proportional Ink

Data Visualization must encourage accessibility first, visually and contextually. Minimalism for visualization is one way to achieve such a goal and has been formulated as the Data-Ink Ratio. It represents the ink ratio used for representing the data against the total ink used to print the entire graphic. The closer it is to 1, the less distracting the visualization is.

Tufte has provided five laws for Data-ink:

  • Above all else, show the data

  • Maximize data-ink ratio

  • Erase non-data-ink

  • Revise and edit

import numpy as np
import matplotlib.pyplot as plt
plt.ion()
plt.style.use('fivethirtyeight')

fig = plt.figure(figsize=(8, 8), facecolor="white")
ax1, ax2 = [fig.add_subplot(2, 1, i + 1) for i in range(2)]

X, Y, labels, colors = range(3), [65, 23, 14], ["A", "B", "C"], ["#008FD5", "#F6514C", "#E5AE42"]

ax1.barh(X, Y, tick_label=labels, color=colors)
ax1.title.set_text("Object Percentage")
ax1.set_xlabel("Percentage")
ax1.set_ylabel("Object Type")
ax1.set_xticklabels([f"{i * 10}%" for i in range(7)])
ax1.legend(handles=[plt.Rectangle((0, 0), 1, 1, color=color) for color in colors], labels=labels)
for i, y in enumerate(Y):
    ax1.text(y - 5, i - 0.05, f"{y}%", color='white', fontweight='bold')

ax2.barh(X, Y, tick_label=labels, color=[colors[0], "#7EC6E5", "#7EC6E5"])
ax2.title.set_text("Object Percentage")
ax2.set_xticklabels([])
ax2.grid(False)
for i, y in enumerate(Y):
    ax2.text(y - 5, i - 0.05, f"{y}%", color='white', fontweight='bold')

fig.subplots_adjust(hspace=0.40, wspace=0.40)
fig.canvas.draw()
../_images/1_data_visualization_11_0.png

Fig. 2.3 Demonstration of Data-Ink Ratio, without on top figure and with on bottom figure.

2.3.2. Contextualization and Storytelling

The most powerful part of data visualization is Storytelling. Without, the visualization lacks impact. It is first essential to define a Context, the Audience, and the Nature of the data to create a connection between those two entities. Nevertheless, a big part of the interaction with the target is achieved through visuals. Aesthetics can be used to draw attention to certain parts of the graphic.