Introduction

Looking in the Dustbin

Data Janitorial Work, Statistical Reasoning, and Information Rhetorics

By Aaron Beveridge¹

Introduction

As digital rhetoric and visual design studies turn their attention to data visualization and infographic media, much work remains to understand quantitative reasoning and statistical analysis as rhetorical constructions. In “Rhetorical Numbers: A Case for Quantitative Writing in the Composition Classroom,” Joanna Wolfe argues that rhetorical approaches to quantitative reasoning remain largely overlooked in writing studies classrooms. As Wolfe explains, “Rather than reject quantitative argument out of hand, contemporary rhetoricians need to train their students to recognize the unethical, deceptive, and misleading as well as thoughtful, insightful, and revealing applications of quantitative rhetoric” (454). Beyond the concerns of the writing classroom, data literacy is quickly becoming a crucial component of public discourse and digital rhetoric on a broader scale. Since 2012, the White House has been following through with President Obama’s Big Data Initiative, and this initiative works to provide “consumers with the full landscape of information they need to make optimal energy decisions.” According to press release, the broad goals of this project include: “enabling civil engineers to monitor and identify at-risk infrastructure; to informing more accurate predictions of natural disasters; […] to advance national goals such as economic growth, education, health, and clean energy.” As more and more of traditional “analyses” are replaced with data “analytics”², critically aware citizens must be provided with access to underlying datasets on which analytics are built, and with the statistical methods used to process, analyze, and visualize the data. However, an improved access to data and a greater transparency among analytical methods is meaningless if the available means of statistical analysis are not understood as inventive, persuasive forms of argumentation.

This article provides examples for how the choices made during data processing and statistical analysis effect the visualizations those analyses produce. The use of the term choice here suggests more variability and indeterminacy than what is often implied by the concept of statistical reasoning, but the range of possibilities—the notion that data analysis is inventive and experimental—is a largely unexplored analogue between rhetoric and statistics. Often, when rhetoric and statistics are compared, it is the supposed “deceptive” nature of statistical “facts” that are discussed. Although the origin of the quote remains uncertain, Mark Twain is credited for popularizing the phrase: “There are three kinds of lies: lies, damned lies, and statistics.” This deceptive understanding of statistics is similar to the way rhetoric is often described in journalism or politics—usually indicating a particular stylistics that appears overtly persuasive. Yet, in stark contrast, visualized statistical analyses often appear in journalism, entertainment media, and politics as impenetrable and unquestionable forms of evidence. As Wolfe explains, “there is a paradox in that on one hand our culture tends to represent statistical evidence as a type of ‘fact’ and therefore immune to the arts of rhetoric, but on the other hand we are deeply aware and suspicious of the ability of statistics to be ‘cooked,’ ‘massaged,’ ‘spun,’ or otherwise manipulated” (453).

In order to move beyond a simplistic deception/fact binary of statistical reasoning, the inventive aspects of statistical analysis require closer examination. As Wolfe argues, “students should have practice making their own arguments from quantitative data…so they can see the role invention plays” (455). In the tradition of statistical analysis, this inventive practice is called “exploratory” statistics. As John W. Tukey notes in his now canonical text Exploratory Data Analysis, “many of the indications to be discerned in bodies of data are accidental or misleading…To fail to collect all appearances because some—or even most—are only accidents would, however, be gross misfeasance.” Yet, as Tukey explains, exploratory data analysis “can never be the whole story, but nothing else can serve as the foundation stone—as the first step” (3). This inventive, exploratory first step, however, is difficult to uncover or observe in many digital forms of data visualization. Many of the cloud data visualization tools available online have already worked through the exploratory aspects of their analyses—much like reading the polished final draft of an essay. Just as new writers have to understand that mistake-filled first drafts are a common aspect of writing, data literacy requires an understanding of the common exploratory methods that lead to polished infographics and effective data analytics. The Janitor and Statistics pages of this webtext investigate the exploratory aspects of data janitorial work and statistical reasoning that are relevant to digital rhetoric.

The form of data visualization considered for this article is the word cloud or tagcloud visualization. Generally, word clouds are used to provide a visual summary of the most important words in a text or corpus. Word clouds are used for the banner images in blogs and websites, word clouds appear as introductory images for powerpoint and other slideware presentations, word clouds may visualize Twitter analyses in entertainment media, and word clouds are useful for providing visual summaries of tags and other categorical data for archives. Basically, word clouds visualize any form of plain text data to display the most frequent terms in various arrangements, layouts, and color schemes. Drawing on Franco Moretti’s “distance reading” theories and Umberto Eco’s The Infinity of Lists, Derrick N. Mueller argues that word clouds are “paratactic lists.” According to Mueller, “Each cloud billows with vaporous logic; its terms inviting associations within the cloud…but also beyond the cloud. They are summary-like without submitting to a reductive logic of coherence and completeness.”

To produce a word cloud, there are many steps that occur between the raw unstructured text data and the final visualization. Some of the steps require data scrubbing or data janitorial work, others involve making quantitative choices with how text data is organized and represented, and finally, choices are made regarding how the colors are programmatically assigned to quantitative ranges of words (ranges in frequency). When a word cloud tool like Wordle.net is used, many of the data janitorial steps that process the text are hidden in the underlying computer code and statistical processes that turn the raw text into a word cloud visualization. Certainly, there is nothing wrong with using cloud tools to produce a data visualization. Indeed, many of the cloud tools provide an accessible way for students and scholars to begin investigating the visual aspects of data analysis. However, as with any form of analysis, the danger of such tools is their standardization and over-application. Often cloud tools require a specific data type, one that is cleaned and processed according to an already well-tested methodology that produces a polished data visualization when such assumptions are appropriately met. In many ways this is similar to the limitations of the 5 paragraph essay. While there is nothing inherently wrong with the 5 paragraph essay (this article, in fact, uses a 5 section theme that is analogous to the 5 paragraph essay—as do many academic articles), the risk of any ubiquitous form or methodology is that it is applied to too wide a range of objects and thus loses its descriptive and analytical effectiveness.³ In many respects, this has happened to the word cloud visualization—overuse and ubiquitous application have made this visual less effective than it once was. Therefore, revealing the exploratory aspects of this common tool will provide a pragmatic introduction to the inventive aspects of statistical reasoning and its relevance for data literacy and information rhetorics.

Thank you to Laurie E. Gries and Nicholas Van Horn for their feedback and assistance in reviewing this webtext.↩
Whereas “analysis” often refers to human reading, observation, or investigation of underlying evidence, sources, or data, “analytics” are often systematized statistical analyses of datasets too large for traditional analyses. Furthermore, the term “analytics” also implies that the methodologies may be systematically re-applied in the future on other similar data types to produce a similar visual analysis. For example, Wordle.net may be understood as an “analytic” because it reproduces a word cloud visual analysis of various textual data.↩
Laurie E. Gries and Collin Gifford Brooke argue a similar point about the structured Pechu Kucha model for slideware presentations in “An Inconvenient Tool: Rethinking the Role of Slideware in the Writing Classroom.”↩