Methodology

Methodology

The introduction argues that improved access and greater transparency with data visualizations may have a limited effect on data literacy if statistical analyses are not understood as inventive, persuasive forms of argumentation. However, that does not mean that explaining methods and working to improve access is of lesser importance. Literacy and accessibility are always closely associated, and this is just as true in digital rhetoric as it was in print media. James E. Porter argues that “From the standpoint of digital production, putting the concept of access into action means designing information so as to help audiences with limited access to digital resources engage that information via alternate media and formats” (216). Access in terms of data literacy is not as simple as providing access to data, but it must also include the accessibility of methodologies and the technical know-how to unpack the production of data visualizations. As Porter explains,

Technical knowledge is integral to digital rhetoric, but that knowledge is not merely mechanical, routinized procedure. Yes, it can be reduced to that (and often is), but when practiced as art (techne) technical knowledge intersects with rhetorical and critical questions in order to assist discursive production and action. (220)

Techne and data visualization are contingent upon the accessibility of: (1) The underlying data, how it is collected, and how it is archived and accessed; (2) How the data is processed, cleaned, and organized to prepare for analysis; (3) How the data is analyzed and visualized; (4) How the analyses and visualizations are delivered to an audience.

The following infographic summarizes these four aspects of data visualization techne for the word clouds shown in this webtext, and then the remainder of this section provides a rationale for the software and methods used to analyze the text and create the wordclouds. 1

wordcloud1

The data visualizations produced in the Data Janitor and Statistics sections of this article rely on a form of text analysis called text mining or text data mining. Text mining takes raw unstructured text and turns it into structured data that can be statistically analyzed. For this article, a corpus of text documents is constructed from text data collected from the following Wikipedia pages:

The text contained in the articles was systematically collected with a data mining application called MassMine. MassMine is funded by the National Endowment for the Humanities, and is an open source software that supports social network data mining for academic research. MassMine was created to address limitations in the accessibility of social network data2. Some networks are more accessible than others in terms of how they license their data. For example, the use of and access to data from Twitter and Facebook data are strictly licensed and controlled, but Wikipedia text data is open and accessible. However, accessibility regarding data is more than the issues pertaining to data licensing and terms of service—accessibility also has to do with the the availability and approachability of technical skills required to collect, analyze, and visualize data. The MassMine project works to address these problems by simplifying the process of collecting and archiving freely available data from social networks. With Wikipedia, a corpus of articles is created by providing MassMine with a list of article titles, and then all of the raw text from the articles is collected and organized for analysis by MassMine.

After collecting text data from Wikipedia, the data scrubbing and analysis is completed with the open source programming language called R. R is designed specifically for data extraction and statistical analysis, and it is one of the best tools available for making informatics and data visualization methodologies widely accessible. Amanda Cox, the graphics editor for the New York Times, uses R to produce many of her data visualizations for the Times. Cox has said that R is “the greatest software on Earth,” and she explains that while it is not the only tool she uses in her work, it is her preferred tool for “sketching” and exploring data when developing visual analyses.3 While R can be used for more complex modeling and statistical predictions, Cox explains that R’s package framework remains friendly to new users who do not want to build all of their data analyses and visualizations from scratch. The Data Stories podcast has a recent interview with Cox available here, and R-Bloggers has videos of Cox talking about using R at the Times available here. These resources provide real-world examples of how data visualizations are produced for a major media outlet.

All of the of the raw text data and the R code that produced the visualizations for this article are available here on GitHub. GitHub is a free social coding site that provides support for open development and collaborative programming. Hopefully, by providing access to all of the code and text data that produced the visualizations in this webtext, it will allow other scholars and teachers to modify or build upon these examples and produce their own text analyses or word cloud visuals. Data literacy can be intimidating for rhetoric and writing scholars who have limited training in statistics and data visualization, but encouraging open and collaborative development will allow new projects to draw from and build on previous ones. As Mary K. Stewart argues, writing studies must continue to develop “a definition of digital literacy as a learning outcome that has three characteristics: multimodal composition, information, and collaboration.” As GitHub is quickly becoming the largest social coding site on the web, it provides an open and collaborative framework for development that supports the data literacy learning outcomes Stewart describes.4

Next Page


  1. The infographic below was created at easel.ly—a site that provides free and easy-to-use cloud tools for creating, hosting, and embedding infographics.

  2. MassMine grant details: http://ufdc.ufl.edu/AA00025642/00001

  3. “Sketching” is term that Cox uses in her podcast with Data Stories

  4. This webtext was drafted with Markdown, Pandoc, and GitHub. GitHub also provides the free hosting as well—including images. For collabroative web-publishing, GitHub is a fantastic resource.


© 2015 Aaron Beveridge