A Data Structure is a concept from computer science that refers to containers that computer programmers design to store and retrieve data. Often they are given highly descriptive names like stacks, queues, graphs and trees. R is both a statistical tool and a high level programming language, so while its possible to create any data structure that might be needed from scratch (because it's a programming language), most of the data structures you'll need are already built into the package.
For those already familiar with the concept of a variable (see Getting Data from a CSV file), data structures just expand on that concept. You can think of data structures as series of data elements that are connected together so that they can be accessed in a specific, well-defined way. The classic example is the stack, which is a group of data elements whose data can only be accessed in the reverse order from which they were stored. Imagine if, in the real world, we made a stack of different-colored dinner plates and placed them on the stack in spectral order: Red, Orange, Yellow, Green, Blue, Indigo, Violet. Then, we expect, the plates will be retrieved from the stack in reverse-spectrum order: Violet first and Red last. This is exactly how a stack data structure behaves. Computer scientists describe this as "last in, first out" behavior, and it describes how most computers' memory systems work at a very low level. Fortunately, the data structures in R are not as restrictive as the stack. What they do share with the stack is that the way data is added, removed, and accessed is governed by the data structure.
While single data elements can be stored in variables, it's much more common to store whole data structures. In most cases, it's even possible to treat each data element as a variable in itself and create a data structure composed of other data structures.
R has four main built-in Data Structures: Vector, List, Matrix, and Data Frame.
Before we begin to look at R's data structures, it's important to understand that every variable and every element in a data structure have what is called a "mode." In R, the term mode refers to the type of data that is stored. This concept will be familiar to those who've spent a lot of time working with spreadsheets where each cell can be defined as a number, text, date, or currency value. In fact, most programming languages and data storage software have a similar concept, usually called "type," which tells the computer (which, after all, just sees ones and zeroes) how to format, display and treat the data. The primary modes in R are numeric, character, and list, with numeric for storing numbers, character for storing strings of characters, and lists for storing data structures that are a mix of numeric and character modes.
When beginning R, the most important built-in data structures will be data frames and vectors.
I'll start with the most complex, because it's probably the easiest to relate to, and will be used throughout the rest of this guide. After following the instructions on the Getting Data from a CSV tab, the structure that will be stored in the variable "educ" is a Data Frame. A Data Frame is a table where each column has a label and each row has an entry for every column (even if the entry is a blank). Most of the commands that tell R to pull information from some kind of outside file, including the "read.csv" from the previous tab, will attempt to produce a data frame from that file. For statistics, the best way to think of a data frame is as a table where each row represents a case and each column represents an attribute.
To demonstrate, let's look again at the data uploaded in the Getting Data from a CSV tab:
When R prints a data frame, it takes as many columns as fit on the screen and prints every row of them, from left to right. Under that, it prints the entirety of the next few columns and under those, the next few, until it reaches the end of the columns. Later, we'll learn how we can view and edit the entire dataset more easily (see the Create Graphics tab), but for now let's scroll up to the top of the dataset and talk about what's displayed there.
The first few columns are labeled "county_name," "total_population," and "median_income" and each row is numbered and represents a specific Illinois county. At each intersection, we find the data values.
There are two primary ways to access data in a data frame. First, you can access it by position. When the data frame is created each column is assigned a number in addition to its label. In the example above, "county_name" is in position one and "total_population" is in position two. These numbers can be used to create a subset of our data. To do so, we use brackets--either "," or "[]." One pair of brackets means "create a new dataframe from just this column." To create and this dataframe, we will use "counties<-educ". This assigns the first column (which is county_name) to a new variable called "counties." Type "counties" after assigning the variable to see the results.
Notice that we created a new variable name for our new dataframe. Just typing "educ," without the variable, would have caused R to print the data frame to the screen, but it wouldn't be saved for reuse. This would be fine if, for example, we just wanted to print the frequencies of the values in this column, but storing it in a variable provides a way to reuse it as needed.
Using the double brackets returns the column contents as a vector. The syntax here is "popVector<-educ[]".
In our previous example of dataframes, we used single brackets "". Using double brackets returns the column contents as a vector. The syntax here is "popVector<-educ[]".
A Vector is a one dimensional data structure (as opposed to the two dimensional data frame) in which all values have the same mode. The values for any single column in a data frame must also have the same mode, which is why we have the opportunity to store single columns as a vector. (Note: Different columns in a data frame may have different modes, however.) You can also create a vector independent of any data frame using the "c()" command as shown below, with the values to be stored inside the parentheses and separated by commas. The syntax here is "aVector<-c(1,2)".
One reason we might want to do this is to use vectors to select multiple columns from our data frame. The syntax here is "educ[c(1,3)]".
Placing the vector in the brackets created a new data frame consisting of "county_name" and "median_income." Instead of just putting the individual column numbers between the brackets, I had to use the vector "c(1,3)."
Vectors are also useful when selecting a single value from the table. To demonstrate, define a variable "piDigits" that contains the numbers "3, 1, 4, 1, 5, 2, 6." The command for this is "piDigits<-c(3, 1, 4, 1, 5, 2, 6)". To select a single value from piDigits, use the position number of the value in brackets. The syntax here is "piDigits"; replace the number in brackets with your desired position number.
Similar to selecting a single column from a data frame, selecting a single value from a vector involves adding the position in brackets after the variable name. So when selecting a column from a data frame as a vector, we can apply this same technique to retrieve a single value. Recall that we used "educ[]" to retrieve a vector containing the names of our counties. Since "educ[]" is a vector we can select a single value by appending a position in brackets. The syntax here is "educ[]".
This returns the county name in the position you specify in single brackets. "educ[]" will print all the county names with their position numbers.
It's also possible to select a column by its label, which is very useful when dealing with large datasets that don't easily fit on the screen. Instead of using the column's numerical position, supply the name of the column, in quotes, in between either the single or double bracket (and remember, single brackets make a new data frame with the selected column or columns, whereas the double brackets create a new vector containing the selected column's contents). The syntax here is "educ['county_name']".
R provides a shortened version of this access by label method as well, using the "$". Notice that no quotes are necessary with this method and that it returns a vector. The syntax here is "educ$median_income".
Also, as with access by position, multiple labels can be used, arranged in a vector, to select multiple columns, or to select single values from a column returned as a vector. An example is below, using the syntax "educ[["county_name"]]" to return the 5th value in county_name, and "educ[c('county_name', 'total_population')]" to create a data vector containing the county name and total population.
Data Frames and Vectors are the most commonly dealt with data structures in R, but users are likely, at some point, to run across two others. Lists are similar to vectors, but without the restriction that vectors have to handle only a single mode of data. A single list can hold many types of data and each element can be labeled. While it's a good idea to know how to use and access these structures, it's not likely you'll need them right away and they won't be used in the remaining portions of this guide, so feel free to skip this section for now and come back to it later.
To create a list use the "list" function, as shown below. The syntax here is "newlst<-list('yes', 'no', 0)".
Once a list has been created stored materials can be accessed using either the double brackets or a vector (so "newlst[]" in the example above would return "0" and "newlst[c(1, 3)]" would return "yes" and "0"). Labels can also be assigned to each element in the list, as in the example below. The syntax here is "salad<-list(dressing='yes', onions='no', croutons=0)".
The names act as a label for accessing the list values. To access list values, you have the options to use double brackets, a vector, or add the label to the list name separated with the dollar sign. Examples include: "salad[["dressing"]] will return "yes", "salad[c("onions", "croutons")]" will return $onions "no", and $croutons 0.
Matrices are like data frames that can only hold one mode of data. Notice that vectors and matrices are named after mathematical concepts: that's very much by design. R's matrices simply extend the mathematical functionality of the vector into multiple dimensions, allowing you to perform calculations from linear algebra or advanced statistics. While these calculations are beyond the scope of this guide (see Recommended Resources for a list of helpful resources for moving forward with R), you may run into a situation where you need to access items from a matrix, or read R code that does so.
There are two basic ways to create a Matrix. First of all, it's common to create a Matrix where every entry contains the same value. To make a 4 by 3 matrix where every entry is zero we'll use the following command. The syntax here is "zeros<-matrix(0, 4, 3)".
The first number in the parentheses is the value that we're using to populate the matrix, the second is its height and the third is it's width. A matrix can also be created from a preexisting vector by replacing the repeated value with a vector name. The syntax here is "v1<-c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)" to define the vector. Then, "sequence<-matrix(v1, 4, 3)" to create a matrix populated with the predefined values.
Notice that when R displays the vector, the rows are labeled "[1,]," "[2,]," ... and the columns "[,1]," "[,2]", and so on. Displaying a row or column can be accomplished using the same notation. To access a single value in the second column of the third row in this example, use "sequence[3, 2]". To return the entire third row use "sequence[3,]" and to return the second column, "sequence[,2]":