Skip to main content

University Library, University of Illinois at Urbana-Champaign

Introduction to R: Get Statistics

A brief tutorial on the R programming language.

Using R's Built-in Functions

Commands, like "read.csv," that can be called in R to perform specific tasks are called functions, but functions are not unique to R. Programmers use the word "function" to refer to a piece of code that can be called by a specific name. This allows them to use that code over and over again. Typically, functions are given some input and then perform some specific task on it. They can be thought of as little machines that take data and transform it somehow, the same way kitchen appliances take in and transform ingredients. The "read.csv" takes the a string of letters, looks for a file with that name and, if it finds one, attempts to create a data frame in R using the contents of the file. Functions in R take input in parentheses immediately following the function name and produce output that might be printed to the console window, saved into a variable, or, in the case of charts and graphs, displayed in a new window. Before we start playing with R's Statistical functions, here are few helpful facts:

  • In R, input is passed to functions by placing it in parentheses immediately following the function name. A piece of data passed to a function in parentheses is called an argument. For a given function to work properly, the arguments usually need to be in a certain format (e.g. the argument to "read.csv" must be a text string that gives the location of an existing CSV file) or an error is raised. A function may have zero, one or many arguments.
  • There is a conventional vocabularly related to functions. As discussed above, input is "passed" as "arguments." We can also say that when we "call" that function when we use it on the command line or in code. If ouput is defined that can be passed to a variable or printed on the screen, the function "returns" that output. As with input, the output is usually speicified to be in a specific mode.
  • A function can be defined to behave differently when called with different types of arguments. For example, "summary" (see below) can be called on with almost any type of data or model and returns summary statistics that R's designers deemed most relevant for that mode.
  • Sometimes operators may appear in the arguments. Operators are symbols like "+", "-", and "=" that perform well understood mathematical or logical functions. For example, in the "t.test" function example below we use the argument "mu=48000." We'll cover what that means below, but for now just know that the equal sign serves a purpose.
  • Calling the "help" function with a function in the arguments will open a browser window to the R documentation page for that function. This will describe arguments, return values, and function behavior in detail.

Getting a statistical summary

Use the "summary" command to obtain a summary of basic statistics for every variable in a data frame ("summary" can also be used on a vector or matrix). For numeric variables, statistics shown will include "Min." for minimum, "1st Qu." for first quartile, "Median," "Mean", "3rd Qu." for third quartile, and "Max" for maximum. For character variables R attempts to count the frequencies of each response. Notice that R attempts to provide statistics for the "county_name," even though the values don't lend themselves to statistical summary. In cases like this that are meant to identify the row, it attempts to count the number of responses for each county name, even though they'll each only have one. Below, we called "summary" on our "educ" variable. The syntax here is "summary(educ)".

Summary

Generating Crosstabs

To obtain frequencies and cross-tabulations use the "table" function. When called with one variable in the parentheses, it will return a frequency table for that variable. With two variables in the parentheses, R will display a cross-tab. The syntax here is "table(educ$inc_greater_than_ave, educ$greater_than_thirty_percent_have_bachelors"). 

Crosstabs

Common Statistical Tests

Chi-Squared:

To perform a Chi squared test, you'll first need to create a contingency table (cross tab) and store it in a variable. When R's "summary" function is called on a contingency table, the output will be the results of a Chi squared test. The syntax here is "tab<-table(educ$inc_greater_than_ave, educ$greater_than_thirty_percent_have_bachelors)", then "summary(tab)".

Chi Squared

 

One Sample t-Test:

To test the likelihood that a sample comes from a population with a given mean, use the "t.test" function. In the example below, we are testing for whether the mean median income for Illinois counties is 48,000. The syntax here is "t.test(educ$median_income, mu=48000)".

t Test

Notice that, in this case, the "t.test" function takes two arguments. The first is the set of data, which in this case is educ's column for median_income. The second argument is the population mean that we're testing. Instead of just passing the mean as a number, "t.test" has a special argument "mu" that is set to contain the mu that we're testing against. The equals sign means that mu is an optional argument with a default value that we're redefining. If we left "mu" out of the arguments the "t.test" function would be called with zero as the mu value. In this case, I set "mu=48000," and since, in the test, we can see that the p-value is greater than .05, we can surmise that it's statistically likely that the real mean for this sample is 48000.

One-Way ANOVA:

So now let's ask ourselves a real world question: can we show a correlation between income or population and education? To help us answer this question, we can use one-way ANOVA. R's "oneway.test" function allows us to define a factor by which to divide another variable (provided they both come from the same data frame). In one step, the "oneway.test" allows us to factor "median_income" by the categorical variable "greater_than thirty_percent_have_bachelors." It then tests each subset to determine the likelihood that their means are identical, which since the p-value is less than .05,  is unlikely here. The syntax here is "oneway.test(educ$median_income~educ$greater_than_thirty_percent_have_bachelors)".

ANOVA

Regression and Statistical Models

To produce a linear model of the educ data we can use the "lm" function. This function creates a linear model for two or more variables using the Ordinary Least Squares method. Here's a simple regression using our variables for median household income to predict the percentage of the population with a bachelors or higher degree. The syntax here is "model<-lm(educ$median_income~educ$bachelors_or_higher)", then calling "model", and "summary(model)".

lm functions

Notice in the above example that I saved the results of the "lm" function into a variable called "model" and then called "summary" on the variable. The tilde in the parentheses tells R that the variables are meant as part of a formula, with the dependent variable on the left and the independent(s) on the right. The lm function by itself just outputs the coefficients and intercept for the model formula, but we can use the "summary" to supply us with more useful information, like the value for R squared and the residual standard error. A number of other functions exist to give information about linear models. Some of the more important are "resid," which generates a list of the model's residuals, "coef," which returns the model's coefficients, and "anova," which generates an ANOVA table for the model.

If you want to perform a multivariate regression with more than one independent variable you can call lm with more than one variable on the right side of the tilde and join them with the "+" sign. The syntax here is "model<-lm(educ$median_income~educ$bachelors_or_higher+educ$some_college+educ$high_school)", then calling "summary(model)".

multivariate regression

For a comprehensive look at R's built-in statistical modeling capabilities, which extend far beyond OLS linear models, see the Official R Tutorial.

Trying to do something you don't see here?

Perhaps R's greatest strength is that it's open source, which not only means that it's free, but that anyone who wants to can add to it. R is a serious data manipulation tool and has gained wide, real-world acceptance in industry and academia, and the statisticians, scientists, accountants and managers who use it are constantly adding packages that will automate functions in whatever area they specialize. Often these expert users make these functions available through the Comprehensive R Archive Network, or CRAN. The packages available through the CRAN can greatly extend R's functionality, so you can apply R in nearly every area of statistical computing. 

Scholarly Commons

Scholarly Commons's picture
Scholarly Commons
Contact:
306 Main Library
Drop-ins welcome
Monday-Friday 8:30am-6:00pm
Phone: 217-244-1331
Website