A brief tutorial on the R programming language.

Commands, like "read.csv," that can be called in R to perform specific tasks are called functions, but functions are not unique to R. Programmers use the word "function" to refer to a piece of code that can be called by a specific name. This allows them to use that code over and over again. Typically, functions are given some input and then perform some specific task on it. They can be thought of as little machines that take data and transform it somehow, the same way kitchen appliances take in and transform ingredients. The "read.csv" takes the a string of letters, looks for a file with that name and, if it finds one, attempts to create a data frame in R using the contents of the file. Functions in R take input in parentheses immediately following the function name and produce output that might be printed to the console window, saved into a variable, or, in the case of charts and graphs, displayed in a new window. Before we start playing with R's Statistical functions, here are few helpful facts:

- In R, input is passed to functions by placing it in parentheses immediately following the function name. A piece of data passed to a function in parentheses is called an argument. For a given function to work properly, the arguments usually need to be in a certain format (e.g. the argument to "read.csv" must be a text string that gives the location of an existing CSV file) or an error is raised. A function may have zero, one or many arguments.
- There is a conventional vocabularly related to functions. As discussed above, input is "passed" as "arguments." We can also say that when we "call" that function when we use it on the command line or in code. If ouput is defined that can be passed to a variable or printed on the screen, the function "returns" that output. As with input, the output is usually speicified to be in a specific mode.
- A function can be defined to behave differently when called with different types of arguments. For example, "summary" (see below) can be called on with almost any type of data or model and returns summary statistics that R's designers deemed most relevant for that mode.
- Sometimes operators may appear in the arguments. Operators are symbols like "+", "-", and "=" that perform well understood mathematical or logical functions. For example, in the "t.test" function example below we use the argument "mu=48000." We'll cover what that means below, but for now just know that the equal sign serves a purpose.
- Calling the "help" function with a function in the arguments will open a browser window to the R documentation page for that function. This will describe arguments, return values, and function behavior in detail.

To perform a Chi squared test, you'll first need to create a contingency table (cross tab) and store it in a variable. When R's "summary" function is called on a contingency table, the output will be the results of a Chi squared test. The syntax here is "tab<-table(educ$inc_greater_than_ave, educ$greater_than_thirty_percent_have_bachelors)", then "summary(tab)".

To test the likelihood that a sample comes from a population with a given mean, use the "t.test" function. In the example below, we are testing for whether the mean median income for Illinois counties is 48,000. The syntax here is "t.test(educ$median_income, mu=48000)".

Notice that, in this case, the "t.test" function takes two arguments. The first is the set of data, which in this case is educ's column for median_income. The second argument is the population mean that we're testing. Instead of just passing the mean as a number, "t.test" has a special argument "mu" that is set to contain the mu that we're testing against. The equals sign means that mu is an optional argument with a default value that we're redefining. If we left "mu" out of the arguments the "t.test" function would be called with zero as the mu value. In this case, I set "mu=48000," and since, in the test, we can see that the p-value is greater than .05, we can surmise that it's statistically likely that the real mean for this sample is 48000.

So now let's ask ourselves a real world question: can we show a correlation between income or population and education? To help us answer this question, we can use one-way ANOVA. R's "oneway.test" function allows us to define a factor by which to divide another variable (provided they both come from the same data frame). In one step, the "oneway.test" allows us to factor "median_income" by the categorical variable "greater_than thirty_percent_have_bachelors." It then tests each subset to determine the likelihood that their means are identical, which since the p-value is less than .05, is unlikely here. The syntax here is "oneway.test(educ$median_income~educ$greater_than_thirty_percent_have_bachelors)".

To produce a linear model of the educ data we can use the "lm" function. This function creates a linear model for two or more variables using the Ordinary Least Squares method. Here's a simple regression using our variables for median household income to predict the percentage of the population with a bachelors or higher degree. The syntax here is "model<-lm(educ$median_income~educ$bachelors_or_higher)", then calling "model", and "summary(model)".

Notice in the above example that I saved the results of the "lm" function into a variable called "model" and then called "summary" on the variable. The tilde in the parentheses tells R that the variables are meant as part of a formula, with the dependent variable on the left and the independent(s) on the right. The lm function by itself just outputs the coefficients and intercept for the model formula, but we can use the "summary" to supply us with more useful information, like the value for R squared and the residual standard error. A number of other functions exist to give information about linear models. Some of the more important are "resid," which generates a list of the model's residuals, "coef," which returns the model's coefficients, and "anova," which generates an ANOVA table for the model.

If you want to perform a multivariate regression with more than one independent variable you can call lm with more than one variable on the right side of the tilde and join them with the "+" sign. The syntax here is "model<-lm(educ$median_income~educ$bachelors_or_higher+educ$some_college+educ$high_school)", then calling "summary(model)".

For a comprehensive look at R's built-in statistical modeling capabilities, which extend far beyond OLS linear models, see the Official R Tutorial.

Scholarly Commons