Commands, like "read.csv," that can be called in R to perform specific tasks are called functions, but functions are not unique to R. Programmers use the word "function" to refer to a piece of code that can be called by a specific name. This allows them to use that code over and over again. Typically, functions are given some input and then perform some specific task on it. They can be thought of as little machines that take data and transform it somehow, the same way kitchen appliances take in and transform ingredients. The "read.csv" takes the a string of letters, looks for a file with that name and, if it finds one, attempts to create a data frame in R using the contents of the file. Functions in R take input in parentheses immediately following the function name and produce output that might be printed to the console window, saved into a variable, or, in the case of charts and graphs, displayed in a new window. Before we start playing with R's Statistical functions, here are few helpful facts:
Use the "summary" command to obtain a summary of basic statistics for every variable in a data frame ("summary" can also be used on a vector or matrix). For numeric variables, statistics shown will include "Min." for minimum, "1st Qu." for first quartile, "Median," "Mean", "3rd Qu." for third quartile, and "Max" for maximum. For character variables R attempts to count the frequencies of each response. Notice that R attempts to provide statistics for the "county_name," even though the values don't lend themselves to statistical summary. In cases like this that are meant to identify the row, it attempts to count the number of responses for each county name, even though they'll each only have one. Below, we called "summary" on our "educ" variable. The syntax here is "summary(educ)".
To obtain frequencies and cross-tabulations use the "table" function. When called with one variable in the parentheses, it will return a frequency table for that variable. With two variables in the parentheses, R will display a cross-tab. The syntax here is "table(educ$inc_greater_than_ave, educ$greater_than_thirty_percent_have_bachelors").
To perform a Chi squared test, you'll first need to create a contingency table (cross tab) and store it in a variable. When R's "summary" function is called on a contingency table, the output will be the results of a Chi squared test. The syntax here is "tab<-table(educ$inc_greater_than_ave, educ$greater_than_thirty_percent_have_bachelors)", then "summary(tab)".
To test the likelihood that a sample comes from a population with a given mean, use the "t.test" function. In the example below, we are testing for whether the mean median income for Illinois counties is 48,000. The syntax here is "t.test(educ$median_income, mu=48000)".
Notice that, in this case, the "t.test" function takes two arguments. The first is the set of data, which in this case is educ's column for median_income. The second argument is the population mean that we're testing. Instead of just passing the mean as a number, "t.test" has a special argument "mu" that is set to contain the mu that we're testing against. The equals sign means that mu is an optional argument with a default value that we're redefining. If we left "mu" out of the arguments the "t.test" function would be called with zero as the mu value. In this case, I set "mu=48000," and since, in the test, we can see that the p-value is greater than .05, we can surmise that it's statistically likely that the real mean for this sample is 48000.
So now let's ask ourselves a real world question: can we show a correlation between income or population and education? To help us answer this question, we can use one-way ANOVA. R's "oneway.test" function allows us to define a factor by which to divide another variable (provided they both come from the same data frame). In one step, the "oneway.test" allows us to factor "median_income" by the categorical variable "greater_than thirty_percent_have_bachelors." It then tests each subset to determine the likelihood that their means are identical, which since the p-value is less than .05, is unlikely here. The syntax here is "oneway.test(educ$median_income~educ$greater_than_thirty_percent_have_bachelors)".
To produce a linear model of the educ data we can use the "lm" function. This function creates a linear model for two or more variables using the Ordinary Least Squares method. Here's a simple regression using our variables for median household income to predict the percentage of the population with a bachelors or higher degree. The syntax here is "model<-lm(educ$median_income~educ$bachelors_or_higher)", then calling "model", and "summary(model)".
Notice in the above example that I saved the results of the "lm" function into a variable called "model" and then called "summary" on the variable. The tilde in the parentheses tells R that the variables are meant as part of a formula, with the dependent variable on the left and the independent(s) on the right. The lm function by itself just outputs the coefficients and intercept for the model formula, but we can use the "summary" to supply us with more useful information, like the value for R squared and the residual standard error. A number of other functions exist to give information about linear models. Some of the more important are "resid," which generates a list of the model's residuals, "coef," which returns the model's coefficients, and "anova," which generates an ANOVA table for the model.
If you want to perform a multivariate regression with more than one independent variable you can call lm with more than one variable on the right side of the tilde and join them with the "+" sign. The syntax here is "model<-lm(educ$median_income~educ$bachelors_or_higher+educ$some_college+educ$high_school)", then calling "summary(model)".
For a comprehensive look at R's built-in statistical modeling capabilities, which extend far beyond OLS linear models, see the Official R Tutorial.
Perhaps R's greatest strength is that it's open source, which not only means that it's free, but that anyone who wants to can add to it. R is a serious data manipulation tool and has gained wide, real-world acceptance in industry and academia, and the statisticians, scientists, accountants and managers who use it are constantly adding packages that will automate functions in whatever area they specialize. Often these expert users make these functions available through the Comprehensive R Archive Network, or CRAN. The packages available through the CRAN can greatly extend R's functionality, so you can apply R in nearly every area of statistical computing.