############################### ## R Tutorial - First Part ## ############################### # Please read the tutorial's general instructions first, and prepare # for loading an external datafile. # The content of this file can be pasted directly into the R console. # It should blurp lost of errors only at the point where you # are supposed to have an external dataset available. # However, it is much handier to use the "display file" command # within the R menu to look at this file, and paste command by command # to the console, using control-V. Take your time to experiment a bit # with the listed commands. #--------------------------------------------------------------------- # .Rfirst: A Very Short Introduction to R including Data Manipulation #--------------------------------------------------------------------- # ----------------- # | Introduction | # ----------------- # Advantages and disadvantages of R: # # + Freely available # + Contains basic and advanced statistical analysis routines # + The code is easy to modify # # - In comparison to S-Plus, the menu system is less developed # - Not as widely used yet as SAS or SPSS # # Important websites, books, information: # # Look at: www.r-project.org ; # from there, follow the links under "Documentation" # # One thing is important for using any package, and # that is knowing how to reach the help files. # There are two easy ways to do this in R. That is # either by typing, for example, help.search("sort") # which will list all functions that contain the word "sort" # in their description, or via typing help(sort) # when you already know the name of a function, # the one named "sort" in this case, and you want to know # what it can do and how you have to supply the arguments. # Try to search for some statistical terms you know, and take a # look at the functions where they appear. help.search("Shapiro") help.search("mixed") # --------------------- # | Data Manipulation | # --------------------- # Content: # -------- # # I. Making data within R # II. Reading data in from elsewhere # III. Manipulating/modifying existing data objects # IV. Summary statistics, calculations # V. Writing a function # V. Tables # I. Making data within R #---------------------------- # It is quite easy to type in data directly into R, # or tho construct lists of observations on a variable. #! In general, missing values in R are coded "NA" # For instance, let's type in data on numbers of beetles # caught in pitfall traps: trapped<-c(0,1,2,1,0,3,3) trapped # will list the observations. # We type c() around the observations because this data becomes # stored as a column vector(hence the "c"). # Another way of doing the same, uses the scan() function. # In that case, you have to type in observations yourself. # Try using it. testtrapped<-scan() # Supply observations separated by spaces, or returns. # You only need to hit return twice to end adding observations. # Different variables can have similar names. You can # for example use a dot to structure a bit more. # Assume that you have a second set of trappings # c(4,1,2,1,0,3,3) that differs ony in a single # observation from the first one. # An economic way to make the second column vector is: trapped.day1<-trapped #make a new variable from old trapped.day2<-trapped.day1 # make a second from old trapped.day2[1]<-4 # modify trapped.day2 trapped.day2 # take a look at it # Typing ls() returns all the objects stored in memory: ls() # By writing trapped.day2[1] we accessed the fourth #element of the column vector trapped.day2. trapped.day2[2] # prints element two. trapped.day2[c(1,2,3)] # prints elements 1,2,3 # Beware, most datasets in R become matrices, and then an element like trapped.day2[1] is not well defined anymore. We see below how we can access the variables within such a data matrix. # R also has a list of datasets that come with the program. # You can view the list by typing data() # To read a dataset in from that list, say dataset "trees", type data(trees) ls() # now contains trees summary(trees) # Dataframe trees contains three variables. # We can access the first variable "Girth" by typing trees$Girth # or trees[,1] # note the comma # a data matrix has rows and columns, therefore two indices # (separated by a comma) are necessary to code each element. # Each variable has a separate column usually, # and each observation is on a row. trees[1,] # gives observation one trees[1,1] # returns the first variable of observation one. # II. Reading data in from elsewhere #-------------------------------------- #!!Please stop here for a moment and read following the instructions. #! We now give an easy way to read data into R, and at the same time #! saving them in a data-object # Prepare the datafile in, for example, the excell software program. # In that spreadsheet, the first line of the data matrix # will contain the names of the variables (single words). # This line is called the header line. # Every observation is written on a separate line below the header line. # Save your spreadsheet as a text file (e.g. "namefile.txt"). # In R: Change the working directory to the directory where the data are. # Use the R menu for that. # Now type the following command at the command line: dataset<-read.table("namefile.txt",header=TRUE) # The data will be read in as data-object with name "dataset". # You can check that this object exists by typing ls() # ls prints all objects in your workspace. Dataset should be be among # the objects listed. # Or get a summary of what dataset contains as follows: summary(dataset) dim(dataset) # gives the numbers of lines (observations) and columns (variables) in it. ls("dataset") #or objects("dataset") # lists the variables in the objects with name dataset. # If you type the following command, the variables in the object dataset # can be accessed directly by giving their names: attach(dataset) # Try this for a variable within dataset you have read in. # For instance type loc at the prompt, if "dataset" contains a variable # with that name. # You can also type Girth # not found .... attach(trees) Girth # aha! # The function detach() is used to make a dataset less accessible as soon # as you don't really need the variables anymore. # This avoids messing up variables from different data-objects. detach(trees) # III. Manipulating/modifying existing data objects #---------------------------------------------------- # It is relatively easy to manipulate datasets within R. # Also concatenating different datasets, extracting pasrts of them # and so on is possible. # For instance, consider the object trapped.day2 # We can access element number eight of it by typing trapped.day2[8] # ....replace it by typing trapped.day2[8]<-5 #such that trapped.day2 becomes trapped.day2 # Replacing several elements at a time can be done as well: trapped.day2[9:11]<-rep(1,3) #replaces elements 9 to 11 by ones. #See help(rep) #for explanation on the rep() function used here. # Try to figure out what the ":" does for you. trapped.day2[c(1,8)]<-c(1,2) #replaces elements 1, 8 by 1 and 2 resp. # We can also add observations, even ones with missing values trapped.day2[12]<-NA trapped.day2[13]<-12 # spreadsheet facilities are available by using data.entry(trapped.day2) # or we can use another way of editing still, edit(trapped.day2) # The function edit() can use different text editors such as # vi, pico, emacs etc... For details, take a look at help(edit) # Using data.entry() to type in a new dataset is a little bit tricky. # Suppose you want to make trapped.day3. # This does not work: data.entry(trapped.day3) # But the following does work, since we initialize trapped.day3 immediately data.entry(trapped.day3=c(NA)) # Concatenating column vectors goes as follows: trapped.all<-c(trapped.day1,trapped.day2) # For combining data matrices, you use rbind() or cbind(). # Some functions return logical TRUE or FALSE, for instance trapped.day2==1 # returns TRUE for all elements of trapped.day2 that equal 4, # FALSE otherwise. # Sometimes you just want the positions where that occurs: which(trapped.day2==3) # The functions is.na() is also extremely handy. # It tests for occurrence of missing values "NA" is.na(trapped.all) help(is.na) # You can for instance replace missing values by zeroes as follows: trapped.all[is.na(trapped.all)]<-0 trapped.all # If trapped.all contained any missing values, # they would become replaced by zeroes. # IV. Summary statistics, calculations #---------------------------------------- # Obviously, you can use R as a pocket calculator, # but there's much more. # We can do calculations on entire vectors, trapped.day2-trapped.day1 # calculates the elementwise differences between them # The following functions do what you would expect already # from reading their names: length(trapped.all) #length of the vector diff(trapped.all) # differences between successive observations diff(c(1:10)) # Predict what this will do... # Several sample statistics are available as well: median(trapped.all) mean(trapped.all) var(trapped.all) max(trapped.all) sum(trapped.all) # If your data vector contains NA's, missing values, # you can add an option to remove missing values, see help(mean) for this. # Cumsum returns a vector whose elements are # the cumulative sums of the argument cumsum(trapped.all) cummax(trapped.all) # Similarly, cumulative maxima. # It is also possible to calculate a statistic on a subset of a # datavector, mean(subset(trapped.day2,trapped.day2>2)) # beware, the following command calcuates the proportion of # observations with values larger than 2: mean(trapped.day2>2) # V. Writing a function # Suppose you want the standard deviation of a variable # and not just the variance, for which we used # var() in the last section. std(trapped.all) # This does not work. # Lets' make a function that calculates a standard deviation ourselves. # We will call it std. # For a vector x, it has to calculate the square root of the variance. # It must become a function with one argument x, # that uses other functions sqrt(var(x)) on x. # Voila: std<-function(x)sqrt(var(x)) std(trapped.all) # This works! # We could have avoided writing this function by looking a bit harder for help.search("standard deviation") # Then we would have seen that this is available: sd(trapped.all) # Which produces the same result. # V. Tables #-------------- # The function table() builds a contingency # table of the counts. # For instance, when we do that for trapped.all, # we get a table listing the frequencies of counts. table(trapped.all) help(table) # ------------------------------------------------------- # When finished running this file, please continue with # the ".Rgraphics" or ".Rclassics" tutorial files. # ------------------------------------------------------- # Much of the material in this short tutorial comes from # Using R for data analysis and graphics. An Introduction # by J. H. Maindonald (2001) # Using R for introductory Statistics # by John Verzani (2002) # both are acessible via links on the documentation section of # the R-project website. # Tom Van Dooren, version 30/01/2003