###############################
 ##  R Tutorial - First Part  ##
 ###############################

# Please read the tutorial's general instructions first, and prepare 
# for loading an external datafile.
# The content of this file can be pasted directly into the R console.
# It should blurp lost of errors only at the point where you 
# are supposed to have an external dataset available.
# However, it is much handier to use the "display file" command 
# within the R menu to look at this file, and paste command by command 
# to the console, using control-V.  Take your time to experiment a bit 
# with the listed commands.

#---------------------------------------------------------------------


# .Rfirst: A Very Short Introduction to R including Data Manipulation


#---------------------------------------------------------------------


# -----------------
# | Introduction  |
# -----------------

# Advantages and disadvantages of R:
#
# + Freely available 
# + Contains basic and advanced statistical analysis routines
# + The code is easy to modify
#
# - In comparison to S-Plus, the menu system is less developed
# - Not as widely used yet as SAS or SPSS
#
# Important websites, books, information:
#
# Look at: www.r-project.org ;
# from there, follow the links under "Documentation"
# 
# One thing is important for using any package, and 
# that is knowing how to reach the help files.
# There are two easy ways to do this in R. That is 
# either by typing, for example, 

help.search("sort")

# which will list all functions that contain the word "sort" 
# in their description, or via typing

help(sort)

# when you already know the name of a function, 
# the one named "sort" in this case, and you want to know 
# what it can do and how you have to supply the arguments.
# Try to search for some statistical terms you know, and take a 
# look at the functions where they appear.

help.search("Shapiro")
help.search("mixed")

# ---------------------
# | Data Manipulation |
# ---------------------

# Content:
# --------
#
# I.	Making data within R
# II.	Reading data in from elsewhere
# III.	Manipulating/modifying existing data objects
# IV.	Summary statistics, calculations
# V.	Writing a function
# V.	Tables



# I. Making data within R
#----------------------------

# It is quite easy to type in data directly into R, 
# or tho construct lists of observations on a variable.
#! In general, missing values in R are coded "NA" 
# For instance, let's type in data on numbers of beetles 
# caught in pitfall traps:

trapped<-c(0,1,2,1,0,3,3)
trapped # will list the observations.

# We type c() around the observations because this data becomes 
# stored as a column vector(hence the "c").
# Another way of doing the same, uses the scan() function.
# In that case, you have to type in observations yourself.
# Try using it.

testtrapped<-scan() 

# Supply observations separated by spaces, or returns.
# You only need to hit return twice to end adding observations.

# Different variables can have similar names. You can
# for example use a dot to structure a bit more.
# Assume that you have a second set of trappings 
# c(4,1,2,1,0,3,3) that differs ony in a single 
# observation from the first one.
# An economic way to make the second column vector is:

trapped.day1<-trapped #make a new variable from old
trapped.day2<-trapped.day1 # make a second from old
trapped.day2[1]<-4 # modify trapped.day2
trapped.day2 # take a look at it

# Typing ls() returns all the objects stored in memory:

ls()

# By writing trapped.day2[1] we accessed the fourth #element of the column vector trapped.day2.

trapped.day2[2] # prints element two.
trapped.day2[c(1,2,3)] # prints elements 1,2,3

# Beware, most datasets in R become matrices, and then an element like trapped.day2[1] is not well defined anymore. We see below how we can access the variables within such a data matrix.

# R also has a list of datasets that come with the program.
# You can view the list by typing

data() 
 
# To read a dataset in from that list, say dataset "trees", type 

data(trees)
ls() # now contains trees
summary(trees)

# Dataframe trees contains three variables.
# We can access the first variable "Girth" by typing

trees$Girth

# or

trees[,1] # note the comma

# a data matrix has rows and columns, therefore two indices 
# (separated by a comma) are necessary to code each element. 
# Each variable has a separate column usually, 
# and each observation is on a row.

trees[1,] # gives observation one
trees[1,1] # returns the first variable of observation one.


# II. Reading data in from elsewhere
#--------------------------------------

#!!Please stop here for a moment and read following the instructions.

#! We now give an easy way to read data into R, and at the same time
#! saving them in a data-object

# Prepare the datafile in, for example, the excell software program.
# In that spreadsheet, the first line of the data matrix 
# will contain the names of the variables (single words).
# This line is called the header line.
# Every observation is written on a separate line below the header line.
# Save your spreadsheet as a text file (e.g. "namefile.txt").
# In R: Change the working directory to the directory where the data are. 
# Use the R menu for that.

# Now type the following command at the command line:

dataset<-read.table("namefile.txt",header=TRUE)

# The data will be read in as data-object with name "dataset". 
# You can check that this object exists by typing

ls()

# ls prints all objects in your workspace. Dataset should be be among 
# the objects listed.
# Or get a summary of what dataset contains as follows:

summary(dataset)

dim(dataset) 

# gives the numbers of lines (observations) and columns (variables) in it.
ls("dataset") 

#or

objects("dataset") # lists the variables in the objects with name dataset.

# If you type the following command, the variables in the object dataset 
# can be accessed directly by giving their names:

attach(dataset)

# Try this for a variable within dataset you have read in.
# For instance type loc at the prompt, if "dataset" contains a variable 
# with that name.
# You can also type

Girth # not found ....
attach(trees)
Girth # aha!

# The function detach() is used to make a dataset less accessible as soon 
# as you don't really need the variables anymore.
# This avoids messing up variables from different data-objects.

detach(trees)


# III. Manipulating/modifying existing data objects
#----------------------------------------------------

#  It is relatively easy to manipulate datasets within R.
# Also concatenating different datasets, extracting pasrts of them
# and so on is possible.

# For instance, consider the object

trapped.day2

# We can access element number eight of it by typing

trapped.day2[8]

# ....replace it by typing 

trapped.day2[8]<-5

#such that trapped.day2 becomes

trapped.day2

# Replacing several elements at a time can be done as well:

trapped.day2[9:11]<-rep(1,3) #replaces elements 9 to 11 by ones.

#See

help(rep) #for explanation on the rep() function used here.

# Try to figure out what the ":" does for you.

trapped.day2[c(1,8)]<-c(1,2) #replaces elements 1, 8 by 1 and 2 resp.


# We can also add observations, even ones with missing values

trapped.day2[12]<-NA
trapped.day2[13]<-12

# spreadsheet facilities are available by using

data.entry(trapped.day2)

# or we can use another way of editing still,

edit(trapped.day2)

# The function edit() can use different text editors such as 
# vi, pico, emacs etc... For details, take a look at

help(edit)

# Using data.entry() to type in a new dataset is a little bit tricky.
# Suppose you want to make trapped.day3.
# This does not work:

data.entry(trapped.day3)

# But the following does work, since we initialize trapped.day3 immediately

data.entry(trapped.day3=c(NA))

# Concatenating column vectors goes as follows:

trapped.all<-c(trapped.day1,trapped.day2)

# For combining data matrices, you use rbind() or cbind().

# Some functions return logical TRUE or FALSE, for instance 

trapped.day2==1

# returns TRUE for all elements of trapped.day2 that equal 4, 
# FALSE otherwise.
# Sometimes you just want the positions where that occurs:

which(trapped.day2==3)

# The functions is.na() is also extremely handy.
# It tests for occurrence of missing values "NA"

is.na(trapped.all)
help(is.na)

# You can for instance replace missing values by zeroes as follows:

trapped.all[is.na(trapped.all)]<-0
trapped.all

# If trapped.all contained any missing values, 
# they would become replaced by zeroes.


# IV. Summary statistics, calculations
#----------------------------------------

# Obviously, you can use R as a pocket calculator, 
# but there's much more.

# We can do calculations on entire vectors,

trapped.day2-trapped.day1 # calculates the elementwise differences between them

# The following functions do what you would expect already
# from reading their names:

length(trapped.all) #length of the vector
diff(trapped.all) # differences between successive observations
diff(c(1:10)) # Predict what this will do...

# Several sample statistics are available as well:

median(trapped.all)
mean(trapped.all)
var(trapped.all)
max(trapped.all)
sum(trapped.all)

# If your data vector contains NA's, missing values, 
# you can add an option to remove missing values, see help(mean) for this.

# Cumsum returns a vector whose elements are
# the cumulative sums of the argument

cumsum(trapped.all) 

cummax(trapped.all) # Similarly, cumulative maxima.

# It is also possible to calculate a statistic on a subset of a 
# datavector, 

mean(subset(trapped.day2,trapped.day2>2))

# beware, the following command calcuates the proportion of
# observations with values larger than 2:

mean(trapped.day2>2)


# V.	Writing a function

# Suppose you want the standard deviation of a variable
# and not just the variance, for which we used 
# var() in the last section.

std(trapped.all) # This does not work.

# Lets' make a function that calculates a standard deviation ourselves.
# We will call it std. 
# For a vector x, it has to calculate the square root of the variance.
# It must become a function with one argument x, 
# that uses other functions sqrt(var(x)) on x.
# Voila:

std<-function(x)sqrt(var(x))

std(trapped.all) # This works!

# We could have avoided writing this function by looking a bit harder for

help.search("standard deviation")

# Then we would have seen that this is available:

sd(trapped.all)

# Which produces the same result.

# V.	Tables
#--------------

# The function table()  builds a contingency
# table of the counts.
# For instance, when we do that for trapped.all,
# we get a table listing the frequencies of counts.

table(trapped.all)
help(table)

# -------------------------------------------------------
# When finished running this file, please continue with 
# the ".Rgraphics" or ".Rclassics" tutorial files.
# -------------------------------------------------------


# Much of the material in this short tutorial comes from

# Using R for data analysis and graphics. An Introduction
# by J. H. Maindonald (2001)

# Using R for introductory Statistics
# by John Verzani (2002)

# both are acessible via links on the documentation section of
# the R-project website. 

# Tom Van Dooren, version 30/01/2003