Intro to R for datascience

Course overview

Session 1: Introduction to R

Elementary data structures, data.frames + an illustrative example of a simple linear regression model. An introduction to basic R data types and objects (vectors, lists, data.frame objects). Examples: subsetting and coercion. Getting to know RStudio. What can R do and how to make it perform the most elementary tricks needed in Data Science? What is CRAN and how to install R packages? R graphics: simple linear regression with plot(), abline(), and fancy with ggplot2().

Session 2: Vectors, Matrices, Data Frames

Introduction to vectors, matrices, and data frames in R. R is a vector programming language, which means you will be using vectors, matrices, and n-dimensional arrays a lot. Vectorizing your code means enhanced performance in terms of speed. Data frame objects in R are elementary carriers of most of your data in R; unlike vectors and matrices, data frames can encompass various data types.

Introduction to R for Data Science :: Session 1

Intro to R for Data Science SlideShare :: Session 1

Introduction to R for Data Science :: Session 1 from Goran S. Milovanovic

R script + Data Set :: Session 1

Download IntroR_Session1.R

Download Session 1 Data Set

########################################################
# Introduction to R for Data Science
# SESSION 1 :: 28 April, 2016
# Data Science Community Serbia + Startit
# :: Branko Kovač and Goran S. Milovanović ::
########################################################
 
# This is an R comment: it begins with "#" and ends with nothing 🙂
# data source: http://www.stat.ufl.edu/~winner/datasets.html (modified, from .dat to .csv)
# from the website of Mr. Larry Winner, Department of Statistics, University of Florida
 
# Data set: RKO Films Costs and Revenues 1930-1941
# More on RKO Films: https://en.wikipedia.org/wiki/RKO_Pictures
 
# First question: where are we?
getwd(); # this will tell you the path to the R working directory
 
# Where are my files?
# NOTE: Here you need to change filesDir to match your local path
filesDir <- "/home/goran/Desktop/__IntroR_Session1/";
class(filesDir); # now filesDir is a of a character type; there are classes and types in R
typeof(filesDir);
# By the way, you do not need to use the semicolon to separate lines of code:
class(filesDir)
typeof(filesDir)
# point R to where your files are stored
setwd(filesDir); # set working directory
getwd(); # check
 
# Read some data in csv (comma separated values
# - it might turn out that you will be using these very often)
fileName <- "rko_film_1930-1941.csv";
dataSet <- read.csv(fileName,
                    header=T,
                    check.names=F,
                    stringsAsFactors=F,
                    row.names=NULL);
 
# read.csv is for reading comma separated values
# type ? in front of any R function for help
?read.csv
# to find our that read.csv is a member of a wider read* family of functions
# of which read.table is the most generic one
 
# now, dataSet is of type...
typeof(dataSet); # in type semantics, dataSet is a list. In R we use lists a lot.
class(dataSet); # in object semantics, dataSet is a data.frame!
 
# what is the first member of the dataSet list?
dataSet[[1]];
# what are the first two members?
dataSet[1:2];
# mind the difference between subsetting a list with [[]] and []
# does a single member of dataSet have a name?
names(dataSet[[1]]);
# of what type is it?
typeof(dataSet[[1]]);
class(dataSet[[1]]);
# do first two elements have names?
names(dataSet[1:2]); # wow
typeof(dataSet[1:2]);
# the first element of dataSet, understood as a character vector, does not have a name
# however, elements OF A list do have names
# can we subset a data.frame object by names?
dataSet$movie;
dataSet$movie[1:10];
dataSet$movie[[1]];
class(dataSet$movie[[1]]);
typeof(dataSet$movie[[1]]);
# thus, a character vector is the first member = the first column of the dataSet data.frame
testWord <- testWord testWord[[1]];
testWord[[1:2]]; # error
testWord[1:2];
# similar
dataSet[1:2]; # first two columns of a dataSet
# back to characters
tW <- testWord[1];
tW[1]
tW[2] # NA
# from a viewpoint of a statistical spreadsheet user, NA is used for missing data in R
# what is the second letter in tW == 'Ana'
substring(tW,2,2); # there are functions in R to deal with characters as strings!
# finding elements of vectors
w <- testWord[w];
# how many elements in testWord?
length(testWord);
# subsetting testWord, again
testWord[2:length(testWord)]; # length is another important function, like which() or substring()
tail(testWord,2); # vectors have tails, yay!
head(testWord,3); # and heads as well
# a data.frame has a head too, and that knowledge often comes handy...
head(dataSet,5); # ... especially when dealing with large data sets
# of course...
tail(dataSet,10);
# another two functions: tail() and head()
# further subsetting of a data.frame object
dataSet$reRelease # columns can have names; reRelease is the name of the 2nd column of dataSet
typeof(dataSet$reRelease);
class(dataSet$reRelease);
# automatic type conversion in R: from numeric to logical
is.numeric(dataSet$reRelease);
reRelease
is.logical(reRelease);
# vectors, sequences...
# automatic type conversion (coercing) in R: from real to integer
x <- 2:10;
# is the same as...
x <- seq(2,10,by=1);
# multiples of 3.1415927...
multipliPi <- x*pi;
multipliPi
# NOTE multiplication * in R operates element-wise
# This is one of the reasons we call it a vector programming language...
is.double(multipliPi);
# type conversion in R: from double to integer
as.integer(multipliPi)
is.integer(multipliPi)
is.integer(as.integer(multipliPi))
# rounding
round(multipliPi,1)
round(multipliPi,2)
# carefully!
as.integer(multipliPi) == round(multipliPi,0) # check documentation
?as.integer # enjoy...
# more coercion...
num <- as.numeric("123");
is.numeric(num)
ch <- as.character(num)
is.character(ch)
 
# What do we all love in Data Science and Statistics? Random numbers..!
runif(100,0,1) # one hundred uniformly distributed random numbers on a range 0 .. 1
rnorm(100, mean=0, sd=1) # one hundred random deviates from the standard Gaussian
# all probability density and mass functions in R have similar r* functions to generate random deviates
 
# Enough! Let's do something for real...
# Q: Is it possible to predict the total revenue from movie production cost?
# Are these two related at all?
# What is the size of the data set?
n # any missing data?
sum(!(is.na(dataSet$productionCost)));
sum(!(is.na(dataSet$totalRevenue)));
# plot dataSet$productionCost on x-axis and dataSet$totalRevenue on y-axis
plot(dataSet$productionCost, dataSet$totalRevenue);
# are these two correlated?
cPearson <- cor(dataSet$productionCost, dataSet$totalRevenue,method="pearson");
cPearson

# However, who in the World tests the assumptions of the linear model... Kick it!
reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost);
summary(reg);
# get residuals
reg$residuals
# get coefficients
reg$coefficients 
# some functions to inspect the simple linear model
coefficients(reg) # model coefficients
confint(reg, level=0.95) # CIs for model parameters 
fitted(reg) # predicted values
residuals(reg) # residuals
anova(reg) # anova table 
vcov(reg) # covariance matrix for model parameters 
 
# plot model
intercept <- reg$coefficients[1];
slope <- reg$coefficients[2];
plot(dataSet$productionCost, dataSet$totalRevenue);
abline(reg$coefficients); # as simple as that; abline() is a generic function, check it out ?abline

# and now for a nice plot
library(ggplot2); # first do: install.packages("ggplot2");not now - it can take a while
# library() is a call to use any R package
# of which the powerful ggplot2 is among the most popular
g <- ggplot(data=dataSet,
            aes(x = productionCost,
                y = totalRevenue)) +
  geom_point() +
  geom_smooth(method=lm,
              se=TRUE) +
  xlab("\nProduction Cost") +
  ylab("Total Revenue\n") +
  ggtitle("Linear Regression\n"); 
print(g);

# Q1: Is this model any good?
# Q2: Are there any truly dangerous outliers present in the data set?
 
# print is also a generic function in R: for example,
print("Doviđenja i uživajte u praznicima uz gomilu materijala za čitanje i vežbu!")
 
# P.S. Play with:
reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost + dataSet$domesticRevenue);
summary(reg) # etc.

Readings :: Session 2 [5. May, 2016, @Startit.rs, 19h CET]

The Art of R Programming, Norman Matloff

Chapter 1 – 5, pp. 1 – 54.

Summary of Session 2, 05. may 2016 :: Introduction to R: vectors, matrices, and data frame

Intro to R for Data Science SlideShare :: Session 2

Introduction to R for Data Science :: Session 2 from Goran S. Milovanovic

R script :: Session 2

Download link for IntroR_Session2.R

########################################################
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
# Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač ::
########################################################
 
# clear all
rm(list=ls());
 
# Let's start with some vectors
char_list <- character(length = 0) #empty character list
num_list <- numeric(length = 10) #length can be != 0, but 0 is default value
log_list <- logical(length = 3) #default value is FALSE
 
# But you can always use good ol' c() for the same purpose
log_list_2 <- c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE) #some Ts and Fs
num_list_2 <- c(1, 4, 12, NA, 101, 999) #numb
char_list_2 <- c("abc", "qwerty", "test", "data", "science")
 
# Factor vectors are also part of R
fac_list <- gl(n = 4, k = 1, length = 8, ordered = T, 
               labels = c("low", "med", "high", "very high")) #only mentioning now :)
 
# Subsetting is regular-thing-to-do when using R
char_list_2[5] #single element can be selected
log_list_2[2:4] #or some interval
num_list_2[3:length(num_list_2)] #or even length() function
 
# New objects can be created when subsetting
test <- num_list_2[-c(2,4)] #or somthing like this - displays all but 2nd and 4th element
test_2 <- num_list_2 %in% test #operator %in% can be very useful
not_na <- num_list_2[!is.na(num_list_2)] #removing NAs using operator ! and is.na() function
 
# Vector ordering
sort(test, decreasing = T) #using sort() function
test[order(test, decreasing = T)] #or with order() function
 
# Vector sequences
seq(1,22,by = 2) #we already mentioned seq()
rep(1, 4) #but rep() is something new :)
rep(num_list_2, 2) #replicate num_list_2, 2 times
 
# Concatenation
new_num_vect <- c(num_list, num_list_2) #using 2 vectors to create new one
new_num_vect
new_combo_vect <- c(num_list_2, log_list) #combination of num and log vector
new_combo_vect #all numbers? false to zero? coercion in action
 
new_combo_vect_2 <- c(char_list_2, num_list_2) #works as well
new_combo_vect_2 #where are the numbers?
class(new_combo_vect_2) #all characters
 
# Matrices are available in R
matr <- matrix(data = c(1,3,5,7,NA,11), nrow = 2, ncol = 3) #2x3 matrix
class(matr) #yes, it's matrix
typeof(matr) #double as expected
 
matr[,2] #2nd column
matr[3,] #oops, out of bounds, there's no 3rd row
matr[2,3] #element in 2nd row and 3rd column
 
matr_2 <- matrix(data = c(1,3,5,"7",NA,11), nrow = 2, ncol = 3) #another 2x3 matrix
class(matr_2) #matrix again
typeof(matr_2) #but not double anymore, type conversion in action!
t(matr_2) #transponed matr_2
 
# What can we do if a matrix needs to encompass different types of data?
# Introducing data frame!
 
library(datasets) #there are some datasets in base R like mtcars
cars_data <- mtcars
 
# Some useful information about data frames
str(cars_data) #lets see what we have here
summary(cars_data) #more information about mtcars dataset
names(cars_data) #column names
?mtcars #dataset documentation is *very* important
 
# Think of data frame columns as vectors! Because they are!
mean(cars_data$mpg) #mean of cars_data mpg (miles per galon) column
median(cars_data$cyl) #median of cars_data cyl (cylinders) column
 
is.list(cars_data[1,]); #but rows are lists!
 
# Lets do some data frame subsetting
 
cars_data[-1, ] # first row out
cars_data[ ,-1] # first column out
 
cars_data[c(1,3)] #keeping 1st and 3rd column only
cars_data[-c(1,3)] #removing 1st and 3rd column
cars_data[ ,-c(1,3)] #same as the previous line of code
 
cars_data[!duplicated(cars_data$mpg), ] #maybe we want to remove all cars with same mpg?
#remember it keeps only the first occurence!
 
subset(cars_data, mpg < 19) #this is one way (and it can be slow!)
cars_data[cars_data$mpg < 19, ] #this is another one (faster)
cars_data[which(cars_data$mpg < 19), ] #and another one (usually even more faster)
 
cars_data[cars_data$mpg > 20 & cars_data$am == 1, ] #multiple conditions
 
cars_data[grep("Merc", row.names(cars_data), value=T), ] #filtering by pattern match
 
# Data frame transformations
cars_data$trans <- ifelse(cars_data$am == 0, "automatic", "manual") #we can add new colums
cars_data$trans <- NULL #or we can remove them
 
cars_data[c(1:3,11,4,7,5:6,8:10)] #this way we change column order
 
# Separation and joining of data frames
low_mpg <- cars_data[cars_data$mpg < 15, ] #new data frame with mpg < 15
high_mpg <- cars_data[cars_data$mpg >= 15, ] #new data frame with mpg >= 15
 
mpg_join <- rbind(low_mpg, high_mpg) # we can combine 2 data frames like this
 
car_condition <- data.frame(sample(c("old","new"), replace = T, size = 32)) #creating random data frame
                                                                            #with "old" and "new" values
names(car_condition) <- "condition" #for all kinds of objects
colnames(car_condition) <- "condition" #for "matrix-like" objects, but same effect here
rownames(car_condition) <- rownames(cars_data) #use row names of one data frame as row names of other
 
mpg_join <- cbind(mpg_join, car_condition) #or combine data frames like this

Readings :: Session 3 [12. May, 2016, @Startit.rs, 19h CET]

Chapters 1 - 5

The Art of R Programming, Norman Matloff

You must be logged in to post a comment.

Intro to R for datascience

Course overview

Session 1: Introduction to R

Session 2: Vectors, Matrices, Data Frames

Session 3: Data Frames, Factors, and Objects in R

Session 4: Data Structures + Control Flow = Programs. Functions in R

Session 5: Structuring Data: String manipulation in R

Session 6: Introduction to GLM: Correlations and Linear Regression in R

Session 7: Introduction to GLM: Multiple Regression in R

Session 8: Extending the Scope of the GLM: Binomial and Multinomial Logistic Regression in R

Session 9: Dimensionality Reduction: Multdimensional Scaling in R with Smacof

Session 10: Non-parametric Methods in R.