Srpska verzija

Intro to R for datascience

Branko Kovač Data Analyst at CUBE, Data Science Mentor at Springboard, Institut savremenih nauka, Data Science Serbia, and Goran S. Milovanović, DataScientist@DiploFoundation, Data Science Serbia, are giving a free introductory course on R for Data Science in Belgrade, Serbia. All course materials – slides, R scripts, data sets, summaries and recommended readings – can be found on this page.

The course is organized by Data Science Serbia in cooperation with Startit, Belgrade. Fifteen participants are working with us in Startit Centar, Belgrade, Savska 5, each Thursday beginning 28. April 2016 19h CET in situ.

The course will be carried out through ten sessions (reproducible R code can be found at the following pages):

Course overview

Session 1: Introduction to R

Elementary data structures, data.frames + an illustrative example of a simple linear regression model. An introduction to basic R data types and objects (vectors, lists, data.frame objects). Examples: subsetting and coercion. Getting to know RStudio. What can R do and how to make it perform the most elementary tricks needed in Data Science? What is CRAN and how to install R packages? R graphics: simple linear regression with plot(), abline(), and fancy with ggplot2().

Session 2: Vectors, Matrices, Data Frames

Introduction to vectors, matrices, and data frames in R. R is a vector programming language, which means you will be using vectors, matrices, and n-dimensional arrays a lot. Vectorizing your code means enhanced performance in terms of speed. Data frame objects in R are elementary carriers of most of your data in R; unlike vectors and matrices, data frames can encompass various data types.

Session 3: Data Frames, Factors, and Objects in R

Session 4: Data Structures + Control Flow = Programs. Functions in R

Session 5: Structuring Data: String manipulation in R

Session 6: Introduction to GLM: Correlations and Linear Regression in R

Session 7: Introduction to GLM: Multiple Regression in R

Session 8: Extending the Scope of the GLM: Binomial and Multinomial Logistic Regression in R

Session 9: Dimensionality Reduction: Multdimensional Scaling in R with Smacof

Session 10: Non-parametric Methods in R.

Introduction to R for Data Science :: Session 1

Elementary data structures, data.frames + an illustrative example of a simple linear regression model. An introduction to basic R data types and objects (vectors, lists, data.frame objects). Examples: subsetting and coercion. Getting to know RStudio. What can R do and how to make it perform the most elementary tricks needed in Data Science? What is CRAN and how to install R packages? R graphics: simple linear regression with plot(), abline(), and fancy with ggplot().

Intro to R for Data Science SlideShare :: Session 1

Introduction to R for Data Science :: Session 1 from Goran S. Milovanovic

R script + Data Set :: Session 1

Download IntroR_Session1.R

Download Session 1 Data Set

########################################################
# Introduction to R for Data Science
# SESSION 1 :: 28 April, 2016
# Data Science Community Serbia + Startit
# :: Branko Kovač and Goran S. Milovanović ::
########################################################
 
# This is an R comment: it begins with "#" and ends with nothing 🙂
# data source: http://www.stat.ufl.edu/~winner/datasets.html (modified, from .dat to .csv)
# from the website of Mr. Larry Winner, Department of Statistics, University of Florida
 
# Data set: RKO Films Costs and Revenues 1930-1941
# More on RKO Films: https://en.wikipedia.org/wiki/RKO_Pictures
 
# First question: where are we?
getwd(); # this will tell you the path to the R working directory
 
# Where are my files?
# NOTE: Here you need to change filesDir to match your local path
filesDir <- "/home/goran/Desktop/__IntroR_Session1/";
class(filesDir); # now filesDir is a of a character type; there are classes and types in R
typeof(filesDir);
# By the way, you do not need to use the semicolon to separate lines of code:
class(filesDir)
typeof(filesDir)
# point R to where your files are stored
setwd(filesDir); # set working directory
getwd(); # check
 
# Read some data in csv (comma separated values
# - it might turn out that you will be using these very often)
fileName <- "rko_film_1930-1941.csv";
dataSet <- read.csv(fileName,
                    header=T,
                    check.names=F,
                    stringsAsFactors=F,
                    row.names=NULL);
 
# read.csv is for reading comma separated values
# type ? in front of any R function for help
?read.csv
# to find our that read.csv is a member of a wider read* family of functions
# of which read.table is the most generic one
 
# now, dataSet is of type...
typeof(dataSet); # in type semantics, dataSet is a list. In R we use lists a lot.
class(dataSet); # in object semantics, dataSet is a data.frame!
 
# what is the first member of the dataSet list?
dataSet[[1]];
# what are the first two members?
dataSet[1:2];
# mind the difference between subsetting a list with [[]] and []
# does a single member of dataSet have a name?
names(dataSet[[1]]);
# of what type is it?
typeof(dataSet[[1]]);
class(dataSet[[1]]);
# do first two elements have names?
names(dataSet[1:2]); # wow
typeof(dataSet[1:2]);
# the first element of dataSet, understood as a character vector, does not have a name
# however, elements OF A list do have names
# can we subset a data.frame object by names?
dataSet$movie;
dataSet$movie[1:10];
dataSet$movie[[1]];
class(dataSet$movie[[1]]);
typeof(dataSet$movie[[1]]);
# thus, a character vector is the first member = the first column of the dataSet data.frame
testWord <- testWord testWord[[1]];
testWord[[1:2]]; # error
testWord[1:2];
# similar
dataSet[1:2]; # first two columns of a dataSet
# back to characters
tW <- testWord[1];
tW[1]
tW[2] # NA
# from a viewpoint of a statistical spreadsheet user, NA is used for missing data in R
# what is the second letter in tW == 'Ana'
substring(tW,2,2); # there are functions in R to deal with characters as strings!
# finding elements of vectors
w <- testWord[w];
# how many elements in testWord?
length(testWord);
# subsetting testWord, again
testWord[2:length(testWord)]; # length is another important function, like which() or substring()
tail(testWord,2); # vectors have tails, yay!
head(testWord,3); # and heads as well
# a data.frame has a head too, and that knowledge often comes handy...
head(dataSet,5); # ... especially when dealing with large data sets
# of course...
tail(dataSet,10);
# another two functions: tail() and head()
# further subsetting of a data.frame object
dataSet$reRelease # columns can have names; reRelease is the name of the 2nd column of dataSet
typeof(dataSet$reRelease);
class(dataSet$reRelease);
# automatic type conversion in R: from numeric to logical
is.numeric(dataSet$reRelease);
reRelease
is.logical(reRelease);
# vectors, sequences...
# automatic type conversion (coercing) in R: from real to integer
x <- 2:10;
# is the same as...
x <- seq(2,10,by=1);
# multiples of 3.1415927...
multipliPi <- x*pi;
multipliPi
# NOTE multiplication * in R operates element-wise
# This is one of the reasons we call it a vector programming language...
is.double(multipliPi);
# type conversion in R: from double to integer
as.integer(multipliPi)
is.integer(multipliPi)
is.integer(as.integer(multipliPi))
# rounding
round(multipliPi,1)
round(multipliPi,2)
# carefully!
as.integer(multipliPi) == round(multipliPi,0) # check documentation
?as.integer # enjoy...
# more coercion...
num <- as.numeric("123");
is.numeric(num)
ch <- as.character(num)
is.character(ch)
 
# What do we all love in Data Science and Statistics? Random numbers..!
runif(100,0,1) # one hundred uniformly distributed random numbers on a range 0 .. 1
rnorm(100, mean=0, sd=1) # one hundred random deviates from the standard Gaussian
# all probability density and mass functions in R have similar r* functions to generate random deviates
 
# Enough! Let's do something for real...
# Q: Is it possible to predict the total revenue from movie production cost?
# Are these two related at all?
# What is the size of the data set?
n # any missing data?
sum(!(is.na(dataSet$productionCost)));
sum(!(is.na(dataSet$totalRevenue)));
# plot dataSet$productionCost on x-axis and dataSet$totalRevenue on y-axis
plot(dataSet$productionCost, dataSet$totalRevenue);
# are these two correlated?
cPearson <- cor(dataSet$productionCost, dataSet$totalRevenue,method="pearson");
cPearson
# However, who in the World tests the assumptions of the linear model... Kick it!
reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost);
summary(reg);
# get residuals
reg$residuals
# get coefficients
reg$coefficients 
# some functions to inspect the simple linear model
coefficients(reg) # model coefficients
confint(reg, level=0.95) # CIs for model parameters 
fitted(reg) # predicted values
residuals(reg) # residuals
anova(reg) # anova table 
vcov(reg) # covariance matrix for model parameters 
 
# plot model
intercept <- reg$coefficients[1];
slope <- reg$coefficients[2];
plot(dataSet$productionCost, dataSet$totalRevenue);
abline(reg$coefficients); # as simple as that; abline() is a generic function, check it out ?abline
# and now for a nice plot
library(ggplot2); # first do: install.packages("ggplot2");not now - it can take a while
# library() is a call to use any R package
# of which the powerful ggplot2 is among the most popular
g <- ggplot(data=dataSet,
            aes(x = productionCost,
                y = totalRevenue)) +
  geom_point() +
  geom_smooth(method=lm,
              se=TRUE) +
  xlab("\nProduction Cost") +
  ylab("Total Revenue\n") +
  ggtitle("Linear Regression\n"); 
print(g);
# Q1: Is this model any good?
# Q2: Are there any truly dangerous outliers present in the data set?
 
# print is also a generic function in R: for example,
print("Doviđenja i uživajte u praznicima uz gomilu materijala za čitanje i vežbu!")
 
# P.S. Play with:
reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost + dataSet$domesticRevenue);
summary(reg) # etc.

Readings :: Session 2 [5. May, 2016, @Startit.rs, 19h CET]

The Art of R Programming, Norman Matloff

Chapter 1 – 5, pp. 1 – 54.

Summary of Session 2, 05. may 2016 :: Introduction to R: vectors, matrices, and data frame

Introduction to vectors, matrices, and data frames in R. R is a vector programming language, which means you will be using vectors, matrices, and n-dimensional arrays a lot. Vectorizing your code means enhanced performance in terms of speed. Data frame objects in R are elementary carriers of most of your data in R; unlike vectors and matrices, data frames can encompass various data types.

Intro to R for Data Science SlideShare :: Session 2

Introduction to R for Data Science :: Session 2 from Goran S. Milovanovic

R script :: Session 2

Download link for IntroR_Session2.R

########################################################
# Introduction to R for Data Science
# SESSION 2 :: 5 May, 2016
# Data Science Community Serbia + Startit
# :: Goran S. Milovanović and Branko Kovač ::
########################################################
 
# clear all
rm(list=ls());
 
# Let's start with some vectors
char_list <- character(length = 0) #empty character list
num_list <- numeric(length = 10) #length can be != 0, but 0 is default value
log_list <- logical(length = 3) #default value is FALSE
 
# But you can always use good ol' c() for the same purpose
log_list_2 <- c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE) #some Ts and Fs
num_list_2 <- c(1, 4, 12, NA, 101, 999) #numb
char_list_2 <- c("abc", "qwerty", "test", "data", "science")
 
# Factor vectors are also part of R
fac_list <- gl(n = 4, k = 1, length = 8, ordered = T, 
               labels = c("low", "med", "high", "very high")) #only mentioning now :)
 
# Subsetting is regular-thing-to-do when using R
char_list_2[5] #single element can be selected
log_list_2[2:4] #or some interval
num_list_2[3:length(num_list_2)] #or even length() function
 
# New objects can be created when subsetting
test <- num_list_2[-c(2,4)] #or somthing like this - displays all but 2nd and 4th element
test_2 <- num_list_2 %in% test #operator %in% can be very useful
not_na <- num_list_2[!is.na(num_list_2)] #removing NAs using operator ! and is.na() function
 
# Vector ordering
sort(test, decreasing = T) #using sort() function
test[order(test, decreasing = T)] #or with order() function
 
# Vector sequences
seq(1,22,by = 2) #we already mentioned seq()
rep(1, 4) #but rep() is something new :)
rep(num_list_2, 2) #replicate num_list_2, 2 times
 
# Concatenation
new_num_vect <- c(num_list, num_list_2) #using 2 vectors to create new one
new_num_vect
new_combo_vect <- c(num_list_2, log_list) #combination of num and log vector
new_combo_vect #all numbers? false to zero? coercion in action
 
new_combo_vect_2 <- c(char_list_2, num_list_2) #works as well
new_combo_vect_2 #where are the numbers?
class(new_combo_vect_2) #all characters
 
# Matrices are available in R
matr <- matrix(data = c(1,3,5,7,NA,11), nrow = 2, ncol = 3) #2x3 matrix
class(matr) #yes, it's matrix
typeof(matr) #double as expected
 
matr[,2] #2nd column
matr[3,] #oops, out of bounds, there's no 3rd row
matr[2,3] #element in 2nd row and 3rd column
 
matr_2 <- matrix(data = c(1,3,5,"7",NA,11), nrow = 2, ncol = 3) #another 2x3 matrix
class(matr_2) #matrix again
typeof(matr_2) #but not double anymore, type conversion in action!
t(matr_2) #transponed matr_2
 
# What can we do if a matrix needs to encompass different types of data?
# Introducing data frame!
 
library(datasets) #there are some datasets in base R like mtcars
cars_data <- mtcars
 
# Some useful information about data frames
str(cars_data) #lets see what we have here
summary(cars_data) #more information about mtcars dataset
names(cars_data) #column names
?mtcars #dataset documentation is *very* important
 
# Think of data frame columns as vectors! Because they are!
mean(cars_data$mpg) #mean of cars_data mpg (miles per galon) column
median(cars_data$cyl) #median of cars_data cyl (cylinders) column
 
is.list(cars_data[1,]); #but rows are lists!
 
# Lets do some data frame subsetting
 
cars_data[-1, ] # first row out
cars_data[ ,-1] # first column out
 
cars_data[c(1,3)] #keeping 1st and 3rd column only
cars_data[-c(1,3)] #removing 1st and 3rd column
cars_data[ ,-c(1,3)] #same as the previous line of code
 
cars_data[!duplicated(cars_data$mpg), ] #maybe we want to remove all cars with same mpg?
#remember it keeps only the first occurence!
 
subset(cars_data, mpg < 19) #this is one way (and it can be slow!)
cars_data[cars_data$mpg < 19, ] #this is another one (faster)
cars_data[which(cars_data$mpg < 19), ] #and another one (usually even more faster)
 
cars_data[cars_data$mpg > 20 & cars_data$am == 1, ] #multiple conditions
 
cars_data[grep("Merc", row.names(cars_data), value=T), ] #filtering by pattern match
 
# Data frame transformations
cars_data$trans <- ifelse(cars_data$am == 0, "automatic", "manual") #we can add new colums
cars_data$trans <- NULL #or we can remove them
 
cars_data[c(1:3,11,4,7,5:6,8:10)] #this way we change column order
 
# Separation and joining of data frames
low_mpg <- cars_data[cars_data$mpg < 15, ] #new data frame with mpg < 15
high_mpg <- cars_data[cars_data$mpg >= 15, ] #new data frame with mpg >= 15
 
mpg_join <- rbind(low_mpg, high_mpg) # we can combine 2 data frames like this
 
car_condition <- data.frame(sample(c("old","new"), replace = T, size = 32)) #creating random data frame
                                                                            #with "old" and "new" values
names(car_condition) <- "condition" #for all kinds of objects
colnames(car_condition) <- "condition" #for "matrix-like" objects, but same effect here
rownames(car_condition) <- rownames(cars_data) #use row names of one data frame as row names of other
 
mpg_join <- cbind(mpg_join, car_condition) #or combine data frames like this

Readings :: Session 3 [12. May, 2016, @Startit.rs, 19h CET]

Chapters 1 - 5

The Art of R Programming, Norman Matloff

You must be logged in to post a comment.