Branko Kovač Data Analyst at CUBE, Data Science Mentor at Springboard, Institut savremenih nauka, Data Science Serbia, and Goran S. Milovanović, DataScientist@DiploFoundation, Data Science Serbia, are giving a free introductory course on R for Data Science in Belgrade, Serbia. All course materials – slides, R scripts, data sets, summaries and recommended readings – can be found on this page.
The course is organized by Data Science Serbia in cooperation with Startit, Belgrade. Fifteen participants are working with us in Startit Centar, Belgrade, Savska 5, each Thursday beginning 28. April 2016 19h CET in situ.
The course will be carried out through ten sessions (reproducible R code can be found at the following pages):
Elementary data structures, data.frames + an illustrative example of a simple linear regression model. An introduction to basic R data types and objects (vectors, lists, data.frame objects). Examples: subsetting and coercion. Getting to know RStudio. What can R do and how to make it perform the most elementary tricks needed in Data Science? What is CRAN and how to install R packages? R graphics: simple linear regression with plot(), abline(), and fancy with ggplot2().
Introduction to vectors, matrices, and data frames in R. R is a vector programming language, which means you will be using vectors, matrices, and n-dimensional arrays a lot. Vectorizing your code means enhanced performance in terms of speed. Data frame objects in R are elementary carriers of most of your data in R; unlike vectors and matrices, data frames can encompass various data types.
######################################################## # Introduction to R for Data Science # SESSION 1 :: 28 April, 2016 # Data Science Community Serbia + Startit # :: Branko Kovač and Goran S. Milovanović :: ######################################################## # This is an R comment: it begins with "#" and ends with nothing 🙂 # data source: (modified, from .dat to .csv) # from the website of Mr. Larry Winner, Department of Statistics, University of Florida # Data set: RKO Films Costs and Revenues 1930-1941 # More on RKO Films: # First question: where are we? getwd(); # this will tell you the path to the R working directory # Where are my files? # NOTE: Here you need to change filesDir to match your local path filesDir <- "/home/goran/Desktop/__IntroR_Session1/"; class(filesDir); # now filesDir is a of a character type; there are classes and types in R typeof(filesDir); # By the way, you do not need to use the semicolon to separate lines of code: class(filesDir) typeof(filesDir) # point R to where your files are stored setwd(filesDir); # set working directory getwd(); # check # Read some data in csv (comma separated values # - it might turn out that you will be using these very often) fileName <- "rko_film_1930-1941.csv"; dataSet <- read.csv(fileName, header=T, check.names=F, stringsAsFactors=F, row.names=NULL); # read.csv is for reading comma separated values # type ? in front of any R function for help ?read.csv # to find our that read.csv is a member of a wider read* family of functions # of which read.table is the most generic one # now, dataSet is of type... typeof(dataSet); # in type semantics, dataSet is a list. In R we use lists a lot. class(dataSet); # in object semantics, dataSet is a data.frame! # what is the first member of the dataSet list? dataSet[[1]]; # what are the first two members? dataSet[1:2]; # mind the difference between subsetting a list with [[]] and [] # does a single member of dataSet have a name? names(dataSet[[1]]); # of what type is it? typeof(dataSet[[1]]); class(dataSet[[1]]); # do first two elements have names? names(dataSet[1:2]); # wow typeof(dataSet[1:2]); # the first element of dataSet, understood as a character vector, does not have a name # however, elements OF A list do have names # can we subset a data.frame object by names? dataSet$movie; dataSet$movie[1:10]; dataSet$movie[[1]]; class(dataSet$movie[[1]]); typeof(dataSet$movie[[1]]); # thus, a character vector is the first member = the first column of the dataSet data.frame testWord <- testWord testWord[[1]]; testWord[[1:2]]; # error testWord[1:2]; # similar dataSet[1:2]; # first two columns of a dataSet # back to characters tW <- testWord[1]; tW[1] tW[2] # NA # from a viewpoint of a statistical spreadsheet user, NA is used for missing data in R # what is the second letter in tW == 'Ana' substring(tW,2,2); # there are functions in R to deal with characters as strings! # finding elements of vectors w <- testWord[w]; # how many elements in testWord? length(testWord); # subsetting testWord, again testWord[2:length(testWord)]; # length is another important function, like which() or substring() tail(testWord,2); # vectors have tails, yay! head(testWord,3); # and heads as well # a data.frame has a head too, and that knowledge often comes handy... head(dataSet,5); # ... especially when dealing with large data sets # of course... tail(dataSet,10); # another two functions: tail() and head() # further subsetting of a data.frame object dataSet$reRelease # columns can have names; reRelease is the name of the 2nd column of dataSet typeof(dataSet$reRelease); class(dataSet$reRelease); # automatic type conversion in R: from numeric to logical is.numeric(dataSet$reRelease); reRelease is.logical(reRelease); # vectors, sequences... # automatic type conversion (coercing) in R: from real to integer x <- 2:10; # is the same as... x <- seq(2,10,by=1); # multiples of 3.1415927... multipliPi <- x*pi; multipliPi # NOTE multiplication * in R operates element-wise # This is one of the reasons we call it a vector programming language... is.double(multipliPi); # type conversion in R: from double to integer as.integer(multipliPi) is.integer(multipliPi) is.integer(as.integer(multipliPi)) # rounding round(multipliPi,1) round(multipliPi,2) # carefully! as.integer(multipliPi) == round(multipliPi,0) # check documentation ?as.integer # enjoy... # more coercion... num <- as.numeric("123"); is.numeric(num) ch <- as.character(num) is.character(ch) # What do we all love in Data Science and Statistics? Random numbers..! runif(100,0,1) # one hundred uniformly distributed random numbers on a range 0 .. 1 rnorm(100, mean=0, sd=1) # one hundred random deviates from the standard Gaussian # all probability density and mass functions in R have similar r* functions to generate random deviates # Enough! Let's do something for real... # Q: Is it possible to predict the total revenue from movie production cost? # Are these two related at all? # What is the size of the data set? n # any missing data? sum(!($productionCost))); sum(!($totalRevenue))); # plot dataSet$productionCost on x-axis and dataSet$totalRevenue on y-axis plot(dataSet$productionCost, dataSet$totalRevenue); # are these two correlated? cPearson <- cor(dataSet$productionCost, dataSet$totalRevenue,method="pearson"); cPearson
# However, who in the World tests the assumptions of the linear model... Kick it! reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost); summary(reg); # get residuals reg$residuals # get coefficients reg$coefficients # some functions to inspect the simple linear model coefficients(reg) # model coefficients confint(reg, level=0.95) # CIs for model parameters fitted(reg) # predicted values residuals(reg) # residuals anova(reg) # anova table vcov(reg) # covariance matrix for model parameters # plot model intercept <- reg$coefficients[1]; slope <- reg$coefficients[2]; plot(dataSet$productionCost, dataSet$totalRevenue); abline(reg$coefficients); # as simple as that; abline() is a generic function, check it out ?abline
# and now for a nice plot library(ggplot2); # first do: install.packages("ggplot2");not now - it can take a while # library() is a call to use any R package # of which the powerful ggplot2 is among the most popular g <- ggplot(data=dataSet, aes(x = productionCost, y = totalRevenue)) + geom_point() + geom_smooth(method=lm, se=TRUE) + xlab("\nProduction Cost") + ylab("Total Revenue\n") + ggtitle("Linear Regression\n"); print(g);
# Q1: Is this model any good? # Q2: Are there any truly dangerous outliers present in the data set? # print is also a generic function in R: for example, print("Doviđenja i uživajte u praznicima uz gomilu materijala za čitanje i vežbu!") # P.S. Play with: reg <- lm(dataSet$totalRevenue ~ dataSet$productionCost + dataSet$domesticRevenue); summary(reg) # etc.
The Art of R Programming, Norman Matloff
Chapter 1 – 5, pp. 1 – 54.
######################################################## # Introduction to R for Data Science # SESSION 2 :: 5 May, 2016 # Data Science Community Serbia + Startit # :: Goran S. Milovanović and Branko Kovač :: ######################################################## # clear all rm(list=ls()); # Let's start with some vectors char_list <- character(length = 0) #empty character list num_list <- numeric(length = 10) #length can be != 0, but 0 is default value log_list <- logical(length = 3) #default value is FALSE # But you can always use good ol' c() for the same purpose log_list_2 <- c(TRUE, FALSE, FALSE, TRUE, TRUE, TRUE) #some Ts and Fs num_list_2 <- c(1, 4, 12, NA, 101, 999) #numb char_list_2 <- c("abc", "qwerty", "test", "data", "science") # Factor vectors are also part of R fac_list <- gl(n = 4, k = 1, length = 8, ordered = T, labels = c("low", "med", "high", "very high")) #only mentioning now :) # Subsetting is regular-thing-to-do when using R char_list_2[5] #single element can be selected log_list_2[2:4] #or some interval num_list_2[3:length(num_list_2)] #or even length() function # New objects can be created when subsetting test <- num_list_2[-c(2,4)] #or somthing like this - displays all but 2nd and 4th element test_2 <- num_list_2 %in% test #operator %in% can be very useful not_na <- num_list_2[!] #removing NAs using operator ! and function # Vector ordering sort(test, decreasing = T) #using sort() function test[order(test, decreasing = T)] #or with order() function # Vector sequences seq(1,22,by = 2) #we already mentioned seq() rep(1, 4) #but rep() is something new :) rep(num_list_2, 2) #replicate num_list_2, 2 times # Concatenation new_num_vect <- c(num_list, num_list_2) #using 2 vectors to create new one new_num_vect new_combo_vect <- c(num_list_2, log_list) #combination of num and log vector new_combo_vect #all numbers? false to zero? coercion in action new_combo_vect_2 <- c(char_list_2, num_list_2) #works as well new_combo_vect_2 #where are the numbers? class(new_combo_vect_2) #all characters # Matrices are available in R matr <- matrix(data = c(1,3,5,7,NA,11), nrow = 2, ncol = 3) #2x3 matrix class(matr) #yes, it's matrix typeof(matr) #double as expected matr[,2] #2nd column matr[3,] #oops, out of bounds, there's no 3rd row matr[2,3] #element in 2nd row and 3rd column matr_2 <- matrix(data = c(1,3,5,"7",NA,11), nrow = 2, ncol = 3) #another 2x3 matrix class(matr_2) #matrix again typeof(matr_2) #but not double anymore, type conversion in action! t(matr_2) #transponed matr_2 # What can we do if a matrix needs to encompass different types of data? # Introducing data frame! library(datasets) #there are some datasets in base R like mtcars cars_data <- mtcars # Some useful information about data frames str(cars_data) #lets see what we have here summary(cars_data) #more information about mtcars dataset names(cars_data) #column names ?mtcars #dataset documentation is *very* important # Think of data frame columns as vectors! Because they are! mean(cars_data$mpg) #mean of cars_data mpg (miles per galon) column median(cars_data$cyl) #median of cars_data cyl (cylinders) column is.list(cars_data[1,]); #but rows are lists! # Lets do some data frame subsetting cars_data[-1, ] # first row out cars_data[ ,-1] # first column out cars_data[c(1,3)] #keeping 1st and 3rd column only cars_data[-c(1,3)] #removing 1st and 3rd column cars_data[ ,-c(1,3)] #same as the previous line of code cars_data[!duplicated(cars_data$mpg), ] #maybe we want to remove all cars with same mpg? #remember it keeps only the first occurence! subset(cars_data, mpg < 19) #this is one way (and it can be slow!) cars_data[cars_data$mpg < 19, ] #this is another one (faster) cars_data[which(cars_data$mpg < 19), ] #and another one (usually even more faster) cars_data[cars_data$mpg > 20 & cars_data$am == 1, ] #multiple conditions cars_data[grep("Merc", row.names(cars_data), value=T), ] #filtering by pattern match # Data frame transformations cars_data$trans <- ifelse(cars_data$am == 0, "automatic", "manual") #we can add new colums cars_data$trans <- NULL #or we can remove them cars_data[c(1:3,11,4,7,5:6,8:10)] #this way we change column order # Separation and joining of data frames low_mpg <- cars_data[cars_data$mpg < 15, ] #new data frame with mpg < 15 high_mpg <- cars_data[cars_data$mpg >= 15, ] #new data frame with mpg >= 15 mpg_join <- rbind(low_mpg, high_mpg) # we can combine 2 data frames like this car_condition <- data.frame(sample(c("old","new"), replace = T, size = 32)) #creating random data frame #with "old" and "new" values names(car_condition) <- "condition" #for all kinds of objects colnames(car_condition) <- "condition" #for "matrix-like" objects, but same effect here rownames(car_condition) <- rownames(cars_data) #use row names of one data frame as row names of other mpg_join <- cbind(mpg_join, car_condition) #or combine data frames like this
Chapters 1 - 5
