COMM 550-Lab Six- ANOVA

Back

This guide will walk you through how to perform the various types of ANOVA in R. We'll be using some real survey data which I collected for 550. The survey features players of the online game EVE online and has a number of demographic and game play variables. For the first example we'll be examining the difference in social capital levels between players at different levels of experience. There are competing theories regarding the relationship between gaming and social capital with some authors suggesting that games detract from social capital stocks by taking people away from their friends. An alternative view is that games provide a new venue for interaction and may boost social capital stocks.

Let's start as always by loading the libraries.

library(car)
library(psych)

We'll be looking at one way ANOVA today. This is a statistical test which determines is there is a significant difference between the means of two groups. Let's read the dataset into R and just omit any value with NA for the sake of time. Normally we would clean and examine the data a bit more closely but we have already covered these techniques in other labs.

a.data<-read.csv(url('http://joshaclark.com/wp-content/uploads/2014/05/evedemodata.csv'))

Our grouping variable will be time spent in game. The idea being that new players or casual gamers are different from their more intensely nerdy counterparts. Instead of binning according to some present values we are going to bin by percentile. So in this case I am making two bins, one which holds everyone from zero to the 50th percentile called 'low' and another for those in the 50+ percentile called 'high'

a.data<-na.omit(a.data)
summary(a.data$EVETime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0     8.0    14.0    16.4    20.0    80.0

Take a look at the quantile breakdown. This gives us the values for the 50th and 100th percentiles as well as the lowest score.

splits<-quantile(a.data$EVETime, c(0, .50, 1)) 
splits
##   0%  50% 100% 
##    1   14   80
a.data$timebin<-cut(a.data$EVETime, breaks=splits, labels=c('low', 'high'))
a.data<-na.omit(a.data)
table(a.data$timebin)
## 
##  low high 
##  263  241
table(a.data$EVETime)
## 
## 1.5   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  18  19 
##   1  16  11  21  31  23  12  28   6  70   1  22   3  18  51   9   6   1 
##  20  21  22  23  24  25  27  28  29  30  33  34  35  36  40  45  48  49 
##  61   5   3   1   3  17   2   2   1  32   1   1   8   1  11   2   1   1 
##  50  55  58  60  65  70  75  80 
##  10   1   1   2   1   2   1   3
summary(a.data$timebin)
##  low high 
##  263  241

We need to do a Levene test to make sure that the variance is equal between our bins

leveneTest(a.data$Bridgeindex, a.data$timebin)
## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)  
## group   1    3.11  0.078 .
##       502                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The test is not significant so we have equal variance, meaning that we can go ahead with our ANOVA. The one way ANOVA simply compares means between groups. the dependent variable comes first, separated by a tilde and then the independent variable(s). Note that data can be called with a separate function or included within the formula

a1 <- aov(a.data$Bridgeindex ~ a.data$timebin)

OR

a1.1<-aov(Bridgeindex~timebin, data=a.data)
summary(a1)
##                 Df Sum Sq Mean Sq F value Pr(>F)  
## a.data$timebin   1    1.5   1.539    3.96  0.047 *
## Residuals      502  195.2   0.389                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(a1.1)
##              Df Sum Sq Mean Sq F value Pr(>F)  
## timebin       1    1.5   1.539    3.96  0.047 *
## Residuals   502  195.2   0.389                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both forms of syntax give us the same thing. The high f-value and significant results suggest that there is greater variance between the two groups then we would expect by chance, so there is a difference between less and more experienced players. Let's try doing the ANOVA step by step to see if we can get the same results. This will let us review how ANOVA operates as well as reviewing the math which R just did for us. First things first let's obtain the grand mean of the entire dataset with regards to our dependent variable, bridging social capital.

grand.mean<-mean(a.data$Bridgeindex)

Next let's grab the group means for the low and high bins. One way to do this would be to use a variant of our old friend describe and look at the descriptive statistics by group

describeBy(a.data$Bridgeindex, a.data$timebin)
## group: low
##   vars   n mean   sd median trimmed  mad min max range  skew kurtosis   se
## 1    1 263 3.89 0.59    3.9    3.92 0.44 1.8   5   3.2 -0.64     1.03 0.04
## -------------------------------------------------------- 
## group: high
##   vars   n mean   sd median trimmed  mad min max range  skew kurtosis   se
## 1    1 241    4 0.66      4    4.05 0.59 1.3   5   3.7 -0.76     1.28 0.04

Group mean for low =3.89

low.mean<-3.89

High mean = 4

high.mean<- 4

Or we can make two new dataframes low and high and calculate the means separately. The subset function is extremely helpful and you should read the documentation when you get a chance.

?subset
low<-subset(a.data, timebin=='low', select=Bridgeindex)
high<-subset(a.data, timebin=='high', select=Bridgeindex)
low.mean<-mean(low$Bridgeindex)
high.mean<-mean(high$Bridgeindex)

Now that we have our various means we calculate our Sum of Squares. Although we don't actually need it for the analysis let's calculate the total sum of squares.

a.data$sstotal<-(a.data$Bridgeindex-grand.mean)^2
ss.total<-sum(a.data$sstotal)

SSbetween

low$ssbtwn<-(low.mean-grand.mean)^2
high$ssbtwn<-(high.mean-grand.mean)^2
ssbtwn.total<-(sum(low$ssbtwn)+sum(high$ssbtwn))

SSwithin is the summation of the score for each user on Bridging social capital minus the mean of their group, then squared.

low$sswithin<-(low$Bridgeindex-low.mean)^2
high$sswithin<-(high$Bridgeindex-high.mean)^2

Add up the two group scores to get the SSwithin total.

sswithin.total<-(sum(low$sswithin)+sum(high$sswithin))

DF for between is number of groups -1 (therefore 1) and DF within is n-#' of groups (511-2)

f<-(ssbtwn.total/1)/(sswithin.total/509)
f
## [1] 4.013

When we compare the by hand ANOVA to the one R calculated we can see the ssbtwn.total is close to the Sum Sq for timebin and the sswithin.total is close to the Sum Sq of the residual, additionally our F value is very close to the one R provides.

summary(a1)
##                 Df Sum Sq Mean Sq F value Pr(>F)  
## a.data$timebin   1    1.5   1.539    3.96  0.047 *
## Residuals      502  195.2   0.389                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ssbtwn.total
## [1] 1.539
sswithin.total
## [1] 195.2
f
## [1] 4.013

ANOVA is helpful because we can also compare across more then two groups. Let's break out sample into three bins and run the ANOVA again.

quantile(a.data$EVETime, c(0, .25, .50, .70, 1)) 
##   0%  25%  50%  70% 100% 
##  1.5  8.0 14.0 20.0 80.0
a.data$timebin4<-cut(a.data$EVETime, breaks=c(0,8,14,20,80), labels=c('bottom', 
                                                                      'middle/low', 
                                                                      'middle/high',
                                                                     'high'))

a2 <- aov(a.data$Bridgeindex ~ a.data$timebin4)
summary(a2)
##                  Df Sum Sq Mean Sq F value Pr(>F)
## a.data$timebin4   3    1.8   0.612    1.57    0.2
## Residuals       500  194.9   0.390

We will continue our examination of ANOVA next class when we look at factorial ANOVA and different strategies for drilling down and comparing between groups.

Back