calculus for statistics
Posted by admin on Friday Aug 27, 2010 Under Numbers

To eliminate rows with condition
# eliminate rows that Age is empty
dat<-dat[-which(temp$Age==""),]
R Error: FEXACT error 7
Testing with small sample size, it is more preferable to use the Fisher’s Exact test than the Chi-square test.
fisher.test(counts, simulate.p.value=TRUE)
If you have too many rows or columns, you may get an error saying,
FEXACT error 7.
LDSTP is too small for this problem.
Try increasing the size of the workspace.
You can still do the test by adding “simulate.p.value=TRUE”
fisher.test(counts, simulate.p.value=TRUE)
How to create contingency table from categorical data in r.
Example:
There are three categorical variables x1, x2, x3 measured from wild cats where
x1 = gender (male, female)
x2 = age (young, kitten, adult)
x3 = test result ( positive = 1, negative =0).
r table will generate two tables: 2by2 table for each of x3=0 and x3=1.
# r code
table(x1, x2, x3)
As shown below, the R output has two parts when x3=0 and x3=1.
Row represents Gender (x1) and the column represents Age (x2).
The numbers are counts of cats that fall into the corresponding categories.
, , = 0
A K Y
F 14 84 2
M 8 97 2
, , = 1
A K Y
F 1 12 0
M 1 36 0
T values
t value = qt(alpha/2, n-1)
#example
> qt(0.975, 8 )
[1] 2.306004
Univariate and Multivariate diagnostics
Univariate diagnostic (Histogram and QQ plot)
Plot a histogram
hist(mydata.st, main="histgram", xlab="X values")
Plot QQ plot
## pch =16 (16 is a symbol for a filled circle)
qqnorm(mydata.st, main="QQ plot", pch=16, col="navy")
Multivariate dignostics
Chi-squre plot
We will graph distance vs chsq
# function to compute distance between X and X.bar
# argument is X, X.bar and S.inv
f.dist <- function(x,x.bar,S.inv){
return(t(x-x.bar)%*%S.inv%*%(x-x.bar))
}
dist_<-apply(mydat.dat, 1, f.dist, x.bar=apply(mydat.dat, 2, mean), S.inv= solve(cov(mydat.dat)))
# Compute u's from chi-square
u.cs <- qchisq((1:150-.5)/150, 4)
Make a plot
plot(u.cs, sort(dist_), pch=16, col="navy", xlab="Theoretical Quantiles", ylab="Sample Quantiles", main="Chi-Square Plot")
If the chi-square plot has a line with a slope=1 and intercept=0, then the data can be assumed to be multivariate normal
Nonparametric classification method
Idea behind knn is that you measure distance between new value (x0) and each of the neighboring points and count the first k shortest distances, then classify the new value to the group that wins the majority rule.
Steps:
1. Choose k as an odd integer
2. Measure the distance between xo and each of the training data points
3. Order the distance
4. Select first k distances
5. Assign the new value to the group that wins the majority rule
R Code for KNN
# importing data into a dataframe
iris.dat <- read.table("iris_short.txt", header=T)
# plot the data
# first 50 data are Setosa and the second 50 Versicolor
plot(iris.dat[1:50,2:3], xlim=c(4.3, 7.0),ylim=c(1.0, 5.1), pch=1, col="red", main="Iris Data")
points(iris.dat[51:100, 2:3], pch=1, col="darkgreen")
# legend (#xcoord, #ycoord, col=c("colors"), pch=c("shape of points"), text.col=c("textcolor")
# legend=c("actual text you want to use")
legend(4.5, 4.9, col=c("red", "darkgreen"), pch=c(1,1), text.col=c("red", "darkgreen"), legend=c("Setosa", "Versicolor"))
# library class is required for knn
library(class)
# generating 100 data points in the range of (4.3, 7.0) and (1.0, 5.1)
x1<-seq(4.3, 7.0, len=100)
x2<-seq(1.0, 5.1, len=100)
x1.new<-rep(x1, 100)
x2.new<-rep(x2, rep(100,100))
iris.knn.2<-knn(iris.dat[,2:3], cl=iris.dat$species, test=cbind(x1.new, x2.new), k=2)
## plotting k nearest neighbors
# pt.col is color that will be assigned to each point based on what knn classifies the data as
# iris.knn.2 will have either 1 or 2.
# if iris.knn.2 >1, then take darkgreen, red o.w
pt.col<- ifelse(c(iris.knn.2) > 1, "darkgreen", "red")
# drawing points on the plot
points(x1.new, x2.new, col=pt.col, pch=20, cex=.1)
# drawing contour
contour(x1,x2,matrix(iris.knn.2,nc=100),add=T,nlevel=1,lty=1,drawlabel=F)
Preparation and EDA
Data should be standardized in factor analysis
scale(crime.dat)
#standardize data
crime.dat.sd= scale(crime.dat)
To obtain number of factors to use for the factor analysis, PCA can be used
#PCA for EDA
crime.pca<-princomp(crime.dat.sd)
Bartlett scores
crime.fa.s<- factanal(crime.dat.sd, 3, scores="Bartlett", rotation="varimax")
crime.fa.s$scores
Factor Analysis
#arg(standarized data, no of factors, rotation)
factanal(crime.da.sd, 3, rotation="none"
Hypothesis Testing
H0: The number of factor is sufficient
Ha: The number of Factor is not sufficient
Decision Rule
Reject H0 if test statistic > Chiq(alpha, df) or if p-value is small
[Note: the test statistic and p-value can be obtained from R output of factoranal]
Performing a PCA after standardizing the variables and obtain estimates for the principal components for the standardized variables.
Reading in athelete’s data
ath.dat <- read.table("athelete.txt")
Standardizing the data
ath.dat.std <- scale(ath.dat)
Correlation matrix (since covariance of standardized data is correlation)
R = cov(ath.dat.std)
Eigen Values
lambda = eigen(R)$val
Eigen values are read to assess which components explain the variance the most. In this case, the first two show values above 1 therefore we will take the first two components.
[1] 1.0900625 1.0290211 0.8809163
Eigen Vector
es= eigen(R)$vec
[,1] [,2] [,3]
[1,] 0.7476122 -0.1215134 -0.6529246
[2,] -0.2827011 0.8313790 -0.4784235
[3,] 0.6009626 0.5422577 0.5871971
Three eigen vectors associated with the labmda values. The three vectors are eigen vectors of the covariance matrix and also the loadings of Principle components.
You can find the loadings (eigen vectors) by obtaining loadings of the princomp output
loadings(ath.pca)
Loadings:
Comp.1 Comp.2 Comp.3
X1 0.748 -0.122 -0.653
X2 -0.283 0.831 -0.478
X3 0.601 0.542 0.587