Custom Search

Logistic Regression and How to interpret it

Posted by admin on Friday Jan 21, 2011 Under Statistics

When to use Simple Logistic Regression

Logistic regression is used when Yi, response variable is binary, 0 or 1.

Meaning of Response Function of binary response var
Yi = beta_0 + beta_1* Xi + ei

Considering Yi as bernoulli random variable,
P(Yi =1 ) = pi ** let’s say probability of success
P(Yi = 0 ) = 1-pi ** probability of failure
E(Yi) = 1(pi) + 0(1-pi) = pi which equals P(Yi=1)
Therefore, we can say Expected value of Yi is same as probability of Yi being 1. (p of success)

Problems with binary response variables are:
1. error term can only take two values
2. variance is dependent of Xs

How to run logistic regression in R

#upload data
dat1<-read.table(dat1.txt, sep='\t', header=T)
test.logr<-glm( result~gender, family=binomial(logit))

Let Yi=1 success and Yi=0 failure and
Let probability of success (p(Yi=1)) be 0.2 and probability of failure (p(Yi=0) be 0.8.
The odds of success is p/(1-p) = 0.2/0.8 =0.25 is 1 to 4

Basically logit is transforming the odds function using log
log(p/(1-p)) . It's monotonic transformation and it can ease the problem of restricted range.

So how does logit look like?
logit(p) = log(p/(1-p)) = b0 + b1X1 + ... bkXk
p = exp(b0+b1x1 + bkxk)/(1+exp(b0+b1x1 + ... + bkxk)

How to Interpret coefficients?

logit(p) = b0 + b1(school),

where school (public =1 and private =0)

success = 0, failure = 1
private = 0 , public =1

b0 is log odds for public since we coded private =0 (baseline)
b1 = log(1.325) = Odds ratio of private to public

Let coefficients be b1 = 0.5234 and b0 = -1.23
How to interpret the coefficients?

By exponentiating b1 (that is log(1.325)), odds ratio may be calculated and it can interpreted as:

Odds for private school being successful are 33% than odds for public school.

To check, you can simply compute odds for public school and private school, then log the ratio log(1.325) then you will get b1 value.

Multiple Logistic Regression Model

It can be interpreted just like a simple logistic regression. But you interpret it as assuming that all other predictor variables are held constant.

With coefficients, you may compute odds ratio and can be worded as follows:
the odds of a student being successful increase by xx percent with each additional year of tutoring (X1) for given soceioeconomic status and location.
the odds of a student being successful in area 1 is at most 7 time as as great as for a student
in area 2. where area1 = 1 and area2 coded as 0

http://division.aomonline.org/rm/1997_forum_regression_models.html

Tags : , , , , | add comments

Plotting density in R

Posted by admin on Wednesday Oct 13, 2010 Under Statistics

How to plot density
plot(density(DATA))

Rainbow color in R
If you want to make a plot have rainbow color range, you can use rainbow function:

rcol=rainbow(length(YOURDATA))
plot(DATAX, DATAY, type=”l”)
points(DATAX, DATAY, pch=16, col=rcol)

Simple Plot

How to change the size of text in a plot?
Use argument cex.[attribute] , and examples are below:

main titles by cex.main
sub titles by cex.sub
axis annonation by cex.axis
xlab and ylab by cex.lab

Legend
legend(x, y = NULL, legend, fill = NULL, col = par(“col”),
border=”black”, lty, lwd, pch,
angle = 45, density = NULL, bty = “o”, bg = par(“bg”),
box.lwd = par(“lwd”), box.lty = par(“lty”), box.col = par(“fg”),
pt.bg = NA, cex = 1, pt.cex = cex, pt.lwd = lwd,
xjust = 0, yjust = 1, x.intersp = 1, y.intersp = 1,
adj = c(0, 0.5), text.width = NULL, text.col = par(“col”),
merge = do.lines && has.pch, trace = FALSE,
plot = TRUE, ncol = 1, horiz = FALSE, title = NULL,
inset = 0, xpd, title.col = text.col, title.adj = 0.5)

symbols for R

Tags : , , , , , | Comments Off

Manipulating data frame/table

Posted by admin on Friday Aug 20, 2010 Under Statistics

To eliminate rows with condition

# eliminate rows that Age is empty
dat<-dat[-which(temp$Age==""),]

Tags : , , , , , , | Comments Off

Contingency Table for Categorical data and R

Posted by admin on Wednesday Aug 11, 2010 Under Statistics

How to create contingency table from categorical data in r.

Example:
There are three categorical variables x1, x2, x3 measured from wild cats where
x1 = gender (male, female)
x2 = age (young, kitten, adult)
x3 = test result ( positive = 1, negative =0).

r table will generate two tables: 2by2 table for each of x3=0 and x3=1.

# r code
table(x1, x2, x3)

As shown below, the R output has two parts when x3=0 and x3=1.
Row represents Gender (x1) and the column represents Age (x2).
The numbers are counts of cats that fall into the corresponding categories.

, , = 0


A K Y
F 14 84 2
M 8 97 2

, , = 1


A K Y
F 1 12 0
M 1 36 0

Tags : , , , , , , , , , | add comments

Validating assumption of multivariate normal data

Posted by admin on Monday Aug 2, 2010 Under Statistics

Univariate and Multivariate diagnostics

Univariate diagnostic (Histogram and QQ plot)

Plot a histogram

hist(mydata.st, main="histgram", xlab="X values")

Plot QQ plot

## pch =16 (16 is a symbol for a filled circle)
qqnorm(mydata.st, main="QQ plot", pch=16, col="navy")


Multivariate dignostics

Chi-squre plot
We will graph distance vs chsq

# function to compute distance between X and X.bar
# argument is X, X.bar and S.inv
f.dist <- function(x,x.bar,S.inv){
return(t(x-x.bar)%*%S.inv%*%(x-x.bar))
}

dist_<-apply(mydat.dat, 1, f.dist, x.bar=apply(mydat.dat, 2, mean), S.inv= solve(cov(mydat.dat)))

# Compute u's from chi-square
u.cs <- qchisq((1:150-.5)/150, 4)

Make a plot
plot(u.cs, sort(dist_), pch=16, col="navy", xlab="Theoretical Quantiles", ylab="Sample Quantiles", main="Chi-Square Plot")

If the chi-square plot has a line with a slope=1 and intercept=0, then the data can be assumed to be multivariate normal

Tags : , , , , , , , , , , | 2 comments

k nearest neighbors classification (knn)

Posted by admin on Friday Jul 16, 2010 Under Statistics

Nonparametric classification method
Idea behind knn is that you measure distance between new value (x0) and each of the neighboring points and count the first k shortest distances, then classify the new value to the group that wins the majority rule.

Steps:
1. Choose k as an odd integer
2. Measure the distance between xo and each of the training data points
3. Order the distance
4. Select first k distances
5. Assign the new value to the group that wins the majority rule

R Code for KNN

# importing data into a dataframe
iris.dat <- read.table("iris_short.txt", header=T)

# plot the data
# first 50 data are Setosa and the second 50 Versicolor
plot(iris.dat[1:50,2:3], xlim=c(4.3, 7.0),ylim=c(1.0, 5.1), pch=1, col="red", main="Iris Data")
points(iris.dat[51:100, 2:3], pch=1, col="darkgreen")

# legend (#xcoord, #ycoord, col=c("colors"), pch=c("shape of points"), text.col=c("textcolor")
# legend=c("actual text you want to use")
legend(4.5, 4.9, col=c("red", "darkgreen"), pch=c(1,1), text.col=c("red", "darkgreen"), legend=c("Setosa", "Versicolor"))

# library class is required for knn
library(class)

# generating 100 data points in the range of (4.3, 7.0) and (1.0, 5.1)
x1<-seq(4.3, 7.0, len=100)
x2<-seq(1.0, 5.1, len=100)
x1.new<-rep(x1, 100)
x2.new<-rep(x2, rep(100,100))
iris.knn.2<-knn(iris.dat[,2:3], cl=iris.dat$species, test=cbind(x1.new, x2.new), k=2)

## plotting k nearest neighbors
# pt.col is color that will be assigned to each point based on what knn classifies the data as
# iris.knn.2 will have either 1 or 2.
# if iris.knn.2 >1, then take darkgreen, red o.w
pt.col<- ifelse(c(iris.knn.2) > 1, "darkgreen", "red")

# drawing points on the plot
points(x1.new, x2.new, col=pt.col, pch=20, cex=.1)

# drawing contour
contour(x1,x2,matrix(iris.knn.2,nc=100),add=T,nlevel=1,lty=1,drawlabel=F)

Tags : , , , , , , , | add comments

Factor Analysis (FA)

Posted by admin on Tuesday Jul 6, 2010 Under Statistics

Preparation and EDA
Data should be standardized in factor analysis
scale(crime.dat)

#standardize data
crime.dat.sd= scale(crime.dat)

To obtain number of factors to use for the factor analysis, PCA can be used
#PCA for EDA
crime.pca<-princomp(crime.dat.sd)

Bartlett scores
crime.fa.s<- factanal(crime.dat.sd, 3, scores="Bartlett", rotation="varimax")
crime.fa.s$scores

Factor Analysis

#arg(standarized data, no of factors, rotation)
factanal(crime.da.sd, 3, rotation="none"

Hypothesis Testing
H0: The number of factor is sufficient
Ha: The number of Factor is not sufficient

Decision Rule
Reject H0 if test statistic > Chiq(alpha, df) or if p-value is small
[Note: the test statistic and p-value can be obtained from R output of factoranal]

Tags : , , , , , , | add comments

Principle Component Analysis (PCA)

Posted by admin on Sunday Jul 4, 2010 Under Statistics



Performing a PCA after standardizing the variables and obtain estimates for the principal components for the standardized variables.

Reading in athelete’s data
ath.dat <- read.table("athelete.txt")

Standardizing the data
ath.dat.std <- scale(ath.dat)

Correlation matrix (since covariance of standardized data is correlation)
R = cov(ath.dat.std)

Eigen Values
lambda = eigen(R)$val

Eigen values are read to assess which components explain the variance the most. In this case, the first two show values above 1 therefore we will take the first two components.

[1] 1.0900625 1.0290211 0.8809163

Eigen Vector
es= eigen(R)$vec

[,1] [,2] [,3]
[1,] 0.7476122 -0.1215134 -0.6529246
[2,] -0.2827011 0.8313790 -0.4784235
[3,] 0.6009626 0.5422577 0.5871971

Three eigen vectors associated with the labmda values. The three vectors are eigen vectors of the covariance matrix and also the loadings of Principle components.

You can find the loadings (eigen vectors) by obtaining loadings of the princomp output

loadings(ath.pca)

Loadings:
Comp.1 Comp.2 Comp.3
X1 0.748 -0.122 -0.653
X2 -0.283 0.831 -0.478
X3 0.601 0.542 0.587

Tags : , , , , , , , | 7 comments