Homework 11, POLS 8505: MEASUREMENT THEORY
Due 13 April 2015



  1. In this problem we are going to try out the metric unfolding version of SMACOF ("Scaling by Maximizing a Convex Function" or "Scaling by Majorizing a Complicated Function") which is discussed in Chapter 8 of Borg and Groenen. We are going to apply it to the 2008 NES Candidate Feeling Thermometers.

    Download the R program:

    #  THIS IS THE ONE DIMENSION SECTION OF THE PROGRAM
    #  ONE DIMENSIONAL RESULTS
    #
    zmetric1 <- result1$conf.col
    zmetric1 <- as.numeric(zmetric1)
    #dim(zmetric) <- c(nq,ndim)   We do not need this because we have a one-dimensional vector
    xmetric1 <- result1$conf.row
    xmetric1 <- as.numeric(xmetric1)
    #dim(xmetric) <- c(np,ndim)
    sse11 <- 0.0
    sse22 <- 0.0
       for (i in 1:np) {
         for(j in 1:nq) {
          dist_i_j <- 0.0
    #
    #  Calculate distance between points
    #
           dist_i_j <- dist_i_j+ (xmetric1[i]-zmetric1[j])*(xmetric1[i]-zmetric1[j])
          sse11 <- sse11 + ((TEIGHT[i,j]) - sqrt(dist_i_j))*((TEIGHT[i,j]) - sqrt(dist_i_j))
          sse22 <- sse22 + ((TEIGHT[i,j]) - sqrt(dist_i_j))*((TEIGHT[i,j]) - sqrt(dist_i_j))*weightmat[i,j]
         }
       }
    obama.voters <- xmetric1[zz[,3]==1]  zz[,3] has the who the respondent voted for
    mccain.voters <- xmetric1[zz[,3]==3]
    non.voters <- xmetric1[NOTVOTE==1 | NOTVOTE==2 | NOTVOTE==3]
    #                                    Note the logic here
    obamaShare <- length(obama.voters)/(length(obama.voters)+length(mccain.voters)+length(non.voters))
    mccainShare <- length(mccain.voters)/(length(obama.voters)+length(mccain.voters)+length(non.voters))
    nonShare <- length(non.voters)/(length(obama.voters)+length(mccain.voters)+length(non.voters))
    #                           
    gvdens <- density(obama.voters)  Get a smoothed histogram of the voters
    gvdens$y <- gvdens$y*obamaShare  Adjust the heigth of the density by the share of the vote
    #                                This makes the three densities add up to 1
    bvdens   <- density(mccain.voters)
    bvdens$y <- bvdens$y*mccainShare
    #
    nvdens   <- density(non.voters)
    nvdens$y <- nvdens$y*nonShare
    #                                Simple trick to make the 3 densities fit nicely in the graph
    maxheight <- 1.1*(max(gvdens$y,bvdens$y,nvdens$y))
    windows()
    plot(gvdens,xlab="",ylab="",
          main="",
          xlim=c(-1.5,1.5),ylim=c(0,maxheight),type="l",lwd=3,col="red",font=2)
    lines(bvdens,lwd=3,col="blue")
    lines(nvdens,lwd=3,col="black")
    # Main title
    mtext("2008 Thermometer Scaling\nObama and McCain Voters and Non-Voters",side=3,line=1.50,cex=1.2,font=2)
    # x-axis title
    mtext("Liberal - Conservative",side=1,line=2.75,cex=1.2)
    # y-axis title
    mtext("Density",side=2,line=2.5,cex=1.2)
    #                           This puts the proportions each candidate gets onto the graph
    text(-1.0,0.53,paste("Obama Voters ",  
                    100.0*round(obamaShare, 3)),col="red",font=2)
    text(-1.0,0.5,paste("McCain Voters ",
                    100.0*round(mccainShare, 3)),col="blue",font=2)
    text(-1.0,0.47,paste("Non Voters  ",
                    100.0*round(nonShare, 3)),col="black",font=2)
    #
    1. Run nes2008_thermometers_smacof_2.r and turn in the two plots AFTER YOU HAVE NEATLY FORMATTED THEM!! In particular, be sure that Obama is on the left side of the plots and McCain is on the right side of the plots. Also, you should adjust the candidate names that are placed on top of the voter distributions so that they can be easily seen.

    2. Report result$stress, sse1, sse2, result1$stress, sse11, sse22.

    3. Report zmetric1. Relative to xmetric1, do the results make sense to you? Explain.

  2. In this problem we are going continue trying out the metric unfolding version of SMACOF ("Scaling by Maximizing a Convex Function" or "Scaling by Majorizing a Complicated Function") which is discussed in Chapter 8 of Borg and Groenen. We are going to apply it to the 1968 NES Candidate Feeling Thermometers and compare it to MLSMU6.FOR -- a metric unfolding program I wrote in 1978.

    Download the R program:

    #  Double-Center Function for a rectangular squared distance matrix
    doubleCenterRect2 <- function(T){
      p <- dim(T)[1]
      n <- dim(T)[2]
      -(T-matrix(apply(T,1,mean),nrow=p,ncol=n)-
         t(matrix(apply(T,2,mean),nrow=n,ncol=p))+ mean(T))/2
    }
    #
    etc etc etc
    TTSQDC <- doubleCenterRect2(TTSQ)  TTSQ is TT squared with the squared mean in the 
    #                                    Missing entries
    #  Perform Singular Value Decomposition
    #
    xsvd <- svd(TTSQDC)
    #
    #  The Two Lines Below Put the Singular Values in a
    #    Diagonal Matrix -- The first one creates an 
    #    identity matrix and the second command puts
    #    the singular values on the diagonal
    #
    Lambda <- diag(nq)
    diag(Lambda) <- xsvd$d
    #
    #  Compute U*LAMBDA*V' for check below
    #
    XREPRODUCE <- xsvd$u %*% Lambda %*% t(xsvd$v)
    ssecheck <- sum(abs(TTSQDC-XREPRODUCE))     Check the Decomposition
    #
    # Starting Coordinates for MLSMU6 Algorithm
    #
    zz <- xsvd$v[,1:2]           Use the Left Singular Vectors as the Stimuli Coordinate Starts
    #
    xx <- rep(0,np*ndim)
    dim(xx) <- c(np,ndim)        Use the Right Singular Vectors as the Stimuli Coordinate Starts
    xx[,1] <- xsvd$u[,1]*sqrt(xsvd$d[1]) Multiplied by the Square Root of the Singular Values
    xx[,2] <- xsvd$u[,2]*sqrt(xsvd$d[2])
    ###################################################
    ### chunk number 15: 
    ###################################################
    ##Here is suma over the NQ dimensions This Function calculates the Sum of Squared Error   
    sumaj <- function(i){                   for the ith Row   
      jjj <- 1:nq
      j <- jjj[!is.na(T[i,])]    Here is the trick to have j only range over the non-missing values   
      s <- (xx[i,1]-zz[j,1])^2+(xx[i,2]-zz[j,2])^2  Note that "j" is a vector so this is a vector calculation   
      s=sqrt(s)
      sx=TEIGHT[i,j]
      sum((s-sx)^2)       Here is the Sum of Squared Error   
    }
    ## now get SUMA over NP dimension
    #xx <- xmetric   # Used as a check on the MLSMU6 Algorithm
    #zz <- zmetric
    sumvector <- sapply(1:np,sumaj)  
    suma <- sum(sumvector)   Here is the Total Sum of Squared Error   
    xxx <- xx
    zzz <- zz
    kp=0
    ktp=0
    sumalast <- 100000.00
    SAVEsumalast <- sumalast
    for(loop in 1:50){
      xxx[,1:2] <- 0
      zzz[,1:2] <- 0
      kp=kp+1
      ktp=ktp+1
    
      for(i in 1:np){      This For Loop is doing the sum of the line equations for the Stimuli   
        for(j in 1:nq){
          if(!is.na(T[i,j])){
            s=0
            for(k in 1:ndim)
              s <- s+(xx[i,k]-zz[j,k])^2
            xc <- ifelse(s==0, 1.0, TEIGHT[i,j]/sqrt(s))
            for(k in 1:ndim)
              zzz[j,k]=zzz[j,k]+(xx[i,k]-xc*(xx[i,k]-zz[j,k]))/xcol[j]
          }
        }
      }
      for(k in 1:ndim){    This For Loop is doing the sum of the line equations for the Respondents   
        for(i in 1:np){
          sw <- 0.0
          for(j in 1:nq){
            if(!is.na(T[i,j])){
              s=0
              for(kk in 1:ndim)
                s <- s+(xx[i,kk]-zzz[j,kk])^2
              xc <- ifelse(s==0, 1.0, TEIGHT[i,j]/sqrt(s))
              xxx[i,k]=xxx[i,k]+(zzz[j,k]-xc*(zzz[j,k]-xx[i,k]))
            }
          }
          xxx[i,k]=xxx[i,k]/xrow[i]
        }
      }
      xx <- xxx
      zz <- zzz
    sumvector <- sapply(1:np,sumaj)
    suma <- sum(sumvector)   Calculate Current SSE   
    #
      print(c(loop,suma))
      done=((sumalast-suma)/suma)<0.0005  Compare the Current with the Previous SSE   
      sumalast=suma
      if(done) break   If Convergence Break out of the Loop   
    }
    1. Run nes68_thermometers_smacof_3.r and turn in the plot that it makes.

    2. Report result$stress, sse1, sse2, and sumalast (this will appear on your screen).

    3. Make two panel plots of the SMACOF configurations against the MLSMU6 configurations. Namely, rotate xx to best match xmetric and apply that same rotation matrix to zz. That is, one two-panel plot shows the respondents and one two panel plot shows the 12 Political figures. Report the rotation matrix, T, and the before and after r-squares by dimension.

  3. In this problem we are going to compare probit and logit coefficients using the analysis in my book on pages 37-40 (also, work through Notes on the Geometry of Logit and Probit). Download the R program:

    #
    # bush_v_gore_example_h106.r -- GLM Examples
    #
    rm(list=ls(all=TRUE))
    #
    library(MASS)
    library(foreign)
    #
    setwd("C:/uga_course_homework_11_2015/")
    data <- read.dta("hdmg106_2009_fixed.dta")  Read in the Stata Data set
    attach(data,warn.conflicts = FALSE)
    #                                           Grab all the variables
    T <- cbind(bush00,gore00,black00,south,hispanic00,income,owner00,dwnom1n,dwnom2n,ybush,ygore)
    TT <- T
    mode(TT) <- "double"                        Name the variables -- handy below
    colnames(TT) <- c("Bushvote","Gorevote","blackpct","southdum","hispandum","income","ownerhome",
                      "dwnom1n","dwnom2n","ybush","ygore")
    #
    nrow <- length(TT[,1])
    ncol <- length(TT[1,])
    #
    #
    #                 STATA OUTPUT FOR REFERENCE
    #  &&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&
    #. summ
    #
    #    Variable |       Obs        Mean    Std. Dev.       Min        Max
    #-------------+--------------------------------------------------------
    #      bush00 |       432    46.81019    14.36873          6         80
    #      gore00 |       432    49.71296      14.005         20         93
    #     black00 |       432    12.68333    16.25699         .3         96
    #       south |       432    .3101852    .4631056          0          1
    #  hispanic00 |       432    12.31597    16.25281         .6         86
    #-------------+--------------------------------------------------------
    #      income |       432    30.75224    8.319547      15.06     57.219
    #     owner00 |       432    65.86875    11.92404          8       84.3
    #     dwnom1n |       432    .0586875    .4671698      -.815      1.269
    #     dwnom2n |       432   -.0381065    .4706545     -1.123       1.27
    #       ybush |       432    .4861111    .5003865          0          1  1 if Bush > 50% in a CD, 0 otherwise
    #-------------+--------------------------------------------------------
    #       ygore |       432       .4375    .4966535          0          1
    #. probit ybush black00 south hispanic00 income owner00 dwnom1n dwnom2n
    #
    #Probit regression                                 Number of obs   =        432
    #                                                  LR chi2(7)      =     338.29
    #                                                  Prob > chi2     =     0.0000
    #Log likelihood = -130.12789                       Pseudo R2       =     0.5652
    #
    #------------------------------------------------------------------------------
    #       ybush |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    #-------------+----------------------------------------------------------------
    #     black00 |  -.0285973   .0097913    -2.92   0.003    -.0477878   -.0094067
    #       south |   .7695862   .2545783     3.02   0.003     .2706219    1.268551
    #  hispanic00 |  -.0089458   .0069163    -1.29   0.196    -.0225015    .0046099
    #      income |  -.0241489   .0126394    -1.91   0.056    -.0489217    .0006239
    #     owner00 |   .0235461   .0143687     1.64   0.101    -.0046161    .0517083
    #     dwnom1n |   2.761974   .2619813    10.54   0.000       2.2485    3.275448
    #     dwnom2n |   1.136417   .2339829     4.86   0.000     .6778186    1.595015
    #       _cons |  -.9697998   1.077738    -0.90   0.368    -3.082127    1.142528
    #------------------------------------------------------------------------------
    #                            How to run a Probit in R
    probitbush <-glm(ybush ~ black00+south+hispanic00+income+owner00+dwnom1n+dwnom2n,family=binomial(link=probit))
    sumprobitbush <- summary(probitbush)  Get the Summary
    #
    #  ---- Useful Commands To See What is in an Object
    #
    # > length(sumprobitbush)
    # [1] 17
    # > class(sumprobitbush)
    # [1] "summary.glm"
    # > names(sumprobitbush)     The coefficients are in sumprobitbush$coefficients[1:8]
    #  [1] "call"           "terms"          "family"         "deviance"      
    #  [5] "aic"            "contrasts"      "df.residual"    "null.deviance" 
    #  [9] "df.null"        "iter"           "deviance.resid" "coefficients"  
    # [13] "aliased"        "dispersion"     "df"             "cov.unscaled"  
    # [17] "cov.scaled"    
    #
    #                            How to run a Logit in R 
    logitbush <-glm(ybush ~ black00+south+hispanic00+income+owner00+dwnom1n+dwnom2n,family=binomial(link=logit))
    sumlogitbush <- summary(logitbush)  Get the Summary
    1. Run the program and report all the output NEATLY FORMATTED produced by sumprobitbush and sumlogitbush.

    2. Calculate Cosineq (Cosine Theta if you use a handicapped browser) between the probit and logit coefficients (see page 40 of my book).

    3. The probit and logit summaries report Null Deviance, Residual Deviance, and AIC. Null Deviance is the deviation of a model that contains only the intercept term, that is, a fixed probability for all observations. Residual Deviance is simply -2*Log-Likelihood. In this instance, N1=210, the number of "1"s, N0=222, the number of "0"s, K=8, the number of coefficients, and the formulas are:

      Null Deviance = -2*N1*[log(N1/(N1+N2))] -2*N2*[log(N2/(N1+N2))]
      Residual Deviance = -2*[LOG LIKELIHOOD]
      Akaike Information Criterion (AIC) = -2*[LOG LIKELIHOOD] + 2*K

      Calculate these three statistics for the probit output. Show all of your calculations.


  4. In this problem we are going to compare ordered probit and ordered logit coefficients using the analysis in my book on pages 37-40 (also, work through Notes on the Geometry of Logit and Probit). Download the R program:

      oprobit_ologit.r -- Program that runs ordered probit and ordered logit in R using the polr function
      #
      rm(list=ls(all=TRUE))
      library(foreign)
      library(MASS)
      library(stats)
      setwd("C:/uga_course_homework_11_2015")
      #
      #  READ STATA FILE HERE
      data <- read.dta("nes1968_first.dta")  Read in the Stata Data set
      attach(data,warn.conflicts = FALSE)
      #                                      Grab only the variables we need
      T <- cbind(VAR00120,VAR00156,VAR00261,VAR00263,VAR00264,VAR00478,VAR00479,VAR00480,VAR00533)
      TT <- T
      mode(TT) <- "double"                   We do not use these variable names, we just need to keep track
      colnames(TT) <- c("partyid2","education5","income2","sex2","black2","wallace2","humphrey2","nixon2","age2")
      #
      # partyid -- 0,1,2,3,4,5,6             The Codes we need
      # education -- 00 - 22=8th grade or less, 31-61=9th-12th grades, 71=some college, 81=BS/BA, 82-87, advanced
      # income -- 10 - 35
      # sex -- 1=male, 2=female
      # black -- 1=white, 2=black
      # wallace -- 0 - 97
      # humphrey -- 0 - 97
      # nixon -- 0 - 97
      # age -- 20 - 93
      #
      partyidx <- VAR00120                   Recode the variables and insert NA's for missing data
      partyid2 <- ifelse(partyidx > 6,NA,partyidx)
      #
      educationx <- VAR00156                 Note the Logic here
      education1 <- ifelse(educationx < 23,1,educationx)
      education2 <- ifelse(((23 < educationx) & (educationx < 62)),2,education1)
      education3 <- ifelse(educationx==71,3,education2)
      education4 <- ifelse(((72 < educationx) & (educationx <= 87)),4,education3)
      education5 <- ifelse(educationx > 87,NA,education4) 
      #
      incomex <- VAR00261
      income2 <- ifelse(incomex > 35,NA,incomex)
      #
      sex <- VAR00263 - 1       0=Male, 1=Female
      #
      racex <- VAR00264
      race1 <- ifelse(racex==1,0,racex)  White = 0
      race2 <- ifelse(racex==2,1,race1)  Black = 1
      race3 <- ifelse(racex>2,NA,race2)
      #
      wallacex <- VAR00478      Feeling Thermometers, max=97
      wallace1 <- ifelse(wallacex>97,NA,wallacex)
      #
      humphreyx <- VAR00479
      humphrey1 <- ifelse(humphreyx>97,NA,humphreyx)
      #
      nixonx <- VAR00480
      nixon1 <- ifelse(nixonx>97,NA,nixonx)
      #
      agex <- VAR00533          Maximum age in the 1968 data was 93
      age1 <- ifelse(agex>93,NA,agex)
      #
      #  VARIABLES
      # partyid2, education5, income2, sex, reace3, wallace1, humphrey1, nixon1, age1
      #                        Stick Variables in a Matrix
      XDATA <- cbind(partyid2,education5,income2,sex,race3,wallace1,humphrey1,nixon1,age1)
      YDATA <- na.omit(XDATA)  List-Wise Deletion -- NEAT COMMAND
      #
      #  ORDERED PROBIT
      #
      scarecrow <- polr(as.ordered(YDATA[,1]) ~ YDATA[,2]+YDATA[,3]+YDATA[,4]+YDATA[,5]+YDATA[,6]+YDATA[,7]+YDATA[,8]+YDATA[,9],
                        method="probit")
      #
      #  ORDERED LOGIT
      #
      blackcat <- polr(as.ordered(YDATA[,1]) ~ YDATA[,2]+YDATA[,3]+YDATA[,4]+YDATA[,5]+YDATA[,6]+YDATA[,7]+YDATA[,8]+YDATA[,9],
                        method="logistic")
    1. Run the program and report all the output NEATLY FORMATTED produced by summary(scarecrow) and summary(blackcat).

    2. Calculate Cosineq (Cosine Theta if you use a handicapped browser) between the ordered probit and ordered logit coefficients (see page 40 of my book).

    3. The ordered probit and ordered logit summaries report Residual Deviance, and AIC. Calculate these two statistics for both the ordered probit and ordered logit outputs. Show all of your calculations.

  5. The aim of this problem is to familarize you with the R version of Optimal Classification (OC). Download the R program:

      oc_in_R_2015.r -- R Program that runs OC. Reads an ORD file, "rollcall" in pscl, runs Optimal Classification, writes the legislator and roll call coordinates to disk, and makes a two panel plot showing the legislators and the roll calls with arrows on the cutting lines indicated the direction of a YEA vote. The Program uses SEN112KH.ORD (Senate roll call matrix for the 112th Senate).

    1. Run the program and turn in the two plots that the program generates.

    2. Report the fits of the one and two dimensional models -- result1$fits and result$fits.

    3. Report summary(result1) and summary(result) NEATLY FORMATTED.

    4. On your hard drive you will have two files: Sen112_X.txt and Sen112_Z.txt. Plot the Senator Coordinates in Sen112_X.txt (this is the same graph that the program produces) only now put the normal vector for the second roll call in the plot as a black arrow (this is the second row of Sen112_Z.txt) and draw the cutting line as a black line that is perpendicular to the normal vector and passes through the normal vector at the midpoint coordinates (which lie on the normal vector).