Skip to content

Voting study application replication issue and solution #5

@patrickvossler18

Description

@patrickvossler18

Hello, I was attempting to replicate the voting study application in your paper, and I noticed an issue in the parse_data.R file that caused my replicated results not to match the results in the paper.

The issue has to do with the seed not being set in the parse_data.R file before sampling DF.nona:

rm(list = ls())
set.seed(1)
# the full dataset is available from
# https://github.com/gsbDBI/ExperimentData/tree/master/Mobilization/ProcessedData
data = read.csv("mobilization_no_unlisted 2.csv")

# W is intent to treat
# contact is received treatment

covariates = c("persons", "state", "county", "competiv",
               "st_sen", "st_hse", "newreg", "vote98",
               "vote00", "age", "female")
X = data[,which(names(data) %in% covariates)]
W = data$W
received_treatment = data$contact
Y = data$vote02

DF = data.frame(X, Y, W)

DF.nona = DF[!is.na(rowSums(DF)),]

idx.all = sample(c(sample(which(DF.nona$W == 0), sum(DF.nona$W) * 3/2), which(DF.nona$W == 1)))
DF.subset = DF.nona[idx.all,]

write.csv(DF.subset, "data_clean.csv", row.names = FALSE)

Additionally, I found this function useful for directly recreating the cleaned data.

make_data <- function(){
    temp_zip <- tempfile()
    temp <- tempfile()
    download.file("https://github.com/gsbDBI/ExperimentData/raw/master/Mobilization/ProcessedData/mobilization_no_unlisted.zip", temp_zip)
    unzip(zipfile = temp_zip, exdir = temp)
    data = read.csv(file.path(temp,"mobilization_no_unlisted.csv"))
    unlink(c(temp, temp_zip))
    covariates = c("persons", "state", "county", "competiv",
                   "st_sen", "st_hse", "newreg", "vote98",
                   "vote00", "age", "female")
    X = data[,which(names(data) %in% covariates)]
    W = data$W
    received_treatment = data$contact
    Y = data$vote02
    
    DF = data.frame(X, Y, W)
    
    DF.nona = DF[!is.na(rowSums(DF)),]
    
    idx.all = sample(c(sample(which(DF.nona$W == 0), sum(DF.nona$W) * 3/2), which(DF.nona$W == 1)))
    DF.subset = DF.nona[idx.all,]
}

Using this modified version of parse_data.R, I successfully replicated the results, except for the boosting MSE, but I think that is because of variability in the boosting algorithm?

Method Reported MSE Replicated MSE
Boosting 0.00079 0.00123
Lasso 0.00047 0.00047
Single Lasso 0.0006 0.00061
BART 0.00409 0.00405
var(tau) 0.01615 0.016

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions