Voting study application replication issue and solution

Hello, I was attempting to replicate the voting study application in your paper, and I noticed an issue in the `parse_data.R` file that caused my replicated results not to match the results in the paper.

The issue has to do with the seed not being set in the `parse_data.R` file before sampling `DF.nona`:

``` r
rm(list = ls())
set.seed(1)
# the full dataset is available from
# https://github.com/gsbDBI/ExperimentData/tree/master/Mobilization/ProcessedData
data = read.csv("mobilization_no_unlisted 2.csv")

# W is intent to treat
# contact is received treatment

covariates = c("persons", "state", "county", "competiv",
               "st_sen", "st_hse", "newreg", "vote98",
               "vote00", "age", "female")
X = data[,which(names(data) %in% covariates)]
W = data$W
received_treatment = data$contact
Y = data$vote02

DF = data.frame(X, Y, W)

DF.nona = DF[!is.na(rowSums(DF)),]

idx.all = sample(c(sample(which(DF.nona$W == 0), sum(DF.nona$W) * 3/2), which(DF.nona$W == 1)))
DF.subset = DF.nona[idx.all,]

write.csv(DF.subset, "data_clean.csv", row.names = FALSE)
```

Additionally, I found this function useful for directly recreating the cleaned data.

``` r
make_data <- function(){
    temp_zip <- tempfile()
    temp <- tempfile()
    download.file("https://github.com/gsbDBI/ExperimentData/raw/master/Mobilization/ProcessedData/mobilization_no_unlisted.zip", temp_zip)
    unzip(zipfile = temp_zip, exdir = temp)
    data = read.csv(file.path(temp,"mobilization_no_unlisted.csv"))
    unlink(c(temp, temp_zip))
    covariates = c("persons", "state", "county", "competiv",
                   "st_sen", "st_hse", "newreg", "vote98",
                   "vote00", "age", "female")
    X = data[,which(names(data) %in% covariates)]
    W = data$W
    received_treatment = data$contact
    Y = data$vote02
    
    DF = data.frame(X, Y, W)
    
    DF.nona = DF[!is.na(rowSums(DF)),]
    
    idx.all = sample(c(sample(which(DF.nona$W == 0), sum(DF.nona$W) * 3/2), which(DF.nona$W == 1)))
    DF.subset = DF.nona[idx.all,]
}
```

Using this modified version of `parse_data.R`, I successfully replicated the results, except for the boosting MSE, but I think that is because of variability in the boosting algorithm?

| Method       | Reported MSE | Replicated MSE |
| ------------ | ------------ | -------------- |
| Boosting | 0.00079      | 0.00123        |
| Lasso        | 0.00047      | 0.00047        |
| Single Lasso | 0.0006       | 0.00061        |
| BART         | 0.00409      | 0.00405        |
| var(tau)     | 0.01615      | 0.016          |



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voting study application replication issue and solution #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Method	Reported MSE	Replicated MSE
Boosting	0.00079	0.00123
Lasso	0.00047	0.00047
Single Lasso	0.0006	0.00061
BART	0.00409	0.00405
var(tau)	0.01615	0.016

Voting study application replication issue and solution #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions