-
Notifications
You must be signed in to change notification settings - Fork 14
Description
Hi optmatch team,
I'm always working on improving MatchIt and found an optimization that might be of interest for you. It is based on the algorithm described in Sävje (2020, Sec 2.3). It relies on the idea that for a 1:1 optimal pair match on a vector score, among the set of optimal solutions that have the same objective value, there will be at least one that has no "crossings", which are described in the paper. By preventing crossing prior to matching, we can dramatically reduce the problem size and speed up the matching, while still yielding a solution as optimal as one that would have been found without this restriction. In that sense, the restriction comes almost for free, similar to setting a caliper value larger than any pair distance in the final solution.
I coded the restriction in Rcpp and in preliminary testing have found it helps while ensuring a truly optimal solution is found. However, due to imprecision based on the numerical tolerance, it might be slightly better or worse than the solution found without it, though that difference goes away with small values of tol.
It works by processing the score vector and treatment status and produces a vector of indices that correspond to the indices in the distance matrix that should be retained for the matching (i.e., not pruned). This makes it work well with the InfinitySparseMatrix class because it corresponds to the subset of the matrix distance matrix that should have non-Inf values, i.e., that would be supplied to discardOthers(). I think an ideal implementation in optmatch would involve running this algorithm when a call to pairmatch() is called with ratio = 1 and the supplied first argument is a numeric vector (or one of the argument types that resolves to a vector, like a glm object). This code could be run prior to match_on() to prevent the calculation of the full distance matrix and instead immediately produce a sparse distance matrix, similar to using the within argument. You might consider adding a preprocess argument to allow the user to select whether they want to implement this (I don't see why they wouldn't except for backward compatibility for reproducing prior results). I believe it would need to be run separately for each subproblem, so perhaps it would only be called after splitting or only when there is no division into subproblems.
The code is below. I provide it with no license, so feel free to use and modify it with no restrictions and no need for attribution. Sorry for not writing comments; let me know if you want some help interpreting it.
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
//Preprocess by pruning unnecessary edges as in Savje (2020) https://doi.org/10.1214/19-STS699.
//Returns a vector of matrix indices for n1xn0 distance matrix.
// [[Rcpp::export]]
IntegerVector preprocess_matchC(IntegerVector t,
NumericVector p) {
R_xlen_t n = t.size();
int n1 = sum(t == 1);
int n0 = n - n1;
int i, j;
Function ord("order");
IntegerVector o = ord(p);
o = o - 1; //location of each unit after sorting
IntegerVector im(n);
int i0 = 0;
int i1 = 0;
for (j = 0; j < n; j++) {
if (t[j] == 0) {
im[j] = i0;
i0++;
}
else {
im[j] = i1;
i1++;
}
}
IntegerVector a(n0), b(n0);
std::vector<int> queue;
queue.reserve(n1);
int k0 = 0;
int ci = 0;
for (i = 0; i < n; i++) {
if (t[o[i]] == 1) {
queue.push_back(i);
continue;
}
if (queue.size() > k0) {
a[ci] = queue[k0];
k0++;
}
else {
a[ci] = i;
}
ci++;
}
k0 = 0;
ci = n0 - 1;
queue.clear();
for (i = n - 1; i >= 0; i--) {
if (t[o[i]] == 1) {
queue.push_back(i);
continue;
}
if (queue.size() > k0) {
b[ci] = queue[k0];
k0++;
}
else {
b[ci] = i;
}
ci--;
}
std::vector<int> keep;
keep.reserve(n1 * n0);
ci = 0;
for (i = 0; i < n; i++) {
if (t[o[i]] == 1) {
continue;
}
for (j = a[ci]; j <= b[ci]; j++) {
if (t[o[j]] == 0) {
continue;
}
keep.push_back(im[o[j]] + n1 * im[o[i]]);
}
ci++;
}
IntegerVector out(keep.size());
for (i = 0; i < keep.size(); i++) {
out[i] = keep[i] + 1;
}
return out;
}
Noah