pairmatch.numeric enhancement following Greifer, Savje

Hi `optmatch` team,

I'm always working on improving `MatchIt` and found an optimization that might be of interest for you. It is based on the algorithm described in [Sävje (2020, Sec 2.3)](https://doi.org/10.1214/19-STS739). It relies on the idea that for a 1:1 optimal pair match on a vector score, among the set of optimal solutions that have the same objective value, there will be at least one that has no "crossings", which are described in the paper. By preventing crossing prior to matching, we can dramatically reduce the problem size and speed up the matching, while still yielding a solution as optimal as one that would have been found without this restriction. In that sense, the restriction comes almost for free, similar to setting a caliper value larger than any pair distance in the final solution.

I coded the restriction in `Rcpp` and in preliminary testing have found it helps while ensuring a truly optimal solution is found. However, due to imprecision based on the numerical tolerance, it might be slightly better or worse than the solution found without it, though that difference goes away with small values of `tol`.

It works by processing the score vector and treatment status and produces a vector of indices that correspond to the indices in the distance matrix that should be retained for the matching (i.e., not pruned). This makes it work well with the `InfinitySparseMatrix` class because it corresponds to the subset of the matrix distance matrix that should have non-`Inf` values, i.e., that would be supplied to `discardOthers()`. I think an ideal implementation in `optmatch` would involve running this algorithm when a call to `pairmatch()` is called with `ratio = 1` and the supplied first argument is a numeric vector (or one of the argument types that resolves to a vector, like a `glm` object). This code could be run prior to `match_on()` to prevent the calculation of the full distance matrix and instead immediately produce a sparse distance matrix, similar to using the `within` argument. You might consider adding a `preprocess` argument to allow the user to select whether they want to implement this (I don't see why they wouldn't except for backward compatibility for reproducing prior results). I believe it would need to be run separately for each subproblem, so perhaps it would only be called after splitting or only when there is no division into subproblems.

The code is below. I provide it with no license, so feel free to use and modify it with no restrictions and no need for attribution. Sorry for not writing comments; let me know if you want some help interpreting it.

```
using namespace Rcpp;

// [[Rcpp::plugins(cpp11)]]

//Preprocess by pruning unnecessary edges as in Savje (2020) https://doi.org/10.1214/19-STS699.
//Returns a vector of matrix indices for n1xn0 distance matrix.

// [[Rcpp::export]]
IntegerVector preprocess_matchC(IntegerVector t,
                                NumericVector p) {
  R_xlen_t n = t.size();
  int n1 = sum(t == 1);
  int n0 = n - n1;

  int i, j;

  Function ord("order");

  IntegerVector o = ord(p);
  o = o - 1; //location of each unit after sorting

  IntegerVector im(n);
  int i0 = 0;
  int i1 = 0;
  for (j = 0; j < n; j++) {
    if (t[j] == 0) {
      im[j] = i0;
      i0++;
    }
    else {
      im[j] = i1;
      i1++;
    }
  }

  IntegerVector a(n0), b(n0);

  std::vector<int> queue;
  queue.reserve(n1);

  int k0 = 0;
  int ci = 0;

  for (i = 0; i < n; i++) {
    if (t[o[i]] == 1) {
      queue.push_back(i);
      continue;
    }

    if (queue.size() > k0) {
      a[ci] = queue[k0];
      k0++;
    }
    else {
      a[ci] = i;
    }

    ci++;
  }

  k0 = 0;
  ci = n0 - 1;

  queue.clear();

  for (i = n - 1; i >= 0; i--) {
    if (t[o[i]] == 1) {
      queue.push_back(i);
      continue;
    }

    if (queue.size() > k0) {
      b[ci] = queue[k0];
      k0++;
    }
    else {
      b[ci] = i;
    }

    ci--;
  }

  std::vector<int> keep;
  keep.reserve(n1 * n0);

  ci = 0;

  for (i = 0; i < n; i++) {
    if (t[o[i]] == 1) {
      continue;
    }

    for (j = a[ci]; j <= b[ci]; j++) {
      if (t[o[j]] == 0) {
        continue;
      }

      keep.push_back(im[o[j]] + n1 * im[o[i]]);
    }

    ci++;
  }

  IntegerVector out(keep.size());
  for (i = 0; i < keep.size(); i++) {
    out[i] = keep[i] + 1;
  }

  return out;
}
```

Noah

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pairmatch.numeric enhancement following Greifer, Savje #243

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

pairmatch.numeric enhancement following Greifer, Savje #243

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions