Skip to content

ldiversity() underestimates distinct l-diversity when quasi-identifiers contain missing values #363

@MuellerRoman

Description

@MuellerRoman

Dear sdcMicro maintainers

When computing distinct l-diversity using ldiversity() in sdcMicro, groups where one or more quasi-identifiers contain missing values seem to yield incorrect l-diversity values.

Minimal reproducible example:

library(sdcMicro)

## create test data
data <- data.frame(
    sex = c(
        "female","female","female",   # EC1 (problematic)
        "male","male",                # EC2 (ok)
        "female","female"             # EC3 (ok)
    ),
    occupation = c(
        NA, NA, NA,                   # EC1: missing QI
        "teacher","teacher",          # EC2
        "nurse","nurse"               # EC3
    ),
    ethnicity = c(
        "other","other","other",       # EC1
        "other","other",               # EC2
        "majority","majority"          # EC3
    ),
    sensitive = c(
        1, 1, 0,                       # EC1 → two distinct values
        1, 0,                          # EC2 → two distinct values
        0, 1                           # EC3 → two distinct values
    )
)

# quasi-identifier variables
qi_vars   <- c("sex", "occupation", "ethnicity")

# create sdc object
sdcObj <- createSdcObj(data,
                       keyVars = qi_vars,
                       sensibleVar = "sensitive")

# compute l-diversity
ldiv_res <- ldiversity(sdcObj)

# extract l-diversity values
ldiv_res <- head(ldiv_res@risk$ldiversity, nrow(data))

# join quasi-identifier information
ldiv_res <- cbind(data, ldiv_res)
print(ldiv_res[, 1:5])

     sex occupation ethnicity sensitive sensitive_Distinct_Ldiversity
1 female       <NA>     other         1                             1
2 female       <NA>     other         1                             1
3 female       <NA>     other         0                             1
4   male    teacher     other         1                             2
5   male    teacher     other         0                             2
6 female      nurse  majority         0                             2
7 female      nurse  majority         1                             2

Individuals with sex = "female", occupation = NA, ethnicity = "other" have sensitive value 1 or 0 (i.e., two distinct values). However, according to the ldiversity() output, the l-diversity for this group is $l = 1$.

Suspected location of the problem: Measure_Risk.h, row 577 and following.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions