-
Notifications
You must be signed in to change notification settings - Fork 26
Open
Description
Dear sdcMicro maintainers
When computing distinct l-diversity using ldiversity() in sdcMicro, groups where one or more quasi-identifiers contain missing values seem to yield incorrect l-diversity values.
Minimal reproducible example:
library(sdcMicro)
## create test data
data <- data.frame(
sex = c(
"female","female","female", # EC1 (problematic)
"male","male", # EC2 (ok)
"female","female" # EC3 (ok)
),
occupation = c(
NA, NA, NA, # EC1: missing QI
"teacher","teacher", # EC2
"nurse","nurse" # EC3
),
ethnicity = c(
"other","other","other", # EC1
"other","other", # EC2
"majority","majority" # EC3
),
sensitive = c(
1, 1, 0, # EC1 → two distinct values
1, 0, # EC2 → two distinct values
0, 1 # EC3 → two distinct values
)
)
# quasi-identifier variables
qi_vars <- c("sex", "occupation", "ethnicity")
# create sdc object
sdcObj <- createSdcObj(data,
keyVars = qi_vars,
sensibleVar = "sensitive")
# compute l-diversity
ldiv_res <- ldiversity(sdcObj)
# extract l-diversity values
ldiv_res <- head(ldiv_res@risk$ldiversity, nrow(data))
# join quasi-identifier information
ldiv_res <- cbind(data, ldiv_res)
print(ldiv_res[, 1:5])
sex occupation ethnicity sensitive sensitive_Distinct_Ldiversity
1 female <NA> other 1 1
2 female <NA> other 1 1
3 female <NA> other 0 1
4 male teacher other 1 2
5 male teacher other 0 2
6 female nurse majority 0 2
7 female nurse majority 1 2
Individuals with sex = "female", occupation = NA, ethnicity = "other" have sensitive value 1 or 0 (i.e., two distinct values). However, according to the ldiversity() output, the l-diversity for this group is
Suspected location of the problem: Measure_Risk.h, row 577 and following.
Thanks!
Metadata
Metadata
Assignees
Labels
No labels