From 95b71878f65687c24a918a48269a483198014dbb Mon Sep 17 00:00:00 2001 From: "Felix.S" Date: Tue, 28 Oct 2025 08:39:44 +0100 Subject: [PATCH] Felix review --- generation/index.qmd | 17 +++++- index.qmd | 2 +- renv.lock | 137 +++++++++++++++++++++---------------------- sdc/index.qmd | 50 ++++++++++++---- synthetic/index.qmd | 41 ++++++++----- 5 files changed, 146 insertions(+), 101 deletions(-) diff --git a/generation/index.qmd b/generation/index.qmd index 85ea4ad..8076889 100644 --- a/generation/index.qmd +++ b/generation/index.qmd @@ -8,7 +8,16 @@ Throughout, we will use the `synthpop` package [@synthpop], which is a powerful Other alternatives to create synthetic data are, for example, the `R`-package `mice` [@mice; @volker_anonymiced_2021] or the stand-alone software `IVEware` [@iveware]. Additionally, we will use the package `densityratio` [@densityratio] to evaluate the utility of synthetic data. -Make sure to load all of the required packages, and in case you haven't installed them already, install them first, using `install.packages("package_name")`. +Make sure to load all of the required packages, and in case you haven't installed them already, install them first: + +```{r} +#| label: install-packages +#| eval: false + +install.packages("synthpop") +install.packages("densityratio") +install.packages("mvtnorm") +``` --- @@ -99,7 +108,9 @@ __3. Use the `summary()` function to get an overview of the data.__ summary(data) ``` -You may notice a couple of things. First, the data seems to be sorted by age. You may verify this by running `!is.unsorted(data$age)`. Second, you may notice that most variables are non-negative, which might be something you want to take into account when modelling the data (but perhaps, this is not so relevant for the analysis at hand; for now, we assume it is). +You may notice a couple of things. First, the data seems to be sorted by age. You may verify this by running `!is.unsorted(data$age)`. + +Second, you may notice that most variables are non-negative, which might be something you want to take into account when modelling the data (but perhaps, this is not so relevant for the analysis at hand; for now, we assume it is). Third, you may notice that the variable `bmi` is mathematically linked to the variables `hgt` and `wgt`. We want to take this into account when modelling the data. Finally, note that the data consist of a mix of continuous and categorical variables. @@ -117,7 +128,7 @@ The other arguments allow to specify various modelling choices, for example whic ::: {.callout-tip title = "Modelling choices in synthpop"} -Some of the most important arguments of the `syn()` function are the following (you can use `?synthpop::syn()` for a more exaustive list). The `syn()` function will take these modelling choices into account when modelling each synthetic variable, and the resulting synthetic data, called `syn` in the output list, adheres to these specifications. +Some of the most important arguments of the `syn()` function are the following (you can use `?synthpop::syn()` for a more exhaustive list). The `syn()` function will take these modelling choices into account when modelling each synthetic variable, and the resulting synthetic data, called `syn` in the output list, adheres to these specifications. ### `method` diff --git a/index.qmd b/index.qmd index 61c2c30..66b72e1 100644 --- a/index.qmd +++ b/index.qmd @@ -9,7 +9,7 @@ This **self-paced tutorial** will introduce you to the generation and evaluation Synthetic data is generated data that can be used as an alternative to privacy-sensitive data, for example to enhance open science practices. Advantages of open (synthetic) data are numerous: other researchers can re-run analyses with data that is close to the actual data, which allows them to verify the main results. Additionally, open (synthetic) data allows researchers to perform exploratory analyses that may lead to novel hypotheses, and in quite some instances performing such analyses with synthetic data yields rather accurate results. -Moreover, realistic synthetic can be used in teaching, or for starting with model building when access to the real data is currently still prohibited. +Moreover, realistic synthetic data can be used in teaching, or for starting with model building when access to the real data is currently still prohibited. All in all, synthetic data makes open science practices easier and might spark collaborations with potential data users. diff --git a/renv.lock b/renv.lock index 5ea8f9c..95a5980 100644 --- a/renv.lock +++ b/renv.lock @@ -113,7 +113,7 @@ "NeedsCompilation": "yes", "Author": "Douglas Bates [aut] (ORCID: ), Martin Maechler [aut, cre] (ORCID: ), Mikael Jagan [aut] (ORCID: ), Timothy A. Davis [ctb] (ORCID: , SuiteSparse libraries, collaborators listed in dir(system.file(\"doc\", \"SuiteSparse\", package=\"Matrix\"), pattern=\"License\", full.names=TRUE, recursive=TRUE)), George Karypis [ctb] (ORCID: , METIS library, Copyright: Regents of the University of Minnesota), Jason Riedy [ctb] (ORCID: , GNU Octave's condest() and onenormest(), Copyright: Regents of the University of California), Jens Oehlschlägel [ctb] (initial nearPD()), R Core Team [ctb] (ROR: , base R's matrix implementation)", "Maintainer": "Martin Maechler ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "R6": { "Package": "R6", @@ -139,7 +139,7 @@ "NeedsCompilation": "no", "Author": "Winston Chang [aut, cre], Posit Software, PBC [cph, fnd]", "Maintainer": "Winston Chang ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "RColorBrewer": { "Package": "RColorBrewer", @@ -156,7 +156,7 @@ "Description": "Provides color schemes for maps (and other graphics) designed by Cynthia Brewer as described at http://colorbrewer2.org.", "License": "Apache License 2.0", "NeedsCompilation": "no", - "Repository": "https://packagemanager.posit.co/cran/latest", + "Repository": "RSPM", "Encoding": "UTF-8" }, "Rcpp": { @@ -434,7 +434,7 @@ "NeedsCompilation": "no", "Author": "Torsten Hothorn [aut, cre]", "Maintainer": "Torsten Hothorn ", - "Repository": "https://packagemanager.posit.co/cran/latest", + "Repository": "RSPM", "Encoding": "UTF-8" }, "backports": { @@ -457,7 +457,7 @@ "Encoding": "UTF-8", "RoxygenNote": "7.3.1", "Author": "Michel Lang [cre, aut] (), Duncan Murdoch [aut], R Core Team [aut]", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "base64enc": { "Package": "base64enc", @@ -476,8 +476,7 @@ "License": "GPL-2 | GPL-3", "URL": "http://www.rforge.net/base64enc", "NeedsCompilation": "yes", - "Repository": "https://packagemanager.posit.co/cran/latest", - "Encoding": "UTF-8" + "Repository": "CRAN" }, "bit": { "Package": "bit", @@ -510,7 +509,7 @@ "NeedsCompilation": "yes", "Author": "Michael Chirico [aut, cre], Jens Oehlschlägel [aut], Brian Ripley [ctb]", "Maintainer": "Michael Chirico ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "bit64": { "Package": "bit64", @@ -544,7 +543,7 @@ "NeedsCompilation": "yes", "Author": "Michael Chirico [aut, cre], Jens Oehlschlägel [aut], Leonardo Silvestri [ctb], Ofek Shilon [ctb]", "Maintainer": "Michael Chirico ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "boot": { "Package": "boot", @@ -789,7 +788,7 @@ "NeedsCompilation": "no", "Author": "Carson Sievert [aut, cre] (), Joe Cheng [aut], Garrick Aden-Buie [aut] (), Posit Software, PBC [cph, fnd], Bootstrap contributors [ctb] (Bootstrap library), Twitter, Inc [cph] (Bootstrap library), Javi Aguilar [ctb, cph] (Bootstrap colorpicker library), Thomas Park [ctb, cph] (Bootswatch library), PayPal [ctb, cph] (Bootstrap accessibility plugin)", "Maintainer": "Carson Sievert ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "cachem": { "Package": "cachem", @@ -815,7 +814,7 @@ "NeedsCompilation": "yes", "Author": "Winston Chang [aut, cre], Posit Software, PBC [cph, fnd]", "Maintainer": "Winston Chang ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "class": { "Package": "class", @@ -924,7 +923,7 @@ "NeedsCompilation": "yes", "Author": "Gábor Csárdi [aut, cre], Hadley Wickham [ctb], Kirill Müller [ctb], Salim Brüggemann [ctb] (), Posit Software, PBC [cph, fnd]", "Maintainer": "Gábor Csárdi ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "clipr": { "Package": "clipr", @@ -955,7 +954,7 @@ "NeedsCompilation": "no", "Author": "Matthew Lincoln [aut, cre] (), Louis Maddox [ctb], Steve Simpson [ctb], Jennifer Bryan [ctb]", "Maintainer": "Matthew Lincoln ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "cmm": { "Package": "cmm", @@ -1031,7 +1030,7 @@ "URL": "http://coin.r-forge.r-project.org", "Author": "Torsten Hothorn [aut, cre] (), Henric Winell [aut] (), Kurt Hornik [aut] (), Mark A. van de Wiel [aut] (), Achim Zeileis [aut] ()", "Maintainer": "Torsten Hothorn ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "cpp11": { "Package": "cpp11", @@ -1078,7 +1077,7 @@ "NeedsCompilation": "no", "Author": "Davis Vaughan [aut, cre] (), Jim Hester [aut] (), Romain François [aut] (), Benjamin Kietzman [ctb], Posit Software, PBC [cph, fnd]", "Maintainer": "Davis Vaughan ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "crayon": { "Package": "crayon", @@ -1108,7 +1107,7 @@ "NeedsCompilation": "no", "Author": "Gábor Csárdi [aut, cre], Brodie Gaslam [ctb], Posit Software, PBC [cph, fnd]", "Maintainer": "Gábor Csárdi ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "densityratio": { "Package": "densityratio", @@ -1178,7 +1177,7 @@ "NeedsCompilation": "yes", "Author": "Dirk Eddelbuettel [aut, cre] (), Antoine Lucas [ctb], Jarek Tuszynski [ctb], Henrik Bengtsson [ctb] (), Simon Urbanek [ctb] (), Mario Frasca [ctb], Bryan Lewis [ctb], Murray Stokely [ctb], Hannes Muehleisen [ctb], Duncan Murdoch [ctb], Jim Hester [ctb], Wush Wu [ctb] (), Qiang Kou [ctb] (), Thierry Onkelinx [ctb] (), Michel Lang [ctb] (), Viliam Simko [ctb], Kurt Hornik [ctb] (), Radford Neal [ctb] (), Kendon Bell [ctb] (), Matthew de Queljoe [ctb], Dmitry Selivanov [ctb], Ion Suruceanu [ctb], Bill Denney [ctb], Dirk Schumacher [ctb], András Svraka [ctb], Sergey Fedorov [ctb], Will Landau [ctb] (), Floris Vanderhaeghe [ctb] (), Kevin Tappe [ctb], Harris McGehee [ctb], Tim Mastny [ctb], Aaron Peikert [ctb] (), Mark van der Loo [ctb] (), Chris Muir [ctb] (), Moritz Beller [ctb] (), Sebastian Campbell [ctb], Winston Chang [ctb] (), Dean Attali [ctb] (), Michael Chirico [ctb] (), Kevin Ushey [ctb]", "Maintainer": "Dirk Eddelbuettel ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "dplyr": { "Package": "dplyr", @@ -1241,7 +1240,7 @@ "NeedsCompilation": "yes", "Author": "Hadley Wickham [aut, cre] (), Romain François [aut] (), Lionel Henry [aut], Kirill Müller [aut] (), Davis Vaughan [aut] (), Posit Software, PBC [cph, fnd]", "Maintainer": "Hadley Wickham ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "e1071": { "Package": "e1071", @@ -1312,7 +1311,7 @@ "NeedsCompilation": "no", "Author": "Hadley Wickham [aut, cre], Yihui Xie [aut] (ORCID: ), Michael Lawrence [ctb], Thomas Kluyver [ctb], Jeroen Ooms [ctb], Barret Schloerke [ctb], Adam Ryczkowski [ctb], Hiroaki Yutani [ctb], Michel Lang [ctb], Karolis Koncevičius [ctb], Posit Software, PBC [cph, fnd]", "Maintainer": "Hadley Wickham ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "farver": { "Package": "farver", @@ -1335,7 +1334,7 @@ "NeedsCompilation": "yes", "Author": "Thomas Lin Pedersen [cre, aut] (), Berendea Nicolae [aut] (Author of the ColorSpace C++ library), Romain François [aut] (), Posit, PBC [cph, fnd]", "Maintainer": "Thomas Lin Pedersen ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "fastmap": { "Package": "fastmap", @@ -1355,7 +1354,7 @@ "NeedsCompilation": "yes", "Author": "Winston Chang [aut, cre], Posit Software, PBC [cph, fnd], Tessil [cph] (hopscotch_map library)", "Maintainer": "Winston Chang ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "fontawesome": { "Package": "fontawesome", @@ -1390,7 +1389,7 @@ "NeedsCompilation": "no", "Author": "Richard Iannone [aut, cre] (), Christophe Dervieux [ctb] (), Winston Chang [ctb], Dave Gandy [ctb, cph] (Font-Awesome font), Posit Software, PBC [cph, fnd]", "Maintainer": "Richard Iannone ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "forcats": { "Package": "forcats", @@ -1539,7 +1538,7 @@ "NeedsCompilation": "yes", "Author": "Jim Hester [aut], Hadley Wickham [aut], Gábor Csárdi [aut, cre], libuv project contributors [cph] (libuv library), Joyent, Inc. and other Node contributors [cph] (libuv library), Posit Software, PBC [cph, fnd]", "Maintainer": "Gábor Csárdi ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "future": { "Package": "future", @@ -1647,7 +1646,7 @@ "NeedsCompilation": "no", "Author": "Hadley Wickham [aut, cre] (ORCID: ), Max Kuhn [aut], Davis Vaughan [aut], Posit Software, PBC [cph, fnd] (ROR: )", "Maintainer": "Hadley Wickham ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "ggh4x": { "Package": "ggh4x", @@ -1874,7 +1873,7 @@ "NeedsCompilation": "yes", "Author": "Jim Hester [aut] (), Jennifer Bryan [aut, cre] (), Posit Software, PBC [cph, fnd]", "Maintainer": "Jennifer Bryan ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "gtable": { "Package": "gtable", @@ -1914,7 +1913,7 @@ "NeedsCompilation": "no", "Author": "Hadley Wickham [aut], Thomas Lin Pedersen [aut, cre], Posit Software, PBC [cph, fnd]", "Maintainer": "Thomas Lin Pedersen ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "haven": { "Package": "haven", @@ -1963,7 +1962,7 @@ "NeedsCompilation": "yes", "Author": "Hadley Wickham [aut, cre], Evan Miller [aut, cph] (Author of included ReadStat code), Danny Smith [aut], Posit Software, PBC [cph, fnd]", "Maintainer": "Hadley Wickham ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "highr": { "Package": "highr", @@ -1993,7 +1992,7 @@ "NeedsCompilation": "no", "Author": "Yihui Xie [aut, cre] (), Yixuan Qiu [aut], Christopher Gandrud [ctb], Qiang Li [ctb]", "Maintainer": "Yihui Xie ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "hms": { "Package": "hms", @@ -2070,7 +2069,7 @@ "NeedsCompilation": "yes", "Author": "Joe Cheng [aut], Carson Sievert [aut, cre] (), Barret Schloerke [aut] (), Winston Chang [aut] (), Yihui Xie [aut], Jeff Allen [aut], Posit Software, PBC [cph, fnd]", "Maintainer": "Carson Sievert ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "isoband": { "Package": "isoband", @@ -2106,7 +2105,7 @@ "NeedsCompilation": "yes", "Author": "Hadley Wickham [aut, cre] (), Claus O. Wilke [aut] (Original author, ), Thomas Lin Pedersen [aut] ()", "Maintainer": "Hadley Wickham ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "iterators": { "Package": "iterators", @@ -2177,7 +2176,7 @@ "NeedsCompilation": "no", "Author": "Carson Sievert [aut, cre] (), Joe Cheng [aut], RStudio [cph], jQuery Foundation [cph] (jQuery library and jQuery UI library), jQuery contributors [ctb, cph] (jQuery library; authors listed in inst/lib/jquery-AUTHORS.txt)", "Maintainer": "Carson Sievert ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "jsonlite": { "Package": "jsonlite", @@ -2207,7 +2206,7 @@ "Encoding": "UTF-8", "NeedsCompilation": "yes", "Author": "Jeroen Ooms [aut, cre] (), Duncan Temple Lang [ctb], Lloyd Hilaiel [cph] (author of bundled libyajl)", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "kableExtra": { "Package": "kableExtra", @@ -2320,7 +2319,7 @@ "NeedsCompilation": "no", "Author": "Yihui Xie [aut, cre] (, https://yihui.org), Abhraneel Sarma [ctb], Adam Vogt [ctb], Alastair Andrew [ctb], Alex Zvoleff [ctb], Amar Al-Zubaidi [ctb], Andre Simon [ctb] (the CSS files under inst/themes/ were derived from the Highlight package http://www.andre-simon.de), Aron Atkins [ctb], Aaron Wolen [ctb], Ashley Manton [ctb], Atsushi Yasumoto [ctb] (), Ben Baumer [ctb], Brian Diggs [ctb], Brian Zhang [ctb], Bulat Yapparov [ctb], Cassio Pereira [ctb], Christophe Dervieux [ctb], David Hall [ctb], David Hugh-Jones [ctb], David Robinson [ctb], Doug Hemken [ctb], Duncan Murdoch [ctb], Elio Campitelli [ctb], Ellis Hughes [ctb], Emily Riederer [ctb], Fabian Hirschmann [ctb], Fitch Simeon [ctb], Forest Fang [ctb], Frank E Harrell Jr [ctb] (the Sweavel package at inst/misc/Sweavel.sty), Garrick Aden-Buie [ctb], Gregoire Detrez [ctb], Hadley Wickham [ctb], Hao Zhu [ctb], Heewon Jeon [ctb], Henrik Bengtsson [ctb], Hiroaki Yutani [ctb], Ian Lyttle [ctb], Hodges Daniel [ctb], Jacob Bien [ctb], Jake Burkhead [ctb], James Manton [ctb], Jared Lander [ctb], Jason Punyon [ctb], Javier Luraschi [ctb], Jeff Arnold [ctb], Jenny Bryan [ctb], Jeremy Ashkenas [ctb, cph] (the CSS file at inst/misc/docco-classic.css), Jeremy Stephens [ctb], Jim Hester [ctb], Joe Cheng [ctb], Johannes Ranke [ctb], John Honaker [ctb], John Muschelli [ctb], Jonathan Keane [ctb], JJ Allaire [ctb], Johan Toloe [ctb], Jonathan Sidi [ctb], Joseph Larmarange [ctb], Julien Barnier [ctb], Kaiyin Zhong [ctb], Kamil Slowikowski [ctb], Karl Forner [ctb], Kevin K. Smith [ctb], Kirill Mueller [ctb], Kohske Takahashi [ctb], Lorenz Walthert [ctb], Lucas Gallindo [ctb], Marius Hofert [ctb], Martin Modrák [ctb], Michael Chirico [ctb], Michael Friendly [ctb], Michal Bojanowski [ctb], Michel Kuhlmann [ctb], Miller Patrick [ctb], Nacho Caballero [ctb], Nick Salkowski [ctb], Niels Richard Hansen [ctb], Noam Ross [ctb], Obada Mahdi [ctb], Pavel N. Krivitsky [ctb] (), Pedro Faria [ctb], Qiang Li [ctb], Ramnath Vaidyanathan [ctb], Richard Cotton [ctb], Robert Krzyzanowski [ctb], Rodrigo Copetti [ctb], Romain Francois [ctb], Ruaridh Williamson [ctb], Sagiru Mati [ctb] (), Scott Kostyshak [ctb], Sebastian Meyer [ctb], Sietse Brouwer [ctb], Simon de Bernard [ctb], Sylvain Rousseau [ctb], Taiyun Wei [ctb], Thibaut Assus [ctb], Thibaut Lamadon [ctb], Thomas Leeper [ctb], Tim Mastny [ctb], Tom Torsney-Weir [ctb], Trevor Davis [ctb], Viktoras Veitas [ctb], Weicheng Zhu [ctb], Wush Wu [ctb], Zachary Foster [ctb], Zhian N. Kamvar [ctb] (), Posit Software, PBC [cph, fnd]", "Maintainer": "Yihui Xie ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "labeling": { "Package": "labeling", @@ -2339,7 +2338,7 @@ "stats", "graphics" ], - "Repository": "https://packagemanager.posit.co/cran/latest", + "Repository": "RSPM", "Encoding": "UTF-8" }, "lattice": { @@ -2379,8 +2378,7 @@ "NeedsCompilation": "yes", "Author": "Deepayan Sarkar [aut, cre] (), Felix Andrews [ctb], Kevin Wright [ctb] (documentation), Neil Klepeis [ctb], Johan Larsson [ctb] (miscellaneous improvements), Zhijian (Jason) Wen [cph] (filled contour code), Paul Murrell [ctb], Stefan Eng [ctb] (violin plot improvements), Achim Zeileis [ctb] (modern colors), Alexandre Courtiol [ctb] (generics for larrows, lpolygon, lrect and lsegments)", "Maintainer": "Deepayan Sarkar ", - "Repository": "https://packagemanager.posit.co/cran/latest", - "Encoding": "UTF-8" + "Repository": "CRAN" }, "libcoin": { "Package": "libcoin", @@ -2407,7 +2405,7 @@ "License": "GPL-2", "Author": "Torsten Hothorn [aut, cre] (), Henric Winell [aut] ()", "Maintainer": "Torsten Hothorn ", - "Repository": "https://packagemanager.posit.co/cran/latest", + "Repository": "RSPM", "Encoding": "UTF-8" }, "lifecycle": { @@ -2449,7 +2447,7 @@ "NeedsCompilation": "no", "Author": "Lionel Henry [aut, cre], Hadley Wickham [aut] (), Posit Software, PBC [cph, fnd]", "Maintainer": "Lionel Henry ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "listenv": { "Package": "listenv", @@ -2602,7 +2600,7 @@ "URL": "https://github.com/HenrikBengtsson/matrixStats", "BugReports": "https://github.com/HenrikBengtsson/matrixStats/issues", "RoxygenNote": "7.3.2", - "Repository": "https://packagemanager.posit.co/cran/latest", + "Repository": "RSPM", "Encoding": "UTF-8" }, "memoise": { @@ -2633,7 +2631,7 @@ "NeedsCompilation": "no", "Author": "Hadley Wickham [aut], Jim Hester [aut], Winston Chang [aut, cre], Kirill Müller [aut], Daniel Cook [aut], Mark Edmondson [ctb]", "Maintainer": "Winston Chang ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "mice": { "Package": "mice", @@ -2717,7 +2715,7 @@ "NeedsCompilation": "yes", "Author": "Yihui Xie [aut, cre] (, https://yihui.org), Jeffrey Horner [ctb], Beilei Bian [ctb]", "Maintainer": "Yihui Xie ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "minqa": { "Package": "minqa", @@ -2825,7 +2823,7 @@ "NeedsCompilation": "no", "Author": "Torsten Hothorn [aut, cre] (ORCID: ), Friedrich Leisch [aut] (ORCID: ), Achim Zeileis [aut] (ORCID: )", "Maintainer": "Torsten Hothorn ", - "Repository": "https://packagemanager.posit.co/cran/latest", + "Repository": "RSPM", "Encoding": "UTF-8" }, "multcomp": { @@ -2899,7 +2897,7 @@ "NeedsCompilation": "yes", "Author": "Alan Genz [aut], Frank Bretz [aut], Tetsuhisa Miwa [aut], Xuefei Mi [aut], Friedrich Leisch [ctb], Fabian Scheipl [ctb], Bjoern Bornkamp [ctb] (), Martin Maechler [ctb] (), Torsten Hothorn [aut, cre] ()", "Maintainer": "Torsten Hothorn ", - "Repository": "https://packagemanager.posit.co/cran/latest", + "Repository": "RSPM", "Encoding": "UTF-8" }, "nlme": { @@ -3328,7 +3326,7 @@ "BugReports": "https://github.com/r-lib/pkgconfig/issues", "Encoding": "UTF-8", "NeedsCompilation": "no", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "plyr": { "Package": "plyr", @@ -3383,7 +3381,7 @@ "License": "GPL (>= 2)", "NeedsCompilation": "yes", "Author": "Charles Kooperberg [aut, cre], Cleve Moler [ctb] (LINPACK routines in src), Jack Dongarra [ctb] (LINPACK routines in src)", - "Repository": "https://packagemanager.posit.co/cran/latest", + "Repository": "RSPM", "Encoding": "UTF-8" }, "prettyunits": { @@ -3409,7 +3407,7 @@ "NeedsCompilation": "no", "Author": "Gabor Csardi [aut, cre], Bill Denney [ctb] (), Christophe Regouby [ctb]", "Maintainer": "Gabor Csardi ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "progress": { "Package": "progress", @@ -3442,7 +3440,7 @@ "NeedsCompilation": "no", "Author": "Gábor Csárdi [aut, cre], Rich FitzJohn [aut], Posit Software, PBC [cph, fnd]", "Maintainer": "Gábor Csárdi ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "proto": { "Package": "proto", @@ -3622,7 +3620,7 @@ "NeedsCompilation": "yes", "Author": "Hadley Wickham [trl, cre, cph], RStudio [cph], Sridhar Ratnakumar [aut], Trent Mick [aut], ActiveState [cph] (R/appdir.r, R/cache.r, R/data.r, R/log.r translated from appdirs), Eddy Petrisor [ctb], Trevor Davis [trl, aut], Gabor Csardi [ctb], Gregory Jefferis [ctb]", "Maintainer": "Hadley Wickham ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "rbibutils": { "Package": "rbibutils", @@ -3706,7 +3704,7 @@ "NeedsCompilation": "yes", "Author": "Hadley Wickham [aut], Jim Hester [aut], Romain Francois [ctb], Jennifer Bryan [aut, cre] (), Shelby Bearrows [ctb], Posit Software, PBC [cph, fnd], https://github.com/mandreyel/ [cph] (mio library), Jukka Jylänki [ctb, cph] (grisu3 implementation), Mikkel Jørgensen [ctb, cph] (grisu3 implementation)", "Maintainer": "Jennifer Bryan ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "reformulas": { "Package": "reformulas", @@ -3788,7 +3786,7 @@ "NeedsCompilation": "no", "Author": "Kevin Ushey [aut, cre] (ORCID: ), Hadley Wickham [aut] (ORCID: ), Posit Software, PBC [cph, fnd]", "Maintainer": "Kevin Ushey ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "rlang": { "Package": "rlang", @@ -3839,7 +3837,7 @@ "NeedsCompilation": "yes", "Author": "Lionel Henry [aut, cre], Hadley Wickham [aut], mikefc [cph] (Hash implementation based on Mike's xxhashlite), Yann Collet [cph] (Author of the embedded xxHash library), Posit, PBC [cph, fnd]", "Maintainer": "Lionel Henry ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "rmarkdown": { "Package": "rmarkdown", @@ -3968,7 +3966,7 @@ "Encoding": "UTF-8", "NeedsCompilation": "no", "Author": "Kevin Ushey [aut, cre], JJ Allaire [aut], Hadley Wickham [aut], Gary Ritchie [aut], RStudio [cph]", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "sandwich": { "Package": "sandwich", @@ -4010,7 +4008,7 @@ "NeedsCompilation": "no", "Author": "Achim Zeileis [aut, cre] (), Thomas Lumley [aut] (), Nathaniel Graham [ctb] (), Susanne Koell [ctb]", "Maintainer": "Achim Zeileis ", - "Repository": "https://packagemanager.posit.co/cran/latest", + "Repository": "RSPM", "Encoding": "UTF-8" }, "sass": { @@ -4047,7 +4045,7 @@ "NeedsCompilation": "yes", "Author": "Joe Cheng [aut], Timothy Mastny [aut], Richard Iannone [aut] (), Barret Schloerke [aut] (), Carson Sievert [aut, cre] (), Christophe Dervieux [ctb] (), RStudio [cph, fnd], Sass Open Source Foundation [ctb, cph] (LibSass library), Greter Marcel [ctb, cph] (LibSass library), Mifsud Michael [ctb, cph] (LibSass library), Hampton Catlin [ctb, cph] (LibSass library), Natalie Weizenbaum [ctb, cph] (LibSass library), Chris Eppstein [ctb, cph] (LibSass library), Adams Joseph [ctb, cph] (json.cpp), Trifunovic Nemanja [ctb, cph] (utf8.h)", "Maintainer": "Carson Sievert ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "scales": { "Package": "scales", @@ -4091,7 +4089,7 @@ "NeedsCompilation": "no", "Author": "Hadley Wickham [aut], Thomas Lin Pedersen [cre, aut] (), Dana Seidel [aut], Posit Software, PBC [cph, fnd] (03wc8by49)", "Maintainer": "Thomas Lin Pedersen ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "shape": { "Package": "shape", @@ -4141,7 +4139,7 @@ "Author": "Marek Gagolewski [aut, cre, cph] (), Bartek Tartanus [ctb], Unicode, Inc. and others [ctb] (ICU4C source code, Unicode Character Database)", "Maintainer": "Marek Gagolewski ", "License_is_FOSS": "yes", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "stringr": { "Package": "stringr", @@ -4254,7 +4252,7 @@ }, "svglite": { "Package": "svglite", - "Version": "2.2.1", + "Version": "2.2.2", "Source": "Repository", "Title": "An 'SVG' Graphics Device", "Authors@R": "c( person(\"Hadley\", \"Wickham\", , \"hadley@posit.co\", role = \"aut\"), person(\"Lionel\", \"Henry\", , \"lionel@posit.co\", role = \"aut\"), person(\"Thomas Lin\", \"Pedersen\", , \"thomas.pedersen@posit.co\", role = c(\"cre\", \"aut\"), comment = c(ORCID = \"0000-0002-5147-4711\")), person(\"T Jake\", \"Luciani\", , \"jake@apache.org\", role = \"aut\"), person(\"Matthieu\", \"Decorde\", , \"matthieu.decorde@ens-lyon.fr\", role = \"aut\"), person(\"Vaudor\", \"Lise\", , \"lise.vaudor@ens-lyon.fr\", role = \"aut\"), person(\"Tony\", \"Plate\", role = \"ctb\", comment = \"Early line dashing code\"), person(\"David\", \"Gohel\", role = \"ctb\", comment = \"Line dashing code and early raster code\"), person(\"Yixuan\", \"Qiu\", role = \"ctb\", comment = \"Improved styles; polypath implementation\"), person(\"Håkon\", \"Malmedal\", role = \"ctb\", comment = \"Opacity code\"), person(\"Posit Software, PBC\", role = c(\"cph\", \"fnd\"), comment = c(ROR = \"03wc8by49\")) )", @@ -4270,7 +4268,7 @@ "cli", "lifecycle", "rlang (>= 1.1.0)", - "systemfonts (>= 1.2.3)", + "systemfonts (>= 1.3.0)", "textshaping (>= 0.3.0)" ], "Suggests": [ @@ -4547,7 +4545,7 @@ "NeedsCompilation": "yes", "Author": "Hadley Wickham [aut, cre], Davis Vaughan [aut], Maximilian Girlich [aut], Kevin Ushey [ctb], Posit Software, PBC [cph, fnd]", "Maintainer": "Hadley Wickham ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "tidyselect": { "Package": "tidyselect", @@ -4590,7 +4588,7 @@ "NeedsCompilation": "yes", "Author": "Lionel Henry [aut, cre], Hadley Wickham [aut], Posit Software, PBC [cph, fnd]", "Maintainer": "Lionel Henry ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "tinytex": { "Package": "tinytex", @@ -4615,7 +4613,7 @@ "NeedsCompilation": "no", "Author": "Yihui Xie [aut, cre, cph] (), Posit Software, PBC [cph, fnd], Christophe Dervieux [ctb] (), Devon Ryan [ctb] (), Ethan Heinzen [ctb], Fernando Cagua [ctb]", "Maintainer": "Yihui Xie ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "truncnorm": { "Package": "truncnorm", @@ -4667,7 +4665,7 @@ "NeedsCompilation": "yes", "Author": "Davis Vaughan [aut, cre], Howard Hinnant [cph] (Author of the included date library), Posit Software, PBC [cph, fnd]", "Maintainer": "Davis Vaughan ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "ucminf": { "Package": "ucminf", @@ -4768,7 +4766,7 @@ "NeedsCompilation": "yes", "Author": "Hadley Wickham [aut], Lionel Henry [aut], Davis Vaughan [aut, cre], data.table team [cph] (Radix sort based on data.table's forder() and their contribution to R's order()), Posit Software, PBC [cph, fnd]", "Maintainer": "Davis Vaughan ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "viridisLite": { "Package": "viridisLite", @@ -4796,7 +4794,7 @@ "RoxygenNote": "7.2.3", "NeedsCompilation": "no", "Author": "Simon Garnier [aut, cre], Noam Ross [ctb, cph], Bob Rudis [ctb, cph], Marco Sciaini [ctb, cph], Antônio Pedro Camargo [ctb, cph], Cédric Scherer [ctb, cph]", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "vroom": { "Package": "vroom", @@ -4904,7 +4902,7 @@ "NeedsCompilation": "no", "Author": "Jim Hester [aut], Lionel Henry [aut, cre], Kirill Müller [aut], Kevin Ushey [aut], Hadley Wickham [aut], Winston Chang [aut], Jennifer Bryan [ctb], Richard Cotton [ctb], Posit Software, PBC [cph, fnd]", "Maintainer": "Lionel Henry ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "xfun": { "Package": "xfun", @@ -4952,7 +4950,7 @@ "NeedsCompilation": "yes", "Author": "Yihui Xie [aut, cre, cph] (ORCID: , URL: https://yihui.org), Wush Wu [ctb], Daijiang Li [ctb], Xianying Tan [ctb], Salim Brüggemann [ctb] (ORCID: ), Christophe Dervieux [ctb]", "Maintainer": "Yihui Xie ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "CRAN" }, "xml2": { "Package": "xml2", @@ -4993,7 +4991,7 @@ "NeedsCompilation": "yes", "Author": "Hadley Wickham [aut], Jim Hester [aut], Jeroen Ooms [aut, cre], Posit Software, PBC [cph, fnd], R Foundation [ctb] (Copy of R-project homepage cached as example)", "Maintainer": "Jeroen Ooms ", - "Repository": "https://packagemanager.posit.co/cran/latest" + "Repository": "RSPM" }, "yaml": { "Package": "yaml", @@ -5012,8 +5010,7 @@ "URL": "https://github.com/vubiostat/r-yaml/", "BugReports": "https://github.com/vubiostat/r-yaml/issues", "NeedsCompilation": "yes", - "Repository": "https://packagemanager.posit.co/cran/latest", - "Encoding": "UTF-8" + "Repository": "CRAN" }, "zoo": { "Package": "zoo", @@ -5054,7 +5051,7 @@ "NeedsCompilation": "yes", "Author": "Achim Zeileis [aut, cre] (), Gabor Grothendieck [aut], Jeffrey A. Ryan [aut], Joshua M. Ulrich [ctb], Felix Andrews [ctb]", "Maintainer": "Achim Zeileis ", - "Repository": "https://packagemanager.posit.co/cran/latest", + "Repository": "RSPM", "Encoding": "UTF-8" } } diff --git a/sdc/index.qmd b/sdc/index.qmd index e60e8fc..a88d84a 100644 --- a/sdc/index.qmd +++ b/sdc/index.qmd @@ -10,17 +10,19 @@ Collected research data often contains sensitive information about individuals. For example, social scientists might collect data on income or criminal behavior, and health data often contains medical information of individuals. Such private information may harm the people involved if disclosed to the public. Even if no harm is incurred, the trust of individuals in the data collector, or scientific institutions in general, may be damaged if such data is revealed. + At the same time, broad data availability is very valuable to researchers and governmental institutions alike. Using previously collected data, researchers may answer novel research questions and governmental institutions may improve policy. In addition to these high-level applications, open data can also be used to evaluate the reproducibility of research projects or serve as realistic data in education. That is, open data is valuable for many applications, but simply releasing the data is often not an option. -The first step in the process of releasing data is to anonymize it [TODO: see Data Anonymization tutorial]. +The first step in the process of releasing data - both the actual and synthetic data - is to anonymize it [TODO: see Data Anonymization tutorial]. Anonymization requires that potentially identifying information is removed from the collected data. Examples of such identifying information are names, addresses, IP-addresses, that can often be removed without losing important information. However, after de-identifying the data, your data might still contain information that can lead to indirect identification of individuals, for example because the data can be linked to external data sources. Especially in today's age of massive data collection, data sources can be linked in surprising ways. -In 2006, University of Texas researchers reidentified Netflix users by linking their reviews to IMDb data. +For example, in 2006, University of Texas researchers reidentified Netflix users by linking their reviews to IMDb data. + While likely harmless for most, it exposed sensitive details, such as sexual orientation, leading to potential privacy risks. # Statistical disclosure control @@ -28,23 +30,29 @@ While likely harmless for most, it exposed sensitive details, such as sexual ori The term "Statistical disclosure control" refers to a suite of statistical methods that aim to protect collected data such that they can be safely released to the public without disclosing confidential information about these individuals. The goal of statistical disclosure control is to release a data set that is as similar as possible to the original data, while at the same time ensuring that no individual can be identified from the released data, nor any sensitive information can be inferred [@hundepool2024]. Disclosure is here defined as the release of information about an individual that would not have become public if the data would be kept private. + Two types of disclosure risk are commonly considered: 1. __Re-identification disclosure__ occurs when individuals can be singled out from the released data, resulting in confidential data being leaked. 2. __Attribute disclosure__ refers to situations in which characteristics of individuals can be learned with (near) certainty from the data release. For example, if a release discloses that all inhabitants of a street in some city are on welfare, knowing that a person lives in this street discloses information on their welfare status. Every data release requires that both re-identification disclosure and attribute disclosure risks are sufficiently small. - + At the same time, the goal of a data release is to allow others to do something useful with the released data, and for this purpose, the released data should be similar to the original data. -Similarity, in this context, means that the released, but protected, data can be used for the same purposes as the original data. + +::: {.callout-tip title="*Similarity*, in this context, means that the released, but protected, data can be used for the same purposes as the original data."} +::: + To a reasonable degree, analyses on the released data should yield results similar to results obtained from the original data. Similarity does not mean that the released records are similar to the original records, but rather that over the whole, the distributions of the observed and released data are similar. -Statistical disclosure control often yields a trade-off between privacy and utility: the stricter the data protection, the better the privacy of respondents is protected, but the more information is lost, and the lower the utility of the data. + +Statistical disclosure control often yields a *trade-off between privacy and utility*: the stricter the data protection, the better the privacy of respondents is protected, but the more information is lost, and the lower the utility of the data. -The level of protection required depends on the data at hand: data from an insensitive experiment with only very general personal information may require very little protection, while an extensive survey on criminal behavior or sensitive issues should typically be well-protected. +The level of protection required depends on the sensitivity of the data at hand: data from an insensitive experiment with only very general personal information may require very little protection, while an extensive survey on criminal behavior or sensitive issues should typically be well-protected. At the same time, the level of utility required depends on the problem at hand: if the released data should allow to replicate complex analyses to a reasonable degree of accuracy, more sophisticated disclosure methods are required than when solely some marginal quantities (e.g., means and standard deviations) should be preserved. -Note that the privacy-utility trade-off is very much relative: a released data set can be very useful for some purposes, but almost useless for others. + +Note that the privacy-utility trade-off is relative to the re-use scenario: a released data set can be very useful for some purposes, but almost useless for others. ```{r} @@ -76,12 +84,12 @@ ggplot(data = data.frame( scale_color_manual(values = c("#009933")) + ggtitle("The privacy-utility trade-off for disclosure risk protected data\n") ``` - + # Conventional statistical disclosure control methods -In practice, many techniques for statistical disclosure control have been developped over the year for microdata; that is, data on individual observations, potentially measured at multiple locations, in contrast to tables with aggregated data. +In practice, many techniques for statistical disclosure control have been developped over the last years for microdata; that is, data on individual observations, potentially measured at multiple locations, in contrast to tables with aggregated data. Typically, these techniques limit the amount of information that is released, thereby introducing statistical bias and variance [@fienberg_slavkovic_2011]. For example, one might set a threshold to variables such that extreme values are not released, which typically creates bias in the distribution of the released values. As an other example, adding noise to observed values will increase the variance of the released data relative to the observed data. @@ -89,14 +97,32 @@ Common methods that have traditionally been used for statistical disclosure cont - __Aggregation__: collapsing categories into larger overarching categories (e.g., towns into municipalities or regions, divisions into companies). - __Rounding__: replacing original values with their rounded counterpart (e.g., income in thousands of euros, age in years). -- __Top coding__: cap all values higher or lower than some threshold to this threshold (sometimes, only relatively extreme values, such as very large income values, yield a high risk of disclosure). +- __Top coding__: cap all values higher or lower than some threshold to this threshold (sometimes, only relatively extreme values, such as very large income values, yield a high risk of disclosure). (This technique is also called "Winsorizing".) - __Microaggregation__: combine observations into groups of some size where people within a group are maximally similar, calculate the group mean for each variable used to form groups, and replace values on these variables by the respective group mean. -- __Suppression__: remove sensitive or identifying values from the released data directly (this can even be an entire variable). +- __Suppression__: remove sensitive or identifying values (i.e., setting the to "missing"/`NA`) from the released data directly (this can even be an entire variable). - __Adding noise__: random noise is added to the observed values, such that the released value is different from the underlying observed value. -Each of these methods methods either introduce errors in the data, such that the released information cannot be inferred to be entirely accurate, or limits the amount of information that is released in such a way that disclosure risks are small. +Each of these methods either introduces errors in the data, such that the released information is not entirely accurate, or limits the amount of information that is released in such a way that disclosure risks are small. However, an important limitation of these methods is that relationships between variables are usually not accounted for. While it is possible for some of these approaches to be applied on a multivariate level, this is often not easy and typically not done in practice. Thus, when relationships between variables are of interest, for example when the released data should allow to reproduce regression analyses on the observed data, these traditional methods might distort the data too much. Synthetic data might provide a better solution, as instead of distorting the data, it attempts to model the multivariate distribution of the data, and thus allows to capture relationships between variables. The idea of synthetic data will be more thoroughly explained in [the subsequent section](../synthetic/). + + + + +::: {.callout-tip} + +## Summary + +* Research data often include sensitive personal info; open data is valuable, but raw release is rarely acceptable. +* Anonymize first: removing direct identifiers isn’t enough—linkage to external data can still re-identify people. +* **SDC’s goal:** enable useful data release while preventing **re-identification** and **attribute disclosure**; both risks must be low. +* “Similarity” means preserving analyses/distributions (not record-level matches) → inherent **privacy–utility trade-off**. +* Traditional SDC (aggregation, suppression, top coding, rounding, noise, microaggregation) reduces risk but adds bias/variance and can miss multivariate structure; **synthetic data** may better preserve relationships. +::: diff --git a/synthetic/index.qmd b/synthetic/index.qmd index 40d8a8e..f40db3e 100644 --- a/synthetic/index.qmd +++ b/synthetic/index.qmd @@ -19,7 +19,7 @@ These newly drawn samples can then be released to the public, as a surrogate dat In what follows, we assume that there is a observed data set, $X$, consisting of $n$ observations measured on $p$ variables. Furthermore, we assume that these observations adhere to some structure that we can encode in the joint distribution of the variables, $f(X)$. If we would know this distribution, we could sample new observations from it directly, but we typically do not know this. -However, we can estimate this joint distribution, leading to $\hat{f}(X)$, and sample new observations from this. +However, we can estimate this joint distribution from a data set at hand, leading to $\hat{f}(X)$, and sample new observations from this. --- @@ -29,12 +29,13 @@ However, we can estimate this joint distribution, leading to $\hat{f}(X)$, and s How to estimate this joint distribution is essentially a question of how to specify the synthetic data model. If our model approximates the true data generating process reasonably well, we can expect fairly high quality synthetic data. -To be a bit more precise, this means that both univariate and multivariate patterns of the observed data are preserved in the synthetic data: the univariate characteristics of the variables are thus preserved, but also relationships between variables are captured in the synthetic data. -Moreover, if we model $f(X)$ accurately and draw new samples from this model, our data should be protected sufficiently well, because many real-world processes are far from deterministic. +To be a bit more precise, this means that both univariate and multivariate patterns of the observed data are preserved in the synthetic data, such as the mean and variance for each variables and the covariances between variables. +Moreover, if we model $f(X)$ accurately and draw new samples from this model, our original data should be protected sufficiently well, because many real-world processes are far from deterministic. These random fluctuations typically provide sufficient uncertainty to prevent both identity and attribute disclosure. + In what follows, we describe the two strategies that have emerged to estimate $\hat{f}(x)$ and draw samples from it. -## Joint modelling +## (A) Joint modelling One approach to modelling synthetic data, is by estimating the multivariate distribution of the data directly, and use this single joint model to draw values for all variables simultaneously. We can, for example, model the distribution of the data using a normal approximation. @@ -42,12 +43,14 @@ That is, we estimate the means and covariances of the variables in the data, ass For many data sets, however, this normal approximation provides a poor fit. In the past decade, deep learning has become popular for the generation of synthetic data. -An advantage of these methods, in particular generative adversarial networks (GANs) and variational auto-encoders (VAEs), is that they do not assume a particular parametric distribution (like the normal approximation we just discussed). +An advantage of these methods, in particular generative adversarial networks (GANs) and variational auto-encoders (VAEs), is that they do not assume a particular parametric distribution (like the normal approximation). Rather, these methods attempt to learn the distribution of the data, by transforming relatively simple input data in a way that resembles the observed data. While these methods can achieve state-of-the-art performance, extensive tuning is often required to achieve this performance. Moreover, tuning these models such that they produce realistic synthetic data is non-trivial and often performed in an ad hoc fashion, subsantially lowering user-friendliness. -## Fully conditional specification +## (B) Fully conditional specification + + A second strategy is to break the joint distribution of the data into a series of conditional distributions, one for each variable (or a block of variables). That is, we can factor the joint distribution of the data into a series of conditionals @@ -66,13 +69,14 @@ Note that, although in theory the order of synthesis should not matter much, in We will come back to this issue and provide some concrete advice in [the next section](/generation/index.qmd#sec-creating). -Fully conditional specification has two distinct advantages over joint modelling: +Fully conditional specification has two distinct advantages over joint modelling: + 1. It allows great flexibility, because a different model can be used for each variable. That is, this linear model can be combined with or replaced by non-linear (e.g., non-parametric) models that capture non-linear relationships (or non-normal error terms). -2. Traditional model evaluation techniques, such as regression diagnostics and posterior predictive checks can be used directly (see [the relevant vignette from the `performance` package](https://easystats.github.io/performance/articles/check_model.html#posterior-predictive-checks) for a practical introduction), and potential improvements can be readily incorporated into the synthesis model.^[Note, however, that no tailor-made software exists that works together with existing synthetic data generation packages.] +2. Traditional model evaluation techniques, such as regression diagnostics and posterior predictive checks can be used directly (see [the relevant vignette from the `performance` package](https://easystats.github.io/performance/articles/check_model.html#posterior-predictive-checks) for a practical introduction), and potential improvements can be readily incorporated into the synthesis model.^[Note, however, that until now (end of 2025) no tailor-made software exists that works together with existing synthetic data generation packages.] For example, if a model with a linear relationship between two variables yields a poor fit, a non-linear relationship can be easily incorporated in the synthesis model. -In what follows, we focus on synthesis methods that employ this conditional modelling strategy (commonly called fully conditional specification or sequential regression). +In what follows, we focus on synthesis methods that employ this *conditional modelling strategy* (also called "fully conditional specification" or "sequential regression"). :::{.callout-tip title="Additional resources"} @@ -91,11 +95,11 @@ If you are interested in any additional resources on __fully conditional specifi --- -# Exercise: A simple synthetic data model +# (Optional) Exercise: A simple synthetic data model --- -_The following exercise is purely illustrational. In practice, we don't write our synthesis code from scratch, but we use dedicated `R` packages to generate synthetic data. However, this section describes how such packages work under the hood, and a basic understanding of this will be helpful for specifying and improving the synthesis models. The related code is not essential, but try to understand the general workflow. In the [next section](../generation/), we will use the `synthpop` package to achieve similar results._ +_The following exercise is purely illustrational. In practice, we don't write our synthesis code from scratch, but we use dedicated `R` packages to generate synthetic data. However, this section describes how such packages work under the hood, and a basic understanding of this will be helpful for specifying and improving the synthesis models. This exercise is not essential for proceeding the tutorial, but try to understand the general workflow. In the [next section](../generation/), we will use the `synthpop` package to achieve similar results._ To get a flavour of the two modelling approaches, you will show that it is indeed possible to model a joint distribution as a series of conditionals. For this exercise (and all exercises in subsequent sections), we will use the `boys` data from the `R`-package `mice`,^[I filled in the missing values through imputation, so we don't have to deal with these. However, this implies that we cannot use the `boys` data from the `mice`-package directly] and for the time being, we focus on the first three variables `age`, `hgt` and `wgt`. @@ -145,7 +149,6 @@ __1. Load the `R`-package `mvtnorm` using the function `library()`.__ ```{r} #| label: load-mvtnorm #| code-fold: true -#| code-summary: "Show code: Loading `mvtnorm`" #| message: false #| warning: false @@ -251,6 +254,8 @@ hgt_syn <- hgt_coefs[1] + hgt_coefs[2] * age_syn + rnorm(nsyn, 0, hgt_sd) ## Comparing the results + + __8. Now, use the synthetic `age` and `hgt` values to predict an equal number of `wgt` values, and again add normal noise based on the residual standard deviation of the model fitted on the observed data.__ ```{r} @@ -299,6 +304,8 @@ var(syn2) Also the variance covariance matrices are very close, as we would expect. However, these are the parameters that we specified, so these should be reasonably close, because anything else suggests a coding error. We can go further comparing the distributions of the variables, but we save that exercise for later. + + --- ## Conclusion: Fully conditional specification versus joint modelling @@ -307,6 +314,7 @@ Now you have a flavour of how the two modelling procedures allow to generate syn While in this specific example, the outcomes were essentially the same, this is not generally the case. That is, the combination of three conditional normal models with solely main effects is equivalent to a tri-variate normal distribution, but using fully conditional specification you can specify any model that you deem appropriate, even if this does not give rise to a well-defined multivariate probability distribution. As stated in @sec-modelling, we can model a subset of the variables, or the entire data set using non-parametric models, and a different model can be specified for each variable. + Because of this flexibility, and it's presence in existing software, we will use the fully conditional specification strategy throughout. Before we get there, we discuss some additional considerations for generating synthetic data. @@ -325,7 +333,7 @@ We could also miss some important relationships between variables. In such cases, the synthetic data might not look like the observed data, and some analyses performed on the synthetic data might yield results that are far from those one would obtain running the same analyses on the observed data. That is, the synthetic data has low utility. 2. It can also be the case that the model fits the data a bit too good. -If the synthetic data model overfits, we risk that the observed data is reproduced when sampling from the synthetic data model, protecting privacy very poorly. +If the synthetic data model overfits, we risk that the observed data is reproduced when sampling from the synthetic data model, protecting privacy very poorly. Moreover, overfitted models have a limited utility, as they typically fail to generalize to new data. You might notice that also here the privacy-utility trade-off plays a major role: a simpler synthesis model typically leaks less information, but might also fail to capture more complex characteristics of the observed data. @@ -502,6 +510,7 @@ This substantially reduces disclosure risks, because this uncertainty renders it Simultaneously, specifying the relationships between variables accurately yields higher utility, which allows data users to do more useful analyses with the synthetic data. Hence, you want to balance data privacy with data utility. + --- @@ -518,10 +527,12 @@ Often, this depends on the analyses conducted with the real data. If the data ought to be used for novel research, including complex modelling, one should consider incorporating these complexities (such as higher-order moments or non-linearities) in the synthesis model. Typically, it is safe to assume that aspects of the data that are not modelled explicitly (e.g., interactions between variables, non-linear relationships, fat tails or even multi-level structure) will not appear in the synthetic data, although some exceptions are possible by choosing flexible synthesis models. -Whatever procedure is followed to create the synthetic data, make sure to be transparent: researchers would love to know what your synthetic data set can and cannot be used for. +Whatever procedure is followed to create the synthetic data, make sure to be transparent: researchers should know what your synthetic data set can and cannot be used for. + + At the same time, synthetic data creators should evaluate potential disclosure risks. -These risks should be evaluated at two distinct moment: before creating synthetic data, one should decide what information must be protected, and after creating synthetic data, one should evaluate how well one actually protected that information. +These risks should be evaluated at two distinct moments: before creating synthetic data, one should decide what information must be protected, and after creating synthetic data, one should evaluate how well one actually protected that information. For example, some synthesis methods may simply reproduce original records. The privacy-utility trade-off hints at the idea that the more complex the synthesis model, the larger the risk of disclosure. As such, it is advisable to start with simple synthesis models, and add complexity only when necessary.