Andrea Galloni
A Novel Evaluation Metric for Synthetic Data Generation
Galloni, Andrea; Lendák, Imre; Horváth, Tomáš
Authors
Imre Lendák
Tomáš Horváth
Abstract
Differentially private algorithmic synthetic data generation (SDG) solutions take input datasets Dp consisting of sensitive, private data and generate synthetic data Ds with similar qualities. The importance of such solutions is increasing both because more and more people realize how much data is collected about them and used in machine learning contexts, as well as a consequence of newly introduced data privacy regulations, e.g. the EU’s General Data Protection Regulation (GDPR). We aim to develop a novel and composite SDG evaluation metric which takes into account macro-statistical dataset similarities and data utility in machine learning tasks against privacy boundaries of the synthetic data. We formalize the mathematical foundations for quantitatively measuring both the statistical similarities and the data utility of synthetic data. We use two well-known datasets containing (potentially) personally identifiable information as inputs (Dp) and existing SDG algorithms PrivBayes and DPGroupFields to generate synthetic data (Ds) based on them. We then test our evaluation metric for different values of privacy budget . Based on our experiments we conclude that the proposed composite evaluation metric is appropriate for quantitatively measuring the quality of synthetic data generated by different SDG solutions and possesses an expected sensitivity to various privacy budget values.
Citation
Galloni, A., Lendák, I., & Horváth, T. (2020, November). A Novel Evaluation Metric for Synthetic Data Generation. Presented at IDEAL 2020: 21st International Conference on Intelligent Data Engineering and Automated Learning, Guimarães, Portugal
Presentation Conference Type | Conference Paper (published) |
---|---|
Conference Name | IDEAL 2020: 21st International Conference on Intelligent Data Engineering and Automated Learning |
Start Date | Nov 4, 2020 |
End Date | Nov 6, 2020 |
Online Publication Date | Oct 27, 2020 |
Publication Date | 2020 |
Deposit Date | Apr 8, 2024 |
Publisher | Springer |
Pages | 25-34 |
Series Title | Lecture Notes in Computer Science |
Series Number | 12490 |
Series ISSN | 0302-9743 |
Book Title | Intelligent Data Engineering and Automated Learning – IDEAL 2020: 21st International Conference, Guimaraes, Portugal, November 4–6, 2020, Proceedings, Part II |
ISBN | 9783030623647 |
DOI | https://doi.org/10.1007/978-3-030-62365-4_3 |
Keywords | Synthetic data generation, Differential privacy, Evaluation metrics |
Public URL | http://researchrepository.napier.ac.uk/Output/3587400 |
You might also like
A Comparative Study of Assessment Metrics for Imbalanced Learning
(2023)
Presentation / Conference Contribution
Squared Symmetric Formal Contexts and Their Connections with Correlation Matrices
(2023)
Presentation / Conference Contribution
NCC: Neural concept compression for multilingual document recommendation
(2023)
Presentation / Conference Contribution