Skip to main content

Research Repository

Advanced Search

A Novel Evaluation Metric for Synthetic Data Generation

Galloni, Andrea; Lendák, Imre; Horváth, Tomáš

Authors

Andrea Galloni

Imre Lendák

Tomáš Horváth



Abstract

Differentially private algorithmic synthetic data generation (SDG) solutions take input datasets Dp consisting of sensitive, private data and generate synthetic data Ds with similar qualities. The importance of such solutions is increasing both because more and more people realize how much data is collected about them and used in machine learning contexts, as well as a consequence of newly introduced data privacy regulations, e.g. the EU’s General Data Protection Regulation (GDPR). We aim to develop a novel and composite SDG evaluation metric which takes into account macro-statistical dataset similarities and data utility in machine learning tasks against privacy boundaries of the synthetic data. We formalize the mathematical foundations for quantitatively measuring both the statistical similarities and the data utility of synthetic data. We use two well-known datasets containing (potentially) personally identifiable information as inputs (Dp) and existing SDG algorithms PrivBayes and DPGroupFields to generate synthetic data (Ds) based on them. We then test our evaluation metric for different values of privacy budget . Based on our experiments we conclude that the proposed composite evaluation metric is appropriate for quantitatively measuring the quality of synthetic data generated by different SDG solutions and possesses an expected sensitivity to various privacy budget values.

Citation

Galloni, A., Lendák, I., & Horváth, T. (2020, November). A Novel Evaluation Metric for Synthetic Data Generation. Presented at IDEAL 2020: 21st International Conference on Intelligent Data Engineering and Automated Learning, Guimarães, Portugal

Presentation Conference Type Conference Paper (Published)
Conference Name IDEAL 2020: 21st International Conference on Intelligent Data Engineering and Automated Learning
Start Date Nov 4, 2020
End Date Nov 6, 2020
Online Publication Date Oct 27, 2020
Publication Date 2020
Deposit Date Apr 8, 2024
Publisher Springer
Pages 25-34
Series Title Lecture Notes in Computer Science
Series Number 12490
Series ISSN 0302-9743
Book Title Intelligent Data Engineering and Automated Learning – IDEAL 2020: 21st International Conference, Guimaraes, Portugal, November 4–6, 2020, Proceedings, Part II
ISBN 9783030623647
DOI https://doi.org/10.1007/978-3-030-62365-4_3
Keywords Synthetic data generation, Differential privacy, Evaluation metrics
Public URL http://researchrepository.napier.ac.uk/Output/3587400