edbt.org

Edbt.org

Measuring Risk and Utility of Anonymized Data Using
Information Theory
Department of Computer Engineering and Maths severe that the anonymized data become useless, i.e., that Before releasing anonymized microdata (individual data) it all information contained in the data is lost. The problem is essential to evaluate whether: i) their utility is high enough of optimizing the trade-off between disclosure risk and in- for their release to make sense; ii) the risk that the anonymized formation loss is known as the statistical disclosure control data result in disclosure of respondent identity or respondent attribute values is low enough. Utility and disclosure riskmeasures are used for the above evaluation, which normally Information loss measures in SDC of microdata are usually lack a common theoretical framework allowing to trade off based on the relative discrepancy between some statistics or utility and risk in a consistent way. We explore in this pa- models computed on the original data and on the masked per the use of information-theoretic measures based on the data [4]. A critique to the above measures is that, for contin- uous attributes, relative discrepancies are unbounded1 anddifficult to combine with disclosure risk; the latter is most often measured as a probability of re-identification and isthus bounded between 0 and 1 [14].
H.2.8 [Database Management]: Database Applications—statistical databases; H.1.1 [Models and Principles]: Sys- Probabilistic information loss measures yielding a figure be- tems and Information Theory; K.4.1 [Computers and So- tween [0, 1] which can be readily compared to disclosure risk have been proposed [9]. Let θ be a population parameter(on the original data) and let ˆ statistic (on the masked data). If the number n of records of the original data is large (> 100), then When a set of microdata (individual data) containing the values of some attributes for individuals or companies is tobe released by a statistical office for research or general use, can be assumed to follow a N (0, 1) distribution.
anonymization is more often than not a legal requirement [6].
Anonymization means processing original microdata so as to A probabilistic information loss measure pil(θ) for param- obtain masked microdata that can be released in such a way eter θ is the probability that the absolute value of the dis- that the respondent subjects to whom the microdata records crepancy Z is ≤ the actual discrepancy in the masked data: refer cannot be re-identified (identity disclosure) and partic-ular attribute values cannot be associated or disassociatedwith a particular respondent subject (attribute disclosure).
In order to reduce the disclosure risk, anonymization distorts the data in some way (e.g. by perturbing them or reducing their detail); as distortion increases, disclosure risk can beexpected to decrease. However, distortion should not be so However, getting information loss and disclosure risk mea-sures bounded in the same [0, 1] does not imply that theirsemantics are entirely comparable. Indeed, probabilistic in-formation loss measures are actually distortions mapped to 1E.g. a statistic whose value is 0 on the original data and 0.1 on the masked data has an infinite relative discrepancy(0.1 − 0)/0.
[0, 1], whereas disclosure risk is normally computed as a re- • More generally, if U and V are random, jointly Gaus- identification probability. If information loss and disclosure risk could be expressed with semantically similar measures, boundedness would be irrelevant: both measures could be where P t is the transpose of P , I the identity matrix The original contribution of this paper, developed in the fol- lowing sections, is to explore the use of information-theoreticmeasures based on the notion of mutual information in viewof providing a unified framework embracing information loss If mutual information can be used to express information and disclosure risk. Section 2 motivates the use of information- loss or/and disclosure risk, then the machinery of informa- tion theory can be used to optimize the trade-off between oretic measures for information loss.
information-theoretic measures for disclosure risk. Based onthe previous measures, Section 5 presents models to trade off information loss against disclosure risk. Section 6 lists conclusions and issues for future research.
The attributes in a microdata set can be classified as: • Identifiers. These are attributes that unambiguously Unbounded loss measures based on relative discrepancies are identify the respondent. Examples are the passport very easy to understand, but rather difficult to trade off number, social security number, name-surname, etc.
• Key attributes. These are attributes which identify the respondent with some degree of ambiguity. (Nonethe-less, a combination of key attributes may provide un- • They can be applied to the same usual statistics θ ambiguous identification.) Examples are address, gen- (means, variances, covariances, etc.) as measures based contain sensitive information on the respondent. Ex- • They are bounded within [0, 1], so they easily compare amples are salary, religion, political affiliation, health Both relative-discrepancy and probabilistic loss measures We assume that the original microdata have been pre-processed lack an underlying theory allowing to optimize their trade- to remove all identifiers from them. Let X, Y be, respec- tively, the key and confidential attributes in the original mi-crodata set. Let X be the key attributes in the masked The mutual information I(X; Y ) between two random vari- microdata set (as in k-anonymization [13], we assume that ables X and Y measures the mutual dependence of the two only key attributes are masked). If we focus on the dam- variables and is measured in bits. Mutual information can age inflicted to key attributes [11], a possible information be expressed as a function of Shannon’s entropy: loss measure is the expected distortion E(d(X, X )) whered(x, x ) is a distortion measure, e.g. d(x, x ) = ||x − x ||2.
I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X) A probably better option is to focus on how masking affects the statistical dependence between the key and confidentialattributes. A possible measure for this is I(X; Y )−I(X ; Y ).
where H(X), H(Y ) are marginal entropies, H(X|Y ), H(Y |X)are conditional entropies and H(X, Y ) is the joint entropyof X and Y .
Lemma 1. If X is a randomized function of X, but not of Y , it holds that I(X; Y ) − I(X ; Y ) ≥ 0.
Given a symmetric, positive definite matrix Σ, let Σ1/2 de-note the unique, symmetric, positive definite square root of Given three random variables X1, X2 and X3, Σ. The following can be stated about mutual information define the conditional mutual information I(X1; X2|X3) as the expected value of I(X1; X2) conditional to X3, that is,I(X1; X2|X3) = EX (I(X • If U and V are jointly Gaussian random vectors, and I(X ; Y )+I(X; Y |X ) = I((X, X ); Y ) = I(X; Y )+I(X ; Y |X) U is the best linear estimate of U from V , then U is The hypothesis of the lemma implies that X and Y are a sufficient statistic, that is, I(U ; V ) = I(U ; V ).
conditionally independent given X, that is, I(X ; Y |X) = 0.
If U and V are random, jointly Gaussian scalars withcorrelation coefficient ρ, then I(U ; V ) = − log p1 − ρ2.
Since I(X; Y |X ) ≥ 0, we have that I(X ; Y ) ≤ I(X; Y ). ✷ Let us now compare the mutual information with the more usual information loss measures based on the mean square The MSE E(d(X, X )) = E(||X−X ||2) seems better adaptedthan I(X; X ) to measuring how well statistical properties • A zero MSE between X and X , that is, E(d(X, X )) = • However, I(X; Y ) − I(X ; Y ) = 0 only implies that X Figure 1: Risk-loss as Lagrangian rate-distortion op-timization. Rate stands for risk in the privacy set- Nonetheless, MSE and the mutual information are not that different, both belonging to the family of so-called Bregmandivergences [10, 2].
By combining the above two approaches with the variousloss and risk measures sketched above, several optimization The difference I(X; Y ) − I(X ; Y ) bears some resemblance to the relative discrepancy between correlation matrices pro-posed as an information loss measure in [4]. However, mu-tual information measures the general dependence between attributes, while the correlation measures only the linear In this model, disclosure risk is minimized while keeping dependence; thus the former seems superior [8]. It will be information loss below a certain upper bound. Disclosure shown below that, under some assumptions, preserving mu- risk is measured using mutual information and information tual information preserves the covariance matrix up to a loss is measured as the expected distortion. Hence: The mutual information I(X ; X) between the released and for a certain pre-specified maximum tolerable loss D.
the original key attributes is a measure of identity disclosure(defined in the introduction above). Note that I(X ; X) was Model 1 was related in [11] to the rate-distortion function previously regarded as a possible information loss measure optimization in information theory (see Figure 1): the risk R was assimilated to the rate and the loss D to the distor-tion. An optimal random perturbation pX |X (x |x) for key The mutual information I(X ; Y ) between the released key attributes was obtained. Let d = D/σ2X be the normalized attributes and the confidential attributes is a measure of at- distortion. For the case of univariate Gaussian, real-valued tribute disclosure (defined in the introduction above). Mea- X and Y , a closed form of the minimum was obtained. The suring risk as I(X ; Y ) conforms to the t-closeness privacy idea is to compute X as X = (1−d)X +dZ, where the noise property [7] requiring that the distance between the distribu- Z is distributed according to N (µ, 1−d σ2 tion of Y within records sharing each combination of values of X and the distribution of Y in the overall dataset be no (x |x) = N ((1 − d)x + dµX , d(1 − d)σ2X ) Several combinations of the above loss and risk measures can be used when trying to optimize the trade-off of informationloss and disclosure risk. Two approaches are conceivable: If we maintain the same risk and loss measures, but take the more natural approach of minimizing D for a maximum Place an upper-bound constraint on the information loss D and minimize the disclosure risk R.
• Place an upper-bound constraint on the disclosure risk R and minimize the information loss D (more natural Information loss measures based on relative discrepanciesare awkward to combine with risk measures in order to opti-mize the risk-loss trade-off. Probabilistic loss measures are a This problem could be related to optimizing the distortion- step forward, but lack a theoretical framework. We have ex- rate function optimization in quantization. This again yields plored here loss and risk measures based on information the- an optimal perturbation pX |X , which can be computed by ory, namely on mutual information. Models for optimizing the information-theoretic risk-loss trade-off when perturbingdata and generating synthetic data have been presented. It has been shown that preserving mutual information offers Model 3 below results from miniziming disclosure risk while keeping information loss below a certain level, and usingmutual information for measuring both the disclosure risk However, this paper is intended to be a starting point rather than a conclusive contribution. The information-theoreticmeasures and the models discussed above are just a firststep. A number of issues for future research lie open ahead: • Relate Model 2 with distortion-rate function optimiza- tion, a well-known problem in quantization. This should subject to I(X; Y ) − I(X ; Y ) ≤ D.
be done in a way analogous to the connection betweenModel 1 and rate-distortion function optimization es- Finally, Model 4 is obtained by maintaining the same mea- sures, but minimizing the information loss while keeping dis-closure risk below an upper bound: • In the context of synthetic data generation, devise information-theoretic loss measures whose minimiza-tion is equivalent to preserving a given model.
• Whenever possible, find closed-form expressions for the • If a closed-form expression is not possible, look for a convex optimization problem to be solved numerically Synthetic data generation, that is, generation of randomsimulated data preserving some properties of original data, • Investigate the connection of mutual information with can be viewed as a form of masking by perturbation [1]. If information loss metrics other than MSE, like [5, 15, we want to generate synthetic key attributes X in such a way that the connection between key attributes and confi-dential attributes is minimally affected, we can use Model 4 to compute pX |X . Synthetic X can be generated by draw- This work was partly funded by the Spanish Government through projects CONSOLIDER INGENIO 2010 CSD2007-00004 “ARES” and TSI2007-65406-C03-01 “E-AEGIS”. The Mutual information vs. covariance preser- first author is partially supported as an ICREA Acad` researcher by the Government of Catalonia; he holds theUNESCO Chair in Data Privacy, but his views do not nec- We justify that preserving mutual information (that is, achiev- essarily reflect the position of UNESCO nor commit that ing D(R) = 0 in Model 4) preserves the covariance matrix Let X and Y be zero-mean, jointly Gaussian random vari- R- and R -valued, respectively. Let X = aT Y be the [1] J. M. Abowd and L. Vilhuber. How protective are best linear MSE estimate of X given Y , for a ∈ synthetic data? In J. Domingo-Ferrer and Y. Saygin, editors, Privacy in Statistical Databases, volume 5262 of Lecture Notes in Computer Science, pages 239–246, On the one hand, X is a sufficient statistic for X given Y , that is, I(X ; Y ) = I(X; Y ) (see [12]).
[2] L. M. Bregman. The relaxation method of finding the common point of convex sets and its application to the On the other hand, the covariance matrix is preserved when solution of problems in convex programming. USSR Comput. Math., Math. Phys., 7:200–217, 1967.
[3] J. Brickell and V. Shmatikov. The cost of privacy: destruction of data-mining utility in anonymized datapublishing. In Proc. of the 14th ACM SIGKDD International Conference on Knowledge Discovery and [4] J. Domingo-Ferrer and V. Torra. Disclosure protection methods and information loss for microdata. InP. Doyle, J. I. Lane, J. J. M. Theeuwes, andL. Zayatz, editors, Confidentiality, Disclosure andData Access: Theory and Practical Applications forStatistical Agencies, pages 91–110, Amsterdam, 2001.
North-Holland.
[5] V. S. Iyengar. Transforming data to satisfy privacy constraints. In Proc. of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining, pages 279–288, 2002.
[6] J. Lane, P. Heus and T. Mulcahy. Data access in a cyber world: making use of cyberinfrastructure.
Transactions on Data Privacy, 1(1): 2–16, 2008.
[7] N. Li, T. Li, and S. Venkatasubramanian. t-closeness: privacy beyond k-anonymity and l-diversity. InProceedings of the IEEE ICDE 2007, 2007.
[8] W. Li. Mutual information functions vs correlation functions. Journal of Statistical Physics, 60:823–837,1990.
[9] J. M. Mateo-Sanz, J. Domingo-Ferrer, and F. Seb´ Probabilistic information loss measures inconfidentiality protection of continuous microdata.
Data Mining and Knowledge Discovery, 11(2), 2005.
181-193.
[10] D. Rebollo-Monedero. Quantization and Transforms for Distributed Source Coding. PhD thesis, StanfordUniversity, 2007.
J. Domingo-Ferrer. From t-closeness to pram andnoise addition via information theory. In Privacy inStatistical Databases-PSD 2008, volume 5262 ofLecture Notes in Computer Science, pages 100–112,Berlin Heidelberg, 2008.
[12] D. Rebollo-Monedero, S. Rane, A. Aaron, and B. Girod. High-rate quantization and transformcoding with side information at the decoder. SignalProcessing, 86(11):3160–3179, 2006.
[13] P. Samarati. Protecting respondents’ identities in microdata release. IEEE Transactions on Knowledgeand Data Engineering, 13(6):1010–1027, 2001.
[14] M. Trottini. Decision models for data disclosure limitation. PhD thesis, Carnegie Mellon University,2003. http://www.niss.org/dgii/TR/Thesis-Trottini-final.pdf.
[15] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and A. W.-C. Fu. Utility-based anonymization using localrecoding. In Proc. of the 12th ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining, pages 785–790, 2006.

Source: http://www.edbt.org/Proceedings/2009-StPetersburg/workshops/PAIS09/workshoppapers/p-pais04.pdf

Microsoft word - blossom-icf-abbrtext-031308.doc

BLOSSOM: Behavioral Modification and Lorcaserin Second Study for Obesity Management A 52-Week, Double-blind, Randomized, Placebo-controlled, Parallel- group Study to Assess the Safety and Efficacy of Lorcaserin Hydrochloride in Overweight and Obese Patients Why is this research study being done? We are studying an “investigational” drug, lorcaserin hydrochloride (“lorcaserin�

Microsoft word - nescoe rps plan - exelon comments - final

New England States Committee on Electricity Coordinated Competitive Renewable Power Procurement Prepared Comments of Jonathan A. Lesser, PhD Continental Economics, Inc. On Behalf Of EXELON CORPORATION August 30, 2012 EXECUTIVE SUMMARY Exelon Corporation (“Exelon”) urges the New England States Committee on Electricity (“NESCOE”) to abandon its propos