Real-Valued Negative Databases

The negative database (NDB) is the negative representation of original data. Existing work has demonstrated that NDB can be used to preserve privacy and hide information. However, most work about NDB is based on binary representation. In some applications which are naturally descripted in real-valued space, the binary negative database is hard to be applied appropriately. Therefore, the real-valued negative database is proposed in this paper, and reversing the real-valued negative database is proved to be an NP-hard problem. Moreover, an effective algorithm for generating real-valued negative databases is given. Finally, an example of applying the realvalued negative database to the privacy-preserving data publication is descripted, and it shows that the real-valued negative database is valuable in practice.


Introduction
Nowadays, databases have become basic tools for storing data.As the privacy of data is widely concerned, the techniques which can preserve privacy while keeping the database services available are urgently needed.Traditional databases store the data with the form what it actually is.This way is called the positive representation of data, and the databases are called positive databases.The privacy of traditional databases is easy to be revealed when the databases are leaked.Although some cryptography methods can be applied to the positive databases, it is time-consuming to encrypt every entry in the databases and the encrypted databases cannot support basic database operations efficiently.Another way is to control the access of the positive database, but this way cannot eliminate all the security risks as there may be some internal attacks.
The negative database, which is inspired by Natural Immune System, was proposed by Esponda and his colleagues (Esponda et al., 2004a;Esponda et al., 2004b;Esponda et al., 2005;Esponda et al., 2007a;Esponda et al., 2009).In contrast to traditional databases, the negative database only stores the information in the complementary set of the original data.This way is called the negative representation of data.It has been proved that reversing the negative database with the binary representation (i.e.recovering the corresponding binary positive database) is NP-hard (Esponda et al., 2004b;Esponda et al., 2009).Therefore, the binary negative database could be employed to protect data privacy.Some algorithms for generating binary negative databases from binary positive databases have been proposed, such as the prefix algorithm (Esponda et al., 2004b;Esponda et al., 2009), the RNDB algorithm (Esponda et al., 2004b;Esponda et al., 2009), the qhidden algorithm (Jia et al., 2005;Esponda et al., 2007a) and the hybrid-NDB algorithm (Liu et al., 2011).Furthermore, some basic operations upon the negative database have been proposed, such as the negative Cartesian product, negative join and negative intersection (Esponda et al., 2004a;Esponda et al., 2005;Esponda et al., 2007b).
So far, most work about the negative database is based on the binary representation.However, in some applications which are naturally descripted in real-valued space, the negative database with the binary representation is not appropriate.Therefore, the real-valued negative database is proposed in this paper.
The negative database has already been introduced to some applications such as privacy preserving (Esponda et al., 2004b;Esponda et al., 2007a;Esponda et al., 2009), sensitive data collection (Esponda, 2006;Horey et al., 2007) and authentication (Dasgupta and Azeem, 2007;Dasgupta and Azeem, 2008).In this paper, an example of applying the realvalued negative database to the privacy-preserving data publication is given.This example demonstrates that the realvalued negative database is appropriate for the privacypreserving data publication.

Existing Work about Negative Databases
The negative database (NDB) was proposed by Esponda and his colleagues (Esponda et al., 2004a;Esponda et al., 2004b).Presently, most negative databases are based on the binary representation.The details of the binary negative database are descripted as follows (Esponda et al., 2004b;Esponda et al., 2009).
Assume the original data is a database which consists of n entries, i.e.DB = {x 1 , x 2 , … , x n }, and each entry in DB is a binary string with length m.The universal set is U = {0, 1} m .
The complementary set of DB is denoted as U-DB, and the negative database NDB only stores the elements that belong to U-DB.As there are usually too many binary strings belong to NDB, a "don"t care" symbol "*" is introduced to compress NDB to a reasonable size.Each entry in NDB is a string defined upon the alphabet {0, 1, *} with length m.The positions with value 0 or 1 are called specified positions, and those with symbol "*" are called unspecified positions.The symbol "*" represents 0 or 1 at a given position.If all the entries in U-DB are covered by NDB, NDB is said to be complete.Any binary string s is said to be matched with (or covered by) an entry y in NDB if and only if the value at each position of s is identical to that of y or the corresponding value of y is "*".With the unspecified value "*", multiple different negative databases can be generated from the same positive database.
It has been proved that reversing the binary negative database (i.e.recovering the corresponding binary positive database) is an NP-hard problem (Esponda et al., 2004b;Esponda et al., 2009).If reversing a negative database is computationally infeasible, the negative database is said to be hard-to-reverse, otherwise it is said to be easy-to-reverse.Some algorithms for generating the binary negative database from a binary positive database have been proposed.The prefix algorithm (Esponda et al., 2004b;Esponda et al., 2009) is the first algorithm for generating binary negative databases, and it is compact and efficient.The binary negative database generated by the prefix algorithm is complete but easy-to-reverse.In order to overcome this shortcoming, the RNDB algorithm (Esponda et al., 2004b;Esponda et al., 2009) was proposed.The RNDB algorithm embeds some random factors for generating binary negative databases which are possibly hard-to-reverse.However, the hard-to-reverse property of the binary negative databases generated by the RNDB algorithm could not be guaranteed, and the size of those binary negative databases could be too large.The qhidden algorithm (Jia et al., 2005;Esponda et al., 2007a) was proposed for the binary positive databases that contain only one entry, and it is very efficient.The generated binary negative databases are not complete, but hard-to-reverse on average.The hybrid-NDB algorithm (Liu et al., 2011) combines the prefix algorithm with the q-hidden algorithm to generate binary negative databases that are both complete and hard-to-reverse on average.It is noted that the "hard-toreverse" property mentioned here means that the SAT solvers with local search strategy (e.g.WalkSAT (Selman et al., 1995)) could not reverse the negative databases on average.
In real-world applications, real-valued databases are often used.However, it is not convenient to employ the binary negative database to represent a real-valued database.Therefore, the real-valued negative database is studied in this paper.It is noted that earlier work about the negative database is the negative selection algorithm (Forrest et al., 1994;Ji and Dasgupta, 2007).The binary negative database is closely related to the negative selection algorithm with the binary representation (Forrest et al., 1994;Ji and Dasgupta, 2007), while their objectives and generation algorithms are obviously different.Hence, the real-valued negative database is also related to (but different from) the negative selection algorithm with the real-valued representation (González et al., 2003;Ji and Dasgupta, 2004;Ji and Dasgupta, 2006;Ji and Dasgupta, 2007).

The Real-Valued Negative Database
Assume real-valued positive database (DB) contains n entries, i.e.DB = {x 1 , x 2 , …, x n }.There are m attributes {R 1 , R 2 , …, R m } in DB, and the domain of each attribute The real-valued negative database only stores the information that belongs to the complementary set of the realvalued positive database.Since the instances covered by the real-valued negative database are usually too many to be presented exactly, intervals are introduced to compress them.
Suppose a is an entry with m real values, and v is an entry with m intervals.Entry a is matched with (or covered by) entry v if and only if following condition is satisfied.
Based on above matching rule, the real-valued negative database for DB can be defined as follows.
Definition 1. (Real-Valued Negative Database) Giving the real-valued positive database DB and the universal set U = I 1 I 2 … I m , the real-valued negative database (RvNDB) for DB is a compressed representation of UDB.Each entry in RvNDB consists of m intervals, and does not cover any entries in DB.
If RvNDB covers the whole complementary set of DB, RvNDB is said to be complete.Otherwise, RvNDB is said to be incomplete.A simple database query can be processed directly upon the real-valued negative database.For any s (a vector with m real values), if it is covered by RvNDB, it does not belong to DB; if s is not covered by RvNDB and RvNDB is complete, it belongs to DB.
As any two entries in the real-valued negative database may intersect with each other, one real-valued positive database can be mapped to multiple real-valued negative databases.An example is given in table 1.
Notes: There are two attributes in DB.The domains of the two attributes are both [0, 1.0].

The NP-Hard Property of RvNDBs
In this section, reversing the real-valued negative database (RvNDB) is proved to be an NP-hard problem.The proofs are similar to the work in (Esponda et al., 2004b;Esponda et al., 2009).Based on the hardness of reversing the real-valued negative database, the real-valued negative database can be used to preserve privacy.(1) Divide the domain of each attribute into two segments: Then encode three intervals as follows.
As the interval I k covers both interval [l k , p k ) and [p k , u k ], the symbol "*" represents either 0 or 1.
(2) Each clause C i is mapped to an entry y i of RvNDB  , and a binary negative database denoted as eNDB  = {e 1 , e 2 , …, , and e i [k] is set as "*".
After all the clauses of the CNF-SAT instance are mapped to entries, the real-valued negative database RvNDB  denoted with intervals is constructed.The database eNDB  is the binary form of RvNDB  , and they can be converted to each other easily.The eNDB  has the same structure with the binary negative database defined in (Esponda et al., 2004b;Esponda et al., 2009).
Lemma 3. Any entry in eNDB  is not the true assignment of

.
Proof.Obviously, if assign e i [k] (k = 1…m) to the k th variable x k , the entry e i is not a true assignment of C i .Because each entry in eNDB  cannot satisfy at least one clause of , it is not the true assignments of .
Lemma 4. Each true assignment of the CNF-SAT instance  corresponds to a real-valued entry not covered by RvNDB  , and vice versa.
Proof.For any true assignment a of the CNF-SAT instance , as every clause of  is satisfied by a and every entry in eNDB  is not the true assignment of , at least one bit of a is different from each entry in eNDB  .That is to say, a is not covered by eNDB  .According to equation 2, the assignment a can be converted to an entry v that consists of intervals, and obviously the entry v is not covered by RvNDB  .
For any entry w consists of m real values and not covered by any entries in RvNDB  , it could be encoded to a binary string a as follows.
As w is not covered by any entry Proof.According to lemma 4, the problem of checking the satisfiability of the instance  is equivalent to the problem 1 for RvNDB  .Furthermore, due to the instance  is chosen arbitrarily, any instance of the CNF-SAT can be converted to a special real-valued negative database.Therefore, problem 1 is NP-complete.Theorem 2. Problem 2 is NP-hard.

Generation Algorithm for RvNDBs
Some generation algorithms for the binary negative database have been proposed (Esponda et al., 2004b;Jia et al., 2005;Esponda et al., 2007a;Esponda et al., 2009;Liu et al., 2011).Based on these generation algorithms, an algorithm for generating real-valued negative databases is proposed in this section.
Giving a positive database DB = {x 1 , x 2 , …, x n }, and there are m attributes in DB.Each entry in DB is a vector of m real values.The procedure of the generation algorithm for the realvalued negative database from DB is described as follows.
(1) Preprocessing: Divide the domains of attributes in DB, and convert DB to a real-valued database DB I which consists of intervals.(2) Encoding: Encode DB I to a binary positive database DB 2 .
(3) Generating: Input DB 2 to an algorithm for generating a binary negative database from the binary positive database such as the q-hidden algorithm or the prefix algorithm, and output a binary negative database NDB 2 .(4) Decoding: Decode NDB 2 to a real-valued negative database RvNDB which consists of intervals.

Phase 1: Preprocessing
The preprocessing phase contains two processes: the dividing process and the converting process.In the dividing process, the domain of each attribute in DB is divided into several distinct intervals.In the converting process, the values of each entry in DB are converted to the intervals which they belong to.
Dividing Process.The domain of each attribute in DB is divided to a set of intervals.For any k (k = 1…m), the interval set , where num k is the number of intervals in P k .
The set P k should be generated according to the requirements for real-life applications and satisfy following basic conditions.
(1) The union of all the intervals in P k equals to I k , i.e. (2) The intersection between any two different intervals in P k is the empty set, i.e. , , , 1 (3) Since DB will be encoded to a binary database, ideally, the number of intervals in P k should be the exponent of 2.

Figure 1. An algorithm for dividing process
Although the dividing process depends on the requirements of real-life applications, a simple algorithm is given in figure 1.The algorithm in figure 1 equally divides each domain I k (k = 1…m) into num k intervals.This algorithm can be applied to some applications such as the privacy-preserving data publication.
Converting Process.According to above dividing process, DB can be converted to a real-valued positive database DB I which consists of intervals as follows.
Let DB I = {t 1 , t 2 , …, t n }.For each entry x i (i = 1…n) in DB, the value of the k th (k = 1…m) attribute is converted to the interval which x i [k] belongs to in P k , i.e.
Phase 2: Encoding In order to generate real-valued negative databases, the realvalued positive database DB I is encoded to a binary database, and then an algorithm for generating negative databases from binary positive databases can be employed.
For the k th (k = 1…m) attribute, since any two different intervals are not intersected with each other and the number of the intervals in P k is the exponent of 2, it is easy to encode num k intervals in P k as num k binary strings with length log 2 (num k ).According to the encoding of intervals in P k , the entries in DB I can be converted to binary strings.The details of the encoding phase are shown in figure 2.
The algorithm shown in Figure 3 is used for generating the binary code from an integer.If the length of the binary code is less than l, some zeros will be attached after it.It follows that all the generated binary strings have the same length.In the encoding phase, this algorithm is employed to encode the intervals in P k (k = 1…m) according to their indexes.

Encode algorithm
Input: Add [low, up) to P k 6.
Add [low, u k ] to P k Phase 3: Generating In the encoding phase, a binary database DB 2 has been generated from the real-valued positive database DB I .In the generating phase, DB 2 is inputted to an algorithm for generating negative databases from the binary positive database, such as the prefix algorithm (Esponda et al., 2004b;Esponda et al., 2009), the RNDB algorithm (Esponda et al., 2004b;Esponda et al., 2009) and the q-hidden algorithm (Jia et al., 2005;Esponda et al., 2007a), and the generation algorithm outputs a binary negative database NDB 2 = {z 1 , z 2 , …, z N }.

Phase 4: Decoding
In the generating phase, a binary negative database NDB 2 is obtained from the binary positive database DB 2 .It is not convenient to use the binary negative database in the realvalued space.Therefore, in the decoding phase, the binary negative database NDB 2 is converted to a real-valued negative database RvNDB.The algorithm for the decoding phase is given in figure 4. Since the entries in NDB 2 are defined upon the alphabet {0, 1, *}, and the symbol "*" represents either 0 or 1 at a given position, each entry may cover multiple strings of specified values (i.e.0 and 1).An extra algorithm for decoding a string defined upon the alphabet {0, 1, *} to a set of intervals is given in figure 5.
The algorithm in figure 5 enumerates every specified string which is covered by the string str, and converts these specified strings to intervals.Finally, the adjacent intervals in W are merged.

Application to the Privacy-Preserving Data Publication
As sensitive data has been involved in many applications nowadays, the privacy preserving of data has been widely concerned.The privacy-preserving data publication is a technique which can both preserve the privacy and maintain the utility of the published data.
The data generalization is an important technique for protecting sensitive data and preserving privacy (Fung et al., 2010).In the preprocessing phase of the generation algorithm for the real-valued negative database, the conversion from real values to intervals can be regarded as the generalization of real values, and the dividing of domains determines the generalized intervals.Therefore, when apply the real-valued negative database to the privacy-preserving data publication, the first phase can be replaced by some generalization techniques, such as some algorithms that can satisfy the kanonymity principle (Sweeney, 2002).Then, a real-valued negative database can be generated from the generalized positive database through the generation algorithm descripted in the former section.
An example of applying the real-valued negative database to the privacy-preserving data publication is given as follows.The original data is shown in table 2. There are four attributes in the original positive database, and the attribute "Name" is the explicit identifier.The combination of attributes <Age, Postcode> is regarded as the quasi-identifiers.The sensitive attribute is "Salary".The domains of the last three attributes  Let str be the same with str but the unspecified positions are assigned according to T 5.
Let temp be the decimal value of str 6.
Add p k, temp+1 to W 7. Merge the adjacent intervals in W The generalized data which satisfies 2-anonymity principle (the k-anonymity principle demands that each entry in the published database cannot be distinguished from at least other k-1 entries (Sweeney, 2002)) is shown in table 3. The binary positive database is shown in table 4. The binary negative database generated by the prefix algorithm (Esponda et al., 2004b;Esponda et al., 2009) from the binary positive database is shown in table 5. Finally, the real-valued negative database decoded from the binary negative database is shown in table 6.

Discussion
The real-valued negative database can be applied to the privacy-preserving data publication.The preprocessing phase of the generation algorithm for the real-valued negative database could be replaced by an existing generalization algorithm.The privacy of the published data is preserved through not only the generalization but also the real-valued negative database.If high data precision is expected, the generalized intervals can be controlled to small ranges.Even if the sensitive data is not generalized, it is still under the protection of the negative representation.If the real-valued negative database is complete, it can be considered as "equivalent" to the generalized positive database and no extra information is lost.Furthermore, since the relationship between the real-valued positive database and the real-valued negative database is one-to-many, and it is hard to check whether two real-valued negative databases correspond to the same positive database (the hardness could be roughly controlled through the generation algorithm for negative databases).Therefore, the real-valued negative database could be properly applied to the privacy-preserving data republication (Xiao and Tao, 2007) and the privacypreserving publication of dynamic data (Jian et al., 2007;Xiao and Tao, 2007;Bu et al., 2008).

Conclusions and Future Work
Since the data in some applications is naturally represented in real-valued space, it is difficult to apply binary negative databases properly.Therefore, the real-valued negative database is proposed in this paper.Reversing the real-valued negative database is proved to be an NP-hard problem, and it follows that the real-valued negative database could be employed to protect data privacy.Based on the generation algorithms for the binary negative database, an effective algorithm for generating real-valued negative databases is proposed in this paper.
The real-valued negative database is applied to the privacypreserving data publication in this paper.The privacy of the published data is under the protection of both the generalization and the negative representation.Furthermore, the balance between security and data precision could be controlled through the level of generalization and the generation algorithm for the real-valued negative database.
Although the definition and a generation algorithm for the real-valued negative database are given in this paper, some further work is expected.Since the generation algorithm for the real-valued negative database is based on the generation algorithms for the binary negative database, some more efficient generation algorithms which are dedicated to the real-valued negative database are expected to be proposed.Some operations for the real-valued negative database such as select, delete, insert, project, union, intersection, set difference, Cartesian product and join need to be designed urgently.These database operations are critical for extending the applications of the real-valued negative database.Moreover, some concrete and practical solutions of applying the real-valued negative database to the privacy-preserving data publication will be considered in future as well.

Age
Postcode Salary (k$) l k is the lower bound and u k is the upper bound.The bounds l k and u k are both real values, i.e. l k  R, u k  R. Each entry x i (i =1…n) is a vector of m real values, and each value x i [k] (k = 1…m) belongs to the domain of k th attribute, i.e. x i [k]  I k .
and the encoding result a[k] is different from e i [k] as well, i.e. a[k] = [ ] i e k .According to the encoding of RvNDB  , if assign a[k] to x k , the clause C i will be satisfied.Moreover, since w is not covered by all the entries in RvNDB  , all the clauses in  are satisfied by a, and a is a true assignment of the CNF-SAT instance .Theorem 1. Problem 1 is NP-complete.
Problem 1.Is the positive database of RvNDB non-empty?That is, is there any entry that is not covered by RvNDB?Problem 2. Can RvNDB be reversed to obtain the entries in the corresponding positive database?That is, can any entry that is not covered by RvNDB be found?Giving any entry w which consists of m real values, if an algorithm can check whether it is the solution in polynomial time, problem 1 is NP.Obviously, the complexity of checking whether the entry w is matched with (or covered by) an entry y i in RvNDB is O(m) (m is the number of attributes).Then the complexity of checking whether RvNDB covers the entry w is O(m|RvNDB|) (|RvNDB| is the number of entries in RvNDB).Therefore, checking whether an entry w is the solution of problem 1 can be done in O(m|RvNDB|), and problem 1 is NP.
Lemma 2. Any CNF-SAT instance  can be converted to a real-valued negative database RvNDB  .Proof.Giving any CNF-SAT instance  with n clauses and m variables x 1 , 2 , …, x m ,  = C 1 C 2 …C n , a real-valued negative database RvNDB  = {y 1 , y 2 , …, y n } with m attributes and n entries can be constructed as follows.

Table 2 .
The original database

Table 3 .
The real-valued positive database which consists of intervals

Table 4 .
The binary positive database

Table 5 .
A binary negative database

Table 6 .
The real-valued negative database