Singular value decomposition-based frequent item set mining method for local differential privacy protection
1. A frequent item set mining method for local differential privacy protection based on singular value decomposition is characterized by comprising the following steps: the method specifically comprises the following steps:
step 1: estimating the frequency of items held by the smart home user, arranging the frequency from big to small, selecting K items before the frequency ranking as frequent items, and numbering the frequent items in sequence;
step 2: dividing users into K groups, combining the K frequent items selected in the step 1 in pairs, wherein each group is a frequent item set, and establishing an initial matrix M with K x K dimensions according to the frequent item sets; setting an initial frequent item set FIS, wherein the FIS is empty;
and step 3: let k equal to 1;
and 4, step 4: performing singular value decomposition on M to obtain two orthogonal matrixes U and VT(ii) a The server will U and VTSending the information to the kth group of users;
and 5: the r user in the k group of users receives U and VTEstablishing a singular matrix corresponding to the R-th user by using FIS and frequent items held by the user, interfering the singular matrix to obtain interference information, and uploading the interference information to a server, wherein R is 1, 2, …, and R is the total number of users in the k-th group of users;
step 6: the server performs aggregation analysis on the received R interference information, so as to excavate the most frequent item set, put the most frequent item set into the FIS, and set the value of an element in the M, which corresponds to the frequent item set in the FIS, to be 0;
and 7: making K equal to K +1, judging whether K is larger than K, if so, stopping calculation, and taking FIS as a mining result; otherwise, returning to the step 4.
2. The singular value decomposition-based frequent item set mining method for local differential privacy protection according to claim 1, characterized in that: the step 1 specifically comprises the following steps: dividing intelligent household users into three groups, randomly selecting a project by each user in the first group for interference, uploading interference information to a server, and performing aggregation analysis on the interference information sent by the users in the first group by the server to obtain projects named as top 2 x K in frequency, wherein the projects form a candidate project set; the server sends the candidate item set to a second group of users, each user of the second group calculates the intersection of a set formed by the own items and the candidate item set, the number of the items in the intersection is used as data to be interfered, interference information is uploaded to the server, the server carries out aggregation recovery on the interference information sent by the second group of users, and therefore a sampling number is obtained, and the server sends the sampling number and the candidate item set to a third group of users; each user in the third group calculates the intersection J of the set formed by the items owned by the user and the candidate item set, if the number of the items in the J is smaller than the sampling number, virtual items are added into the J, so that the number of the items in the J is equal to the sampling number, each user in the third group randomly selects one item from the intersection of the set formed by the items owned by the user and the candidate item set to perform interference, interference information is uploaded to the server, and the server performs aggregation recovery on the interference information sent by the users in the third group, so that the frequency estimation of the items owned by all the users is obtained.
3. The singular value decomposition-based frequent item set mining method for local differential privacy protection according to claim 1, characterized in that: in the step 2, according to a frequent item set composed of the ith frequent item and the jth frequent item, values of elements in the ith row and the jth column in the matrix M are calculated:
m(i,j)=min(f(i),f(j))
where min (·) is a minimum function, f (i) is the frequency of the ith frequent item, and f (j) is the frequency of the jth frequent item, i is 1, 2, …, K, j is 1, 2, …, K.
4. The singular value decomposition-based frequent item set mining method for local differential privacy protection according to claim 1, characterized in that: the step 5 specifically comprises the following steps:
step 3.1: comparing the frequent items held by the r-th user with the frequent item sets in the FIS one by one, and if the set formed by the frequent items held by the r-th user is the same as a certain frequent item set, then comparing the singular matrix S corresponding to the userrSetting all the elements in the sequence to be 0, stopping comparison, and turning to the step 3.3; if the number of elements in the intersection of the set formed by the frequent items held by the user and a certain frequent item set is 1 or 0, the frequent item held by the r-th user is unchanged; if the number of elements in the intersection of the set formed by the frequent items held by the r-th user and a certain frequent item set is 2, deleting any frequent item in the intersection from the frequent items held by the user so as to update the frequent items held by the client, then performing the next round of comparison until all the frequent item sets in the FIS are compared, and turning to the step 3.2;
step 3.2: using a matrix M of dimensions K x KrIndicating the frequent items held by the user after updating:
wherein M isr(x, y) represents MrThe x-th row and the y-th column; k ═ 1, 2,. K; k ═ 1, 2,. K; q is a number set of the updated frequent items held by the user;
according to U and VTCalculating a singular matrix of the r-th user:
Sr=U+*Mr*VT+
wherein U is+=UTWhere T is the matrix transpose, VT+=V;
Step 3.3: will SrMapping onto a value range v, v ∈ [ -1, 1]To map with SrThe value range v of the interference is interfered, and information after interference is obtainedOrThe r-th user will t with probability p1Upload to server, or get t with probability 1-p2The information is uploaded to a server and then transmitted to the server,ε is a privacy parameter.
5. The singular value decomposition-based frequent item set mining method for local differential privacy protection according to claim 1, characterized in that: the step 6 specifically comprises the following steps:
the server carries out aggregation analysis on the received interference information to obtain an estimation matrixCalculating a K x K dimensional matrix according to the following formula
Neutralizing the matrix MAdding the frequent item set corresponding to the element with the same position of the middle maximum element value into a frequent item set FIS.
Background
With the development of social economy and science and technology, information technology including computer technology, communication technology, sensor technology and the like is rapidly developed and improved, and the development of modern family life of people towards more convenient and comfortable directions is promoted. The intelligent home is also realized from the concept and becomes a word which can be detailed by people. The intelligence of the intelligent home is mainly embodied in the automatic execution process of related intelligent equipment, usually, a sensor senses a trigger condition, and when the condition is met, the system can appoint the corresponding equipment to execute corresponding actions. Such trigger-action combinations are a cornerstone for intelligent implementation in smart homes, and the system needs to set up reasonable combinations to help users rapidly deploy intelligent applications, and also needs to learn new combinations from the user side for subsequent improvement and optimization. If both the trigger and the action are regarded as items owned by the user, the frequently-occurring combinations searched in the trigger-action combinations owned by the majority of users correspond to the frequent item sets in the data set.
Frequent itemsets are a common correlation property between data, and specifically refer to a collection of several items that frequently appear in a data set, but frequent itemset mining generally requires the collection and analysis of user raw data. How to implement the mining of frequent itemses while protecting the privacy information of the user is not only the key to implement various intelligent applications, but also the technical problem faced by the further development of the intelligent applications.
In recent years, differential privacy has been proposed as a strict definition of privacy protection. The differential privacy is realized by adding noise meeting specific properties to personal data of a user, so that the purpose that a third party cannot predict the personal information is realized while the statistical information of the overall data is kept. But traditional differential privacy requires a trusted server and the server can still observe the user's raw data. To avoid the problem of trusted third party servers, local differential privacy techniques are proposed that add noise at the user end. The user uploads the information after the disturbance to the server so as to avoid exposing the original information to a third party server.
The prior art proposes that a frequent item set candidate set is guessed based on the assumption that the frequent item sets are all composed of frequent items, the frequent items in a data set are searched under the definition of meeting local differential privacy, then the frequent item set candidate set is constructed according to the assumption and sent to a user, the user uploads an item set owned by the user, and a server determines a final top-k frequent item set result through aggregation calculation. However, the method ignores the possibility that the frequent item set may be composed of frequent items which are relatively infrequent, which may cause part of the frequently associated item set to be ignored and the estimation result to be inaccurate.
A CALM method is also proposed, in which users are assigned to a set edge table, the users interfere with uploading of values of corresponding attributes in the edge table, and a server obtains a multi-attribute joint distribution result through aggregation recovery. However, according to the method, a user needs to upload multiple data, so that not only is the privacy budget divided to cause noise increase, but also the communication overhead is increased.
How to realize a local differential privacy-based frequent mining method, which can realize the balance among data privacy, mining precision and communication overhead, is a difficult problem.
Disclosure of Invention
The purpose of the invention is as follows: in order to solve the problems in the prior art, the invention provides a frequent item set mining method for local differential privacy protection based on singular value decomposition.
The technical scheme is as follows: the invention provides a frequent item set mining method for local differential privacy protection based on singular value decomposition, which specifically comprises the following steps:
step 1: estimating the frequency of items held by the smart home user, arranging the frequency from big to small, selecting K items before the frequency ranking as frequent items, and numbering the frequent items in sequence;
step 2: dividing users into K groups, combining the K frequent items selected in the step 1 in pairs, wherein each group is a frequent item set, and establishing an initial matrix M with K x K dimensions according to the frequent item sets; setting an initial frequent item set FIS, wherein the FIS is empty;
and step 3: let k equal to 1;
and 4, step 4: performing singular value decomposition on the M to obtain two orthogonal matrixes U and V; the server will U and VTSending the data to the kth group of users, wherein T is matrix transposition;
and 5: the r user in the k group of users receives U and VTEstablishing a singular matrix corresponding to the R-th user by using FIS and frequent items held by the user, interfering the singular matrix to obtain interference information, and uploading the interference information to a server, wherein R is 1, 2, …, and R is the total number of users in the k-th group of users;
step 6: the server performs aggregation analysis on the received R interference information, so as to excavate the most frequent item set, put the most frequent item set into the FIS, and set the value of an element in the M, which corresponds to the frequent item set in the FIS, to be 0;
and 7: making K equal to K +1, judging whether K is larger than K, if so, stopping calculation, and taking FIS as a mining result; otherwise, returning to the step 4.
Further, the step 1 specifically comprises: dividing intelligent household users into three groups, randomly selecting a project by each user in the first group for interference, uploading interference information to a server, and performing aggregation analysis on the interference information sent by the users in the first group by the server to obtain projects named as top 2 x K in frequency, wherein the projects form a candidate project set; the server sends the candidate item set to a second group of users, each user of the second group calculates the intersection of a set formed by the own items and the candidate item set, the number of the items in the intersection is used as data to be interfered, interference information is uploaded to the server, the server carries out aggregation recovery on the interference information sent by the second group of users, and therefore a sampling number is obtained, and the server sends the sampling number and the candidate item set to a third group of users; each user in the third group calculates the intersection J of the set formed by the items owned by the user and the candidate item set, if the number of the items in the J is smaller than the sampling number, virtual items are added into the J, so that the number of the items in the J is equal to the sampling number, each user in the third group randomly selects one item from the intersection of the set formed by the items owned by the user and the candidate item set to perform interference, interference information is uploaded to the server, and the server performs aggregation recovery on the interference information sent by the users in the third group, so that the frequency estimation of the items owned by all the users is obtained.
Further, in step 2, according to a frequent item set composed of the ith frequent item and the jth frequent item, a value of an element in the ith row and the jth column in the matrix M is calculated:
m(i,j)=min(f(i),f(j))
where min (·) is a minimum function, f (i) is the frequency of the ith frequent item, and f (j) is the frequency of the jth frequent item, i is 1, 2, …, K, j is 1, 2, …, K.
Further, the step 5 specifically includes:
step 3.1: comparing the frequent items held by the r-th user with the frequent item sets in the FIS one by one, and if the set formed by the frequent items held by the r-th user is the same as a certain frequent item set, then comparing the singular matrix S corresponding to the userrSetting all the elements in the sequence to be 0, stopping comparison, and turning to the step 3.3; if the number of elements in the intersection of the set formed by the frequent items held by the user and a certain frequent item set is 1 or 0, the frequent item held by the r-th user is unchanged; if the number of elements in the intersection of the set formed by the frequent items held by the r-th user and a certain frequent item set is 2, deleting any frequent item in the intersection from the frequent items held by the user so as to update the frequent items held by the client, then performing the next round of comparison until all the frequent item sets in the FIS are compared, and turning to the step 3.2;
step 3.2: using a matrix M of dimensions K x KrIndicating the frequent items held by the user after updating:
wherein M isr(x, y) represents MrThe x-th row and the y-th column; k ═ 1, 2,. K; k ═ 1, 2,. K; q is a number set of the updated frequent items held by the user;
according to U and VTCalculating a singular matrix of the r-th user:
Sr=U+*Mr*VT+
wherein U is+=UT,VT+=V;
Step 3.3: will SrMapping onto a value range v, v ∈ [ -1, 1]To map with SrThe value range v of the interference is interfered, and information after interference is obtainedThe r-th user will t with probability p1Upload to server, or get t with probability 1-p2The information is uploaded to a server and then transmitted to the server,ε is a privacy parameter.
Further, the step 6 specifically includes:
the server carries out aggregation analysis on the received interference information to obtain an estimation matrixCalculating a K x K dimensional matrix according to the following formula
Neutralizing the matrix MAdding the frequent item set corresponding to the element with the same position of the middle maximum element value into a frequent item set FIS.
Has the advantages that:
(1) the item set information privacy of the user is protected by meeting strict local differential privacy definition;
(2) the mining result of the frequent item set is accurate, and the frequent item set consisting of the non-most frequent items cannot be ignored;
(3) the communication overhead between the user side and the server side is reduced.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate an embodiment of the invention and, together with the description, serve to explain the invention and not to limit the invention.
The invention provides a brand-new privacy frequent item set mining method based on a local differential privacy framework, and the method can optimally balance data privacy, mining precision and communication overhead. The basic idea is as follows: firstly, at a distributed user side, representing an original high-dimensional sensitive data matrix of the user by using a low-dimensional singular value matrix; secondly, collecting insensitive singular values and recovering the values at a server side to finally obtain accurate estimation of the frequent item set group.
As shown in fig. 1, the embodiment provides a frequent item set mining method for local differential privacy protection based on singular value decomposition:
step 1: frequent item frequency estimation
The items owned by the user may be various, and if all the items that may appear are used as the input fields, the estimation error is too large due to the linear relationship between the difference privacy error and the value field range, and the data practicability is further affected. Therefore, the range of candidates needs to be narrowed.
The method comprises the following steps that users using the smart home are divided into three groups, each user in the first group randomly selects any item to be disturbed and uploaded to a server, the server performs aggregation analysis on received information to obtain items with the names of the top 2 x k, the items are combined into an item set, and the item set is used as a candidate item set; the server sends the candidate item set to a second group of users, each user of the second group calculates the intersection of the own item and the candidate item set, the number of items contained in the intersection is used as data to be disturbed and uploaded, the server determines the sampling number after carrying out aggregation recovery on the disturbed data sent by the users of the second group, and the server sends the sampling number and the candidate item set to a third group of users; each user in the third group calculates the intersection of the own item and the option set, if the number of items in the intersection is smaller than the sampling number, a virtual item is added in the intersection, so that the number of items in the intersection is equal to the sampling number, each user in the third group randomly selects one item from the corresponding intersection to perform interference uploading to the server, the server performs aggregation recovery on the received information, and the server can perform frequency estimation on frequent items held by the users to obtain frequent top-K items, namely items with the frequency ranking of the top K names.
Step 2: dividing users into K groups, establishing an initial matrix by using a frequent item estimation result at a server side, performing singular value decomposition on the matrix to obtain a left matrix and a right matrix (orthogonal matrix or left singular matrix and right singular matrix), sending the left matrix and the right matrix to the kth group of users, setting an initial frequent item set FIS (FIS) and enabling the FIS to be null at the moment;
the final objective of the present invention is to obtain a frequent item set group including top-K (K-top frequent items), so it is necessary to divide users into K groups, i, j denote two different frequent items, f (i), f (j) denotes the frequency of the ith frequent item and the frequency of the jth frequent item, i-1, 2. And (3) calculating the value M (i, j) ═ min (f (i), f (j)) of the element in the ith row and jth column in the matrix M by the server according to the frequency estimation of the frequent items obtained in the step 1 and the frequent item set consisting of the ith frequent item and the j frequent items. The smaller of the two item frequencies is selected as the matrix element because for any item i, when the frequency of occurrence is f (i), the frequency of the frequent item set including the item i is the highest and can only be less than or equal to f (i). In order to ensure the accuracy of estimation, the frequency of acquiring each item set is possibleThe current maximum value; decomposition of M ═ U ∑ V using singular valuesTObtaining two orthogonal matrixes of U and V, wherein sigma is a diagonal matrix. In order to reduce the communication overhead, this embodiment only takes n to 1 singular value to approximate the matrix to obtain U and V with the order of 1 × n and n × 1 respectivelyTAnd sending the data to a first group of users, wherein T is matrix transposition.
And step 3: the kth group of user terminals establish a matrix M according to locally owned frequent items and FISr(when k is 1, FIS is empty, FIS does not affect the local user end according to the local owned frequent items), and then according to U and VTEstablishing a singular value matrix, and uploading the singular value matrix to a server after the singular value matrix is interfered, wherein the interference in the embodiment is after differential privacy noise is added;
step 3.1: comparing the frequent items held by the r-th user with the frequent item sets in the FIS one by one, and if the frequent items held by the r-th user are a certain frequent item set, determining the singular value matrix S of the userrSetting all the elements in the sequence to be 0, stopping comparison, and turning to the step 3.3; if the number of intersections of the frequent items held by the users and a certain frequent item set is 1 or 0, the frequent items held by the r-th user are unchanged; if the number of the intersection is 2, in order to avoid influence on estimation of other frequent item sets containing frequent items, randomly deleting a frequent item in the intersection from the frequent item held by the r-th user (for example, if the intersection has two frequent items a and b, and the client has three frequent items a, b and d, then randomly deleting a or randomly deleting b in a, b and d), so as to update the frequent item held by the client, and then performing the next round of comparison; turning to step 3.2 until all frequent item sets in the FIS are compared; the original frequent items held by the user are still reserved in the step;
step 3.2: and using a matrix M of dimensions K x KrRepresents frequent items held by the user:
wherein M isr(x, y) tableShow MrThe x-th row and the y-th column; k ═ 1, 2,. K; k ═ 1, 2,. K; q is a frequent item number set held by the updated user; each position element in the matrix has a value, but not a diagonal matrix like sigma; for example, the updated frequent items owned by the user are a and d; the K frequent items are ranked according to frequency and are a, b, c and d in sequence, and the serial number is 1234; the number of the frequent items owned by the user is 1, 4; then at MrIn the matrix, 11, 14, 41, 44 elements have a value of 1, and the rest are 0.
The kth group of the r-th users receives U and VTThen obtaining the corresponding Moore-Penrose generalized inverse matrix U according to the property of the orthogonal matrix+=UT,VT+V; calculating a singular matrix of the r-th user:
Sr=U+*Mr*VT+
step 3.3: in this embodiment, U and V are usedTAre 1 x n and n x 1, respectively, so that SrContaining only one element. User needs to be right to SrCarrying out differential privacy interference in the following mode: firstly, according to left and right singular matrixes U and VTObtaining a range D of singular values, projecting the range D to [ -1, 1]In the meantime, the user will SjMapping to a corresponding value range v ∈ [ -1, 1]Random response mechanism uploads interfered information with probability of pOr uploading interfered information with probability of 1-pEpsilon is a privacy parameter; therefore, the user only needs to send 1bit data to the server, and the communication overhead is greatly reduced. Wherein p and 1-p are defined as follows:
and 4, step 4: if K is greater than K +1, stopping calculation, and taking FIS as a final mining result, namely a final top-K frequent item set; otherwise, the server side performs aggregation analysis on the received interference information, excavates the most frequent item set and puts the most frequent item set into FIS, updates the initial matrix and the FIS and returns to the step 3;
the server firstly carries out aggregation estimation on the received interference information and recovers to obtain an estimation matrixThen calculated to obtainMatrix arrayThe same dimension as the matrix M, is selectedThe largest element is put in the matrixThe position in (3) corresponds to the matrix M, and the frequent item set corresponding to the element at the position in the matrix M is used as the frequent item set added to the frequent item set FIS. And setting the value of an element corresponding to the frequent item set in the FIS in the matrix M as 0, performing singular value decomposition on the updated M to obtain a new orthogonal matrix, and sending the FIS and the new orthogonal matrix to the kth group of users together.
The embodiments of the present invention have been described in detail with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.