Method and device for determining effective value of service data characteristic for protecting privacy

文档序号:7989 发布日期:2021-09-17 浏览:108次 中文

1. A method for determining effective values of characteristics of business data for protecting privacy is disclosed, wherein the business data is distributed in a plurality of participants, the respective business data of the participants form joint data under the condition of supposing splicing, and the joint data comprises the characteristic values of a plurality of objects aiming at a plurality of characteristic items; the method is performed by any first participant device, comprising:

acquiring joint data fragments of a first participant, acquiring predicted value fragments corresponding to a plurality of objects respectively, and model parameter fragments corresponding to a plurality of feature items respectively; the predicted value fragment and the model parameter fragment are obtained based on a trained service prediction model;

determining relevance data fragments respectively corresponding to a plurality of participants based on joint data fragments and predicted value fragments of the plurality of participants by utilizing multi-party security calculation and interaction among a plurality of participant devices, wherein the relevance data fragments comprise relevance data among a plurality of characteristic items;

and determining the effective value of the characteristic item corresponding to the model parameter on improving the effect of the business prediction model by adopting a significance test method and through the safety interaction among a plurality of participant devices based on the model parameter fragments of the participants and the corresponding data in the relevance data fragments.

2. The method of claim 1, wherein the step of obtaining the federated data segment of the first party comprises:

adopting a secret sharing addition, and carrying out splitting and splicing operations based on the service data of a plurality of participants through interaction with other participant equipment, so that the plurality of participants respectively obtain joint data fragments; the federated data fragments of multiple participants result in the federated data assuming reconstruction.

3. The method according to claim 1, wherein the service prediction model is obtained by performing security association training based on respective association data segments of a plurality of participants; the business prediction model is used for conducting business prediction on the object.

4. The method according to claim 3, wherein the step of obtaining the predicted value slices corresponding to the plurality of objects and the model parameter slices corresponding to the plurality of feature items comprises:

obtaining a model parameter fragment of the trained service prediction model in the local first participant device;

through interaction of equipment of a plurality of participants, the participants are enabled to determine predicted value fragments of the object respectively based on joint data fragments of the participants and the trained service prediction model.

5. The method of claim 1, the correlation data comprising covariance matrix data, the correlation data patches comprising covariance matrix patches;

the step of determining the relevance data segments corresponding to the multiple participants respectively includes:

determining intermediate matrix fragments respectively corresponding to a plurality of participants based on joint data fragments and predicted value fragments of the participants and a functional relation in the service prediction model;

and calculating the inverse fragments of the intermediate matrix corresponding to the multiple participants respectively based on the intermediate matrix fragments of the multiple participants to obtain the covariance matrix fragments corresponding to the multiple participants respectively.

6. The method according to claim 5, wherein the step of determining the intermediate matrix slices respectively corresponding to the multiple participants comprises:

determining hessian matrix fragments respectively corresponding to a plurality of participants as intermediate matrix fragments based on joint data fragments and predicted value fragments of the plurality of participants and hessian matrix expression obtained based on a functional relation in the service prediction model; the Hessian matrix expression comprises a joint data matrix and a predicted value matrix.

7. The method of claim 6, wherein the step of determining the hessian matrix segments corresponding to the plurality of participants comprises:

carrying out corresponding multiplication of vector elements on predicted value fragments of a plurality of participants by using secret sharing multiplication and based on an expression of a predicted value matrix, so that the plurality of participants respectively obtain intermediate vector fragments;

taking elements in the intermediate vector fragment of the first participant as diagonal elements to construct a diagonalized predicted value matrix fragment of the first participant;

and determining Hessian matrix fragments corresponding to the multiple participants respectively based on the joint data fragments, the predicted value matrix fragments and the Hessian matrix expression of the multiple participants.

8. The method according to claim 7, wherein the step of determining hessian matrix segments respectively corresponding to a plurality of participants based on the hessian matrix expression, the hessian matrix segments and the joint data segments of the plurality of participants comprises:

when the safe multiplication operation of the joint data fragments and the predicted value matrix fragments of a plurality of participants is calculated, the column vectors in the joint data fragments and the corresponding diagonal elements in the predicted value matrix fragments are respectively subjected to safe multiplication operation.

9. The method of claim 5, wherein the step of calculating inverse partitions of the intermediate matrices corresponding to the participants respectively based on the partitions of the intermediate matrices of the participants to obtain the partitions of the covariance matrices corresponding to the participants respectively comprises:

and obtaining covariance matrix fragments respectively corresponding to the multiple participants through iterative computation based on the intermediate matrix fragments of the multiple participants by using a secret sharing matrix inverse algorithm (SMI).

10. The method according to claim 5, wherein the step of determining the effective value of the feature item corresponding to the model parameter in improving the effect of the business prediction model comprises:

using diagonal elements in the covariance matrix segments of the multiple participants as variance segments corresponding to the multiple model parameters respectively;

aiming at any model parameter, utilizing a secret sharing root number inverse algorithm SNSI and a significance test method, jointly performing safe root number inverse operation through interaction among a plurality of participant devices on the basis of a corresponding model parameter fragment of the first participant and a corresponding variance fragment of a plurality of participants, and determining a significance test value fragment of the first participant aiming at the model parameter; and determining the effective value of the feature item corresponding to the model parameter based on the significance test value shards of the multiple participants for the model parameter.

11. The method of claim 10, further comprising:

aiming at any first characteristic item, obtaining a valid value fragment of the first characteristic item from other participant equipment;

and determining the reconstructed effective value of the first feature item based on the local effective value fragment of the first feature item and the obtained effective value fragment.

12. The method of claim 1, further comprising:

and based on the effective value, removing the characteristic items of which the effective values do not meet the preset conditions from the plurality of characteristic items so that the plurality of participants perform safe joint training on the service prediction model by adopting the service data without the characteristic items.

13. The method of claim 1, the object comprising one of a user, a good, an event; the characteristic items include at least one of: basic attribute information, incidence relation information, interaction information and historical behavior information; the business prediction model is used for conducting business prediction on the object.

14. The method of claim 1, wherein the traffic prediction model is derived based on a logistic regression model.

15. A device for determining effective values of characteristics of business data for protecting privacy, wherein the business data are distributed in a plurality of participants, the business data of each of the participants form joint data under the condition of splicing, and the joint data comprise the characteristic values of a plurality of objects for a plurality of characteristic items; the apparatus is deployed in any first participant device, and comprises:

the acquisition module is configured to acquire joint data fragments of a first participant, acquire predicted value fragments corresponding to a plurality of objects respectively, and model parameter fragments corresponding to a plurality of feature items respectively; the predicted value fragment and the model parameter fragment are obtained based on a trained service prediction model;

the interaction module is configured to determine relevance data fragments corresponding to a plurality of participants respectively based on joint data fragments and predicted value fragments of the plurality of participants through interaction among a plurality of participant devices by utilizing multi-party security calculation, wherein the relevance data fragments comprise relevance data among a plurality of feature items;

and the verification module is configured to determine an effective value of a feature item corresponding to a model parameter in improving the effect of the business prediction model by adopting a significance verification method through the safety interaction among a plurality of participant devices and based on the model parameter fragments of the participants and the corresponding data in the relevance data fragments.

16. The apparatus of claim 15, the means for obtaining, when obtaining the federated data segment of the first party, comprises:

adopting a secret sharing addition, and carrying out splitting and splicing operations based on the service data of a plurality of participants through interaction with other participant equipment, so that the plurality of participants respectively obtain joint data fragments; the federated data fragments of multiple participants result in the federated data assuming reconstruction.

17. The apparatus of claim 15, wherein the traffic prediction model is obtained by performing security association training based on respective association data segments of a plurality of participants; the business prediction model is used for conducting business prediction on the object.

18. The apparatus according to claim 17, wherein the obtaining module, when obtaining predicted value slices corresponding to the plurality of objects respectively and model parameter slices corresponding to the plurality of feature items respectively, includes:

obtaining a model parameter fragment of the trained service prediction model in the local first participant device;

through interaction of equipment of a plurality of participants, the participants are enabled to determine predicted value fragments of the object respectively based on joint data fragments of the participants and the trained service prediction model.

19. The apparatus of claim 15, the correlation data comprising covariance matrix data, the correlation data tile comprising a covariance matrix tile; the interaction module comprises:

the determining submodule is configured to determine intermediate matrix fragments corresponding to a plurality of participants respectively based on joint data fragments and predicted value fragments of the participants and a functional relation in the service prediction model;

and the calculation submodule is configured to calculate the inverse fragments of the intermediate matrix corresponding to the multiple participants respectively based on the intermediate matrix fragments of the multiple participants to obtain the covariance matrix fragments corresponding to the multiple participants respectively.

20. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-14.

21. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-14.

Background

The data required for machine learning often involves multiple platforms, multiple domains. For example, in a merchant classification analysis scenario based on machine learning, an electronic payment platform has transaction flow data of merchants, an electronic commerce platform stores sales data of the merchants, and a banking institution has loan data of the merchants. In order to improve service, multiple parties often train a business prediction model in a combined manner on the premise of ensuring privacy and security of business data.

As the amount of data increases, the characteristic dimensions of the data become larger and larger. The multi-dimensional feature data often has some redundant information, which may affect the effect of machine learning and reduce the stability of the model. Therefore, the multidimensional feature data can be subjected to dimension reduction according to feature effectiveness, redundant features with low significance in improving the model performance are removed under the condition that the information quantity is not lost as much as possible, and the redundant features are converted into low-dimensional features.

It is therefore desirable to have an improved scheme for determining feature validity as securely as possible without revealing private data.

Disclosure of Invention

One or more embodiments of the present specification describe a method and an apparatus for determining effective values of characteristics of business data, which protect privacy, and may determine effective values of characteristic items for business data distributed in multiple parties under the conditions of security and no leakage of privacy data. The specific technical scheme is as follows.

In a first aspect, an embodiment provides a method for determining effective values of characteristics of business data, which protects privacy, where the business data is distributed among multiple participants, and the business data of each of the multiple participants forms federated data under the assumption of splicing, where the federated data includes characteristic values of multiple objects for multiple characteristic items; the method is performed by any first participant device, comprising:

acquiring joint data fragments of a first participant, acquiring predicted value fragments corresponding to a plurality of objects respectively, and model parameter fragments corresponding to a plurality of feature items respectively; the predicted value fragment and the model parameter fragment are obtained based on a trained service prediction model;

determining relevance data fragments respectively corresponding to a plurality of participants based on joint data fragments and predicted value fragments of the plurality of participants by utilizing multi-party security calculation and interaction among a plurality of participant devices, wherein the relevance data fragments comprise relevance data among a plurality of characteristic items;

and determining the effective value of the characteristic item corresponding to the model parameter on improving the effect of the business prediction model by adopting a significance test method and through the safety interaction among a plurality of participant devices based on the model parameter fragments of the participants and the corresponding data in the relevance data fragments.

In one embodiment, the step of obtaining the federated data segment of the first party includes:

adopting a secret sharing addition, and carrying out splitting and splicing operations based on the service data of a plurality of participants through interaction with other participant equipment, so that the plurality of participants respectively obtain joint data fragments; the federated data fragments of multiple participants result in the federated data assuming reconstruction.

In one embodiment, the service prediction model is obtained by performing security association training based on respective association data segments of a plurality of participants; the business prediction model is used for conducting business prediction on the object.

In an embodiment, the step of obtaining the predicted value slices corresponding to the plurality of objects and the model parameter slices corresponding to the plurality of feature items includes:

obtaining a model parameter fragment of the trained service prediction model in the local first participant device;

through interaction of equipment of a plurality of participants, the participants are enabled to determine predicted value fragments of the object respectively based on joint data fragments of the participants and the trained service prediction model.

In one embodiment, the correlation data comprises covariance matrix data, and the correlation data patches comprise covariance matrix patches;

the step of determining the relevance data segments corresponding to the multiple participants respectively includes:

determining intermediate matrix fragments respectively corresponding to a plurality of participants based on joint data fragments and predicted value fragments of the participants and a functional relation in the service prediction model;

and calculating the inverse fragments of the intermediate matrix corresponding to the multiple participants respectively based on the intermediate matrix fragments of the multiple participants to obtain the covariance matrix fragments corresponding to the multiple participants respectively.

In one embodiment, the step of determining the intermediate matrix slices corresponding to the multiple participants respectively includes:

determining hessian matrix fragments respectively corresponding to a plurality of participants as intermediate matrix fragments based on joint data fragments and predicted value fragments of the plurality of participants and hessian matrix expression obtained based on a functional relation in the service prediction model; the Hessian matrix expression comprises a joint data matrix and a predicted value matrix.

In one embodiment, the step of determining hessian matrix segments corresponding to a plurality of participants includes:

carrying out corresponding multiplication of vector elements on predicted value fragments of a plurality of participants by using secret sharing multiplication and based on an expression of a predicted value matrix, so that the plurality of participants respectively obtain intermediate vector fragments;

taking elements in the intermediate vector fragment of the first participant as diagonal elements to construct a diagonalized predicted value matrix fragment of the first participant;

and determining Hessian matrix fragments corresponding to the multiple participants respectively based on the joint data fragments, the predicted value matrix fragments and the Hessian matrix expression of the multiple participants.

In an embodiment, the step of determining hessian matrix segments corresponding to a plurality of participants respectively based on joint data segments, predictor matrix segments, and the hessian matrix expression of the plurality of participants includes:

when the safe multiplication operation of the joint data fragments and the predicted value matrix fragments of a plurality of participants is calculated, the column vectors in the joint data fragments and the corresponding diagonal elements in the predicted value matrix fragments are respectively subjected to safe multiplication operation.

In an embodiment, the step of calculating inverse partitions of intermediate matrices corresponding to the multiple participants respectively based on the intermediate matrix partitions of the multiple participants to obtain covariance matrix partitions corresponding to the multiple participants respectively includes:

and obtaining covariance matrix fragments respectively corresponding to the multiple participants through iterative computation based on the intermediate matrix fragments of the multiple participants by using a secret sharing matrix inverse algorithm (SMI).

In an embodiment, the step of determining an effective value of a feature item corresponding to a model parameter in improving the effect of the business prediction model includes:

using diagonal elements in the covariance matrix segments of the multiple participants as variance segments corresponding to the multiple model parameters respectively;

aiming at any model parameter, utilizing a secret sharing root number inverse algorithm SNSI and a significance test method, jointly performing safe root number inverse operation through interaction among a plurality of participant devices on the basis of a corresponding model parameter fragment of the first participant and a corresponding variance fragment of a plurality of participants, and determining a significance test value fragment of the first participant aiming at the model parameter; and determining the effective value of the feature item corresponding to the model parameter based on the significance test value shards of the multiple participants for the model parameter.

In one embodiment, the method further comprises:

aiming at any first characteristic item, obtaining a valid value fragment of the first characteristic item from other participant equipment;

and determining the reconstructed effective value of the first feature item based on the local effective value fragment of the first feature item and the obtained effective value fragment.

In one embodiment, the method further comprises:

and based on the effective value, removing the characteristic items of which the effective values do not meet the preset conditions from the plurality of characteristic items so that the plurality of participants perform safe joint training on the service prediction model by adopting the service data without the characteristic items.

In one embodiment, the object comprises one of a user, a commodity, an event; the characteristic items include at least one of: basic attribute information, incidence relation information, interaction information and historical behavior information; the business prediction model is used for conducting business prediction on the object.

In one embodiment, the business prediction model is based on a logistic regression model.

In a second aspect, an embodiment provides an apparatus for determining effective values of characteristics of business data for protecting privacy, where the business data is distributed among multiple parties, and the business data of each of the multiple parties constitutes federated data under the assumption of splicing, where the federated data includes characteristic values of multiple objects for multiple characteristic items; the apparatus is deployed in any first participant device, and comprises:

the acquisition module is configured to acquire joint data fragments of a first participant, acquire predicted value fragments corresponding to a plurality of objects respectively, and model parameter fragments corresponding to a plurality of feature items respectively; the predicted value fragment and the model parameter fragment are obtained based on a trained service prediction model;

the interaction module is configured to determine relevance data fragments corresponding to a plurality of participants respectively based on joint data fragments and predicted value fragments of the plurality of participants through interaction among a plurality of participant devices by utilizing multi-party security calculation, wherein the relevance data fragments comprise relevance data among a plurality of feature items;

and the verification module is configured to determine an effective value of a feature item corresponding to a model parameter in improving the effect of the business prediction model by adopting a significance verification method through the safety interaction among a plurality of participant devices and based on the model parameter fragments of the participants and the corresponding data in the relevance data fragments.

In one embodiment, the obtaining module, when obtaining the federated data segment of the first party, includes:

adopting a secret sharing addition, and carrying out splitting and splicing operations based on the service data of a plurality of participants through interaction with other participant equipment, so that the plurality of participants respectively obtain joint data fragments; the federated data fragments of multiple participants result in the federated data assuming reconstruction.

In one embodiment, the service prediction model is obtained by performing security association training based on respective association data segments of a plurality of participants; the business prediction model is used for conducting business prediction on the object.

In an embodiment, when obtaining the predicted value slices corresponding to the plurality of objects and the model parameter slices corresponding to the plurality of feature items, the obtaining module includes:

obtaining a model parameter fragment of the trained service prediction model in the local first participant device;

through interaction of equipment of a plurality of participants, the participants are enabled to determine predicted value fragments of the object respectively based on joint data fragments of the participants and the trained service prediction model.

In one embodiment, the correlation data comprises covariance matrix data, and the correlation data patches comprise covariance matrix patches; the interaction module comprises:

the determining submodule is configured to determine intermediate matrix fragments corresponding to a plurality of participants respectively based on joint data fragments and predicted value fragments of the participants and a functional relation in the service prediction model;

and the calculation submodule is configured to calculate the inverse fragments of the intermediate matrix corresponding to the multiple participants respectively based on the intermediate matrix fragments of the multiple participants to obtain the covariance matrix fragments corresponding to the multiple participants respectively.

In a third aspect, embodiments provide a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any of the first aspect.

In a fourth aspect, an embodiment provides a computing device, including a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method of any one of the first aspect.

In the method and the apparatus provided in the embodiment of the present specification, through interaction among multiple participants, based on joint data fragmentation and predicted value fragmentation of a first participant and joint data fragmentation and predicted value fragmentation of other participants, multiple participants obtain relevant data fragmentation by using multi-party security computation, and then determine an effect value of a feature item on improving a model effect by using model parameter fragmentation and relevant data fragmentation. The multi-party security calculation is carried out among the multiple participants by using the fragments of various data, the obtained data is also fragments, and the privacy data such as the correlation data among the characteristic items can not be reconstructed in the processing process, so that the privacy and the security of the data in the processing process are improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a schematic diagram illustrating an implementation scenario of an embodiment disclosed herein;

fig. 2 is a schematic flowchart of a method for determining a valid value of a service data feature for protecting privacy according to this embodiment;

FIG. 3 is a schematic diagram illustrating a calculation flow of the secret sharing matrix multiplication application according to the present embodiment;

fig. 4 is a schematic block diagram of an apparatus for determining a valid value of a service data feature, according to an embodiment.

Detailed Description

The scheme provided by the specification is described below with reference to the accompanying drawings.

Fig. 1 is a schematic view of an implementation scenario of an embodiment disclosed in this specification. As shown in fig. 1, in a shared learning scenario, a data set is provided by a plurality of participants 1,2, …, W in common (W is a natural number), and each participant possesses a part of data in the data set, and forms business data (i.e., an original matrix) of the participant. The data set may be a training data set for training a model, a testing data set for testing a model, or a data set to be predicted. The data set may include characteristic data of an object, and the object may be one of various business objects to be analyzed, such as a user, a commodity, an event, and the like. The model may comprise a business prediction model trained in a machine learning manner.

There are at least two data distributions for the data set. One distribution is that each participant has different characteristic data for all objects. For example, each participant has the same samples of N objects, and the privacy data of each sample contains D features, which are distributed among W participants, each participant having D/W features. As another example, two platforms have the same set of users, but have different user characteristics in their business data. Each participant has different kinds of features, and the number of the features can be the same (for example, each participant has D/W features) or different. N, D and W are both natural numbers. This is a scenario of data vertical slicing in a data set, and table 1 is service data distribution of the data vertical slicing scenario.

TABLE 1

Where xx represents a specific characteristic value, belonging to the private data of the participant. Each row in table 1 represents one sample data, each column represents the feature value of a feature item of N objects, and D feature items belong to W participants. The feature values of the D feature items of the N objects constitute the entire business data.

Another distribution is that each participant has all the characteristic data of the different objects. For example, there are N samples of the object, the business data of each sample includes D feature items, the N pieces of business data are distributed in W participants, each participant has a part of the samples in all N samples, and the feature items included in each sample are the same. The number of object samples stored by different participants may be the same or different. As another example, there are two banks that serve different groups of users, but they both have the same user credit characteristics. This is a scenario of data horizontal slicing in the data set, and table 2 is service data distribution of the data horizontal slicing scenario.

TABLE 2

Where xx represents a specific characteristic value, belonging to the private data of the participant. Each row in table 2 represents one sample data, each column represents the feature value of a certain feature item of N objects, and N sample data belong to W participants. Different participants have different object samples. The feature values of the D feature items of the N objects constitute the entire business data.

The business data owned by the participants may include a plurality of characteristic items. The feature item of the object may include at least one of: basic attribute information, association relation information, interaction information, historical behavior information and the like of the object. For example, when the object is a user, the basic attribute information may include gender, age, income, and the like of the user, the association information of the user may include other users, companies, regions, and the like, which have an association with the user, the interaction information of the user may include information of clicking, viewing, participating in a certain activity, and the like of the user at a certain website, and the historical behavior information of the user may include historical transaction behavior, payment behavior, purchase behavior, and the like of the user.

When the object is a commodity, the basic attribute information may include a category, a place of production, a price, and the like of the commodity, the association relationship information of the commodity may include a user, a shop, or other commodities, and the like, which have an association relationship with the commodity, the interaction information of the commodity may include interaction characteristics between the user, the shop, and the commodity, and the historical behavior information of the commodity may include information that the commodity is purchased, transferred, returned, and the like.

When the object is an event, the event may include a transaction event, a login event, a purchase event, a social event, and the like. The basic attribute information of the event may be text information for describing the event, the association relation information may include text having a contextual relation with the event, other event information having an association with the event, and the like, and the historical behavior information may include record information of the event developing and changing in a time dimension, and the like.

The various participants may correspond to different service platforms that may include various enterprises, institutions, organizations, and the like. The service data is often privacy data of the service platform, and higher privacy and security are required to be maintained in the processing process. Regardless of the data distribution mode, the eigenvalue (i.e., the characteristic data) corresponding to the characteristic item of the object belongs to the private data, and can be stored as a private data matrix. In order to secure the private data, each participant needs to leave the private data thereof locally, not output plaintext data, and not perform plaintext aggregation.

In order to protect private data of each participant from being leaked out, in one embodiment, each participant can adopt a multi-party safe calculation mode, and utilize a predicted value and an original matrix of each participant to enable a third party to obtain covariance matrix data capable of representing correlation data among a plurality of feature items through interaction with the third party. And the third party determines the effective value of the characteristic item corresponding to the model parameter on improving the effect of the service prediction model by using the covariance matrix data and the model parameter and adopting a significance test method.

The covariance matrix data contains certain privacy data, so that the security of the privacy data can be improved by further improving the covariance matrix data. Referring to fig. 1, in an embodiment of the present specification, each participant stores a respective data segment, which includes a respective joint data segment, a predicted value segment corresponding to a plurality of objects, and a model parameter segment corresponding to a plurality of features, and multiple participant devices perform interaction based on multi-party security computation, and determine, by using the joint data segment and the predicted value segment, a relevant data segment corresponding to each of the plurality of participants, where the relevant data segment includes relevant data among a plurality of feature items, and each participant uses a significance test method, and determines, based on corresponding data in the model parameter segments and the relevant data segments of the plurality of participants, an effective value of a feature item in improving a service prediction model effect. The multi-party safety calculation is carried out among the multiple participants by using various data fragments, the obtained related data are fragments, and the related data and other privacy data among the characteristic items cannot be reconstructed, so that the privacy and the safety of the data in the processing process can be improved.

In this specification, a plurality of participants have corresponding participant apparatuses, respectively, and the operations in the embodiments of the specification are performed using the corresponding participant apparatuses. Participant devices include, but are not limited to, any apparatus, device, platform, cluster of devices, etc. having computing, processing capabilities. The following describes embodiments of the present invention with reference to specific examples.

Fig. 2 is a flowchart illustrating a method for determining a valid value of a service data feature for protecting privacy according to this embodiment. The service data is distributed among a plurality of participants, and the service data of each of the plurality of participants constitutes joint data under the condition of supposing splicing. The business data of the participants belong to privacy data with high privacy, and the business data cannot be sent in a clear text among the multiple participants, and the business data cannot be really spliced to form combined data. The syndication data is only a data set consisting of business data of a plurality of participants under the assumption. For example, the above tables 1 and 2 are specific forms of the joint data in the scenarios of data vertical slicing and data horizontal slicing, respectively. The joint data includes feature values of a plurality of objects for a plurality of feature items, and may include feature values of N objects for D feature items, where N and D are both natural numbers, for example.

For convenience of description, two participants are exemplified in the following examples. For example, the two parties are a first party a and a second party B, respectively, the first party a corresponding to a first party device and the second party B corresponding to a second party device. The participator device is used for executing the operation of the participator and storing the data of the participator. In particular embodiments, the participant device may also obtain data of the participant from other devices. The method of the present embodiment specifically includes the following steps S210 to S230.

Step S210, the first participant device obtains the joint data segment of the first participant a, obtains the predicted value segments corresponding to the plurality of objects, and obtains the model parameter segments corresponding to the plurality of feature items. And the second participant equipment acquires the joint data fragments of the second participant B, acquires the predicted value fragments corresponding to the plurality of objects respectively, and acquires the model parameter fragments corresponding to the plurality of feature items respectively.

The multiple participants respectively have respective service data, which belongs to the original data and is also the privacy data. In the vertical segmentation scene, the feature items of a plurality of participants are different, and the objects are the same. The plurality of participants may respectively represent their respective raw data in raw matrices, for example, the raw matrices of the first participant a and the second participant B may be respectively represented as XAAnd XBThe characteristic items are respectively represented as dA、dBThe number of objects is respectively represented as nAAnd nBThen the total characteristic term of the conjoined data is D ═ DA+dBThe total number of objects or samples is N ═ NA=nB. When columns in the original matrix represent characteristic items and rows represent objects or samples, assumed horizontal splicing is performed on service data of multiple participants such as a first participant a and a second participant B, and joint data can be obtained, wherein the form of the joint data is X ═ X (X ═ XA,XB). The above is the case where the columns in the original matrix represent the feature items and the rows represent the samples, and corresponds to the data distribution in table 1. In other embodiments, the columns in the original matrix may represent objects and the rows represent feature items, in which case, for the first participant aAnd the service data of a plurality of participants such as a second participant B and the like are subjected to assumed longitudinal splicing to obtain joint data in the form of

In the horizontal segmentation scene, the characteristic items of a plurality of participants are the same, and the objects are different. The original matrices of the first party A and the second party B are X respectivelyAAnd XBThe characteristic items are respectively dA=dBD, the number of objects is nA、nBThen the total characteristic term of the conjoined data is D ═ DA=dBThe total number of objects or samples is N ═ NA+nB. When the rows in the original matrix of the participants represent objects and the columns represent characteristic items, the service data of a plurality of participants such as a first participant A and a second participant B are assumed to be longitudinally spliced to obtain joint data in the form of

The above may correspond to the data distribution scenario in table 2. When the rows in the original matrix represent feature items and the columns represent objects, service data of multiple participants such as a first participant a and a second participant B are subjected to assumed horizontal splicing to obtain joint data in the form of X ═ X (X)A,XB)。

In order to enable a plurality of participants to obtain the joint data fragmentation, secret sharing addition can be adopted among the participants to split the business data of the participants into random numbers, and the fragmentation is completed through the transmission of the random numbers among the participants. Specifically, when the first participant device acquires the joint data segment of the first participant a, a secret sharing addition may be adopted, and through interaction with other participant devices, splitting and splicing operations are performed based on the service data of multiple participants, so that the multiple participants respectively acquire the joint data segment. Similarly, the second party B also obtains its joint data shards.

The secret sharing addition can split an original matrix into random matrices, and the fragmentation is completed through the transmission of the random matrices among a plurality of participants. Taking two participants as an example, a first participant a and a second participant B respectively possess original matrices X of service dataAAnd XB. For the first participant device, it may generate a random matrix R in the finite fieldAAnd calculating XA-RA=X2The first participant device may combine the two random matrices RAAnd X2Any one of (1), e.g. X2And sending to the second participant device. A second participant device, also generating a random matrix R in the finite fieldBAnd calculating XB-RB=X3The second participant device may combine the two random matrices RBAnd X3Any one of (1), e.g. X3And sending the message to the first participant device.

The first participant device may associate R withAAnd X received from the second participant device3Spliced into federated data fragments, the second participant device can segment RBAnd X sent by the received first participant device2And splicing into joint data fragments. Of course, in practical application scenarios, the number of participants is usually 3 or more, and the implementation process of the secret sharing addition can be easily extended to more than three parties. The data sent among the multiple participants is a random matrix, and the privacy data of the original matrix is not disclosed.

Wherein the federated data fragments of the multiple participants result in federated data assuming reconstruction. The reconstruction may be implemented by adding the data fragments of the parties, and the specific reconstruction may be to add other matrix transformation operations on the basis of the addition, the matrix transformation including, for example, multiplication by a preset value, and the like. The union data contains the privacy data, each participant does not directly carry out plaintext aggregation on the privacy data, the union data is only a representation under an assumed condition, and the data fragments of the participants cannot be directly reconstructed together in practice. The following meanings for reconstitution apply to the description herein.

Federated data fragmentation for first participant A may be employed<X>AIndicating that federated data fragmentation of the first party B can be employed<X>BDenotes that then the joint data X ═<X>A+<X>B. Wherein the content of the first and second substances,<X>denotes the slice of the parameter X, with the subscript indicating the party to which the slice belongs. For the sake of uniformity in expression, the fragmentation of data in a certain participant is indicated hereinafter in the form of "tip brackets + subscripts".

In this embodiment, the federated data segments of the participants are obtained based on the business data of the multiple participants, and the sum of the federated data segments of the multiple participants is conceptually or theoretically equal to the federated data.

In step S210, the predicted value segment and the model parameter segment are data obtained based on the trained service prediction model. The service prediction model is obtained by performing safe joint training based on joint data fragments of a plurality of participants. The business prediction model can be obtained by pre-training. The business prediction model can be a model obtained by training based on a logistic regression model, and can also be obtained by training based on other types of models. The business prediction model is used for performing business prediction on the object, for example, classification prediction or regression prediction can be performed on input feature data of the object.

And the plurality of participant devices can obtain the predicted value fragments and the model parameter fragments through the trained service prediction model. For example, the first participant device may obtain a model parameter fragment of the trained business prediction model local to the first participant device, and respectively enable the multiple participants to determine the predicted value fragments of the object based on the joint data fragments of the multiple participants and the trained business prediction model through secure interaction between the multiple participant devices.

And the plurality of participant devices take N objects in the joint data fragments as samples to train a service prediction model. After training, the model parameter fragment of the service prediction model in the present participant device can be obtained. Through the safe interaction among a plurality of participant devices, the joint data fragments of the participants are input into a service prediction model, and each participant device can determine the predicted value fragments of a plurality of objects of the participant.

Therefore, for a participant, in the acquired data, one object corresponds to one predicted value fragment, N objects correspond to N predicted value fragments respectively, and the N predicted value fragments can be used as vector elements to form a vector; when the service data contains D characteristic items, the trained service prediction model contains a plurality of model parameters which respectively correspond to the D characteristic items. For any predicted value data, the corresponding predicted value segments owned by a plurality of participants obtain the predicted value data under the condition of supposing reconstruction. For any model parameter, the corresponding model parameter slices owned by multiple participants obtain the model parameter under the condition of supposing reconstruction.

Step S220, determining, by using multi-party security computation, relevant data segments corresponding to multiple parties respectively based on joint data segments and predicted value segments of the multiple parties through interaction between the multiple party devices, where the relevant data segments include relevant data between multiple feature items.

The relevance data fragments of the multiple participants obtain relevance data under the condition of reconstruction, namely relevance data among feature items, wherein the feature items comprise relevance data among feature items owned by the same participant and relevance data among feature items owned by different participants, and the relevance data among different feature items and the relevance data among the same feature items exist.

When the step is implemented, the relevance data fragments respectively corresponding to a plurality of participants can be determined by utilizing the joint data fragments and the predicted value fragments and in a multi-party safe calculation mode based on the existing formula for calculating the relevance data between the characteristic items. The formula capable of expressing the correlation data between the feature items may include a covariance matrix formula, a correlation coefficient formula, and the like.

Multi-party Secure computing (MPC) is an existing data privacy protection technology that can be used for Multi-party participation, and specific implementations thereof include homomorphic encryption, garbled circuit, careless transmission, secret sharing, and the like. By adopting a multi-party safety calculation mode, safety interactive calculation aiming at joint data fragmentation and predicted value fragmentation among a plurality of participant equipment can be realized, and then a plurality of participants can determine corresponding correlation data fragmentation.

And step S230, determining the effective value of the characteristic item corresponding to the model parameter on the effect of improving the service prediction model by adopting a significance test method through the safety interaction among the equipment of the multiple participants and based on the corresponding data in the model parameter fragments and the relevance data fragments of the multiple participants.

The significance test method may include a Wald test, a Likelihood Ratio (LR) test, a Lagrange Multiplier (LM) test, and the like. After the existing formula provided by the significance test method is transformed, the model parameter fragments and the relevance data fragments of a plurality of participants are safely calculated through the safety interaction among the devices of the participants, and the effective value fragments corresponding to the participants are determined.

In this embodiment, the feature items correspond to model parameters, and data corresponding to the feature items exist in both the model parameter patches and the correlation data patches. By using the corresponding data in the model parameter fragment and the correlation data fragment and adopting a significance test method, the significance test value fragments corresponding to the plurality of model parameters respectively, namely the significance test value fragments of the corresponding plurality of feature items, can be determined, and the effective value fragment can be determined based on the significance test value fragments.

When the valid value of a certain feature item needs to be determined, for example, for an arbitrary first feature item, the first participant device may obtain a valid value fragment of the first feature item from other participant devices, and determine a reconstructed valid value of the first feature item based on the local valid value fragment of the first feature item and the obtained valid value fragment. The valid value of the feature item may also be reconstructed in the second participant device or in another participant device, and this embodiment is described only by taking the reconstruction of the valid value in the first participant device as an example.

After obtaining the effective values of the plurality of feature items, the first participant device may further remove, from the plurality of feature items, the feature item whose effective value does not satisfy the preset condition based on the plurality of effective values, so that the plurality of participants perform safe joint training on the service prediction model by using the service data from which the feature item is removed. The service data after the feature items are removed realizes the dimension reduction processing of the original matrix, so that the feature items are more refined, and the safety of the private data is ensured without leakage.

One embodiment is described in detail below. When the business prediction model includes a logistic regression model and the significance test method adopts the Wald test method, the way of determining the relevance data fragmentation in step S220 and the specific implementation manner of determining the effect value of the feature item in step S230 are provided.

The application of the Wald test to logistic regression is first explained in detail below. When the logistic regression model is adopted to carry out regression on the characteristic data of the sample, the calculation formula of the predicted value comprises the following steps:

wherein, X is the characteristic data of the sample and can be used as an independent variable; pi (X) is a predictive value function of the sample and can be used as a dependent variable; beta is a model parameter and is a characteristic term coefficient; e is a natural constant.

The original and alternative hypotheses of the Wald test are:

H0:ωj0 (j-1, 2, …, k), i.e., the independent variable has no influence on the possibility of the dependent variable, i.e., the independent variable is assumed to have no influence on the estimated value of the dependent variable;

H1:ωj≠0

if the null hypothesis is rejected, it is stated that the dependent variable changes depending on the independent variable j.

The test statistic of the Wald test is

WaldkIs a significance check value, which conforms to a chi-square distribution with a degree of freedom of 1. Wherein the content of the first and second substances,as a parameter of the modelAlso equal to the square root of the diagonal elements of the covariance matrix:

the diagonal elements of the covariance matrix are the variances of the feature terms. Covariance matrix of model parametersThe negative Hessian (Hessian) matrix is a log-likelihood functionValue of (A)

Wherein

For the element expression in the Hessian matrix H, the indices k and r are natural numbers less than N, xikAnd xirFor joining elements in data X, XiRepresenting the characteristic data of the ith sample.

By deriving the above formula, the H matrix can beTo be expressed as H ═ XTMX of which

Where N is the total number of samples, i.e., the total number of objects, D is the dimension of the feature data, and pi (X)N) For sample X for logistic regression modelNM is a diagonal matrix obtained based on the predicted value, and may also be referred to as a predicted value matrix.

From the above equation (2)

It can be seen that, for the kth model parameter, when the standard deviation of the model parameter is larger, that is, the value of the kth row and the kth column in the covariance matrix is larger, it is indicated that the model parameter causes the higher the concussion of the logistic regression model, and the smaller the Wald test value corresponding to the model parameter is.

In determining the significance check value Wald of the kth model parameterkThen, can also be according to

To obtain zkStatistic and according to p _ value ═ 2[1-normk|)]Cdf is used to obtain the probability distribution function of the normal distribution. When the p value is smaller than the significance level threshold value, rejecting the original hypothesis, wherein the model parameter can be kept for modeling, and the effective value of the feature item corresponding to the model parameter can be 1 or other higher values; when the p value is not less than the significance level threshold value, the original hypothesis is accepted, the model parameters are not reserved, and the model parameters are not retainedThe significance of the number-corresponding feature term may be 0 or other lower value. The significance level threshold may typically be 0.05 or 0.01, etc.

Logistic regression analysis is a statistical method that resolves independent variables and dependent variables and defines the relationship between them. The regression equation that is built is only meaningful if the independent and dependent variables do have some relationship. Therefore, the fact that the independent variable is related to whether the prediction target is a dependent variable, how much the correlation is, and how much the reliability of the correlation is determined is a problem to be solved by the regression analysis. Logistic regression analysis may use the Wald test to check the values of the regression term coefficients one by one. If for certain arguments, the Wald test indicates that these arguments are important, they should be included in the model. If the Wald test indicates that these arguments are not significant, these arguments may be omitted from the model. The model parameters of the business prediction model can be evaluated by using logistic regression analysis and Wald test, and then the characteristic items of the object samples are screened based on the evaluation results, so that the purpose of performing dimension reduction processing on the business data is achieved.

In this embodiment, in step S220, the correlation data includes covariance matrix data, and the correlation data slices include covariance matrix slices. Covariance matrix patches of multiple participants can constitute a covariance matrix assuming reconstruction. The covariance matrix is a matrix formed by the covariance between two feature items in a plurality of feature items in the joint data, wherein the elements on the main diagonal are the variances of the plurality of feature items, and the elements on the off-diagonal are the covariance between the two feature items. The covariance matrix is a symmetric matrix, and when there are D feature entries in the joint data, the covariance matrix may be a symmetric matrix in D × D dimensions.

When determining the pieces of correlation data corresponding to the plurality of participants respectively in step S220, that is, determining the pieces of covariance matrices corresponding to the plurality of participants respectively, the participant devices of the plurality of participants may perform the following steps 1 and 2.

Step 1, joint data fragmentation and predicted value fragmentation based on a plurality of participants and a functional relation in a service prediction model are confirmedAnd determining the middle matrix fragments corresponding to the multiple participants respectively. For example, the first participant A gets the intermediate matrix shard<H>AThe second participant B gets the intermediate matrix patches<H>BMultiple intermediate matrix slices yield an intermediate matrix H under the assumption of reconstruction. The multiple participants do not really perform the reconstruction of the inter-matrix slice, and here only represent the relationship between the multiple inter-matrix slices.

And 2, calculating the inverse fragments of the intermediate matrix corresponding to the multiple participants respectively based on the intermediate matrix fragments of the multiple participants to obtain the covariance matrix fragments corresponding to the multiple participants respectively. For example, the first participant a gets the inverse sharding of the intermediate matrix<H-1>AThe second participant B gets the inverse of the intermediate matrix<H-1>BThe slicing of the inverses of the plurality of intermediate matrices yields the inverse H of the intermediate matrix under the assumption of reconstruction-1. The multiple participants do not really perform the reconstruction of the slices of the intermediate matrix inverse, and here only the relation between the slices of the multiple intermediate matrix inverses is shown.

In step 1, when determining the intermediate matrix segments corresponding to the multiple participants, determining the hessian matrix segments corresponding to the multiple participants as the intermediate matrix segments based on the joint data segments and the predicted value segments of the multiple participants and the hessian matrix expression obtained based on the functional relation in the service prediction model; the Hessian matrix expression comprises a joint data matrix and a predicted value matrix.

When the business prediction model is a logistic regression model, the functional relation of the business prediction model, that is, the functional relation of the model prediction value, is obtained as shown in the above formula (1) after the logistic regression model is trained, and corresponding model parameters, for example, β, are obtained. The hessian matrix expression is actually a second derivative on the model parameter β. From the above equations (1) to (5), it can be seen that the hessian matrix expression obtained based on the functional relational expression in the traffic prediction model is represented by

H=XTMX (9)

By multiple participationSecure interaction between party devices based on federated data segments owned by multiple parties respectively<X>And based on a plurality of predicted values pi (X)N) And (3) partitioning the obtained matrix M, and enabling a plurality of participants to respectively determine H partitions by using the formula (9), wherein the H partitions serve as intermediate matrix partitions. Where M may be referred to as a predictor matrix.

In an application scenario, the union data X is a high-dimensional matrix, and the number N of objects is usually in the hundreds of thousands, millions or more, which results in that H ═ X is calculated by using the sliced data of multiple participantsTAnd in MX, the amount of interaction data is too large, and the processing efficiency is not high. In order to simplify the calculation of the H-slice and simplify the interaction data between the multiple participants as much as possible, the form of the matrix M may be transformed so as to simplify the determination process of the H-slice by the multiple participants.

In particular, a first participant device is utilizing federated data shards<X>ADetermining the Hessian matrix segments corresponding to the participants by the formula (9)<H>Then, the following steps 1a to 3a may be performed.

Step 1a, carrying out corresponding multiplication of vector elements on predicted value fragments of a plurality of participants by using secret sharing multiplication and based on an expression of a predicted value matrix, so that the plurality of participants respectively obtain intermediate vector fragments.

For example, for the case of two parties, the first party a and the second party B may perform corresponding multiplication of vector elements on the predictor slices by using secret sharing multiplication, so as to obtain intermediate vector slices of the first party a and intermediate vector slices of the second party B. The intermediate vector sharding of multiple participants results in an intermediate vector when reconstruction is assumed. The multiple participants do not really reconstruct the intermediate vector, and here only represent the relationship between the multiple intermediate vector slices.

And 2a, constructing and obtaining a diagonalized predicted value matrix fragment of the first participant A by taking elements in the intermediate vector fragment of the first participant A as diagonal elements.

As other participant devices, the second participant device also constructs a predictor matrix slice of the second participant B that is diagonalized, with elements in the intermediate vector slice of the second participant B as diagonal elements.

Step 3a, joint data slicing based on multiple participants<X>And determining the hessian matrix fragments corresponding to the multiple participants respectively according to the predicted value matrix fragments and the hessian matrix expression. For example, hessian matrix sharding may be determined between a first party a and a second party B by, for example, secret sharing matrix multiplication<H>AAnd<H>B

through the steps 1a and 2a, the participants respectively obtain the predicted value matrix fragments after the diagonalization based on the plurality of predicted value fragments of the participants. Since the main diagonal elements of the diagonalized matrix are not 0, the non-main diagonal elements are both 0. This simplifies the predictor matrix, thereby enabling an improvement in processing efficiency.

In step 1a, the expression of the predictor matrix M includes

π(XN)[π(XN)-1] (10)

Thus, a predictor slice owned by each of multiple participants, such as the predictor slice of the first participant a, may be utilized<π>ASecond participant B predictor segmentation<π>BTo obtain another expression form of the above formula (10)

(<π>A+<π>B)*(<π>A+<π>B-1)=<Intermediate vector>A+<Intermediate vector>B(11) The corresponding multiplication of the vector elements can be performed between the multiple participants according to equation (11) using secret sharing multiplication. That is, for any group of predicted value slices among multiple participants, the group of predicted value slices is used as an input of secret sharing multiplication, and the secret sharing multiplication is performed according to a predicted value matrix expression form, so as to output elements in respective intermediate vector slices of the multiple participants. And forming intermediate vector fragments by the intermediate vector fragment elements corresponding to the multiple groups of predicted value fragments. Multiple slices of intermediate vectors result in intermediate vectors when a reconstruction is assumed.

For example, each predictor slice of the first participant a<π>AThe corresponding predicted value segment of the second participant B<π>BThe secret sharing multiplication can be used as an input of the secret sharing multiplication, the secret sharing multiplication is carried out according to the formula (11), and the secret sharing multiplication outputs the secret sharing multiplication result corresponding to the first party A and the second party B respectively<Intermediate vector>AElements in a slice and<intermediate vector>BElements in a slice.

In step 2a, the first party A has<Intermediate vector>ATaking the elements in the fragments as diagonal elements, and constructing to obtain a diagonal matrix<Λ>AThis is the predictor matrix partition of the first participant a, which is diagonalized. A second party B to<Intermediate vector>BTaking the elements in the fragments as diagonal elements, and constructing to obtain a diagonal matrix<Λ>BThis is the diagonalized predictor matrix slice. When in use<Intermediate vector>AAnd when the dimension of the fragment is N, the dimension of the constructed diagonal matrix is N x N. Predictor matrix fragmentation in constructing diagonal matrices<Λ>AAre respectively diagonal elements of<Intermediate vector>AElement in slice, predictor matrix slice<Λ>AThe off-diagonal elements of (a) are all 0's.

In step 3a, hessian matrix expression H ═ XTThe M matrix in MX can be replaced with the predictor matrix Λ, so the hessian matrix expression can be updated to H ═ XTAnd Λ X. The first party a, the second party B may employ Secret Sharing Matrix Multiplication (SMM) based on the joint data sharding of the first party a<X>APredictor matrix fragmentation<Λ>AAnd federated data fragmentation of the second participant B<X>BPredictor matrix fragmentation<Λ>BAccording to the formula H ═ XTLambda X, determining Hessian matrix patches of the first Party A<H>AAnd the Hessian matrix sharding of the second party B<H>B

Since the predictor matrix partition is a diagonal matrix, it contains a large number of 0 elements, and the matrix dimension is N × N. In a traffic scenario, the amount of sample size NThe levels are very large, for example on the order of one hundred thousand, million or more, i.e. the dimensionality of the conjoined data X is very high. In the respect of XTWhen multiplying the secret sharing matrix with the diagonal matrix Λ, to improve the execution efficiency and reduce the communication traffic between the participants, X may be calculatedTA more concise approach is adopted.

That is, when the safe multiplication operations of the joint data fragments and the predicted value matrix fragments of the multiple participants are calculated, the column vectors in the joint data fragments are respectively subjected to the safe multiplication operations with the corresponding diagonal elements in the predicted value matrix fragments.

The multiple predictor matrix fragments are diagonal matrixes, elements on the main diagonal are not 0, and elements on the non-main diagonal are all. When the joint data slice and the predictor matrix slice perform matrix multiplication, the column vectors cut into the joint data slice can be respectively multiplied by diagonal elements in the predictor matrix slice, namely multiplied by non-0 elements. The multiplication operation of the column vector and the 0 element, the result of which is 0, can be omitted. Therefore, the high-dimensional matrix multiplication operation among a plurality of participants can be disassembled, a large amount of calculation amount is saved, and the communication traffic among a plurality of participants is reduced. Traffic plays a decisive role in processing efficiency in privacy preserving scenarios.

How the multiplication of column vectors with non-0 elements reduces traffic is described below in conjunction with matrix expressions. In Hessian matrix expression H ═ XTIn Λ X, XTThe specific form of Λ is

Where X is the joint data, T is the matrix transpose symbol, the predicted value

Below with XTThe calculation method of the first column of Λ is explained as an example. To obtain XTThe first column of Λ, requires vector x ═ x11……x1D) Each element of (2) multiplied byTaking the multiplication operation between the first party a and the second party B as an example, refer to the flowchart shown in fig. 3, and fig. 3 is a schematic diagram of a calculation flow of the secret sharing matrix multiplication application in this embodiment.

The first participant A has D x 1 dimensional vector shards<x>AAnd 1 x 1 dimensional numerical slicing<m>AWherein m is substituted forAs a shorthand. The second participant B has D x 1-dimensional vector shards<x>BAnd 1 x 1 dimensional numerical slicing<m>B

In step 1, both parties respectively obtain random number triples (triples). First party A obtains<u>A(D*1)、<v>A(1*1)、<z>A(D*1)The second party B obtains<u>B(D*1)、<v>B(1*1)、<z>B(D*1)And satisfy z(D*1)=u(D*1)*v(1*1)Wherein z ═<z>A+<z>B,u=<u>A+<u>B,v=<v>A+<v>B. Wherein D1, 1 is the matrix dimension.

And 2, the first participant A splits the private data of the first participant A by using the random number so as to realize the shielding of the private data and further obtain a secret matrix. First Party A computation<d>A=<x>A-<u>A,<e>A=<m>A-<v>A. And the second party B splits the private data by using the random number to obtain a secret matrix. Second Party B computation<d>B=<x>B-<u>B,<e>B=<m>B-<v>B

Step 3, the parties send their own secret matrixes to each other and are connectedProcessing is carried out on the own secret matrix and the received secret matrix. First party A sends to second party B<d>AAnd<e>Athe second party B sends to the first party A<d>BAnd<e>B. The first participant a calculates d ═<d>A-<d>B,e=<e>A-<e>BThe first participant B calculates d ═<d>A-<d>B,e=<e>A-<e>B

And 4, respectively calculating respective data fragments by the participants. First Party A computation<Y>A=<z>A+<u>A*e+d*<v>A+ d × e, second participant B calculation<Y>B=<z>B+<u>B*e+d*<v>B. And the number of the first and second electrodes,<Y>A+<Y>B=x*m。

thus, the first party a and the second party B are not exposing private data<x>AAnd<m>Aand<x>Band<m>Bin the case of (2), the slices are obtained separately<Y>AAnd<Y>Bthese two slices, when assuming reconstruction, can result in the product of the vector x and the value m. Then, every time matrix multiplication is performed, the amount of communication between the participants including the data communication performed in the above-described step 3 is 2(D +1), and X is calculatedTThe traffic required for Λ is 2(D +1) × N. This reduces the amount of traffic compared to the traffic 2(D × N + N × N) required for general matrix multiplication computation.

In the manner described above, multiple participants will XTEach column in (a) is multiplied by a corresponding diagonal element in Λ, which results in a number of shards for any one participant<Y>ADividing the plurality of pieces into pieces<Y>AThe matrices being formed by splicing, i.e. XTLambda shard in this participant.

Jointly calculating X at a plurality of participantsTAfter Λ, SMM may be employed, based on what multiple participants respectively possess<XTΛ>SlicingAnd federated data fragmentation<X>Determining H ═ X of Hessian matrixTFragmentation of Λ X.

The following describes the process of performing tile matrix multiplication using SMM, taking two participants as an example. It is known that the first party a owns the shard<XTΛ>AAnd federated data fragmentation<X>AThe second party B owns the shards<XTΛ>BAnd federated data fragmentation<X>BThe object is to output XTΛ X, so that the first party is obtained<XTΛX>AThe second party B gets<XTΛX>BAnd is and<XTΛX>A+<XTΛX>B=XTΛX。

processing between a first party a and a second party B may be seen in the schematic diagram of fig. 3, the data of the first party a in fig. 3<x>AIs replaced by<XTΛ>AWill be<m>AIs replaced by<x>AData of the second party B<x>BIs replaced by<XTΛ>BWill be<m>BIs replaced by<x>BAnd correspondingly adjusting the matrix dimension of each parameter, that is, based on the flow chart shown in fig. 3, the first party a and the second party B respectively obtain hessian matrix segments<XTΛX>AAnd<XTΛX>B. In the context of figure 3 of the drawings,<XTΛX>Acorrespond to<Y>A,<XTΛX>BCorrespond to<Y>B

The operations performed by the first party a and the second party B are actually performed by the party devices corresponding to the parties.

Returning to step 2, in the intermediate matrix slicing based on multiple participants<H>Computing the inverse slices of the intermediate matrix corresponding to each of the plurality of participants<H-1>Obtaining the covariance matrix patches corresponding to the multiple participants respectively<Cov>The steps of (a) may be performed using a secret Sharing Matrix Inversion (SMI) algorithm based onIntermediate matrix fragmentation for multiple participants<H>Obtaining covariance matrix fragments corresponding to multiple participants respectively through iterative computation<Cov>. Wherein the covariance matrix is equal to the inverse of the intermediate matrix, Cov ═ H-1

For example, the intermediate matrix shard of the first participant a is known<H>AAnd the intermediate matrix shard of the second participant B<H>BTo calculate<H-1>AAnd<H-1>Bas a result, an iterative calculation can be performed using SMI. Wherein the intermediate matrix is sliced<H>AAnd<H>Bobtaining an intermediate matrix H, H upon hypothetical reconstruction-1Is the inverse of H, but the first party a and the second party B do not reconstruct H. Therefore, it is necessary to know<H>AAnd<H>Band without reconstructing it, causes the first party a and the second party B to determine separately<H-1>AAnd<H-1>B. The intermediate matrix H is not reconstructed, and the leakage of private data can be avoided.

The following describes a process of iteratively calculating covariance matrix shards using SMI by taking two participants as an example. It is known that the first participant a owns the intermediate matrix shard<H>AThe second participant B has an intermediate matrix slice<H>B,H=<H>A+<H>B. It is desired that: so that the first party a gets<H-1>AThe second party B gets<H-1>B,H-1=<H-1>A+<H-1>B

During initialization, the first party A and the second party B respectively obtain L through joint calculation0

L0=tr(H)-1=[tr(<H>A)+tr(<H>B)]-1

Where tr is the trace of the matrix.

In any one iteration calculation, SMM is utilized among a plurality of participants, and the calculation is respectively carried out according to the following iteration formula

Lk+1=Lk(2*I-H Lk)=(<Lk>A+<Lk>B)[2*I-(<H>A+<H>B)(<Lk>A+<Lk>B)]

Wherein I is an identity matrix. In one iteration, 2 SMMs need to be performed. The number of iteration rounds may be preset, and may be set to 20 to 32 times, for example, where k is the number of iterations.

Returning to step S230, when determining the effective value of the feature item corresponding to the model parameter in improving the effect of the service prediction model based on the model parameter shards and the covariance matrix shards of the multiple participants, the method may use the formula (2) of Wald test

Or adopt the formula (8)

And calculating a significance test value (or a significance level value) of the kth model parameter, and determining an effective value of the feature item corresponding to the model parameter on improving the effect of the business prediction model based on the significance test value and an initial hypothesis.

In the determination of WaldkOr zkWhen the molecular moiety isModel parameters, denominator partThe standard deviation is the standard deviation of the model parameters, which can be obtained according to the square root of the variance of the model parameters, and the diagonal elements of the covariance matrix are the variances of the corresponding model parameters. The following may utilize a secret sharing root Number inverse (SNSI) algorithm based on multipleAnd determining the effective value of the characteristic item corresponding to the model parameter by the model parameter fragmentation and the covariance matrix fragmentation of the participator. Specifically, the following steps 1b and 2b may be included.

And step 1b, the plurality of participant devices take diagonal elements in the covariance matrix fragments of the plurality of participants as variance fragments respectively corresponding to the plurality of model parameters. The diagonal element here may refer to the main diagonal element. In the covariance matrix, the main diagonal element is the variance of the feature term. Correspondingly, in covariance matrix slicing, the main diagonal elements are variance slices of feature items.

And 2b, the first participant equipment determines the significance test value fragment of the first participant A aiming at any model parameter by utilizing an SNSI algorithm and a significance test method and jointly performing safe root number inverse operation through interaction among the plurality of participant equipment on the basis of the corresponding model parameter fragment of the first participant A and the corresponding variance fragments of the plurality of participants. And determining the effective value of the feature item corresponding to the model parameter based on the significance test value shards of the multiple participants for the model parameter.

Similarly, the second participant device determines the significance check value fragment of the model parameter of the second participant B by performing the security root number inversion operation in a combined manner through interaction among the plurality of participant devices based on the corresponding model parameter fragment of the second participant B and the corresponding variance fragments of the plurality of participants by using the SNSI algorithm and the significance check value for any model parameter.

In one embodiment, the saliency check value slices of multiple participants may be sent to a certain participant device or a third-party device, the saliency check value is reconstructed by the participant device or the third-party device, and based on the saliency check value, the effective value of the corresponding feature item may be determined according to a predetermined transformation manner. In another embodiment, the significance check value slices of the multiple participants can also be directly used as valid value slices, and the multiple significance check value slices can be reconstructed to obtain valid values.

The significance check value may be based onThe formula (2) or the formula (8) or the p _ value formula is used for calculation, and the obtained significance check value fragment can be, but is not limited to, WaldkValue sharding, zkValue sharding or p-value sharding.

The model parameter slices of multiple participants derive the model parameters when a reconstruction is assumed. For example, for any one model parameter β1Model parameter slicing of the first participant<β1>AAnd the second participant B's model parameter sharding<β1>BObtaining the model parameter beta when assuming reconstruction1. The model parameter slices are not actually reconstructed, and the description is only for illustrating the relationship between the model parameter slices and the model parameters.

It can be seen that, in the embodiment, when the significance test value is calculated, diagonal elements in covariance matrix fragments of multiple participants are used, and data in the covariance matrix is not reconstructed, so that security of private data in the covariance matrix can be well protected.

In step 2b, the following description will be made with respect to any one model parameter βkThe first participant device is sharded based on the model parameters of the first participant a through interaction between the plurality of participant devices using the SNSI algorithm and the significance test method<βk>AAnd the variance fragments of a plurality of participants jointly perform the inverse operation of the safety root number to determine the model parameter beta of the first participant AkThe significance check value slicing step. In the same way, it can be achieved that the second participant device determines the model parameter β for the second participant BkThe significance check value of (1).

In the significance test method (8)For example. For the first party, this equation (8) can be modified to

Wherein the content of the first and second substances,<zk>Amodel parameter β for the first participant AkThe significance check value of (a) is sliced, the molecular part is the model parameter slice of the first participant a, in the denominator part,<Covkk>Amodel parameters β owned by the first participant akThe corresponding variance partition, which is also the kth element (diagonal element) in the covariance matrix partition of the first participant a,<Covkk>Bmodel parameter β owned by the second participant BkThe corresponding variance partition is also the kth element in the covariance matrix partition of the second participant B.

The numerator portion is owned by the first party a and the denominator portion is owned by both the first party a and the second party B. Therefore, the present problem is focused on how to calculate the root inverse in equation (12). In this embodiment, the SNSI algorithm is used to determine the model parameter β of the first party akWith the model parameter β of the second participant BkIs inverse to the root of the sum of the variance patches based on the inverse of the root and the model parameter patches of the first participant a<βk>AMay yield the first party a for the model parameter βkThe significance check value of (1). Wherein the root number in formula (12) is inverted as follows

How to calculate the root number inverse by using the SNSI algorithm is described in detail by the following steps 1c to 3c<Covkk>A+<Covkk>B)-1/2. For convenience of description, let na=<Covkk>A,nb=<Covkk>BLet n denote the model parameter βkI.e. n ═ na+nbThe expectation is calculated such that the first participant device gets caThe second participant device gets cbAnd c is anda+cb=(na+nb)-1/2=n-1/2

and step 1c, the first party equipment and the second party equipment convert the addition fragmentation into the multiplication fragmentation through interaction.

The first participant device locally generates a random number xa and finds it

The first party device and the second party device jointly calculate through secret sharing matrix multiplicationRespectively obtain xba2,xbb

First participant device calculates xba=xba1+xba2And x isbaSending to the second participant device (x)ba1,xba2Not separately transmittable);

the second participant device calculates xb=xba+xbbWhere n is xa×xbRealize the addition slicing n ═ na+nbConversion into multiplication shards n ═ xa×xb. At this point, the first party A owns xaThe second party has xb

And 2c, respectively carrying out initialization of iterative estimation values locally by the two participant devices.

Taking the first participant a as an example, the first participant device will float a 64-bit floating point number xaIs read as a 64-bit integer and shifted to the right by one bit (divided by 2 and rounded down), denoted as inta(ii) a Calculate 0x5fe6eb50c7b537a9-intaAnd reading according to the storage mode of 64-bit floating point number and recording as ya. Thus, i.e. xaInitialized to ya

Similarly, the second participant device performs the above initialization, and x may be setbInitialized to yb. At this point, the first party A owns yaThe second party has yb

Step 3c, the two parties jointly utilize the cattleIterative calculation of n by the method of ton-1/2

The initial value of the iteration is Y0=Y0a×Y0b=ya×ybOwned by two participants respectively. The iterative formula is as follows

In the iteration process, two times of secret sharing matrix multiplication are used, 1 time of iteration is performed in total, and the floating point number c is obtained by the first party A and the second party B respectivelyaAnd cb

The implementation process of step 2b may also be implemented in other manners. For example, firstly, the variance fragment of the first party a and the variance fragment of the second party B are subjected to security standardization, then an iteration initial value is obtained through linear approximation calculation, and finally iteration is performed based on the Goldschmidt algorithm. In this embodiment, the secret sharing matrix multiplication operation may be performed based on the variance shard of the first party a and the variance shard of the second party B, and then other operations may be performed.

In this specification, the first party, the first feature item "first", and the second feature item "second" are used for convenience of distinction and description only, and do not have any limiting meaning.

In this specification, the number of the plurality of participants may be 2, 3 or more, each participant performs various operations through a corresponding participant device, and the participant device may be implemented by any device, platform, device cluster, etc. having computing and processing capabilities.

In the embodiments of the present specification, two participants are exemplified in more detail. For example, in the description of the embodiments of algorithms such as secret sharing matrix multiplication, secret sharing root number inversion, secret sharing matrix inversion, and the like for multi-party security calculation, the implementation of two parties can be easily extended to a more multi-party participating scenario, and the detailed process is not repeated.

The foregoing describes certain embodiments of the present specification, and other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily have to be in the particular order shown or in sequential order to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Fig. 4 is a schematic block diagram of an apparatus for determining a valid value of a service data feature, according to an embodiment. The business data are distributed in a plurality of participants, the business data of each of the participants form combined data under the condition of splicing, and the combined data comprises characteristic values of a plurality of objects aiming at a plurality of characteristic items; the apparatus 400 is deployed in any first participant device, which may be implemented by any apparatus, device, platform, cluster of devices, etc. having computing and processing capabilities. This embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2. The apparatus 400 comprises:

an obtaining module 410, configured to obtain joint data fragments of the first participant, obtain predicted value fragments corresponding to the plurality of objects, respectively, and model parameter fragments corresponding to the plurality of feature items, respectively; the predicted value fragment and the model parameter fragment are obtained based on a trained service prediction model;

an interaction module 420, configured to determine, by using multi-party security computation, relevance data segments corresponding to a plurality of parties, respectively, based on joint data segments and predicted value segments of the plurality of parties, through interaction between the plurality of party devices, where the relevance data segments include relevance data between a plurality of feature items;

the checking module 430 is configured to determine, by using a significance checking method, effective values of feature items corresponding to model parameters in improving the effect of the business prediction model based on the model parameter segments of the multiple participants and corresponding data in the relevance data segments through secure interaction between the multiple participant devices.

In one embodiment, the obtaining module 410, when obtaining the federated data segment of the first participant, includes:

adopting a secret sharing addition, and carrying out splitting and splicing operations based on the service data of a plurality of participants through interaction with other participant equipment, so that the plurality of participants respectively obtain joint data fragments; the federated data fragments of multiple participants result in the federated data assuming reconstruction.

In one embodiment, the service prediction model is obtained by performing security association training based on respective association data segments of a plurality of participants; the business prediction model is used for conducting business prediction on the object.

In an embodiment, when obtaining the predicted value slices corresponding to the plurality of objects and the model parameter slices corresponding to the plurality of feature items, the obtaining module 410 includes:

obtaining a model parameter fragment of the trained service prediction model in the local first participant device;

through interaction of equipment of a plurality of participants, the participants are enabled to determine predicted value fragments of the object respectively based on joint data fragments of the participants and the trained service prediction model.

In one embodiment, the correlation data comprises covariance matrix data, and the correlation data patches comprise covariance matrix patches; the interaction module 420 includes:

the determining submodule 421 is configured to determine intermediate matrix fragments corresponding to multiple participants respectively based on joint data fragments and predicted value fragments of the multiple participants and a functional relation in the service prediction model;

the calculating sub-module 422 is configured to calculate inverse partitions of intermediate matrices corresponding to the multiple participants, respectively, based on the intermediate matrix partitions of the multiple participants, so as to obtain covariance matrix partitions corresponding to the multiple participants, respectively.

In an embodiment, the determining submodule 421 is specifically configured to:

determining hessian matrix fragments respectively corresponding to a plurality of participants as intermediate matrix fragments based on joint data fragments and predicted value fragments of the plurality of participants and hessian matrix expression obtained based on a functional relation in the service prediction model; the Hessian matrix expression comprises a joint data matrix and a predicted value matrix.

In one embodiment, the determining submodule 421, when determining the hessian matrix patches corresponding to the plurality of participants respectively, includes:

carrying out corresponding multiplication of vector elements on predicted value fragments of a plurality of participants by using secret sharing multiplication and based on an expression of a predicted value matrix, so that the plurality of participants respectively obtain intermediate vector fragments;

taking elements in the intermediate vector fragment of the first participant as diagonal elements to construct a diagonalized predicted value matrix fragment of the first participant;

and determining Hessian matrix fragments corresponding to the multiple participants respectively based on the joint data fragments, the predicted value matrix fragments and the Hessian matrix expression of the multiple participants.

In an embodiment, when determining the hessian matrix segments respectively corresponding to a plurality of participants based on the joint data segment, the predictor matrix segment, and the hessian matrix expression of the plurality of participants, the determining submodule 421 includes:

when the safe multiplication operation of the joint data fragments and the predicted value matrix fragments of a plurality of participants is calculated, the column vectors in the joint data fragments and the corresponding diagonal elements in the predicted value matrix fragments are respectively subjected to safe multiplication operation.

In one embodiment, the calculation sub-module 422 is specifically configured to:

and obtaining covariance matrix fragments respectively corresponding to the multiple participants through iterative computation based on the intermediate matrix fragments of the multiple participants by using a secret sharing matrix inverse algorithm (SMI).

In one embodiment, the verification module 430 is specifically configured to:

using diagonal elements in the covariance matrix segments of the multiple participants as variance segments corresponding to the multiple model parameters respectively;

for any model parameter, utilizing SNSI and a significance test method, and based on the corresponding model parameter fragment of the first participant and the corresponding variance fragments of a plurality of participants, through interaction among a plurality of participant devices, jointly performing a safety root number inverse operation, and determining the significance test value fragment of the first participant for the model parameter; and determining the effective value of the feature item corresponding to the model parameter based on the significance test value shards of the multiple participants for the model parameter.

In one embodiment, the apparatus 400 further comprises a reconstruction module (not shown in the figures) configured to:

aiming at any first characteristic item, obtaining a valid value fragment of the first characteristic item from other participant equipment;

and determining the reconstructed effective value of the first feature item based on the local effective value fragment of the first feature item and the obtained effective value fragment.

In one embodiment, the apparatus 400 further comprises a removal module (not shown) configured to:

and based on the effective value, removing the characteristic items of which the effective values do not meet the preset conditions from the plurality of characteristic items so that the plurality of participants perform safe joint training on the service prediction model by adopting the service data without the characteristic items.

In one embodiment, the object comprises one of a user, a commodity, an event; the characteristic items include at least one of: basic attribute information, incidence relation information, interaction information and historical behavior information; the business prediction model is used for conducting business prediction on the object.

In one embodiment, the business prediction model is based on a logistic regression model.

The above device embodiments correspond to the method embodiments, and specific descriptions may refer to descriptions of the method embodiments, which are not repeated herein. The device embodiment is obtained based on the corresponding method embodiment, has the same technical effect as the corresponding method embodiment, and for the specific description, reference may be made to the corresponding method embodiment.

Embodiments of the present specification also provide a computer-readable storage medium having a computer program stored thereon, which, when executed in a computer, causes the computer to perform the method of any one of fig. 1 to 3.

The embodiment of the present specification further provides a computing device, which includes a memory and a processor, where the memory stores executable code, and the processor executes the executable code to implement the method described in any one of fig. 1 to 3.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the storage medium and the computing device embodiments, since they are substantially similar to the method embodiments, they are described relatively simply, and reference may be made to some descriptions of the method embodiments for relevant points.

Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.

The above-mentioned embodiments further describe the objects, technical solutions and advantages of the embodiments of the present invention in detail. It should be understood that the above description is only exemplary of the embodiments of the present invention, and is not intended to limit the scope of the present invention, and any modification, equivalent replacement, or improvement made on the basis of the technical solutions of the present invention should be included in the scope of the present invention.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:控制通信量的确定业务数据特征有效值的方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!

技术分类