Optimization method and device of voiceprint recognition model, computer equipment and storage medium
1. A method for optimizing a voiceprint recognition model, comprising:
respectively deploying a preset initial voiceprint recognition model to a plurality of terminals, wherein the initial voiceprint recognition model comprises a time delay neural network TDNN and a neural probability linear discriminant analysis network NPLDA, and the plurality of terminals comprise a target terminal and a plurality of associated terminals associated with the target terminal;
acquiring voice data to be recognized sent by a target terminal, wherein the voice data to be recognized is voice input by a target user through the target terminal, and the voice duration of the voice data to be recognized is greater than a threshold value;
performing voiceprint body-building operation on the voice data to be recognized through preset historical stock data, wherein the historical stock data comprises a plurality of voice data sets, each voice data set comprises a plurality of pieces of registered voice data of a user, and the user corresponding to each voice data set is different;
when the voice data to be recognized passes the voiceprint nuclear operation, determining a plurality of anonymous voiceprint vectors as negative sample data and sending the negative sample data to the target terminal, so that the target terminal performs gradient calculation on the initial voiceprint recognition model according to the negative sample data and the local positive sample data of the target terminal to obtain a target model gradient corresponding to the target terminal, wherein the anonymous voiceprint vectors are speaking voice characteristics of other users, and the other users are users except the target user;
acquiring a plurality of association model gradients sent by the association terminals, and aggregating the association model gradients and the target model gradient by adopting a federal aggregation average algorithm to obtain an aggregation gradient, wherein each association terminal corresponds to one association model gradient;
sending the aggregate gradient to the plurality of terminals such that each terminal optimizes the initial voiceprint recognition model according to the aggregate gradient.
2. The method for optimizing the voiceprint recognition model according to claim 1, wherein before the pre-set initial voiceprint recognition model is deployed to a plurality of terminals, the method for optimizing the voiceprint recognition model further comprises:
and constructing an initial model and performing off-line training on the initial model to obtain an initial voiceprint recognition model.
3. The optimization method of the voiceprint recognition model according to claim 2, wherein the constructing the initial model and performing offline training on the initial model to obtain the initial voiceprint recognition model comprises:
extracting a TDNN structure of a front 6-layer time delay network from a neural network feature extractor x-vector, and taking the TDNN structure of the front 6-layer time delay network as a front part of an initial model;
extracting a back 3-layer network structure from a neural probability linear discriminant analysis network NPLDA, and taking the back 3-layer network structure as a subsequent part of an initial model;
combining the preamble portion and the subsequent portion into an initial model, the initial model comprising a 9-tier network structure;
acquiring an initial training corpus, wherein the initial training corpus comprises voice pairs of the same user and voice pairs of different users;
and training the initial model according to the initial training corpus to obtain an initial voiceprint recognition model.
4. The method of claim 3, wherein the training the initial model according to the initial training corpus to obtain an initial voiceprint recognition model comprises:
inputting the initial training corpus into the initial model, and calculating a detection cost function of the initial model;
and when the value of the detection cost function is smaller than a preset value, determining that the initial model completes training to obtain an initial voiceprint recognition model.
5. The optimization method of the voiceprint recognition model according to claim 1, wherein the voiceprint body-building operation is performed on the voice data to be recognized through preset historical stock data, and comprises the following steps:
determining a target user corresponding to the voice data to be recognized as a user to be checked;
determining a target voice data set matched with the user to be checked in a plurality of voice data sets of the preset historical stock data, wherein the target voice data set comprises a plurality of pieces of registered voice data of the user to be checked;
inputting the initial voiceprint recognition model according to the target voice data set and the voice data to be recognized, calculating a score value under a preset error acceptance rate, and determining the score value as a target threshold value of the voiceprint core body;
and calling the initial voiceprint recognition model according to the target threshold value to perform 1:1 voiceprint body-building operation on the voice data to be recognized.
6. The optimization method of the voiceprint recognition model according to claim 5, wherein after the initial voiceprint recognition model is called according to the target threshold value to perform a 1:1 voiceprint body-coring operation on the voice data to be recognized, the optimization method of the voiceprint recognition model further comprises:
calculating target equal error rate values of the initial voiceprint recognition model;
when the target equal error rate value is smaller than or equal to a preset warning value, updating the initial voiceprint recognition model based on the target threshold value;
and when the target equal error rate value is greater than the preset warning value, sending an early warning message to a management center.
7. The optimization method of the voiceprint recognition model according to any one of claims 1 to 6, wherein the obtaining a plurality of correlation model gradients sent by the plurality of correlation terminals, and aggregating the plurality of correlation model gradients and the target model gradient by using a federal aggregation average algorithm to obtain an aggregate gradient, wherein each correlation terminal corresponds to one correlation model gradient, includes:
determining the current weight of the initial voiceprint recognition model and issuing the current weight to each terminal;
acquiring a target weight corresponding to the target model gradient;
acquiring a plurality of association model gradients sent by the association terminals and association weight corresponding to each association model gradient;
calculating based on a federated aggregation average algorithm, the target model gradient, the target weight, the plurality of correlation model gradients and the correlation weight corresponding to each correlation model gradient to obtain an aggregation gradient and an updated weight;
and sending the aggregation gradient and the updated weight to each terminal.
8. An apparatus for optimizing a voiceprint recognition model, comprising:
the model deployment module is used for respectively deploying a preset initial voiceprint recognition model to a plurality of terminals, wherein the initial voiceprint recognition model comprises a time delay neural network TDNN and a neural probability linear discriminant analysis network NPLDA, and the terminals comprise a target terminal and a plurality of associated terminals associated with the target terminal;
the data acquisition module is used for acquiring voice data to be recognized sent by a target terminal, wherein the voice data to be recognized is voice input by a target user through the target terminal, and the voice duration of the voice data to be recognized is greater than a threshold value;
the voiceprint body-checking module is used for carrying out voiceprint body-checking operation on the voice data to be recognized through preset historical stock data, the historical stock data comprises a plurality of voice data sets, each voice data set comprises a plurality of pieces of registered voice data of one user, and the corresponding user of each voice data set is different;
a determining and sending module, configured to determine multiple anonymous voiceprint vectors as negative sample data and send the negative sample data to the target terminal when the to-be-recognized voice data passes a voiceprint kernel operation, so that the target terminal performs gradient calculation on the initial voiceprint recognition model according to the negative sample data and local positive sample data of the target terminal to obtain a target model gradient corresponding to the target terminal, where the anonymous voiceprint vectors are speech features of other users, and the other users are users other than the target user;
the acquiring and aggregating module is used for acquiring a plurality of association model gradients sent by the association terminals and aggregating by adopting a federal aggregation average algorithm to obtain an aggregation gradient, wherein each association terminal corresponds to one association model gradient;
a sending module, configured to send the aggregate gradient to the multiple terminals, so that each terminal optimizes the initial voiceprint recognition model according to the aggregate gradient.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements a method of optimizing a voiceprint recognition model according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to carry out a method of optimization of a voiceprint recognition model according to any one of claims 1 to 7.
Background
With the improvement of security awareness of people in the aspect of data privacy protection, in a functional scene that important privacy data cannot leave local data, how to improve the security of user data becomes an important index, and voiceprint recognition is a common solution.
The existing voiceprint recognition scheme is a centralized training mode, but a large amount of speaker voice data needs to be collected in advance to serve as training data, and the historical stock data are not real client data, so that the existing recognition model is unchanged, the situation of the existing active user data cannot be reflected, and the real-time optimization of the online user data is difficult to realize.
Disclosure of Invention
The embodiment of the application provides a voiceprint recognition model optimization method and device, computer equipment and a storage medium, and aims to solve the problem that an existing voiceprint recognition model is difficult to optimize on line users in real time.
In a first aspect, an embodiment of the present application provides an optimization method for a voiceprint recognition model, including:
respectively deploying a preset initial voiceprint recognition model to a plurality of terminals, wherein the initial voiceprint recognition model comprises a time delay neural network TDNN and a neural probability linear discriminant analysis network NPLDA, and the plurality of terminals comprise a target terminal and a plurality of associated terminals associated with the target terminal;
acquiring voice data to be recognized sent by a target terminal, wherein the voice data to be recognized is voice input by a target user through the target terminal, and the voice duration of the voice data to be recognized is greater than a threshold value;
performing voiceprint body-building operation on the voice data to be recognized through preset historical stock data, wherein the historical stock data comprises a plurality of voice data sets, each voice data set comprises a plurality of pieces of registered voice data of a user, and the user corresponding to each voice data set is different;
when the voice data to be recognized passes the voiceprint nuclear operation, determining a plurality of anonymous voiceprint vectors as negative sample data and sending the negative sample data to the target terminal, so that the target terminal performs gradient calculation on the initial voiceprint recognition model according to the negative sample data and the local positive sample data of the target terminal to obtain a target model gradient corresponding to the target terminal, wherein the anonymous voiceprint vectors are speaking voice characteristics of other users, and the other users are users except the target user;
acquiring a plurality of association model gradients sent by the association terminals, and aggregating the association model gradients and the target model gradient by adopting a federal aggregation average algorithm to obtain an aggregation gradient, wherein each association terminal corresponds to one association model gradient;
sending the aggregate gradient to the plurality of terminals such that each terminal optimizes the initial voiceprint recognition model according to the aggregate gradient.
In a possible implementation manner, before the pre-set initial voiceprint recognition models are deployed to a plurality of terminals respectively, the optimization method of the voiceprint recognition models further includes:
and constructing an initial model and performing off-line training on the initial model to obtain an initial voiceprint recognition model.
In a possible implementation manner, the building an initial model and performing offline training on the initial model to obtain an initial voiceprint recognition model includes:
extracting a TDNN structure of a front 6-layer time delay network from a neural network feature extractor x-vector, and taking the TDNN structure of the front 6-layer time delay network as a front part of an initial model;
extracting a back 3-layer network structure from a neural probability linear discriminant analysis network NPLDA, and taking the back 3-layer network structure as a subsequent part of an initial model;
combining the preamble portion and the subsequent portion into an initial model, the initial model comprising a 9-tier network structure;
acquiring an initial training corpus, wherein the initial training corpus comprises voice pairs of the same user and voice pairs of different users;
and training the initial model according to the initial training corpus to obtain an initial voiceprint recognition model.
In a possible implementation manner, the training the initial model according to the initial training corpus to obtain an initial voiceprint recognition model includes:
inputting the initial training corpus into the initial model, and calculating a detection cost function of the initial model;
and when the value of the detection cost function is smaller than a preset value, determining that the initial model completes training to obtain an initial voiceprint recognition model.
In a possible implementation manner, the voiceprint body-building operation performed on the voice data to be recognized through preset historical stock data includes:
determining a target user corresponding to the voice data to be recognized as a user to be checked;
determining a target voice data set matched with the user to be checked in a plurality of voice data sets of the preset historical stock data, wherein the target voice data set comprises a plurality of pieces of registered voice data of the user to be checked;
inputting the initial voiceprint recognition model according to the target voice data set and the voice data to be recognized, calculating a score value under a preset error acceptance rate, and determining the score value as a target threshold value of the voiceprint core body;
and calling the initial voiceprint recognition model according to the target threshold value to perform 1:1 voiceprint body-building operation on the voice data to be recognized.
In a possible implementation manner, after the invoking the initial voiceprint recognition model according to the target threshold performs a 1:1 voiceprint core operation on the voice data to be recognized, the optimization method of the voiceprint recognition model further includes:
calculating target equal error rate values of the initial voiceprint recognition model;
when the target equal error rate value is smaller than or equal to a preset warning value, updating the initial voiceprint recognition model based on the target threshold value;
and when the target equal error rate value is greater than the preset warning value, sending an early warning message to a management center.
In a feasible implementation manner, the obtaining multiple association model gradients sent by the multiple association terminals, and aggregating the multiple association model gradients and the target model gradient by using a federal aggregation average algorithm to obtain an aggregate gradient, where each terminal corresponds to a different model gradient includes:
determining the current weight of the initial voiceprint recognition model and issuing the current weight to each terminal;
acquiring a target weight corresponding to the target model gradient;
acquiring a plurality of association model gradients sent by the association terminals and association weight corresponding to each association model gradient;
calculating based on a federated aggregation average algorithm, the target model gradient, the target weight, the plurality of correlation model gradients and the correlation weight corresponding to each correlation model gradient to obtain an aggregation gradient and an updated weight;
and sending the aggregation gradient and the updated weight to each terminal.
In a second aspect, an embodiment of the present application provides an apparatus for optimizing a voiceprint recognition model, including:
the model deployment module is used for respectively deploying a preset initial voiceprint recognition model to a plurality of terminals, wherein the initial voiceprint recognition model comprises a time delay neural network TDNN and a neural probability linear discriminant analysis network NPLDA, and the terminals comprise a target terminal and a plurality of associated terminals associated with the target terminal;
the data acquisition module is used for acquiring voice data to be recognized sent by a target terminal, wherein the voice data to be recognized is voice input by a target user through the target terminal, and the voice duration of the voice data to be recognized is greater than a threshold value;
the voiceprint body-checking module is used for carrying out voiceprint body-checking operation on the voice data to be recognized through preset historical stock data, the historical stock data comprises a plurality of voice data sets, each voice data set comprises a plurality of pieces of registered voice data of one user, and the corresponding user of each voice data set is different;
a determining and sending module, configured to determine multiple anonymous voiceprint vectors as negative sample data and send the negative sample data to the target terminal when the to-be-recognized voice data passes a voiceprint kernel operation, so that the target terminal performs gradient calculation on the initial voiceprint recognition model according to the negative sample data and local positive sample data of the target terminal to obtain a target model gradient corresponding to the target terminal, where the anonymous voiceprint vectors are speech features of other users, and the other users are users other than the target user;
the acquiring and aggregating module is used for acquiring a plurality of association model gradients sent by the association terminals, and aggregating the association model gradients and the target model gradient by adopting a federal aggregation average algorithm to obtain an aggregation gradient, wherein each association terminal corresponds to one association model gradient;
a sending module, configured to send the aggregate gradient to the multiple terminals, so that each terminal optimizes the initial voiceprint recognition model according to the aggregate gradient.
In a possible implementation manner, the optimization device of the voiceprint recognition model further includes:
and the construction training module is used for constructing an initial model and performing off-line training on the initial model to obtain an initial voiceprint recognition model.
In one possible embodiment, the building training module includes:
the first extraction unit is used for extracting a TDNN structure of the front 6 layers of time delay networks from a neural network feature extractor x-vector and taking the TDNN structure of the front 6 layers as a front part of an initial model;
the second extraction unit is used for extracting a back 3-layer network structure from the NPLDA and taking the back 3-layer network structure as a subsequent part of the initial model;
a combining unit configured to combine the preamble portion and the subsequent portion into an initial model, the initial model comprising a 9-tier network structure;
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring an initial training corpus, and the initial training corpus comprises voice pairs of the same user and voice pairs of different users;
and the training unit is used for training the initial model according to the initial training corpus to obtain an initial voiceprint recognition model.
In a possible embodiment, the training unit is specifically configured to:
inputting the initial training corpus into the initial model, and calculating a detection cost function of the initial model;
and when the value of the detection cost function is smaller than a preset value, determining that the initial model completes training to obtain an initial voiceprint recognition model.
In one possible embodiment, the voiceprint core module comprises:
the determining unit is used for determining the target user corresponding to the voice data to be recognized as a user to be checked;
the matching unit is used for determining a target voice data set matched with the user to be checked in a plurality of voice data sets of the preset historical stock data, and the target voice data set comprises a plurality of pieces of registered voice data of the user to be checked;
the first calculation unit is used for inputting the initial voiceprint recognition model according to the target voice data set and the voice data to be recognized, calculating a score value under a preset error acceptance rate, and determining the score value as a target threshold value of the voiceprint core body;
and the voiceprint body-building unit is used for calling the initial voiceprint recognition model according to the target threshold value to carry out 1:1 voiceprint body-building operation on the voice data to be recognized.
In one possible embodiment, the voiceprint core module further comprises:
the second calculation unit is used for calculating target equal error rate values of the initial voiceprint recognition model;
the updating unit is used for updating the initial voiceprint recognition model based on the target threshold value when the target equal error rate value is smaller than or equal to a preset warning value;
and the sending unit is used for sending an early warning message to a management center when the target equal error rate value is greater than the preset warning value.
In a possible implementation, the obtaining aggregation module is specifically configured to:
determining the current weight of the initial voiceprint recognition model and issuing the current weight to each terminal;
acquiring a target weight corresponding to the target model gradient;
acquiring a plurality of association model gradients sent by the association terminals and association weight corresponding to each association model gradient;
calculating based on a federated aggregation average algorithm, the target model gradient, the target weight, the plurality of correlation model gradients and the correlation weight corresponding to each correlation model gradient to obtain an aggregation gradient and an updated weight;
and sending the aggregation gradient and the updated weight to each terminal.
In a third aspect, an embodiment of the present application further provides a computer device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the processor implements the optimization method of the voiceprint recognition model according to the first aspect.
In a fourth aspect, the present application further provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and the computer program, when executed by a processor, causes the processor to execute the optimization method of the voiceprint recognition model according to the first aspect.
The embodiment of the application provides a voiceprint recognition model optimization method and device, computer equipment and a storage medium, which are used for carrying out real-time optimization aiming at an online user and improving the accuracy of the model.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of an optimization method of a voiceprint recognition model according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a method for optimizing a voiceprint recognition model according to an embodiment of the present disclosure;
FIG. 3 is a schematic block diagram of an optimization apparatus for a voiceprint recognition model provided in an embodiment of the present application;
fig. 4 is a schematic block diagram of a computer device provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic view of an application scenario of an optimization method of a voiceprint recognition model according to an embodiment of the present application; fig. 2 is a schematic flowchart of a method for optimizing a voiceprint recognition model according to an embodiment of the present application, where the method is applied to a server and is executed by application software installed in the server.
As shown in fig. 2, the method includes steps S201 to S206.
S201, respectively deploying a preset initial voiceprint recognition model to a plurality of terminals, wherein the initial voiceprint recognition model comprises a time delay neural network TDNN and a neural probability linear discriminant analysis network NPLDA, and the plurality of terminals comprise a target terminal and a plurality of associated terminals associated with the target terminal.
The method comprises the steps that a server deploys a preset initial voiceprint recognition model to a plurality of terminals respectively, the initial voiceprint recognition model comprises a time delay neural network TDNN and a neural probability linear discriminant analysis network NPLDA, and the plurality of terminals comprise a target terminal and a plurality of associated terminals associated with the target terminal. In this embodiment, the data update occurring on the target terminal is taken as an example to explain, and the target terminal may be any one of the multiple terminals.
Optionally, before the server deploys the preset initial voiceprint recognition models to the multiple terminals, the method further includes: and the server constructs an initial model and carries out off-line training on the initial model to obtain an initial voiceprint recognition model.
Specifically, the server extracts a TDNN structure of a front 6-layer time delay network from a neural network feature extractor x-vector, and takes the TDNN structure of the front 6-layer time delay network as a front part of an initial model; the server extracts a back 3-layer network structure from the NPLDA, and takes the back 3-layer network structure as a subsequent part of the initial model; the server combines the front part and the subsequent part into an initial model, and the initial model comprises a 9-layer network structure; the method comprises the steps that a server obtains an initial training corpus, wherein the initial training corpus comprises voice pairs of the same user and voice pairs of different users; and the server trains the initial model according to the initial training corpus to obtain an initial voiceprint recognition model.
It should be noted that, the initial voiceprint recognition model is selected based on the first 6 layers of Time-Delay Neural networks (TDNN) combined with X-Vector and the last 3 layers of Neural Probabilistic Linear Discriminant Analysis (NPLDA) as the model to be trained. The network architecture is combined as follows: the layer type of the layer 1 of the initial voiceprint recognition model is a time delay neural network-linear rectification function (TDNN-ReLu), the context range is (t-2, t +2), the input dimension is 30, and the output dimension is 512; the layer type of the layer 2 of the initial voiceprint recognition model is TDNN-ReLu, the context range is (t-2, t +2), the input dimension is 512, and the output dimension is 512; the layer type of the 3 rd layer of the initial voiceprint recognition model is TDNN-ReLu, the context range is (t-3, t +3), the input dimension is 512, and the output dimension is 512; the layer type of the 4 th layer of the initial voiceprint recognition model is TDNN-ReLu, the context range is (t), the input dimension is 512, and the output dimension is 512; the layer type of the 5 th layer of the initial voiceprint recognition model is TDNN-ReLu, the context range is (t), the input dimension is 512, and the output dimension is 1500; the layer type of layer 6 of the initial voiceprint recognition model is statistical Pooling (Statistics-Pooling), the context range is (0, T), the input dimension is 1500 × T, and the output dimension is 3000; the layer type of the 7 th layer of the initial voiceprint recognition model is Affine-Unit length normalization (Affine-Unit length normalization), the context range is (0), the input dimension is 1500, and the output dimension is 512; the layer type of the layer 8 of the initial voiceprint recognition model is Affine (Affine), the context range is (0), the input dimension is 512, and the output dimension is 512; the layer type of layer 9 of the initial voiceprint recognition model is Quadratic (Quadratic), the context range is (0), the input dimension is 512, the output dimension is 1, and the output of layer 8 is a voiceprint feature vector.
Optionally, the server trains the initial model according to the initial training corpus to obtain an initial voiceprint recognition model, which specifically includes:
the server inputs the initial training corpus into an initial model, and calculates a detection cost function of the initial model;
and when the value of the detection cost function is smaller than a preset value, the server determines that the initial model completes training to obtain an initial voiceprint recognition model.
It should be noted that the training process is to apply the value of the Detection Cost Function C (θ) (Detection Cost Function, DCF) of the model, which is generally referred to as Normalized Detection Cost Function (Normalized Detection Cost Function, Normalized DCF) CNormAnd (theta) as a loss Function of the model to train the model with a smaller Minimum Detection Cost Function value (minDCF), wherein the value is an important index participating in evaluating the quality of the model. Through such a whole set of processes, end-to-end voiceprint model training can be achieved.
Wherein C (θ) ═ Cmiss·Pmiss(θ)·Ptarget+CFA·PFA(θ)·(1-Ptarget);
It will be appreciated that PmissWhen the threshold value is θ, the ratio of missed targets (miss rate), PFAWhen the threshold value is θ, (θ) is a false alarm rate. Wherein C ismiss、CFAAnd PtargetIt is a constant, and defines these three parameters under the condition that the prior probability of the proportion of the target detection voice quantity is considered, and the influence of the two events of missing target and false alarm on the real situation is considered, and this definition is rarely changed in NIST SRE 1996, and usually: cmiss=10,CFA=1,Ptarget=0.01。
Thus, it is possible to obtain:the DCF can be simplified as: cNorm(θ)=Pmiss(θ)+β·PFA(θ), where β ═ 9.9, where i is the number of test experiments (dials) and N is the total number of tests. When t isiWhen 1, this test i is a miss (speech pair from different speakers), when tiWhen 0, this test i is a hit (speech pair from the same speaker). siIs the score of the model decision (output of the model) in test i. Function(s)Is an indicator function (indicator function) which is 1 if the condition in parentheses is true and 0 if the condition in parentheses is false. Here no gradient can be learned since the index function changes steeply, where this function can be replaced by a function that changes somewhat gently: sigmoid function σ (·).
It is thus possible to obtain:
finally, a loss function is obtained:since minDCF requires the minimum DCF value corresponding to all threshold values θ, it is possible to learn by setting θ as a variable that can be learned.
S202, voice data to be recognized sent by the target terminal is obtained, the voice data to be recognized is voice input by the target user through the target terminal, and the voice duration of the voice data to be recognized is larger than a threshold value.
The server acquires voice data to be recognized sent by the target terminal, the voice data to be recognized is voice input by the target user through the target terminal, and the voice duration of the voice data to be recognized is larger than a threshold value.
It should be noted that, for the terminal participating in training, if there is a new voice participating in verification (i.e. voice data to be recognized), it is first necessary to ensure that valid voice for speaking is enough and only one speaker speaks.
S203, performing voiceprint body-building operation on voice data to be recognized through preset historical stock data, wherein the historical stock data comprises a plurality of voice data sets, each voice data set comprises a plurality of pieces of registered voice data of one user, and the user corresponding to each voice data set is different.
Specifically, the server determines a target user corresponding to the voice data to be recognized as a user to be subjected to core verification; the server determines a target voice data set matched with the user to be checked in a plurality of voice data sets of preset historical stock data, wherein the target voice data set comprises a plurality of pieces of registered voice data of the user to be checked; the server inputs an initial voiceprint recognition model according to a target voice data set and voice data to be recognized, calculates a score value under a preset error Acceptance Rate (FAR), and determines the score value as a target threshold value of a voiceprint core body; the server calls an initial voiceprint recognition model according to a target threshold value to perform 1:1 voiceprint body-building operation on voice data to be recognized, historical stock data comprise a plurality of voice data sets, each voice data set comprises a plurality of pieces of registered voice data of a user, and the user corresponding to each voice data set is different.
It should be noted that, here, the core body operation task is not completed in the terminal, but completed in the server, so the server needs to perform a voiceprint core body with a higher threshold value of 1:1, so as to ensure that the voice of the speaker corresponding to the speaker ID participates in the voiceprint training. The server also comprises a detection algorithm for detecting effective voice and detecting whether only one speaker exists or not to control the quality of the voice participating in training.
Optionally, after invoking the initial voiceprint recognition model according to the target threshold and performing 1:1 voiceprint body-building operation on the voice data to be recognized, the optimization method of the voiceprint recognition model further includes:
the server calculates target equal error rate values of the initial voiceprint recognition model;
when the Equal Error Rate (EER) value of the target is smaller than or Equal to the preset warning value, the server updates the initial voiceprint recognition model based on the target threshold value;
and when the error rate value of the target and the like is larger than the preset warning value, the server sends an early warning message to the management center.
And S204, when the voice data to be recognized pass the voiceprint nuclear body operation, determining a plurality of anonymous voiceprint vectors as negative sample data and sending the negative sample data to the target terminal, so that the target terminal performs gradient calculation on the initial voiceprint recognition model according to the negative sample data and the local positive sample data of the target terminal to obtain the gradient of the target model corresponding to the target terminal, wherein the anonymous voiceprint vectors are the speaking voice characteristics of other users, and the other users are users except the target user.
It should be noted that, in order to ensure that the data participating in the training is the user himself, the training data needs to be subjected to a 1:1 voiceprint body-building operation in advance. The local terminal needs to keep a registration voice of the user; when the voice of the same speaker is trained, the terminal trains two local voice segments. However, when training the voices of different speakers, the voice contents of other speakers cannot be transmitted because the privacy of other speakers needs to be protected. Here, the inferred voiceprint vectors (i.e. voiceprint feature μ) of other terminals can be usedt) And transmitting to the machine for training.
The training process of the target terminal is as follows:
the target terminal determines target registration voice data corresponding to a target user in a local database;
target registration voice data and voice data mu to be recognized(n)Determining the data as the positive sample data;
obtaining multiple anonymous voiceprint vectors mu from a server(m)And determining multiple anonymous voiceprint vectors as negative sample data, wherein the anonymous voiceprint vectors are speech features of other users and other usersThe user is a user except the target user, wherein m is not equal to n, and is the ID of the user (namely the speaker);
and optimizing the initial voiceprint recognition model according to the positive sample data and the negative sample data to obtain a target voiceprint recognition model.
S205, obtaining a plurality of association model gradients sent by a plurality of association terminals, and aggregating the association model gradients and the target model gradient by adopting a federal aggregation average algorithm to obtain an aggregation gradient, wherein each association terminal corresponds to one association model gradient.
Specifically, the server determines the current weight of the initial voiceprint recognition model and issues the current weight to each terminal; the server obtains a target model gradient sent by a target terminal and a target weight corresponding to the target model gradient; the server acquires a plurality of association model gradients sent by a plurality of association terminals and association weights corresponding to the association model gradients; the server calculates based on a federated aggregation average algorithm, a target model gradient, a target weight, a plurality of association model gradients and an association weight corresponding to each association model gradient to obtain an aggregation gradient and an updated weight; the server sends the aggregate gradient and the updated weights to each terminal.
And S206, transmitting the aggregation gradient to a plurality of terminals, so that each terminal optimizes the initial voiceprint recognition model according to the aggregation gradient.
The server sends the aggregate gradient to the plurality of terminals so that each terminal optimizes the initial voiceprint recognition model according to the aggregate gradient.
In the embodiment of the application, the training task is coordinated by the central cloud server through the federal learning algorithm, and then distributed to each terminal in a distributed mode for joint training, the whole process is completed through encryption training, so that the data safety and privacy of a client are guaranteed, meanwhile, the model is optimized in real time, and the purpose of the accuracy of the voiceprint nuclear body of the user on the line is improved.
The embodiment of the present application further provides an optimization apparatus for a voiceprint recognition model, where the optimization apparatus for a voiceprint recognition model is used to implement any one of the embodiments of the optimization method for a voiceprint recognition model. Specifically, please refer to fig. 3, fig. 3 is a schematic block diagram of an optimization apparatus of a voiceprint recognition model according to an embodiment of the present application. The optimization apparatus 300 of the voiceprint recognition model can be configured in a server.
As shown in fig. 3, the apparatus 300 for optimizing a voiceprint recognition model includes:
a model deployment module 301, configured to deploy a preset initial voiceprint recognition model to a plurality of terminals respectively, where the initial voiceprint recognition model includes a time delay neural network TDNN and a neural probability linear discriminant analysis network NPLDA, and the plurality of terminals include a target terminal and a plurality of associated terminals associated with the target terminal;
a data obtaining module 302, configured to obtain to-be-recognized voice data sent by a target terminal, where the to-be-recognized voice data is voice input by a target user through the target terminal, and a voice duration of the to-be-recognized voice data is greater than a threshold;
the voiceprint body-building module 303 is configured to perform voiceprint body-building operation on the voice data to be recognized through preset historical stock data, where the historical stock data includes multiple voice data sets, each voice data set includes multiple pieces of registered voice data of one user, and a user corresponding to each voice data set is different;
a determining and sending module 304, configured to determine, when the voice data to be recognized passes through a voiceprint kernel operation, multiple anonymous voiceprint vectors as negative sample data and send the negative sample data to the target terminal, so that the target terminal performs gradient calculation on the initial voiceprint recognition model according to the negative sample data and local positive sample data of the target terminal, so as to obtain a target model gradient corresponding to the target terminal, where the anonymous voiceprint vectors are speech features of other users, and the other users are users other than the target user;
an obtaining and aggregating module 305, configured to obtain multiple association model gradients sent by the multiple association terminals, and aggregate the multiple association model gradients and the target model gradient by using a federal aggregation average algorithm to obtain an aggregate gradient, where each association terminal corresponds to one association model gradient;
a sending module 306, configured to send the aggregate gradient to the multiple terminals, so that each terminal optimizes the initial voiceprint recognition model according to the aggregate gradient.
In an embodiment, the apparatus 300 for optimizing a voiceprint recognition model further includes:
and a construction training module 307, configured to construct an initial model and perform offline training on the initial model to obtain an initial voiceprint recognition model.
In one embodiment, the building training module 307 comprises:
the first extraction unit 3071 is configured to extract a front 6-layer time delay network TDNN structure from the neural network feature extractor x-vector, and use the front 6-layer TDNN structure as a front part of the initial model;
the second extraction unit 3072 is configured to extract a back 3-layer network structure from the NPLDA, and use the back 3-layer network structure as a subsequent part of the initial model;
a combining unit 3073 for combining the preamble portion and the subsequent portion into an initial model, the initial model comprising a 9-tier network structure;
an obtaining unit 3074, configured to obtain an initial corpus, where the initial corpus includes a voice pair of the same user and a voice pair of different users;
and the training unit 3075 is configured to train the initial model according to the initial training corpus to obtain an initial voiceprint recognition model.
In a possible embodiment, the training unit 3075 is specifically configured to:
inputting the initial training corpus into the initial model, and calculating a detection cost function of the initial model;
and when the value of the detection cost function is smaller than a preset value, determining that the initial model completes training to obtain an initial voiceprint recognition model.
In one possible implementation, the voiceprint bodice module 303 includes:
a determining unit 3031, configured to determine a target user corresponding to the voice data to be recognized as a user to be authenticated;
a matching unit 3032, configured to determine, in a plurality of voice data sets of the preset historical stock data, a target voice data set that matches the user to be authenticated, where the target voice data set includes a plurality of pieces of registered voice data of the user to be authenticated;
a first calculating unit 3033, configured to input the initial voiceprint recognition model according to the target voice data set and the voice data to be recognized, calculate a score value at a preset false acceptance rate, and determine the score value as a target threshold of a voiceprint core;
and a voiceprint tattooing unit 3034, configured to invoke the initial voiceprint recognition model according to the target threshold to perform a 1:1 voiceprint tattooing operation on the voice data to be recognized.
In a possible implementation, the voiceprint core module 303 further includes:
a second calculating unit 3035, configured to calculate a target equal error rate value of the initial voiceprint recognition model;
an updating unit 3036, configured to update the initial voiceprint recognition model based on the target threshold when the target equal error rate value is less than or equal to a preset warning value;
the sending unit 3037 sends an early warning message to a management center when the target equal error rate value is greater than the preset warning value.
In a possible implementation manner, the obtaining aggregation module 305 is specifically configured to:
determining the current weight of the initial voiceprint recognition model and issuing the current weight to each terminal;
acquiring a target weight corresponding to the target model gradient;
acquiring a plurality of association model gradients sent by the association terminals and association weight corresponding to each association model gradient;
calculating based on a federated aggregation average algorithm, the target model gradient, the target weight, the plurality of correlation model gradients and the correlation weight corresponding to each correlation model gradient to obtain an aggregation gradient and an updated weight;
and sending the aggregation gradient and the updated weight to each terminal.
In the embodiment of the application, the training task is coordinated by the central cloud server through the federal learning algorithm, and then distributed to each terminal in a distributed mode for joint training, the whole process is completed through encryption training, so that the data safety and privacy of a client are guaranteed, meanwhile, the model is optimized in real time, and the purpose of the accuracy of the voiceprint nuclear body of the user on the line is improved.
The above-described means for optimizing the voiceprint recognition model can be implemented in the form of a computer program which can be run on a computer device as shown in fig. 4.
Referring to fig. 4, fig. 4 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 400 is a server, which may be an independent server or a server cluster composed of a plurality of servers.
Referring to fig. 4, the computer device 400 includes a processor 402, a memory, which may include a storage medium 403 and an internal memory 404, and a network interface 405 connected by a system bus 401.
The storage medium 403 may store an operating system 4031 and computer programs 4032. The computer program 4032, when executed, may cause the processor 402 to perform a method of optimizing a voiceprint recognition model.
The processor 402 is used to provide computing and control capabilities that support the operation of the overall computer device 400.
The internal memory 404 provides an environment for the operation of a computer program 4032 in the storage medium 403, which computer program 4032, when executed by the processor 402, causes the processor 402 to perform a method for optimizing a voiceprint recognition model.
The network interface 405 is used for network communication, such as providing transmission of data information. Those skilled in the art will appreciate that the configuration shown in fig. 4 is a block diagram of only a portion of the configuration associated with the present application and does not constitute a limitation of the computing device 400 to which the present application is applied, and that a particular computing device 400 may include more or less components than those shown, or combine certain components, or have a different arrangement of components.
The processor 402 is configured to run the computer program 4032 stored in the memory to implement the optimization method of the voiceprint recognition model disclosed in the embodiment of the present application.
Those skilled in the art will appreciate that the embodiment of a computer device illustrated in fig. 4 does not constitute a limitation on the specific construction of the computer device, and that in other embodiments a computer device may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. For example, in some embodiments, the computer device may only include a memory and a processor, and in such embodiments, the structures and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 4, and are not described herein again.
It should be understood that in the embodiment of the present Application, the Processor 402 may be a Central Processing Unit (CPU), and the Processor 402 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be a nonvolatile computer-readable storage medium or a volatile computer-readable storage medium. The computer readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the optimization method of the voiceprint recognition model disclosed in the embodiments of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses, devices and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again. Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a storage medium. Based on such understanding, the technical solution of the present application may be substantially or partially contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:训练方法、声纹识别方法、装置和电子设备