System and method for predicting cancer prognosis risk under high-dimensional deletion data
1. A method for predicting cancer prognosis risk under high-dimensional deletion data is characterized by comprising the following steps:
s1: constructing a Cox neural network model, and acquiring a training data set and a verification data set of a target database;
s2: carrying out random sampling disturbance on a training data set according to Bayesian prior knowledge constraint;
s3: defining a risk function of a Cox neural network model and calculating a loss function according to a training data set subjected to random sampling disturbance;
s4: training the Cox neural network model through a loss function, and updating the network weight of the Cox neural network model;
s5: verifying the updated network weight by using a verification data set, if the verification is passed, obtaining a deep Bayesian disturbance model, and executing the step S6; otherwise, returning to step S2 to perform random sampling perturbation again;
s6: inputting the data of the target cancer patient into a deep Bayesian disturbance model, and outputting the predicted value of the cancer prognosis risk of the patient.
2. The method according to claim 1, wherein in step S2, the bayesian prior knowledge constraint indicates that the lifetime of the deleted data is within the upper bound and does not deviate from the constraint of too much lifetime of the non-deleted data; wherein the pruned data and non-pruned data exist in the training dataset; the training data set further includes a time-to-live; the specific process of step S2 is as follows:
sequencing the samples in the training data set according to the survival time, and converting the survival time into a sequencing value;
and taking the converted ranking value as the mean value of Gaussian distribution, and obtaining new survival time of the sample from the distribution by adopting the distribution again according to the set variance of the Gaussian distribution to finish random adoption and disturbance.
3. The method of claim 2, wherein in the random sampling perturbation process, for a deleted sample with deleted data, by setting a constant value, if the result of the sampling of the deleted sample is in an area with a ratio of the constant value to the right of the gaussian distribution, the sample is marked as a non-deleted sample.
4. The method for predicting cancer prognosis risk under high dimensional deletion data according to claim 2, wherein in the step S3, a survival function S (t) of the Cox neural network model is defined, which is expressed as: s (T) ═ Pr, where Pr denotes the survival rate of the patient before time T, which is less than the time from data collection to the last observation of the patient, i.e. the survival time T; thus, the risk function at time t is defined as:
wherein δ represents a constant; according to the definition of the risk function, the Cox proportional risk function is obtained as follows:
λ(t|x)=λ0(t)*exph(x)
wherein X ∈ X, X denotes omics data of all patients in the training dataset, X denotes covariates affecting patient survival time, and the risk function h (X) ═ β Xi,λ0(t) represents a baseline risk function at time t, β being a constant, indicating that the risk function is a linear combination of covariates for patient survival time; the optimization objective of the Cox neural network model, i.e., the maximum likelihood function, is thus expressed as:
wherein E isiPresentation sampleDeletion tag of this i, Ei1 denotes a non-deleted sample, Ei0 denotes a deleted sample; r (T)i) Represents a sample set which still survives when the sample i dies, and j is an individual of the sample set; thus, the loss function of a neural network based on the Cox proportional hazards is:
wherein θ represents a network weight of the Cox neural network model; then, rewriting the predicted loss function to obtain a loss function after introducing Bayesian prior knowledge through a disturbance sampling mechanism, which is specifically represented as:
wherein, Ti pbRepresenting the new lifetime after the perturbation; then, in combination with a deep learning optimization technique, an L2 regularization term is introduced into a loss function, and the loss function is finally expressed as:
5. the method of claim 4, wherein in step S6, the omics data X, the survival time T and the deletion label E of the target cancer patient are obtained first, and are used as input of the deep Bayesian disturbance model, and the risk prediction is performed by the deep Bayesian disturbance model, and finally the risk prediction value of the target cancer patient is output.
6. A system for predicting cancer prognosis risk under high-dimensional deletion data, which is used for implementing the method for predicting cancer prognosis risk under high-dimensional deletion data according to any one of claims 1-5; the system is characterized by comprising a model building module, a data acquisition module, a random sampling disturbance module, a loss function calculation module, a weight updating module, a verification module and a prediction module; wherein:
the model building module is used for building a Cox neural network model;
the data acquisition module is used for acquiring a training data set and a verification data set from a target database;
the random sampling disturbance module is used for carrying out random sampling disturbance on the training data set according to Bayesian prior knowledge constraint;
the loss function calculation module is used for defining a risk function of the Cox neural network model and calculating a loss function according to the training data set subjected to random sampling disturbance;
the weight updating module is used for training the Cox neural network model through a loss function and updating the network weight of the Cox neural network model;
the verification module is used for verifying the updated network weight by utilizing a verification data set;
the prediction module is used for inputting the data of the target cancer patient into the validated deep Bayesian disturbance model and outputting the predicted value of the cancer prognosis risk of the patient.
7. The system for predicting cancer prognosis risk under high-dimensional deleted data according to claim 6, wherein in the stochastic sampling perturbation module, the Bayesian prior knowledge constraint indicates that the deleted data survival time is at an upper bound and does not deviate from the constraint that the non-deleted data survival time is too much; wherein the pruned data and non-pruned data exist in the training dataset; the training data set further includes a time-to-live; the random sampling perturbation module specifically executes the following steps:
sequencing the samples in the training data set according to the survival time, and converting the survival time into a sequencing value;
and taking the converted ranking value as the mean value of Gaussian distribution, and obtaining new survival time of the sample from the distribution by adopting the distribution again according to the set variance of the Gaussian distribution to finish random adoption and disturbance.
8. The system of claim 7, wherein the stochastic sampling perturbation module sets a constant value for the erased sample with the erased data during stochastic sampling perturbation, and if the sampled result of the erased sample is in an area with a constant value set in a ratio of the right side of the gaussian distribution, the sample is marked as the non-erased sample.
9. The system for predicting cancer prognosis risk under high-dimensional deletion data according to claim 6, wherein the following steps are specifically executed in the loss function calculation module:
defining a survival function S (t) of the Cox neural network model, and specifically representing that: s (T) ═ Pr, where Pr denotes the survival rate of the patient before time T, which is less than the time from data collection to the last observation of the patient, i.e. the survival time T; thus, the risk function at time t is defined as:
wherein δ represents a constant; according to the definition of the risk function, the Cox proportional risk function is obtained as follows:
λ(t|x)=λ0(t)*exph(x)
wherein X ∈ X, X denotes omics data of all patients in the training dataset, X denotes covariates affecting patient survival time, and the risk function h (X) ═ β Xi,λ0(t) represents a baseline risk function at time t, β being a constant, indicating that the risk function is a linear combination of covariates for patient survival time; the optimization objective of the Cox neural network model, i.e., the maximum likelihood function, is thus expressed as:
wherein E isiDeletion tag representing sample i, Ei1 denotes a non-deleted sample, Ei0 denotes a deleted sample; r (T)i) Represents a sample set which still survives when the sample i dies, and j is an individual of the sample set; thus, the loss function of a neural network based on the Cox proportional hazards is:
wherein θ represents a network weight of the Cox neural network model; then, rewriting the predicted loss function to obtain a loss function after introducing Bayesian prior knowledge through a disturbance sampling mechanism, which is specifically represented as:
wherein, Ti pbRepresenting the new lifetime after the perturbation; then, in combination with a deep learning optimization technique, an L2 regularization term is introduced into a loss function, and the loss function is finally expressed as:
10. the system for predicting cancer prognosis risk under high dimensional deletion data according to claim 9, wherein the following steps are specifically executed in the prediction module:
and obtaining omics data X, survival time T and deletion label E of the target cancer patient, using the omics data X, survival time T and deletion label E as input of a deep Bayesian disturbance model, performing risk prediction by using the deep Bayesian disturbance model, and finally outputting a risk prediction value of the target cancer patient.
Background
With the development of medical assistance technology, researchers are increasingly trying to apply it to the adjuvant treatment of cancer. Among them, the survival analysis for cancer prognosis is a key auxiliary technology, which can predict the potential risk of the patient according to various physiological indexes of the patient, thereby helping the doctor to select the corresponding treatment scheme.
The biggest difficulty of cancer survival analysis is to utilize information of deleted samples to reveal a complex correlation mechanism of high-dimensional omics data to patient prognosis risk. The prior art includes: a mathematical statistical method designed for deleted data comprises the following steps: applying a Cox proportional risk model (Cox model for short) to deep learning, replacing a linear function of the Cox proportional risk with a function fitted by a neural network, establishing a neural network model based on the Cox proportional risk, and applying the deep learning model to disease prognosis survival analysis with a large number of samples; the Cox proportional risk-based neural network model is applied to survival analysis of cancer patients, and multiple deep learning optimization technologies such as regularization and Dropout are combined, so that the prediction accuracy of the model on small sample data is improved. However, in the method, the deep learning model with high requirements on the data sample size is applied to the small sample data set, and the deleted data is not further processed, so that the neural network with strong fitting capability has a bias on the prediction of the deleted data, and a strong over-fitting problem exists. Aiming at the problem, the prior art modifies the proportional risk assumption in the Cox model, introduces time information into the model, and improves the performance of the neural network model based on the Cox proportional risk in a data set with more samples. The improvement method mainly improves the performance of the neural network model based on the Cox proportional risk in a data set with a large sample size and a time information format meeting the requirement, does not solve the over-fitting problem of the neural network model based on the Cox proportional risk in a small sample data set, and limits the performance of the method.
The Chinese patent publication No. CN111312393A (the publication No. 2020-06-19) discloses a time sequence deep survival analysis system combined with active learning, which comprises a data acquisition module, an active learning module and a time sequence deep survival analysis module; the data acquisition module is used for acquiring survival data of an object to be analyzed; the active learning module is combined with an active learning method to select part of right deleted data to label the survival time; and the time sequence depth survival analysis module constructs a time sequence depth survival analysis neural network model, and takes the un-deleted data and the right-deleted data as model input to obtain a survival time prediction result of the object to be analyzed. The invention can fully utilize the right deletion data and the time sequence characteristics in the survival data. Compared with the traditional survival analysis model, the method solves the problems that high-dimensional data are difficult to process and the model is poor in performance under the condition that only a small amount of data are not deleted in the survival analysis; meanwhile, the extraction and utilization of data time dimension characteristics are increased, the application range of the model is expanded, the expression effect of the model is improved, but the method has the defects of high time complexity, high computational cost and low universality.
Disclosure of Invention
The invention aims to provide a system and a method for predicting cancer prognosis risk under high-dimensional deletion data with low time complexity, low computational cost and high universality aiming at the defects in the prior art.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a method for predicting cancer prognosis risk under high-dimensional deletion data comprises the following steps:
s1: constructing a Cox neural network model, and acquiring a training data set and a verification data set of a target database;
s2: carrying out random sampling disturbance on a training data set according to Bayesian prior knowledge constraint;
s3: defining a risk function of a Cox neural network model and calculating a loss function according to a training data set subjected to random sampling disturbance;
s4: training the Cox neural network model through a loss function, and updating the network weight of the Cox neural network model;
s5: verifying the updated network weight by using a verification data set, if the verification is passed, obtaining a deep Bayesian disturbance model, and executing the step S6; otherwise, returning to step S2 to perform random sampling perturbation again;
s6: inputting the data of the target cancer patient into a deep Bayesian disturbance model, and outputting the predicted value of the cancer prognosis risk of the patient.
Wherein, in the step S2, the bayesian priori knowledge constraint indicates that the lifetime of the deleted data exists in the upper bound and does not deviate from the constraint that the lifetime of the non-deleted data is too much; wherein the pruned data and non-pruned data exist in the training dataset; the training data set further includes a time-to-live; the specific process of step S2 is as follows:
the samples in the training data set are sorted according to the survival time, and the survival time is converted into a sorting value, which is specifically expressed as: t isi'=Rank(Ti) I, where subscript i denotes the ith sample; t isi' represents the new survival time after the sorting pretreatment;
taking the converted ranking value as the mean value of Gaussian distribution, taking a set constant alpha as the variance of the Gaussian distribution, and re-adopting the distribution to obtain a new survival time of the sample, specifically represented as Ti pb~N(Ti',α);Ti pbRepresenting the new lifetime after the perturbation; thereby completing the random adoption perturbation.
In the above scheme, the integral expression is written in a manner that Bayesian prior knowledge cannot be displayed, so that the idea of a Monte Carlo sampling method is adopted, and prior information is introduced approximately by the adopted method, so as to achieve the effect of replacing the integral expression.
In the random sampling perturbation process, for a deleted sample with deleted data, by setting a constant value γ, if a result of sampling the deleted sample is in an area with a proportion γ on the right side of a gaussian distribution, the sample will be marked as a non-deleted sample, specifically expressed as:
in the scheme, when the Cox neural network model needs to be trained every time, the stochastic disturbance operation needs to be performed, and then the obtained sample is input into the Cox neural network model for training.
In step S3, a survival function S (t) of the Cox neural network model is defined, which is specifically expressed as: s (T) ═ Pr, where Pr denotes the survival rate of the patient before time T, which is less than the time from data collection to the last observation of the patient, i.e. the survival time T; thus, the risk function at time t is defined as:
wherein δ represents a constant; according to the definition of the risk function, the Cox proportional risk function is obtained as follows:
λ(t|x)=λ0(t)*exph(x)
wherein X ∈ X, X denotes omics data of all patients in the training dataset, X denotes covariates affecting patient survival time, and the risk function h (X) ═ β Xi,λ0(t) represents a baseline risk function at time t, β being a constant, indicating that the risk function is a linear combination of covariates for patient survival time; the optimization objective of the Cox neural network model, i.e., the maximum likelihood function, is thus expressed as:
wherein E isiDeletion tag representing sample i, Ei1 denotes a non-deleted sample, Ei0 denotes a deleted sample; r (T)i) Represents the set of samples that survived when sample i died, and j is the sameIndividuals of the sample set; thus, the loss function of a neural network based on the Cox proportional hazards is:
wherein θ represents a network weight of the Cox neural network model; then, rewriting the predicted loss function to obtain a loss function after introducing Bayesian prior knowledge through a disturbance sampling mechanism, which is specifically represented as:
wherein, Ti pbRepresenting the new lifetime after the perturbation; then, in combination with a deep learning optimization technique, an L2 regularization term is introduced into a loss function, and the loss function is finally expressed as:
in step S6, omics data X, survival time T, and deletion tag E of the target cancer patient are first obtained, and are used as input of the deep bayesian disturbance model, and the deep bayesian disturbance model is used to perform risk prediction, and finally a risk prediction value of the target cancer patient is output.
In the scheme, omics data X are all indexes of the patient; the process of building the Cox neural network model is to find the corresponding relation between omics data X and the patient risk prediction value H.
The method aims to solve the problem that prediction deviation exists when the existing deep learning method for survival analysis is applied to high-dimensional small sample data containing a large number of deleted samples, and the performance of a Cox neural network model on cancer survival analysis is improved. According to the scheme, Bayesian prior knowledge constraint is introduced for the deleted samples, and a Cox neural network model suitable for deep learning of high-dimensional small sample cancer data is provided by adding optimization modules of sample sequencing disturbance and deleted tag disturbance, so that the problem that the existing prediction method is poor in performance in the data is solved.
A system for predicting cancer prognosis risk under high-dimensional deletion data is used for realizing a method for predicting cancer prognosis risk under high-dimensional deletion data; the system comprises a model construction module, a data acquisition module, a random sampling disturbance module, a loss function calculation module, a weight updating module, a verification module and a prediction module; wherein:
the model building module is used for building a Cox neural network model;
the data acquisition module is used for acquiring a training data set and a verification data set from a target database;
the random sampling disturbance module is used for carrying out random sampling disturbance on the training data set according to Bayesian prior knowledge constraint;
the loss function calculation module is used for defining a risk function of the Cox neural network model and calculating a loss function according to the training data set subjected to random sampling disturbance;
the weight updating module is used for training the Cox neural network model through a loss function and updating the network weight of the Cox neural network model;
the verification module is used for verifying the updated network weight by utilizing a verification data set;
the prediction module is used for inputting the data of the target cancer patient into the validated deep Bayesian disturbance model and outputting the predicted value of the cancer prognosis risk of the patient.
In the random sampling disturbance module, the Bayesian prior knowledge constraint represents that the survival time of the deleted data is in an upper bound and does not deviate from the constraint that the survival time of the non-deleted data is too much; wherein the pruned data and non-pruned data exist in the training dataset; the training data set further includes a time-to-live; the random sampling perturbation module specifically executes the following steps:
sequencing the samples in the training data set according to the survival time, and converting the survival time into a sequencing value;
and taking the converted ranking value as the mean value of Gaussian distribution, and obtaining new survival time of the sample from the distribution by adopting the distribution again according to the set variance of the Gaussian distribution to finish random adoption and disturbance.
In the random sampling and disturbing process, the random sampling and disturbing module sets a constant value for a deleted sample with deleted data, and if the sampling result of the deleted sample is in the area with the set constant value in proportion to the right side of the Gaussian distribution, the sample is marked as a non-deleted sample.
Wherein, in the loss function calculation module, the following steps are specifically executed:
defining a survival function S (t) of the Cox neural network model, and specifically representing that: s (T) ═ Pr, where Pr denotes the survival rate of the patient before time T, which is less than the time from data collection to the last observation of the patient, i.e. the survival time T; thus, the risk function at time t is defined as:
wherein δ represents a constant; according to the definition of the risk function, the Cox proportional risk function is obtained as follows:
λ(t|x)=λ0(t)*exph(x)
wherein X ∈ X, X denotes omics data of all patients in the training dataset, X denotes covariates affecting patient survival time, and the risk function h (X) ═ β Xi,λ0(t) represents a baseline risk function at time t, β being a constant, indicating that the risk function is a linear combination of covariates for patient survival time; the optimization objective of the Cox neural network model, i.e., the maximum likelihood function, is thus expressed as:
wherein E isiDeletion tag representing sample i, Ei1 denotes a non-deleted sample, Ei0 denotes a deleted sample; r (T)i) Represents a sample set which still survives when the sample i dies, and j is an individual of the sample set; thus, the loss function of a neural network based on the Cox proportional hazards is:
wherein θ represents a network weight of the Cox neural network model; then, rewriting the predicted loss function to obtain a loss function after introducing Bayesian prior knowledge through a disturbance sampling mechanism, which is specifically represented as:
wherein, Ti pbRepresenting the new lifetime after the perturbation; then, in combination with a deep learning optimization technique, an L2 regularization term is introduced into a loss function, and the loss function is finally expressed as:
wherein, in the prediction module, the following steps are specifically executed:
and obtaining the survival time T and the deletion label E of the target cancer patient, using the survival time T and the deletion label E as the input of a deep Bayesian disturbance model, performing risk prediction by the deep Bayesian disturbance model, and finally outputting a risk prediction value of the target cancer patient.
Compared with the prior art, the technical scheme of the invention has the beneficial effects that:
the invention provides a system and a method for predicting cancer prognosis risk under high-dimensional deletion data, wherein Bayesian prior knowledge constraint is introduced, and a random sampling disturbance process is added, so that a constructed Cox neural network model is suitable for proportional risk prediction of high-dimensional small sample cancer data, is used for solving the problem that the existing prediction method is poor in the data, and improves the performance of a deep learning method for cancer survival analysis.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention;
FIG. 2 is a schematic diagram of a method for deep learning of prognosis risk of a cancer patient according to an embodiment of the present invention;
FIG. 3 is a graph showing the comparison of omics data for different cancer survival assays according to an embodiment of the present invention (C-index is a comparison index);
FIG. 4 is a graph illustrating comparison of performance of Cox neural network model before and after using DBP optimization module in simulation data according to an embodiment of the present invention.
Detailed Description
The drawings are for illustrative purposes only and are not to be construed as limiting the patent;
for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;
it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.
Example 1
As shown in fig. 1, a method for predicting cancer prognosis risk under high-dimensional deletion data comprises the following steps:
s1: constructing a Cox neural network model, and acquiring a training data set and a verification data set of a target database;
s2: carrying out random sampling disturbance on a training data set according to Bayesian prior knowledge constraint;
s3: defining a risk function of a Cox neural network model and calculating a loss function according to a training data set subjected to random sampling disturbance;
s4: training the Cox neural network model through a loss function, and updating the network weight of the Cox neural network model;
s5: verifying the updated network weight by using a verification data set, if the verification is passed, obtaining a deep Bayesian disturbance model, and executing the step S6; otherwise, returning to step S2 to perform random sampling perturbation again;
s6: inputting the data of the target cancer patient into a deep Bayesian disturbance model, and outputting the predicted value of the cancer prognosis risk of the patient.
More specifically, in the step S2, the bayesian priori knowledge constraint indicates that the lifetime of the deleted data exists in the upper bound, and does not deviate from the constraint that the lifetime of the non-deleted data is too much; wherein the pruned data and non-pruned data exist in the training dataset; the training data set further includes a time-to-live; the specific process of step S2 is as follows:
the samples in the training data set are sorted according to the survival time, and the survival time is converted into a sorting value, which is specifically expressed as: t isi'=Rank(Ti) I, where subscript i denotes the ith sample; t isi' represents the new survival time after the sorting pretreatment;
taking the converted ranking value as the mean value of Gaussian distribution, taking a set constant alpha as the variance of the Gaussian distribution, and re-adopting the distribution to obtain a new survival time of the sample, specifically represented as Ti pb~N(Ti',α);Ti pbRepresenting the new lifetime after the perturbation; thereby completing the random adoption perturbation.
In the specific implementation process, the integral expression is written in a mode that Bayesian prior knowledge cannot be displayed, so that the idea of a Monte Carlo sampling method is adopted, and prior information is introduced approximately by the adopted method to achieve the effect of replacing the integral expression.
In the specific implementation process, Bayesian prior knowledge is introduced into the deleted samples, the prediction deviation of the existing deep learning method applied to the small sample cancer data set is corrected, the stability and the accuracy of a prediction model are improved, meanwhile, the Dropout and Monte Carlo sampling ideas are referred, the prior knowledge about the deleted data is introduced into the model in a random sampling disturbance mode, the small sample deleted data can be used for predicting the prognosis risk of the cancer patient more accurately, and the reference value is achieved.
More specifically, in the random sampling perturbation process, for a deleted sample with deleted data, by setting a constant value γ, if the result of the sampling of the deleted sample is in an area with a ratio γ on the right side of the gaussian distribution, the sample will be marked as a non-deleted sample, specifically expressed as:
in the specific implementation process, when the Cox neural network model needs to be trained every time, the stochastic disturbance operation needs to be performed, and then the obtained sample is input into the Cox neural network model for training.
More specifically, in step S3, a survival function S (t) of the Cox neural network model is defined, which is specifically represented as: s (T) ═ Pr, where Pr denotes the survival rate of the patient before time T, which is less than the time from data collection to the last observation of the patient, i.e. the survival time T; thus, the risk function at time t is defined as:
wherein δ represents a constant; according to the definition of the risk function, the Cox proportional risk function is obtained as follows:
λ(t|x)=λ0(t)*exph(x)
wherein X ∈ X, X denotes omics data of all patients in the training dataset, X denotes covariates affecting patient survival time, and the risk function h (X) ═ β Xi,λ0(t) represents a baseline risk function at time t, β being a constant, indicating that the risk function is a linear combination of covariates for patient survival time; the optimization objective of the Cox neural network model, i.e., the maximum likelihood function, is thus expressed as:
wherein E isiDeletion tag representing sample i, Ei1 denotes a non-deleted sample, Ei0 denotes a deleted sample; r (T)i) Represents a sample set which still survives when the sample i dies, and j is an individual of the sample set; thus, the loss function of a neural network based on the Cox proportional hazards is:
wherein θ represents a network weight of the Cox neural network model; then, rewriting the predicted loss function to obtain a loss function after introducing Bayesian prior knowledge through a disturbance sampling mechanism, which is specifically represented as:
wherein, Ti pbRepresenting the new lifetime after the perturbation; then, in combination with a deep learning optimization technique, an L2 regularization term is introduced into a loss function, and the loss function is finally expressed as:
more specifically, in step S6, omics data X, survival time T, and deletion tag E of the target cancer patient are first obtained, and are used as input of the deep bayesian disturbance model, and the deep bayesian disturbance model is used to perform risk prediction, and finally a predicted risk value of the target cancer patient is output.
The method aims to solve the problem that prediction deviation exists when the existing deep learning method for survival analysis is applied to high-dimensional small sample data containing a large number of deleted samples, and the performance of a Cox neural network model on cancer survival analysis is improved. According to the scheme, Bayesian prior knowledge constraint is introduced for the deleted samples, and a Cox neural network model suitable for deep learning of high-dimensional small sample cancer data is provided by adding optimization modules of sample sequencing disturbance and deleted tag disturbance, so that the problem that the existing prediction method is poor in performance in the data is solved.
Example 2
More specifically, on the basis of embodiment 1, a system for predicting cancer prognosis risk under high-dimensional deletion data is provided, which is used for implementing a method for predicting cancer prognosis risk under high-dimensional deletion data; the system comprises a model construction module, a data acquisition module, a random sampling disturbance module, a loss function calculation module, a weight updating module, a verification module and a prediction module; wherein:
the model building module is used for building a Cox neural network model;
the data acquisition module is used for acquiring a training data set and a verification data set from a target database;
the random sampling disturbance module is used for carrying out random sampling disturbance on the training data set according to Bayesian prior knowledge constraint;
the loss function calculation module is used for defining a risk function of the Cox neural network model and calculating a loss function according to the training data set subjected to random sampling disturbance;
the weight updating module is used for training the Cox neural network model through a loss function and updating the network weight of the Cox neural network model;
the verification module is used for verifying the updated network weight by utilizing a verification data set;
the prediction module is used for inputting the data of the target cancer patient into the validated deep Bayesian disturbance model and outputting the predicted value of the cancer prognosis risk of the patient.
More specifically, in the random sampling perturbation module, the bayesian priori knowledge constraint indicates that the survival time of the deleted data is in an upper bound and does not deviate from the constraint that the survival time of the non-deleted data is too much; wherein the pruned data and non-pruned data exist in the training dataset; the training data set further includes a time-to-live; the random sampling perturbation module specifically executes the following steps:
sequencing the samples in the training data set according to the survival time, and converting the survival time into a sequencing value;
and taking the converted ranking value as the mean value of Gaussian distribution, and obtaining new survival time of the sample from the distribution by adopting the distribution again according to the set variance of the Gaussian distribution to finish random adoption and disturbance.
More specifically, in the random sampling perturbation process, for a deleted sample with deleted data, by setting a constant value, if the result of sampling the deleted sample is in an area with a ratio of the set constant value on the right side of the gaussian distribution, the sample will be marked as a non-deleted sample.
More specifically, in the loss function calculation module, the following steps are specifically performed:
defining a survival function S (t) of the Cox neural network model, and specifically representing that: s (T) ═ Pr, where Pr denotes the survival rate of the patient before time T, which is less than the time from data collection to the last observation of the patient, i.e. the survival time T; thus, the risk function at time t is defined as:
wherein δ represents a constant; according to the definition of the risk function, the Cox proportional risk function is obtained as follows:
λ(t|x)=λ0(t)*exph(x)
wherein X ∈ X, X denotes omics data of all patients in the training dataset, X denotes covariates affecting patient survival time, and the risk function h (X) ═ β Xi,λ0(t) represents a baseline risk function at time t, β being a constant, indicating that the risk function is a linear combination of covariates for patient survival time; the optimization objective of the Cox neural network model, i.e., the maximum likelihood function, is thus expressed as:
wherein E isiDeletion tag representing sample i, Ei1 denotes a non-deleted sample, Ei0 denotes a deleted sample; r (T)i) Represents a sample set which still survives when the sample i dies, and j is an individual of the sample set; thus, the loss function of a neural network based on the Cox proportional hazards is:
wherein θ represents a network weight of the Cox neural network model; then, rewriting the predicted loss function to obtain a loss function after introducing Bayesian prior knowledge through a disturbance sampling mechanism, which is specifically represented as:
wherein, Ti pbRepresenting the new lifetime after the perturbation; then, in combination with a deep learning optimization technique, an L2 regularization term is introduced into a loss function, and the loss function is finally expressed as:
more specifically, in the prediction module, the following steps are specifically executed:
and obtaining omics data X, survival time T and deletion label E of the target cancer patient, using the omics data X, survival time T and deletion label E as input of a deep Bayesian disturbance model, performing risk prediction by using the deep Bayesian disturbance model, and finally outputting a risk prediction value of the target cancer patient.
Example 3
More specifically, in order to further illustrate the technical scheme and technical effects of the present invention, the present embodiment applies the content of the present invention to identify a target gene affecting the prognosis of breast cancer, and the specific process is as follows:
obtaining omics expression data X, survival time T and deletion tag E of a target cancer patient, wherein the data are derived from a breast cancer data set (BRCA) in a TCGA cancer public data set;
wherein the omics data are for mRNA expression in breast cancer patients, the data are RNA sequencing data generated by UNC Illumina HiSeq _ RNASeq V2, and these data are from TCGA lv3 grade data. The deletion label E-1 indicates that the patient has died within T time of the observation record, E-0 indicates that the patient has not died within T time of the record, and the following information is not observed to be recorded.
Next, the data was preprocessed, and all genes and samples with deletion values exceeding 20% were deleted from the data, and the remaining deletion values were then filled with a value of 0. As shown in FIG. 2, the input samples are sorted for time-to-live and the time-to-live is converted to a sorted value, Ti'=Rank(Ti)=i;
Taking the converted survival time value as the mean value of the Gaussian distribution, taking a manually set constant alpha as the variance of the Gaussian distribution, and re-adopting the distribution to obtain a new survival time of the sample, which is specifically represented as Ti pb~N(Ti',α);
For a deleted sample, a constant γ is set, and if the result of sampling the deleted sample is in the area with the ratio γ on the right side of the gaussian distribution, the sample will be marked as a non-deleted sample, which is specifically expressed as:
calculating a risk prediction loss function of the Cox neural network model according to the disturbed data as follows:
wherein λ is a constant coefficient of the regularization term; and then, optimizing the model through a stochastic gradient descent algorithm, updating the network weight theta of the Cox neural network model, repeating the sampling disturbance operation of the steps every time the neural network is trained, and inputting the information into the Cox neural network model for training.
In the specific implementation process, the Cox neural network model needs to be trained for multiple times, data is disturbed before each training, and then the Cox neural network model is put into training and is alternately carried out until the training is finished. The Cox neural network model is a tool for fitting an arbitrary function, and after training, the Cox neural network model is equivalent to a function corresponding to omics data and risk values. And if omics data X are input, the Cox neural network model can output the corresponding risk predicted value.
The risk prediction module of fig. 2 represents the process of predicting a risk value and evaluating the effectiveness of the risk prediction. The disturbance only has the effect when the Cox neural network model is trained, and the pointing to risk prediction after the disturbance in the graph is because the Cox neural network model has an optimization target when being trained, the risk is predicted firstly, and then the optimization target judges the prediction accuracy of the optimization target to enable the Cox neural network model to be adjusted automatically. The perturbation affects the data of the Cox neural network model, and therefore, has an effect on the optimization objective of the model.
Based on the above model, by inputting omics data characteristics X of the target cancer patient, the model can output the risk factors of the patient and classify them into high risk group and low risk group. As shown in FIG. 3, compared with the prior proportional risk deep learning model, in the risk prediction of BRCA data, the C-index10 times of independent repeated prediction mean values are improved from 0.669 to 0.718, and various C-index indexes such as median, highest value and lowest value are superior to the comparison methods such as CoxEN, Random Survival Forest (RSF), proportional risk network (Cox-network) and the like. Among them, the CoxEN method is an improved Cox proportional hazards method, and the RSF method is one of the methods that has been proved to be the most preferable among the conventional methods in recent years. CoxNN is a Cox neural Network, also known as a Cox neural Network model.
In addition, 3 datasets were selected in the TCGA public cancer dataset (BRCA, CESC, COAD), 3 datasets were selected in the GEO public dataset (GSE4922, GSE1456, GSE25006), and in these 6 public datasets the approach taken increased the average 4% C-index compared to the rest of the optimal approaches. And in the model with higher deleted data proportion, the performance improvement brought by the method is more obvious, and the results show that the method can really solve the problem of prediction deviation when the deep learning model processes high-dimensional deleted data to a certain extent, and can effectively improve the prediction performance of the model.
As shown in fig. 4, in the simulation experiment, the performance difference of the DBP model proposed by the present invention and the existing proportional risk deep learning model in the data set containing different proportions of the deleted samples and different sample sizes is compared. The result shows that the performance of the DBP model is improved more and more obviously with the increase of the proportion of the deleted samples in the data, and the C-index of average 9% can be improved at most. The performance improvement of the method provided by the scheme is mainly characterized in that the constraint of Bayesian prior information is introduced to the deleted data. Meanwhile, the DBP model has obvious performance improvement in data with a small sample size. Combining the results of fig. 4, it can be shown that the method can significantly improve the existing method in a data set with high dimensionality, small samples and more deleted data.
In table 1, in order to investigate whether the performance improvement of the method of the present invention depends on a specific deletion rule, 3 different deletion rules, i.e., exponential deletion, logarithmic deletion, and uniform deletion, are designed in the simulation data, and the performance of the model of the present invention and the performance of the previous deep learning model are compared in these data. The result shows that compared with the traditional proportional risk neural network, the model has obvious performance improvement in data sets of various deletion rules, and when the data deletion proportion reaches 75%, the C-index of the model in the three types of data is improved by 8.5% on average. The performance improvement of the model does not depend on a specific deletion rule, the effect of improving the performance of the model is achieved by correcting the prediction deviation of the model to the deleted data, and the model has general applicability.
TABLE 1C-index values of DBP model and existing neural network model in simulation data for different pruning rules
In table 2, the present example also performs an ablation experiment and a parameter sensitivity experiment on the parameter γ of the deletion disturbance probability to explore the contribution of the bayesian priori knowledge introduced by the deletion disturbance mechanism to the model prediction accuracy. The results show that in the range of the deletion disturbance probability gamma from 0 to 10%, the increase of gamma has a significant influence on the improvement of the model performance. And as gamma is further increased, the performance of the model is slightly improved, and when gamma is more than 25%, the performance of the model is hardly improved or slightly reduced. Therefore, the improvement of the model performance does not depend on the fine setting of the gamma parameter, the performance of the model is obviously improved as long as the gamma parameter is in the range of 10% to 20%, and the model has general applicability. Meanwhile, it can also be shown that the deletion disturbance mechanism is to modify the prediction deviation of the model by introducing randomness, and if the model deviation can be modified, no matter how large or small the gamma setting is, the model performance will not be significantly affected.
Table 2 investigation experiment of influence of deletion disturbance probability parameter γ on experimental result
It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.