miRNA-disease association relation prediction method based on graph neural network
1. A miRNA-disease association relation prediction method based on a graph neural network is characterized by comprising the following steps:
(1) obtaining miRNA-disease association relation data L:
downloading and M kinds of mirNar ═ r from miRNA-disease association relation database1,r2,…,rm,…,rMU diseases d ═ d } associated with1,d2,…,du,…,dUR pieces of miRNA-disease association data L ═ L1,L2,…,Lr,…LRWherein, each miRNA rmIs associated with at least one disease, and each disease duAt least one miRNA is related, M is more than or equal to 100, rmRepresents the m-th miRNA, N is more than or equal to 100, duIndicates the u-th disease, R is more than or equal to 3000, LrRepresenting the data of the correlation relation between the r miRNA and the disease;
(2) constructing a miRNA-disease association relationship network Y:
with M mirNar ═ r1,r2,…,rm,…,rMD ═ d for U diseases1,d2,…,du,…,dUThe miRNA-disease association relationship data L ═ L is taken as a node respectively1,L2,…,Lr,…LRConstructing a miRNA-disease association relationship network Y by taking R pieces of miRNA-disease association relationship data contained in the network as edges;
(3) acquiring a sample Data set Data and a Label Data set Label:
adopting R miRNA-disease node pairs with edges in 1 pair of miRNA-disease association relationship network Y, wherein the edges exist between the pairs of miRNA-disease nodes P ═ { P ═ P1,P2,…,Pr,…PRMarking each node pair in the Y, and simultaneously adopting 0 pair of miRNA-disease node pairs with M multiplied by U-R non-existing edges in the Y, wherein N is { N ═ N1,N2,…,Ns,…,NM×U-RLabeling each node pair in the node pair to obtain a label data set PLAbel (1) corresponding to the P1,12,…,1r,…1RAnd N-corresponding label dataset NLabel ═ 01,02,…,0s,…,0M×U-RAnd composing P and N into a sample Data set Data ═ SP1,SP2,…,SPk,…,SPM×UAnd combining the Label data set Label (LB) consisting of the Label and the NLabel1,LB2,…,LBk,…,LBM×UIn which P isrmiRNA-disease node pair representing the r-th existing edge, 1rRepresents PrLabel of (2), NsmiRNA-disease node pair representing the s-th absent edge, 0sRepresents NsOf (2), SPkDenotes the kth miRNA-disease node pair, LBkRepresents SPkThe label of (1);
(4) extracting each miRNA-disease node pair SPkH-order closed subgraph:
(4a) each miRNA-disease node contained in the Data of the sample Data set is coupled with the SPkAll nodes which start from the two nodes and pass through the moving step h in the miRNA-disease association relationship network Y form an SPkCorresponding h-order closed subgraph node setObtaining h-order closed sub-graph node set corresponding to DataWherein h is more than or equal to 1;
(4b) with SPkCorresponding h-order closed subgraph node setAll the nodes in the set are nodes, and the connecting edges between the nodes are taken as edges, wherein the connecting edges between two nodes of the miRNA-disease node pair are not included, so that the SP is formedkCorresponding h-order closed subgraphGet NodesSethCorresponding h-order closed subgraph set
(5) Obtaining each h-order closed subgraphThe node feature matrix of (2):
(5a) h-order closed subgraph set by using double-radius node labeling methodhEach h-order closed subgraphEach node of (A) is marked with an integer to formNode label vector ofAnd combining the node label vectors of all the subgraphs into a node label vector set
(5b) Assembling NL node label vectorshEach node label vector inEach element of (a) is encoded into a one-hot encoded vector, constitutingCorresponding node feature matrixNLhThe corresponding h-order closed subgraph node feature matrix set isWherein the content of the first and second substances,number of lines andare equal in the dimension of (a) to (b),number of columns and NLhAdding 1 to the numerical value of the middle maximum node label and the like;
(6) obtaining a training sample set TD, a training sample label data set TLabel and a predicted sample set BPD:
(6a) forming the serial numbers of the positions of the labels with the value of 1 selected from the Label data set Label in the Label into a position serial number set Pidx ═ Pidx { (Pidx)1,Pidx2,…,Pidxr,…,PidxRAnd respectively forming a negative sample Label set NLB (NLB) { NLB) by using the labels with the value of 0 randomly selected from the Label and the serial numbers of the positions of the labels in the Label1,NLB2,…,NLBr,…,NLBRAnd position index set Nidx ═ Nidx1,Nidx2,…,Nidxr,…,NidxRAnd then selecting serial numbers of the positions of the labels with the M multiplied by U-R values of 0 in the Label to form a position serial number set ANidx ═ { ANidx }1,ANidx2,…,ANidxs,…,ANidxM×U-RIn which, PidxrNumber indicating the position of Label with the r-th value of 1 in Label, NLBrDenotes the r-th negative sample label, NidxrA serial number indicating the position of the tag with the r-th value of 0 in the Label;
(6b) selecting h-order closed subgraph set subpraphset according to position sequence number set PidxhThe middle R h-order closed subgraphs form a h-order closed subgraph set of the positive sampleAnd selecting a subpraphset according to the position sequence number set NidxhThe middle R h-order closed subgraphs form a h-order closed subgraph set of the negative sampleThen root ofSelecting a subpraphset according to the position sequence number set ANidxhM × U-R h-order closed subgraphs form an h-order closed subgraph set of the predicted sample setWherein the content of the first and second substances,a closed subgraph of order h representing the r-th positive sample,is an h-order closed subgraph of the r-th negative sample,is h-order closed subgraph of the s-th predicted sample set;
(6c) selecting R h-order closed sub-graph node feature matrixes in an h-order closed sub-graph node feature matrix set NFSet according to the position sequence number set Pidx to form an h-order closed sub-graph node feature matrix set of a positive sampleSelecting R h-order closed sub-graph node feature matrixes in NFSet according to the position sequence number set Nidx to form h-order closed sub-graph node feature matrix set of negative samplesSelecting M multiplied by U-R h-order closed sub-graph node feature matrixes in NFSet according to the position sequence number set ANidx to form h-order closed sub-graph node feature matrix set of predicted sample setWherein the content of the first and second substances,an h-order closed subgraph node characteristic matrix representing the r-th positive sample,an h-order closed subgraph node characteristic matrix representing the r-th negative sample,an h-order closed subgraph node feature matrix representing the s-th predicted sample set;
(6d) PSGS (pseudo-stationary source graph system) of h-order closed subgraph set of positive samplehH-order closed subgraph set NSGS of sum negative samplehH-order closed subgraph set combined into training sample setH-order closed subgraph node feature matrix set PNFS of positive samplehH-order closed subgraph node feature matrix set NNFS of sum negative samplehH-order closed subgraph node feature matrix set combined into training sample setMerging the positive sample label dataset PLabel and the negative sample label dataset NLB into a training sample label dataset TLabel ═ { TL1,TL2,,TLt,,TL2×RAnd (c) the step of (c) in which,the t h-order blocking subgraph, TNF, of the h-order blocking subgraph set representing the training sample sett hTo representTLtTo representThe label of (1);
(6e) h-order closed subgraph set TSGS of training sample sethEach h-order closed subgraph ofTNFS (TNFS) set of h-order closed subgraph node feature matrix of training sample sethOf the corresponding node feature matrix TNFt h TLtForm a binary groupObtaining a training sample setH-order closed subgraph set ANSGS of predicted sample set at the same timehEach h-order closed subgraph ofH-order closed subgraph node feature matrix set ANNFS with predicted sample sethIn (1) correspond toForm a binary setObtaining a set of predicted samplesWherein, in the step (A),which represents the t-th training sample,represents the s-th predicted sample;
(7) and (3) building a graph neural network GNN:
constructing a graph convolution module GCM, a graph pooling layer GPY and a one-dimensional convolution module CNNM which are connected in sequence1And the graph neural network GNN of the full connection layer FC, wherein the weight parameter of the GNN is thetaGNNLoss function Loss of GNN is NLLLoss, wherein the GCM comprises I sequentially connected graph convolution layers GCN1,GCN2,…,GCNi,…GCNI,I≥2,CNNM1Comprises one-dimensional convolution layers connected in sequenceMaximum pooling layer MP and one-dimensional convolution layer
(8) The graph neural network GNN is iteratively trained:
(8a) the initial iteration number is E, the maximum iteration number is E, E is more than or equal to 20, and the weight parameter in the neural network GNN of the E-th iteration graph isAnd let e be 0 and make e be 0,
(8b) using the training sample set TD as the input of the graph neural network GNN, the graph convolution module GCM is used for each training sampleClosed subgraph of order h inAnd node feature matrix TNFt hPerforming multi-layer graph convolution, wherein the graph pooling layer GPY embeds a characteristic matrix Z ═ Z into nodes obtained by the convolution of the GCM multi-layer graph1,Z2,…,Zi,…,ZI]Embedding characteristic vectors into the last row of nodes with the maximum K values for splicing, and performing one-dimensional convolution by using a CNNM (convolutional neural network)1Spliced to GPYThe vector representation V of (2) is subjected to feature learning, and the full connection layer FC is used for CNNM1Feature(s)Learned VrfClassifying to obtain prediction fraction vector RS ═ { RS of GNN1,RS2,…,RSt,…,RS2×RWherein K is more than or equal to 10, RStTo representFor each training sample, the graph volume module GCMClosed subgraph of order h inAnd node feature matrix TNFt hThe formula for performing a layer graph convolution is:
wherein Z isiRepresenting the ith graph convolution layer GCNiThe node embedding characteristic matrix output by carrying out one layer of graph convolution is also GCNi+1Is input of D isDegree matrix of (W)iIs GCNiWeight parameter, [ @ ] to be trained]The representation matrix is spliced according to rows;
(8c) calculating a Loss value L between the TLabel and the RS by adopting a Loss function Loss and predicting the fraction vector RS and the training set sample label data set TLabeleThen using a back propagation method and passing through LosseCalculating the gradient of the GNN parameters, and finally adopting gradient descent algorithm to perform weight parameter comparison on the GNN parameters through the gradient of the GNN parametersUpdating is carried out;
(8d) judging whether E is more than or equal to E, if so, obtaining a trained miRNA-disease association prediction model GNN', otherwise, making E equal to E +1, and executing the step (8 b);
(9) obtaining a prediction result of miRNA-disease association relation:
and (3) performing forward propagation by taking the predicted sample set BPD as the input of the trained graph neural network GNN', and obtaining the prediction scores of the M multiplied by U-R miRNA-disease node pairs without edges in the sample Data set Data.
2. The method for predicting miRNA-disease association relationship based on graph neural network of claim 1, wherein the step (5a) of using a dual-radius node labeling method to perform subvraphset on h-order closed subgraph sethEach h-order closed subgraphEach node is marked with an integer, and the implementation steps are as follows:
(5a1) to pairCentral miRNA-disease node pair SPkThe miRNA node miR and the disease node dis are marked with an integer 1;
(5a2) judgment ofThe shortest distance SD from each node x except the node marked in the step (5a1) to the miRNA node miRmOr the shortest distance SD to the disease node disdIf the node is infinite, marking the node x by using an integer 0, otherwise, calculating the integer marked on the node x by using the following formula:
fl(x)=1+min(SDm,SDd)+(D/2)[(D/2)+(D%2)-1]
wherein D ═ SDm+SDdD/2 represents integer division, and D% 2 represents remainder.
3. The graph neural network-based miRNA-disease association of claim 1The relation prediction method is characterized in that the node label vector set NL in the step (5b)hEach node label vector inEach element of (a) is encoded into a one-hot encoding vector, and the implementation steps are as follows: assembling NL node label vectorshEach node label vector inIs encoded as a one-hot encoded vector with dimension y being 1 and the remaining dimension being 0, the dimension of the one-hot encoded vector being equal to NLhThe value of the medium maximum node label is incremented by 1.
4. The method for predicting miRNA-disease association relationship based on neural network of claim 1, wherein the step (8c) of calculating the loss value L between TLabel and RSeAnd the weight parameter of the GNN is obtained through the parameter gradient of the graph neural network GNNThe updating formulas for updating are respectively as follows:
wherein, RStDenotes the t-th prediction score, TLtRepresents RStCorresponding to the sample label, softmax () represents a normalized exponential function, log () represents a logarithmic function,representing updated weight parameters of GNNThe number of the first and second groups is,representing the weight parameter, α, before updatingGNNThe learning step size of the GNN is represented,representing the gradient of the GNN parameters.
Background
mirnas are a class of non-coding single-stranded RNA molecules consisting of 20-25 nucleotides. microRNA can be widely involved in important biological processes and play a role in feedback mechanisms, such as cell division, differentiation, apoptosis, cell cycle regulation, inflammation and stress response. Deregulation of mirnas (including deregulation of expression, increased or decreased function mutations, epigenetic silencing, etc.) often results in abnormal levels in the human body and thus in the development of many diseases. Therefore, identification of disease-associated mirnas can improve human awareness of complex diseases.
The method for manually searching potential association relations by using a biological experiment has high accuracy. But the time period is long due to the complicated process, and the cost is high. Therefore, it is not efficient to verify all miRNA-disease associations only by means of biological experiments. Therefore, an efficient and accurate calculation method is provided based on the known miRNA-disease association relationship, so that guidance can be provided for biological experiments, and the miRNA-disease association relationship can be found more efficiently.
For example, Chen et al, 2018, published in RNA Biology, "differentiating microRNA-disease association using biological local models and hubness-disease regression", discloses a miRNA-disease association prediction method, BLHARMDA. The method respectively calculates Jacard-similarity of miRNA and Jacard-similarity of disease based on known miRNA-disease incidence relation on the basis of traditional Gaussian interaction profile nuclear similarity calculated on miRNA and disease, and then splices Jacard-similarity matrix of miRNA on original miRNA similarity matrix. The similarity matrix of mirnas changes from a large nm x nm scale to a large nm x 2nm matrix. Like mirnas, disease similarity is also spliced on the right side to the vicard similarity matrix for the disease. And obtaining the probability score of the edge between each miRNA and the disease by BLHARMDA from the miRNA visual angle and the disease visual angle respectively through a similarity matrix-based calculation method, and then performing k-nearest neighbor regression with error correction on the predicted score to obtain the final predicted score.
Also, for example, Kai et al published an article "differentiating MiRNA-Disease Association by way of latex feed Extraction with Positive Samples" in 2019 to Genes, and discloses a prediction method of miRNA-Disease Association relationship LFEMDA. LFEMDA considers that the common miRNA similarity matrix is based on miRNA-disease association, and it is not reasonable to re-use it for predicting miRNA-disease association. LFEMDA was used to calculate the edit distance between miRNA sequences. The inter-miRNA edit distance was subtracted from 1 as the similarity score between mirnas. LFEMDA then uses the newly proposed miRNA similarity data for the following predictions. The idea of LFEMDA is to use matrix decomposition to solve the prediction problem. For each miRNA and disease, LFEMDA gives an initial projection vector in a fixed k-dimensional space, and their inner product then represents the association between miRNA and disease. Multiple regularization terms are then used to link miRNA-disease associations with similarity data. Finally, a prediction score is obtained by matrix decomposition.
However, the miRNA functional similarity and disease semantic similarity networks routinely used in these methods for predicting miRNA-disease associations are incomplete and inaccurate. And they all rely on one hypothesis: similar mirnas are associated with the same disease, and similar diseases are associated with the same miRNA. This allows the model to be biased first with a priori knowledge. Moreover, the incompleteness and inaccuracy of miRNA functional similarity and disease semantic similarity networks makes the use of this hypothesis even more unreasonable. Incomplete data and a priori knowledge obtained through human experience undoubtedly bring inaccuracy to the prediction result of the model.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a miRNA-disease association relation prediction method based on a graph neural network, which is used for solving the technical problem of low prediction accuracy in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) obtaining miRNA-disease association relation data L:
downloading and M kinds of mirNar ═ r from miRNA-disease association relation database1,r2,…,rm,…,rMU diseases associated with d ═{d1,d2,…,du,…,dUR pieces of miRNA-disease association data L ═ L1,L2,…,Lr,…LRWherein each mirNar ismIs associated with at least one disease, and each disease duAt least one miRNA is related, M is more than or equal to 100, rmRepresents the m-th miRNA, N is more than or equal to 100, duIndicates the u-th disease, R is more than or equal to 3000, LrRepresenting the data of the correlation relation between the r miRNA and the disease;
(2) constructing a miRNA-disease association relationship network Y:
with M mirNar ═ r1,r2,…,rm,…,rMD ═ d for U diseases1,d2,…,du,…,dUThe miRNA-disease association relationship data L ═ L is taken as a node respectively1,L2,…,Lr,…LRConstructing a miRNA-disease association relationship network Y by taking R pieces of miRNA-disease association relationship data contained in the network as edges;
(3) acquiring a sample Data set Data and a Label Data set Label:
adopting R miRNA-disease node pairs with edges in 1 pair of miRNA-disease association relationship network Y, wherein the edges exist between the pairs of miRNA-disease nodes P ═ { P ═ P1,P2,…,Pr,…PRMarking each node pair in the Y, and simultaneously adopting 0 pair of miRNA-disease node pairs with M multiplied by U-R non-existing edges in the Y, wherein N is { N ═ N1,N2,…,Ns,…,NM×U-RLabeling each node pair in the node pair to obtain a label data set PLAbel (1) corresponding to the P1,12,…,1r,…1RAnd N-corresponding label dataset NLabel ═ 01,02,…,0s,…,0M×U-RAnd composing P and N into a sample Data set Data ═ SP1,SP2,…,SPk,…,SPM×UAnd combining the Label data set Label (LB) consisting of the Label and the NLabel1,LB2,…,LBk,…,LBM×UIn which P isrmiRNA-disease node pair representing the r-th existing edge, 1rRepresents PrLabel of (2), NsmiRNA-diseases representing the s-th nonexistent borderNode pair, 0sRepresents NsOf (2), SPkDenotes the kth miRNA-disease node pair, LBkRepresents SPkThe label of (1);
(4) extracting each miRNA-disease node pair SPkH-order closed subgraph:
(4a) each miRNA-disease node contained in the Data of the sample Data set is coupled with the SPkAll nodes which start from the two nodes and pass through the moving step h in the miRNA-disease association relationship network Y form an SPkCorresponding h-order closed subgraph node setObtaining h-order closed sub-graph node set corresponding to DataWherein h is more than or equal to 1;
(4b) with SPkCorresponding h-order closed subgraph node setAll the nodes in the set are nodes, and the connecting edges between the nodes are taken as edges, wherein the connecting edges between two nodes of the miRNA-disease node pair are not included, so that the SP is formedkCorresponding h-order closed subgraphGet NodesSethCorresponding h-order closed subgraph set
(5) Obtaining each h-order closed subgraphThe node feature matrix of (2):
(5a) h-order closed subgraph set by using double-radius node labeling methodhEach h-order closed subgraphEach node of (A) is marked with an integer to formNode label vector ofAnd combining the node label vectors of all the subgraphs into a node label vector set
(5b) Assembling NL node label vectorshEach node label vector inEach element of (a) is encoded into a one-hot encoded vector, constitutingCorresponding node feature matrixNLhThe corresponding h-order closed subgraph node feature matrix set isWherein the content of the first and second substances,number of lines andare equal in the dimension of (a) to (b),number of columns and NLhAdding 1 to the numerical value of the middle maximum node label and the like;
(6) obtaining a training sample set TD, a training sample label data set TLabel and a predicted sample set BPD:
(6a) will be selected from the Label data set LabelThe sequence numbers of the positions of the selected labels with the value of 1 in the Label form a position sequence number set Pidx ═ Pidx1,Pidx2,…,Pidxr,…,PidxRAnd respectively forming a negative sample Label set NLB (NLB) { NLB) by using the labels with the value of 0 randomly selected from the Label and the serial numbers of the positions of the labels in the Label1,NLB2,…,NLBr,…,NLBRAnd position index set Nidx ═ Nidx1,Nidx2,…,Nidxr,…,NidxRAnd then selecting serial numbers of the positions of the labels with the M multiplied by U-R values of 0 in the Label to form a position serial number set ANidx ═ { ANidx }1,ANidx2,…,ANidxs,…,ANidxM×U-RIn which, PidxrNumber indicating the position of Label with the r-th value of 1 in Label, NLBrDenotes the r-th negative sample label, NidxrA serial number indicating the position of the tag with the r-th value of 0 in the Label;
(6b) selecting h-order closed subgraph set subpraphset according to position sequence number set PidxhThe middle R h-order closed subgraphs form a h-order closed subgraph set of the positive sampleAnd selecting a subpraphset according to the position sequence number set NidxhThe middle R h-order closed subgraphs form a h-order closed subgraph set of the negative sampleThen selecting a subpraphset according to the position sequence number set ANidxhM × U-R h-order closed subgraphs form an h-order closed subgraph set of the predicted sample setWherein the content of the first and second substances,a closed subgraph of order h representing the r-th positive sample,is an h-order closed subgraph of the r-th negative sample,is h-order closed subgraph of the s-th predicted sample set;
(6c) selecting R h-order closed sub-graph node feature matrixes in an h-order closed sub-graph node feature matrix set NFSet according to the position sequence number set Pidx to form an h-order closed sub-graph node feature matrix set of a positive sampleSelecting R h-order closed sub-graph node feature matrixes in NFSet according to the position sequence number set Nidx to form h-order closed sub-graph node feature matrix set of negative samplesSelecting M multiplied by U-R h-order closed sub-graph node feature matrixes in NFSet according to the position sequence number set ANidx to form h-order closed sub-graph node feature matrix set of predicted sample setWherein the content of the first and second substances,an h-order closed subgraph node characteristic matrix representing the r-th positive sample,an h-order closed subgraph node characteristic matrix representing the r-th negative sample,an h-order closed subgraph node feature matrix representing the s-th predicted sample set;
(6d) PSGS (pseudo-stationary source graph system) of h-order closed subgraph set of positive samplehH-order closed subgraph set NSGS of sum negative samplehH-order closed subgraph set combined into training sample setH-order closed subgraph node feature matrix set PNFS of positive samplehH-order closed subgraph node feature matrix set NNFS of sum negative samplehH-order closed subgraph node feature matrix set combined into training sample setMerging the positive sample label dataset PLabel and the negative sample label dataset NLB into a training sample label dataset TLabel ═ { TL1,TL2,,TLt,,TL2×RAnd (c) the step of (c) in which,the t h-th order closed subgraph of the h-order closed subgraph set representing the training sample set,to representTLtTo representThe label of (1);
(6e) h-order closed subgraph set TSGS of training sample sethEach h-order closed subgraph ofTNFS (TNFS) set of h-order closed subgraph node feature matrix of training sample sethOf the corresponding node feature matrixForm a binary groupObtaining a training sample setH order of the simultaneous sample set to be predictedClosed subgraph set ANSGShEach h-order closed subgraph ofH-order closed subgraph node feature matrix set ANNFS with predicted sample sethIn (1) correspond toForm a binary setObtaining a set of predicted samplesWherein, in the step (A),which represents the t-th training sample,represents the s-th predicted sample;
(7) and (3) building a graph neural network GNN:
constructing a graph convolution module GCM, a graph pooling layer GPY and a one-dimensional convolution module CNNM which are connected in sequence1And the graph neural network GNN of the full connection layer FC, wherein the weight parameter of the GNN is thetaGNNThe Loss function Loss of GNN is NLLLoss, where GCM includes I sequentially connected graph convolution layers GCN1,GCN2,…,GCNi,…GCNI,I≥2,CNNM1Comprises one-dimensional convolution layers connected in sequenceMaximum pooling layer MP and one-dimensional convolution layer
(8) The graph neural network GNN is iteratively trained:
(8a) the initial iteration time is E, the maximum iteration time is E, E is more than or equal to 20, and the E-th iteration timeThe weight parameter in the graph-representing neural network GNN isAnd let e be 0 and make e be 0,
(8b) using the training sample set TD as the input of the graph neural network GNN, the graph convolution module GCM is used for each training sampleClosed subgraph of order h inAnd node feature matrixPerforming multi-layer graph convolution, wherein the graph pooling layer GPY embeds a characteristic matrix Z ═ Z into nodes obtained by the convolution of the GCM multi-layer graph1,Z2,…,Zi,…,ZI]Embedding characteristic vectors into the last row of nodes with the maximum K values for splicing, and performing one-dimensional convolution by using a CNNM (convolutional neural network)1Spliced to GPYThe vector representation V of (2) is subjected to feature learning, and the full connection layer FC is used for CNNM1V obtained by feature learningrfClassifying to obtain prediction fraction vector RS ═ { RS of GNN1,RS2,…,RSt,…,RS2×RWherein K is more than or equal to 10, RStTo representFor each training sample, the graph volume module GCMClosed subgraph of order h inAnd node feature matrixThe formula for performing a layer graph convolution is:
wherein Z isiRepresenting the ith graph convolution layer GCNiThe node embedding characteristic matrix output by carrying out one layer of graph convolution is also GCNi+1Is input of D isDegree matrix of (W)iIs GCNiWeight parameter to be trained, [ · in]The representation matrix is spliced according to rows;
(8c) calculating a Loss value L between the TLabel and the RS by adopting a Loss function Loss and predicting the fraction vector RS and the training set sample label data set TLabeleThen using a back propagation method and passing through LosseCalculating the gradient of the GNN parameters, and finally adopting gradient descent algorithm to perform weight parameter comparison on the GNN parameters through the gradient of the GNN parametersUpdating is carried out;
(8d) judging whether E is more than or equal to E, if so, obtaining a trained miRNA-disease association prediction model GNN', otherwise, making E equal to E +1, and executing the step (8 b);
(9) obtaining a prediction result of miRNA-disease association relation:
and (3) performing forward propagation by taking the predicted sample set BPD as the input of the trained graph neural network GNN', and obtaining the prediction scores of the M multiplied by U-R miRNA-disease node pairs without edges in the sample Data set Data.
Compared with the prior art, the invention has the following advantages:
1. the graph neural network GNN constructed by the invention comprises a graph volume module GCM and a graph pooling layer which are connected in sequenceGPY, one-dimensional convolution module CNNM1And the full-connection layer FC learns the graph structure characteristics of the h-order closed subgraph extracted by the miRNA-disease node pairs, and the GCM can simultaneously fuse various information between the miRNA nodes and the disease nodes in the process of training the graph neural network GNN and acquiring the prediction result of the miRNA-disease association relation, fully learn implicit graph topology information, and effectively improve the prediction precision of miRNA-disease association compared with the prior art.
2. The constructed graph neural network GNN takes the known miRNA-disease association relation information as supervision information to automatically learn the graph structure characteristics of the h-order closed subgraph extracted by miRNA-disease node pairs, and predicts the miRNA-disease node pairs, so that the assumption that the prior model is similar to the miRNA regulated and controlled diseases based on similar functions and vice versa is avoided, the deviation of the model caused by the prior knowledge obtained by human experience is avoided, and the prediction precision of miRNA-disease association is further improved.
3. The data used by the graph neural network GNN constructed by the invention only uses miRNA-disease association relation data, and compared with the existing model which usually needs to use miRNA functional similarity data and disease semantic similarity data, the data is less in preparation.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention.
Detailed Description
The invention will now be described in further detail with reference to the accompanying drawings and specific examples, it being understood that the invention is not limited to the patentable subject matter defined in clause 25 of the patent statutes, but is also in accordance with the second clause of the patent statutes:
referring to fig. 1, the present invention includes the steps of:
step 1) obtaining miRNA-disease association relation data L:
downloading and M kinds of miRNA r ═ r { r from miRNA-disease association relation database HMDDv2.01,r2,…,rm,…,rMU diseases d ═ d } associated with1,d2,…,du,…,dUR pieces m of }iRNA-disease association data L ═ { L ═ L1,L2,…,Lr,…LREach mirNar } each mirNarmIs associated with at least one disease, and each disease duAt least one miRNA is associated, wherein M is more than or equal to 100, rmRepresents the m-th miRNA, N is more than or equal to 100, duDenotes the u-th disease, R.gtoreq.1000, LrRepresenting the data of the correlation relation between the r miRNA and the disease; in this example, M-495, N-383, R-5430,
step 2) constructing a miRNA-disease association relationship network Y:
with M mirNar ═ r1,r2,…,rm,…,rMD ═ d for U diseases1,d2,…,du,…,dUThe miRNA-disease association relationship data L ═ L is taken as a node respectively1,L2,…,Lr,…LRConstructing a miRNA-disease association relationship network Y by taking R pieces of miRNA-disease association relationship data contained in the network as edges;
step 3) obtaining a sample Data set Data and a Label Data set Label:
adopting R miRNA-disease node pairs with edges in 1 pair of miRNA-disease association relationship network Y, wherein the edges exist between the pairs of miRNA-disease nodes P ═ { P ═ P1,P2,…,Pr,…PRMarking each node pair in the Y, and simultaneously adopting 0 pair of miRNA-disease node pairs with M multiplied by U-R non-existing edges in the Y, wherein N is { N ═ N1,N2,…,Ns,…,NM×U-RLabeling each node pair in the node pair to obtain a label data set PLAbel (1) corresponding to the P1,12,…,1r,…1RAnd N-corresponding label dataset NLabel ═ 01,02,…,0s,…,0M×U-RAnd composing P and N into a sample Data set Data ═ SP1,SP2,…,SPk,…,SPM×UAnd combining the Label data set Label (LB) consisting of the Label and the NLabel1,LB2,…,LBk,…,LBM×UIn which P isrmiRNA-disease node pair representing the r-th existing edge, 1rRepresents PrLabel of (2), NsmiRNA-disease node pair representing the s-th absent edge, 0sRepresents NsOf (2), SPkDenotes the kth miRNA-disease node pair, LBkRepresents SPkThe label of (1);
step 4) extracting each miRNA-disease node pair to SPkH-order closed subgraph:
step 4a) to couple each miRNA-disease node contained in the Data of the sample dataset to the SPkAll nodes which start from the two nodes and pass through the moving step h in the miRNA-disease association relationship network Y form an SPkCorresponding h-order closed subgraph node setObtaining h-order closed sub-graph node set corresponding to DataWhere h ≧ 1, in this example, h ≧ 4.
Step 4b) with SPkCorresponding h-order closed subgraph node setAll the nodes in the set are nodes, and the connecting edges between the nodes are taken as edges, wherein the connecting edges between two nodes of miRNA-disease central node pairs are not included, so that SP is formedkCorresponding h-order closed subgraphGet NodesSethCorresponding h-order closed subgraph set
The h-order closed subgraph refers to a subgraph formed by moving h steps in the network from two nodes of a certain node pair; in general, the larger h, the larger the scale of the subgraph;
step 5) obtaining each h-order closed subgraphThe node feature matrix of (2):
step 5a) utilizing a double-radius node marking method to perform h-order closed subgraph set subpraphsethEach h-order closed subgraphEach node of (A) is marked with an integer to formNode label vector ofAnd combining the node label vectors of all the subgraphs into a node label vector set
In order to label different roles of each node in the closed subgraph, the double-radius node labeling method labels a corresponding label according to the positions of each node of the h-order closed subgraph relative to two nodes in the center of the network.
The method for labeling the h-order closed subgraph set by using the double-radius nodehEach h-order closed subgraphEach node is marked with an integer, and the implementation steps are as follows:
step 5a1) pairsCentral miRNA-disease node pair SPkThe miRNA node miR and the disease node dis are marked with an integer 1;
step 5a2) judgmentThe shortest distance SD from each node x except the node marked in the step (5a1) to the miRNA node miRmOr the shortest distance SD to the disease node disdIf it is infinite, if so, the whole is adoptedNumber 0 labels node x, otherwise, the integer labeled for node x is calculated by the following formula:
fl(x)=1+min(SDm,SDd)+(D/2)[(D/2)+(D%2)-1]
wherein D ═ SDm+SDdD/2 represents integer division, and D% 2 represents remainder.
Step 5b) assembling node label vector NLhEach node label vector inEach element of (a) is encoded into a one-hot encoded vector, constitutingCorresponding node feature matrixNLhThe corresponding h-order closed subgraph node feature matrix set isWherein the content of the first and second substances,number of lines andare equal in the dimension of (a) to (b),column numbers vdim and NL ofhAdding 1 to the numerical value of the middle maximum node label and the like;
said node label vector set NLhEach node label vector inEach element of (a) is encoded into a one-hot encoding vector, and the implementation steps are as follows: assembling NL node label vectorshEach node label vector inEach element y of (a) is encoded as a one-hot encoded vector with dimension y of 1 and the remaining dimensions 0, the dimension vdim of the one-hot encoded vector being equal to NLhThe value of the medium maximum node label is incremented by 1.
Step 6), obtaining a training sample set TD, a training sample label data set TLabel and a predicted sample set BPD:
step 6a) forming a position sequence number set Pidx ═ { Pidx ═ by the sequence numbers of the positions of the labels with the value of 1 in the Label selected from the Label data set Label1,Pidx2,…,Pidxr,…,PidxRAnd respectively forming a negative sample Label set NLB (NLB) { NLB) by using the labels with the value of 0 randomly selected from the Label and the serial numbers of the positions of the labels in the Label1,NLB2,…,NLBr,…,NLBRAnd position index set Nidx ═ Nidx1,Nidx2,…,Nidxr,…,NidxRAnd then selecting serial numbers of the positions of the labels with the M multiplied by U-R values of 0 in the Label to form a position serial number set ANidx ═ { ANidx }1,ANidx2,…,ANidxs,…,ANidxM×U-RIn which, PidxrNumber indicating the position of Label with the r-th value of 1 in Label, NLBrDenotes the r-th negative sample label, NidxrA serial number indicating the position of the tag with the r-th value of 0 in the Label;
step 6b) selecting h-order closed subgraph set according to the position sequence number set PidxhThe middle R h-order closed subgraphs form a h-order closed subgraph set of the positive sampleAnd selecting a subpraphset according to the position sequence number set NidxhThe middle R h-order closed subgraphs form a h-order closed subgraph set of the negative sampleThen selecting a subpraphset according to the position sequence number set ANidxhM × U-R h-order closed subgraphs form h-order closure of predicted sample setClosed graph setWherein the content of the first and second substances,a closed subgraph of order h representing the r-th positive sample,is an h-order closed subgraph of the r-th negative sample,is h-order closed subgraph of the s-th predicted sample set;
step 6c) selecting R h-order closed sub-graph node feature matrixes in the h-order closed sub-graph node feature matrix set NFSet according to the position sequence number set Pidx to form an h-order closed sub-graph node feature matrix set of the positive sampleSelecting R h-order closed sub-graph node feature matrixes in NFSet according to the position sequence number set Nidx to form h-order closed sub-graph node feature matrix set of negative samplesSelecting M multiplied by U-R h-order closed sub-graph node feature matrixes in NFSet according to the position sequence number set ANidx to form h-order closed sub-graph node feature matrix set of predicted sample setWherein the content of the first and second substances,an h-order closed subgraph node characteristic matrix representing the r-th positive sample,an h-order closed subgraph node characteristic matrix representing the r-th negative sample,an h-order closed subgraph node feature matrix representing the s-th predicted sample set;
step 6d) carrying out PSGS on h-order closed subgraph set of positive samplehH-order closed subgraph set NSGS of sum negative samplehH-order closed subgraph set combined into training sample setH-order closed subgraph node feature matrix set PNFS of positive samplehH-order closed subgraph node feature matrix set NNFS of sum negative samplehH-order closed subgraph node feature matrix set combined into training sample setMerging the positive sample label dataset PLabel and the negative sample label dataset NLB into a training sample label dataset TLabel ═ { TL1,TL2,,TLt,,TL2×RAnd (c) the step of (c) in which,the t h-th order closed subgraph of the h-order closed subgraph set representing the training sample set,to representTLtTo representThe label of (1);
step 6e) carrying out TSGS on h-order closed subgraph set of training sample sethEach h-order closed subgraph ofTNFS (TNFS) set of h-order closed subgraph node feature matrix of training sample sethOf the corresponding node feature matrixForm a binary groupObtaining a training sample setH-order closed subgraph set ANSGS of predicted sample set at the same timehEach h-order closed subgraph ofH-order closed subgraph node feature matrix set ANNFS with predicted sample sethIn (1) correspond toForm a binary setObtaining a set of predicted samples
Wherein, in the step (A),which represents the t-th training sample,represents the s-th predicted sample;
step 7) building a graph neural network GNN:
constructing a graph convolution module GCM, a graph pooling layer GPY and a one-dimensional convolution module CNNM which are connected in sequence1And the graph neural network GNN of the full connection layer FC, wherein the weight parameter of the GNN is thetaGNNThe Loss function Loss of GNN is NLLLoss, where GCM includes I sequentially connected graph convolution layers GCN1,GCN2,…,GCNi,…GCNI,I≥2,I=4,CNNM1IncludedOne-dimensional convolution layer connected in sequenceMaximum pooling layer MP and one-dimensional convolution layer
Step 8) carrying out iterative training on the graph neural network GNN:
step 8a) the initial iteration times are E, the maximum iteration times are E, E is more than or equal to 20, and the weight parameters in the E-th iteration graph neural network GNN areAnd let e be 0 and make e be 0,in this example E ═ 50;
step 8b) using the training sample set TD as input of the graph neural network GNN, the graph convolution module GCM is used for each training sampleClosed subgraph of order h inAnd node feature matrixPerforming multi-layer graph convolution, wherein the graph pooling layer GPY embeds a characteristic matrix Z ═ Z into nodes obtained by the convolution of the GCM multi-layer graph1,Z2,…,Zi,…,ZI]Embedding characteristic vectors into the last row of nodes with the maximum K values for splicing, and performing one-dimensional convolution by using a CNNM (convolutional neural network)1Spliced to GPYThe vector representation V of (2) is subjected to feature learning, and the full connection layer FC is used for CNNM1V obtained by feature learningrfClassifying to obtain prediction fraction vector RS ═ { RS of GNN1,RS2,…,RSt,…,RS2×RWherein K is more than or equal to 10, RStTo representFor each training sample, the graph volume module GCMClosed subgraph of order h inAnd node feature matrixThe formula for performing a layer graph convolution is:
wherein Z isiRepresenting the ith graph convolution layer GCNiThe node embedding characteristic matrix output by carrying out one layer of graph convolution is also GCNi+1Is input of D isDegree matrix of (W)iIs GCNiWeight parameter to be trained, [ · in]The representation matrix is spliced according to rows;
each graph convolution layer GCN in the graph convolution module GCMiUsing a plurality of commonly used propagation functionsThe invention also proposes a propagation function that can be more suitable in bipartite graph networksThe multiple propagation functions enable the GCM to fuse multiple information between miRNA nodes and disease nodes and fully learn graph topology information implied in the association relationship network.
The graph convolution module GCM comprises 4 graph convolution layers GCN connected in sequence1,GCN2,GCN3,GCN4;GCN1Weight parameter W of full connection layer in (1)1Is 4 x vdim, the output dimension is 32, GCN1Output Z of1Is equal toThe number of nodes in the column equals 32; GCN2Weight parameter W of full connection layer in (1)2Is 128, the output dimension is 32, the GCN2Output Z of2Is equal toThe number of nodes in the column equals 32; GCN3Weight parameter W of full connection layer in (1)3Is 128, the output dimension is 32, the GCN3Output Z of3Is equal toThe number of nodes in the column equals 32; GCN4Weight parameter W of full connection layer in (1)4Is 128, the output dimension is 1, GCN4Output Z of4Is equal to TSGt hThe number of nodes in the column equals 32; the number of rows of the output Z of the GCM is equal toThe number of nodes, the number of columns equals 97;
the graph pooling layer GPY is used for embedding a node embedding characteristic matrix Z ═ Z into a node obtained by convolution of a GCM multilayer graph1,Z2,…,Zi,…,ZI]Embedding the last row of nodes with the maximum K values into the feature vector for splicing to obtain outputRepresents V, wherein the dimensionality of V is 97K, and K is equal to the h-order closed subgraph sethWith the number of sub-graph nodes in orderThe 60 th percentile;
the one-dimensional convolution module CNNM1Comprising one-dimensional convolutional layers connected in sequenceMaximum pooling layer MP and one-dimensional convolution layerOne-dimensional convolution layerThe input channel of (1), the output channel of (16), the number of columns with convolution kernel size Z of 97, the step size of 97, the input dimension of 1 x (97. K), the output dimension of 16 x K, the window size of the maximum pooling layer MP of 2, the step size of 2, the input dimension of 16 x K, the output dimension of 16 x (K/2), the one-dimensional convolution layerThe input channel of (1) is 16, the output channel is 32, the convolution kernel size is 5, the step size is 1, the input dimension is 16 x (K/2), and the output dimension is 32 x (K/2-4);
the input dimension of the full connection layer FC is 32 (K/2-4), and the output dimension is 2;
in the present embodiment, the training sample set TD is used as the input of the graph neural network GNN in batches, and the number of the training samples in each batch is 50;
step 8c), adopting a Loss function Loss, and calculating a Loss value L between the TLabel and the RS through predicting the fraction vector RS and the training set sample label data set TLabeleThen using a back propagation method and passing through LosseCalculating the gradient of the GNN parameters, and finally adopting gradient descent algorithm to perform weight parameter comparison on the GNN parameters through the gradient of the GNN parametersThe updating formulas for updating are respectively as follows:
wherein, RStDenotes the t-th prediction score, TLtRepresents RStCorresponding to the sample label, softmax () represents a normalized exponential function, log () represents a logarithmic function,representing the weight parameters after the GNN update,representing the weight parameter, α, before updatingGNNThe learning step size of the GNN is represented,a parametric gradient representing the GNN;
step 8d) judging whether E is more than or equal to E, if so, obtaining a trained miRNA-disease association prediction model GNN', otherwise, making E equal to E +1, and executing step (8 b);
step 9) obtaining a prediction result of the miRNA-disease association relation:
and (3) performing forward propagation by taking the predicted sample set BPD as the input of the trained graph neural network GNN', and obtaining the prediction scores of the M multiplied by U-R miRNA-disease node pairs without edges in the sample Data set Data.
The technical effects of the invention are further explained by simulation experiments as follows:
1. simulation conditions and contents:
simulation experiments were performed with Intel (R) core (TM) i7-8700k CPU, host frequency 3.70GHz, memory 48G, Python3.6.5 on the Ubuntu platform in conjunction with pytorch.
Simulation 1, comparing and simulating the prediction accuracy of the invention and the prior art by ten-fold cross validation, wherein the result is shown in table one, the prior art 1 in table one is a publication of "Predicting microRNA-disease association using biological models and hubness-ware regression" published by Chen et al 2018 in RNA Biology, and discloses a miRNA-disease association relationship prediction method blarmda; prior art 2 in Table I is that Kai et al published an article "differentiating MiRNA-Disease Association by way of needle Feature Extraction with Positive Samples" in Genes 2019, and discloses a miRNA-Disease Association relationship prediction method LFEMDA; the present invention uses miRNA-disease association data downloaded from hmddv2.0 as in prior art 1 and prior art 2; prior art 1 and prior art 2 also require that the corresponding miRNA functional similarity and disease semantic similarity both use the data provided in prior art 1.
Simulation 2, simulating the prediction performance of the invention; the invention is used for predicting all miRNA-disease association relations and verifying by using a related database.
2. And (3) simulation result analysis:
and (3) as a result of simulation 1, evaluation indexes adopted by the prediction accuracy of the miRNA-disease association relation comprise AUROC and AUPR. AUROC is the area under the receiver operating characteristic curve ROC, AUPR is the area under the accurate recall curve, AUROC and AUPR are both indexes for measuring the prediction accuracy, and the larger the value is, the higher the accuracy is;
the results of comparing the AUC and the aucr values of the ten-fold cross validation of the present invention with the two prior art techniques are shown in table 1.
TABLE 1 comparison of the prediction accuracy of the prior art and the present invention
Method
AUROC
AUPR
Prior art 1
0.92838
0.92699
Prior art 2
0.90039
0.91289
The invention
0.93086
0.93247
The combination table shows that the AUROC value and the AUPR value of the method are higher than those of the prior art, and the method proves that the accuracy of the drug-disease association prediction is effectively improved.
Simulation 2 results, according to the description of the above example, obtaining the prediction scores of the incidence relations between 495 miRNA and 383 diseases, ranking the prediction scores, taking 180 before ranking to verify the incidence relations in the three miRNA-Disease incidence relation verification databases hmddv3.0, and verifying results of dbEMCD and miR2Disease are as follows:
10 of the top 10 predictions were validated. 49 of the first 50 predictions were verified. 97 of the top 100 predictions were validated. 169 of the first 180 predictors were validated.
The foregoing description is only an example of the present invention and should not be construed as limiting the invention in any way, and it will be apparent to those skilled in the art that various changes and modifications in form and detail may be made therein without departing from the principles and arrangements of the invention, but such changes and modifications are within the scope of the invention as defined by the appended claims.