Target activity prediction method, device, equipment and storage medium of sgRNA

文档序号:9891 发布日期:2021-09-17 浏览:149次 中文

1. A method for predicting target activity of sgRNA, comprising:

acquiring an sgRNA sequence dataset;

performing sequence feature extraction on the sgRNA sequence dataset to obtain a plurality of feature information;

fusing the characteristic information to obtain a characteristic set;

training the seed model based on the feature set to obtain an activity prediction model;

obtaining an sgRNA sequence to be predicted;

and predicting the target activity of the sgRNA sequence to be predicted based on the activity prediction model.

2. The activity prediction method of claim 1, wherein training a seed model based on the feature set to obtain an activity prediction model comprises:

selecting optimal feature information from the feature sets to obtain unbalanced feature sets;

carrying out up-sampling processing on the unbalanced feature set to obtain a balanced feature set;

and training the seed model based on the balance feature set to obtain an activity prediction model.

3. The activity prediction method of claim 2, wherein the upsampling of the imbalance feature set to obtain a balance feature set comprises:

acquiring a sampling rate;

and performing upsampling processing on the unbalanced feature set based on the sampling rate to obtain a balanced feature set.

4. The activity prediction method of claim 3, wherein obtaining a sampling rate comprises:

operating a support vector machine based on the unbalanced feature set to obtain a support vector set;

determining a plurality of neighborhoods of elements in the set of support vectors;

classifying elements in the support vector set based on the plurality of neighborhoods to obtain majority samples, boundary samples and minority samples;

determining a sampling rate based on the majority class samples, the boundary samples, and the minority class samples.

5. The activity prediction method of claim 4, wherein upsampling the imbalance feature set based on the sampling rate to obtain a balance feature set comprises:

acquiring a plurality of nearest neighbors of the boundary samples and the minority class samples;

and interpolating the minority samples based on the boundary samples and the nearest neighbors to balance the interpolated minority samples with the number of the majority samples to obtain a balance feature set.

6. The activity prediction method of claim 1, wherein the sgRNA sequence dataset comprises: high activity sgRNA sequence data and low activity sgRNA sequence data.

7. A target activity prediction device for sgRNAs, comprising:

the first acquisition module is used for acquiring a sgRNA sequence dataset;

the characteristic extraction module is used for extracting sequence characteristics of the sgRNA sequence data set to obtain a plurality of characteristic information;

the characteristic fusion module is used for fusing the characteristic information to obtain a characteristic set;

the model training module is used for training the seed model based on the feature set to obtain an activity prediction model;

the second acquisition module is used for acquiring the sgRNA sequence to be predicted;

and the activity prediction module is used for predicting the target activity of the sgRNA sequence to be predicted based on the activity prediction model.

8. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executed implements the steps of the method according to any of claims 1-6.

Background

Single-guide RNA (sgRNA) is a guide RNA, and is formed by fusing two RNAs (tracrrna and crRNA). These RNAs can bind to cas9 protein and direct cas9 enzyme to target regions of genomic DNA and cleave DNA. The sgRNA is an important component of a CRISPR/Cas9 gene knockout system, and has important significance in gene editing and disease treatment. The sgRNA with high activity shows higher mutation efficiency to a target, and the efficiency of subsequent screening and identification schemes is improved.

Therefore, how to determine the target activity of sgrnas before gene editing is a challenge to solve.

Disclosure of Invention

The application provides a method, a device, equipment and a storage medium for predicting the target activity of sgRNA, which can determine the target activity of the sgRNA.

In a first aspect, an embodiment of the present application provides a method for predicting a target activity of an sgRNA, including:

acquiring an sgRNA sequence dataset;

performing sequence feature extraction on the sgRNA sequence dataset to obtain a plurality of feature information;

fusing the characteristic information to obtain a characteristic set;

training the seed model based on the feature set to obtain an activity prediction model;

obtaining an sgRNA sequence to be predicted;

and predicting the target activity of the sgRNA sequence to be predicted based on the activity prediction model.

Optionally, training the seed model based on the feature set to obtain an activity prediction model, including:

selecting optimal feature information from the feature sets to obtain unbalanced feature sets;

carrying out up-sampling processing on the unbalanced feature set to obtain a balanced feature set;

and training the seed model based on the balance feature set to obtain an activity prediction model.

Optionally, performing upsampling processing on the imbalance feature set to obtain a balance feature set, including:

acquiring a sampling rate;

and performing upsampling processing on the unbalanced feature set based on the sampling rate to obtain a balanced feature set.

Optionally, obtaining a sampling rate comprises:

operating a support vector machine based on the unbalanced feature set to obtain a support vector set;

determining a plurality of neighborhoods of elements in the set of support vectors;

classifying elements in the support vector set based on the plurality of neighborhoods to obtain majority samples, boundary samples and minority samples;

determining a sampling rate based on the majority class samples, the boundary samples, and the minority class samples.

Optionally, performing upsampling processing on the imbalance feature set based on the sampling rate to obtain a balance feature set, including:

acquiring a plurality of nearest neighbors of the boundary samples and the minority class samples;

and interpolating the minority samples based on the boundary samples and the nearest neighbors to balance the interpolated minority samples with the number of the majority samples to obtain a balance feature set.

Optionally, the sgRNA sequence dataset comprises: high activity sgRNA sequence data and low activity sgRNA sequence data.

A second aspect of an embodiment of the present application provides an apparatus for predicting a target activity of a sgRNA, including:

the first acquisition module is used for acquiring a sgRNA sequence dataset;

the characteristic extraction module is used for extracting sequence characteristics of the sgRNA sequence data set to obtain a plurality of characteristic information;

the characteristic fusion module is used for fusing the characteristic information to obtain a characteristic set;

the model training module is used for training the seed model based on the feature set to obtain an activity prediction model;

the second acquisition module is used for acquiring the sgRNA sequence to be predicted;

and the activity prediction module is used for predicting the target activity of the sgRNA sequence to be predicted based on the activity prediction model.

A third aspect of embodiments of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, performs the steps in the method according to the first aspect of the present application.

A fourth aspect of the embodiments of the present application provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method according to the first aspect of the present application when executed.

By adopting the target activity prediction method of the sgRNA provided by the embodiment of the application, the target activity of the sgRNA is predicted, and a theoretical basis is provided for corresponding drug development. And performing up-sampling treatment on the unbalanced data set, so that the unbalanced data set is changed into a balanced data set, and the target activity prediction precision of the small number of types of sgRNAs is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a flowchart of a method for predicting target activity of sgrnas provided in an embodiment of the present application;

fig. 2 is a schematic diagram of feature dimensions and accuracy curves under different data sets provided in an embodiment of the present application.

FIG. 3 is a schematic diagram of the ACC effect of CS-Smote under an unbalanced data set provided by the embodiment of the present application.

FIG. 4 is a schematic diagram of the G-mean effect of CS-Smote under an unbalanced data set provided by the embodiment of the present application.

Fig. 5 is a diagram illustrating results of different classifiers according to an embodiment of the present application.

Fig. 6 is a schematic diagram comparing the recognition effect with the recognition algorithm in the prior art according to the embodiment of the present application.

Fig. 7 is a schematic structural diagram of a target activity prediction apparatus for sgrnas provided in an embodiment of the present application.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, the present application is described in further detail with reference to the accompanying drawings and the detailed description.

Referring to fig. 1, a flow diagram of a method of predicting target activity of an sgRNA of the present application is shown. As shown in fig. 1, the method comprises the steps of:

s101, acquiring a sgRNA sequence dataset.

Wherein, the sgRNA activity sequence dataset comprises a positive case dataset and a negative case dataset, the positive case dataset is a high activity sgRNA sequence, and the negative case dataset is a low activity sgRNA sequence.

In some alternative embodiments, there are a total of 8 sample set pairs of sgRNA activity sequence data, including: g17 (1059 for positive high-activity sgRNA sequences, 4251 for negative low-activity sgRNA sequences), Gr (731 for positive high-activity sgRNA sequences, 438 for negative low-activity sgRNA sequences), Gnr (237 for positive high-activity sgRNA sequences, 237 for negative low-activity sgRNA sequences), Gm (830 for positive high-activity sgRNA sequences, 231 for negative low-activity sgRNA sequences), hela (2019 for positive high-activity sgRNA sequences, 536 for negative low-activity sgRNA sequences), hct116 (3873 for positive high-activity sgRNA sequences, 24536 for negative low-activity sgRNA sequences), hek293t (536 for positive high-activity sgRNA sequences, 404 for negative low-activity sgRNA sequences), and h160 (536 for positive high-activity sgRNA sequences, 67 for negative low-activity sgRNA sequences).

S102, extracting sequence features of the sgRNA sequence data set to obtain a plurality of feature information.

In some optional embodiments, sequence feature extraction is performed on the sgRNA sequence dataset based on a plurality of different feature extraction algorithms, resulting in a plurality of feature information. In some alternative embodiments, the feature extraction algorithm comprises a nucleotide composition method, an autocorrelation method of the sequence, a nucleotide composition method, and a structural feature method of the sequence; the nucleotide composition method comprises a k-mer extraction algorithm and a Subsequence extraction algorithm; the self-organization correlation feature extraction algorithm comprises a feature extraction algorithm based on an automatic covariance DAC, an extraction algorithm based on a cross covariance DCC, an extraction algorithm based on an automatic cross covariance DACC, an algorithm of a Geary autocorrelation GAC and an algorithm of a normalized Moreau-Broto autocorrelation NMBAC; the pseudo nucleic acid composition feature extraction algorithm comprises an algorithm based on parallel correlation pseudo-dinucleotide composition and a pseudo-dinucleotide composition method based on continuous correlation; the structural feature extraction algorithm comprises a local structural sequence Triplet feature Triplet extraction algorithm.

In some alternative embodiments, in the k-mer extraction algorithm, when the occurrence frequency k of adjacent nucleic acids is k-2 and k-3, one feature file is obtained, so that a total of 11 feature files are obtained by using the above-mentioned 10 feature extraction algorithm, and the dimension distribution of the 11 feature files is shown in fig. 2.

S103, fusing the plurality of feature information to obtain a feature set.

In some optional embodiments, the 11 feature files are merged and fused in a previous fusion manner. Of course, in some other alternative embodiments, a post-fusion mode may be selected for feature fusion.

And S104, training the seed model based on the feature set to obtain an activity prediction model.

Step S104 includes the following sub-steps S1041-S104

And S1041, selecting optimal feature information from the feature sets to obtain an unbalanced feature set.

In some optional embodiments, the feature set is subjected to feature selection by using an MRMD2.0 algorithm, so that a feature subset with strong correlation between features and example categories and low redundancy between features is obtained.

In the MRMD2.0 algorithm, the correlation between the characteristics and the example categories is characterized by a Pearson coefficient, and the larger the Pearson coefficient is, the stronger the correlation between the characteristics and the example categories is, and the more compact the relationship is; the redundancy between features is characterized by Euclidean distance, which is related to Euclidean distance ED, Cosine distance COS and Tanimoto coefficient TC, and the larger the Euclidean distance, the lower the redundancy between features.

Based on the theory, the basis for selecting the features of the feature set by adopting the MRMD2.0 algorithm is Max (MR)i+MDi) Wherein MRiDenotes the Pearson coefficient, MD, between the ith sgRNA example class and featureiDenotes the Euclidean distance between the characteristics of the ith sgRNA example, where maxMRiThe calculation of the values is as follows:

maxMDithe calculation of the values is as follows:

wherein PCC (. cndot.) represents the Pearson coefficient, FiFeature vector, C, representing the ith sgRNA exampleiA class vector representing the ith sgRNA instance, M the characteristic dimension of the sgRNA instance, SFiCiIs represented by FiAll elements in (A) and (C)iCovariance of all elements in (S)FiIs represented by FiStandard deviation of all elements in, SCiIs represented by CiStandard deviation of all elements in, fkIs represented by FiThe k-th element of (1), ckIs represented by CiN is FiAnd CiThe number of the elements in (1) is,is FiThe average value of all the elements in (A),is CiAverage of all elements in (1), EDiRepresenting the Euclidean distance, COS, between the i < th > sgRNA example featuresiDenotes the Cosine distance, TC, between the i-th sgRNA example featuresiIndicate Tanimoto coefficients between the i-th sgRNA example features.

And obtaining an unbalanced feature set after feature selection.

S1042, carrying out up-sampling processing on the unbalance feature set to obtain a balance feature set.

In some optional embodiments, the CS-Smote algorithm is used to process the balance feature set to improve the prediction efficiency, and step S1042 includes: acquiring a sampling rate; and performing upsampling processing on the unbalanced feature set based on the sampling rate to obtain a balanced feature set. More specifically, the step "acquiring the sampling rate" comprises:

a. operating a support vector machine based on the unbalanced feature set to obtain a support vector set;

and (4) operating the SVM on the unbalanced feature set S to obtain a support vector set SV.

b. Determining a plurality of neighborhoods of elements in the set of support vectors;

calculating point sv using euclidean distanceiM neighborhoods of, sviE.g., SV, and assuming m 'in the m neighborhoods is the majority of samples, 0. ltoreq. m'.

c. And classifying elements in the support vector set based on the plurality of neighborhoods to obtain majority samples, boundary samples and minority samples.

If m ═ m, then sv is representediAll m neighborhoods of (a) belong to most classes of samples. Will sviRegarding as a noise point, sv is deletedi

m/t<m′<m, represents sviIs a sufficiently large proportion of the number of the most sample points in the neighborhood, sviBelonging to the boundary samples.

If 0<m′<m/t (t is a parameter, generally t is 2) represents sviThe number of neighborhood sample points of (a) is a large proportion, sviBelonging to the safety sample.

And finally, determining the specific value of m, and obtaining a few types of samples by using m-m'.

d. Determining a sampling rate based on the majority class samples, the boundary samples, and the minority class samples.

Boundary sample is svi′,sviThe sum of the distances from the m' majority of samples is ai 1;svithe sum of the distances between the 'and m-m' minority samples is calculated as ai2, and the sampling magnification is Ui 1/ai 2.

The step of performing upsampling processing on the imbalance feature set based on the sampling rate to obtain a balance feature set comprises the following steps:

a. multiple nearest neighbors of the boundary samples and the minority class samples are obtained.

Calculating svi' and k nearest neighbors to a few class samples.

b. And interpolating the minority samples based on the boundary samples and the nearest neighbors to balance the interpolated minority samples with the number of the majority samples to obtain a balance feature set.

Interpolating according to sampling multiplying power to generate new minority samples sn. The interpolation formula is as follows:

sn=svi′+chi(ki-svi′)

wherein k isiThe ith sample, ch, representing the k neighbor1E (0,1) is a random number. ch (channel)i=μ*chi-1*(1-chi-1),μ∈[3.75,4)。

Finally, a balance feature set is obtained.

S1043, training the seed model based on the balance feature set to obtain an activity prediction model.

The seed model can be a random forest model, and the random forest model is subjected to classification training based on the balance feature set to obtain a trained classification model. Step S1043 specifically includes:

the feature data in the balanced feature set is divided into 10 shares.

And traversing each piece of characteristic data, taking one of the characteristic data as a test set and the remaining 9 characteristic data as a training set, and performing classification training on the sgRNA active sequences by adopting an RF algorithm (random forest model).

And evaluating the classification effect.

In some optional embodiments, the indicators for evaluating the classification effect include SE, SP, ACC, MCC, and G-mean, and are calculated as follows:

wherein TP represents the number of correctly predicted high-activity sgrnas, FP represents the number of correctly predicted low-activity sgrnas, TN represents the number of incorrectly predicted high-activity sgrnas, and FN represents the number of incorrectly predicted low-activity sgrnas.

S105, obtaining an sgRNA sequence to be predicted;

s106, predicting the target activity of the sgRNA sequence to be predicted based on the activity prediction model.

And constructing a sgRNA-RF classifier by adopting a trained classification model, inputting the feature subset into the sgRNA-RF classifier to obtain a classification result, and completing the prediction of sgRNA activity.

The recognition effect of the present invention is further described below with a set of specific experimental examples.

MRMD2.0 is used to select important characteristics. A graph of the relationship between the feature dimensions and the ACC is generated as shown in fig. 2. Feature selection is performed at G17 to determine the feature size curve and its accuracy. The highest accuracy is 0.8043 when the feature dimension is 161. 161 is the feature dimension obtained after feature selection. Then MRMD2.0 is executed on other data sets, and the feature dimension with the highest accuracy is obtained as follows: 187, 585, 470, 523, 365, 459, and 156. By comparing the accuracy differences, it was found that the accuracy difference of each data set was at most 0.1. Therefore, we take the feature dimension as 116.

The performance of CS-Smote, originnal data, Smote and bsmote was evaluated using ACC, G-mean based on the G17 dataset. The Smote algorithm is a classical oversampling algorithm. The Bsmote algorithm mainly selects a small sample set at the edges of a few classes and a large number of classes of samples, and is an improved algorithm of the smote algorithm. The results are shown in FIGS. 3 and 4. As can be seen from fig. 3, the ACC of the original data set is generally low. The classification accuracy of MED12 gene was the highest, about 91%, while that of NF2 gene was the lowest, about 73%, with a difference of about 18%. The classical smoothing algorithm has the best classification effect on 12 genes and the highest classification accuracy. Among these genes, the classification accuracy of MED12 gene reached or exceeded 95%, and the classification accuracy of THY1 gene was the lowest, about 80%. Then the bsmote and CS-Smote methods are tested, and the prediction precision is improved relative to the original data. In addition, the algorithm adopted by the analysis has the highest classification accuracy on CD28, CUL3 and TAD2B genes, and the effectiveness of the algorithm in the research is demonstrated to a certain extent. The performance of the four cases was then analyzed using G-mean. As can be seen from fig. 4, the g-mean of CS-Smote is best, followed by Smote and raw data. The CS-Smote values of 17 genes are the highest, and the CS-Smote values of most genes are about 0.8. This result also demonstrates the effectiveness of the CS-Smote method. The comprehensive ACC and G-mean show that the CS-Smote algorithm improves the prediction performance of the unbalanced data set to a certain extent.

Then through feature selection and unbalanced data processing, a better data set is obtained, and performances generated by four classifiers (RF, SVM, NB and J48) of ACC, SN, SP and MCC are compared. The results are shown in FIG. 5. As can be seen from fig. 5, good results were obtained for sgRNA-RF on all datasets and overall good results were obtained for 4 genes on the 5 th dataset, as seen from the broken line position of sgRNA-RF. For Gr, Gnr and Gm datasets, recognition was generally poor, but sgRNA-RF recognition was best in contrast, demonstrating its effectiveness. However, a classifier needs to be constructed to accommodate the additional data set.

Finally, the invention is compared with the research result of the existing excellent recognition algorithm, and during comparison, a consistent evaluation index ACC is used on the basis of ensuring the consistency of the used data sets, as shown in FIG. 6. To demonstrate the effectiveness of sgRNA-RF, we compared it with other studied predictors in the G17 dataset and performed 10-fold cross-validation. Ge-CRISPR, Az-imuth, CRISPRPred, sgRNA-psm, and sgRNA-expsm are the six most advanced predictors of predicting sgRNA target activity in previous studies. A comparison result of 10-fold cross validation shows that the performance of the sgRNA is superior to that of the previous research, sgRNA-RF has certain effectiveness in sgRNA activity prediction, and a new thought can be provided for sgRNA research.

Based on the same inventive concept, an embodiment of the present application provides a target activity prediction apparatus for sgrnas. Referring to fig. 7, fig. 7 is a schematic view of a target activity prediction apparatus for sgrnas provided in an embodiment of the present application. As shown in fig. 7, the apparatus includes:

a first obtaining module 701, configured to obtain a sgRNA sequence dataset;

a feature extraction module 702, configured to perform sequence feature extraction on the sgRNA sequence dataset to obtain multiple pieces of feature information;

a feature fusion module 703, configured to fuse the plurality of feature information to obtain a feature set;

a model training module 704, configured to train a seed model based on the feature set to obtain an activity prediction model;

a second obtaining module 705, configured to obtain a sgRNA sequence to be predicted;

an activity prediction module 706 configured to predict a target activity of the sgRNA sequence to be predicted based on the activity prediction model.

Based on the same inventive concept, another embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the steps in the method according to any of the above-mentioned embodiments of the present application.

Based on the same inventive concept, another embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and running on the processor, and when the processor executes the computer program, the electronic device implements the steps of the method according to any of the above embodiments of the present application.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one of skill in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the true scope of the embodiments of the application.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The plant resistance protein identification method, device, equipment and storage medium provided by the application are described in detail above, and the principle and the implementation mode of the application are explained by applying specific examples, and the description of the above examples is only used for helping to understand the method and the core idea of the application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:一种基于二代测序数据的HLA分型方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!