Human developmental toxicity prediction method based on action mode
1. A method for predicting developmental toxicity in humans based on a mode of action, comprising:
constructing a compound activity dataset comprising, based on a mode of action, selecting a detrimental outcome pathway, determining a first number of study events, said first number of study events being capable of describing a second number of toxicity endpoints; collecting and collating the relevant test and or compound activity data for each study event;
respectively training a fourth number of corresponding QSAR models through a fourth number of different machine learning algorithms aiming at each toxicity endpoint by combining a third number of molecular descriptor libraries according to a compound activity data set, screening the QSAR models with the best prediction effect corresponding to each toxicity endpoint, and forming a first prediction model by the obtained set of the second number of QSAR models;
the method comprises the steps of utilizing a plurality of compounds with in-vivo experimental data and the prediction result of predicting the compounds with in-vivo experimental data by utilizing the first prediction model, and utilizing a naive Bayes algorithm to train to obtain a second prediction model;
and inputting the compound to be tested into the first prediction model for qualitative prediction, and inputting the qualitative prediction result into the second prediction model to complete the prediction of the human developmental toxicity of the chemicals.
2. The method of claim 1, wherein the detrimental outcome paths are selected based on the mode of action and the first number of research events is determined by:
collecting a first number of research events including molecular initiation events and key events from a molecular level, a cellular level, an individual or an organ level based on human developmental toxicity mechanisms of adverse fate pathways;
the manner in which the relevant tests for each study event were collected was: all study event data from the molecular priming event, except for individual/organ level study events based on animal experiments, are high throughput chemical information, in vitro experiments; and individual or organ level research events are used as research harmful outcomes or the last research event, and the used animal experiments need to meet the test guidelines of the United states environmental protection agency or the economic cooperation and development organization;
the manner in which the activity data for each study event was collated was:
firstly, removing compounds without structural information, such as high molecular compounds, ionic compounds and mixed compounds;
the compound activity was then normalized using the following formula:
in the formula, Activity value represents the Activity intensity value, KiRepresents the inhibition constant, KdRepresenting the dissociation constant, AC50Representing half the active concentration, IC50Represents the median inhibitory concentration, EC50Represents the half effect concentration, uM represents micromolar amount;
finally, using the cytotoxicity assay as an active filter, false positive compounds potentially due to cytotoxicity are removed.
3. The method of claim 2, wherein the activity classification of the collected compounds for each study event after the activity data conditioning comprises:
activity: aiming at a certain research event, at least one experiment has activity, and if the activity is resistance in vitro experiment data, the activity intensity is required to be greater than the cytotoxicity activity intensity under the same experiment;
non-activity: aiming at a certain research event, all experimental determination results are inactive or only have resistance in vitro experimental data, and the resistance activity intensity is less than or equal to the cytotoxicity activity intensity under the same experiment;
simulation: aiming at a certain research event, pseudo activity data exists on the premise that the compound has activity;
resistance: aiming at a certain research event, on the premise that the compound has activity, resistance activity data exist, and the resistance activity intensity is greater than the cytotoxicity activity intensity.
4. The method according to claim 3, wherein the step of training a fourth number of corresponding QSAR models for each toxicity endpoint by a fourth number of different machine learning algorithms according to the compound activity data set in combination with the third number of molecular descriptors library and selecting the QSAR model with the best prediction effect for each toxicity endpoint comprises:
dividing the constructed compound activity data set into a training set and a testing set according to a ratio of 4:1, wherein the training set is used for model construction and internal verification, and the testing set is used for external verification;
selecting a third number of molecular descriptor libraries for structural information data calculation of the compound;
for each toxicity endpoint, respectively training through a fourth number of different machine learning algorithms to obtain a fourth number of QSAR models;
verifying the prediction effect of the QSAR model by adopting a fifth quantity of different indexes, and selecting the QSAR model with the best prediction capability for each toxicity terminal point;
and constructing the first prediction model by the screened set of the second number of QSAR models.
5. The method of claim 4, wherein a fourth number of the machine learning algorithms comprises a K-nearest neighbors algorithm, a naive Bayes algorithm, a random forest, a support vector machine, and a decision tree.
6. The method of claim 4, wherein a third number of the libraries of molecular descriptors are: OEState, Mold2, and Dragon v.7.
7. The method of claim 4, wherein a fifth number of said indicators comprises true positive, false positive, true negative, false negative, sensitivity, specificity, accuracy, and area under the curve.
8. The method according to claim 4, wherein, when building the QSAR model, the built QSAR model training set is internally verified by 5-fold cross-validation to test the stability of the data.
9. The method of claim 1, wherein the in vivo experimental data is animal experimental data, and wherein a portion of the animal experimental data is capable of detecting multiple toxicity effects exhibited by the compound at an organ/individual level, such that the second predictive model comprises a plurality of independent predictive models, each model individually predicting a toxicity effect, and wherein the compound is developmentally toxic to humans if at least one model prediction is positive after the test compound is predicted by the plurality of independent predictive models in the plurality of second predictive models.
10. The method according to any one of claims 1 to 9, further comprising constructing an application domain by:
the application domain of the relevant QSAR model is described using 26 physicochemical properties of compounds in the training set, the 26 physicochemical properties including 1D, 2D, 3D, nAcid, ALogP, AMR, apol, naAromAtom, nAromBond, ntom, nhavava, nBonds, nbondd, nBondsT, nbondq, bpol, ETA _ Alpha, FMF, nhacc, nHBDon, TopoPSA, VABC, MW, AMW, XLogP, TPSA, nRing, nRotB and nRotBt, respectively;
and calculating a distance matrix of the training set by adopting Euclidean distance to serve as an application domain of each QSAR model.
Background
As a result of a large number of animal experiments and epidemiological researches, most environmental small molecule compounds have the function of interfering nucleic acid translation and expression mediated by nuclear receptors, thereby influencing the growth and development process of human individuals, and are called Developmental Toxicity (Developmental Toxicity). In particular, a number of environmentally related contaminants are found in various official and unofficial organizations, including the U.S. environmental Protection Agency (U.S. epa) and the European Chemicals Agency (ECHA), to potentially affect Adverse Outcome Pathways (AOPs) and cause dysplasia in human reproductive-related organs (males and females). For example. For males, contaminants can produce individual levels of Male Developmental Toxicity (MDT) by affecting androgen receptor mediated deleterious fate pathways and leading to dysplasia of Male reproductive related organs, including testicular, prostatic growth retardation, dysfunction or abnormalities. Environmental pollutants with MDT can lead not only to dysplasia of the male Reproductive-related organs, but also potentially to abnormal male Reproductive health, including increased testicular germ cell tumors, low sperm quality, incidence of cryptorchidism and hypospadias, and ultimately Reproductive Toxicity (Reproductive Toxicity). For females, contaminants can contribute to Female Developmental Toxicity (FDT) at an individual level by affecting estrogen receptor-mediated deleterious fate pathways and causing dysplasia in Female reproductive-related organs, including ovary, fallopian tube, uterus, placenta, and breast. Successful pregnancy in women requires the normal development and operation of these female reproductive-related organs, and dysfunction can lead to poor or no pregnancy, failure to successfully reach term, or difficulty in feeding the infant.
In principle, depending on the mode of action, environmental pollutants produce human developmental toxic disturbances at the molecular level, cellular level, and organ tissues by interfering with deleterious fate pathways, resulting in incomplete development of these human reproductive-related organs, or developmental malformations that result in failure of the human to perform normal development and operation, ultimately leading to human reproductive-related dysfunction. At present, identification of development-related hazards is based primarily on extensive animal experimentation and epidemiological findings, and a small number of interference mechanisms. There are thousands of chemicals in commerce, but few have been tested for ontological developmental toxicity, and few have been epidemiological studies with endpoints on dysplasia of human reproductive-related organs. Thus, a number of traditional animal tests have been conducted over the last two decades to test the toxicity of chemicals for human development. However, since 2004, the european union has since prohibited traditional animal testing due to ethical issues, and non-animal testing (non-animal testing) methods based on in vitro testing and virtual screening have been developed. In vitro tests are time consuming, high in cost and incapable of completely testing tens of thousands of registered chemicals, so that the virtual screening technology is particularly important.
Scientists have developed computer-based virtual screening methods to perform activity prediction of chemical-related toxicity endpoints. Quantitative Structure-Activity Relationship (QSAR) can extract and delineate the Relationship between compound biological Activity and structural features using molecular descriptors. QSAR as a well-established method has been widely used in a variety of toxicity predictions. For example, the invention creation names: a virtual screening method of human transthyretin interferent (patent publication No. CN106407665A, published: 2017-02-15) utilizes QSAR technology to construct an interferent screening method. The invention also provides the following name: a fresh water acute benchmark prediction method based on a metal quantitative structure-activity relationship (patent publication No. CN104820873A, published Japanese 2015-08-05) and the invention creation name are as follows: the seawater acute standard prediction method based on the metal quantitative structure-activity relationship (patent publication No. CN105447248A, published Japanese 2015-11-24) also uses QSAR technology to predict seawater and fresh water acute standards. Notably, there are many non-QSAR technologies that have been exploited for developmental toxicity prediction. For example, the invention creation names: a method for evaluating growth and development toxicity of triazole pesticide by use of Drosophila melanogaster (patent publication No. CN110150236A, published as 2021-04-06). However, this method is not only of low applicability (limited to triazole pesticides), but also predicts that the developmental endpoint is drosophila melanogaster, not a highly interesting human health toxicity endpoint. The invention also provides the following name: a prediction method and a prediction model for the developmental toxicity of chemicals, and a construction method and application thereof (patent publication No. CN112063681A, published as 2020-12-11). Although the method uses the data of human myocardial cells for prediction, the predicted end point is the activity of alpha-actin and SOX17 proteins in the human myocardial cells, the activity prediction is limited to the cell level, and the developmental toxicity on organ and individual levels cannot be predicted.
The Adverse Outcome Pathway (AOP) is used to describe the existing correlation between a direct Molecular Initiation Event (MIE) (e.g., ligand-receptor binding) and the occurrence of "Adverse outcomes" at different organizational levels (e.g., cells, organs, organisms, populations) of an organism that are associated with risk assessment. The establishment of an AOP not only determines each individual toxic event during the occurrence of toxicity, but also the context between the occurrence of toxic events, thereby modularizing the toxic events. AOP ultimately combines toxic events at the cellular level, organ level, and individual level, comprehensively evaluates the toxic effects of chemicals based on adverse outcome pathways, and predicts the toxic mechanism of action of chemicals (e.g., interference with a critical event). However, by analyzing the prior art, methods for high-throughput prediction of human developmental toxicity based on mode of action are lacking in the prior art.
Disclosure of Invention
The technical problem is as follows: the invention aims to overcome the defect that high-throughput prediction of human developmental toxicity cannot be effectively carried out based on action modes in the prior art, and provides a method for predicting human developmental toxicity based on action modes. The method can be used for carrying out high-throughput screening on chemicals which potentially act on harmful outcome pathways to further generate developmental toxicity, so that whether the compounds have developmental toxicity or not can be accurately and quickly judged, and whether the toxicity is generated by interfering with harmful outcome pathways or not can be accurately and quickly judged.
The technical scheme is as follows: the invention provides a method for predicting human developmental toxicity based on action mode, which comprises the following steps:
constructing a compound activity dataset comprising, based on a mode of action, selecting a detrimental outcome pathway, determining a first number of study events, said first number of study events being capable of describing a second number of toxicity endpoints; collecting and collating the relevant test and or compound activity data for each study event;
respectively training a fourth number of corresponding QSAR models through a fourth number of different machine learning algorithms aiming at each toxicity endpoint by combining a third number of molecular descriptor libraries according to a compound activity data set, screening the QSAR models with the best prediction effect corresponding to each toxicity endpoint, and forming a first prediction model by the obtained set of the second number of QSAR models;
the method comprises the steps of utilizing a plurality of compounds with in-vivo experimental data and the prediction result of predicting the compounds with in-vivo experimental data by utilizing the first prediction model, and utilizing a naive Bayes algorithm to train to obtain a second prediction model;
and inputting the compound to be tested into the first prediction model for qualitative prediction, and inputting the qualitative prediction result into the second prediction model to complete the prediction of the human developmental toxicity of the chemicals.
Preferably, the adverse outcome path is selected based on the mode of action, and the first number of research events is determined by:
collecting a first number of research events including molecular initiation events and key events from a molecular level, a cellular level, an individual or an organ level based on human developmental toxicity mechanisms of adverse fate pathways;
the manner in which the relevant tests for each study event were collected was: all study event data from the molecular priming event, except for individual/organ level study events based on animal experiments, are high throughput chemical information, in vitro experiments; and individual or organ level research events are used as research harmful outcomes or the last research event, and the used animal experiments need to meet the test guidelines of the United states environmental protection agency or the economic cooperation and development organization;
the manner in which the activity data for each study event was collated was:
firstly, removing compounds without structural information, such as high molecular compounds, ionic compounds and mixed compounds;
the compound activity was then normalized using the following formula:
in the formula, Activity value represents the Activity intensity value, KiRepresents the inhibition constant, KdRepresenting the dissociation constant, AC50Representing half the active concentration, IC50Represents the median inhibitory concentration, EC50Represents the half effect concentration, uM represents micromolar amount;
finally, using the cytotoxicity assay as an active filter, false positive compounds potentially due to cytotoxicity are removed.
Preferably, after the activity data is collated, the collected compounds are activity classified for each study event, including:
activity: aiming at a certain research event, at least one experiment has activity, and if the activity is resistance in vitro experiment data, the activity intensity is required to be greater than the cytotoxicity activity intensity under the same experiment;
non-activity: aiming at a certain research event, all experimental determination results are inactive or only have resistance in vitro experimental data, and the resistance activity intensity is less than or equal to the cytotoxicity activity intensity under the same experiment;
simulation: aiming at a certain research event, pseudo activity data exists on the premise that the compound has activity;
resistance: aiming at a certain research event, on the premise that the compound has activity, resistance activity data exist, and the resistance activity intensity is greater than the cytotoxicity activity intensity.
Preferably, the method for obtaining the first prediction model includes the steps of training a fourth number of corresponding QSAR models by a fourth number of different machine learning algorithms according to the compound activity data set in combination with the third number of molecular descriptor libraries and for each toxicity endpoint, and screening out the QSAR model with the best prediction effect corresponding to each toxicity endpoint, where the QSAR model is obtained by:
dividing the constructed compound activity data set into a training set and a testing set according to a ratio of 4:1, wherein the training set is used for model construction and internal verification, and the testing set is used for external verification;
selecting a third number of molecular descriptor libraries for structural information data calculation of the compound;
for each toxicity endpoint, respectively training through a fourth number of different machine learning algorithms to obtain a fourth number of QSAR models;
verifying the prediction effect of the QSAR model by adopting a fifth quantity of different indexes, and selecting the QSAR model with the best prediction capability for each toxicity terminal point;
and constructing the first prediction model by the screened set of the second number of QSAR models.
Preferably, the fourth number of the machine learning algorithms comprises a K-nearest neighbor algorithm, a na iotave bayes algorithm, a random forest, a support vector machine, and a decision tree.
Preferably, the third number of said pool of molecular descriptors is: OEState, Mold2, and Dragon v.7.
Preferably, the fifth number of said indicators comprises true positive, false positive, true negative, false negative, sensitivity, specificity, accuracy and area under the curve.
Preferably, when building the QSAR model, the built QSAR model training set is internally verified by using 5-fold cross-validation to test the stability of the data.
Preferably, the in vivo experimental data is animal experimental data, and part of the animal experimental data can detect a plurality of toxicity effects of the compound at organ/individual level, so that the second prediction model comprises a plurality of independent prediction models, each model individually predicts one toxicity effect, and after the compound to be tested is predicted by a plurality of independent prediction models in the plurality of second prediction models, if at least one model is predicted to be positive, the compound has human developmental toxicity.
Further, the method also comprises the step of constructing an application domain, and the method for constructing the application domain comprises the following steps:
the application domain of the relevant QSAR model is described using 26 physicochemical properties of compounds in the training set, the 26 physicochemical properties including 1D, 2D, 3D, nAcid, ALogP, AMR, apol, naAromAtom, nAromBond, ntom, nhavava, nBonds, nbondd, nBondsT, nbondq, bpol, ETA _ Alpha, FMF, nhacc, nHBDon, TopoPSA, VABC, MW, AMW, XLogP, TPSA, nRing, nRotB and nRotBt, respectively;
and calculating a distance matrix of the training set by adopting Euclidean distance to serve as an application domain of each QSAR model.
Has the advantages that: compared with the prior art, the invention has the following advantages:
the method provided by the embodiment of the invention selects a harmful outcome path based on an action mode, determines research events, collects and collates related tests and compound activity data of each research event, and constructs a compound activity data set; then, training to obtain a first prediction model by combining a molecular descriptor library and a machine learning algorithm in favor of a constructed compound activity data set; then, a plurality of compounds with in-vivo experimental data and a prediction result of predicting the compounds with in-vivo experimental data by using the first prediction model are trained by using a naive Bayes algorithm to obtain a second prediction model; the compound to be tested is subjected to prediction verification in two stages of a first prediction model and a second prediction model, so that whether the compound has developmental toxicity or not is accurately and quickly judged, and whether the toxicity is generated by interfering a harmful outcome pathway or not is judged.
By utilizing the method provided by the invention, the high-flux screening can be carried out on the chemicals which further generate developmental toxicity due to potential action and harmful outcome pathways, so that whether the compounds have developmental toxicity or not can be accurately and quickly judged, and whether the toxicity is generated by interfering the harmful outcome pathways or not can be accurately and quickly judged, and the defect that the high-flux prediction of human developmental toxicity based on action modes is lacked in the prior art is overcome.
In the embodiment of the invention, the constructed model is constructed based on the molecular mechanism of the signal path of the interference harmful outcome path of environmental pollutants, and the direct connection is generated among human developmental toxicity activity mechanisms generated by the compounds on the molecular level, the cell level and the individual organ level in a breakthrough manner, so that a key technical method and a theoretical basis are provided for the extrapolation of human developmental toxicity based on computational toxicology.
Drawings
FIG. 1 is a flow chart of a method for human developmental toxicity prediction based on mode of action in an embodiment of the invention;
FIG. 2 is a flow chart of model chemical prediction for androgen receptor mediated adverse fate pathway based prediction of male developmental toxicity;
FIG. 3 is a flow chart for modeling a model for predicting male developmental toxicity based on androgen receptor mediated adverse fate pathways;
FIG. 4 is a flow chart of construction of a first predictive model of a model for predicting male developmental toxicity based on androgen receptor mediated adverse fate pathways;
FIG. 5 is a graph of the results of a first predictive model external validation evaluation of a model for predicting male developmental toxicity based on androgen receptor mediated adverse fate pathways;
FIG. 6 is a graph of the predicted outcome of the first and second predictive models in a model for predicting androgenetic toxicity based on androgen receptor mediated adverse fate pathways;
FIG. 7 is a flow chart of prediction of a model for predicting male developmental toxicity based on androgen receptor mediated adverse fate pathways;
figure 8 is a graph of the outcome of a graph of the prediction of male developmental toxicity based on androgen receptor mediated adverse fate pathways.
Detailed Description
Example 1
This example illustrates a method for predicting male developmental toxicity by androgen receptor mediated adverse fate pathway, and a specific implementation process of the present invention will be described in detail. Fig. 1 is a flowchart illustrating a method for predicting human developmental toxicity based on an action pattern according to an embodiment of the present invention, fig. 2 is a flowchart illustrating a chemical prediction model for predicting male developmental toxicity based on androgen receptor-mediated adverse outcome pathway, and in conjunction with fig. 1 and 2, the method according to the embodiment includes:
step S100: constructing a compound activity dataset comprising, based on a mode of action, selecting a detrimental outcome pathway, determining a first number of study events, said first number of study events being capable of describing a second number of toxicity endpoints; relevant assays and their or compound activity data for each study event are collected and collated.
Where the first number refers to the number of study events identified, in this example, experimental data was collected from the molecular level, cellular level, and individual or organ level for seven study events including Molecular Initiation Events (MIEs) and Key Events (KEs), based on the mechanism of androgenetic toxicity of androgen receptor mediated adverse fate pathways. The seven study events were ligand-receptor binding (MIE), co-factor recruitment (KE1), DNA binding (KE2), aberrant protein transcription activity (KE3), transcriptional abnormality (KE4), cell proliferation (KE5), and abnormal organ development (KE6), respectively, in detail, as shown in fig. 3.
In addition to the organodysplastic event based on animal experiments (KE6), the high throughput in vitro (in vitro) test data for the first six study events was derived from toxast developed by four U.S. official organizationsTMthe/Tox 21 High Throughput Screening (HTS) project includes the National Toxicology Program (NTP), the National Center for advanced science of transformation (NCATS), the Food and Drug Administration (FDA) and the National Computational Toxicology Center for Computational (NCCT) belonging to u.s.epa. Latest toxast based on 26-day update in 2 months in 2019TMthe/Tox 21 database collected a total of 20 trials describing the first six study events, as shown in table 1.
Table 1 summary of experiments used in model for prediction of male developmental toxicity based on androgen receptor mediated adverse fate pathway
ligand-Receptor Binding (Receptor Binding), as the key first step in the generation of androgen Receptor mediated deleterious fate pathway activity, three molecular experiments exist (NVS _ NR _ cAR, NVS _ NR _ rAR, NVS _ NR _ hAR).
Cofactoring Recruitment (COA Recruitment) as KE1, two trials exist (OT _ AR _ arrc 1_0480, OT _ AR _ arrc 1_0960) aimed at determining the role of co-activation modulators (COAs) recruited in androgen receptor mediated adverse fate pathways. Thus, DNA Binding (chromatography Binding) an experiment was presented as KE2 (TOX21_ ARE _ BLA _ aginst _ ratio) to determine if AR binds to DNA. Notably, the results of inhibition of protein production based on in vitro assays are highly correlated with cellular activity, i.e., the presence of cytotoxic chemicals can produce potential "false positive" activity. Therefore, the cytotoxicity assay associated with each in vitro assay was also selected as a threshold filter for active compounds with "false positives". Thus, TOX21_ ARE _ BLA _ aginst _ viatility was also selected as a cytotoxicity test to prevent false positives.
There were two experiments with protein transcriptional Activity abnormalities as KE3 (ATG _ AR _ TRANS _ dn, ATG _ AR _ TRANS _ up). Two in vitro experiments measured the tendency of protein transcription activity to increase (ATG _ AR _ TRANS _ up) or decrease (ATG _ AR _ TRANS _ dn), respectively, so that two models with different transcription trends were used in the subsequent model construction.
Transcriptional abnormality (Gene Expression) there were two tests, activation of transcription and repression of transcription, including 7 trials, as the KE 4. Compounds activate androgen receptor mediated detrimental fate pathway transcription including three in vitro tests (OT _ AR _ arelcu _ AG _1440, TOX21_ AR _ BLA _ Agonist _ ratio,
TOX21_ AR _ LUC _ MDAKB2_ agenist) to determine whether a compound has mimetic activity; inhibition of androgen receptor mediated adverse fate pathway transcription by a compound involves two in vitro assays (TOX21_ AR _ BLA _ Antagonist _ ratio, TOX21_ AR _ LUC _ MDAKB2_ Antagonist) and two related cytotoxicity assays (TOX21_ AR _ BLA _ Antagonist _ viatility, TOX21_ AR _ LUC _ MDAKB2_ Antagonist _ viatility) to determine whether a compound has resistance activity under non-cytotoxic conditions.
Similarly, Cell Proliferation (Cell Proliferation) was presented as KE5 in two tests including 2 tests (ACEA _ AR _ agonist _80hr ) and 2 cytotoxicity tests (ACEA _ AR _ agonist _ AUC _ viability, ACEA _ AR _ agonist _ AUC _ viability).
In addition, organ dysplasia was identified as KE6, and rat Hershberger experiments were selected in this example to determine the developmental toxicity of androgenic and antiandrogenic activity on male-related organs at the organ level. The rodent Hershberger experiment was identified as both a u.s.epa (EPA 890.1400) and OECD (OECD 441). In the U.S. epa/OECD guideline, the Hershberger experiment utilized the castrate male mouse model. Rats are castrated around 42 days postnatally and allowed a post-operative recovery period of at least 7 days to reduce endogenous androgen (testosterone) levels. The assay results are based on the Hershberger assay of the developmental weight changes of 5 androgen-dependent accessory organs (ASTs), including the Ventral Prostate (VP), Seminal Vesicles (SV) (plus fluid and coagulated glands), Levator Ani Bulbocavernosus (LABC) muscle, paired bulbourethral glands (COW) and Glans Penis (GP). When more than two ASTs have obvious organ weight gain, the compound has a pseudo-male effect; conversely, when more than two ASTs have significant organ weight loss, the compound has an antiandrogenic effect. Both androgenic and antiandrogenic effects are different types of interfering activities produced by chemicals, which lead to abnormal development of the male organs and thus to Male Developmental Toxicity (MDT). The final 21 in vitro and in vivo assays were selected to screen the data set of compounds that produced gynecomastic toxicity by androgen receptor mediated adverse fate pathway effects.
The collected activity data are collated to optimize the compound structure and activity data information, and the method specifically comprises the following steps:
first, a polymer type, an ionic type, or a mixed type compound having no structural information (indicating "NA" or "FAIL"), is removed;
then, the Activity Value (AV) of the compound is normalized using formula (1);
in formula (1), Activity value represents Activity Strength value, KiRepresents the inhibition constant, KdRepresenting the dissociation constant, AC50Representing half the active concentration, IC50Represents the median inhibitory concentration, EC50Represents the half effect concentration and uM represents the micromolar amount. Under the formula (1), for each experiment, it is defined that the active intensity of 3 or more is active (AV. gtoreq.3), and the active intensity of 3 or less (AV < 3) is inactive.
After data consolidation, the collected compounds were activity classified for each study event, including:
active (Active): aiming at a certain research event, at least one experiment has activity, and if the activity is resistance in vitro experiment data, the activity intensity is required to be greater than the cytotoxicity activity intensity under the same experiment;
inactive (Inactive): aiming at a certain research event, all experimental determination results are inactive or only have resistance in vitro experimental data, and the resistance activity intensity is less than or equal to the cytotoxicity activity intensity under the same experiment;
mimetic (aginist): aiming at a certain research event, pseudo activity data exists on the premise that the compound has activity;
resistance (Antagonist): aiming at a certain research event, on the premise that the compound has activity, resistance activity data exist, and the resistance activity intensity is greater than the cytotoxicity activity intensity.
Finally, the compound information collected is shown in table 2.
TABLE 2 prediction model data collection results for male developmental toxicity based on androgen receptor mediated adverse fate pathway
In Table 2, "1" indicates "active" and "0" indicates "inactive".
From table 2, it can be seen that 7 study events can describe 11 toxicity endpoints, namely ligand-receptor binding (receptor binding), cofactor recruitment (COA retrieval), DNA binding (chromatin binding), protein transcriptional activity up-regulation (transcription factor activity-assessment), protein transcriptional activity reduction (transcription factor activity-assessment), transcriptional activation (gene expression-assessment), transcriptional inhibition (gene expression-assessment), cell proliferation-activation (cell proliferation-assessment), cell proliferation-inhibition (cell proliferation-assessment), abnormal organ growth-weight gain (Hersberger test-assessment), and abnormal organ development-weight loss (Hersberger test-assessment), and the information for each toxicity endpoint is detailed in table 2.
In the embodiment, the adopted data is reliable and effective, so that the reliability and the accuracy of the subsequently established model on the human developmental toxicity prediction result are improved.
Step S200: respectively training a fourth number of corresponding QSAR models through a fourth number of different machine learning algorithms aiming at each toxicity endpoint by combining a third number of molecular descriptor libraries according to a compound activity data set, screening the QSAR models with the best prediction effect corresponding to each toxicity endpoint, and forming a first prediction model by the obtained set of the second number of QSAR models;
wherein the third number refers to the number of the molecular descriptor libraries, and the fourth number refers to the number of the types of the selected machine learning algorithms.
Thus for eachEach toxicity endpoint was modeled using the data in table 2. In the present embodiment, three molecular descriptor libraries and five machine learning algorithms are used, as shown in fig. 4. More specifically, the three analysis descriptor libraries include OEState, Mold2 and Dragon v.7, and the five machine learning algorithms selected include K Nearest Neighbor (KNN), naive bayes algorithm (KNN)Bayes, NB), Random Forest (RF), Support Vector Machine (SVM), and Decision Tree (DT). Thus, 5 QSAR models were trained for each toxicity endpoint, for a total of 55 QSAR models, for 11 toxicity endpoints of seven study events. And screening 5 QSAR models trained for each toxicity terminal point, selecting the QSAR model with the best prediction effect, screening 11 QSAR models for 11 toxicity terminal points in total, and forming a first prediction model by using the set of the 11 QSAR models.
Specifically, with reference to fig. 4, the following steps may be performed.
Step S201: the data set was divided into a training set (80%) and a test set (20%) at a ratio of 4:1 using a "Partitioning Mode" module in KNIME platform software, where the training set was used for model construction and internal validation and the test set was used for external validation.
Step S202: the structural information of the compound (indicated as SMILES) was checked for correctness using ChemBioDraw Ultra 14.0 software. This step may be omitted if the structural information of the compound can be confirmed to be correct.
Step S203: three molecular descriptor libraries of OEState, Mold2 and Dragon v.7 are selected for the structural information data calculation of the compound. The three molecular descriptor libraries collectively contain 6049 1D, 2D, and 3D molecular descriptors, which were calculated on an Online platform using the SMILES information of the compound at Online Chemical Modeling Environment (OCHEM). The method mainly comprises four steps of structure optimization, namely Standardization, neutrallize, Remove salts and Clean structure. And selecting related descriptors of the obtained molecular descriptors, wherein the main steps comprise removing low variation descriptors (low variation filters), removing high correlation descriptors (high correlation filters) and selecting key feature descriptors (feature information selection).
Step S204: for each toxicity endpoint, five machine learning algorithms were chosen, respectively, and five QSAR models were trained. Since there are 11 toxicity endpoints, a total of 55 QSAR models were trained.
In the invention, the five machine learning algorithms are K Nearest Neighbor (KNN) and naive Bayes (K Nearest Neighbor, KNN)Bayes, NB), Random Forest (RF), Support Vector Machine (SVM), Decision Tree (SVM). Therefore, for each toxicity endpoint, the 5 QSAR models obtained by training are respectively a K-nearest neighbor model, a naive bayes model, a random forest model, a support vector machine model and a decision tree model.
Step S205: and verifying the prediction effect of the QSAR model by adopting a fifth quantity of different indexes, and selecting the QSAR model with the best prediction capability for each toxicity terminal point.
In this embodiment, 8 different indexes may be used to verify the obtained QSAR model, where the fifth number is the number of indexes, and the 8 indexes are: in a specific implementation process, the best prediction model may be screened by only using one or more of the 8 indexes, namely True Positive (TP), False Positive (FP), True Negative (TN), False Negative (FN), Sensitivity (Sensitivity), Specificity (Specificity), Accuracy (Accuracy), and Area Under the Curve (AUC). Table 3 gives the results of validation of the 55 QSAR models trained using 8 metrics.
TABLE 3 Male developmental toxicity prediction model based on androgen receptor mediated detrimental fate pathway first prediction model evaluation results
Also, in the present embodiment, the QSAR model having the best predictive effect for each toxicity endpoint is screened with Accuracy (Accuracy) as the final criterion, as shown in fig. 5.
As can be seen from fig. 5, the best predicted effect of QSAR models trained using the forest (RF) algorithm was found for ligand-receptor binding (Model 1), co-factor recruitment (Model 2), DNA binding (Model 3), protein transcription activity-upregulation (Model 4), transcription activation (Model 6), transcription inhibition (Model 7), cell proliferation-activation (Model 8), and cell proliferation-inhibition (Model 9).
For the decrease of protein transcription activity (Model 5), the QSAR Model trained by using the Decision Tree (DT) algorithm has the best prediction effect. The QSAR Model trained by the support vector machine algorithm (SVM) has the best prediction effect on organ dysplasia-weight gain (Hershberger test-agonist, Model 10) and organ dysplasia-weight loss (Hershberger test-agonist, Model 11).
In the process of QSAR model training, in this embodiment, 5-fold cross validation (five-fold cross validation) is used to perform internal validation on the training set for building the QSAR model, and the stability of data is tested.
Additionally, from the 48 compounds with results from the rat Hershberger test in vivo experiment, it was found that 9 of the compounds presented all of the results from seven study events based on androgen receptor mediated adverse outcome pathways. Therefore, the predictive ability of the 11 QSAR models in the first predictive model was further verified with the 9 compounds. As shown in table 4 and fig. 6(a), 11 QSAR models in the first prediction model were able to accurately predict the experimental results of seven study events based on androgen receptor mediated adverse outcome pathway for 9 reference compounds with an accuracy as high as 92%. It was demonstrated that the 11 QSAR models in the first predictive model enabled accurate and rapid qualitative predictions of seven investigational events on androgen receptor-mediated adverse outcome pathways-based for compounds.
Table 4 validation of prediction of 9 representative compounds in androgen receptor mediated adverse fate pathway based prediction model of male developmental toxicity first prediction model
In table 4, colorless indicates that the measured-predicted results are consistent; grey indicates measured-predicted result inconsistency; the outside of the bracketed sign is an experimental value, and the inside of the bracketed sign is a predicted value; a "1" is characterized as active and a "0" is characterized as inactive.
The prediction result of the established first prediction model for each research event can also give a human developmental toxicity mechanism of the chemicals, and effective mechanism information is provided for development of green chemicals.
Step S300: the method comprises the steps of utilizing a plurality of compounds with in-vivo experimental data and the prediction result of predicting the compounds with in-vivo experimental data by utilizing the first prediction model, and utilizing a naive Bayes algorithm to train to obtain a second prediction model;
in this step, the results of prediction of the compounds having in vivo experimental data using 48 compounds having in vivo experimental data and using the first prediction model are shown in table 5. And performing composite superposition on the prediction results through a naive Bayes algorithm, and training to obtain a second prediction model, wherein in the embodiment, the comprehensive prediction model of the male developmental toxicity based on the weight is obtained. The second prediction model trained using the naive bayes algorithm is substantially a weight model.
Table 5 prediction result information of 48 compounds utilized by androgen receptor mediated adverse fate pathway-based male developmental toxicity prediction model and first prediction model thereof
It should be noted that, at this step, since the in vivo animal test data used is the rat Hershberger test, the test can detect not only the androgenic effect but also the anti-androgenic effect of the compound. Thus, ultimately there are two independent models in the second predictive model: (i) a pseudo-male effect prediction model; (ii) prediction model of anti-male effect. When the compound is predicted by two models respectively, if the androgenic activity and/or the antiandrogenic activity exists, the compound has male developmental toxicity. Meanwhile, since the experimental data that can be used in the second prediction model are 48 compounds having in vivo experimental data, in order to secure the application domain of the model, at this stage, the QSAR model is constructed using all the data as a training set, and internal verification is performed. Similarly, the predicted effect of the model is verified using one or more of the eight metrics in step S200. Table 6 fig. 6(B-C) shows the pseudo-androgenic and anti-androgenic activity prediction model results in the second prediction model.
TABLE 6 Male developmental toxicity prediction model based on androgen receptor mediated detrimental fate pathway second prediction model evaluation results
The result shows that whether 48 compounds have the antiandrogenic activity can be completely predicted for the antiandrogenic activity prediction model, and the prediction precision can reach 100%; whereas for the pseudo-male activity prediction model, although five pseudo-male compounds could be completely predicted, i.e. there was no false negative result, 36 compounds without pseudo-male activity were predicted as false positives, i.e. compounds that were mispredicted to have pseudo-male activity, with an accuracy of only 25%. Although the false negative rate of 0% guarantees the effectiveness of the androgenesis model from a regulatory perspective, excessive misprediction does not achieve the prediction goal.
Therefore, in the examples of the present invention, the chemical structures and characteristics of 48 compounds were studied in detail, and it was found that 5 pseudo-male compounds are all steroid (sterone propionate,17-methyl sterone, trenborone, methyl-1-sterone, sterone), and the other 43 pseudo-male active compounds are not steroids. Therefore, a new screening condition "is the compound predicted to have androgenic activity a steroid? ", the process is as in fig. 7.
In the modeling process, the screening condition utilizes chemical structure similarity (chemical similarity) to predict and screen. In detail, five pseudo-male compounds were used as template compounds (positive controls), and the structures of the predicted compounds were matched one by one with those of the template compounds. Chemical structure Similarity was characterized using a Tanimoto Similarity Score. And similarity scoring is performed by using 12 molecular structure fingerprint libraries contained in the PadEL-descriptor software. The 12 molecular structure fingerprint libraries are fingerprint, Extended fingerprint, specific fingerprint, GraphOnly fingerprint, MACCS fingerprint, Pubchem fingerprint, Substructure fingerprint count, Klekotaroth fingerprint, Klekotath fingerprint, and AtomPairs2 DFingfingerprint count, respectively. Tanimoto Score has output values between 0 and 1, with greater Score giving higher chemical similarity. Therefore, in this example, the cutoff value of Tanimoto Score is set to 0.8, and when the similarity between the tested compound and at least one compound in 5 androgenic compounds is more than or equal to 0.8, the tested compound is proved to be the steroid compound satisfying the androgenic activity. The prediction capability of the castration prediction model after the new screening condition is added is greatly improved, and as shown in table 6 and fig. 6(D), whether 48 compounds have castration activity or not can be completely predicted, and the prediction precision can reach 100%.
Step S400: and inputting the compound to be tested into the first prediction model for qualitative prediction, and inputting the qualitative prediction result into the second prediction model to complete the prediction of the human developmental toxicity of the chemicals. It can be seen that the compound to be predicted is essentially predicted by two layers, the first layer being qualitatively predicted by a first prediction model and the second layer being synthetically predicted by a second prediction model.
In this embodiment, according to the flow shown in fig. 2 and 6, the method proposed in this embodiment is tested and verified by using flutamide, the compound is first input into a first prediction model for qualitative prediction, and 11 QSAR models in the first prediction model predict the activity/inactivity of the compound to be tested on the androgen receptor mediated adverse outcome-based pathway with respect to seven research events; then, the prediction result of the first prediction model is input to the second prediction model, and in this embodiment, since the second prediction model includes two prediction models of the androgenic activity and the antianginal activity, the androgenic activity and the antianginal activity of the compound are predicted by the second prediction model, respectively. Finally, the compounds exhibit androgenic toxicity when the compounds exhibit androgenic and/or antiandrogenic activity. Fig. 8 shows the integrated display result of the present embodiment. Respectively displaying (i) information of the compound to be detected, including compound name, CASN, structural information and a male developmental toxicity prediction result; (ii) detailed male developmental toxicity prediction results (tabular format); (iii) detailed male developmental toxicity prediction results (laser map format).
Further, in the examples, the above process can only predict small molecular organic matter, and can not predict male developmental toxicity of heavy metal compounds, mixtures and ionic compounds, because of the difference of the toxicity mechanism. Thus, in this embodiment, 26 physicochemical properties of compounds in the training set are used to describe the Application Domain (ad) of the relevant QSAR predictive model. The 26 physicochemical properties include 1D, 2D, 3D, nAcid, ALOGP, AMR, apol, naAromatom, nArromBond, nATOm, nHeavyAtom, nBonds, nBondsD, nBondsT, nBondsQ, bpol, ETA _ Alpha, FMF, nHBAcc, nHBDon, TopoPSA, VABC, MW, AMW, XLOgP, TPSA, nRing, nRotB and nRotBT, respectively. The detailed physicochemical properties represented by each molecular descriptor are listed in table 7. Physicochemical properties of the compounds were calculated using the 1D &2D &3D molecular Descriptor library of the PaDEL-Descriptor software.
TABLE 7 information on 26 physicochemical properties selected by applying domain selection to model for predicting male developmental toxicity by androgen receptor-mediated adverse fate pathway
The construction process of the Application Domain (AD) includes four steps. First, 26 physicochemical properties were calculated for all compounds in the training set used for modeling, and the Min-Max Normalization was performed for each property value using the normarizer module of the KNIME platform software, with values between 0 and 1. Secondly, a Distance Matrix (DM) of the training set compounds in the model is calculated by using Euclidean Distance in the RComplexheatmap package, and DM is AD of the QSAR model. Thirdly, 26 physicochemical property values of the tested compound are calculated, the values are normalized based on AD (26 physicochemical parameters of the compounds of the training set) of the corresponding QSAR model, the normalized values are subjected to similarity scoring with DM of the model, and the compound and the similarity distance (similarity distance) which are the closest to the tested compound are found. And finally, when the similarity distance between the tested compound and the closest compound is more than or equal to 0.5, judging that the tested compound is in the AD, otherwise, judging that the tested compound is not in the AD.
There is one AD for each QSAR model in the first predictive model. Thus, in general, the domain of application of a method for predicting male developmental toxicity based on androgen receptor mediated deleterious outcome pathways is the union of 11 QSAR prediction models.
As can be seen from FIG. 8, the tested compound flutamide was first predicted by 11 QSAR models of tier 1, and both the predicted results and whether it is in AD (indicated in parentheses, "in" is in AD, and "out" is not in AD) are indicated. It was found that for compound flutamide, it was all in the AD of 11 QSAR models, demonstrating the validity of its prediction.
Currently, effective experimental data based on traditional animal experiments (rat Hershberger experiments) are few. In 2018, OECD researchers performed an extremely systematic literature search to collate the data of the rat Hershberger experiment to obtain a valid animal experiment database (https:// www.regulations.gov/document/EPA-HQ-OPPT-2009-. As a result, it was found that nearly 3200 pieces of study data were obtained on the premise of meeting the OECD/US EPA specification, and only 134 compounds preliminarily met the OECD/US EPA specification. Further, at least two and more research results exist in only 48 compounds, and the research results are consistent; however, at least two and more than 24 compounds exist, and the results are not unified, and the developmental toxicity of the compounds on the organ level of the individual cannot be determined. Although the QSAR model constructed based on 48 compounds with effective animal experiment results can effectively predict toxicity of the compounds in AD (Accuracy 1, fig. 4), the problem of too narrow AD range greatly limits the practical application of the prediction model. The method for predicting male developmental toxicity based on androgen receptor mediated adverse outcome pathway is a two-layer prediction model constructed by utilizing a multi-dimensional QSAR model based on the androgen receptor mediated adverse outcome pathway. The model can not only analyze the interference mechanism of the compound (which key event is interfered), but also greatly expand the narrow application domain only based on animal experiment data by combining a multi-dimensional model, thereby greatly enhancing the prediction range and the prediction capability of the model in practical application.
The above examples are only preferred embodiments of the present invention, it should be noted that: it will be apparent to those skilled in the art that various modifications and equivalents can be made without departing from the spirit of the invention, and it is intended that all such modifications and equivalents fall within the scope of the invention as defined in the claims.