Molecular feature extraction and performance prediction method based on image convolution

文档序号:9895 发布日期:2021-09-17 浏览:56次 中文

1. A molecular feature extraction and performance prediction method based on image convolution is characterized by comprising the following steps:

s1: extracting molecular characteristics, constructing an atomic characteristic matrix and an image adjacency matrix, and converting a molecular image into a digital vector with atomic information, chemical bond information and molecular structure information;

s2: constructing an image convolution layer, inputting the obtained atomic characteristic matrix and the image adjacent matrix into the image convolution layer, and enabling each node to represent information of surrounding nodes, wherein the nodes represent atomic nodes in a molecular structure;

s3: constructing a node linear layer, and performing node level linear activation on the convolved atom characteristic matrix to obtain a characteristic matrix of the molecule;

s4: constructing a pooling layer, pooling the characteristic matrix of the molecules, and extracting the characteristic vector of the molecules;

s5: and constructing a molecular image linear layer, and performing linear activation on the molecular feature vectors.

2. The image convolution-based molecular feature extraction and performance prediction of claim 1, wherein the atomic feature matrix construction includes obtaining atomic feature data corresponding to atomic nodes, and performing hash encryption on the atomic feature data to obtain a node feature matrix.

3. The image convolution-based molecular feature extraction and performance prediction of claim 1, wherein constructing the image adjacency matrix includes constructing an n × n binary matrix according to molecular structure information, where n represents the number of atomic nodes, and if the nodes are adjacent, the corresponding elements in the matrix are set to 1, and if the nodes are not adjacent, the corresponding elements are set to 0.

4. The image convolution-based molecular feature extraction and performance prediction of claim 1, wherein the image convolution layer is constructed as a hidden layer, and the formula is as follows:

wherein H(l)Is an n x d matrix of the current hidden layer, n represents the number of atomic nodes, d represents the characteristic dimension of atoms, H(l+1)For the next layer of hidden layer matrix, W(l)In order to hide the layer weight coefficients,is obtained by adding an identity matrix I to an adjacent matrix A which is an image adjacent matrix, sigma is a nonlinear activation function,is a matrixThe degree matrix of (c) is calculated as follows:

5. the image convolution-based molecular feature extraction and performance prediction of claim 4, wherein the image convolution layer is 2 layers or 3 layers.

6. The image convolution-based molecular feature extraction and performance prediction of claim 1, wherein constructing the node linear layer includes performing a linear activation operation on an output of the image convolution layer, and a formula is as follows:

H(node MLP)=σ(H(Conv)W+B)

wherein H(node MLP)For node linear layer output, H(Conv)For the convolutional layer output, W is the linear layer weight coefficient, B is the bias matrix, and σ is the nonlinear activation function.

7. The image convolution-based molecular feature extraction and performance prediction of claim 6, wherein the node linear layer is 1 layer.

8. The image convolution-based molecular feature extraction and performance prediction method of claim 1, wherein the molecular image linear layer is constructed and is a linear hidden layer, and the formula is as follows:

Hl+1 (graph MLP)=σ(Hl (graph MLP)W+B)

wherein Hl (graph MLP)For the current linear hidden layer, Hl+1 (graph MLP)And W is a linear hidden layer of the next layer, W is a linear layer weight coefficient, B is a bias matrix, and sigma is a nonlinear activation function.

9. The image convolution-based molecular feature extraction and performance prediction of claim 1, wherein the molecular image linear layer is constructed as a 1-3 layer structure.

Background

The prediction of molecular properties is the key to the discovery of effective materials and is an important component of the genome research of materials. With the improvement of computing power and the continuous development of molecular databases, machine learning has been widely applied in chemical and material research, such as electronic structure learning, spectral property prediction, virtual screening of related material design, and the like, and quantitative structure-activity relationships can be established more accurately and effectively by using a machine learning auxiliary method.

At present, molecular fingerprint design and proper molecular characterization construction are a challenge of molecular machine learning, molecular feature extraction is an important part of molecular machine learning molecular design and molecular performance prediction, a molecular image needs to be converted into a digital vector to be used as input of a neural network, and meanwhile, the feature vector also needs to have complete atomic information, chemical bond information and molecular structure information.

The traditional ECFP circular fingerprint utilizes a hash algorithm to encrypt a molecular substructure, so that the molecular substructure is changed into a binary vector, but the problem of information loss exists in the encryption process; CM coulomb fingerprints use atomic charge and atomic distance to construct a coulomb matrix, but the fingerprint has no atomic number permutation invariance.

Disclosure of Invention

The invention provides a molecular feature extraction and performance prediction method based on image convolution, which integrates atom and chemical bond information extraction and neighborhood node information aggregation, avoids the defect of atomic number displacement variability, integrates molecular feature vectors into a neural network model, has learnable characteristics, improves the prediction accuracy of molecular performance by effectively grabbing atom node neighborhood information, and has higher value in the fields of deducing protein structures, compound synthesis, drug design, molecular functional material development and the like.

The invention provides a molecular feature extraction and performance prediction method based on image convolution, which comprises the following steps:

s1: extracting molecular characteristics, constructing an atomic characteristic matrix and an image adjacency matrix, and converting a molecular image into a digital vector with atomic information, chemical bond information and molecular structure information;

s2: constructing an image convolution layer, inputting the obtained atomic feature matrix and the image adjacency matrix, and obtaining the atom feature matrix after convolution;

s3: constructing a node linear layer, and performing node level linear activation on the convolved atom characteristic matrix to obtain a characteristic matrix of the molecule;

s4: constructing a pooling layer, pooling the characteristic matrix of the molecules, and extracting the characteristic vector of the molecules;

s5: and constructing a molecular image linear layer, and performing linear activation on the molecular feature vectors.

Further, the atom feature matrix is constructed, atom feature data corresponding to the atom nodes are obtained, and hash encryption is performed on the atom feature data to obtain the node feature matrix.

Further, the construction of the image adjacency matrix constructs an n × n binary matrix according to the molecular structure information, where n represents the number of atomic nodes, and if the nodes are adjacent, the corresponding elements in the matrix are set to 1, and if the nodes are not adjacent, the corresponding elements are set to 0.

Further, the image convolution layer is constructed, the image convolution layer is a hidden layer in the neural network model, and the formula is as follows:

wherein H(l)For the current hidden layer n x d matrix, n represents the number of atomic nodes, d represents the dimension of atomic features, H(l+1)For the next layer of hidden layer matrix, W(l)In order to hide the layer weight coefficients,for containing self-connected adjacent matrix, obtained by adding A and an identity matrix I, wherein the matrix A is an image adjacent matrix, and the sigma is a nonlinear activation function,Is a matrixThe degree matrix of (c) is calculated as follows:

further, the image convolution layer is 2 layers or 3 layers.

Further, the construction of the node linear layer performs a linear activation operation on the hidden layer output by the image convolution, and the formula is as follows:

H(node MLP)=σ(H(Conv)W+B)

wherein H(node MLP)For node linear layer output, H(Conv)For the convolutional layer output, W is the linear layer weight coefficient, B is the bias matrix, and σ is the nonlinear activation function.

Further, the node linear layer is 1 layer.

Further, the molecular image linear layer is constructed as a hidden layer in the neural network model, and the formula is as follows:

Hl+1 (graph MLP)=σ(Hl (graph MLP)W+B)

wherein Hl (graph MLP)For the current linear hidden layer, Hl+1 (graph MLP)And W is a linear hidden layer of the next layer, W is a linear layer weight coefficient, B is a bias matrix, and sigma is a nonlinear activation function.

Further, the molecular image linear layer is 1-3 layers.

The invention has the following beneficial effects:

1. through quantizing the atoms and the information of chemical bonds among the atoms, extracting the connection information among the atoms in the molecule, and encrypting the atoms, the chemical bonds and the molecular structure information in the molecule by utilizing a hash algorithm, a characteristic vector which has no information loss and can be learned is formed, and the characteristic vector is ensured to have the integrity of complete atom information, chemical bond information and molecular structure information so as to be input into an MLP network for performance prediction.

2. A neural network model is constructed based on image convolution, molecular feature vectors are fused into the neural network model, effective capture of atom node neighborhood information is achieved through convolution layer operation, node level linear operation, pooling operation and image level linear operation, and prediction accuracy of molecular performance is improved.

Drawings

FIG. 1 is a schematic overall flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of the input and output processes of the molecule of the present invention in a model.

Detailed Description

In the following description, technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the invention provides a molecular feature extraction and performance prediction method based on image convolution, which comprises the steps of extracting molecular features, constructing an image convolution network model, inputting the obtained molecular features into the network model for molecular performance prediction, wherein the image convolution network model comprises an image convolution layer, a node linear layer, a pooling layer and a molecular image linear layer.

As shown in fig. 1, firstly, according to set atomic characteristic parameters, extracting molecular characteristics, encrypting characteristic data to obtain a node characteristic matrix of a molecular image, and obtaining molecular structure information to obtain an image adjacency matrix; then, an image convolution network model is constructed, data obtained from a molecule database is input into the model, a final network model is obtained through training, and finally the model is verified and tested to predict the performance of molecules;

in this example, the molecular database was a QM9 database, from which 133885 molecules were obtained, and was divided into a training set, a validation set, and a test set at 8: 1.

With reference to fig. 2, the specific process is as follows:

s1: extracting molecular characteristics, constructing an atomic characteristic matrix and an image adjacency matrix, and converting a molecular image into a digital vector with atomic information, chemical bond information and molecular structure information;

acquiring characteristic data according to set atomic characteristic parameters, wherein in the embodiment, the atomic characteristic parameters include an atomic type, an atomic number, an acceptor, a donor, aromaticity, orbital hybridization, a hydrogen number, and a chemical bond type, and specific setting descriptions are shown in table 1:

table: 1: atomic character table

Characteristic parameter Description of the invention
Atom type H. C, N, O, F, S, Cl etc
Atomic number Number of protons
Receiver Receiving electrons
Donor Donation of electrons
Aromaticity In a fragrance system
Orbital hybridization sp,sp2,sp3
Amount of hydrogen Number of H connected
Type of chemical bond Single, double, triple, aromatic bonds

Wherein the atom type is the atom type contained in the molecular data, and one-hot is adopted to code the atoms; the atomic number is the number of protons of the atom, the number of protons is an integer, and the encoding is carried out through integer numbers; the characteristic parameters of the acceptor and the donor are expressed as the side receiving electrons or the side giving electrons in the molecular structure, and the encoding is carried out in a binary mode; the aromaticity is whether the molecule is positioned in an aromatic system or not, and binary coding is adopted; the number of hydrogen represents the number of H atoms connected with atoms in the molecule, and the number of the connected atoms is an integer and is coded by an integer number; the chemical bond type represents chemical bonds connected with atoms in a molecule, and comprises single bonds, double bonds, triple bonds and aromatic bonds which are coded by a one-hot mode.

Acquiring corresponding characteristic data according to the characteristic parameters, and encrypting through a hash algorithm to obtain a node characteristic matrix of n multiplied by m, wherein n represents the number of atomic nodes, m represents the dimension of atomic characteristics, and the obtained node characteristic matrix is a (0, 1) binary matrix;

and constructing an n multiplied by n image adjacency matrix for the molecular structure information according to the characteristic data, wherein n represents the number of atomic nodes, if the node i is adjacent to the node j, the corresponding element in the matrix is 1, and if the node i is not adjacent to the node j, the element is 0, namely constructing the image adjacency matrix as a binary matrix with the diagonal line of 0.

S2: constructing an image convolution layer in a network model, wherein the image convolution layer is a hidden layer of a molecular image node, so that each node can represent information of surrounding nodes, and the formula is as follows:

wherein H(l)For the current hidden layer n x d matrix, n represents the number of atomic nodes, d represents the dimension of atomic features, H(l+1)For the next layer of hidden layer matrix, W(l)In order to hide the layer weight coefficients,is a contiguous matrix containing self-connection, consisting of A plus an identity matrix, sigma is a nonlinear activation function,is a matrixThe degree matrix of (c) is calculated as follows:

in this embodiment, the image convolution layers are 3 layers, so as to avoid that many hidden layers affect the model training precision, wherein H is(0)An atomic node feature matrix representing a first level of inputs.

S3: constructing a node linear layer, and performing node level linear activation on the convolved atomic feature matrix by adopting a fully-connected neural network, wherein the formula is as follows:

H(node MLP)=σ(H(Conv)W+B)

wherein H(node MLP)For node linear layer output, H(Conv)Is the output of the convolutional layer, W is the linear layer weight coefficient, B is the bias matrix, and σ is the nonlinear activation function;

in this embodiment, the node linear layer is 1 layer, the initial weight of the linear layer weight coefficient W is a random number extracted from a normal standard, and the nonlinear activation function employs ReLU, Softmax, or the like.

S4: constructing a pooling layer, pooling the characteristic matrix of the molecules, and extracting the characteristic vector of the molecules;

pooling the molecular feature vector matrices output by the node linear layer, where the pooling includes calculating an average value, a maximum value, and the like for the node vectors, in this embodiment, summing the Q × E molecular feature vector matrices output by the node linear layer, and calculating a sum of column vectors of the molecular feature vector matrices to obtain a 1 × E matrix.

S5: constructing a molecular image linear layer, wherein the molecular image linear layer is constructed by adopting a fully-connected neural network and is a hidden layer in a neural network model, performing linear activation operation on the pooled molecular characteristic vector matrix, and predicting the molecular performance, and the formula is as follows:

Hl+1 (graph MLP)=σ(Hl (graph MLP)W+B)

wherein Hl (graph MLP)For the current linear hidden layer, Hl+1 (graph MLP)The next layer is a linear hidden layer, W is a linear layer weight coefficient, B is a bias matrix, and sigma is a nonlinear activation function;

in this embodiment, the molecular image linear layer is 3 layers, H0 (graph MLP)The molecular performance is predicted for the first layer of input molecular characteristic vector matrix and the last layer, and the dimension is 1;

in this embodiment, the error during model training is obtained by using the root mean square error RMSE as a loss function, and the parameters of the model are updated by using an Autogard optimizer in a back propagation manner, where the loss function formula is as follows:

wherein y is the actual marking performance, p is the prediction performance, and N represents the number of samples;

the minimum value of the loss function is obtained by adopting a gradient descent method, an Epoch with the minimum loss function when the verification set is input is taken as a final output model, the molecular performance prediction effect of the final output model based on the image convolution molecular fingerprint in the embodiment is compared with the neural network prediction effect taking the ECFP fingerprint as input, and the molecular performance and the prediction precision are shown in the following table 2:

table 2: performance prediction comparison table

The invention is not limited to the foregoing embodiments. The invention extends to any novel feature or any novel combination of features disclosed in this specification and any novel method or process steps or any novel combination of features disclosed.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:一种近α型钛合金航空模锻件微观组织变化的预测方法

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!