Deep community discovery method fusing node attributes
1. A deep community discovery method fusing node attributes is characterized by comprising the following steps:
the first step is as follows: constructing a modularity matrix: the larger the value of the modularity is, the clearer the community structure is, the better the community division is, and the community structure of the network can be obtained by maximizing the modularity;
the second step is that: constructing a capture network structure of the depth self-encoder: through reconstructing the modularity matrix, the nonlinear community structure of the network is saved in the last layer of output H of the hidden layer;
the third step: combining node attribute information; when the nodes with the same attribute are divided into different communities, a punishment is executed, and community discovery is carried out by using the fusion link relation data and the node content data.
2. The method of claim 1, wherein the specific process of constructing the modularity matrix is as follows:
the modularity is defined as the ratio of the connecting edges within a community minus the expected value of the ratio of connecting edges between any two nodes under the same community structure:
wherein the content of the first and second substances,is the total number of edges in the network,representing node v if edges in the network are randomly placediAnd vjDesired number of edges in between, wi(wj) Representative node vi(vj) Degree of (d), δ (c)i,cj) Is a kronecker function; introducing a module matrix B epsilon Rn×nN is the number of nodes in the network, R is the real number set, the elements of the modularity matrixThe modularity is written as:
Q=tr(HTBH) s.t.tr(HTH)=n (6)
the matrix H is a community indication matrix, the community corresponding to the maximum value of each row of the matrix is the community to which the node belongs, HijRepresenting the probability of a node i belonging to a community j, HTIs the transpose of matrix H.
3. The method of claim 2, wherein the specific process of constructing the deep autoencoder capturing network structure comprises:
the goal of a dynamic encoder is to minimize the reconstruction error between the output data and the input data so that the final hidden layer can preserve the features of the original input data to the maximum extent. The invention takes the modularity matrix as the input of the depth self-encoder, captures the nonlinear structure in the modularity matrix through reconstruction:
wherein θ ═ W(1),W(2),b(1),b(2)The parameter is a hyper-parameter set; b is a modularity matrix, matrix elementsAnd (4) by reconstructing the modularity matrix, saving the nonlinear community structure of the network in the last output H of the hidden layer.
4. The method of claim 3, wherein the specific process of combining node attribute information is as follows:
suppose two nodes viAnd vjWith a high degree of similarity s in content attributesijThen they have a high probability of belonging to the same community, two nodes viAnd vjShould be similar;
constructing an attribute similarity matrix S, a node viAnd vjSimilarity of attributes between them is expressed as sijRepresentation, where we use cosine similarity to compute attribute similarity between two nodes:
wherein, tiFor the ith row of the node attribute matrix, represent node viThe attribute feature vector of (2);
for each node viSearching and node v based on attribute similarityiK nodes with the closest attribute similarity; if node viIs node vjK-neighbors of, then node vjShould also be node viK-neighbors of (a), all attribute neighbor graphs are symmetric; preserving nodes v in a similarity matrixiSimilarity value s with its k-neighborijSetting elements corresponding to other non-adjacent neighbors as 0;
after the attribute similarity matrix S is obtained, introducing a graph regular term into an automatic encoder to fuse attribute information; suppose if node viAnd vjWith high attribute similarity sijThen their embedded vector hiAnd hjShould also be similar:
wherein, the Laplace matrix L is D-S, D is diagonal matrix; and merging the graph regularization item into a reconstruction loss function to obtain a final loss function of the SADA:
wherein α is a parameter controlling the weight of the regularization term; and obtaining a matrix H through optimization, wherein the community corresponding to the maximum value of each row of H is the community to which the node belongs, so that a final community structure is obtained.
Background
The community structure is an important structural feature widely existing in a network, the connection among nodes in the communities is tight, and the connection among the nodes in the communities is sparse. Community discovery is the process of mining the structure of communities hidden in network data from a mesoscopic perspective by analyzing the interactions and potential information between nodes in the network. The community discovery provides an effective tool for exploring the potential characteristics of the complex network, and has important theoretical and practical significance for understanding the network organization structure, analyzing the potential characteristics of the network, discovering the network hiding rule, the interaction mode and the like. The node attributes are used as important information in the network, and the combination of the network attributes is beneficial to mining a more accurate community structure.
Although many community detection methods have been proposed and achieve reasonable results, we still face three major challenges, namely that firstly, the stochastic model and the modular maximization model are linear models and can only capture the linear structure of the network. However, the network structure in the real world has proven to be complex, which is best viewed as highly non-linear. Secondly, it is known that computing feature values requires a high computational space. Thus, scalability is a major bottleneck. Third, how to efficiently integrate different types of information to detect communities remains to be solved. Most algorithms only use topology information and ignore important attribute information. The network in daily life has abundant attribute information in each node, and the attributes can be used for improving the efficiency of community detection. The addition of the attribute information can supplement the topology information and alleviate the problem of network sparsity.
Disclosure of Invention
The invention aims to provide a deep community discovery method fusing node attributes, which utilizes a deep neural network to mine a nonlinear structure and combines node attribute information to obtain a more accurate community structure.
A deep community discovery method fusing node attributes comprises the following steps:
the first step is as follows: constructing a modularity matrix: the larger the value of the modularity is, the clearer the community structure is, the better the community division is, and the community structure of the network can be obtained by maximizing the modularity;
the second step is that: constructing a capture network structure of the depth self-encoder: through reconstructing the modularity matrix, the nonlinear community structure of the network is saved in the last layer of output H of the hidden layer;
the third step: combining node attribute information; when the nodes with the same attribute are divided into different communities, a punishment is executed, and community discovery is carried out by using the fusion link relation data and the node content data.
Preferably, the specific process of constructing the modularity matrix in the present invention is as follows:
the modularity is defined as the ratio of the connecting edges within a community minus the expected value of the ratio of connecting edges between any two nodes under the same community structure:
wherein the content of the first and second substances,is the total number of edges in the network,representing node v if edges in the network are randomly placediAnd vjDesired number of edges in between, wi(wj) Representative node vi(vj) Degree of (d), δ (c)i,cj) Is in crometA gram function; introducing a module matrix B epsilon Rn×nN is the number of nodes in the network, R is the real number set, the elements of the modularity matrixThe modularity is written as:
Q=tr(HTBH) s.t.tr(HTH)=n (6)
the matrix H is a community indication matrix, the community corresponding to the maximum value of each row of the matrix is the community to which the node belongs, HijRepresenting the probability of a node i belonging to a community j, HTIs the transpose of matrix H.
Preferably, the specific process of constructing the capture network structure of the deep self-encoder of the present invention is as follows:
the goal of a dynamic encoder is to minimize the reconstruction error between the output data and the input data so that the final hidden layer can preserve the features of the original input data to the maximum extent. The invention takes the modularity matrix as the input of the depth self-encoder, captures the nonlinear structure in the modularity matrix through reconstruction:
wherein θ ═ W(1),W(2),b(1),b(2)The parameter is a hyper-parameter set; b is a modularity matrix, matrix elementsAnd (4) by reconstructing the modularity matrix, saving the nonlinear community structure of the network in the last output H of the hidden layer.
Preferably, the specific process of combining the node attribute information in the present invention is as follows:
suppose two nodes viAnd vjWith a high degree of similarity s in content attributesijThen they have a high probability of belonging to the same community, two nodes viAnd vjShould be similar;
constructing an attribute similarity matrix S, a node viAnd vjSimilarity of attributes between them is expressed as sijRepresentation, where we use cosine similarity to compute attribute similarity between two nodes:
wherein, tiFor the ith row of the node attribute matrix, represent node viThe attribute feature vector of (2);
for each node viSearching and node v based on attribute similarityiK nodes with the closest attribute similarity; if node viIs node vjK-neighbors of, then node vjShould also be node viK-neighbors of (a), all attribute neighbor graphs are symmetric; preserving nodes v in a similarity matrixiSimilarity value s with its k-neighborijSetting elements corresponding to other non-adjacent neighbors as 0;
after the attribute similarity matrix S is obtained, introducing a graph regular term into an automatic encoder to fuse attribute information; suppose if node viAnd vjWith high attribute similarity sijThen their embedded vector hiAnd hjShould also be similar:
wherein, the Laplace matrix L is D-S, D is diagonal matrix; and merging the graph regularization item into a reconstruction loss function to obtain a final loss function of the SADA:
where α is a parameter that controls the regularization term weight. And obtaining a matrix H through optimization, wherein the community corresponding to the maximum value of each row of H is the community to which the node belongs, so that a final community structure is obtained.
Over the past several decades, many community discovery algorithms have been proposed. Among them, stochastic models and modularity-based models are particularly popular. The essence of the stochastic model is to map the network into a low dimensional space and then detect the community structure in the underlying space. Structural feature representation is obtained by decomposing the feature values of the modularity matrix based on the modularity model, and then a community structure is obtained by applying a K-means clustering algorithm on the feature vectors corresponding to the feature values, which is equivalent to low-rank reconstruction of the modularity matrix. Although the stochastic model and the model based on modularity work differently, they essentially map the network into a low-dimensional space and then cluster the nodes in the new space to obtain the community structure. The automatic encoder is used as an unsupervised deep neural network model, aims to minimize the reconstruction error between output data and input data, and can maximally retain the characteristics of original data at the last layer of a hidden layer. Both the auto-encoder and the modularity-based model aim to obtain a low-dimensional approximation of the corresponding matrix. However, the auto-encoder can capture the non-linear relationship between nodes and has less complexity. In order to fully utilize the advantages of the automatic encoder, the invention aims to provide a community discovery method based on a depth automatic encoder, and simultaneously realize the fusion of structure and node attribute information. And (3) mining a nonlinear structure by using a deep neural network, and simultaneously combining node attribute information to obtain a more accurate community structure.
Drawings
FIG. 1 is a comparison of community discovery results on an artificial data set by algorithms.
Detailed Description
The goal of the self-encoder is to reconstruct the original input so that the output is as close as possible to the input. In this way, the output of the hidden layer can be viewed as a low-dimensional representation of the original data, thereby maximizing the extraction of the features contained in the original data. The self-encoder comprises two symmetrical components: an encoder and a decoder. A basic self-encoder can be seen as a three-layer neural network, consisting of an input layer, a hidden layer and an output layer.
Given an input data xiThe encoder converts the original data xiOutput coding h mapped as hidden layeri,hiCan be regarded as xiRepresents the low-dimensional embedding of:
hi=σ(W(1)xi+b(1)) (1)
the decoder then reconstructs the input data,is the reconstructed output data:
the input data is encoded and decoded to obtain a reconstructed representation of the input data. Wherein θ ═ W(1),W(2),b(1),b(2)Is a parameter set, W(1),W(2)Weight matrices for the encoder and decoder, respectively, b(1),b(2)The bias vectors for the encoder and decoder, respectively. σ (-) is a non-linear activation function, e.g. sigmoid functiontan h functionAnd the like. The self-encoder derives a characterization of the input data by minimizing the error between the input data and the reconstructed data:
in order to capture the high nonlinearity of the topological structure and the node attribute, the section combines a plurality of nonlinear functions into an encoder and a decoder, and learns the characteristics of different abstraction levels by carrying out multi-layer abstraction learning on data.
Wherein K represents the number of hidden layers,is a low-dimensional representation of the feature of node i.
Further, the detailed process of the community discovery method of the present invention in which learning is based on network representation can be described as follows:
the first step is as follows: constructing a modularity matrix
The modularity is defined as the ratio of the connecting edges within a community minus the expected value of the ratio of connecting edges between any two nodes under the same community structure:
wherein the content of the first and second substances,is the total number of edges in the network,representing node v if edges in the network are randomly placediAnd vjDesired number of edges in between, wi(wj) Representative node vi(vj) Degree of (d), δ (c)i,cj) Is the kronecker function. By introducing a matrix of modules B ∈ Rn×nElements of the modularity matrixThe modularity can be written as:
Q=tr(HTBH) s.t.tr(HTH)=n (6)
the community corresponding to the maximum value of each row of the matrix H is the community to which the node belongs, HijRepresenting the probability of the node i belonging to the community j. The larger the value of modularity is, the clearer the community structure is, and the better the community division is. The community structure of the network can be obtained by maximizing the modularity.
The problem of maximum modularity can be translated into the use of a low rank approximation to reconstruct the modularity matrix. In matrix reconstruction, eigenvalue decomposition is highly correlated with the auto-encoder. The problem of maximum modularity can therefore be seen as modular matrix reconstruction using a low rank approximation. Therefore, we can obtain the ideal network representation embedding space at the last layer of the concealment by using the modularity matrix B as the input of the self-encoder.
The second step is that: building a depth autoencoder capture network structure
The goal of a dynamic encoder is to minimize the reconstruction error between the output data and the input data so that the final hidden layer can preserve the features of the original input data to the maximum extent. The invention takes the modularity matrix as the input of the depth self-encoder, captures the nonlinear structure in the modularity matrix through reconstruction:
wherein θ ═ W(1),W(2),b(1),b(2)Is the hyperparameter set. B is a modularity matrix, matrix elementsAnd (4) by reconstructing the modularity matrix, saving the nonlinear community structure of the network in the last output H of the hidden layer.
The third step: combining node attribute information;
the invention assumes that if two nodes v are presentiAnd vjWith a high degree of similarity s in content attributesijThen they have a high probability of belonging to the same community. Their community indication matrix vectors should be similar. Based on this assumption, the present invention incorporates node attribute information into the autoencoder by constructing pairwise constraints on the nodes and introducing a new graph regularization. Heuristic of the Laplacian feature mapping method, when having the same attributesThe nodes are divided into different communities and a penalty is enforced. The framework therefore utilizes the fused link relationship data and the node content data simultaneously for community discovery.
Firstly, constructing an attribute similarity matrix S, a node viAnd vjSimilarity of attributes between them is expressed as sijRepresentation, where we use cosine similarity to compute attribute similarity between two nodes:
wherein, tiFor the ith row of the node attribute matrix, represent node viThe attribute feature vector of (2).
Based on the similarity, an attribute neighbor graph is constructed by utilizing k-neighbor consistency, and the manifold structure of the node content space is solved. For each node viFirst, based on attribute similarity, search and node viThe k nodes with the closest attribute similarity. If node viIs node vjK-neighbors of, then node vjShould also be node viK-neighbors, all attribute neighbor graphs are symmetric. Finally, reserving a node v in the similarity matrixiSimilarity value s with its k-neighborijThe other non-neighboring corresponding elements are set to 0. The improved attribute similarity not only reflects the similarity of the attributes between the nodes, but also saves the popular structure between the nodes.
After obtaining the attribute similarity matrix S, we introduce a graph regular term in the automatic encoder to fuse the attribute information. Let us assume if node viAnd vjWith high attribute similarity sijThen their embedded vector hiAnd hjShould also be similar:
wherein, the laplace matrix L is D-S, and D is a diagonal matrix. And finally, merging the graphical regularization item into a reconstruction loss function to obtain a final loss function of the SADA:
where α is a parameter that controls the regularization term weight. And obtaining a matrix H through optimization, wherein the community corresponding to the maximum value of each row of H is the community to which the node belongs, so that a final community structure is obtained.
The present invention will be described in further detail with reference to the following experiments.
We used the artificial network proposed by Girvan and Newman to evaluate the effectiveness of the present invention. The artificial network consists of 128 nodes and is divided into 4 communities. 32 nodes per community. The average degree of the nodes is 16, and Z connections are shared with other community nodes. The higher the value of Z, the more fuzzy the community structure, and the greater the difficulty of community discovery. Experiments are carried out on artificial networks with different Z values, and standardized mutual information NMI is used as an index for measuring the performance of the community discovery algorithm. FIG. 1 shows community discovery results on an artificial data set in comparison to other community discovery algorithms. Experiments show that under all conditions, the method obtains the optimal community discovery result, and particularly when the Z value is increased and the community discovery difficulty is increased, the method can still obtain a clear social structure.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:基于k桁架建立社交网络中关键关系的方法