Multi-modal entity alignment method based on adaptive feature fusion
1. A multi-modal entity alignment method based on adaptive feature fusion is characterized by comprising the following steps:
step 1, acquiring data of two multi-modal knowledge maps,andwherein E represents a set of entities; r represents a relationship set; t represents a triple set, which is a subset of E multiplied by R multiplied by E; i represents a picture set associated with an entity;
step 2, in a structural feature learning module, respectively learning the structural vectors of the entities of the two multi-modal knowledge maps by using a graph convolution neural network to generate structural feature representations of the respective entities;
step 3, respectively generating visual feature representations of respective entities in a visual feature processing module;
step 4, entity alignment is carried out by combining the entity structural features and the entity visual features of the two multi-modal knowledge maps through a self-adaptive feature fusion module;
the adaptive feature fusion module described in step 4, for each entity pair,,ComputingAndand predicting potential alignment entities by using the similarity score, wherein the similarity score is as follows:
,
andrespectively representing the similarity of the structural and visual feature representations of the entity,、representing the contribution rate weights of the structural feature representation and the visual feature representation, respectively;
,
.
wherein K, b and a are hyper-parameters, degree represents the degree of an entity, NhopRepresents the degree of closeness of the entity to the seed entity:
,
wherein the content of the first and second substances,,respectively representing the number of 1 hop and 2 hops away from the seed entity; w is a1、w2Is a hyper-parameter.
2. The multi-modal entity alignment method based on adaptive feature fusion as claimed in claim 1, wherein the visual feature processing module in step 3 comprises, in step 301, generating picture and entity similarity by using a pre-trained image-text matching model CVSE; step 302, setting a similarity threshold value to filter a noise picture; and 303, giving corresponding weight to the picture based on the similarity between the picture and the entity to generate visual characteristic representation of the entity.
3. An adaptation-based feature as claimed in claim 2The multi-mode entity alignment method for feature fusion is characterized in that in step 301, a pre-trained image-text matching model is used for calculating the similarity score of each picture in an entity picture set, a pre-trained consensus perception visual semantic embedding model CVSE is adopted, and the CVSE model is input as an entity eiPicture embedding of piAnd text information tiWherein the pictures are embeddedN is the number of pictures in the picture set corresponding to the entity, 36 multiplied by 2048 is the feature vector dimension generated by the pre-trained target detection algorithm Faster-RCNN for each picture, and the entity text information t of the model is inputiBy expanding the entity name into sentences: t is ti= { a photo of Entity Name. }; then, the picture embedding and the text information are sent into a model CVSE, and the similarity score of the pictures in the entity image set is obtained:
,
wherein the Softmax layer of the CVSE is removed and the model input is picture embedding piAnd text information tiGenerating similarity scores of a plurality of picturesN is the number of pictures in the entity-corresponding picture set;
in step 302, a similarity threshold α is set to filter the noise picture:
,
where set (i) represents the initial set of pictures, and set (i)' represents the set of pictures with noise filtered, Simv(j ') represents the similarity score of picture j' with the entity;
in step 303, entity e is generatediMore accurate visual feature representationVi:
,
Wherein the content of the first and second substances,representing a visual characteristic of an entity i;image features generated for the Resnet model, n' is the number of pictures after noise removal, AttiPicture attention weight is represented:
Atti = Softmax(Simv’),
wherein Simv'is the similarity score of the set of pictures set (i)'.
4. The multi-modal entity alignment method based on adaptive feature fusion as claimed in claim 2 or 3, wherein the structural feature learning module in step 2 captures entity adjacent structure information and generates entity structural feature representation by using a convolutional neural network:
,
wherein Hl,Hl+1Respectively representing characteristic matrixes of entity nodes of the layer l and the layer l + 1;representing a normalized adjacency matrix, D is a degree matrix,wherein A represents an adjacency matrix, and if a relationship exists between entity i and entity j, then Aij= 1; i denotes identity matrix, activation function σ is set to ReLU, WlCan be trained for layer IA parameter matrix of training;
since the entity structure vectors of different knowledge-graphs are not in the same space, the entity structure vectors of different knowledge-graphs need to be mapped into the same space by using the known entity pair S, and the specific training goal is to minimize the following loss values:
,
wherein, (x)+=max{0,x},Representing a set of negative examples, based on known pairs of seed entities (e)1,e2) Replacing e with a random entity1Or e2Generation of heA structure vector representing the entity e is generated,representing entity e1And e2The manhattan distance between samples, gamma represents the distance separating positive and negative samples, and random gradient descent is adopted for model optimization.
5. The method according to claim 4, wherein before the step 2 of obtaining the structural feature representation and the step 3 of obtaining the visual feature representation, an unsupervised triple filtering module is used to quantify the importance of the triples (h, r, t), wherein h represents a head entity, t represents a tail entity, and r represents a relationship, and filter the partial invalid triples based on the importance scores.
6. The method according to claim 5, wherein in the triplet filtering module, a relationship-entity graph, also called a relationship pair graph of the knowledge graph, is first constructed, wherein the relationship is a node and the entity is an edge, and the knowledge graph is defined as Ge=(Ve,Ee) In which V iseAs a set of entities, EeIs a set of relationships, a graph of relationship pairs GrUsing relationships as nodes, if two different relationships have the same entity connection, there is an edge, V, between the two relationship nodesrBeing a collection of relational nodes, ErIs a collection of edges, a relational dual graph Gr=(Vr,Er) Based on the relational dual graph, a PageRank algorithm is used to calculate a relational score:
wherein PR (r) is the PageRank score for a relationship; b isrSet of neighbour relations representing relation r, relationL (v) represents the number of connections of the relationship v;
the triple scoring function is thus calculated:
Score(h,r,t) = PR(r),
and (4) based on the triple Score (h, r, t), setting a threshold value beta, reserving the triple Score (h, r, t) > beta, and refining the knowledge graph.
Background
In recent years, knowledge maps have become a widely used representation of structured data. It represents real-world knowledge or events in the form of triples and is widely used for various types of artificial intelligence downstream tasks. At present, a multi-modal knowledge graph is often constructed from limited data sources, and the problems of information loss and low coverage rate exist, so that the knowledge utilization rate is not high. Considering that manual completion of a knowledge graph is costly and inefficient, one possible approach to improve the coverage of knowledge graphs is to automatically integrate useful knowledge from other knowledge graphs. The entity is used as a hub for linking different knowledge graphs and is very important for integrating various multi-modal knowledge graphs. The process of identifying entities in different multimodal knowledge maps that express the same meaning is referred to as multimodal entity alignment.
Multi-modal entity alignment requires the utilization and fusion of information for multiple modalities. However, existing multi-modal entity alignment methods encounter two bottlenecks: first, pattern structure variability is difficult to handle. Based on the assumption that peer entities in different knowledge graphs usually have peer neighbor entities, the current mainstream entity alignment method mainly depends on the structural information of the knowledge graph. However, in the real world, due to different construction modes, different knowledge maps may have large structural differences. For such problems, triples can be generated based on link prediction to enrich structural information, and although the problem of structural diversity is alleviated to a certain extent, the reliability of the generated triples needs to be considered, and the completion difficulty is high for the case that the number of the triples is different by multiple times. Second, visual information utilizes the difference. Current automated methods of constructing multimodal knowledgemaps typically complement information of other modalities based on existing knowledgemaps. To obtain visual information, these methods mainly utilize crawlers to obtain pictures related to the entity from the internet. However, a picture with a low degree of partial correlation, i.e., a noise picture, inevitably exists in the acquired picture. The current method cannot distinguish a noise picture in a picture related to an entity, so that part of noise is mixed in visual information of the entity, and the accuracy rate of the visual information for entity alignment is further reduced.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention discloses a multi-modal entity alignment method based on adaptive feature fusion. Aiming at the problem of poor utilization of the visual information of the multi-modal knowledge graph at present, the method utilizes a pre-trained image-text matching model to calculate the similarity score of an entity-picture, sets a similarity threshold value to filter a noise picture, gives different weights to the picture based on the similarity, and finally generates the visual characteristic representation of the entity; in addition, in order to capture the confidence coefficient of the dynamic change of the structural information and fully utilize the complementarity of different modal information, an adaptive feature fusion mechanism is designed, the structural information and the visual information of the entity are dynamically fused based on the degree of the entity and the distance between the entity and the seed entity, and the mechanism can deal with the challenges that the number of long-tail entities accounts for a large proportion and the structural information is relatively deficient.
The technical scheme of the invention is that a multi-modal entity alignment method based on adaptive feature fusion comprises the following steps:
step 1, acquiring data of two multi-modal knowledge maps,andwherein E represents a set of entities; r represents a relationship set; t represents a triple set, which is a subset of E multiplied by R multiplied by E; i represents a picture set associated with an entity;
step 2, in a structural feature learning module, respectively learning the structural vectors of the entities of the two multi-modal knowledge maps by using a graph convolution neural network to generate structural feature representations of the respective entities;
step 3, respectively generating visual feature representations of respective entities in a visual feature processing module;
step 4, the self-adaptive feature fusion module is used for each entity pair,,ComputingAndand predicting potential alignment entities by using the similarity score, wherein the similarity score is as follows:
,
andrespectively representing the similarity of the structural and visual feature representations of the entity,、representing the contribution rate weights of the structural feature representation and the visual feature representation, respectively;
,
.
wherein K, b and a are hyper-parameters, degree represents the degree of an entity, NhopRepresents the degree of closeness of the entity to the seed entity:
,
wherein the content of the first and second substances,,respectively representing the number of 1 hop and 2 hops away from the seed entity; w is a1、w2Is a hyper-parameter.
Specifically, the visual feature processing module in step 3 comprises step 301 of generating similarity between a picture and an entity by using a pre-trained image-text matching model CVSE; step 302, setting a similarity threshold value to filter a noise picture; and 303, giving corresponding weight to the picture based on the similarity between the picture and the entity to generate visual characteristic representation of the entity.
Further, in step 301, a pre-trained image-text matching model is used to calculate similarity scores of respective images in the entity image set, and a pre-trained consensus perceptual visual semantic embedding model CVSE is used, wherein the CVSE model is input as entity eiPicture ofEmbedding piAnd text information tiWherein the pictures are embeddedN is the number of pictures in the picture set corresponding to the entity, 36 multiplied by 2048 is the feature vector dimension generated by the pre-trained target detection algorithm Faster-RCNN for each picture, and the entity text information t of the model is inputiBy expanding the entity name into sentences: t is ti= { a photo of Entity Name. }; then, the picture embedding and the text information are sent into a model CVSE, and the similarity score of the pictures in the entity image set is obtained:
,
wherein the Softmax layer of the CVSE is removed and the model input is picture embedding piAnd text information tiGenerating similarity scores of a plurality of picturesN is the number of pictures in the entity-corresponding picture set;
in step 302, a similarity threshold α is set to filter the noise picture:
,
where set (i) represents the initial set of pictures, and set (i)' represents the set of pictures with noise filtered, Simv(j ') represents the similarity score of picture j' with the entity;
in step 303, entity e is generatediMore accurate visual feature representation Vi:
,
Wherein the content of the first and second substances,representing a visual characteristic of an entity i;image features generated for the Resnet model, n' is the number of pictures after noise removal, AttiPicture attention weight is represented:
Atti = Softmax(Simv’),
wherein Simv'is the similarity score of the set of pictures set (i)'.
Specifically, the structural feature learning module in step 2 captures entity adjacent structure information by using a graph convolution neural network and generates an entity structural feature representation:
,
wherein Hl,Hl+1Respectively representing characteristic matrixes of entity nodes of the layer l and the layer l + 1;representing a normalized adjacency matrix, D is a degree matrix,wherein A represents an adjacency matrix, and if a relationship exists between entity i and entity j, then Aij= 1; i denotes identity matrix, activation function σ is set to ReLU, WlIs a one-level trainable parameter matrix;
since the entity structure vectors of different knowledge-graphs are not in the same space, the entity structure vectors of different knowledge-graphs need to be mapped into the same space by using the known entity pair S, and the specific training goal is to minimize the following loss values:
,
wherein, (x)+=max{0,x},Representing a set of negative examples, based on known pairs of seed entities (e)1,e2) Replacing e with a random entity1Or e2Generation of heA structure vector representing the entity e is generated,representing entity e1And e2The manhattan distance between samples, gamma represents the distance separating positive and negative samples, and random gradient descent is adopted for model optimization.
Further, before performing step 2 to obtain the structural feature representation and step 3 to obtain the visual feature representation, an unsupervised triple filtering module is used to quantify the importance of the triples (h, r, t), where h represents the head entity, t represents the tail entity, and r represents the relationship, and filter the partially invalid triples based on the importance scores.
Specifically, in the triple screening module, a relationship-entity graph, also called a relationship dual graph of the knowledge graph, is first constructed with relationships as nodes and entities as edges, and the knowledge graph is defined as Ge=(Ve,Ee) In which V iseAs a set of entities, EeIs a set of relationships, a graph of relationship pairs GrUsing relationships as nodes, if two different relationships have the same entity connection, there is an edge, V, between the two relationship nodesrBeing a collection of relational nodes, ErIs a collection of edges, a relational dual graph Gr=(Vr,Er) Based on the relational dual graph, a PageRank algorithm is used to calculate a relational score:
wherein PR (r) is the PageRank score for a relationship; b isrSet of neighbour relations representing relation r, relationL (v) represents the number of connections of the relationship v;
the triple scoring function is thus calculated:
Score(h,r,t) = PR(r),
and (4) based on the triple Score (h, r, t), setting a threshold value beta, reserving the triple Score (h, r, t) > beta, and refining the knowledge graph.
Compared with the prior art, the method has the advantages that: aiming at the problem of poor utilization of visual information, the work is based on a pre-training image-text matching model, calculates the similarity score of an entity-picture, filters a noise picture and obtains more accurate entity visual characteristic representation based on the similarity score; a self-adaptive feature fusion mechanism is designed, structural features and visual features of the variable attention fusion entity are used, complementarity of multi-mode information is fully utilized, and an alignment effect is improved.
Drawings
FIG. 1 shows a schematic flow diagram of an embodiment of the invention;
FIG. 2 illustrates a multi-modal entity alignment framework diagram of an embodiment of the present invention;
FIG. 3 shows a schematic flow chart of a visual feature processing module according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
FIG. 1 shows a multi-modal entity alignment method based on adaptive feature fusion, comprising the following steps:
step 1, acquiring data of two multi-modal knowledge maps;
step 2, in a structural feature learning module, respectively learning the structural vectors of the entities of the two multi-modal knowledge maps by using a graph convolution neural network to generate structural feature representations of the respective entities;
step 3, respectively generating visual feature representations of respective entities in a visual feature processing module;
and 4, combining the entity structural features and the entity visual features of the two multi-modal knowledge maps to perform entity alignment through the self-adaptive feature fusion module.
Multimodal knowledge-graphs typically contain information for multiple modalities. Without loss of generality, this work only focuses on structural and visual information of the knowledge-graph. Given two multi-modal knowledge-maps,andwherein E represents a set of entities; r represents a relationship set; t represents a triple set, which is a subset of E multiplied by R multiplied by E; i represents a collection of pictures associated with the entity. Set of pairs of seed entitiesA set of aligned pairs of entities for training is represented. The multi-modal entity alignment task aims to find new entity pairs using known entity pair information and predict potential alignment resultsWhere equal signs indicate that two entities point to the same entity in the real world.
Given an entity, the process of finding its corresponding entity in another knowledge-graph can be considered a ranking problem. That is, under a certain feature space, the degree of similarity (distance) of a given entity to all entities in another knowledge-graph is calculated and given an ordering, and the entity with the highest degree of similarity (distance is the smallest) can be regarded as an alignment result.
As shown in FIG. 2, the present invention first designs a multi-modal entity alignment framework: convolution spirit using graphGenerating entity structure characteristics by learning the structure vector of the entity through a network; designing a visual characteristic processing module to generate entity visual characteristics; and then combining the information of the two modes to carry out entity alignment based on an adaptive feature fusion mechanism. In addition, in order to mitigate the structural difference of the knowledge graph, the embodiment designs a triple screening mechanism, integrates the degree of the relation score and the entity, and filters part of the triples. MG in FIG. 21、MG2Representing different multimodal knowledge maps; KG1、KG2Representing a knowledge-graph, KG1' represents the knowledge-graph after processing by the triple filtering module.
The visual characteristic processing module: in order to solve the problem of poor visual information utilization of the multi-modal entity alignment method, inspired by an image-text matching model, the work designs a visual feature processing module to generate more accurate visual features for entities so as to help entity alignment. FIG. 3 details the generation of the visual characteristics of the entity. In the absence of supervision data, generating picture and entity similarity by adopting a pre-trained image-text matching model CVSE; then setting a similarity threshold value to filter the noise picture; and endowing the picture with corresponding weight based on the similarity score, and finally generating the visual characteristic representation of the entity.
A picture-entity similarity score is calculated. This step uses a pre-trained image-text matching model to calculate a similarity score for each picture in the entity picture set. Model parameters were obtained by training on the MSCOCO and Flickr30k datasets using a pre-trained Consensus perceptual Visual Semantic Embedding model CVSE (Consensus-aware Visual Semantic Embedding). The model input is entity eiPicture embedding of piAnd text information tiWherein the pictures are embeddedN is the number of pictures in the picture set corresponding to the entity, and 36 x 2048 is the feature vector dimension generated by the pre-trained target detection algorithm Faster-RCNN for each picture. Inputting entity text information t of modeliBy combining Entity Name]Expanding into sentences: t is ti={Aphoto of Entity Name.
Then, the picture embedding and the text information are sent into a model CVSE, and the similarity score of the pictures in the entity image set is obtained:
wherein the Softmax layer of the CVSE is removed and the model input is picture embedding piAnd text information tiGenerating similarity scores of a plurality of picturesAnd n is the number of pictures in the entity-corresponding picture set.
And filtering the noise picture. The method considers that partial pictures with low similarity exist in the picture set of the entity, and the precision of the visual information is influenced. In view of this, a similarity threshold α is set to filter the noise picture:
,
where set (i) represents the initial set of pictures, and set (i)' represents the set of pictures with noise filtered, Simv(j ') represents the similarity score of picture j' with the entity.
And generating the entity visual feature representation. Generating an entity picture set through a picture filtering mechanism, giving weight based on picture similarity scores, and finally generating an entity eiMore accurate visual feature representation Vi:
,
Wherein the content of the first and second substances,representing a visual characteristic of an entity i;image features generated for the Resnet model, n' is the number of pictures after noise removal, AttiPicture attention weight is represented:
Atti = Softmax(Simv’),
wherein Simv'is the similarity score of the set of pictures set (i)'.
A structural feature learning module: the present embodiment employs a graphical convolutional neural network (GCN) to capture entity adjacency structure information and generate entity structure representation vectors. The GCN is a convolutional network that acts directly on graph structure data, and generates corresponding node structure vectors by capturing structure information around nodes:
,
wherein Hl,Hl+1Respectively representing characteristic matrixes of entity nodes of the layer l and the layer l + 1;representing a normalized adjacency matrix, D is a degree matrix,wherein A represents an adjacency matrix, and if a relationship exists between entity i and entity j, then Aij= 1; i denotes identity matrix, activation function σ is set to ReLU, WlIs a l-level trainable parameter matrix.
Since the entity structure vectors of different knowledge-graphs are not in the same space, it is necessary to map them into the same space using the known entity pair S. A specific training objective is to minimize the following loss values:
,
wherein, (x)+=max{0,x},Represents negativeSet of samples based on known pairs of seed entities (e)1,e2) Replacing e with a random entity1Or e2Generation of heA structure vector representing the entity e is generated,representing entity e1And e2The manhattan distance between samples, gamma represents the distance separating positive and negative samples, and random gradient descent is adopted for model optimization.
The self-adaptive feature fusion module: the multi-modal knowledge graph contains information of at least two modes, and multi-modal entity alignment needs to be fused with information of different modes. Existing approaches combine different embeddings into one unified representation space, which requires additional training to uniformly represent irrelevant features. A more preferable strategy is to first compute a similarity matrix within each feature-specific space and then combine the feature similarity scores.
Formally, given a structural feature vector representation S, a visual feature representation V. For each entity pair (e)1,e2),,Calculating e1And e2And then using the similarity scores to predict potential alignment entities. To calculate the overall similarity, we first calculate a feature-specific similarity score between the entity pair, i.e., we calculateAnd. Next, the above similarity scores are combined:
,
wherein, Atts、AttvRepresenting the contribution rate weights of the structural information and the visual information, respectively.
The features of different modalities characterize the entity from different perspectives, with some correlation and complementarity. The current method combines the structural information and the visual information with a fixed contribution rate weight, and ignores the contribution rate difference of the structural information between different entities. For entities with poor structural information, the visual feature representation should be trusted more. Moreover, intuitively, the closeness of the association between the entity and the seed entity is positively correlated with the accuracy of its structural features.
In order to capture the dynamic change of the contribution rate of different modal information, the adaptive feature fusion mechanism is further designed by combining the closeness degree of the association between the entity and the seed entity on the basis of the entity degree and being inspired by a joint attention mechanism based on degree perception:
,
.
wherein K, b and a are hyper-parameters, degree represents the degree of an entity, NhopRepresents the degree of closeness of the entity to the seed entity:
,
wherein the content of the first and second substances,,respectively representing the number of 1 hop and 2 hops away from the seed entity; w is a1、w2Is a hyper-parameter.
Further, before performing step 2 to obtain the structural feature representation and step 3 to obtain the visual feature representation, the importance of the triples (h, r, t) is quantified using an unsupervised triple filtering module, and a portion of the invalid triples are filtered based on the importance scores.
The structural information of the knowledge-graph is represented as triples, (h, r, t), where h represents the head entity, t represents the tail entity, and r represents the relationship. The difference in the number of different knowledge-graph triples is large, which results in a large discount on the effect of entity alignment based on structural information. In order to mitigate the structural differences of different knowledge maps, the work designs an unsupervised triple screening module, quantifies the importance of the triples, and filters partial invalid triples based on the importance scores. Wherein the triple importance score incorporates the PageRank score for the relationship r, and the degree of the entity h, t.
And calculating the PageRank score. Firstly, a relationship-entity graph, also called a relationship dual graph of the knowledge graph, is constructed, wherein the relationship is used as a node, and the entity is used as an edge. Defining the knowledge-graph as Ge=(Ve,Ee) In which V iseAs a set of entities, EeIs a set of relationships, a graph of relationship pairs GrUsing relationships as nodes, if two different relationships have the same entity connection, there is an edge, V, between the two relationship nodesrBeing a collection of relational nodes, ErIs a collection of edges, a relational dual graph Gr=(Vr,Er)。
Based on the above-generated relationship pair graph, the present embodiment calculates a relationship score using the PageRank algorithm. The PageRank algorithm is a representative algorithm for link analysis on graph data and belongs to an unsupervised learning method. The basic idea is to define a random walk model on the directed graph, describing the behavior of random walkers to randomly access each node along the directed graph. Under a certain condition, the probability of accessing each node in the limit condition converges to a stationary distribution, and the stationary probability value of each node is the PageRank value thereof and represents the importance of the node. Inspired by the algorithm, based on a knowledge graph relation dual graph, a PageRank value of the relation is calculated to represent the importance of the relation:
wherein PR (r) is the PageRank score for a relationship; b isrSet of neighbour relations representing relation r, relationAnd L (v) represents the number of connections (i.e., degrees) of the relationship v.
A triple scoring mechanism. The screening of the triples is to filter out redundant or invalid relationships on one hand and protect the structural characteristics of the knowledge graph on the other hand. Since the long-tail entity with the lack of the structural information only has a few related triples, the problem of structural information lack of the long-tail entity can be aggravated if one relationship is directly filtered based on the relationship importance score. Therefore, the embodiment provides two triple scoring functions, one is to directly adopt PageRank scoring and design the triple scoring function:
Score(h,r,t) = PR(r),
and (4) based on the triple Score (h, r, t), setting a threshold value beta, reserving the triple Score (h, r, t) > beta, and refining the knowledge graph.
In the experiment, the data set MMKG was used in the present example and extracted from the knowledge bases FreeBase, DBpedia and Yago, respectively. These datasets are based on FB15K, aligning entities in FB15K with equivalent entities in other knowledge-maps using SameAs (equivalent) links between knowledge-maps, thereby generating DB15K and Yago 15K. Experiments were performed herein on two pairs of multimodal knowledge maps FB15K-DB15K and FB15K-YAGO 15K.
Since the data set does not provide pictures, in order to obtain entity-related pictures, this embodiment uses URI data and designs a web crawler to parse query results from Image Search engines (i.e., Google Images, Bing Images, and Yahoo Image Search). Then, pictures obtained by different search engines are distributed to different MMKGs. In order to simulate the construction process of a real-world multi-modal knowledge graph, pictures with high similarity in an equivalent entity image set are removed, and a certain number of noise pictures are introduced. Table 1 describes the details of the data set. In experiments, pairs of known equivalent entities are used for model training and testing.
TABLE 1 multimodal knowledge map statistics
Data set
Entity
Relationships between
Triple unit
Picture frame
Equivalence of
FB15K
14,951
1,345
592,213
13,444
DB15K
14,777
279
99,028
12,841
12,846
Yago15K
15,404
32
122,886
11,194
11,199
Evaluation indexes are as follows: experiments used Hits @ k (k =1,10) and Mean Reciprocal Rank (MRR) as evaluation indices. For each entity in the test set, the entities in the other graph are ranked in descending order according to their similarity score to the entity. Hits @ k represents the number of the first k entities that contain the correct entity as a percentage of the total number. On the other hand, MRR represents the average of the inverse ordering of the correctly aligned entities, Hits @1 represents the accuracy of the alignment and is the most important evaluation index, and Hits @10 and MRR provide supplementary information. Note that higher values for Hits @ k and MRR indicate better performance, and the results for Hits @ k are expressed as a percentage. We mark the best effect in bold in the table.
The experiment utilizes the graph convolution neural network to generate the entity structural feature, the quantity of negative examples is set to be 15, gamma =3, 400 rounds of training are carried out, and the dimension ds= 300; the visual features are generated by a visual feature processing module, dimension dv= 2048. Setting the proportion of the seed entities to be 20% and 50%, and selecting 10% of the entities as a verification set to be used for adjusting the hyperparameters in the formula, wherein b =1.5, a =1, the value of the parameter K is related to the proportion of the seed entities, and 0.6 is taken when seed = 0.2; when seed =0.5, 0.8 is assumed. For the hyperparameter w1And w2Take 0.8 and 0.1, respectively.
TABLE 2 Multi-modal entity alignment results
The method of the present embodiment and the method of removing the triple screening module in the method of the present embodiment are compared with 2 methods: (1) GCN-align, using GCN to generate entity structure and visual characteristic matrix, combining two characteristics with fixed weight to align entity; (2) and HMEA, generating a structure and visual characteristic matrix of the entity by using a hyperbolic convolution neural network (HGCN), and combining the structure characteristic and the visual characteristic in a hyperbolic space by weight to perform entity alignment. The method of the embodiment achieves the best multi-modal entity alignment effect at present.
In addition, to verify the validity of the triple screening module proposed by the present invention, we compared FPageRank、FRandom、FourThree screening mechanisms, which respectively represent direct PageRank scoring screening, random screening and improved PageRank scoring screening. To control experimental variables, the experiment used the 3 screening mechanisms described above to screen the same number of triples, about 29 million; the structural features are learned based on the graph convolution neural network, and all parameters are kept consistent.
The experimental results show that random screening FRandomHits @1 increased by about 1.5% and 2.5% for seed =0.2 and 0.5, respectively, compared to the baseline with all triplets retained, indicating that there is some effect of atlas structural variability on entity alignment. Compared with random screening, the screening mechanism based on the PageRank score is improved by about 3% under the condition that the proportion of the seed entities is 50%. According to results, the improved triplet screening mechanism for PageRank score screening obtains the optimal alignment result, and Hits @1 of the triplet screening mechanism is respectively improved by more than 8% and 3% compared with a baseline on FB15K-DB 15K; on FB15K-Yago15K, the Hits @1 increased by about 9%, 5%, respectively.
Since the richness of the structural information is related to the degree of the entity, the entity is divided into three categories according to the number of the degree of the entity, and the accuracy of the multi-modal entity alignment under the adaptive fusion mechanism and the fixed weight mechanism provided by the embodiment is respectively tested on the three categories of entities. The seed entity ratio of the experiment is set to be 20 percent and is respectively carried out on FB15K-DB15K and FB15K-Yago15K, and the rest parameters are consistent with the experiment.
Table 3 shows the multi-modal entity alignment results of adaptive feature fusion and fixed weight fusion. Wherein Fixed and Adaptive represent a Fixed weight fusion mechanism and an Adaptive feature fusion mechanism respectively; group 1, group 2, and group 3 represent front 1/3, middle 1/3, and rear 1/3 partial entities, respectively, divided from small to large based on entity degrees. As can be seen from table 3, the adaptive feature fusion mechanism achieves better entity alignment effect on various entities than the fixed weight fusion. It can be clearly found that the improvement on the group 1 is significantly higher than that of the group 2 and the group 3, which proves that the alignment accuracy of the short-tailed entity, which is an entity lacking in structural information, can be significantly improved by the adaptive feature fusion mechanism of the present embodiment.
TABLE 3 adaptive feature fusion and fixed weight fusion multimodal entity alignment results
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种用于共享平台的政务数据分析系统