Discrete hash retrieval method based on robust matrix decomposition
1. The discrete hash retrieval method based on robust matrix decomposition is characterized by comprising the following steps:
step S1, collecting samples of two modes of images and texts through the Internet to establish a data set, and dividing the data set into a training set and a testing set;
step S2, extracting the characteristics of the images and texts in the training set and the test set by using the BOW algorithm of the images and the texts respectively;
step S3, learning the consistency and the inconsistency between the characteristics of the image and the text by using a matrix decomposition algorithm, wherein the consistency is represented by a shared hash code, the inconsistency is restrained by minimizing the commonality, and a total objective function is constructed;
which comprises the following steps:
step S31, usingTo represent the characteristics of the training set samples, where n is the number of sample pairs,;andzero-center feature vectors representing data from image and text modalities, respectively; mapping features of images and text to a d-dimensional feature space using a radial basis kernel function, wherein
Andmapped features representing features of the image and the text, respectively;
step S32, merging the consistency and inconsistency between the image and text modalities into a model based on matrix decomposition, where the objective function based on matrix decomposition is defined as follows:
wherein the content of the first and second substances,to balance the balance parameters of the image and text weights,to balance the parameters of consistency and inconsistency,anda matrix of latent factors representing image and text modalities respectively,andrespectively, representing the part of the image and the text mode which are inconsistent, B representing the part of the mode which is consistent, namely the hash code of the image and the text sample,as a constraint for the inconsistency, which is a characteristic of the noise or different modalities present in the sample,all elements should be minimized, whereinRepresenting dot product operations of matrices, definitionThe following were used:
whereinA dot product operation representing a matrix, using a matrix form, the above formula can be further written as:
whereinA trace representing a matrix;
in step S33, learning the objective function of the hash function of the image and text modality from the hash code B of the image and text modality is defined as:
whereinIs a parameter that balances the two modal hash functions,andrespectively representA projection matrix of image and text modalities;
step S34, the overall objective function of the method of the present invention is therefore:
whereinIs a parameter that balances the weights of the regularization terms,
representing a regularization term;
step S4, solving the total objective function in the step S3 to obtain a hash code B of the image and text sample pair and a projection matrix of the image and text mode(ii) a The objective function is non-convex, so the invention provides an iterative optimization algorithm to obtain a local optimal solution of the problem, which comprises the following substeps:
step S41: fixingSolving for:
;
Is a k-dimensional identity matrix;
step S42: fixingSolving for:
;
Step S43: fixingSolving for:
;
Step S44: fixingSolving for:
;
Step S45: fixingSolving for:
;
Is a d-dimensional identity matrix;
step S46: fixingSolving for;
;
Step S47: fixingSolving a hash code B:
removing terms that are not related to B, the overall objective function can be simplified as:
whereinThe method is a discrete least square problem, the problem is difficult to solve due to the discrete constraint of B, and the method adopts a discrete circular coordinate descent method to directly solve B bit by bit;as the ith line of the hash code B,b is a matrix formed by removing B; in a similar manner to that described above,to representIn the (i) th row(s),to representRemovingThe matrix of the composition is formed by the following components,to representThe number of the ith row of (a),to representRemovingThe matrix of the composition is formed by the following components,to representRow i, removing the constant term yields:
,
the following can be obtained:
,
first of all with generatedUpdating the ith row of the hash code B, and then repeatedly executing the formula until all bit hash codes are updated; repeatedly executing the above processObtaining a Hash code B of the training set sample;
step S48: judging whether the maximum iteration number is reached or the difference of the iteration losses of the last two times is less than 0.001, and if not, continuing the iteration; if so, stopping the circulation;
step S5, when the user submits the query sample, the projection matrix of the image mode is usedOr projection matrix of text modalityAnd calculating to obtain a hash code of the query sample, calculating the Hamming distance between the query sample and the heterogeneous modal sample in the data set, and returning a cross-media retrieval result according to the sequence of the Hamming distance from small to large.
2. The discrete hash retrieval method based on robust matrix factorization of claim 1, wherein said step S1 comprises collecting samples of two modalities of image and text from social network sites on the network, and constructing a data set by constructing image and text samples according to co-occurrence relationship of image and text.
3. The discrete hash search method based on robust matrix decomposition of claim 1, wherein in step S2, the image is characterized by using bag-of-words model with SIFT features as visual words, and the text is characterized by using conventional bag-of-words model.
4. The discrete hash search method based on robust matrix factorization of claim 1, wherein in step S5, when the user submits the query sampleWhere r =1 denotes an image modality, r =2 denotes a text modality, and a projection matrix according to the image modalityAnd projection matrix of text modalityBy usingAnd calculating the hash code of the query sample, calculating the Hamming distance between the query sample and the heterogeneous modal sample in the data set, and returning a cross-media retrieval result according to the sequence of the Hamming distance from small to large.
Background
With the explosion of computer technology and social networking, the amount of multimedia data, including text, images, and video, has increased rapidly in recent years. Fast similarity retrieval for large-scale data sets has become one of the basic requirements. Hash techniques have received much attention in recent years due to their high efficiency in large-scale applications. The key to the hashing technique is to seek a compact binary representation of the high-dimensional data points by preserving data structure or semantic similarity. Subsequently, in the learned hamming space, the retrieval task can be effectively completed through the xor operation, which makes the hash technique applicable to large-scale data sets. However, most retrieval tasks are limited to retrieving data points within a single modality, where the type of retrieved data is the same as the query. Due to the heterogeneity differences between different modalities, these methods cannot be directly applied to the case where the data to be retrieved belong to different types.
Typically, data generated on the internet is represented by different modalities, such as text, images, and video. For a search engine, in practical applications, various modal samples need to be provided to a user as search results. Therefore, cross-media hash retrieval technology becomes a hot spot of research. The cross-media hash retrieval technology encodes heterogeneous samples into hash codes, which are receiving more and more attention due to their great advantages in terms of computational efficiency and storage overhead. Despite the great improvements of the previous methods, these methods have been found to have the following disadvantages. First, these methods focus on only partial modeling of the coherence between different modalities, ignoring potential inconsistencies (caused by noise or modal characteristics) between multimodal data. Although they can learn hash codes for samples of different modalities, satisfactory retrieval performance cannot be obtained. Therefore, how to jointly model the consistency and the inconsistency among different modalities in a unified learning framework to improve the quality of the hash code still remains to be solved. Secondly, the discrete constraint of the hash code causes that the objective function is difficult to solve, most hash methods firstly relax the discrete constraint to obtain a continuous solution, and then quantize the continuous solution to obtain the hash code of the sample, but the process can introduce quantization error, which causes the reduction of the retrieval performance.
Disclosure of Invention
The present invention is directed to overcome the above-mentioned deficiencies of the prior art and to provide a discrete hash search method based on robust matrix decomposition.
The technical scheme provided by the invention is as follows: the discrete hash retrieval method based on robust matrix decomposition is characterized by comprising the following steps: :
step S1, collecting samples of two modes of images and texts through the Internet to establish a data set, and dividing the data set into a training set and a testing set;
step S2, extracting the characteristics of the images and texts in the training set and the test set by using the BOW algorithm of the images and the texts respectively;
step S3, learning the consistency and the inconsistency between the characteristics of the image and the text by using a matrix decomposition algorithm, wherein the consistency is represented by a shared hash code, the inconsistency is restrained by minimizing the commonality, and a total objective function is constructed;
which comprises the following steps:
step S31, usingTo represent the characteristics of the training set samples, where n is the number of sample pairs,;andzero-center feature vectors representing data from image and text modalities, respectively; mapping features of images and text to a d-dimensional feature space using a radial basis kernel function, wherein
Andmapped features representing features of the image and the text, respectively;
step S32, merging the consistency and inconsistency between the image and text modalities into a model based on matrix decomposition, where the objective function based on matrix decomposition is defined as follows:
wherein the content of the first and second substances,to balance the balance parameters of the image and text weights,to balance the parameters of consistency and inconsistency,anda matrix of latent factors representing image and text modalities respectively,andrespectively, representing the part of the image and the text mode which are inconsistent, B representing the part of the mode which is consistent, namely the hash code of the image and the text sample,as a constraint for the inconsistency, which is a characteristic of the noise or different modalities present in the sample,all elements should be minimized, whereinRepresenting dot product operations of matrices, definitionThe following were used:
whereinA dot product operation representing a matrix, using a matrix form, the above formula can be further written as:
whereinA trace representing a matrix;
in step S33, learning the objective function of the hash function of the image and text modality from the hash code B of the image and text modality is defined as:
whereinIs a parameter that balances the two modal hash functions,andprojection representing image and text modalities, respectivelyA matrix;
step S34, the overall objective function of the method of the present invention is therefore:
whereinIs a parameter that balances the weights of the regularization terms,
representing a regularization term;
step S4, solving the total objective function in the step S3 to obtain a hash code B of the image and text sample pair and a projection matrix of the image and text mode(ii) a The objective function is non-convex, so the invention provides an iterative optimization algorithm to obtain a local optimal solution of the problem, which comprises the following substeps:
step S41: fixingSolving for:
;
Is a k-dimensional identity matrix;
step S42: fixingSolving for:
;
Step S43: fixingSolving for:
;
Step S44: fixingSolving for:
;
Step S45: fixingSolving for:
;
Is a d-dimensional identity matrix;
step S46: fixingSolving for;
;
Step S47: fixingSolving a hash code B:
removing terms that are not related to B, the overall objective function can be simplified as:
whereinThe method is a discrete least square problem, the problem is difficult to solve due to the discrete constraint of B, and the method adopts a discrete circular coordinate descent method to directly solve B bit by bit;as the ith line of the hash code B,b is a matrix formed by removing B; in a similar manner to that described above,to representIn the (i) th row(s),to representRemovingThe matrix of the composition is formed by the following components,to representThe number of the ith row of (a),to representRemovingThe matrix of the composition is formed by the following components,to representRow i, removing the constant term yields:
,
the following can be obtained:
,
first of all with generatedUpdating the ith row of the hash code B, and then repeatedly executing the formula until all bit hash codes are updated; repeatedly executing the above processObtaining a Hash code B of the training set sample;
step S48: judging whether the maximum iteration number is reached or the difference of the iteration losses of the last two times is less than 0.001, and if not, continuing the iteration; if so, stopping the circulation;
step S5, when the user submits the query sample, the projection matrix of the image mode is usedOr projection matrix of text modalityAnd calculating to obtain a hash code of the query sample, calculating the Hamming distance between the query sample and the heterogeneous modal sample in the data set, and returning a cross-media retrieval result according to the sequence of the Hamming distance from small to large.
Preferably, the step S1 includes collecting samples of two modalities, namely, images and texts, from social network sites on the network, and constructing the data set by constructing the images and the text samples according to the co-occurrence relationship between the images and the texts.
Preferably, in step S2, the images are extracted by using bag-of-words model with SIFT features as visual words, and the texts are extracted by using traditional bag-of-words model.
Preferably, in step S5, when the user submits the query sampleWhere r =1 denotes an image modality, r =2 denotes a text modality, and a projection matrix according to the image modalityAnd projection matrix of text modalityBy usingAnd calculating the hash code of the query sample, calculating the Hamming distance between the query sample and the heterogeneous modal sample in the data set, and returning a cross-media retrieval result according to the sequence of the Hamming distance from small to large.
The invention has the beneficial effects that: the invention eliminates the inconsistency among different modes through the matrix decomposition model, and simultaneously keeps the consistency of the generated hash codes. Therefore, the model can better capture the internal structure of the training data and has stronger robustness to noise. In addition, unlike most previous methods that relax the discrete constraint, the discrete hash code can be directly obtained in the optimization process.
The invention designs a general objective function based on matrix decomposition, models the consistency and the inconsistency of multi-modal data at the same time, the consistency represents the consistent Hash codes of image and text samples, and the inconsistency represents the characteristics of noise or different modes existing in the samples. Thus, the hash code may well capture commonality between different modalities, thereby improving the quality of the generated hash code. The invention provides an effective iteration-based discrete optimization scheme to solve the total objective function, and the discrete hash code can be directly generated to avoid quantization errors. The invention has high retrieval accuracy, is easy to be applied to large-scale data sets, and has wide application prospect.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings in which:
although the present invention specifies two modalities, image and text, the algorithm is easily extended to other modalities and to cases of more than two modalities. For convenience of description, the present invention considers only two modalities, image and text.
As shown in fig. 1, the discrete hash retrieval method based on robust matrix decomposition is characterized in that it implements the following steps by computer means:
step S1, collecting samples of two modes of images and texts through the Internet to establish a data set, and dividing the data set into a training set and a testing set; collecting samples of two modalities of images and texts from social network sites on a network, forming the images and the text samples according to the co-occurrence relation of the images and the texts, constructing a data set, and using a Mirflickr25K data set which consists of 24 types of images and corresponding text labels; randomly selecting 75% of the image text label pairs from the data set constitutes the training set, and the remainder constitutes the test set.
Step S2, extracting the characteristics of the images and texts in the training set and the test set by using the BOW algorithm of the images and the texts respectively; features are extracted for images using a bag-of-words model with SIFT features as visual words, and features are extracted for text using a traditional bag-of-words model.
Step S3, learning the consistency and the inconsistency between the characteristics of the image and the text by using a matrix decomposition algorithm, wherein the consistency is represented by a shared hash code, the inconsistency is restrained by minimizing the commonality, and a total objective function is constructed;
which comprises the following steps:
step S31, usingRepresenting the characteristics of the training set samples, where n is the number of sample pairs,;andzero-center feature vectors representing data from image and text modalities, respectively; mapping the features of the image and the text to a d-dimensional feature space by using a radial basis kernel function, and setting d = 500; whereinAndfeatures representing images and text, respectively;
step S32, merging the consistency and inconsistency between the image and text modalities into a model based on matrix decomposition, where the objective function based on matrix decomposition is defined as follows:
wherein the content of the first and second substances,setting balance parameters for balancing image and text weight=0.6;Setting balance parameters for balancing consistency and inconsistency=0.1;Anda matrix of latent factors representing image and text modalities respectively,andrespectively, representing parts of the image and text modalities that are not consistent, B representing parts of the modality that are consistent (i.e., consistent hashes of image and text samples),is about of inconsistencyBeam conditions, which are characteristic of noise present in the sample or of different modes, so that the sum of the point products of the discordant parts of the modes should be as small as possible, i.e. (< CHEM >) < CHEM >All elements should be minimized, whereinA dot product operation representing a matrix). Definition of itThe following were used:
using a matrix form, the above formula can be further written as:
whereinA trace representing a matrix;
in step S33, learning the objective function of the hash function of the image and text modality from the hash code B of the image and text modality is defined as:
whereinIs to balance the parameters of two modal hash functions=1000,Andprojection matrices representing image and text modalities, respectively;
step S34, the overall objective function of the method of the present invention is therefore:
whereinIs a parameter balancing the weight of the regularization term, setting=0.1;
Representing a regularization term;
step S4, solving the total objective function in the step S3 to obtain a hash code B of the image and text sample pair and a projection matrix of the image and text mode(ii) a The objective function is non-convex, so the invention provides an iterative optimization algorithm to obtain a local optimal solution of the problem, which comprises the following substeps:
step S41: fixingSolving for:
;
Is a k-dimensional identity matrix; setting k = 32;
step S42: fixingSolving for:
;
Step S43: fixingSolving for:
;
Step S44: fixingSolving for:
;
Step S45: fixingSolving for:
;
Is a d-dimensional identity matrix; setting d = 500;
step S46: fixingSolving for;
;
Step S47: fixingSolving a hash code B:
removing terms that are not related to B, the overall objective function can be simplified as:
whereinThe method is a discrete least square problem, the problem is difficult to solve due to the discrete constraint of B, and the method adopts a discrete circular coordinate descent method to directly solve B bit by bit;as the ith line of the hash code B,b is a matrix formed by removing B; in a similar manner to that described above,to representIn the (i) th row(s),to representRemovingThe matrix of the composition is formed by the following components,to representThe number of the ith row of (a),to representRemovingThe matrix of the composition is formed by the following components,to representRow i, removing the constant term yields:
,
the following can be obtained:
,
first of all with generatedUpdating the ith row of the hash code B, and then repeatedly executing the formula until all bit hash codes are updated; repeatedly executing the above processObtaining a Hash code B of the training set sample;
step S48: judging whether the maximum iteration number is reached or the difference of the iteration losses of the last two times is less than 0.001, and if not, continuing the iteration; if so, stopping the circulation;
step S5, when the user submits the query sample(r =1 represents an image modality, and r =2 represents a text modality), a projection matrix according to the image modalityAnd projection matrix of text modalityBy usingAnd calculating the hash code of the query sample, calculating the Hamming distance between the query sample and the heterogeneous modal sample in the data set, and returning a cross-media retrieval result according to the sequence of the Hamming distance from small to large.
The experimental effect is as follows:
the present embodiment was validated on a mirfllickr 25K dataset containing 20015 pairs of samples of images and text that can be divided into 24 semantic categories; randomly selecting 75% of sample pairs to form a training set, and the other 25% of sample pairs to form a testing set; expressing the image by using the characteristics Of 150-dimensional texture, expressing the text by using the characteristics Of 500-dimensional BOW (bag Of words), and carrying out normalization and mean value removal (zero center) processing on the characteristics; average accuracy (MAP @ 50) was used as the evaluation criterion for performance, where 50 indicates that the value of MAP was calculated from the first 50 returned samples, and this scheme was compared to MTFH (X. Liu, Z. Hu, H. Ling, and Y.M. Cheung, "Mtfh: A matrix tri-factor suspension for impact cross-module," IEEE Transactions on Pattern Analysis and Machine Analysis interest, vol. 43, No. 3, pp. 964-981, 2021.) where the accuracy of the 16, 24, 32, and 64 bit code lengths on image retrieval text and text retrieval tasks is shown in Table 1.
It can be seen that the invention designs a uniform objective function based on matrix decomposition, models the consistency and inconsistency of multi-modal data simultaneously, the consistency part represents the consistent hash code of the image and text sample, and the inconsistent part represents the noise existing in the sample or the diversity among different modes. Thus, the hash code may well capture commonality between different modalities, thereby improving the quality of the generated hash code. An efficient iteration-based discrete optimization scheme is provided to solve the above objective function. Therefore, the discrete hash code can be directly generated, and quantization errors are avoided. The invention has high retrieval accuracy, is easy to be applied to large-scale data sets, and has wide application prospect.
It should be understood that parts of the specification not set forth in detail are well within the prior art. The above examples are only for describing the preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, and various modifications and improvements made to the technical solution of the present invention by those skilled in the art without departing from the spirit of the present invention should fall within the protection scope defined by the claims of the present invention.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:敏感词识别方法、系统及计算机可读存储介质