Target retrieval method, device and storage medium
1. A target retrieval method is characterized by comprising the following steps:
acquiring original video data from monitoring equipment, and segmenting the original video data to obtain a plurality of original video segments;
performing fusion feature analysis on each original video segment to obtain a fusion feature vector corresponding to each original video segment;
importing target video data to be retrieved, and segmenting the target video data to be retrieved to obtain a plurality of target video segments to be retrieved;
performing fusion feature analysis on each target video segment to be retrieved respectively to obtain a fusion feature vector to be retrieved corresponding to the target video segment to be retrieved;
and respectively carrying out similarity analysis on each fusion feature vector according to each fusion feature vector to be retrieved to obtain an analysis result corresponding to each target video segment to be retrieved, and taking each analysis result as a retrieval result corresponding to each target video segment to be retrieved.
2. The object retrieval method according to claim 1, wherein each of the original video segments includes a plurality of frame pictures belonging to the same pedestrian ID; the process of respectively performing fusion feature analysis on each original video segment to obtain a fusion feature vector corresponding to each original video segment includes:
taking a plurality of frame pictures belonging to the same pedestrian ID as a frame picture unit group, and respectively extracting the features of the plurality of frame pictures in each frame picture unit group through a preset two-dimensional convolutional neural network to obtain a plurality of frame picture features of each group belonging to the same pedestrian ID;
taking a plurality of frame picture features belonging to the same pedestrian ID as a feature unit group, and respectively carrying out feature aggregation processing on the plurality of frame picture features in each feature unit group by utilizing a time modeling algorithm to obtain each group of feature vectors belonging to the same pedestrian ID;
respectively carrying out local feature extraction on each feature vector by utilizing an SSD (solid State disk) target detection framework algorithm to obtain local feature vectors corresponding to the feature vectors;
and respectively calculating fusion feature vectors of the feature vectors and the local feature vectors corresponding to the feature vectors to obtain fusion feature vectors corresponding to the original video segments.
3. The method according to claim 2, wherein said calculating fused feature vectors for each of said feature vectors and said local feature vectors corresponding to each of said feature vectors to obtain fused feature vectors corresponding to each of said original video segments comprises:
calculating fusion eigenvectors of each eigenvector and the local eigenvector corresponding to each eigenvector respectively through a first formula to obtain fusion eigenvectors corresponding to each original video segment, wherein the first formula is as follows:
wherein, T (fc)i,gi) For the ith fused feature vector, fciIs a feature vector, giAs local feature vectors, cov (f)ci,gi) Is a feature vector fciAnd local feature vector giCovariance of D (f)ci) Is a feature vector fciVariance of D (g)i) Is a local feature vector giThe variance of (a) and T is [ -1,1]。
4. The target retrieval method according to claim 2, wherein the process of performing similarity analysis on each fusion feature vector according to each fusion feature vector to be retrieved to obtain an analysis result corresponding to each target video segment to be retrieved includes:
respectively carrying out similarity calculation on each fusion feature vector according to each fusion feature vector to be retrieved to obtain a plurality of similarities corresponding to each fusion feature vector to be retrieved;
and respectively carrying out maximum value screening on a plurality of similarities corresponding to the feature vectors to be retrieved to obtain the maximum similarity corresponding to the feature vectors to be retrieved, and taking the pedestrian ID belonging to the maximum similarity corresponding to the feature vectors to be retrieved as the analysis result corresponding to the target video segment to be retrieved.
5. The target retrieval method of claim 4, wherein the process of performing similarity calculation on each fused feature vector according to each fused feature vector to be retrieved to obtain a plurality of similarities corresponding to each fused feature vector to be retrieved comprises:
and respectively carrying out similarity calculation on each fusion feature vector according to a second formula and each fusion feature vector to be retrieved to obtain a plurality of similarities corresponding to each fusion feature vector to be retrieved, wherein the second formula is as follows:
wherein cos θ is the degree of similarity, Ti(x) And S (x) is the fusion feature vector to be retrieved.
6. The target retrieval method according to claim 1, wherein the target video data to be retrieved includes a plurality of original pedestrian IDs corresponding to the target video segment to be retrieved; when a retrieval result is obtained, the method also comprises the step of predicting the accuracy of the retrieval result, and the process comprises the following steps:
and predicting the accuracy of the original pedestrian IDs and the retrieval results by utilizing a top1 algorithm to obtain the retrieval accuracy.
7. An object retrieval apparatus, comprising:
the system comprises an original data segmentation module, a video segmentation module and a video segmentation module, wherein the original data segmentation module is used for acquiring original video data from monitoring equipment and segmenting the original video data to obtain a plurality of original video segments;
the original video segment processing module is used for respectively carrying out fusion feature analysis on each original video segment to obtain fusion feature vectors corresponding to each original video segment;
the device comprises a to-be-retrieved data segmentation module, a retrieval module and a retrieval module, wherein the to-be-retrieved data segmentation module is used for importing target video data to be retrieved and segmenting the target video data to be retrieved to obtain a plurality of target video segments to be retrieved;
the to-be-retrieved video segment processing module is used for respectively carrying out fusion characteristic analysis on each to-be-retrieved target video segment to obtain to-be-retrieved fusion characteristic vectors corresponding to the to-be-retrieved target video segments;
and the retrieval result obtaining module is used for respectively carrying out similarity analysis on each fusion feature vector according to each fusion feature vector to be retrieved to obtain an analysis result corresponding to each target video segment to be retrieved, and taking each analysis result as a retrieval result corresponding to each target video segment to be retrieved.
8. The object retrieval device according to claim 7, wherein each of the original video segments includes a plurality of frame pictures belonging to the same pedestrian ID; the original video segment processing module is specifically configured to:
taking a plurality of frame pictures belonging to the same pedestrian ID as a frame picture unit group, and respectively extracting the features of the plurality of frame pictures in each frame picture unit group through a preset two-dimensional convolutional neural network to obtain a plurality of frame picture features of each group belonging to the same pedestrian ID;
taking a plurality of frame picture features belonging to the same pedestrian ID as a feature unit group, and respectively carrying out feature aggregation processing on the plurality of frame picture features in each feature unit group by utilizing a time modeling algorithm to obtain each group of feature vectors belonging to the same pedestrian ID;
respectively carrying out local feature extraction on each feature vector by utilizing an SSD (solid State disk) target detection framework algorithm to obtain local feature vectors corresponding to the feature vectors;
and respectively calculating fusion feature vectors of the feature vectors and the local feature vectors corresponding to the feature vectors to obtain fusion feature vectors corresponding to the original video segments.
9. An object retrieval apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that when the computer program is executed by the processor, the object retrieval method according to any one of claims 1 to 6 is implemented.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out an object retrieval method according to any one of claims 1 to 6.
Background
In recent years, due to rapid development of related technologies in the communication and internet industries and the moving and light-weight of video photographing apparatuses, the accumulation, distribution, and update speeds of video information have explosively increased. Therefore, the rapid extraction, management and utilization of such unstructured information become very difficult, and an efficient video retrieval system is urgently needed to complete the automatic extraction and archiving of video contents. Search engines play an irreplaceable role in the modern internet. According to the statistics of the authoritative Alexa websites, the websites with the top 10 of the current global Internet traffic rank all provide the function of a search engine. In search engine technology, the most common way is to annotate video content with text and then do database system-based video retrieval. However, when a large amount of video information is faced, the method is slow in speed and high in cost, and errors and omissions in content are easy to occur, because the description of images or videos through texts has an intention gap which is difficult to make up, for example, tracking and searching a specific pedestrian target across cameras in a natural scene is very difficult, only manual searching and recording can be performed at present, and the problems of slow speed and low searching accuracy exist. Therefore, the problems of how to quickly and accurately search the target in the video and how to better meet the requirements of the industry and the like are still urgently needed to be solved.
Disclosure of Invention
The invention provides a target retrieval method, a target retrieval device and a storage medium, aiming at the defects of the prior art.
The technical scheme for solving the technical problems is as follows: a target retrieval method includes the following steps:
acquiring original video data from monitoring equipment, and segmenting the original video data to obtain a plurality of original video segments;
performing fusion feature analysis on each original video segment to obtain a fusion feature vector corresponding to each original video segment;
importing target video data to be retrieved, and segmenting the target video data to be retrieved to obtain a plurality of target video segments to be retrieved;
performing fusion feature analysis on each target video segment to be retrieved respectively to obtain a fusion feature vector to be retrieved corresponding to the target video segment to be retrieved;
and respectively carrying out similarity analysis on each fusion feature vector according to each fusion feature vector to be retrieved to obtain an analysis result corresponding to each target video segment to be retrieved, and taking each analysis result as a retrieval result corresponding to each target video segment to be retrieved.
Another technical solution of the present invention for solving the above technical problems is as follows: a target retrieval apparatus comprising:
the system comprises an original data segmentation module, a video segmentation module and a video segmentation module, wherein the original data segmentation module is used for acquiring original video data from monitoring equipment and segmenting the original video data to obtain a plurality of original video segments;
the original video segment processing module is used for respectively carrying out fusion feature analysis on each original video segment to obtain fusion feature vectors corresponding to each original video segment;
the device comprises a to-be-retrieved data segmentation module, a retrieval module and a retrieval module, wherein the to-be-retrieved data segmentation module is used for importing target video data to be retrieved and segmenting the target video data to be retrieved to obtain a plurality of target video segments to be retrieved;
the to-be-retrieved video segment processing module is used for respectively carrying out fusion characteristic analysis on each to-be-retrieved target video segment to obtain to-be-retrieved fusion characteristic vectors corresponding to the to-be-retrieved target video segments;
and the retrieval result obtaining module is used for respectively carrying out similarity analysis on each fusion feature vector according to each fusion feature vector to be retrieved to obtain an analysis result corresponding to each target video segment to be retrieved, and taking each analysis result as a retrieval result corresponding to each target video segment to be retrieved.
Another technical solution of the present invention for solving the above technical problems is as follows: an object retrieval apparatus comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the object retrieval method as described above being implemented when the computer program is executed by the processor.
Another technical solution of the present invention for solving the above technical problems is as follows: a computer-readable storage medium, storing a computer program which, when executed by a processor, implements an object retrieval method as described above.
The invention has the beneficial effects that: the method comprises the steps of obtaining a plurality of original video segments through segmentation processing of original video data, obtaining a plurality of fusion feature vectors through fusion feature analysis of each original video segment, obtaining a plurality of target video segments to be retrieved through segmentation processing of target video data to be retrieved, obtaining a plurality of fusion feature vectors to be retrieved through fusion feature analysis of each target video segment to be retrieved, obtaining retrieval results corresponding to each target video segment to be retrieved through similarity analysis of each fusion feature vector to be retrieved, fully fusing the appearance and facial features of a target to be retrieved, greatly improving retrieval accuracy, saving manpower, achieving rapid and accurate target retrieval in a video, and better meeting the requirements of the industry.
Drawings
Fig. 1 is a schematic flow chart of a target retrieval method according to an embodiment of the present invention;
fig. 2 is a block diagram of a target retrieval apparatus according to an embodiment of the present invention.
Detailed Description
The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.
Fig. 1 is a schematic flow chart of a target retrieval method according to an embodiment of the present invention.
As shown in fig. 1, a target retrieval method includes the following steps:
acquiring original video data from monitoring equipment, and segmenting the original video data to obtain a plurality of original video segments;
performing fusion feature analysis on each original video segment to obtain a fusion feature vector corresponding to each original video segment;
importing target video data to be retrieved, and segmenting the target video data to be retrieved to obtain a plurality of target video segments to be retrieved;
performing fusion feature analysis on each target video segment to be retrieved respectively to obtain a fusion feature vector to be retrieved corresponding to the target video segment to be retrieved;
and respectively carrying out similarity analysis on each fusion feature vector according to each fusion feature vector to be retrieved to obtain an analysis result corresponding to each target video segment to be retrieved, and taking each analysis result as a retrieval result corresponding to each target video segment to be retrieved.
It should be understood that the monitoring device is the actual monitoring device at a specific intersection of a city, the monitoring device must be installed at the middle position of the intersection, and the monitoring device should be 3 meters high from the ground, and the illumination angle of the monitoring device is 45 degrees to the ground.
It should be understood that the original video data is taken as input, and k original video segments { Ck } are output after the original video data is segmented through an algorithm; and taking the target video data to be retrieved as input, and segmenting the target video data to be retrieved through an algorithm to output k target video segments { Ck } to be retrieved.
In the embodiment, a plurality of original video segments are obtained through segmentation processing of original video data, a plurality of fusion feature vectors are obtained through fusion feature analysis of each original video segment, a plurality of target video segments to be retrieved are obtained through segmentation processing of target video data to be retrieved, a plurality of fusion feature vectors to be retrieved are obtained through fusion feature analysis of each target video segment to be retrieved, retrieval results corresponding to each target video segment to be retrieved are obtained through similarity analysis of each fusion feature vector to be retrieved, and the appearance and facial features of the target to be retrieved are fully fused, so that retrieval accuracy is greatly improved, manpower is saved, the target in the video can be retrieved quickly and accurately, and the requirements of the industry are better met.
Optionally, as an embodiment of the present invention, each of the original video segments includes a plurality of frame pictures belonging to the same pedestrian ID; the process of respectively performing fusion feature analysis on each original video segment to obtain a fusion feature vector corresponding to each original video segment includes:
taking a plurality of frame pictures belonging to the same pedestrian ID as a frame picture unit group, and respectively extracting the features of the plurality of frame pictures in each frame picture unit group through a preset two-dimensional convolutional neural network to obtain a plurality of frame picture features of each group belonging to the same pedestrian ID;
taking a plurality of frame picture features belonging to the same pedestrian ID as a feature unit group, and respectively carrying out feature aggregation processing on the plurality of frame picture features in each feature unit group by utilizing a time modeling algorithm to obtain each group of feature vectors belonging to the same pedestrian ID;
respectively carrying out local feature extraction on each feature vector by utilizing an SSD (solid State disk) target detection framework algorithm to obtain local feature vectors corresponding to the feature vectors;
and respectively calculating fusion feature vectors of the feature vectors and the local feature vectors corresponding to the feature vectors to obtain fusion feature vectors corresponding to the original video segments.
It should be understood that each of the original video segments contains T frames (i.e., the frame pictures). Namely, a long video is cut into continuous non-overlapping k original video segments { Ck }, each original video segment contains T frames (i.e. the frame pictures), and each small video segment (i.e. the original video segment) contains only one pedestrian (i.e. belongs to the same pedestrian ID).
Specifically, a series of image level features (i.e. T of the frame pictures) are aggregated into a video segment level feature (i.e. the feature vector), and the extracted feature (i.e. the feature vector) contains time information and pedestrian information. It comprises three parts: an image-level feature extractor, a temporal modeling method (i.e., the temporal modeling algorithm) to aggregate temporal features and loss functions. The processing steps are as follows:
1. the feature vector of each frame picture (i.e., the frame picture feature) is extracted using an image-level feature extractor.
2. And aggregating the extracted features (namely the frame picture features) into the features (namely the feature vectors) of the video sequence by a time modeling method (namely the time modeling algorithm).
It should be understood that the feature extractor is a feature extractor using the network 2D CNN (convolutional neural network), i.e. using the standard Resnet50 model as an image level. The input is a series of picture frames, and a series of image level features { f are output after passing through a feature extractorci tT ∈ [1, n ] } (i.e. a number of said frame picture features), t ∈ [1, n ]]Is an n x D matrix, where n is the length of the video sequence and D is the dimension of the output image level feature vector.
It should be understood that ssd is called Single Shot multi box Detector, the target detection framework has obvious speed advantage compared with fast RCNN and obvious mAP advantage compared with YOLO, the idea of converting detection into regression is inherited from YOLO, target positioning and classification are completed once based on anchor points in fast RCNN, and a similar primer box is provided; and (4) adding a detection mode based on a characteristic pyramid, namely predicting the target on characteristic graphs of different receptive fields.
Specifically, the time modeling method adopts time attention pooling, the time attention pooling fully utilizes the problem of all image level features, and an attention weighted average is applied to the image level features. Namely, the frame with high pedestrian quality is weighted highly, the picture with low pedestrian quality is weighted lowly, and the weighted summation is carried out. The formula is as follows:
fcirepresenting the sequence features (i.e. the feature vector) which contain time information and pedestrian information, fci tRepresenting the features of a frame, a given segment Ci has an attention aci t,t∈[1,T),
The tensor size output by the last convolutional layer in Resnet50 is [2048, w, h ], w, h depending on the input picture size. The attention generating network takes a series of image level features [ T,2048, w, h ] as input and outputs a T attention score.
The above attention score/weight applies a spatial convolution layer (width of convolution kernel w, height h, input channel number 2048, output channel number dt), denoted as [ w, h,2048, dt ]. The output of the convolutional layer is followed by a time convolutional layer, the number of input channels is dt, the number of output channels is 1, the step length of the convolutional kernel is 3, and the result is recorded as [3,3, dt,1 ]. The final output is a scalar Sct, and T ∈ [1, T ] is the importance score of frame T for segment C.
Once the temporal attention score Sct is obtained, the final attention score a can be calculated by the Softmax functionc t,
Softmax function:
it will be appreciated that the feature f will be obtainedciThe local feature extraction (namely the feature vector) is further carried out, and the local feature which is most distinctive for the pedestrian is the face, so the local feature extraction of the patent aims at the face feature, then the feature is extracted for the face, and g is usediRepresents local features (i.e., the local feature vector), which are information containing face and temporal features.
In the above embodiment, the fusion feature vectors corresponding to the original video segments are obtained by analyzing the fusion features of the original video segments, so as to provide basic data for subsequent processing, and fully fuse the appearance and facial features of the target to be retrieved, thereby greatly improving the retrieval accuracy, realizing rapid and accurate target retrieval in the video, and better meeting the requirements of the industry.
Optionally, as an embodiment of the present invention, the calculating of the fusion feature vector for each feature vector and the local feature vector corresponding to each feature vector to obtain the fusion feature vector corresponding to each original video segment includes:
calculating fusion eigenvectors of each eigenvector and the local eigenvector corresponding to each eigenvector respectively through a first formula to obtain fusion eigenvectors corresponding to each original video segment, wherein the first formula is as follows:
wherein, T (fc)i,gi) For the ith fused feature vector, fciIs a feature vector, giAs local feature vectors, cov (f)ci,gi) Is a feature vector fciAnd local feature vector giCovariance of D (f)ci) Is a feature vector fciVariance of D (g)i) Is a local feature vector giThe variance of (a) and T is [ -1,1]。
It should be understood that the sequence feature f is written separately using matlab language programming algorithmci(i.e., the feature vector) and local gi(namely the local feature vector) is subjected to linear correlation calculation to obtain a fusion feature T which fully fuses the two featuresi(x) (i.e., the fused feature vector) to make the features more discriminative, the module fused features (i.e., the fused feature vector) for later retrieval.
In the embodiment, the fused feature vectors corresponding to the original video segments are obtained by calculating the fused feature vectors of the feature vectors and the local feature vectors corresponding to the feature vectors in the first type, so that the features are more discriminative, a basis is provided for later retrieval, a quick and accurate retrieval target in the video is realized, and the requirements of the industry are better met.
Optionally, as an embodiment of the present invention, the process of performing similarity analysis on each fusion feature vector according to each fusion feature vector to be retrieved, to obtain an analysis result corresponding to each target video segment to be retrieved includes:
respectively carrying out similarity calculation on each fusion feature vector according to each fusion feature vector to be retrieved to obtain a plurality of similarities corresponding to each fusion feature vector to be retrieved;
and respectively carrying out maximum value screening on a plurality of similarities corresponding to the feature vectors to be retrieved to obtain the maximum similarity corresponding to the feature vectors to be retrieved, and taking the pedestrian ID belonging to the maximum similarity corresponding to the feature vectors to be retrieved as the analysis result corresponding to the target video segment to be retrieved.
It should be appreciated that the fused feature T is computed using a cosine similarity traversali(i.e., the fusion feature vector) and the retrieved fusion feature s (x) (i.e., the fusion feature vector to be retrieved), and the class corresponding to the highest feature similarity (i.e., the pedestrian ID) is the class (i.e., the pedestrian ID) matched with the retrieved picture.
In the above embodiment, the similarity of each fusion feature vector is analyzed according to each fusion feature vector to be retrieved to obtain the analysis result corresponding to each target video segment to be retrieved, so that the retrieval accuracy can be greatly improved, the target can be quickly and accurately retrieved in the video, and the requirements of the industry are better met.
Optionally, as an embodiment of the present invention, the process of performing similarity calculation on each fusion feature vector according to each fusion feature vector to be retrieved, to obtain a plurality of similarities corresponding to each fusion feature vector to be retrieved includes:
and respectively carrying out similarity calculation on each fusion feature vector according to a second formula and each fusion feature vector to be retrieved to obtain a plurality of similarities corresponding to each fusion feature vector to be retrieved, wherein the second formula is as follows:
wherein cos θ is the degree of similarity, Ti(x) And S (x) is the fusion feature vector to be retrieved.
In the above embodiment, the similarity of each fused feature vector is calculated according to the second formula and each fused feature vector to be retrieved to obtain a plurality of similarities corresponding to each fused feature vector to be retrieved, so that the retrieval accuracy can be greatly improved, a target for fast and accurately retrieving in a video is realized, and the requirements of the industry are better met.
Optionally, as an embodiment of the present invention, the target video data to be retrieved includes a plurality of original pedestrian IDs corresponding to the target video segment to be retrieved; when a retrieval result is obtained, the method also comprises the step of predicting the accuracy of the retrieval result, and the process comprises the following steps:
and predicting the accuracy of the original pedestrian IDs and the retrieval results by utilizing a top1 algorithm to obtain the retrieval accuracy.
It should be appreciated that top-1 is utilized to calculate the matching accuracy of the graph.
Specifically, Top-1 (i.e., the Top1 algorithm) is that the predicted label takes the largest one of the last probability vectors as the predicted result, and the class with the highest probability in your predicted result must be the correct class to predict correctly. For example, if a picture is to be predicted (e.g., there are one thousand categories in imagenet), the 1000 categories are ranked from high to low according to probability, and top1 accuracy refers to the accuracy with which the first category is ranked and matches the picture category.
In the above embodiment, the top1 algorithm is used to predict the accuracy of a plurality of original pedestrian IDs and a plurality of search results to obtain the search accuracy, so that the verification of the search results is realized, and the requirements of the industry are better met.
Optionally, as an embodiment of the present invention, the method further includes storing each of the feature vectors, each of the local feature vectors, and each of the fused feature vectors, where the process of storing each of the feature vectors, each of the local feature vectors, and each of the fused feature vectors includes:
and establishing a plurality of databases corresponding to the pedestrian IDs, and respectively storing each feature vector, each local feature vector and each fusion feature vector into the database corresponding to the pedestrian IDs according to the pedestrian IDs.
It should be understood that the database needs to be specially processed because the fused feature values (i.e., the fused feature vectors) are processed using matlab language, but fci(i.e., the feature vector) and gi(i.e., the local feature vectors) are feature value files that all exist in json format. While using json in matlab requires downloading a jsonnlab library, and the purpose of downloading the database is to fuse the feature files as input smoothly through an algorithm.
Optionally, as an embodiment of the present invention, the step of performing fusion feature analysis on the target video data to be retrieved to obtain a fusion feature vector to be retrieved includes:
acquiring a target to be retrieved (namely video data of the target to be retrieved), then performing data acquisition and video processing, and then processing a video into k video segments { Ck }, wherein each video segment comprises a T frame and only one pedestrian is in each video segment;
aggregating a series of image level features into a video segment level feature vi;
further extracting local (time and face) features of the obtained features vi, and recording the extracted features as ci;
The extracted local features ciAnd fusing the characteristic vi, wherein the fused characteristic is S (x) (namely the fusion characteristic vector to be retrieved).
Fig. 2 is a block diagram of a target retrieval apparatus according to an embodiment of the present invention.
Alternatively, as another embodiment of the present invention, as shown in fig. 2, an object retrieval apparatus includes:
the system comprises an original data segmentation module, a video segmentation module and a video segmentation module, wherein the original data segmentation module is used for acquiring original video data from monitoring equipment and segmenting the original video data to obtain a plurality of original video segments;
the original video segment processing module is used for respectively carrying out fusion feature analysis on each original video segment to obtain fusion feature vectors corresponding to each original video segment;
the device comprises a to-be-retrieved data segmentation module, a retrieval module and a retrieval module, wherein the to-be-retrieved data segmentation module is used for importing target video data to be retrieved and segmenting the target video data to be retrieved to obtain a plurality of target video segments to be retrieved;
the to-be-retrieved video segment processing module is used for respectively carrying out fusion characteristic analysis on each to-be-retrieved target video segment to obtain to-be-retrieved fusion characteristic vectors corresponding to the to-be-retrieved target video segments;
and the retrieval result obtaining module is used for respectively carrying out similarity analysis on each fusion feature vector according to each fusion feature vector to be retrieved to obtain an analysis result corresponding to each target video segment to be retrieved, and taking each analysis result as a retrieval result corresponding to each target video segment to be retrieved.
Optionally, as an embodiment of the present invention, each of the original video segments includes a plurality of frame pictures belonging to the same pedestrian ID; the original video segment processing module is specifically configured to:
taking a plurality of frame pictures belonging to the same pedestrian ID as a frame picture unit group, and respectively extracting the features of the plurality of frame pictures in each frame picture unit group through a preset two-dimensional convolutional neural network to obtain a plurality of frame picture features of each group belonging to the same pedestrian ID;
taking a plurality of frame picture features belonging to the same pedestrian ID as a feature unit group, and respectively carrying out feature aggregation processing on the plurality of frame picture features in each feature unit group by utilizing a time modeling algorithm to obtain each group of feature vectors belonging to the same pedestrian ID;
respectively carrying out local feature extraction on each feature vector by utilizing an SSD (solid State disk) target detection framework algorithm to obtain local feature vectors corresponding to the feature vectors;
and respectively calculating fusion feature vectors of the feature vectors and the local feature vectors corresponding to the feature vectors to obtain fusion feature vectors corresponding to the original video segments.
Alternatively, another embodiment of the present invention provides an object retrieval apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the object retrieval method as described above is implemented. The device may be a computer or the like.
Alternatively, another embodiment of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the object retrieval method as described above.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种视频搜索方法、系统、服务器和客户端