Real-time target detection method and device for lightweight image and video data
1. A real-time target detection method of light-weight image and video data is characterized by comprising the following steps:
1) acquiring data to be detected;
2) inputting data to be detected into a trained target detection model to obtain a target identification result in the data to be detected; the target detection model comprises a feature extraction part and a detection end part, wherein the feature extraction part adopts a YOLO-Lite network, and a residual block and a parallel connection structure are additionally arranged in the YOLO-Lite network and are used for fusing deep features and shallow features obtained by the YOLO-Lite network so as to output feature maps of features with different scales; the detection end part comprises a convolution layer and a splicing layer and is used for fusing the feature graphs of different scales obtained by the feature extraction part and generating corresponding prediction results on the feature graphs of different scales.
2. The method of claim 1, wherein the feature extraction component comprises a 3 x 3 convolutional layer, a 1 x1 convolutional layer, a residual block, an upsampling layer, and a pooling layer, the 3 x 3 convolutional layer is used for feature extraction of the image data, the pooling layer is disposed between the convolutional layers and is used for resampling the extracted features to reduce the feature dimension of the convolutional layer extraction, the residual block is used for continuously transferring shallow features to deep layers, and the upsampling layer is used for restoring the size of the image.
3. The method of claim 2, wherein the residual block comprises 1 x1 convolutional layer and 1 x 3 convolutional layer.
4. The method of claim 1, wherein the parallel connection structure is configured to perform multi-resolution reconstruction of deep and shallow features at multiple scales, such that the feature maps at multiple scales have both deep and shallow features.
5. The method for detecting the real-time target of the light-weight image and video data according to claim 1, wherein the detection part comprises three detection modules, each detection module comprises a convolution layer and a splicing layer, and the input of the splicing layer of each detection module is respectively connected with the different convolution layers and the pooling layer of the feature extraction part so as to realize feature map fusion of different scales.
6. A real-time object detection apparatus for lightweight image and video data, comprising a memory and a processor, and a computer program stored in the memory and running on the processor, wherein the processor is coupled to the memory, and the processor executes the computer program to implement the method for real-time object detection of lightweight image and video data according to any one of claims 1 to 5.
Background
In recent years, target detection based on a convolutional neural network is always a popular research direction in the field of computer vision, and focuses on target positioning and classification, and the results can be widely applied to the fields of face recognition, posture prediction and various intelligent applications. Currently, convolutional neural network structures are developed towards deeper and more complex directions, and although the precision can reach or even exceed the level of human vision, the convolutional neural network structures often have huge computation and ultrahigh energy consumption, so that the convolutional neural network structures are inconvenient to use in many GPU-free and mobile devices. With the development of embedded and mobile intelligent devices with limited computing power and power consumption, such as unmanned vehicles, small intelligent unmanned aerial vehicles, augmented reality glasses and the like, lightweight and real-time network models become key research contents of mobile terminal convolutional neural network target detection technology.
Recent studies have shown that some researchers focus on improving the accuracy of detection by constructing increasingly complex neural networks, such as ResNet (deep reactive networks), Yolov3, HRNet (High-Resolution networks), etc., and some have constructed small and efficient lightweight neural networks, such as MobileNet V1, MobileNet V2, Tiny-Yolo, Yolo-Lite, MTYOLO, etc., by optimizing various structures. The end-to-end deep learning target detection method based on the regression method in the YOLO series and the SSD series realizes real-time target detection on a GPU computer under the condition of keeping relatively high average accuracy, but the real-time accurate detection on GPU-free computers and portable equipment with limited computing capacity is difficult to realize due to large calculation amount.
Disclosure of Invention
The invention aims to provide a real-time target detection method and a real-time target detection device for light-weight image and video data, which are used for solving the problems of complex calculation and large calculation amount in the current real-time target detection.
The invention provides a real-time target detection method of light-weight image and video data for solving the technical problems, which comprises the following steps:
1) acquiring data to be detected;
2) inputting data to be detected into a trained target detection model to obtain a target identification result in the data to be detected; the target detection model comprises a feature extraction part and a detection end part, wherein the feature extraction part adopts a YOLO-Lite network, and a residual block and a parallel connection structure are additionally arranged in the YOLO-Lite network and are used for fusing deep features and shallow features obtained by the YOLO-Lite network so as to output feature maps of features with different scales; the detection end part comprises a convolution layer and a splicing layer and is used for fusing the feature graphs of different scales obtained by the feature extraction part and generating corresponding prediction results on the feature graphs of different scales.
The invention also provides a real-time target detection device of the light-weight image and video data, which comprises a memory, a processor and a computer program stored on the memory and operated on the processor, wherein the processor is coupled with the memory, and the processor executes the computer program to realize the real-time target detection method of the light-weight image and video data.
According to the invention, by adding a residual block and a parallel connection structure on the basis of a backbone network of YOLO-Lite, deep layer characteristics and shallow layer characteristics are fused, characteristic graphs of different scales are output, and the maximum utilization of original characteristics is realized; meanwhile, feature maps of different scales are fused, and corresponding prediction results are generated on the feature maps of different scales. Compared with a YOLOv3 structure, the structure of the invention is shallower and narrower, trainable parameters are less, the calculation amount is obviously reduced, the operation speed is faster, and meanwhile, compared with a YOLO-Lite structure, under the condition that the operation speed is relatively reduced, the detection precision is greatly improved, and the requirement on hardware equipment is reduced.
Further, the characteristic part comprises a 3 × 3 convolutional layer, a 1 × 1 convolutional layer, a residual block, an upsampling layer and a pooling layer, wherein the 3 × 3 convolutional layer is used for extracting characteristics of the image data, the pooling layer is arranged between the convolutional layers and is used for resampling the extracted characteristics so as to reduce the characteristic dimension of the convolutional layer extraction, the residual block is used for continuously transferring the shallow characteristics to the deep layer, and the upsampling layer is used for recovering the size of the image.
Further, the residual block includes 1 × 1 convolutional layer and 13 × 3 convolutional layer.
Further, the parallel connection structure is used for performing multi-resolution reconstruction on the deep features and the shallow features at multiple scales, so that the feature maps at multiple scales have the deep features and the shallow features at the same time.
Furthermore, the detection part comprises three detection modules, each detection module comprises a convolution layer and a splicing layer, and the input of the splicing layer of each detection module is respectively connected with different convolution layers and pooling layers of the feature extraction part, so as to realize feature map fusion of different scales.
Drawings
FIG. 1 is a schematic diagram of a Mixed YOLOv3-Lite network structure adopted by the real-time target detection method of the present invention;
FIG. 2 is a schematic diagram of the structure of a residual block used in the present invention;
FIG. 3 is a schematic diagram of a HRNet network structure adopted by the present invention;
FIG. 4 is a graphical representation of the partial test results of the present invention on the PASCAL VOC 2007 test set;
FIG. 5 is a graph comparing the effect of the present invention on the VisDrone2018-Det data set with the prior art detection model;
FIG. 6-a is a schematic diagram of the result of static image detection on VisDrone2018-Det Val according to the present invention;
FIG. 6-b is a schematic diagram showing the result of dynamic image detection on VisDrone2018-Det Val according to the present invention;
FIG. 6-c is a schematic diagram of the results of orthometric image detection on VisDrone2018-Det Val in accordance with the present invention;
FIG. 6-d is a schematic diagram showing the result of poor light-exposure on VisDrone2018-Det Val according to the present invention.
Detailed Description
The following further describes embodiments of the present invention with reference to the drawings.
Embodiments of the detection method
On the basis of various existing target detection algorithms, in order to reduce the calculated amount and facilitate the use of the method on the condition without GPU or portable equipment, the invention provides a novel real-time target detection method for light-weight image and video data, wherein the detection method adopts a Mixed YOLOv3-Lite network which is based on the YOLO-Lite network. Wherein, the YOLO-Lite Network is a "Shallow and Narrow (Narrow Network and Narrow Channel)" Network, the calculated amount and parameter amount are reduced substantially for the deeper Network, and the detection speed of the Network is improved significantly. The backbone network of YOLO-Lite is generally composed of 7 convolutional layers and 5 maximum pooling layers, and its structure is shown in table 1, and includes: 6 3 x 3 convolutional layers for feature extraction, 1 x1 convolutional layers for feature reduction of the extracted features, and 5 maximum pooling layers for feature compression.
TABLE 1
Although the YOLO-Lite network has less calculated amount and parameter amount and greatly improved processing speed, the accuracy is lower, and the invention is a parallel connection structure for increasing reblock (residual block) and HRNet on the basis of the YOLO-Lite network. Specifically, as shown in fig. 1, the Mixed YOLOv3-Lite network of the present invention includes a feature extraction part and a detection end part, wherein the feature extraction part is formed by adding a parallel connection structure of resblock and HRNet on the basis of the backbone network of YOLO-Lite, and includes 12 3 × 3 convolutional layers, 1 × 1 convolutional layer, 3 residual blocks, 3 upsampling layers and 8 maximum pooling layers, which has high detection performance. The convolutional layers are connected in sequence, and the maximum pooling layer, the residual block and the upper sampling layer are inserted between the convolutional layers.
The residual structure adopted in the method is shown in fig. 2, and is consistent with the residual structure in YOLOv3, wherein Relu is an activation function. By adding shortcuts in the network, the problem that the model accuracy does not rise or fall when the number of layers in the VGG network is increased to a certain degree is solved. The principle of Parallel connection (Parallel High-to-low Resolution networks) is shown in fig. 3, wherein the dashed frame part in the figure is a Parallel connection structure, and the Parallel connection in the invention is to perform Resolution reconstruction and fusion on three feature maps with different scales, and respectively output the feature maps to a detection end for target detection, thereby improving the detection accuracy of the network.
The inter-detection end part comprises 3 detection modules which are respectively a prediction one, a prediction two and a prediction three. The prediction method comprises the following steps that a prediction one comprises a splicing layer, a first convolution layer, a second convolution layer, a third convolution layer and a convolution block which are sequentially connected, the splicing layer is used for fusing feature data of a fifth pooling layer, a seventh pooling layer and a 1 x1 convolution layer in a feature extraction part, the fused feature data is output to the first convolution layer in the prediction one, the first convolution layer and the second convolution layer both adopt 1 x1 convolution kernels, the third convolution layer adopts 3 x 3 convolution kernels, and the feature data is output to the convolution block after three-layer convolution processing, so that prediction of image features of the scale is achieved. The structure and action of the predictor two are similar, except that the input objects of the concatenation layers are different, in addition, each of the predictor two and the predictor three has one more concatenation layer, the concatenation layer is arranged between two 1 × 1 convolution kernels, for the predictor two, the output result of the two convolution layers in the predictor one is fused with the output structure of the first convolution layer in the predictor two, and similarly, for the predictor three, the output result of the two convolution layers in the predictor two is fused with the output structure of the first convolution layer in the predictor three.
After the network structure is established, the network structure is used as a target detection model, the model is trained by utilizing a known image and video data set to obtain a trained target detection model, and real-time image data and video data to be detected are input into the trained target detection model to realize the detection of a real-time target. The method can be applied to the fields of intelligent vehicle control and the like, and intelligent driving is realized by detecting the target objects (obstacles) on the road in real time.
Device embodiment
The detection device of the present invention includes a memory, a processor, and a computer program stored in the memory and running on the processor, the processor is coupled to the memory, and the processor implements the real-time target detection method of image data according to the present invention when executing the computer program. Wherein the processor may be a GPU-less device, a mobile terminal, or the like.
Examples of the experiments
To better illustrate the effect of the present invention, the following data sets of PASCAL VOC and VisDrone2018-Det are used as examples to verify the method of the present invention. The experimental hardware platform is a server with an Intel i7-9700K CPU, an NVIDIA RTX2080Ti GPU and a 48GB RAM, is mainly used for network model training, and simultaneously performs performance test in a GPU-free environment by forbidding the GPU. In addition, the NVIDIA Jetson AGX Xavier is used as an embedded mobile terminal for performance test, and the embedded mobile terminal is configured to be an 8-core ARM v 8.264-bit CPU, a 512-core Volta GPU and a 16GB RAM of NVIDIA self-research.
The PASCAL VOC data set is a public target detection data set containing 20 classes of targets. The experiment was trained and tested using a mixed data set from the PASCAL VOC groups 2007 and 2012, where the training set contained 16511 images and the test set contained 4592 images. The VisDrone2018-Det is a large data set which is acquired by an unmanned aerial vehicle and has rich and diverse scenes and diverse environmental elements, comprises 8599 images (a training set 6471, a verification set 548 and a test set 1580) and has rich labels, and comprises an object boundary box, an object type, shielding, a truncation ratio and the like. The labeled data of the training set and the verification set are disclosed, and the labeled data are respectively used as the training set and the test set in the experiment. Data statistics for the PASCAL VOC and VisDrone data sets are shown in table 2.
TABLE 2
60 epochs training is carried out on a PASCAL VOC 2007&2012 training set by Mixed Yolov3-Lite, and final model parameters are obtained after a loss function is converged. The input image size for both model training and testing was set at 224 x 224, consistent with YOLO-Lite. Since YOLOv3 did not publish the assessment data on the PASCAL VOC dataset, 60 epochs trains of YOLOv3 were performed under the same experimental environment, with the same parameter settings, for comparison as baseline models. The method adopts average accuracy (mAP), accuracy (Precision), Recall (Recall) and F1 Score (F1 Score) to evaluate the detection effect of the model, uses FLOPS, parameters and model size to evaluate the performance of the model, and finally reflects the performance of the model on a frame rate (FPS) index. The results of the baseline model and the model of the invention on the PASCAL VOC data set are shown in table 3.
TABLE 3
From the experimental results, it can be seen that, in the experimental environment, YOLO-Lite can realize 369FPS (RTX2080Ti) and 102FPS (non-GPU), which is very fast, but the average accuracy rate is only 33.77%. The average accuracy of YOLOv3 is 55.81%, but the speed is about 86FPS (RTX2080Ti) and 11FPS (non-GPU), which is obviously inferior to YOLO-Lite, and real-time monitoring is difficult to achieve in a GPU-free computer or mobile terminal. Compared with the Mixed YOLOv3-Lite method, the average accuracy is greatly improved by 14.48% under the condition that the size and the calculated amount of the model are slightly increased; compared with YOLOv3, under the condition of sacrificing partial average accuracy, the model size is reduced by 12 times, the calculated amount is reduced by 7 times, FPS is improved by about 6 times under the condition of no GPU, meanwhile, indexes such as recall rate and F1 fraction are improved slightly in a small range, and partial detection results of Mixed YOLOv3-Lite on a PASCAL VOC 2007 test set are shown in FIG. 4.
Mixed yollov 3-Lite trains 60 epochs on a VisDrone2018-Det training set with an input image size of 832 × 832, tests are carried out on a verification set, and compared with data of the slimYOLO 3, experimental results are shown in table 4, histograms of accuracy, recall rate, F1 score, average accuracy, model size and model calculation amount are given, as shown in fig. 5, it can be seen visually that the average accuracy of Mixed yolo 3-Lite is obviously superior to that of tiny-yolo and slimYOLO series networks, and evaluation indexes of the calculation amount and the model size of the model also have absolute advantages. Mixed YOLOv3-Lite reached 47FPS in a test planning environment using NVIDIA RTX2080Ti GPU. Wherein the tiny-Yolov3 and the slimYolov3 series network FPS data are measured under NVIDIA GTX1080Ti environment.
TABLE 4
The detection effect of Mixed YOLOv3-Lite (832 × 832) on each type of object on the VisDrone2018-Det verification set is shown in table 5, and it can be seen that the VisDrone2018-Det data set has highly unbalanced data category distribution and is very challenging, for example, car has more example objects, which account for about 36.29% of the total example of the data, and relatively few awning-cycles account for only 1.37% of the total example, thereby bringing unbalanced problems to detector optimization, specifically, the average accuracy of car reaches 70.79%, and the average accuracy of awning-cycles is only 6.24%. In the design process of Mixed YOLOv3-Lite, only the convolution layer structure is recombined and deleted, and the problem of class imbalance is not treated in a targeted way, which also provides guidance for further optimizing the network in the later period. Partial detection results of the VisDrone2018-Det verification set are shown in FIGS. 6-a, 6-b, 6-c and 6-d, and it can be seen that the method can accurately identify the target under any conditions.
TABLE 5
Jetson AGX Xavier is a small-sized and low-power-consumption computing system which is provided by NVIDIA and has complete functions and module size not exceeding 105mm multiplied by 105mm, and is specially designed for neural network application platforms such as robots and industrial automation. Power consumption of only 10 to 30 watts when deploying smart devices such as unmanned vehicles, robots, etc. may provide powerful and efficient AI, computer vision, and high performance computing capabilities. Mixed YOLOv3-Lite was tested on a Jetson AGX Xavier apparatus and the results are shown in Table 6: when an image with the size of 224 multiplied by 224 is input, 43FPS can be achieved, which is 3.31 times of YOLOv3, and when the image is used for unmanned aerial vehicle images, the input image is adjusted to the size of 832 multiplied by 832, and still 13FPS can be achieved; although the method has a gap with YOLO-Lite, the real-time requirement can be met.
TABLE 6
Through the experimental example, the Mixed YOLOv3-Lite adopted by the invention has a shallower and narrower structure compared with YOLOv3, has fewer trainable parameters, obviously reduces the calculated amount, has faster running speed, greatly improves the detection precision under the condition of relatively reducing the running speed compared with the YOLOv-Lite, reduces the requirements on hardware equipment, can adapt to the target identification of various image data, and has wide application prospect.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:矿井中突透水情景的判识方法及装置