Target detection method, device and system and advanced driving assistance system
1. A method of object detection, the method comprising:
carrying out image target detection on the acquired image data to obtain image detection frames of a plurality of image detection targets;
carrying out point cloud target detection on the collected point cloud data by using a preset neural network to obtain a plurality of 3D target frames of point cloud detection targets;
projecting the 3D target frame of each point cloud detection target to an image pixel coordinate system to obtain each corresponding 2D projection frame;
calculating an IOU value between the 2D projection frame of each point cloud detection target and each image detection frame based on the image pixel coordinate system;
and fusing each image detection target and each point cloud detection target according to the IOU value to obtain a fused target.
2. The method of claim 1, wherein fusing each of the image detection targets with each of the point cloud detection targets according to the IOU value to obtain a fused target comprises:
respectively determining the 2D projection frames of the point cloud detection targets corresponding to the maximum IOU values between the 2D projection frames of all the point cloud detection targets and the image detection frames of all the image detection targets;
respectively judging whether each maximum IOU value is larger than an IOU threshold value;
and if the maximum IOU value is larger than the IOU threshold value, taking the point cloud detection target corresponding to the maximum IOU value as a fusion target.
3. The method of claim 2, wherein said separately determining whether each of said maximum IOU values is greater than an IOU threshold further comprises:
if the maximum IOU value is smaller than the IOU threshold, judging whether the confidence of the image detection target corresponding to the maximum IOU value is larger than a first confidence threshold;
and if so, taking the image detection target corresponding to the maximum IOU value as the fusion target.
4. The method of claim 2, further comprising:
traversing all the point cloud detection targets, and judging whether the confidence of the unfused point cloud detection targets is greater than a second confidence threshold;
and if so, taking the unfused point cloud detection target as the fusion target.
5. The method of claim 1, wherein the pre-provisioned neural networks comprise a Point-RPN network and a Point-RCNN network;
the 3D target frame for carrying out point cloud target detection on the collected point cloud data by using the preset neural network to obtain a plurality of point cloud detection targets comprises the following steps:
inputting the Point cloud data into the Point-RPN network for processing to obtain a plurality of 3D suggestion boxes;
inputting the plurality of 3D suggestion boxes to the Point-RCNN network so as to obtain corresponding Point cloud data from each 3D suggestion box;
extracting the characteristics of the point cloud data to obtain a global characteristic with a preset dimension;
and classifying each 3D suggestion frame by the global features of the preset dimensionality through a classification branch and a regression branch respectively, and under the condition that the 3D suggestion frame is classified as a positive sample, regressing the 3D suggestion frame to obtain a 3D target frame of the point cloud detection target.
6. The method of claim 5, wherein the inputting the Point cloud data into the Point-RPN network for processing to obtain a plurality of 3D suggestion boxes comprises:
performing grid division on the point cloud data according to different scales to perform voxelization processing to obtain voxelization point cloud data of different scales;
respectively extracting feature maps of corresponding scales from voxelized point cloud data of different scales;
and selecting feature maps corresponding to the feature maps with different scales from the backbone network, and fusing the feature maps with the corresponding scales to obtain a plurality of 3D suggestion boxes.
7. The method of claim 1, wherein the projecting the 3D object box of each of the point cloud detection objects to an image pixel coordinate system to obtain a corresponding each 2D projection box comprises:
converting the parameters of the 3D target frame of each point cloud detection target into corresponding vertex coordinates in a three-dimensional space;
based on a camera perspective transformation principle, converting vertex coordinates of each 3D target frame in the three-dimensional space into corresponding pixel coordinates in an image pixel coordinate system;
and taking the vertex of a circumscribed rectangular frame formed by all projection points as the coordinates of each 2D projection frame according to the pixel coordinates corresponding to each 3D target frame in the image pixel coordinate system.
8. An object detection device, comprising:
the image target detection module is used for carrying out image target detection on the acquired image data to obtain a plurality of image detection frames of image detection targets;
the point cloud target detection module is used for carrying out point cloud target detection on the collected point cloud data by utilizing a preset neural network so as to obtain a plurality of 3D target frames of point cloud detection targets;
the projection processing module is used for projecting the 3D target frame of each point cloud detection target to an image pixel coordinate system to obtain each corresponding 2D projection frame;
the IOU value calculation module is used for calculating the IOU value between the 2D projection frame of each point cloud detection target and each image detection frame based on the image pixel coordinate system;
and the fusion processing module is used for performing target-level fusion on each image detection target and each point cloud detection target in the point cloud target queue according to the IOU value so as to obtain a fusion target.
9. An object detection system, comprising: a camera and a radar; the camera is used for collecting image data, and the radar is used for collecting point cloud data;
further comprising: a processor, a memory and a computer program stored on the memory and operable on the processor, the processor executing the computer program to perform the object detection method of any one of claims 1 to 7.
10. An advanced driving assistance system characterized by comprising the object detection system according to any one of claims 9.
11. A computer-readable storage medium storing a computer program for executing the object detection method according to any one of claims 1 to 7.
[ background of the invention ]
With the rapid development of automatic driving technology in recent years, the environmental perception problem in the vehicle running process is more important. And since the requirement of automatic driving on safety is high, how to improve the perception precision is a core problem, so that sensor fusion becomes indispensable.
In the field of Advanced Driving Assistance systems (ADAS for short), the conventional environmental sensing System considers the adoption of a multi-sensor fusion technology to improve the sensing performance, and a common scheme is the fusion of a camera and a radar.
The current fusion mode based on the camera and the laser radar mainly comprises a fusion mode of decision level, target level and feature level. The decision-level fusion means that a mutual matching relationship between the sensors is not established in advance before the sensing information of each sensor is transmitted to the decision system, so that the decision difficulty of the decision system is increased to a certain extent, and the decision system is not favorable for making a more optimal decision scheme. Feature level fusion is a popular research direction in recent two years, but for data of two different modalities, namely an image and a point cloud, the direct feature fusion does not necessarily achieve the optimal performance, and how to perform more effective feature fusion is still a direction to be continuously explored. Most of existing target-level fusion schemes are of a serial structure, for example, a detection frame in an image is obtained through a Convolutional Neural network (CNN for short), point cloud data inside the image detection frame is extracted to continue foreground and background point segmentation and regression of a 3D target frame, the 3D detection performance of the method is greatly influenced by the image target detection performance, and missed and false detection of the image directly causes missed and false detection of the point cloud target. For another example, the point cloud data is processed by some conventional methods, ground points are segmented by a certain strategy and filtered, then the remaining points are spatially clustered, a 3D region of interest of a target is extracted, the 3D region of interest is projected to an image to extract a corresponding 2D region, and then a conventional 2D cnn network is used to classify and regress a target frame.
[ summary of the invention ]
In view of this, embodiments of the present invention provide a target detection method, an apparatus and a system thereof, and an advanced driving assistance system, so as to solve the technical problem that the detection performance of the target-level fusion scheme based on the series structure is limited to the lower performance side in the prior art.
In one aspect, an embodiment of the present invention provides a target detection method, including: carrying out image target detection on the acquired image data to obtain image detection frames of a plurality of image detection targets; carrying out point cloud target detection on the collected point cloud data by using a preset neural network to obtain a plurality of 3D target frames of point cloud detection targets; projecting the 3D target frame of each point cloud detection target to an image pixel coordinate system to obtain each corresponding 2D projection frame; calculating an IOU value between the 2D projection frame of each point cloud detection target and each image detection frame based on the image pixel coordinate system; and fusing each image detection target and each point cloud detection target according to the IOU value to obtain a fused target.
Optionally, the fusing each image detection target and each point cloud detection target according to the IOU value to obtain a fused target includes: respectively determining the 2D projection frames of the point cloud detection targets corresponding to the maximum IOU values between the 2D projection frames of all the point cloud detection targets and the image detection frames of all the image detection targets; respectively judging whether each maximum IOU value is larger than an IOU threshold value; and if the maximum IOU value is larger than the IOU threshold value, taking the point cloud detection target corresponding to the maximum IOU value as a fusion target.
Optionally, after respectively determining whether each of the maximum IOU values is greater than the IOU threshold, the method further includes: if the maximum IOU value is smaller than the IOU threshold, judging whether the confidence of the image detection target corresponding to the maximum IOU value is larger than a first confidence threshold; and if so, taking the image detection target corresponding to the maximum IOU value as the fusion target.
Optionally, the target detection method further includes: traversing all the point cloud detection targets, and judging whether the confidence of the unfused point cloud detection targets is greater than a second confidence threshold; if so, organizing the unfused point cloud detection target to the fusion target.
Optionally, the preset neural network includes a Point-RPN network and a Point-RCNN network; the 3D target frame for carrying out point cloud target detection on the collected point cloud data by using the preset neural network to obtain a plurality of point cloud detection targets comprises the following steps: inputting the Point cloud data into the Point-RPN network for processing to obtain a plurality of 3D suggestion boxes; inputting the plurality of 3D suggestion boxes to the Point-RCNN network so as to obtain corresponding Point cloud data from each 3D suggestion box; extracting the characteristics of the point cloud data to obtain a global characteristic with a preset dimension; and classifying each 3D suggestion frame by the global features of the preset dimensionality through a classification branch and a regression branch respectively, and under the condition that the 3D suggestion frame is classified as a positive sample, regressing the 3D suggestion frame to obtain a 3D target frame of the point cloud detection target.
Optionally, the inputting the Point cloud data into the Point-RPN network for processing to obtain a plurality of 3D suggestion boxes includes: performing grid division on the point cloud data according to different scales to perform voxelization processing to obtain voxelization point cloud data of different scales; respectively extracting feature maps of corresponding scales from voxelized point cloud data of different scales; and selecting a feature map corresponding to the feature maps with different scales from the backbone network, and fusing the feature map and the feature maps with the scales corresponding to the feature maps to obtain a plurality of 3D suggestion frames.
Optionally, the projecting the 3D target frame of each point cloud detection target to an image pixel coordinate system to obtain each corresponding 2D projection frame includes: converting the parameters of the 3D target frame of each point cloud detection target into corresponding vertex coordinates in a three-dimensional space; based on a camera perspective transformation principle, converting vertex coordinates of each 3D target frame in the three-dimensional space into corresponding pixel coordinates in an image pixel coordinate system; and converting the pixel coordinates corresponding to each 3D target frame in the image pixel coordinate system into the coordinates of each corresponding 2D projection frame.
On the other hand, an embodiment of the present invention further provides an object detection apparatus, including: the image target detection module is used for carrying out image target detection on the acquired image data to obtain a plurality of image detection frames of image detection targets; the point cloud target detection module is used for carrying out point cloud target detection on the collected point cloud data by utilizing a preset neural network so as to obtain a plurality of 3D target frames of point cloud detection targets; the projection processing module is used for projecting the 3D target frame of each point cloud detection target to an image pixel coordinate system to obtain each corresponding 2D projection frame; the IOU value calculation module is used for calculating the IOU value between the 2D projection frame of each point cloud detection target and each image detection frame based on the image pixel coordinate system; and the fusion processing module is used for performing target-level fusion on each image detection target and each point cloud detection target in the point cloud target queue according to the IOU value so as to obtain a fusion target.
In another aspect, an embodiment of the present invention further provides an advanced driving assistance system, including the above object detection system.
In still another aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program is used to execute the above object detection method.
Compared with the prior art, the technical scheme at least has the following beneficial effects:
according to the target detection method provided by the embodiment of the invention, the image target detection is respectively carried out on the collected image data to obtain a plurality of image detection targets, and the point cloud target detection is carried out on the collected point cloud data to obtain the point cloud detection targets. Then, projecting the 3D target frame of each point cloud detection target to an image pixel coordinate system to obtain each corresponding 2D projection frame; and under an image pixel coordinate system, calculating an IOU value between the 2D projection frame of each point cloud detection target and each image detection frame, and performing target-level fusion on the image detection target and the point cloud detection target which meet the set fusion logic based on the IOU value to obtain a fusion target. And judging whether the image detection target and the point cloud detection target which are not fused are taken as fusion targets according to respective confidence degrees. Because the image detection target has rich semantic information, the classification effect on the target is better, but the distance measurement precision is lower, the point cloud data acquired by the radar has higher distance measurement precision, and the point cloud data are fused to make up for the deficiency, so that a more accurate and effective environment perception result is output. Therefore, the embodiment of the invention fully considers the respective advantages of image detection and point cloud 3D target detection, and achieves better 3D detection performance than the single sensing data.
Further, when Point cloud data is subjected to Point cloud target detection by using a preset neural network (for example, a Point-RCNN network), by taking the algorithm thought of fast-RCNN in image target detection as a reference, a two-stage Point cloud target detection algorithm (including a Point-RPN network and a Point-RCN network) is designed, the Point cloud data is processed through the Point-RPN network to generate a series of 3D suggestion frames (propulses), the Point-RCNN network takes the 3D propulses generated by the Point-RPN network as input, and further classifies and regresses the propulses, so that the performance of the Point cloud 3D detection algorithm is greatly improved.
Further, the improved Point pilars algorithm is utilized in the Point-RPN network. Compared with the existing Point Pillars algorithm, the improved Point Pillars algorithm adopts multi-scale Point cloud mesh division and fuses multi-scale Point cloud mesh characteristics, so that the detection performance of the Point-RPN network is improved.
[ description of the drawings ]
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a target detection method according to the present invention;
fig. 2 is a schematic diagram of a network structure using a Point-RPN network in the target detection method according to the embodiment of the present invention;
fig. 3 is a schematic network structure diagram of a Point-RCNN network in the target detection method according to the embodiment of the present invention;
fig. 4 is a schematic flowchart of a specific embodiment of a target-level fusion algorithm in the target detection method according to the embodiment of the present invention;
FIG. 5 is a schematic structural diagram of an embodiment of an object detection apparatus according to the present invention;
fig. 6 is a schematic structural diagram of an object detection system according to an embodiment of the present invention.
[ detailed description ] embodiments
For better understanding of the technical solutions of the present invention, the following detailed descriptions of the embodiments of the present invention are provided with reference to the accompanying drawings.
It should be understood that the described embodiments are only some embodiments of the invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of an embodiment of a target detection method according to an embodiment of the present invention. Referring to fig. 1, the object detection method includes:
step 101, carrying out image target detection on acquired image data to obtain image detection frames of a plurality of image detection targets;
102, carrying out point cloud target detection on the collected point cloud data by using a preset neural network to obtain a plurality of 3D target frames of point cloud detection targets;
103, projecting the 3D target frame of each point cloud detection target to an image pixel coordinate system to obtain each corresponding 2D projection frame;
104, calculating an IOU value between the 2D projection frame of each point cloud detection target and each image detection frame based on the image pixel coordinate system;
and 105, performing target-level fusion on each image detection target and each point cloud detection target according to the IOU value to obtain a fusion target.
The target detection method provided by the embodiment is applied to a target detection system with a camera and a radar. Wherein, the camera can be single camera or many cameras, radar can adopt laser radar.
In the process of target detection, a camera is used for shooting a scene to acquire 2D image data and a radar is used for scanning the scene to acquire point cloud data. Wherein the point cloud data refers to a set of vectors in a three-dimensional coordinate system. These vectors are usually expressed in the form of X, Y, Z three-dimensional coordinates, which are used primarily to represent the shape of the outer surface of an object; also, the point cloud data may represent RGB color, gray value, depth, division result, etc. of one point in addition to the geometric position information represented by (X, Y, Z).
The main execution body of the target detection method described in this embodiment is a processor in a target detection system.
As described in step 101, image target detection is performed on the acquired image data to obtain a plurality of image detection frames of image detection targets.
Specifically, the image data collected from the scene may include Multiple types of Detection targets (e.g., motor vehicles, non-motor vehicles, pedestrians, etc.), and different types of image Detection targets may be detected from the image by using a Multi Object Detection (MOD) network, so as to obtain image Detection frames of Multiple image Detection targets. The multi-target detection network can be realized by using a deep learning method based on a Convolutional Neural Network (CNN), and mainly detects a plurality of types of targets in a scene, can obtain position and size information of image detection targets in an image after detection, and identifies the detected image detection targets in image data through an image detection frame.
Further, as the accurate control decision cannot be made by the decision system directly according to the position and size information of the image detection target, the relative position of the image detection target from the camera is estimated by using monocular distance measurement (corresponding to a single camera) or binocular distance measurement (corresponding to two or more cameras), the approximate position of the image detection target in the scene is obtained, and the better distance measurement precision is ensured.
And step 102, performing point cloud target detection on the collected point cloud data by using a preset neural network to obtain a plurality of 3D target frames of point cloud detection targets.
In this embodiment, the preset neural network may adopt a Point-RCNN network. The Point-RCNN network is a two-phase (two-stage) network, including a Point-RPN network and a Point-RCNN network.
Specifically, the steps include:
step 1021, inputting the Point cloud data into the Point-RPN network for processing to obtain a plurality of 3D suggestion frames;
step 1022, inputting the plurality of 3D suggestion boxes to the Point-RCNN network, so as to obtain corresponding Point cloud data from each 3D suggestion box;
1023, extracting the characteristics of the point cloud data to obtain a global characteristic of a preset dimension;
and 1024, classifying each 3D suggestion frame by the global features of the preset dimensionality through a classification branch and a regression branch, and under the condition that the 3D suggestion frames are classified as positive samples, regressing the 3D suggestion frames to obtain a 3D target frame of the point cloud detection target.
In this embodiment, the Point cloud data collected is processed by using a Point-RPN network to obtain a plurality of preliminary 3D spots. These preliminary 3D prosals are not accurate enough and are therefore referred to as 3D advice boxes.
Wherein, the Point-RPN network is realized by using an improved Point pilars algorithm.
The technical personnel in the field understand that the existing Point pilars algorithm preprocesses Point cloud data in a grid division mode, only divides the Point cloud data in two dimensions of a top view, maps the Point cloud data in each grid into a characteristic vector with a fixed length through a mini-Point network to generate a pseudo image with a fixed length and width, and then uses a 2D convolutional neural network to continue to perform feature extraction and subsequent target detection tasks. The improved Point pilars algorithm adopts multi-scale grid division and fuses multi-scale grid characteristics, so that the detection performance of the Point-RPN network is improved.
Specifically, the step 1021 includes:
step 10211, performing grid division on the point cloud data according to different scales to perform voxelization processing, and obtaining voxelization point cloud data of different scales;
step 10212, extracting feature maps of corresponding scales from the voxelized point cloud data of different scales respectively;
step 10213, selecting feature maps corresponding to feature maps with different scales from the backbone network, and fusing the feature maps with the corresponding scales to obtain a plurality of 3D suggestion boxes.
The following describes a specific implementation process of the steps 10211 to 10213 in conjunction with a network structure of a Point-RPN network.
Fig. 2 is a schematic diagram of a network structure using a Point-RPN network in the target detection method according to the embodiment of the present invention.
Referring to fig. 2, the input point cloud data is subjected to mesh (pilar) division by a 3D target detector (VoxelFPN), multi-scale pilar division is adopted on the scale of dividing the mesh, as shown in fig. 2, the mesh is divided into meshes (pilar) of three scales of s × s, 2s × 2s and 4s × 4s, the mesh division process is to perform voxelization on the point cloud data, and the point cloud data in each mesh is voxelized point cloud data.
Then, obtaining an output feature map (C) from each grid feature of different scales through a Voxel feature extraction module (Voxel feature extraction)1,H,W)、(C2,H/2,W/2)、(C3H/4, W/4), and inputting the feature map output via the voxel feature extraction module to a Multi-scale feature fusion (Multi-scale feature aggregation) module. In the multi-scale Feature fusion module, the output Feature maps are respectively fused with Feature maps (Feature maps) of corresponding scales in a backbone network (backbone), so that fusion between multi-scale grid features (Pillar features) is realized. And then obtaining a plurality of 3D suggestion frames through a detection module based on the fused grid features.
Furthermore, a design idea of the FPN is used for reference in the backbone network, a bottom-up and top-down structure is designed, and fusion of multi-scale features is realized. When the 3D suggestion boxes are generated, non-maximum suppression algorithm (NMS) operation is carried out in each scene, the 3D suggestion boxes with large overlapping degree are filtered, N targets before the confidence degree are reserved (if the N targets are not enough, all targets are output), and the N targets are output to the Point-RCNN network.
According to step 1022, in step 1022, the plurality of 3D suggestion boxes are input to the Point-RCNN network, so as to obtain corresponding Point cloud data from each of the 3D suggestion boxes.
As shown in step 1023, feature extraction is performed on the point cloud data to obtain a global feature with a preset dimension.
Fig. 3 is a schematic network structure diagram of a Point-RCNN network in the target detection method according to the embodiment of the present invention.
Referring to fig. 3, the Point cloud data and the 3D suggestion box output by the Point-RPN network are used as the input of the Point-RCNN network, and the 3D suggestion box is further classified and regressed. And combining the 3D suggestion boxes and Point cloud data output by the Point-RPN, acquiring corresponding Point cloud data from each 3D suggestion box by the Point-RCNN, and encoding the characteristics of the Point cloud data in each 3D suggestion box by Point cloud data Encoder encoding. The point cloud data Encoder coding adopts PointNet + + as a backbone network. In point cloud data Encoder coding, Set Abstraction (SA) is adopted to gradually down sample characteristics of point cloud data to extract deep characteristics, and finally a global characteristic of a preset dimension (C dimension) is obtained. The dimension of the global feature can be freely set, for example, C512.
In step 1024, the global features of the preset dimensions are classified into each 3D suggestion box through a classification branch and a regression branch, and when the 3D suggestion box is classified as a positive sample, the 3D suggestion box is regressed to obtain a 3D target box of the point cloud detection target.
Specifically, with continued reference to fig. 3, the 3D explosals are classified by a classification branch, respectively, and the classification branch classifies the 3D explosals based on the result of performing confidence prediction on the obtained C-dimensional feature vector. According to the classification result, if the 3D suggestion frame is a positive sample (i.e. a 3D suggestion frame belonging to a certain classification category), the 3D suggestion frame is further regressed through a regression branch to obtain a 3D target frame of the point cloud detection target.
For each 3D target frame, the regression result of the 3D target frame passing through Point-RPN is set as (x)i,yi,zi,wi,li,hi,θi) Then according to (x)i,yi,zi) And thetaiThe spatial transformation of translation and rotation can be performed on the point cloud data in the 3D target frame, the point cloud data is converted into a local canonical coordinate system, and then the deviation amount of the regression parameters is calculated in the coordinate system. Let the GT box have the parameters (x)gt,ygt,zgt,wgt,lgt,hgt,θgt) Then, the corresponding regression deviation is:
Δθ=θgt-θi
the offset of the central point is the offset under a local regular coordinate system, the offset of the scale is the offset relative to a corresponding target scale mean value in all training data, and the deflection angle offset is the deviation from a true value frame deflection angle. The loss function used was Smooth L1 loss.
Further, the distance between the point cloud detection target and the radar is measured by utilizing the ranging function of the radar. The radar ranging technology may adopt the prior art, and is not described herein again.
And as shown in step 103, projecting the 3D object frame of each point cloud detection object to an image pixel coordinate system to obtain corresponding 2D projection frames.
Those skilled in the art will appreciate that the same coordinate system is required to fuse the data collected by the various sensors. In this embodiment, the image pixel coordinate system is selected as the uniform coordinate system before the image data and the point cloud data are fused. And then, projecting the 3D target frame of the point cloud detection target to an image pixel coordinate system to obtain a corresponding 2D projection frame, and fusing according to the relative position relation between the 2D projection frame and the image detection frame of the image detection target.
In this embodiment, the present step includes:
step 1031, converting the parameters of the 3D target frame of each point cloud detection target into corresponding vertex coordinates in a three-dimensional space;
step 1032, based on a camera perspective transformation principle, converting vertex coordinates of each 3D target frame in the three-dimensional space into corresponding pixel coordinates in an image pixel coordinate system;
and 1033, according to the pixel coordinates corresponding to each 3D target frame in the image pixel coordinate system, taking the vertex of the circumscribed rectangular frame formed by all the projection points as the coordinates of each corresponding 2D projection frame.
Specifically, assume the point cloud target queue isWherein the parameter (x)i,yi,zi,wi,li,hi,θi) And detecting parameters of a 3D target frame of the target for the ith point cloud. The 3D object box may also be represented by 8 vertices in three-dimensional space, so (x) may be expressedi,yi,zi,wi,li,hi,θi) Is converted into
Further, according to the principle of camera perspective transformation, in the camera coordinate system, the X coordinate of a point in the three-dimensional space is (X, Y, z,1) and the X coordinate is (u, v,1) which is converted to the Y coordinate of a point in the image pixel coordinate system.
Therefore, the point cloud is used for detecting the coordinates of the target in the three-dimensional space:
the pixel coordinates converted into the image pixel coordinate system are:
and taking the vertex of the circumscribed rectangle frame formed by all the projection points as the coordinates of each corresponding 2D projection frame:
wherein
The IOU value between the 2D projection frame of each point cloud detection target and each image detection frame is calculated based on the image pixel coordinate system as described in step 104.
As understood by those skilled in the art, the IOU value is an index for evaluating the degree of coincidence of two frames (bounding boxes), i.e., the ratio of their intersection to union. The IOU value is equal to the area of the intersection of two bounding boxes divided by the area of their union. When two bounding boxes do not have any intersection, the IOU is 0; when two bounding boxes completely coincide, the IOU is 1. Thus, the value of the IOU ranges from [0,1 ].
In this embodiment, the 2D projection frame of each point cloud detection target and the IOU value of each image detection frame are calculated in an image pixel coordinate system. For example, if there are m point cloud detection targets and n image targets, the IOU values of the 2D projection frame of each point cloud detection target and the image detection frames of the other image detection targets are calculated, so that an m × n IOU matrix can be obtained.
And step 105, performing target-level fusion on each image detection target and each point cloud detection target according to the IOU value to obtain a fusion target.
In this embodiment, the present step includes:
step 1051, respectively determining the 2D projection frames of the point cloud detection targets corresponding to the maximum IOU values between the 2D projection frames of all the point cloud detection targets and the image detection frames of all the image detection targets.
Specifically, according to the above description, if the IOU value between the 2D projection frame of a point cloud detection target and an image detection frame is larger, the degree of coincidence between the 2D projection frame of the point cloud detection target and the image detection frame is higher. Therefore, the 2D projection frame of the point cloud detection target corresponding to the maximum IOU value between the 2D projection frames of all the point cloud detection targets and the image detection frame of each image detection target is determined respectively. That is, the 2D projection frame of the point cloud detection target having the highest degree of coincidence with the image detection frame of each image detection target is found.
In addition, the 2D projection frame of the image detection target corresponding to the maximum IOU value between the 2D projection frames of all the image detection targets and the 2D projection frame of each point cloud detection target may be determined separately.
Step 1052, determining whether each of the maximum IOU values is greater than the IOU threshold.
The IOU threshold may be set in advance, for example, if the value range of the IOU value is [0,1], the IOU threshold is set to 0.8.
And 1053, if the maximum IOU value is larger than the IOU threshold value, taking the point cloud detection target corresponding to the maximum IOU value as a fusion target.
Specifically, if the maximum IOU value of the two is greater than the IOU threshold, the corresponding point cloud detection target and the image detection target are fused. Because the image detection frame of the image detection target does not have information such as the three-dimensional space position, the scale, the orientation and the like of the target, the point cloud detection target corresponding to the maximum IOU value is taken as the fusion target when the target-level fusion is carried out. In the fusion target, the detection frame selects a 3D target frame of the point cloud detection target, and the confidence coefficient mean value of the point cloud detection target and the image detection target corresponding to the maximum IOU value is used as the confidence coefficient of the fusion target. Further, the position information of the fusion target in the 3D space is obtained according to the radar ranging principle.
Step 1054, if the maximum IOU value is smaller than the IOU threshold, determining whether the confidence of the image detection target corresponding to the maximum IOU value is greater than a first confidence threshold.
Specifically, if the maximum IOU value of the two is smaller than the IOU threshold, it is further determined whether the confidence of the image detection target corresponding to the maximum IOU value is greater than the first confidence threshold. The confidence of the image detection target is the reliability of the image detection target for detecting a certain class of targets. The first confidence threshold may be preset by itself.
And 1055, if yes, taking the image detection target corresponding to the maximum IOU value as the fusion target.
That is, if the image detection target corresponding to the maximum IOU value is not fused with the point cloud detection target, but because the confidence of the image detection target is higher (i.e. greater than the set first confidence threshold), the image detection target is still used as the fusion target. Further, the position information of the image detection target in the 3D space is obtained according to a monocular distance measurement or binocular distance measurement algorithm.
Otherwise, if the confidence of the image detection target is smaller than the first confidence threshold, the image detection target is directly filtered.
Step 1056, traversing all point cloud detection targets, and judging whether the confidence of the point cloud detection targets which are not fused is larger than a second confidence threshold.
Specifically, each point cloud detection target is traversed once, if the point cloud detection target is fused with the image detection target, skipping is performed, and if the point cloud detection target is not fused, the confidence of the point cloud detection target is further judged, that is, whether the confidence of the point cloud detection target is greater than a second confidence threshold is judged. Wherein the second confidence threshold value can be preset by itself.
And 1057, if so, taking the unfused point cloud detection target as the fusion target.
Specifically, if the confidence of the unfused point cloud detection target is higher (i.e. greater than the set second confidence threshold), the point cloud detection target is still used as the fusion target. Otherwise, if the confidence of the point cloud detection target is smaller than the second confidence threshold, the point cloud detection target is directly filtered.
It can be seen that all the obtained fusion targets include the independent unfused image detection target and the point cloud detection target. Because the image detection target has rich semantic information, the classification effect on the target is better, but the precision of monocular distance measurement or binocular distance measurement is lower, and the point cloud detection target utilizes radar distance measurement and has higher distance measurement precision, the two are fused, so that the advantages and the disadvantages can be made up, and a more accurate and effective environment perception result is output.
Fig. 4 is a schematic flowchart of a specific embodiment of a target-level fusion algorithm in the target detection method according to the embodiment of the present invention. Referring to fig. 4, the target-level fusion algorithm includes the following steps:
step 401, assume a point cloud target queue QlidarIs composed of
Step 402, assume image object queue QimageIs composed of
Step 403, projecting the point cloud target to the image to obtain a 2D projection frame
Step 404, according to the image object queue QimageThe IOU matrix M is calculated by the image detection frame and the 2D projection frame inm×n;
Step 405 sequentially identifies an image detection target i (i is 1,2,3 …, n).
Step 406, finding the point cloud detection target j with the maximum IOU value of the image detection target i, wherein the maximum IOU value is
Step 407, ifWherein T is the IOU threshold; if the determination result in the step 407 is yes, execute the step 409; if the determination result in the step 407 is negative, then step 408 is executed.
Step 408, whether the confidence of the image detection target is greater than C1In which C is1Is a first confidence threshold; if the determination result in step 408 is yes, step 410 is executed.
Step 409, fusing the image detection target i and the point cloud detection target j, and outputting the fused fusion target to a fusion target queue Qfusion。
Step 410, image is processedOutputting the detected target i to a fused target queue Qfusion。
Step 411, if i < ═ n; if the determination result in the step 411 is yes, that is, the image target queue has not been traversed, then i is equal to i + 1; if the determination result in the step 411 is negative, go to the step 412;
step 412, fusing the image detection target with confidence degree greater than C2Outputting the point cloud detection target to a fused target queue Qfusion(ii) a Wherein C is2Is a second confidence threshold.
Fig. 5 is a schematic structural diagram of an embodiment of an object detection apparatus according to the present invention. Referring to fig. 5, the object detection device 5 includes:
and an image target detection module 51, configured to perform image target detection on the acquired image data to obtain image detection frames of a plurality of image detection targets. And the point cloud target detection module 52 is configured to perform point cloud target detection on the collected point cloud data by using a preset neural network to obtain a plurality of 3D target frames of point cloud detection targets. And the projection processing module 53 is configured to project the 3D target frame of each point cloud detection target to an image pixel coordinate system to obtain each corresponding 2D projection frame. An IOU value calculating module 54, configured to calculate an IOU value between the 2D projection frame of each point cloud detection target and each image detection frame based on the image pixel coordinate system. And the fusion processing module 55 is configured to perform target-level fusion on each image detection target and each point cloud detection target according to the IOU value to obtain a fusion target.
Wherein the fusion processing module 55 includes: the projection frame determining unit 551 is configured to determine, respectively, 2D projection frames of the point cloud detection targets corresponding to the maximum IOU value between the 2D projection frames of all the point cloud detection targets and the image detection frames of each of the image detection targets. An IOU value determining unit 552 is configured to determine whether each of the maximum IOU values is greater than an IOU threshold value. A fusion processing unit 553, configured to, if the maximum IOU value is greater than the IOU threshold, output the point cloud detection target corresponding to the maximum IOU value as a fusion target to the fusion target.
The fusion processing module 55 further includes: a first confidence determining unit 554, configured to determine whether a confidence of the image detection target corresponding to the maximum IOU value is greater than a first confidence threshold if the maximum IOU value is smaller than the IOU threshold; the fusion processing unit 553 is further configured to, if the determination result of the first confidence determining unit 554 is yes, take the image detection target corresponding to the maximum IOU value as the fusion target.
The fusion processing module 55 further includes: the second confidence judgment unit 555 is configured to traverse all the point cloud detection targets, and judge whether the confidence of the point cloud detection targets that are not fused is greater than a second confidence threshold; the fusion processing unit 553 is further configured to, if the determination result of the second confidence degree determination unit 555 is yes, take the unfused point cloud detection target as the fusion target.
The preset neural network comprises a Point-RPN network and a Point-RCNN network. The point cloud target detection module 52 includes: the Point-RPN network processing unit 521 is configured to input the Point cloud data into the Point-RPN network for processing to obtain a plurality of 3D suggested frames; and inputting the plurality of 3D suggestion boxes into the Point-RCNN network so as to obtain corresponding Point cloud data from each 3D suggestion box. The Point-RCNN network processing unit 522 is configured to perform feature extraction on the Point cloud data to obtain a global feature of a preset dimension; and classifying each 3D suggestion frame by the global features of the preset dimensionality through a classification branch and a regression branch respectively, and under the condition that the 3D suggestion frame is classified as a positive sample, regressing the 3D suggestion frame to obtain a 3D target frame of the point cloud detection target.
The Point-RPN network processing unit 521 includes: a voxelization processing unit (not shown in the figure) for performing grid division on the point cloud data according to different scales to perform voxelization processing, so as to obtain voxelization point cloud data of different scales; a feature map extracting unit (not shown in the figure) for extracting feature maps of corresponding scales from the voxelized point cloud data of different scales respectively; and a 3D suggestion frame determining unit (not shown in the figure) configured to select a feature map corresponding to feature maps of different scales from the backbone network, and fuse the feature map and the feature map to obtain a plurality of 3D suggestion frames.
The projection processing module 53 includes: and the space conversion unit 531 is configured to convert parameters of the 3D target frame of each point cloud detection target into corresponding vertex coordinates in a three-dimensional space. A coordinate transformation unit 532, configured to transform vertex coordinates of each 3D object frame in the three-dimensional space into corresponding pixel coordinates in an image pixel coordinate system based on a camera perspective transformation principle. A 2D projection frame coordinate determining unit 533, configured to convert the pixel coordinates corresponding to each 3D target frame in the image pixel coordinate system into the coordinates of each corresponding 2D projection frame.
The specific implementation process of the modules and units may refer to the method embodiments, and will not be described herein again.
Fig. 6 is a schematic structural diagram of an object detection system according to an embodiment of the present invention.
Referring to fig. 6, the object detection system 6 includes: a camera 61, a radar 62, a processor 63, a memory 64 and a computer program stored on said memory and executable on said processor. The processor 63, when executing the computer program, performs the object detection method described in the above method embodiments. The camera 61 is used for collecting image data, and the radar 62 is used for collecting point cloud data. The camera 61 may comprise a single camera or a plurality of cameras. The radar 62 may be a laser radar.
For the specific process of the executed target detection method when the processor 63 executes the computer program, reference may be made to the above method embodiments, which are not described herein again.
The embodiment of the invention also provides an advanced driving assistance system which comprises the target detection system. The advanced driving auxiliary system mainly carries out 3D detection on a motor vehicle target, a non-motor vehicle target and a pedestrian target (short for a man-machine-non-man target) in a front visual area of a running vehicle under an automobile running scene. The advanced driving assistance system realizes 3D detection of the robot and non-human targets by fusing point cloud data and image data by using the target detection system provided by the application, and improves the perception precision and detection effect of the environment in a forward-looking area.
An embodiment of the present invention further provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program is configured to execute each step in the above-described embodiment of the object detection method.
The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the examples of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be understood that the term "and/or" as used herein is merely one type of association that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.
The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context. Similarly, the phrases "if determined" or "if detected (a stated condition or event)" may be interpreted as "when determined" or "in response to a determination" or "when detected (a stated condition or event)" or "in response to a detection (a stated condition or event)", depending on the context.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions in actual implementation, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a Processor (Processor) to execute some steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:车辆周围环境的识别方法、装置及相关设备