Pedestrian and vehicle abnormal behavior detection method based on time-space characteristics
1. A pedestrian and vehicle abnormal behavior detection method based on time-space characteristics is characterized by comprising the following steps:
step 10) detecting and obtaining a first abnormal area and a first normal area of the sample video based on a deep learning method;
step 20) detecting and obtaining a second abnormal area and a second normal area of the sample video based on an optical flow method;
and step 30) combining the first abnormal area, the first normal area, the second abnormal area and the second normal area according to the Bayesian theory to obtain the abnormal area of the sample video.
2. The pedestrian and vehicle abnormal behavior detection method based on spatiotemporal features according to claim 1, wherein the step 30) specifically comprises:
step 301) a first posterior probability is calculated by using the formula (1):
in the formula, p (A)d| o) represents the first posterior probability, p (d) represents the accuracy of the first abnormal region detected in step 10),nAdthe number of the pixel points in the second abnormal region falling in the first abnormal region, NAdRepresenting the total number of pixel points within the first anomaly region,nNdrepresenting the number of the pixel points in the second abnormal area falling into the first normal area;
step 302) calculating a second posterior probability by using the formula (2):
in the formula, p (A)o| d) represents the second posterior probability, p (o) represents the accuracy of the second abnormal region detected in step 20),nAothe number of pixels in the first abnormal region falling in the second abnormal region, NAoRepresenting the total number of pixel points within the second anomaly region,nNorepresenting the number of the pixel points in the first abnormal area falling into the second normal area;
step 303) taking the first posterior probability as the weight of the pixel point in the first abnormal region, taking the second posterior probability as the weight of the pixel point in the second abnormal region, and obtaining the abnormal region of the sample video by using the formula (3):
in the formula, AfShowing abnormal areas of the sample video, AAdIndicates a first abnormal area, AAoIndicating a second abnormal region.
3. The pedestrian and vehicle abnormal behavior detection method based on spatiotemporal features according to claim 1, wherein the step 10) specifically comprises:
step 101) inputting continuous sample video frames into a convolutional neural network model to obtain a primary prediction result of the sample video frames; the primary prediction result comprises a feature vector in each frame of image and predicted primary region information, and the region information comprises region center coordinates, region length and width, confidence and classification results;
step 102) inputting the primary prediction results of the continuous sample video frames into a long-term and short-term memory network model to obtain the secondary prediction results of the sample video frames; the secondary prediction result comprises secondary region information predicted in each frame of image, and the region information comprises region center coordinates, region length and width, confidence coefficient and classification result;
step 103) forming a first abnormal area by the areas of which the classification results in the secondary prediction results are abnormal; and all other areas except the first abnormal area in the image form a first normal area.
4. The pedestrian and vehicle abnormal behavior detection method based on the time-space characteristics as claimed in claim 3, wherein the convolutional neural network model adopts a Yolov3 architecture; dividing an input image into 13 × 13, 26 × 26 and 52 × 52 grids respectively, setting 3 prediction frames in each grid to be responsible for detecting an object falling into the grid, and acquiring object center coordinates, the length and the width of the prediction frames and confidence; information of 3 prediction frames for the entire image is calculated by a decoding process, and a prediction frame region having the highest score is selected as a detected region according to the score.
5. The pedestrian and vehicle abnormal behavior detection method based on spatio-temporal features of claim 4, wherein the loss function of the convolutional neural network model is composed of confidence error, IOU error and classification error, and is updated and optimized by using a stochastic gradient descent optimizer.
6. The pedestrian and vehicle abnormal behavior detection method based on spatiotemporal features according to claim 3, characterized in that the long-short term memory network model comprises 1 long-short term memory network layer and 1 output layer; the loss function of the long-short term memory network model consists of confidence coefficient errors, IOU errors and classification errors, and is updated and optimized by a random gradient descent optimizer.
7. The pedestrian and vehicle abnormal behavior detection method based on spatiotemporal features according to claim 1, wherein the step 20) specifically comprises:
step 201) carrying out foreground extraction on a sample video frame by frame based on ViBe, and extracting a dynamic region;
step 202) calculating the optical flow of the dynamic area according to an optical flow method, classifying according to the optical flow characteristics of the dynamic area, and extracting a second abnormal area and a second normal area of the sample video.
8. The method for detecting the abnormal behavior of the pedestrian and the vehicle based on the spatiotemporal characteristics according to claim 7, wherein the step 202) specifically comprises:
calculating the optical flow of the pixel points in the dynamic area by using the formula (4):
Ixu+Iyv=-It
in the formula, niIndicating the ith pixel point, I, in a preset windowxDenotes the partial derivative of the pixel point in the x-direction, IyDenotes the partial derivative of the pixel point in the y direction, ItRepresenting the partial derivative of the pixel point to the time t, u representing the speed of the pixel point in the x direction, and v representing the speed of the pixel point in the y direction;
taking the average value of the optical flows of the pixels in the dynamic area as the optical flow characteristics of the dynamic area, giving a real label to each dynamic area in a video frame in a mode of manually marking an abnormal area in the training process, and automatically adjusting parameters of an optical flow model by utilizing a grid search method; and (3) performing label prediction on the dynamic areas by using a Support Vector Machine (SVM) as a classifier, taking the sum of the dynamic areas marked as abnormal as a second abnormal area, and taking the sum of the dynamic areas marked as normal as a second normal area.
Background
With the development of modern society, automobiles increasingly become an integral part of human life, great convenience is provided for daily travel of people, but meanwhile, with the continuous increase of automobile holding capacity and traffic density, a series of problems such as traffic jam, frequent traffic safety accidents and the like are caused. The provision of the detection of the abnormal running of the pedestrian and the vehicle can assist a driver to safely run and remind the pedestrian to avoid an abnormal behavior area, even the pedestrian and the vehicle control system are combined together for an automatic driving vehicle, the abnormal area is automatically avoided, and the assistance is great for ensuring the safety in a traffic scene.
At present, for the research of detecting the abnormal behaviors of vehicles and pedestrians, a deep learning-based method is the main trend of the research of scholars at home and abroad. The convolutional neural network CNN has strong characterization capability and high detection accuracy, but can only extract spatial features, and is often combined with a long-short term memory network LSTM to extract temporal spatial features. For example, a detection method for encoding and decoding video frames based on a convolutional self-coding network and a long-short term memory network is proposed. The method based on the optical flow method is to extract the optical flow characteristics of the video frame by the optical flow method, and classify the video frame according to the optical flow characteristics to obtain abnormal behavior areas. And for example, an LK optical flow method is utilized to extract target object characteristics, and the target object characteristics are fed back to a feedforward neural network for classification.
The deep learning-based method is widely applied due to strong characterization capability and high detection precision, but a large amount of sample data support is required. For small sample data, the judgment of abnormal behaviors is easily influenced by the change of parameters such as color brightness and the like. The method based on the optical flow method extracts optical flow characteristics by means of front and rear frame images, is not sensitive enough to weak moving targets, is not as good as the robustness based on a deep learning method, and can have high detection accuracy without a large amount of sample data support for common behaviors. The method based on deep learning can perform better and better with the increase of sample data, but needs to make up for the deficiency of the previous small sample data.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the pedestrian and vehicle abnormal behavior detection method based on the time-space characteristics is provided, and has high detection accuracy under the condition of small sample data.
In order to solve the technical problem, the invention provides a pedestrian and vehicle abnormal behavior detection method based on time-space characteristics, which comprises the following steps:
step 10) detecting and obtaining a first abnormal area and a first normal area of the sample video based on a deep learning method;
step 20) detecting and obtaining a second abnormal area and a second normal area of the sample video based on an optical flow method;
and step 30) combining the first abnormal area, the first normal area, the second abnormal area and the second normal area according to the Bayesian theory to obtain the abnormal area of the sample video.
As a further improvement of the embodiment of the present invention, the step 30) specifically includes:
step 301) a first posterior probability is calculated by using the formula (1):
in the formula, p (A)d| o) represents the first posterior probability, p (d) represents the accuracy of the first abnormal region detected in step 10),nAdthe number of the pixel points in the second abnormal region falling in the first abnormal region, NAdRepresenting the total number of pixel points within the first anomaly region,nNdrepresenting the number of the pixel points in the second abnormal area falling into the first normal area;
step 302) calculating a second posterior probability by using the formula (2):
in the formula, p (A)o| d) represents the second posterior probability, p (o) represents the accuracy of the second abnormal region detected in step 20),nAothe number of pixels in the first abnormal region falling in the second abnormal region, NAoRepresenting the total number of pixel points within the second anomaly region,nNorepresenting the number of the pixel points in the first abnormal area falling into the second normal area;
step 303) taking the first posterior probability as the weight of the pixel point in the first abnormal region, taking the second posterior probability as the weight of the pixel point in the second abnormal region, and obtaining the abnormal region of the sample video by using the formula (3):
in the formula, AfShowing abnormal areas of the sample video, AAdIndicates a first abnormal area, AAoIndicating a second abnormal region.
As a further improvement of the embodiment of the present invention, the step 10) specifically includes:
step 101) inputting continuous sample video frames into a convolutional neural network model to obtain a primary prediction result of the sample video frames; the primary prediction result comprises a feature vector in each frame of image and predicted primary region information, and the region information comprises region center coordinates, region length and width, confidence and classification results;
step 102) inputting the primary prediction results of the continuous sample video frames into a long-term and short-term memory network model to obtain the secondary prediction results of the sample video frames; the secondary prediction result comprises secondary region information predicted in each frame of image, and the region information comprises region center coordinates, region length and width, confidence coefficient and classification result;
step 103) forming a first abnormal area by the areas of which the classification results in the secondary prediction results are abnormal; and all other areas except the first abnormal area in the image form a first normal area.
As a further improvement of the embodiment of the present invention, the convolutional neural network model adopts a Yolov3 architecture; dividing an input image into 13 × 13, 26 × 26 and 52 × 52 grids respectively, setting 3 prediction frames in each grid to be responsible for detecting an object falling into the grid, and acquiring object center coordinates, the length and the width of the prediction frames and confidence; information of 3 prediction frames for the entire image is calculated by a decoding process, and a prediction frame region having the highest score is selected as a detected region according to the score.
As a further improvement of the embodiment of the invention, the loss function of the convolutional neural network model consists of confidence error, IOU error and classification error, and is updated and optimized by using a random gradient descent optimizer.
As a further improvement of the embodiment of the present invention, the long-short term memory network model includes 1 long-short term memory network layer and 1 output layer; the loss function of the long-short term memory network model consists of confidence coefficient errors, IOU errors and classification errors, and is updated and optimized by a random gradient descent optimizer.
As a further improvement of the embodiment of the present invention, the step 20) specifically includes:
step 201) carrying out foreground extraction on a sample video frame by frame based on ViBe, and extracting a dynamic region;
step 202) calculating the optical flow of the dynamic area according to an optical flow method, classifying according to the optical flow characteristics of the dynamic area, and extracting a second abnormal area and a second normal area of the sample video.
As a further improvement of the embodiment of the present invention, the step 202) specifically includes:
calculating the optical flow of the pixel points in the dynamic area by using the formula (4):
Ixu+Iyv=-It
in the formula, niIndicating the ith pixel point, I, in a preset windowxDenotes the partial derivative of the pixel point in the x-direction, IyDenotes the partial derivative of the pixel point in the y direction, ItRepresenting the partial derivative of the pixel point to the time t, u representing the speed of the pixel point in the x direction, and v representing the speed of the pixel point in the y direction;
taking the average value of the optical flows of the pixels in the dynamic area as the optical flow characteristics of the dynamic area, giving a real label to each dynamic area in a video frame in a mode of manually marking an abnormal area in the training process, and automatically adjusting parameters of an optical flow model by utilizing a grid search method; and (3) performing label prediction on the dynamic areas by using a Support Vector Machine (SVM) as a classifier, taking the sum of the dynamic areas marked as abnormal as a second abnormal area, and taking the sum of the dynamic areas marked as normal as a second normal area.
Compared with the prior art, the technical scheme of the invention has the following beneficial effects: according to the pedestrian and vehicle abnormal behavior detection method based on the time-space characteristics, the result obtained by detection based on the deep learning method and the result obtained by detection based on the optical flow method are fused based on the Bayesian theory, so that the problem of insufficient performance based on the deep learning method in small sample data can be solved, the problem of insufficient sensitivity of weak moving objects based on the optical flow method can be solved, and the detection precision and performance of abnormal behaviors can be improved without support of large sample data.
Drawings
FIG. 1 is a flow chart of a pedestrian and vehicle abnormal behavior detection method based on temporal and spatial features according to an embodiment of the invention;
FIG. 2 is a diagram of the architecture of a long short term memory network in accordance with an embodiment of the present invention;
fig. 3(a) is an original picture in the embodiment of the present invention, and fig. 3(b) is a binary image of a dynamic region extracted from fig. 3(a) based on ViBe in the embodiment of the present invention.
Detailed Description
The technical solution of the present invention will be explained in detail below.
The embodiment of the invention provides a pedestrian and vehicle abnormal behavior detection method based on time-space characteristics, which comprises the following steps of:
step 10) detecting and obtaining a first abnormal area and a first normal area of the sample video based on a deep learning method;
step 20) detecting and obtaining a second abnormal area and a second normal area of the sample video based on an optical flow method;
and step 30) combining the first abnormal area, the first normal area, the second abnormal area and the second normal area according to the Bayesian theory to obtain the abnormal area of the sample video.
According to the method provided by the embodiment of the invention, the result obtained by detection based on the deep learning method and the result obtained by detection based on the optical flow method are fused based on the Bayes theory, so that the problem of insufficient performance based on the deep learning method in the case of small sample data can be solved, the problem of insufficient sensitivity of weak moving objects based on the optical flow method can be solved, and the detection precision and performance of abnormal behaviors can be improved without the support of large sample data.
Preferably, step 30) specifically comprises:
step 301) a first posterior probability is calculated by using the formula (1):
in the formula, p (A)d| o) represents the first posterior probability, p (d) represents the accuracy of the first abnormal region detected in step 10),nAdthe number of the pixel points in the second abnormal region falling in the first abnormal region, NAdRepresenting the total number of pixel points within the first anomaly region,nAdand the number of the pixel points in the second abnormal region falling in the first normal region is represented.
Step 302) calculating a second posterior probability by using the formula (2):
in the formula, p (A)o| d) represents the second posterior probability, p (o) represents the accuracy of the second abnormal region detected in step 20),nAothe number of pixels in the first abnormal region falling in the second abnormal region, NAoRepresenting the total number of pixel points within the second anomaly region,nNoand the number of the pixel points in the first abnormal area falling into the second normal area is represented.
Step 303) taking the first posterior probability as the weight of the pixel points in the first abnormal region, taking the second posterior probability as the weight of the pixel points in the second abnormal region, obtaining a fused region by using the formula (3), and taking the fused region as the abnormal region of the sample video.
In the formula, AfDenotes the post-fusion region, AAdIndicates a first abnormal area, AAoIndicating a second abnormal region.
If the first posterior probability is larger, selecting the first abnormal area as the fused area; and if the first posterior probability is smaller, selecting the second abnormal area as the fused area.
Compared with the traditional method, the Bayesian theory starts with the prior probability with subjectivity and can continuously correct the probability, so that the accuracy is greatly improved. The Bayesian theory is adopted to perform region fusion, and the posterior probability is used as the weight of the region detection accuracy, so that the detection precision of the fused region can be improved.
The detection based on the deep learning method has better detection precision and robustness, but depends on a large amount of sample data. In practice, it is often difficult to obtain a large number of targeted samples, and training has to be performed with fewer sample data, so that the problem of missing detection of the same type of abnormal behaviors due to changes in parameters such as brightness is easily caused. The method based on the optical flow method has good performance on common behaviors without supporting a large amount of data, has good detection effect on abnormal behaviors caused by speed, such as sudden braking and the like, and is generally lower in detection accuracy than a deep learning method along with continuous accumulation of sample data. The Bayesian theory is adopted for fusion, so that the defects of the deep learning method can be undoubtedly made up in the process of small sample data; the method can have higher detection precision when the data of the large sample is detected.
Preferably, step 10) specifically comprises:
step 101) inputting continuous sample video frames into a convolutional neural network model to obtain a primary prediction result of the sample video frames. The primary prediction result comprises a feature vector in each frame of image and predicted primary region information, and the region information comprises region center coordinates, region length and width, confidence and classification results.
Step 102) inputting the primary prediction results of the continuous sample video frames into the long-term and short-term memory network model to obtain the secondary prediction results of the sample video frames. The secondary prediction result comprises secondary region information predicted in each frame of image, and the region information comprises region center coordinates, region length and width, confidence coefficient and classification result.
Step 103) forming the regions of which the classification results in the secondary prediction results are abnormal into a first abnormal region, and forming all other regions except the first abnormal region into a first normal region.
The method provided by the embodiment of the invention can only extract the spatial features in the image by utilizing the convolutional neural network model to detect the abnormal behavior region, and is lack of temporal connection. By combining with the LSTM, the method establishes the relation between frames, screens and retains the information in the previous frames, and is beneficial to tracking and detecting abnormal behavior areas.
Further, the convolutional neural network model adopts a Yolov3 architecture. The input image is divided into 13 × 13, 26 × 26 and 52 × 52 grids, each grid is provided with 3 prediction frames for detecting the object falling into the grid, and the center coordinates of the object, the length and the width of the prediction frames and the confidence coefficient are obtained. Information of 3 prediction frames for the entire image is calculated by a decoding process, and a prediction frame region having the highest score is selected as a detected region according to the score.
Further, a loss function of the convolutional neural network model consists of confidence degree errors, IOU errors and classification errors, and is updated and optimized by a random gradient descent optimizer.
The convolutional neural network model with the Yolov3 architecture is adopted in the method, and compared with an RCNN series and an SSD, the method has the advantages of higher detection speed and lower background false detection rate. The YOLOv3 network model designs 3 mesh divisions with different sizes, which is beneficial to improving the detection precision of target areas with different sizes.
Preferably, as shown in fig. 2, the long-short term memory network model includes 1 long-short term memory network layer and 1 output layer. The loss function of the long-short term memory network model consists of confidence coefficient errors, IOU errors and classification errors, and is updated and optimized by a random gradient descent optimizer.
If the Deep Sort-based method is adopted, the established temporal relation is between adjacent frames, and information in image frames which are far away from each other in the past cannot be reserved. The method of the embodiment of the invention adopts the LSTM network to reserve the image frame information with longer interval time, and can better track and predict the detection area.
Preferably, step 20) specifically comprises:
step 201) carrying out foreground extraction on the sample video frame by frame based on ViBe, and extracting a dynamic area.
Specifically, for each pixel in the video frame, a sample set is stored, where the sample set is a past pixel value of the pixel and a pixel value of a neighboring pixel. Aiming at the new pixel value of each pixel point in the video frame, the new pixel value is compared with the sample set of the video frame to judge whether the pixel point is a background point, and therefore a dynamic region is extracted. For example, the original map is shown in fig. 3(a), and the gray value map in which dynamic regions are extracted based on ViBe is shown in fig. 3 (b).
Step 202) calculating the optical flow of the dynamic area according to an optical flow method, classifying according to the optical flow characteristics of the dynamic area, and extracting a second abnormal area and a second normal area of the sample video.
Further, the step 202) specifically includes:
calculating the optical flow of the pixel points in the dynamic area by using the formula (3):
Ixu+Iyv=-It
in the formula, niIndicating the ith pixel point, I, in a preset windowxDenotes the partial derivative of the pixel point in the x-direction, IyDenotes the partial derivative of the pixel point in the y direction, ItAnd the partial derivative of the pixel point to the time t is shown, u represents the speed of the pixel point in the x direction, and v represents the speed of the pixel point in the y direction.
Dividing a video frame into several fractions and assuming that the same instantaneous speed is maintained within these fractions, for a given size window, one can obtain:
so as to obtain the optical flow of the area.
Taking the average value of the optical flows of the pixels in the dynamic area as the optical flow characteristics of the dynamic area, giving a real label to each dynamic area in a video frame in a mode of manually marking an abnormal area in the training process, and automatically adjusting parameters of an optical flow model by utilizing a grid search method; and (3) performing label prediction on the dynamic areas by using a Support Vector Machine (SVM) as a classifier, taking the sum of the dynamic areas marked as abnormal as a second abnormal area, and taking the sum of the dynamic areas marked as normal as a second normal area.
The more common foreground extraction methods include a frame difference method, an optical flow method, and the like. The dynamic target extracted by the frame difference method is easy to have the problem that the inner area has cavities, namely only the outline of a moving object is detected; when the optical flow method is used for extracting the dynamic target, the dynamic target is easily influenced by noise and shielding, and the caused error is large. The method provided by the embodiment of the invention has the advantages that the ViBe has a good detection effect, the detection speed is high, and certain robustness on noise is realized.
The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are intended to further illustrate the principles of the invention, and that various changes and modifications may be made without departing from the spirit and scope of the invention, which is also intended to be covered by the appended claims. The scope of the invention is defined by the claims and their equivalents.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:手写识别方法、系统、客户端和服务器端