Method and related device for separating foreground and background of video
1. A method for separating foreground and background of a video, comprising:
s1, respectively carrying out multi-scale feature extraction and feature fusion on the first three frames of images of the video according to the initial weight through a preset multi-scale feature extraction model to obtain feature fusion graphs corresponding to the first three frames of images;
s2, constructing a background model according to the feature fusion image corresponding to the first three frames of images of the video and updating the initial weight to obtain a first background module and a first weight;
s3, performing multi-scale feature extraction and feature fusion on the ith frame image of the video according to the first weight through the preset multi-scale feature extraction model to obtain a feature fusion image corresponding to the ith frame image in the video, wherein i belongs to [4, N ], and N is the frame number of the video;
s4, separating foreground points and background points of the ith frame image according to the first background model and the feature fusion image corresponding to the ith frame image to obtain a foreground and background separation result of the ith frame image;
and S5, updating the first background model and the first weight according to the feature fusion image corresponding to the ith frame image and the foreground-background separation result to obtain the updated first background model and the updated first weight, setting i to i +1, and returning to the step S3 until i to N to complete the foreground-background separation of the video.
2. The method for separating foreground from background in video according to claim 1, wherein step S1 specifically includes;
respectively performing multi-scale feature extraction on the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video through the preset multi-scale feature extraction model to obtain scale feature layers corresponding to the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video, wherein the scale feature layer corresponding to each frame image comprises feature layers with multiple scales;
and respectively carrying out feature fusion on the scale feature layers corresponding to the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video according to the initial weight through the preset multi-scale feature extraction model to obtain feature fusion images corresponding to the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video.
3. The method for separating the foreground from the background of the video according to claim 1, wherein the step S2 specifically includes:
s21, constructing an initial background model based on the feature fusion graphs corresponding to the 1 st frame image and the 3 rd frame image of the video;
s22, separating foreground points and background points in the 2 nd frame image according to the initial background model and the feature fusion image corresponding to the 2 nd frame image of the video to obtain a foreground and background separation result of the 2 nd frame image;
s23, updating the initial background model and the initial weight according to the feature fusion image corresponding to the 2 nd frame image of the video and the foreground-background separation result to obtain a first background model and a first weight.
4. The method for separating the foreground from the background of the video according to claim 3, wherein the step S21 specifically includes:
and calculating the average characteristics of the characteristic fusion images corresponding to the 1 st frame image and the 3 rd frame image of the video to obtain an initial background model.
5. The method for separating the foreground from the background of the video according to claim 3, wherein the step S22 specifically includes:
calculating the absolute difference value of the feature fusion image corresponding to the 2 nd frame image of the video and the feature value of the same position in the initial background model;
determining the pixel points at the positions corresponding to the characteristic values of which the absolute difference values are greater than or equal to the foreground-background separation threshold as foreground points, and determining the pixel points at the positions corresponding to the characteristic values of which the absolute difference values are less than the foreground-background separation threshold as background points to obtain a foreground-background separation result of the 2 nd frame of image;
wherein the front-background separation threshold is calculated based on an average pixel difference value of a 1 st frame image and a 3 rd frame image of the video.
6. The method for separating the foreground from the background of the video according to claim 3, wherein the step S23 specifically includes:
acquiring pixel points which are background points in the 2 nd frame image from a foreground and background separation result corresponding to the 2 nd frame image of the video to obtain target pixel points;
updating the characteristic value of the initial background model, which is located at the same position as the target pixel point, according to the characteristic value of the target pixel point in the characteristic fusion image corresponding to the 2 nd frame image to obtain a first background model;
determining the ratio of foreground points in the 2 nd frame image according to a foreground and background separation result corresponding to the 2 nd frame image of the video;
and updating the initial weight according to the ratio of foreground points in the 2 nd frame image to obtain a first weight.
7. A video foreground and background separation apparatus, comprising:
the first feature extraction and fusion unit is used for respectively carrying out multi-scale feature extraction and feature fusion on the first three frames of images of the video according to the initial weight through a preset multi-scale feature extraction model to obtain feature fusion images corresponding to the first three frames of images;
the building and updating unit is used for building a background model and updating the initial weight according to the feature fusion graph corresponding to the first three frames of images of the video to obtain a first background module and a first weight;
the second feature extraction and fusion unit is used for performing multi-scale feature extraction and feature fusion on the ith frame image of the video according to the first weight through the preset multi-scale feature extraction model to obtain a feature fusion image corresponding to the ith frame image in the video, wherein i belongs to [4, N ], and N is the frame number of the video;
the separation unit is used for separating foreground points and background points of the ith frame image according to the first background model and the feature fusion image corresponding to the ith frame image to obtain a foreground-background separation result of the ith frame image;
and the updating unit is used for updating the first background model and the first weight according to the feature fusion image corresponding to the ith frame image and the foreground-background separation result to obtain the updated first background model and the updated first weight, setting i +1, triggering the second feature extraction and fusion unit until i is N, and completing the foreground-background separation of the video.
8. The apparatus according to claim 7, wherein the constructing and updating unit specifically comprises:
the construction subunit is used for constructing an initial background model based on the feature fusion graphs corresponding to the 1 st frame image and the 3 rd frame image of the video;
the separation subunit is configured to separate foreground points and background points in the 2 nd frame image according to the initial background model and the feature fusion image corresponding to the 2 nd frame image of the video, so as to obtain a foreground-background separation result of the 2 nd frame image;
and the updating subunit is used for updating the initial background model and the initial weight according to the feature fusion image corresponding to the 2 nd frame image of the video and the foreground-background separation result to obtain a first background model and a first weight.
9. An electronic device, comprising a processor and a memory;
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the video foreground and background separation method of any one of claims 1-6 according to instructions in the program code.
10. A computer-readable storage medium for storing program code for performing the video foreground and background separation method of any one of claims 1-6.
Background
The foreground and background separation of the monitoring video, namely the separation of the moving foreground target and the static background from the video stream, has important application values in real life, such as target tracking, urban traffic monitoring and the like.
The existing video foreground and background separation method is mostly based on the modeling of a single pixel on a time axis or the modeling based on local information, so that the change of the whole information is not fully utilized, and the separation effect is not ideal.
Disclosure of Invention
The application provides a video foreground and background separation method and a related device thereof, which are used for improving the technical problem of unsatisfactory separation effect of the existing foreground and background separation method.
In view of this, a first aspect of the present application provides a method for separating a foreground from a background of a video, including:
s1, respectively carrying out multi-scale feature extraction and feature fusion on the first three frames of images of the video according to the initial weight through a preset multi-scale feature extraction model to obtain feature fusion graphs corresponding to the first three frames of images;
s2, constructing a background model according to the feature fusion image corresponding to the first three frames of images of the video and updating the initial weight to obtain a first background module and a first weight;
s3, performing multi-scale feature extraction and feature fusion on the ith frame image of the video according to the first weight through the preset multi-scale feature extraction model to obtain a feature fusion image corresponding to the ith frame image in the video, wherein i belongs to [4, N ], and N is the frame number of the video;
s4, separating foreground points and background points of the ith frame image according to the first background model and the feature fusion image corresponding to the ith frame image to obtain a foreground and background separation result of the ith frame image;
and S5, updating the first background model and the first weight according to the feature fusion image corresponding to the ith frame image and the foreground-background separation result to obtain the updated first background model and the updated first weight, setting i to i +1, and returning to the step S3 until i to N to complete the foreground-background separation of the video.
Optionally, step S1 specifically includes;
respectively performing multi-scale feature extraction on the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video through the preset multi-scale feature extraction model to obtain scale feature layers corresponding to the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video, wherein the scale feature layer corresponding to each frame image comprises feature layers with multiple scales;
and respectively carrying out feature fusion on the scale feature layers corresponding to the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video according to the initial weight through the preset multi-scale feature extraction model to obtain feature fusion images corresponding to the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video.
Optionally, step S2 specifically includes:
s21, constructing an initial background model based on the feature fusion graphs corresponding to the 1 st frame image and the 3 rd frame image of the video;
s22, separating foreground points and background points in the 2 nd frame image according to the initial background model and the feature fusion image corresponding to the 2 nd frame image of the video to obtain a foreground and background separation result of the 2 nd frame image;
s23, updating the initial background model and the initial weight according to the feature fusion image corresponding to the 2 nd frame image of the video and the foreground-background separation result to obtain a first background model and a first weight.
Optionally, step S21 specifically includes:
and calculating the average characteristics of the characteristic fusion images corresponding to the 1 st frame image and the 3 rd frame image of the video to obtain an initial background model.
Optionally, step S22 specifically includes:
calculating the absolute difference value of the feature fusion image corresponding to the 2 nd frame image of the video and the feature value of the same position in the initial background model;
determining the pixel points at the positions corresponding to the characteristic values of which the absolute difference values are greater than or equal to the foreground-background separation threshold as foreground points, and determining the pixel points at the positions corresponding to the characteristic values of which the absolute difference values are less than the foreground-background separation threshold as background points to obtain a foreground-background separation result of the 2 nd frame of image;
wherein the front-background separation threshold is calculated based on an average pixel difference value of a 1 st frame image and a 3 rd frame image of the video.
Optionally, step S23 specifically includes:
acquiring pixel points which are background points in the 2 nd frame image from a foreground and background separation result corresponding to the 2 nd frame image of the video to obtain target pixel points;
updating the characteristic value of the initial background model, which is located at the same position as the target pixel point, according to the characteristic value of the target pixel point in the characteristic fusion image corresponding to the 2 nd frame image to obtain a first background model;
determining the ratio of foreground points in the 2 nd frame image according to a foreground and background separation result corresponding to the 2 nd frame image of the video;
and updating the initial weight according to the ratio of foreground points in the 2 nd frame image to obtain a first weight.
A second aspect of the present application provides a device for separating foreground and background of a video, including:
the first feature extraction and fusion unit is used for respectively carrying out multi-scale feature extraction and feature fusion on the first three frames of images of the video according to the initial weight through a preset multi-scale feature extraction model to obtain feature fusion images corresponding to the first three frames of images;
the building and updating unit is used for building a background model and updating the initial weight according to the feature fusion graph corresponding to the first three frames of images of the video to obtain a first background module and a first weight;
the second feature extraction and fusion unit is used for performing multi-scale feature extraction and feature fusion on the ith frame image of the video according to the first weight through the preset multi-scale feature extraction model to obtain a feature fusion image corresponding to the ith frame image in the video, wherein i belongs to [4, N ], and N is the frame number of the video;
the separation unit is used for separating foreground points and background points of the ith frame image according to the first background model and the feature fusion image corresponding to the ith frame image to obtain a foreground-background separation result of the ith frame image;
and the updating unit is used for updating the first background model and the first weight according to the feature fusion image corresponding to the ith frame image and the foreground-background separation result to obtain the updated first background model and the updated first weight, setting i +1, triggering the second feature extraction and fusion unit until i is N, and completing the foreground-background separation of the video.
Optionally, the constructing and updating unit specifically includes:
the construction subunit is used for constructing an initial background model based on the feature fusion graphs corresponding to the 1 st frame image and the 3 rd frame image of the video;
the separation subunit is configured to separate foreground points and background points in the 2 nd frame image according to the initial background model and the feature fusion image corresponding to the 2 nd frame image of the video, so as to obtain a foreground-background separation result of the 2 nd frame image;
and the updating subunit is used for updating the initial background model and the initial weight according to the feature fusion image corresponding to the 2 nd frame image of the video and the foreground-background separation result to obtain a first background model and a first weight.
A third aspect of the application provides an electronic device comprising a processor and a memory;
the memory is used for storing program codes and transmitting the program codes to the processor;
the processor is configured to perform the video foreground and background separation method of any of the first aspect according to instructions in the program code.
A fourth aspect of the present application provides a computer-readable storage medium for storing program code for performing the video foreground and background separation method of any one of the first aspects.
According to the technical scheme, the method has the following advantages:
the application provides a method for separating a foreground from a background of a video, which comprises the following steps: s1, respectively carrying out multi-scale feature extraction and feature fusion on the first three frames of images of the video according to the initial weight through a preset multi-scale feature extraction model to obtain feature fusion graphs corresponding to the first three frames of images; s2, constructing a background model and updating an initial weight according to the feature fusion graph corresponding to the first three frames of images of the video to obtain a first background module and a first weight; s3, performing multi-scale feature extraction and feature fusion on the ith frame image of the video according to the first weight through a preset multi-scale feature extraction model to obtain a feature fusion image corresponding to the ith frame image in the video, wherein i belongs to [4, N ], and N is the frame number of the video; s4, separating foreground points and background points of the ith frame image according to the first background model and the feature fusion image corresponding to the ith frame image to obtain a foreground and background separation result of the ith frame image; and S5, updating the first background model and the first weight according to the feature fusion image corresponding to the ith frame image and the foreground-background separation result to obtain the updated first background model and the updated first weight, setting i to i +1, and returning to the step S3 until i to N to complete the foreground-background separation of the video.
In the method, the influence of the change of one pixel point on other pixel points in different degrees is considered, feature extraction of different scales is carried out on each frame of image of the video, and feature graphs of different scales are fused according to different weights so as to take the change of the whole information into consideration; and a background model is constructed and weight initialization is carried out through the first three frames of images of the video, foreground and background separation is carried out from the 4 th frame of image, the background model is adaptively optimized and the weight is updated according to the separation result of the images, and the method has a good effect of inhibiting the 'hole' condition of a common background separation method, thereby improving the separation effect of the foreground and the background of the video and improving the technical problem of unsatisfactory separation effect of the existing foreground and background separation method.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a schematic flowchart of a method for separating a foreground from a background of a video according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a preset multi-scale feature extraction model according to an embodiment of the present application;
FIG. 3 is a schematic flowchart of building a background model and updating initial weights according to an embodiment of the present disclosure;
fig. 4 is a schematic diagram of a video foreground and background separation process provided in an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
For easy understanding, please refer to fig. 1, an embodiment of a method for separating foreground and background of a video provided by the present application includes:
and S1, respectively carrying out multi-scale feature extraction and feature fusion on the first three frames of images of the video according to the initial weight through a preset multi-scale feature extraction model to obtain feature fusion images corresponding to the first three frames of images.
The traditional background separation algorithm is based on the modeling of a single pixel on a time axis or the modeling based on local information, so that the change of the whole information is not fully utilized, and the separation effect is not ideal. The embodiment of the application considers that the change of any pixel point generates influences of different degrees on other pixel points, so that the embodiment of the application performs feature extraction of different scales on the input image, and fuses feature graphs of different scales according to different weights so as to take the change of the whole information into consideration.
Specifically, multi-scale feature extraction is respectively carried out on the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video through a preset multi-scale feature extraction model to obtain scale feature layers corresponding to the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video, wherein the scale feature layer corresponding to each frame image comprises feature layers with multiple scales; and respectively carrying out feature fusion on scale feature layers corresponding to the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video according to the initial weight through a preset multi-scale feature extraction model to obtain feature fusion graphs corresponding to the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video.
Please refer to fig. 2, which provides a preset multi-scale feature extraction model, including 4 convolutional layers, 3 sampling layers, and a feature fusion layer. The convolution kernel size of the first convolution layer (conv1 layer) is 3 × 3, the step size is 1, and the convolution kernel is filled with 0, assuming that the input image size of the preset multi-scale feature extraction model is 300 × 300, the conv1 layer performs convolution processing on the input image, extracts the edge information of the input image, and extracts the feature layer F with the size of 300 × 3001Feature layer F1Is the same size as the input image, feature layer F1Each feature point characterizes 1 × 1 information.
The second convolution layer (conv2 layer) had a convolution kernel size of 3 x 3, a step size of 3, and was not filled, and feature layer F was aligned by conv2 layer1Performing convolution processing to extract 100 × 100 feature layers F2Feature layer F2Each feature point of (3 x 3) characterizes information of (3 x 3).
The convolution kernel size of the third convolution layer (conv3 layer) was 3 × 3, the step size was 3, and the feature layer F was not filled with the convolution layer 3 and the conv3 layer2Performing convolution processing to extract a characteristic layer F with the size of 34 x 343Feature layer F3Each feature point of (2) characterizes 9 x 9 information.
Convolution kernel size of the fourth convolution layer (conv4 layer)3 x 3, step size 3, no fill, feature layer F by conv3 layer pair3Performing convolution processing to extract 12 × 12 feature layers F4Feature layer F4Each feature point of (2) characterizes the information of 27 x 27.
Because the scales of 4 feature layers extracted by a preset multi-scale feature extraction model are different in size and the 4 feature layers cannot be directly subjected to feature fusion, an up-sampling layer is added after conv2, conv3 and conv4 respectively, and the feature layer F is subjected to sampling by the up-sampling layer2A characteristic layer F3A characteristic layer F4Performing up-sampling treatment to obtain a characteristic layer F after sampling2Characteristic layer F after sampling3Characteristic layer F after sampling4And a feature layer F1I.e. remain the same as the size of the input image. Then the feature fusion layer is based on the initial weight W ═ W1,W2,W3,W4F characteristic layer corresponding to input image1Characteristic layer F after sampling2Characteristic layer F after sampling3Characteristic layer F after sampling4Performing feature fusion to obtain a feature fusion graph corresponding to the input image
Wherein the initial weight W1,W2,W3,W4The size of the feature layer is the same as that of the input image, and the feature layer is a feature layer F1Characteristic layer F after sampling2Characteristic layer F after sampling3Characteristic layer F after sampling4The weight coefficient of (2) is large because the detail texture information represented by the feature points with low feature layer number is rich, and is obtained by assuming that the weight coefficient meets Gaussian distribution and carrying out normalization operation:
W1=0.6439*E,W2=0.2369*E,W3=0.0871*E,W4=0.0321*E;
in the formula, E is an identity matrix, and is the same as the input image size.
Image of frame 1 and image of frame 2 of video through preset multi-scale feature extraction modelRespectively carrying out multi-scale feature extraction on the image and the 3 rd frame image to obtain scale feature layers corresponding to the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video, wherein the scale feature layers corresponding to the 1 st frame image, the 2 nd frame image and the 3 rd frame image respectively comprise feature layers with 4 scales; performing feature fusion on the scale feature layer corresponding to the 1 st frame image according to the initial weight through a preset multi-scale feature extraction model to obtain a feature fusion image B corresponding to the 1 st frame image1Performing feature fusion on the scale feature layer corresponding to the 2 nd frame image according to the initial weight through a preset multi-scale feature extraction model to obtain a feature fusion image B corresponding to the 2 nd frame image2Performing feature fusion on the scale feature layer corresponding to the 3 rd frame image according to the initial weight through a preset multi-scale feature extraction model to obtain a feature fusion image B corresponding to the 3 rd frame image3。
S2, constructing a background model according to the feature fusion image corresponding to the first three frames of images of the video and updating the initial weight to obtain a first background module and a first weight.
Specifically, referring to fig. 3, step S2 specifically includes:
s21, constructing an initial background model based on the feature fusion graphs corresponding to the 1 st frame image and the 3 rd frame image of the video.
Calculating the average characteristics of the characteristic fusion images corresponding to the 1 st frame image and the 3 rd frame image of the video to obtain an initial background modelNamely:
s22, separating foreground points and background points in the 2 nd frame image according to the initial background model and the feature fusion image corresponding to the 2 nd frame image of the video, and obtaining a foreground and background separation result of the 2 nd frame image.
Calculating the absolute difference value of the feature value of the same position in the feature fusion image corresponding to the 2 nd frame image of the video and the initial background model;
determining pixel points at positions corresponding to the feature values of which the absolute difference values are greater than or equal to the front background separation threshold as foreground points, and determining pixel points at positions corresponding to the feature values of which the absolute difference values are less than the front background separation threshold as background points to obtain a foreground-background separation result of the 2 nd frame of image;
the foreground and background separation threshold is calculated based on the average pixel difference value of the 1 st frame image and the 3 rd frame image of the video, namely:
where T is a foreground-background separation threshold, K is a coefficient, preferably, K is 4, N is a number of lines of pixels in a 1 st frame image or a 3 rd frame image of the video, M is a number of columns of pixels in the 1 st frame image or the 3 rd frame image of the video, and I is3(i,j)Is the pixel value, I, of the pixel point at position (I, j) in the 3 rd frame image of the video1(i,j)Is the pixel value of the pixel point located at the position (i, j) in the 1 st frame image of the video.
Specifically, when | b2(i,j)-b(i,j)If the | is more than or equal to T, judging that a pixel point at the position (i, j) in the 2 nd frame image of the video is a foreground point;
when | b2(i,j)-b(i,j)If the | is less than T, determining that the pixel point at the position (i, j) in the 2 nd frame image of the video is a background point;
wherein, b2(i,j)Corresponding characteristic fusion map B for 2 nd frame image of video2Characteristic value of the median position (i, j), b(i,j)As an initial background modelThe characteristic value located at position (i, j) in the middle.
And S23, updating the initial background model and the initial weight according to the feature fusion image corresponding to the 2 nd frame image of the video and the foreground-background separation result to obtain a first background model and a first weight.
Wherein, the updating process of the initial background model comprises the following steps:
acquiring pixel points which are background points in a 2 nd frame image from a foreground and background separation result corresponding to the 2 nd frame image of the video to obtain target pixel points;
and updating the characteristic value of the initial background model, which is positioned at the same position as the target pixel point, according to the characteristic value of the target pixel point in the characteristic fusion image corresponding to the 2 nd frame image to obtain a first background model.
Specifically, when a pixel point at a position (i, j) in a 2 nd frame image of the video is a background point, the pixel point at the position (i, j) is taken as a target pixel point, and according to a feature value b of the target pixel point in a feature fusion image corresponding to the 2 nd frame image, the feature value b is obtained2(i,j)For the initial background modelThe characteristic value b of the target pixel point at the same position (i, j)(i,j)And updating to obtain a first background model. In particular, the initial background modelCharacteristic value b at position (i, j)(i,j)The update formula of (2) is:
in the formula, b* (i,j)Is a characteristic value b(i,j)The updated value, i.e. the value of the feature at position (i, j) in the first background model.
The updating process of the initial weight is as follows:
determining the ratio of foreground points in a 2 nd frame image according to a foreground and background separation result corresponding to the 2 nd frame image of the video;
and updating the initial weight according to the ratio of foreground points in the 2 nd frame image to obtain a first weight.
In the embodiment of the present application, the weight is updated according to the contribution degree of each feature layer feature point, the weight with a large contribution degree is increased, the weight with a small contribution degree is decreased, and then normalization is performed to obtain:
in the formula, Wt(i,j)The initial weight of the feature point of the t-th layer located at the position (i, j),updated weights for feature points of the tth layer located at position (i, j),the updated first weight; pt(i,j)Is the t characteristic layer (i.e. characteristic layer F)1Characteristic layer F after sampling2Characteristic layer F after sampling3Characteristic layer F after sampling4) The contribution degree of the feature points at the position (i, j) represents the proportion of foreground pixel points contained in the original image information; pk(i,j)The contribution degree of the feature point of the k-th feature layer at the position (i, j).
It should be noted that P can be obtained by calculating the proportion of foreground points in the neighborhood (which may be 3 × 3 neighborhood) of the feature point (i, j) of the t-th feature layert(i,j)For example, the ratio of foreground points in the 3 × 3 neighborhood, i.e., the number of foreground points in the 3 × 3 neighborhood/(3 × 3), is determined by the foreground-background separation result, which feature points in the neighborhood are foreground points.
And S3, performing multi-scale feature extraction and feature fusion on the ith frame image of the video according to the first weight through a preset multi-scale feature extraction model to obtain a feature fusion image corresponding to the ith frame image in the video.
After a background model and a first weight are obtained through the construction of the first three frame images of the video, performing foreground and background separation by taking the residual frame images of the video (namely the 4 th frame image to the last frame image) as images to be separated. Inputting an ith frame image of a video into a preset multi-scale feature extraction model for multi-scale feature extraction to obtain a scale feature layer, wherein i belongs to [4, N ], and N is the frame number of the video; and performing feature fusion on the scale feature layer corresponding to the frame of image by the feature fusion layer in the preset multi-scale feature extraction model according to the updated first weight to obtain a feature fusion image corresponding to the ith frame of image of the video, wherein the specific process is similar to the extraction process of the feature fusion image in the previous step, and is not repeated here.
S4, separating foreground points and background points of the ith frame image according to the first background model and the feature fusion image corresponding to the ith frame image, and obtaining a foreground and background separation result of the ith frame image.
Calculating the absolute difference value of the feature fusion image corresponding to the ith frame image of the video and the feature value of the same position in the first background model; and determining pixel points at the position corresponding to the characteristic value with the absolute difference value larger than or equal to the front background separation threshold as foreground points, determining pixel points at the position corresponding to the characteristic value with the absolute difference value smaller than the front background separation threshold as background points, obtaining a foreground-background separation result of the ith frame of image, and generating a foreground-background separation binary image of the ith frame of image according to the background separation result. The specific separation process may refer to a foreground-background separation process of the 2 nd frame image, which is not described herein again.
And S5, updating the first background model and the first weight according to the feature fusion image corresponding to the ith frame image and the foreground-background separation result to obtain the updated first background model and the updated first weight, setting i to i +1, and returning to the step S3 until i to N to complete the foreground-background separation of the video.
Acquiring pixel points which are background points in an ith frame of image from a foreground and background separation result corresponding to the ith frame of image of the video to obtain target pixel points; and updating the characteristic value of the target pixel point in the same position in the first background model as the target pixel point according to the characteristic value of the target pixel point in the characteristic fusion image corresponding to the ith frame image to obtain an updated first background model.
Determining the proportion of foreground points in the ith frame of image according to a foreground and background separation result corresponding to the ith frame of image of the video; and updating the initial weight according to the proportion of the foreground point in the ith frame image to obtain an updated first weight.
The specific updating process of the first background model may refer to the updating process of the initial background model, and the specific updating process of the first weight may refer to the updating process of the initial weight, which is not described herein again.
After the first background model and the first weight are updated according to the feature fusion map corresponding to the ith frame image of the video and the foreground-background separation result, i +1 is set, the process returns to step S3, foreground-background separation and background model and weight update are performed on the next frame image of the video until i is N, and foreground-background separation of the video is completed, which can refer to fig. 4.
In the embodiment of the application, considering that the change of one pixel point has different degrees of influence on other pixel points, extracting the features of different scales of each frame of image of the video, and fusing the feature graphs of different scales according to different weights so as to take the change of the whole information into consideration; and a background model is constructed and weight initialization is carried out through the first three frames of images of the video, foreground and background separation is carried out from the 4 th frame of image, the background model is adaptively optimized and the weight is updated according to the separation result of the images, and the method has a good effect of inhibiting the 'hole' condition of a common background separation method, thereby improving the separation effect of the foreground and the background of the video and improving the technical problem of unsatisfactory separation effect of the existing foreground and background separation method.
The foregoing is an embodiment of a method for separating a foreground from a background of a video provided by the present application, and the following is an embodiment of a device for separating a foreground from a background of a video provided by the present application.
The embodiment of the application provides a video foreground and background separator, includes:
the first feature extraction and fusion unit is used for respectively carrying out multi-scale feature extraction and feature fusion on the first three frames of images of the video according to the initial weight through a preset multi-scale feature extraction model to obtain feature fusion images corresponding to the first three frames of images;
the building and updating unit is used for building a background model and updating the initial weight according to the feature fusion graph corresponding to the first three frames of images of the video to obtain a first background module and a first weight;
the second feature extraction and fusion unit is used for performing multi-scale feature extraction and feature fusion on the ith frame image of the video according to the first weight through a preset multi-scale feature extraction model to obtain a feature fusion image corresponding to the ith frame image in the video, wherein i belongs to [4, N ], and N is the frame number of the video;
the separation unit is used for separating foreground points and background points of the ith frame image according to the first background model and the feature fusion image corresponding to the ith frame image to obtain a foreground-background separation result of the ith frame image;
and the updating unit is used for updating the first background model and the first weight according to the feature fusion image corresponding to the ith frame image and the foreground-background separation result to obtain the updated first background model and the updated first weight, setting i to be i +1, and triggering the second feature extraction and fusion unit until i to be N to complete the foreground-background separation of the video.
As a further refinement, the first feature extraction and fusion unit is specifically configured to:
respectively performing multi-scale feature extraction on the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video through a preset multi-scale feature extraction model to obtain scale feature layers corresponding to the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video, wherein the scale feature layer corresponding to each frame image comprises feature layers with multiple scales;
and respectively carrying out feature fusion on scale feature layers corresponding to the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video according to the initial weight through a preset multi-scale feature extraction model to obtain feature fusion graphs corresponding to the 1 st frame image, the 2 nd frame image and the 3 rd frame image of the video.
As a further improvement, the building and updating unit specifically includes:
the construction subunit is used for constructing an initial background model based on the feature fusion graphs corresponding to the 1 st frame image and the 3 rd frame image of the video;
the separation subunit is configured to separate foreground points and background points in the 2 nd frame image according to the initial background model and the feature fusion image corresponding to the 2 nd frame image of the video, so as to obtain a foreground-background separation result of the 2 nd frame image;
and the updating subunit is used for updating the initial background model and the initial weight according to the feature fusion image corresponding to the 2 nd frame image of the video and the foreground-background separation result to obtain a first background model and a first weight.
As a further improvement, the building subunit is specifically for:
and calculating the average characteristics of the characteristic fusion images corresponding to the 1 st frame image and the 3 rd frame image of the video to obtain an initial background model.
As a further improvement, the separation subunit is specifically adapted to:
calculating the absolute difference value of the feature value of the same position in the feature fusion image corresponding to the 2 nd frame image of the video and the initial background model;
determining pixel points at positions corresponding to the feature values of which the absolute difference values are greater than or equal to the front background separation threshold as foreground points, and determining pixel points at positions corresponding to the feature values of which the absolute difference values are less than the front background separation threshold as background points to obtain a foreground-background separation result of the 2 nd frame of image;
the foreground and background separation threshold is calculated based on the average pixel difference value of the 1 st frame image and the 3 rd frame image of the video.
As a further improvement, the update subunit is specifically configured to:
acquiring pixel points which are background points in a 2 nd frame image from a foreground and background separation result corresponding to the 2 nd frame image of the video to obtain target pixel points;
updating the characteristic value of the initial background model, which is positioned at the same position as the target pixel point, according to the characteristic value of the target pixel point in the characteristic fusion image corresponding to the 2 nd frame image to obtain a first background model;
determining the ratio of foreground points in a 2 nd frame image according to a foreground and background separation result corresponding to the 2 nd frame image of the video;
and updating the initial weight according to the ratio of foreground points in the 2 nd frame image to obtain a first weight.
In the embodiment of the application, considering that the change of one pixel point has different degrees of influence on other pixel points, extracting the features of different scales of each frame of image of the video, and fusing the feature graphs of different scales according to different weights so as to take the change of the whole information into consideration; and a background model is constructed and weight initialization is carried out through the first three frames of images of the video, foreground and background separation is carried out from the 4 th frame of image, the background model is adaptively optimized and the weight is updated according to the separation result of the images, and the method has a good effect of inhibiting the 'hole' condition of a common background separation method, thereby improving the separation effect of the foreground and the background of the video and improving the technical problem of unsatisfactory separation effect of the existing foreground and background separation method.
The embodiment of the application also provides electronic equipment, which comprises a processor and a memory;
the memory is used for storing the program codes and transmitting the program codes to the processor;
the processor is configured to perform the video foreground and background separation method in the foregoing method embodiments according to instructions in the program code.
The embodiment of the present application further provides a computer-readable storage medium, which is used for storing a program code, where the program code is used for executing the video foreground and background separation method in the foregoing method embodiment.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for executing all or part of the steps of the method described in the embodiments of the present application through a computer device (which may be a personal computer, a server, or a network device). And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.