Non-green-curtain portrait real-time matting algorithm based on multitask deep learning

文档序号:8473 发布日期:2021-09-17 浏览:28次 中文

1. A green curtain-free portrait real-time matting algorithm based on multitask deep learning is characterized by comprising the following steps:

step 1: performing two-classification adjustment on an original multi-classification multi-target detection data set, inputting an image or video containing portrait information, and performing corresponding data preprocessing on the image or video to obtain preprocessed data of an original input file;

step 2: adopting encoder-logistic regression to construct a deep learning network for human body target detection, inputting the preprocessing data obtained in the step 1, constructing a loss function, training and optimizing the deep learning network for human body target detection, and obtaining a human body target detection model;

and 3, step 3: extracting a feature map from the encoder of the human body target detection model in the step 2, performing feature splicing and fusing multi-scale image features to form an encoder of the portrait Alpha mask matting network, and realizing an encoder sharing structure of human body target detection and the portrait Alpha mask matting network;

and 4, step 4: constructing a decoder of the portrait Alpha mask matting network, forming an end-to-end encoder-decoder portrait Alpha mask matting network structure with an encoder sharing structure in the step 3, and constructing a loss function training and optimizing the portrait Alpha mask matting network by taking an image containing human body information and a ternary map as input;

and 5, step 5: inputting the preprocessing data obtained in the step 1 into the network trained in the step 4, and outputting a candidate frame ROIBox of the portrait foreground and a portrait trimap ternary map in the candidate frame through the logistic regression of the human body target detection model in the step 2;

and 6, step 6: and (4) inputting the three-element image of the image foreground candidate frame ROI Box and the image trimap in the step (5) into the image Alpha mask matting network constructed in the step (4), and finally obtaining an image Alpha mask prediction result.

2. The multitask deep learning based non-green-curtain portrait real-time matting algorithm according to claim 1, wherein in step 1, the data preprocessing comprises video frame processing and input image resizing.

3. The multitask deep learning based green curtain-free portrait real-time matting algorithm according to claim 1, characterized in that in step 2, the deep learning network for human body target detection is realized by model prediction with a deep residual neural network body.

4. The multitask deep learning based non-green-curtain portrait real-time matting algorithm according to claim 1, characterized in that in step 4, the decoder up-samples, convolves, ELU activation function and full-connected layer FC output as a body structure.

5. The multitask deep learning based non-green-curtain portrait real-time matting algorithm according to claim 4, characterized in that the upsampling is used to recover the feature size of the downsampled image in the encoder, and a SeLU activation function is used, where the hyperparameter λ, α is a fixed constant, and the expression of the activation function is as shown in formula (2):

6. the multitask deep learning based green curtain-free real-time matting algorithm according to claim 1, wherein in step 4, a loss function training and optimization character Alpha mask matting network is constructed, specifically comprising:

4.1) Alpha mask prediction error, as shown in equation (3):

therein, LossαlpRepresenting Alpha mask prediction error, AlphapregroThe predicted and actual Alpha mask values, respectively, are obtained, with epsilon being a very small constant;

4.2) image composition error, as shown in equation (4):

therein, LosscomRepresenting image composition errors, cpre,cgroRespectively, predicted and real Alpha synthetic images, wherein epsilon is a tiny constant;

4.3) the synthetic loss function is Alpha mask prediction error and image synthesis error, as shown in equation (5):

Lossoverall=ω1Lossαlp2Losscom12=1 (5);

therein, LossoverallRepresenting the combined loss function, ω1,ω2Respectively representing Alpha mask prediction error LossαlpAnd image composition error LossαlpThe weight value of (2).

7. The multitask deep learning based green curtain-free portrait real-time matting algorithm according to claim 1, wherein in step 5, a portrait foreground extension candidate box ROIBox and a portrait trimap ternary map in the candidate box are output, and specifically include:

5.1) the portrait foreground extension candidate frame judgment standard RIOU, improving the original judgment basis, wherein the improved judgment standard RIOU is shown as a formula (7):

wherein, ROIedgeTo be able to wrap up the ROIpAnd ROIgMinimum bounding rectangle candidate frame, [ ·]As frame area candidates, ROIpRepresenting the predicted value, ROI, of a portrait foreground candidate framegRepresenting the true value of the portrait foreground candidate frame;

5.2) for the human body front/background binary classification result, firstly removing noise by adopting a corrosion algorithm, then generating a clear edge contour by an expansion algorithm, and finally obtaining a portrait ternary map, as shown in a formula (8):

wherein the foreground f (pixel)i) Representing the ith pixeliBelong to the foreground, background b (pixel)i) Representing the ith pixeliBelonging to the background, otherwise representing the case where a pixel cannot be confirmed to belong to the front/back scene, trimapiRepresenting the ith pixeliAlpha mask channel values of (1).

Background

In recent years, due to the rapid development of the internet information age, a large amount of digital content is ubiquitous in human daily life. Among the huge amount of digital contents, digital image information includes images and videos, and the advantages of intuitive and easy information transmission, rich and diverse content forms, and the like are gradually becoming important carriers for information dissemination. However, the editing and processing of digital image information is complex and difficult, and related industries have certain admission thresholds, and practitioners are often required to consume a large amount of manpower and time cost to create content. Therefore, there is an increasing need for efficient and easy-to-access means for content production. The digital image matting technology is one of the key research contents in the digital image information editing and processing technology.

The digital image matting technology mainly aims to separate foreground and background pictures in an image or a video so as to realize high-precision foreground extraction and virtual background replacement. Among them, portrait cutout is the main application field of digital image cutout, which has been produced in the middle of the twentieth century along with the production requirements of the movie industry. By utilizing the portrait matting technology, the character image of the actor can be extracted from the early-stage film special effect, and the character image is synthesized with the virtual field background. Through the development of industrial science and technology for decades, the film and television special effect technology comprehensively utilizing digital image matting can reduce the content production cost and ensure the safety of participants, meanwhile, the audience is provided with the audience watching experience of deducting the minds, and the image matting technology becomes an irreplaceable part in the production link of film and television programs.

In early studies, digital portrait matting techniques required users to provide a priori background knowledge. Adopt in traditional movie & TV preparation usually with human skin and the great pure color green curtain of clothing color difference or blue curtain as shooting place background, through the pixel difference of contrast subject and background to accomplish the work of portrait matting. However, the setting level of the professional green screen background is high, and the lighting conditions of the field are strictly limited, so that it is difficult for general users to use the green screen technology at a low cost. With the rapid development of the digital era, the demand of the public on the digital portrait matting technology is more widely expanded to scenes such as picture editing, network conferences and the like so as to meet the demands of the public on various aspects such as entertainment, privacy protection and the like. The research of the digital portrait matting technology has been in progress for decades, and has also achieved very attention. However, the existing algorithm mainly has three types of defects. Firstly, part of research needs to provide a human figure ternary map labeled by human interaction, and the work of constructing the ternary map consumes a great deal of manpower and time. Secondly, most research algorithms are long in time consumption, the number of frames of processed images per second is low, and the real-time image matting effect of the portrait cannot be achieved. Finally, the existing fast-operation portrait matting algorithm generally needs to provide a scene photo containing a shot subject and a scene photo not containing the shot subject in the same background, so that the use scene of the algorithm is limited.

Disclosure of Invention

The invention provides a non-green-curtain portrait real-time matting algorithm based on multi-task deep learning, aiming at the defects of the prior art and the technical problem of digital image matting.

The invention provides a non-green-curtain image real-time matting algorithm based on multitask deep learning, which realizes a threshold-free real-time automatic image matting function under the condition of lacking of professional green curtain equipment by surrounding key technologies such as human body target detection, ternary diagram generation, image Alpha mask matting and the like in the image matting process under a complex natural environment. The invention can be applied to application programs such as network conferences, photography editing and the like, and provides convenient digital portrait matting service for general users.

The purpose of the invention is realized by the following technical scheme:

a green curtain-free portrait real-time matting algorithm based on multitask deep learning comprises the following steps:

step 1: performing two-classification adjustment on an original multi-classification multi-target detection data set, inputting an adjusted data set image or video file (namely inputting an image or video containing portrait information), and performing corresponding data preprocessing on the image or video to obtain preprocessed data of an original input file;

step 2: adopting encoder-logistic regression (encoder-logistic) to construct a deep learning network for human body target detection, inputting the preprocessing data obtained in the step 1, constructing a loss function, training and optimizing the deep learning network for human body target detection, and obtaining a human body target detection model;

and 3, step 3: extracting a characteristic diagram from an encoder of the human body target detection model in the step 2, performing characteristic splicing and fusing multi-scale image characteristics to form an encoder of the human image Alpha mask matting network, and realizing an encoder sharing structure of human body target detection and the human image Alpha mask matting network;

and 4, step 4: constructing a decoder of the portrait Alpha mask matting network, forming an end-to-end encoder-decoder (encoder-decoder) portrait Alpha mask matting network structure with the shared structure of the encoder in the step 3, and constructing a loss function training and optimizing the portrait Alpha mask matting network by taking an image containing human body information and a ternary map as input;

and 5, step 5: inputting the preprocessing data obtained in the step 1 into the network trained in the step 4, and outputting a candidate frame ROI Box of the portrait foreground and a portrait trimap ternary map in the candidate frame through the logistic regression of the human body target detection model in the step 2;

and 6, step 6: and (4) inputting the three-element image of the image foreground candidate frame ROI Box and the image trimap in the step (5) into the image Alpha mask matting network constructed in the step (4), and finally obtaining an image Alpha mask prediction result.

In step 1, the two-classification adjustment is to modify the original data set COCO-80 of the 80 object multi-classification into a 'human body/other' two-classification, and supplement the data set according to the standard. By abandoning the task of identifying other object types, the accuracy of the subsequent network model for human body identification is improved through fine adjustment.

In step 1, the data preprocessing includes video frame processing and input image resizing:

the video frame processing comprises the following steps:

video frame processing, namely converting a video into a frame image through ffmpeg, and processing a processed video file as an image file in the subsequent work by adopting the same method; specifically, the video is converted into a frame image through ffmpeg, and the frame image is stored in a way that the original video number is used as a folder name and all the frame images are used as image files under the folder in an engineering directory;

said input image resizing comprises:

resizing the input image, unifying the sizes of different input images in a cutting and filling mode, and keeping the size of the network characteristic graph consistent with that of the original image. Specifically, the sizes of different input images are unified, a scaling coefficient is calculated by taking the longest edge of the original image as a reference edge, the longest edge is compressed in equal proportion to an input standard specified by a subsequent network, and then gray background filling is performed on the content of the short edge of the image.

Step 2, inputting the preprocessed data obtained in the step 1, and training and optimizing a human body target detection network (namely a deep learning network for human body target detection) by taking a candidate frame error, a candidate frame confidence error and a human body two-class cross entropy error as loss functions;

the deep learning network for detecting the human body target is realized by model prediction of a deep residual error neural network main body;

the model of the depth residual error neural network main body is composed of an encoder part and a logistic regression part, and specifically comprises the following steps:

the encoder portion is a full convolution residual neural network. In the network, residual blocks res _ block with different depths are formed by layer jump connection, and the image containing portrait information is subjected to feature extraction to obtain a feature sequence x. Aiming at the image frame obtained after the processing in the step 1Extracting a characteristic sequence with the length of TVtRepresenting the t-th image frame, xtRepresenting a sequence of features of the t-th image frame.

The feature extraction comprises the following steps:

the method comprises the steps of utilizing a deep learning technology to conduct a cognitive process of an original image or a frame image after video preprocessing, and converting the image into a feature sequence which can be identified by a computer.

The logistic regression part is a function of the candidate box center position (x)i,yi) The frame candidate length width (w)i,hi) Candidate frame confidence CiCandidate in-frame object classification pi(c) C ∈ classes, and human foreground f (pixel)i) And background b (pixel)i) And (5) carrying out multi-scale detection on the classification result. Wherein the classes are all classes, pixels, in the training sampleiAnd the ith pixel point in the candidate frame is obtained.

And 3, extracting a feature map from the encoder of the human body target detection model in the step 2 by using three different scales of large scale, medium scale and small scale respectively, splicing and fusing the features of the multi-scale image to form an encoder of the portrait Alpha mask matting network, and realizing an encoder sharing structure of human body target detection and the portrait Alpha mask matting network.

In the step 3, the depth residual error neural network constructed in the step 2 is accessed in the forward direction, and the outputs of the residual error blocks res _ block with the down-sampling multiples of 8 times, 16 times and 32 times are obtained respectively. The output is spliced by the convolution kernel conv of 3 × 3 and the convolution kernel conv of 1 × 1 to form a large, medium and small multi-scale fused image characteristic structure as an encoder of the portrait Alpha mask matting network, so that the human body target detection and encoder sharing structure of the portrait Alpha mask matting network are realized.

Human target detection and portrait Alpha mask keying network's encoder share structure, 3 rd step specifically includes:

3.1) the forward access to the full convolution depth residual error neural network, respectively obtaining the output of the residual error block res _ block with the down-sampling multiple of 8 times, 16 times and 32 times, adopting the convolution check with the step length stride of 2 to carry out down-sampling work, and setting core8,core16,core32The convolution kernel size is x, y for the above corresponding convolution kernel in the downsampling process. If input size is m, n, output size is m/2, n/2, and the convolution formula corresponding to the output is shown in formula (1), where fun (·) is an activation function, β is a bias quantity:

outputm/2,n/2=fun(∑∑inputmn*corexy+β) (1)

and 3.2) correspondingly outputting a large, medium and small multi-scale fused image characteristic structure formed by fusion splicing to serve as an encoder of the portrait Alpha mask matting network, and realizing an encoder sharing structure of the portrait Alpha mask matting network and human target detection.

In the 4 th step, the decoder takes up sampling, convolution, ELU activation function and full connection layer FC output as a main body structure, takes an image containing human body information and a ternary map as input, constructs a network loss function taking both Alpha mask prediction error and image synthesis error as cores, and trains and optimizes a portrait Alpha mask matting network.

The upsampling is used to restore the feature size of the downsampled image in the encoder. Adopting a SeLU activation function, wherein the hyperparameter λ, α is a fixed constant, and the expression of the activation function is shown in formula (2):

in step 4, constructing an image Alpha mask matting network loss function, specifically comprising:

4.1) Alpha mask prediction error, as shown in equation (3):

wherein alpha ispregroThe predicted and true Alpha mask values, respectively, ε is a very small constant.

4.2) image composition error, as shown in equation (4):

wherein c ispre,cgroPredicted and true Alpha synthetic images, respectively, epsilon is a very small constant.

4.3) the synthetic loss function is Alpha mask prediction error and image synthesis error, as shown in equation (5):

Lossoverall=ω1Lossαlp2Losscom12=1 (5)

and 5, inputting the image preprocessing data obtained in the step 1 to the trained human body target detection network model, and predicting to obtain a portrait foreground expansion candidate frame ROI Box and a portrait ternary map trimap in the expansion candidate frame after logistic regression.

The human image foreground expansion candidate frame ROI Box carries out edge expansion on the basis of a common target identification candidate frame, and the problem that human body fine edges are placed outside the candidate frame in the target detection process is solved. And the portrait ternary diagram in the expansion candidate frame is obtained by corroding and expanding the human body second-class cross entropy error in the second-step loss function.

In the step 5, the output portrait foreground expansion candidate frame ROI Box and the portrait trimap ternary map in the candidate frame specifically include:

and 5.1) the portrait foreground expansion candidate frame judgment standard RIOU improves the original judgment basis. In order to make the candidate frame have stronger inclusion capability and avoid the problem that the human body subtle edge is placed outside the candidate frame in the target detection process, the improved judgment standard RIOU is shown as formula (7):

wherein, ROIedgeTo be able to wrap up the ROIpAnd ROIgMinimum bounding rectangle candidate frame, [ ·]As frame area candidates, ROIpRepresenting the predicted value, ROI, of a portrait foreground candidate framegRepresenting the true value of the portrait foreground candidate frame;

and 5.2) for the human body front/background classification results, firstly removing noise by adopting a corrosion algorithm, and then generating a clear edge profile by adopting an expansion algorithm. And (3) obtaining a portrait ternary map, as shown in formula (8):

wherein the foreground f (pixel)i) And background b (pixel)i) Representing the ith pixeliBelonging to foreground or background, trimapiRepresenting the ith pixeliAlpha mask channel value of (1), otherwise indicates a case where the pixel cannot be confirmed to belong to the front/back scene.

In the 6 th step, the original portrait foreground expansion candidate frame ROI Box in the 5 th step is subjected to feature mapping, and then is input to the portrait Alpha mask matting network model together with the portrait ternary map trimap in the expansion candidate frame, so that the convolution calculation scale is reduced, and the network calculation speed is accelerated. After the original resolution of the image is restored through sampling on a decoder, a human image Alpha mask prediction result is obtained through output of a full connection layer FC, and finally the human image matting task is integrally completed.

The method comprises the steps of adjusting an original data set in a two-classification mode, inputting an image or a video containing portrait information, and obtaining preprocessed network input data through video frame processing and input image resizing; constructing a human body target detection deep learning network, extracting image characteristics through a deep residual error neural network, and obtaining a human body foreground expansion candidate frame ROI Box and a human body ternary map trimap in the expansion candidate frame in a logistic regression mode; the method comprises the steps of constructing an image Alpha mask matting deep learning network, effectively accelerating the calculation process of the network through an encoder sharing mechanism, and outputting an image foreground Aplha mask prediction result in an end-to-end mode to achieve an image matting effect. The method successfully gets rid of the use limitation of the green curtain in the portrait matting process, and only needs to provide the original image or video without providing the manually marked portrait ternary image in the matting process, thereby providing great convenience for users. Finally, the encoder sharing mechanism provided by the invention accelerates the task calculation speed, provides a real-time portrait matting effect under high-definition image quality, and meets the use requirements of users under various scenes.

Compared with the prior art, the invention has the following advantages:

the invention relates to a non-green-curtain portrait real-time matting algorithm based on multitask deep learning, which realizes a threshold-free real-time automatic portrait automatic matting function under the condition of lacking of professional green curtain equipment by surrounding key technologies such as human body target detection, ternary diagram generation, portrait Alpha mask matting and the like in the portrait matting process under a complex natural environment. The algorithm solves the limitation of the traditional digital image matting technology on equipment and sites, is applied to application programs such as network conferences, photography editing and the like, and provides real-time and convenient digital image matting service for general users. The innovation of the invention is embodied in the following aspects:

1) the invention innovatively provides modification and supplement of the traditional multi-classification multi-target detection data set COCO-80, and forms a unique 'character \ other' binary data set. The accuracy of a subsequent network model for human body recognition is improved by fine tuning while the difficulty in constructing a training sample is obviously reduced;

2) the invention innovatively provides a new target detection candidate frame judgment standard RIOU, so that the candidate frame has stronger inclusion capability, and the problem that the human body tiny edge is arranged outside the candidate frame in the target detection process is avoided;

3) the invention innovatively provides an encoder sharing mechanism of a human body target detection network and a portrait Alpha mask matting network, greatly reduces the time consumption of an algorithm in an image feature identification process, and realizes high-definition real-time portrait matting.

Drawings

FIG. 1 is a schematic diagram of a network structure of a green-curtain-free real-time portrait matting algorithm based on multitask deep learning according to the present invention;

FIG. 2 is a schematic diagram of a multi-classification raw data set two-classification process according to the present invention;

FIG. 3 is a schematic diagram of a human target detection task flow of the algorithm of the present invention;

FIG. 4 is a schematic diagram of an algorithm human image Alpha mask matting task flow according to the present invention;

FIG. 5 is a schematic overall flow chart of the algorithm of the present invention;

Detailed Description

The following further describes the real-time matting algorithm for the non-green-curtain portrait based on the multitask deep learning with reference to the accompanying drawings.

A green curtain-free portrait real-time matting algorithm based on multitask deep learning comprises the following steps:

step 1: improving an original data set, inputting an improved data set image or video file, and performing corresponding data preprocessing on the image or video to obtain preprocessed data of the original input file;

in step 1, the raw data set improvement and data preprocessing specifically include:

1.1) two-classification adjustment and supplement of a multi-classification multi-target detection data set, wherein the two-classification adjustment modifies 80 object multi-classification original data sets COCO-80 into two classifications of 'human body/other', and supplements the data set according to the standard;

1.2) video frame processing, namely converting a video into a frame image through ffmpeg, and processing a processed video file as an image file in subsequent work by adopting the same method;

1.3) resizing the input image, unifying the sizes of different input images in a cutting and filling mode, and keeping the size of the network characteristic graph consistent with that of the original image.

Step 2: an encoder-logistic regression (encoder-logistic) is adopted to construct a deep learning network for human body target detection. Inputting the preprocessing data obtained in the step 1, constructing a loss function, and training and optimizing a human body target detection network;

the human body target detection deep learning network specifically comprises:

2.1) the encoder part is a full convolution residual neural network. In the network, residual blocks res _ block with different depths are formed by layer jump connection, and the image containing portrait information is subjected to feature extraction to obtain a feature sequence;

2.2) constructing a loss function, and adding a human body two-class cross entropy error as an extra load on the basis of a general target detection task;

2.3) logistic regression part is a function of the candidate box center position (x)i,yi) Frame candidate length width (w)i,hi) Candidate frame confidence CiCandidate in-frame object classification pi(c) And c belongs to classes to carry out multi-scale detection. Wherein classes are all classes in the training sample, and are specifically class0: person, class1: others],pixeliAnd the ith pixel point in the candidate frame is obtained.

And 3, step 3: fusing multi-scale image characteristics to form an encoder of a portrait Alpha mask matting network, and realizing an encoder sharing structure of human body target detection and the portrait Alpha mask matting network;

human target detection and portrait Alpha mask keying network's multiscale encoder sharing structure specifically includes:

3.1) accessing the full convolution depth residual error neural network in a forward direction to respectively obtain the output of a residual error block res _ block with the downsampling multiples of 8 times, 16 times and 32 times. The downsampling work is performed by adopting convolution verification with the step size stride of 2, and core is set8,core16,core32The convolution kernel size is x, y for the above corresponding convolution kernel in the downsampling process. If input size is m, n, output size is m/2, n/2, output pairThe convolution calculation formula is shown in formula (1), where fun (·) is the activation function, β is the offset:

outputm/2,n/2=fun(∑∑inputmn*corexy+β) (1)

and 3.2) correspondingly outputting a large, medium and small multi-scale fused image characteristic structure formed by fusion splicing to serve as an encoder of the portrait Alpha mask matting network, and realizing an encoder sharing structure of the portrait Alpha mask matting network and human target detection.

And 4, step 4: and constructing a decoder of the portrait Alpha mask matting network, and combining the decoder with the shared encoder in the step 3 to form an end-to-end encoder-decoder (encoder-decoder) portrait Alpha mask matting network structure. Constructing a loss function by taking an image containing human body information and a ternary map as input, and training and optimizing an image Alpha mask matting network;

the human image Alpha mask matting network decoder takes up sampling, convolution, ELU activation function and full connection layer FC output as a main structure, and specifically comprises the following steps:

4.1) up-sampling is realized through unsampling operation, so that the feature size of the down-sampled image in the encoder is recovered;

and 4.2) adopting a SeLU activation function to enable partial neuron output in the deep learning network to be set to be 0, so as to form a sparse network structure. Wherein, the hyper-parameter λ, α of the SeLU activation function is a fixed constant, and the expression of the activation function is shown as formula (2):

constructing an image Alpha mask matting network loss function, which specifically comprises the following steps:

4.3) Alpha mask prediction error, as shown in equation (3):

wherein alpha ispregroThe predicted and actual Alpha mask values, respectively, are obtained, with epsilon being a very small constant;

4.4) image composition error, as shown in equation (4):

wherein c ispre,cgroRespectively, predicted and actual Alpha synthetic images;

4.5) the synthetic loss function is Alpha mask prediction error and image synthesis error, as shown in equation (5):

Lossoverall=ω1Lossαlp2Losscom12=1 (5)

and 5, step 5: inputting the image preprocessing data obtained in the step 1 into the trained network, and outputting a portrait foreground expansion candidate frame ROI Box and a portrait trimap ternary map in the candidate frame through the human body target detection network logistic regression in the step 2;

the output portrait foreground expansion candidate frame ROI Box and the portrait trimap ternary map in the candidate frame specifically comprise:

and 5.1) expanding the candidate frame judgment standard RIOU by the portrait foreground, and changing the original judgment basis. In order to make the candidate frame have stronger inclusion capability and avoid the problem that the human body subtle edge is placed outside the candidate frame in the target detection process, the improved judgment standard RIOU is shown as formula (7):

wherein, ROIedgeTo be able to wrap up the ROIpAnd ROIgMinimum bounding rectangle candidate frame, [ ·]Is the area of the candidate frame;

and 5.2) for the human body front/background classification results, firstly removing noise by adopting a corrosion algorithm, and then generating a clear edge profile by adopting an expansion algorithm. And (3) obtaining a portrait ternary map, as shown in formula (8):

wherein the foreground f (pixel)i) And background b (pixel)i) Representing the ith pixeliBelonging to foreground or background, trimapiRepresenting the ith pixeliAlpha mask channel values of (1).

And 6, step 6: and (4) inputting the three-element image of the image foreground candidate frame ROI Box and the image trimap in the step (5) into the image Alpha mask matting network constructed in the step (4), and finally obtaining an image Alpha mask prediction result.

More specifically, the image matting is divided into two parts of algorithm tasks based on a green curtain-free real-time image matting algorithm of multi-task deep learning, wherein the two parts of algorithm tasks are a human body target detection task in the first step and an image foreground Alpha mask matting task in the second step, and the method specifically comprises the following steps:

in step 1, data pre-processing including video frame processing and input image resizing is performed:

the video frame processing comprises the following steps:

converting the video into frame images through ffmpeg, storing the frame images in a way that original video numbers are used as folder names and all the image frames are image files under folders in an engineering directory, and treating the treated video files as the image files in the subsequent work by adopting the same method;

the input image resizing comprises:

unifying the sizes of different input images, calculating a scaling coefficient by taking the longest edge of the original image as a reference edge, compressing the longest edge to a subsequent input standard specified by a network in an equal proportion, and filling the gray background of the content of the short edge vacancy in a Padding mode to maintain the size of the network characteristic image to be consistent with that of the original image. The abnormal network output value caused by the dimension error of the input image is avoided.

As shown in FIG. 2, the original 80-object multi-classification data set COCO-80 is modified into two classifications of 'human body/other' through classification adjustment, and the data set is supplemented with the standard. By abandoning the task of identifying other object types, the accuracy of the subsequent network model for human body identification is improved through fine adjustment.

As shown in fig. 3, the human target detection deep learning network of the first partial task of the whole network is realized by model prediction with a deep residual neural network body. The depth residual error neural network model is composed of an encoder part and a logistic regression part, and specifically comprises the following steps:

step 1: the encoder portion is a full convolution residual neural network. In the network, residual blocks res _ block with different depths are formed by layer jump connection, and the image containing portrait information is subjected to feature extraction to obtain a feature sequence x. Aiming at the image frame obtained after processingExtracting a characteristic sequence with the length of TVtRepresenting the t-th image frame, xtRepresenting a sequence of features of the t-th image frame.

The feature extraction comprises the following steps:

the method comprises the steps of utilizing a deep learning technology to conduct a cognitive process of an original image or a frame image after video preprocessing, and converting the image into a feature sequence which can be identified by a computer.

Step 2: the logistic regression part is a function of the candidate box center position (x)i,yi) The frame candidate length width (w)i,hi) Candidate frame confidence CiCandidate in-frame object classification pi(c) C ∈ classes, and human foreground f (pixel)i) And background b (pixel)i) And (5) carrying out multi-scale detection on the classification result. Wherein the classes are all classes in the training sample, and are specifically class0: person, class1: others],pixeliAnd the ith pixel point in the candidate frame is obtained.

As shown in fig. 4, the image Alpha mask matting network of the second part task of the whole network is composed of a shared encoder and an image Alpha mask matting decoder, and specifically includes the following embodiments:

step 1: and forward accessing the depth residual error neural network to obtain the outputs of residual error blocks res _ block with the down-sampling multiples of 8 times, 16 times and 32 times respectively. In order to reduce the negative effect of gradient caused by pooling in the down-sampling process, a convolution kernel with the step length stride of 2 is adopted. Setting core8,core16,core32The convolution kernel in the corresponding down-sampling process is the channel number channel _ n and the corresponding input8,input16,input32Equal, convolution kernel size is x, y. If input size is m, n, output size is m/2, n/2, and the convolution formula corresponding to the output is shown in formula (1), where fun (·) is an activation function, β is a bias quantity:

outputm/2,n/2=fun(∑∑inputmn*corexy+β)

(1)

step 2: the corresponding output is respectively passed through 3 × 3 convolution kernels conv33 × 3 to enlarge the characteristic map receptive field, and the local context information of the image characteristic is increased. The characteristic channel dimension is then reduced by a convolution kernel conv1 of 1 x 1. The image characteristic structure fused in large, medium and small multi-scale is formed by fusion splicing and serves as an encoder of the portrait Alpha mask matting network, and the human body target detection and encoder sharing structure of the portrait Alpha mask matting network is achieved.

And 3, step 3: the decoder takes the up-sampling, convolution, ELU activation function and full-link layer FC output as a main structure. And constructing a network loss function taking the Alpha mask prediction error and the image synthesis error as the core by taking the image containing the human body information and the ternary map as input, and training and optimizing the portrait Alpha mask matting network.

The upsampling is realized through unsampling operation, a certain value in the input image characteristic is mapped and filled into a certain corresponding area of the output upsampled image characteristic, and meanwhile, the same value is filled into a blank area after the upsampling, so that the size of the downsampled image characteristic in an encoder is restored.

By adopting the SeLU activation function, partial neuron output in the deep learning network is set to be 0, a sparse network structure is formed, the overfitting problem of the matting network is effectively reduced, and the problem that the gradient of the traditional sigmoid activation function is easy to disappear during back propagation is avoided. Wherein, the hyper-parameter λ, α of the SeLU activation function is a fixed constant, and the expression of the activation function is shown as formula (2):

the Alpha mask prediction error is shown in formula (3):

wherein alpha ispregroThe predicted and true Alpha mask values, respectively, ε is a very small constant.

The image synthesis error is shown in formula (4):

wherein c ispre,cgroPredicted and true Alpha synthetic images, respectively, epsilon is a very small constant.

The final composite loss function is Alpha mask prediction error and image synthesis error, as shown in equation (5):

Lossoverall=ω1Lossαlp2Losscom12=1

(5)

as shown in FIG. 5, after the algorithm training provided by the invention is completed, the process of image matting reasoning can be performed in real time.

Step 1: inputting image preprocessing data to a trained human body target detection network model, and predicting to obtain a portrait foreground expansion candidate frame ROI Box and a portrait ternary map trimap in the expansion candidate frame after logistic regression.

Screening judgment of general target identification candidate frame by imageThe cross-over ratio IOU is the standard, as shown in equation (6), ROIp,ROIgPredicted and true candidate boxes, respectively:

the invention provides an improved portrait foreground extension candidate frame judgment standard RIOU, in order to enable a candidate frame to have stronger inclusion capability and avoid the problem that human body fine edges are arranged outside the candidate frame in the target detection process, the improved judgment standard RIOU is as shown in a formula (7):

wherein, ROIedgeTo be able to wrap up the ROIpAnd ROIgMinimum bounding rectangle candidate frame, [ ·]Is the candidate box area.

Step 2: for the human body front/background classification result, firstly, the corrosion algorithm is adopted to remove noise, and then, the expansion algorithm is used to generate a clear edge profile. And (3) obtaining a portrait ternary map, as shown in formula (8):

wherein the foreground f (pixel)i) And background b (pixel)i) Representing the ith pixeliBelonging to foreground or background, trimapiRepresenting the ith pixeliAlpha mask channel values of (1).

And 3, step 3: and (3) after the original portrait foreground expansion candidate frame ROI Box in the step (2) is subjected to feature mapping, inputting the feature map and the portrait ternary map trimap in the expansion candidate frame into a portrait Alpha mask cutout network model, reducing the convolution calculation scale and accelerating the network calculation speed. And after the original resolution of the image is restored by up-sampling of a decoder, outputting a human image Alpha mask prediction result Alpha at a full connection layer FC. Combining the original input image, and finally completing the image matting task through foreground extraction, as shown in formula (9), wherein I is the input image, F is the image foreground, and B is the background image:

I=αF+(1-α)B (9)

the foregoing is illustrative of the present invention and is not to be construed as limiting thereof. One of ordinary skill in the art would recognize that any variations or modifications would come within the scope of the present invention.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:目标重识别模型的训练方法、目标重识别方法及装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!