Semantic segmentation model training method, semantic segmentation device and electronic equipment
1. A semantic segmentation model training method comprises the following steps:
determining a depth map and a two-dimensional segmentation result map of a sample picture based on a depth estimation network and a semantic segmentation network of an initial semantic segmentation model;
determining a 3D point cloud based on the depth map of the sample picture, and performing semantic segmentation on the 3D point cloud to obtain a three-dimensional segmentation result map of the sample picture;
determining a loss function value based on the two-dimensional segmentation result map, the three-dimensional segmentation result map, and a predetermined loss function, wherein the predetermined loss function comprises a consistency loss function, and the consistency loss function is used for determining consistency loss of the two-dimensional segmentation result map and the three-dimensional segmentation result map;
and updating the model parameters of the initial semantic segmentation model based on the determined loss function values until convergence, so as to obtain a target semantic segmentation model.
2. The method of claim 1, wherein the consistency loss function is:
wherein the content of the first and second substances,representing a 2D semantic segmentation result corresponding to a pixel point of the sample picture,and representing a 3D semantic segmentation result corresponding to the pixel points of the sample picture.
3. The method of claim 1, wherein the determining a loss function value based on the two-dimensional segmentation result map, the three-dimensional segmentation result map, and a predetermined loss function comprises:
determining a loss function value based on the two-dimensional segmentation result map, the three-dimensional segmentation result map, the two-dimensional segmentation labels of the depth map and the sample picture, the three-dimensional segmentation labels, the depth information labels and a predetermined loss function; wherein the predetermined loss function further comprises: two-dimensional loss function, three-dimensional loss function, depth loss function.
4. The method of claim 1, wherein the determining a 3D point cloud based on the depth map of the sample picture comprises:
and converting the depth map into a 3D point cloud based on the internal reference of the camera corresponding to the sample picture and the depth map.
5. The method of claim 1, wherein the converting the depth map into a 3D point cloud based on the internal reference of the camera to which the sample picture corresponds and the depth map comprises:
by the following formula:
converting the depth map into a 3D point cloud; wherein d is the depth corresponding to each pixel point (u, v), K is the internal reference of the camera, PcAs 3D coordinates [ X ] in the camera coordinate systemc,Yc,Zc]。
6. The method of claim 1, wherein performing semantic segmentation on the 3D point cloud to obtain a three-dimensional segmentation result map of the sample picture comprises:
and obtaining a semantic segmentation result of the 3D point cloud based on a preset 3D point cloud semantic segmentation network.
7. A method of semantic segmentation, comprising:
determining a target image to be segmented;
inputting the target image to be segmented into the target semantic segmentation model trained by any one of claims 1 to 6 to obtain a semantic segmentation result of the target image to be segmented.
8. The method of claim 1, wherein the method further comprises: inputting the target image to be segmented into the target semantic segmentation model according to any one of claims 1 to 6 to obtain a depth map of the target image to be segmented.
9. A semantic segmentation model training apparatus, comprising:
the first determining module is used for determining a depth map and a two-dimensional segmentation result map of the sample picture based on a depth estimation network and a semantic segmentation network of the initial semantic segmentation model;
the 3D semantic segmentation module is used for determining a 3D point cloud based on the depth map of the sample picture and performing semantic segmentation on the 3D point cloud to obtain a three-dimensional segmentation result map of the sample picture;
a loss value determination module, configured to determine a loss function value based on the two-dimensional segmentation result map, the three-dimensional segmentation result map, and a predetermined loss function, where the predetermined loss function includes a consistency loss function, and the consistency loss function is used to determine consistency loss of the two-dimensional segmentation result map and the three-dimensional segmentation result map;
and the updating module is used for updating the model parameters of the initial semantic segmentation model based on the determined loss function value until convergence to obtain the target semantic segmentation model.
10. The apparatus of claim 9, wherein the consistency loss function is:
wherein the content of the first and second substances,representing a 2D semantic segmentation result corresponding to a pixel point of the sample picture,and representing a 3D semantic segmentation result corresponding to the pixel points of the sample picture.
11. The apparatus according to claim 9, wherein the loss value determining module is specifically configured to determine the loss function value based on the two-dimensional segmentation result map, the three-dimensional segmentation result map, the two-dimensional segmentation labels, the three-dimensional segmentation labels, the depth information labels of the depth map and the sample picture, and a predetermined loss function; wherein the predetermined loss function further comprises: two-dimensional loss function, three-dimensional loss function, depth loss function.
12. The apparatus of claim 9, wherein the apparatus further comprises:
and the conversion module is used for converting the depth map into a 3D point cloud based on the internal reference of the camera corresponding to the sample picture and the depth map.
13. The method of claim 9, wherein the conversion module is specifically configured to perform the following by:
converting the depth map into a 3D point cloud; wherein d is the depth corresponding to each pixel point (u, v), K is the internal reference of the camera, PcAs 3D coordinates [ X ] in the camera coordinate systemc,Yc,Zc]。
14. The apparatus of claim 9, wherein the 3D semantic segmentation module is specifically configured to obtain the semantic segmentation result of the 3D point cloud based on a predetermined 3D point cloud semantic segmentation network.
15. A semantic segmentation apparatus comprising:
the second determination module is used for determining a target image to be segmented;
an obtaining module, configured to input the target image to be segmented into the target semantic segmentation model trained according to any one of claims 1 to 6, and obtain a semantic segmentation result of the target image to be segmented.
16. The method according to claim 1, wherein the obtaining module is further configured to input the target image to be segmented into the target semantic segmentation model according to any one of claims 1 to 6, and obtain a depth map of the target image to be segmented.
17. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.
19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.
Background
In the field of computer vision, the application of the current neural network mainly comprises image recognition, target positioning and detection and semantic segmentation. Image recognition tells you what the image is, target location and detection tells you where the target is in the image, and semantic segmentation answers the above two questions from the pixel level.
Semantic segmentation of images (semantic segmentation) literally means that a computer performs segmentation according to the semantics of images, and for example, when a left image in fig. 1 is input, the computer can output a right image. The semantics refers to the meaning of voice in voice recognition, and in the image field, the semantics refers to the content of an image and the understanding of the meaning of a picture, for example, the semantics of a left image is that three people ride three bicycles; the division means that different objects in the picture are divided from the perspective of pixels, and each pixel in the original image is labeled, for example, a human part and a bicycle part corresponding to the right image in fig. 1.
Disclosure of Invention
The disclosure provides a semantic segmentation model training method, a semantic segmentation device and electronic equipment.
According to a first aspect of the present disclosure, there is provided a semantic segmentation model training method, including:
determining a depth map and a two-dimensional segmentation result map of a sample picture based on a depth estimation network and a semantic segmentation network of an initial semantic segmentation model;
determining a 3D point cloud based on the depth map of the sample picture, and performing semantic segmentation on the 3D point cloud to obtain a three-dimensional segmentation result map of the sample picture;
determining a loss function value based on the two-dimensional segmentation result map, the three-dimensional segmentation result map and a predetermined loss function, wherein the predetermined loss function comprises a consistency loss function, and the consistency loss function is used for determining consistency loss of the two-dimensional segmentation result map and the three-dimensional segmentation result map;
and updating the model parameters of the initial semantic segmentation model based on the determined loss function value until convergence, so as to obtain the target semantic segmentation model.
According to a second aspect of the present disclosure, there is provided a semantic segmentation method, including:
determining a target image to be segmented;
and inputting the target image to be segmented into the target semantic segmentation model trained in the first aspect to obtain a semantic segmentation result of the target image to be segmented.
According to a third aspect of the present disclosure, there is provided a semantic segmentation model training apparatus, including:
the first determining module is used for determining a depth map and a two-dimensional segmentation result map of the sample picture based on a depth estimation network and a semantic segmentation network of the initial semantic segmentation model;
the 3D semantic segmentation module is used for determining a 3D point cloud based on the depth map of the sample picture and performing semantic segmentation on the 3D point cloud to obtain a three-dimensional segmentation result map of the sample picture;
a loss value determination module for determining a loss function value based on the two-dimensional segmentation result map, the three-dimensional segmentation result map, and a predetermined loss function, wherein the predetermined loss function includes a consistency loss function, and the consistency loss function is used for determining consistency loss of the two-dimensional segmentation result map and the three-dimensional segmentation result map;
and the updating module is used for updating the model parameters of the initial semantic segmentation model based on the determined loss function value until convergence, so as to obtain the target semantic segmentation model.
According to a fourth aspect of the present disclosure, there is provided a semantic segmentation apparatus including:
the second determination module is used for determining a target image to be segmented;
and the obtaining module is used for inputting the target image to be segmented into the target semantic segmentation model trained in the first aspect, and obtaining a semantic segmentation result of the target image to be segmented.
According to a fifth aspect of the present disclosure, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method.
According to a sixth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the above method.
According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the above method.
The technical scheme provided by the disclosure has the following beneficial effects:
compared with the prior art, the scheme provided by the embodiment of the disclosure mainly improves the accuracy of semantic segmentation on network improvement and multi-modal feature fusion. The method comprises the steps of determining a depth map and a two-dimensional segmentation result map of a sample picture through a depth estimation network and a semantic segmentation network based on an initial semantic segmentation model; then, determining a 3D point cloud based on the depth map of the sample picture, and performing semantic segmentation on the 3D point cloud to obtain a three-dimensional segmentation result map of the sample picture; then determining a loss function value based on the two-dimensional segmentation result graph, the three-dimensional segmentation result graph and a preset loss function, wherein the preset loss function comprises a consistency loss function, and the consistency loss function is used for determining consistency loss of the two-dimensional segmentation result graph and the three-dimensional segmentation result graph; and finally, updating model parameters of the initial semantic segmentation model based on the determined loss function value until convergence, so as to obtain the target semantic segmentation model. Namely, when the semantic segmentation model is trained, the 3D segmentation result is consistent with the 2D segmentation result as much as possible by introducing the consistency loss function of the 3D semantic segmentation and the 2D semantic segmentation, so that the 3D information is used for guiding the 2D semantic segmentation, and the precision and the robustness of the trained model can be improved.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is an exemplary graph of semantic segmentation provided in accordance with the present disclosure;
FIG. 2 is a flow diagram of a semantic segmentation model training method provided in accordance with the present disclosure;
FIG. 3 is an exemplary diagram of a semantic segmentation model training method provided in accordance with the present disclosure;
FIG. 4 is a flow chart diagram of a semantic segmentation method provided in accordance with the present disclosure;
FIG. 5 is a schematic structural diagram of a semantic segmentation model training apparatus provided by the present disclosure;
FIG. 6 is a schematic structural diagram of a semantic segmentation apparatus provided in the present disclosure;
FIG. 7 is a block diagram of an electronic device used to implement an embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Example one
Fig. 1 illustrates a semantic segmentation model training method provided by an embodiment of the present disclosure, and as shown in fig. 2, the method includes:
step S101, determining a depth map and a two-dimensional segmentation result map of a sample picture based on a depth estimation network and a semantic segmentation network of an initial semantic segmentation model;
specifically, a sample picture is input into an initial semantic segmentation network model, and a depth map and a two-dimensional segmentation result map of the sample picture are determined through a depth estimation network and a semantic segmentation network of the initial semantic segmentation model.
The depth estimation is to estimate the distance of each pixel in the image relative to the shooting source by using the RGB image. When photographing, a three-dimensional graph is projected onto a two-dimensional plane to form a two-dimensional image, and the purpose of depth estimation is to estimate three-dimensional information through a two-dimensional picture, which is an inverse process. Depth estimation (estimating depth from 2D images) is a key step in the task of scene reconstruction and understanding, and is part of 3D reconstruction in the field of computer vision. The monocular estimation basis based on depth learning is that the pixel value relationship reflects the depth relationship, the method is to fit a function to map an image into a depth map, and obtaining specific depth from a single picture is equivalent to deducing a three-dimensional space from a two-dimensional image. Depth estimation can be divided into monocular depth estimation and multiocular depth estimation.
Specifically, the Depth estimation Network may be a Depth estimation Network based on a convolutional neural Network, such as a Network introduced in the paper "Depth Map Prediction from a Single Image using a Multi-Scale Deep Network", or may be another Network capable of implementing the functions of the present application.
The semantic segmentation of the image is to assign a semantic category to each pixel in the input image to obtain a pixilated dense classification.
Specifically, the semantic segmentation network may be a network based on Full Convolution Networks (FCNs), SegNet, E-Net, Link-Net, and Mask R-CNN, or may be another network capable of implementing the functions of the present application, which is not limited herein.
Specifically, the sample set of the application can be determined in a manual labeling manner, or unmarked sample data is processed in an unsupervised or weakly supervised manner to obtain the sample set. The training sample set may include positive samples and negative samples.
Step S102, determining a 3D point cloud based on a depth map of a sample picture, and performing semantic segmentation on the 3D point cloud to obtain a three-dimensional segmentation result map of the sample picture;
wherein, the point cloud data (point cloud data) refers to a set of vectors in a three-dimensional coordinate system. The 3D data can be represented in various formats, including depth images, point clouds, meshes, and volumetric meshes. As a common format, the point cloud representation retains the original geometric information in 3D space without any discretization. It is therefore a preferred representation of many context-understanding-related applications, such as autopilot and robotics. 3D point cloud segmentation requires knowledge of the global geometry and fine-grained details of each point. According to the segmentation granularity, the 3D point cloud segmentation method can be divided into three categories: semantic segmentation (scene level), instance segmentation (object level) and part segmentation (part level). Given a point cloud, semantic segmentation, which has two paradigms, namely projection-based and point-based, aims at dividing the point cloud into several subsets according to their semantics.
Specifically, the 3D point cloud segmentation network of the present application may be a projection-based network or a point-based network, and other networks capable of implementing the functions of the present application.
Step S103, determining a loss function value based on the two-dimensional segmentation result graph, the three-dimensional segmentation result graph and a preset loss function, wherein the preset loss function comprises a consistency loss function, and the consistency loss function is used for determining consistency loss of the two-dimensional segmentation result graph and the three-dimensional segmentation result graph;
specifically, the consistency loss function value may be determined based on a two-dimensional segmentation result map, a three-dimensional segmentation result map, two-dimensional segmentation and three-dimensional segmentation labels of the sample picture, and a predetermined loss function.
And step S104, updating model parameters of the initial semantic segmentation model based on the determined loss function value until convergence, and obtaining a target semantic segmentation model.
Specifically, the target semantic segmentation model may be obtained by performing back propagation in a direction in which the consistency loss becomes small, and adjusting parameters of the model until convergence.
The embodiment of the present disclosure provides a possible implementation manner, where the consistency loss function is:
wherein the content of the first and second substances,representing the 2D semantic segmentation result corresponding to the pixel points of the sample picture,and representing a 3D semantic segmentation result corresponding to the pixel points of the sample picture.
For the embodiment of the present disclosure, the two-dimensional semantic segmentation guided by 3D information is realized by the consistency loss function.
The embodiment of the present disclosure provides a possible implementation manner, where determining a loss function value based on a two-dimensional segmentation result map, a three-dimensional segmentation result map, and a predetermined loss function includes:
determining a loss function value based on the two-dimensional segmentation result image, the three-dimensional segmentation result image, the two-dimensional segmentation labels of the depth image and the sample image, the three-dimensional segmentation labels, the depth information labels and a predetermined loss function; wherein the predetermined loss function further comprises: two-dimensional loss function, three-dimensional loss function, depth loss function.
For the disclosed embodiments, four loss functions are considered, namely: the method comprises a two-dimensional loss function, a three-dimensional loss function, a depth loss function and a consistency loss function, so that the precision and the robustness of the trained model are further improved. In addition, a certain weight value can be set for the three loss functions, wherein the weight value can be an empirical value or a trained value.
Illustratively, the loss function of the training semantic segmentation model of the present disclosure may be:
Lall=L2D-seg+Ldepth+L3D-seg+Lconsist
the four loss functions respectively represent a two-dimensional loss function, a three-dimensional loss function, a depth loss function and a consistency loss function.
The embodiment of the present disclosure provides a possible implementation manner, where determining a 3D point cloud based on a depth map of a sample picture includes:
and converting the depth map into a 3D point cloud based on the internal reference of the camera corresponding to the sample picture and the depth map.
The depth map is converted into a point cloud, namely the transformation of a coordinate system, the image coordinate system is a world coordinate system, and the constraint condition of the transformation is camera internal reference. Namely, based on the camera reference, the depth map can be converted into a 3D point cloud, thereby solving the problem of how to convert the depth map into the point cloud.
The embodiment of the present disclosure provides a possible implementation manner, in which, based on an internal reference of a camera corresponding to a sample picture and a depth map, the depth map is converted into a 3D point cloud, including:
by the following formula:
converting the depth map into a 3D point cloud; wherein d is the depth corresponding to each pixel point (u, v), K is the internal reference of the camera, PcAs 3D coordinates [ X ] in the camera coordinate systemc,Yc,Zc]。
Specifically, according to the formula, the problem of how to convert the depth map into point cloud data is solved.
The embodiment of the present application provides a possible implementation manner, wherein, semantic segmentation is performed on a 3D point cloud to obtain a three-dimensional segmentation result graph of a sample picture, including:
and obtaining a semantic segmentation result of the 3D point cloud based on a predetermined 3D point cloud semantic segmentation network.
Specifically, the point cloud segmentation network in the embodiment of the present disclosure may be Pinpoint, Pointnet + +, or the like, or may be another point cloud segmentation network that can implement the present application. According to the method and the device, the semantic segmentation of the point cloud is achieved.
Illustratively, to better understand the training method of the semantic segmentation network model of the present disclosure, fig. 3 shows an exemplary diagram of a training flow, which includes: 1. coding an input image, and then obtaining a Depth map and a two-dimensional segmentation result respectively based on a Depth estimation network (Depth head) and a two-dimensional semantic segmentation network (seg head) of an initial model; 3. converting the depth map into a 3D point cloud; 4. performing semantic segmentation on the 3D point cloud to obtain a 3D semantic segmentation network; 5. corresponding loss is determined based on consistency of 3D and 2D semantic segmentation, and parameters of the initial model are adjusted based on direction propagation towards the direction that the loss value becomes smaller.
Example two
According to a second aspect of the present disclosure, there is provided an image recognition method, as shown in fig. 4, including:
step S401, determining a target image to be segmented;
step S402, inputting the target image to be segmented into the target semantic segmentation model trained in the first embodiment, and obtaining the semantic segmentation result of the target image to be segmented.
The target image to be segmented can be a picture directly shot by a camera or extracted from a shot video.
As one scenario of the embodiment of the present disclosure, a driving image captured by a vehicle-mounted camera of an unmanned vehicle may be used, then each target in the driving image is determined, and the image is automatically segmented and classified to avoid obstacles such as pedestrians and vehicles.
The method can be used for medical image analysis, and with the rise of artificial intelligence, the combination of a neural network and medical diagnosis becomes a research hotspot, so that intelligent medical research is gradually mature, and semantic segmentation can be used for tumor image segmentation, caries diagnosis and the like in the field of intelligent medical treatment.
In addition, the method disclosed by the application can also be used in the security field and other application scenes comprising semantic segmentation and target detection segmentation.
For the embodiment of the disclosure, the used semantic segmentation model is a model obtained by introducing 3D information to guide two-dimensional semantic segmentation training, and the semantic segmentation accuracy can be improved.
A possible implementation manner of the embodiment of the present application, wherein the method further includes: and inputting the target image to be segmented to a target semantic segmentation model to obtain a depth map of the target image to be segmented.
Specifically, according to the embodiment of the application, the depth information of the target picture to be segmented can be obtained through the trained semantic segmentation model.
For example, an application scenario of the depth information may be used as reference data for vehicle obstacle avoidance during automatic driving.
EXAMPLE III
The embodiment of the present disclosure provides a semantic segmentation model training device, as shown in fig. 5, including:
a first determining module 501, configured to determine a depth map and a two-dimensional segmentation result map of a sample picture based on a depth estimation network and a semantic segmentation network of an initial semantic segmentation model;
the 3D semantic segmentation module 502 is configured to determine a 3D point cloud based on the depth map of the sample picture, and perform semantic segmentation on the 3D point cloud to obtain a three-dimensional segmentation result map of the sample picture;
a loss value determination module 503, configured to determine a loss function value based on the two-dimensional segmentation result map, the three-dimensional segmentation result map, and a predetermined loss function, where the predetermined loss function includes a consistency loss function, and the consistency loss function is used to determine consistency loss of the two-dimensional segmentation result map and the three-dimensional segmentation result map;
and an updating module 504, configured to update the model parameters of the initial semantic segmentation model based on the determined loss function value until convergence, so as to obtain a target semantic segmentation model.
The embodiment of the present application provides a possible implementation manner, where the consistency loss function is:
wherein the content of the first and second substances,representing the 2D semantic segmentation result corresponding to the pixel points of the sample picture,and representing a 3D semantic segmentation result corresponding to the pixel points of the sample picture.
The embodiment of the present application provides a possible implementation manner, wherein the loss value determining module is specifically configured to determine a loss function value based on a two-dimensional segmentation result map, a three-dimensional segmentation result map, a two-dimensional segmentation label, a three-dimensional segmentation label, a depth information label of a depth map and a sample picture, and a predetermined loss function; wherein the predetermined loss function further comprises: two-dimensional loss function, three-dimensional loss function, depth loss function.
The embodiment of the present application provides a possible implementation manner, wherein the apparatus further includes:
and the conversion module is used for converting the depth map into a 3D point cloud based on the internal reference of the camera corresponding to the sample picture and the depth map.
The embodiment of the present application provides a possible implementation manner, wherein the conversion module is specifically configured to obtain the following formula:
converting the depth map into a 3D point cloud; wherein d is the depth corresponding to each pixel point (u, v), K is the internal reference of the camera, PcAs 3D coordinates [ X ] in the camera coordinate systemc,Yc,Zc]。
The embodiment of the application provides a possible implementation manner, wherein the 3D semantic segmentation module is specifically used for obtaining a semantic segmentation result of the 3D point cloud based on a predetermined 3D point cloud semantic segmentation network.
For the embodiment of the present application, the beneficial effects achieved by the embodiment of the present application are the same as those of the embodiment of the method described above, and are not described herein again.
EXAMPLE III
An embodiment of the present disclosure provides a semantic segmentation apparatus, as shown in fig. 6, including:
a second determining module 601, configured to determine a target image to be segmented;
the obtaining module 602 is configured to input the target image to be segmented into the target semantic segmentation model trained in the first embodiment, and obtain a semantic segmentation result of the target image to be segmented.
The embodiment of the application provides a possible implementation manner, wherein the obtaining module is further configured to input the target image to be segmented to the target semantic segmentation model of the embodiment, so as to obtain a depth map of the target image to be segmented.
For the embodiment of the present application, the beneficial effects achieved by the embodiment of the present application are the same as those of the embodiment of the method described above, and are not described herein again.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
The electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as provided by the embodiments of the present disclosure.
Compared with the prior art, the electronic equipment mainly improves the accuracy of semantic segmentation on network improvement and multi-modal feature fusion. The method comprises the steps of determining a depth map and a two-dimensional segmentation result map of a sample picture through a depth estimation network and a semantic segmentation network based on an initial semantic segmentation model; then, determining a 3D point cloud based on the depth map of the sample picture, and performing semantic segmentation on the 3D point cloud to obtain a three-dimensional segmentation result map of the sample picture; then determining a loss function value based on the two-dimensional segmentation result graph, the three-dimensional segmentation result graph and a preset loss function, wherein the preset loss function comprises a consistency loss function, and the consistency loss function is used for determining consistency loss of the two-dimensional segmentation result graph and the three-dimensional segmentation result graph; and finally, updating model parameters of the initial semantic segmentation model based on the determined loss function value until convergence, so as to obtain the target semantic segmentation model. Namely, when the semantic segmentation model is trained, the 3D segmentation result and the 2D segmentation result are consistent as much as possible by introducing a 3D and 2D consistency loss function, so that the 3D information is utilized to guide the 2D semantic segmentation, and the precision and the robustness of the trained model can be improved.
The readable storage medium is a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method as provided by an embodiment of the present disclosure.
The readable storage medium improves the accuracy of semantic segmentation over the prior art, primarily in network improvement and multi-modal feature fusion. The method comprises the steps of determining a depth map and a two-dimensional segmentation result map of a sample picture through a depth estimation network and a semantic segmentation network based on an initial semantic segmentation model; then, determining a 3D point cloud based on the depth map of the sample picture, and performing semantic segmentation on the 3D point cloud to obtain a three-dimensional segmentation result map of the sample picture; then determining a loss function value based on the two-dimensional segmentation result graph, the three-dimensional segmentation result graph and a preset loss function, wherein the preset loss function comprises a consistency loss function, and the consistency loss function is used for determining consistency loss of the two-dimensional segmentation result graph and the three-dimensional segmentation result graph; and finally, updating model parameters of the initial semantic segmentation model based on the determined loss function value until convergence, so as to obtain the target semantic segmentation model. Namely, when the semantic segmentation model is trained, the 3D segmentation result and the 2D segmentation result are consistent as much as possible by introducing a 3D and 2D consistency loss function, so that the 3D information is utilized to guide the 2D semantic segmentation, and the precision and the robustness of the trained model can be improved.
The computer program product comprising a computer program which, when executed by a processor, implements a method as shown in the first aspect of the disclosure.
The computer program product is compared to the prior art for improving the accuracy of semantic segmentation, mainly in network improvement and multi-modal feature fusion. The method comprises the steps of determining a depth map and a two-dimensional segmentation result map of a sample picture through a depth estimation network and a semantic segmentation network based on an initial semantic segmentation model; then, determining a 3D point cloud based on the depth map of the sample picture, and performing semantic segmentation on the 3D point cloud to obtain a three-dimensional segmentation result map of the sample picture; then determining a loss function value based on the two-dimensional segmentation result graph, the three-dimensional segmentation result graph and a preset loss function, wherein the preset loss function comprises a consistency loss function, and the consistency loss function is used for determining consistency loss of the two-dimensional segmentation result graph and the three-dimensional segmentation result graph; and finally, updating model parameters of the initial semantic segmentation model based on the determined loss function value until convergence, so as to obtain the target semantic segmentation model. Namely, when the semantic segmentation model is trained, the 3D segmentation result and the 2D segmentation result are consistent as much as possible by introducing a 3D and 2D consistency loss function, so that the 3D information is utilized to guide the 2D semantic segmentation, and the precision and the robustness of the trained model can be improved.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 701 performs the various methods and processes described above, such as method semantic segmentation model training or semantic segmentation. For example, in some embodiments, the method semantic segmentation model training or semantic segmentation may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM702 and/or communications unit 709. When loaded into RAM 703 and executed by computing unit 701, may perform one or more steps of the method semantic segmentation model training or semantic segmentation described above. Alternatively, in other embodiments, computing unit 701 may be configured by any other suitable means (e.g., by way of firmware) to perform method semantic segmentation model training or semantic segmentation.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.