Method, system, equipment and storage medium for determining gazing target

文档序号:8513 发布日期:2021-09-17 浏览:25次 中文

1. A method of determining a fixation target, comprising:

splicing the head picture, the head position mask and the gazing vector space on a channel dimension to obtain a first input parameter;

inputting the first input parameter into a first backbone network for feature extraction, so that the first backbone network performs feature extraction on the first input parameter;

obtaining a first watching vector space characteristic output by the first backbone network, and inputting the first watching vector space characteristic into a coarse-grained module so that the coarse-grained module encodes the first watching vector space characteristic;

obtaining a first three-dimensional watching vector of coarse granularity output by the coarse granularity module, and performing matrix multiplication on the first three-dimensional watching vector and the watching vector space to obtain a watching area heat map;

a gaze target is determined based on the first gaze vector spatial feature.

2. The method of claim 1, further comprising:

acquiring a target scene by using a depth camera to obtain a scene picture and a first depth picture;

extracting the head picture and a first position from the scene picture by using a preset head detection algorithm, wherein the first position is the position of eyes in the target scene;

converting the head picture by using a preset head mask generation algorithm to obtain the head position mask;

registering the scene picture and the first depth picture by using a preset registration algorithm to obtain a registered second depth picture;

constructing the gaze vector space using the second depth picture, parameters of the depth camera, and the first location.

3. The method of claim 2, further comprising:

inputting the scene picture, the gaze area heat map and the head position mask as fourth input parameters into a third backbone network for feature extraction, so that the third backbone network performs feature extraction on the fourth input parameters;

and obtaining a visual saliency feature output by the third backbone network, and obtaining the visual saliency feature with attention after multiplying the visual saliency feature by a mapping map of head attention, wherein the mapping map of head attention is obtained after the shape of the head feature and the head position feature is changed by a full connection layer, the head feature is extracted from the head picture by a preset head path extraction model, and the head position feature is obtained after the head position mask is pooled by a preset pooling algorithm.

4. The method according to claim 3, wherein the determining a gaze target based on the first gaze vector spatial feature comprises:

inputting the first watching vector space feature, the head feature and the attention-bearing visual saliency feature as second input parameters into a fine-grained module, and coding the second input parameters by the fine-grained module to obtain a fine-grained second three-dimensional watching vector;

and taking the second three-dimensional gazing vector and the two-dimensional gazing heat map as third input parameters, carrying out calculation processing on the third input parameters in a three-dimensional space by a preset joint inference algorithm, and determining the gazing target, wherein the two-dimensional gazing heat map is obtained by encoding and decoding the head characteristics and the visual saliency characteristics with attention by a preset encoder-decoder framework.

5. The method of claim 3, wherein the preset head path extraction model is a second backbone network, the method further comprising:

inputting the head picture into the second backbone network for feature extraction, so that the second backbone network performs feature extraction on the head picture;

obtaining the head feature output by the second backbone network.

6. The method according to claim 4, wherein the calculating process of the third input parameter in the three-dimensional space by the preset joint inference algorithm to determine the fixation target specifically includes:

the method comprises the steps of projecting an proposed area into a three-dimensional space by using a preset pinhole camera model, obtaining a gaze vector set located in the proposed area in the three-dimensional space, selecting a third three-dimensional gaze vector with the highest fitting degree with a second three-dimensional gaze vector in the gaze vector set by using a preset joint inference algorithm, and determining a pointing point of the third three-dimensional gaze vector as a gaze target, wherein the central point of the proposed area is a position corresponding to the maximum value of the two-dimensional gaze heat map, and the length and width of the proposed area are not greater than those of the two-dimensional gaze heat map.

7. The method of claim 2, wherein the constructing the gaze vector space using the second depth picture, the parameters of the depth camera, and the first location comprises:

adding two channels to the second depth picture to construct a pixel space picture, wherein the two channels are respectively used for representing the abscissa and the ordinate of a pixel coordinate system;

projecting the abscissa and the ordinate represented by the two channels of the pixel space picture into a three-dimensional space after being processed by the depth camera according to the parameters of the depth camera to obtain a first three-dimensional space picture;

and performing regularization processing on a second three-dimensional space picture generated after the first three-dimensional space picture is subtracted from the first three-dimensional space picture to obtain the gazing vector space.

8. A system for determining a fixation target, the system comprising:

the parameter acquisition module is used for splicing the head picture, the head position mask and the gazing vector space on a channel dimension to acquire a first input parameter;

the feature extraction module is used for inputting the first input parameter into a first backbone network for feature extraction so that the first backbone network performs feature extraction on the first input parameter;

the coarse-granularity processing module is used for obtaining a first watching vector space characteristic output by the first backbone network and inputting the first watching vector space characteristic into the coarse-granularity module so that the coarse-granularity module encodes the first watching vector space characteristic;

the heat map generation module is used for obtaining a first three-dimensional watching vector of coarse granularity output by the coarse granularity module, and performing matrix multiplication on the first three-dimensional watching vector and the watching vector space to obtain a watching area heat map;

a gaze target determination module to determine a gaze target based on the first gaze vector spatial feature.

9. An apparatus for determining a fixation target, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement a method of determining a gaze target according to any of claims 1 to 7.

10. A computer-readable storage medium, wherein instructions in the computer-readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method of determining a gaze target of any of claims 1 to 7.

Background

Human gaze analysis is an important component of human-computer interaction, and human sight not only reflects the attention area and target of a human, but also reflects human intention and psychological activities. With the development of science and technology, the analysis of the fixation target has become a research hotspot of various science and technology companies, so that the fixation target can better meet various application scenes.

A method for determining a gazing target at the present stage mainly comprises the step of utilizing a probability map model to extract and analyze scene information and human skeleton information in a two-dimensional scene, and accordingly obtaining potential intentions, gazing directions and gazing targets in scene point cloud. However, in the prior art, only human skeleton information can be combined to extract scene information of a two-dimensional scene, and the scene information is difficult to be deployed in a wide range of three-dimensional application scenes. Therefore, how to extract the three-dimensional scene information to determine the gazing target is an urgent problem to be solved by current research and development personnel.

Disclosure of Invention

The embodiment of the invention aims to provide a method, a system, equipment and a storage medium for determining a gazing target so as to determine the gazing target in a three-dimensional scene. The specific technical scheme is as follows:

a method of determining a gaze target, the method comprising:

and splicing the head picture, the head position mask and the gazing vector space on the channel dimension to obtain a first input parameter.

And inputting the first input parameter into a first backbone network for feature extraction, so that the first backbone network performs feature extraction on the first input parameter.

And obtaining a first watching vector space characteristic output by the first backbone network, and inputting the first watching vector space characteristic into a coarse-grained module so that the coarse-grained module encodes the first watching vector space characteristic.

And obtaining a first three-dimensional watching vector of the coarse granularity output by the coarse granularity module, and performing matrix multiplication on the first three-dimensional watching vector and the watching vector space to obtain a watching area heat map.

A gaze target is determined based on the first gaze vector spatial feature.

Optionally, the method further includes:

and acquiring a target scene by using a depth camera to obtain a scene picture and a first depth picture.

And extracting the head picture and a first position from the scene picture by using a preset head detection algorithm, wherein the first position is the position of eyes in the target scene.

And converting the head picture by using a preset head mask generation algorithm to obtain the head position mask.

And registering the scene picture and the first depth picture by using a preset registration algorithm to obtain a registered second depth picture.

Constructing the gaze vector space using the second depth picture, parameters of the depth camera, and the first location.

Optionally, the method further includes:

and inputting the scene picture, the watching area heat map and the head position mask as fourth input parameters into a third trunk network for feature extraction, so that the third trunk network performs feature extraction on the fourth input parameters.

And obtaining a visual saliency feature output by the third backbone network, and obtaining the visual saliency feature with attention after multiplying the visual saliency feature by a mapping map of head attention, wherein the mapping map of head attention is obtained after the shape of the head feature and the head position feature is changed by a full connection layer, the head feature is extracted from the head picture by a preset head path extraction model, and the head position feature is obtained after the head position mask is pooled by a preset pooling algorithm.

Optionally, the determining a gazing target based on the first gazing vector spatial feature specifically includes:

and inputting the first watching vector space feature, the head feature and the attention-bearing visual saliency feature as second input parameters into a fine-grained module, and coding the second input parameters by the fine-grained module to obtain a fine-grained second three-dimensional watching vector.

And taking the second three-dimensional gazing vector and the two-dimensional gazing heat map as third input parameters, carrying out calculation processing on the third input parameters in a three-dimensional space by a preset joint inference algorithm, and determining the gazing target, wherein the two-dimensional gazing heat map is obtained by encoding and decoding the head characteristics and the visual saliency characteristics with attention by a preset encoder-decoder framework.

Optionally, the preset head path extraction model is a second backbone network, and the method further includes:

inputting the head picture into the second backbone network for feature extraction, so that the second backbone network performs feature extraction on the head picture.

Obtaining the head feature output by the second backbone network.

Optionally, the calculating, by using a preset joint inference algorithm, the third input parameter in the three-dimensional space to determine the gazing target specifically includes:

the method comprises the steps of projecting an proposed area into a three-dimensional space by using a preset pinhole camera model, obtaining a gaze vector set located in the proposed area in the three-dimensional space, selecting a third three-dimensional gaze vector with the highest fitting degree with a second three-dimensional gaze vector in the gaze vector set by using a preset joint inference algorithm, and determining a pointing point of the third three-dimensional gaze vector as a gaze target, wherein the central point of the proposed area is a position corresponding to the maximum value of the two-dimensional gaze heat map, and the length and width of the proposed area are not greater than those of the two-dimensional gaze heat map.

Optionally, the constructing the gaze vector space by using the second depth picture, the parameter of the depth camera, and the first position specifically includes:

and adding two channels to the second depth picture to construct a pixel space picture, wherein the two channels are respectively used for representing the abscissa and the ordinate of a pixel coordinate system.

And projecting the abscissa and the ordinate represented by the two channels of the pixel space picture into a three-dimensional space after the abscissa and the ordinate are processed by the depth camera according to the parameters of the depth camera, so as to obtain a first three-dimensional space picture.

And performing regularization processing on a second three-dimensional space picture generated after the first three-dimensional space picture is subtracted from the first three-dimensional space picture to obtain the gazing vector space.

A system for determining a gaze target, the system comprising:

and the parameter acquisition module is used for splicing the head picture, the head position mask and the gazing vector space on the channel dimension to acquire a first input parameter.

And the feature extraction module is used for inputting the first input parameter into a first trunk network for feature extraction so that the first trunk network performs feature extraction on the first input parameter.

And the coarse-granularity processing module is used for obtaining the first watching vector space characteristic output by the first trunk network and inputting the first watching vector space characteristic into the coarse-granularity module so that the coarse-granularity module encodes the first watching vector space characteristic.

And the heat map generation module is used for obtaining a first three-dimensional watching vector of the coarse granularity output by the coarse granularity module, and performing matrix multiplication on the first three-dimensional watching vector and the watching vector space to obtain a watching area heat map.

A gaze target determination module to determine a gaze target based on the first gaze vector spatial feature.

An apparatus for determining a gaze target, comprising:

a processor;

a memory for storing the processor-executable instructions.

Wherein the processor is configured to execute the instructions to implement a method of determining a gaze target as described in any above.

A computer readable storage medium, wherein instructions, when executed by a processor of an electronic device, enable the electronic device to perform a method of determining a gaze target as in any above.

According to the method, the system, the equipment and the storage medium for determining the gazing target, which are provided by the embodiment of the invention, the gazing vector space is constructed, and the information in the three-dimensional space acquired by the depth camera is fully utilized, so that the method, the system and the equipment can be used for directly extracting and utilizing the scene information in the three-dimensional scene. Meanwhile, the invention utilizes the three-dimensional gazing path feature extraction model and combines the gazing vector space to extract and process the features of the scene information provided by the three-dimensional scene so as to determine the gazing target in the three-dimensional space, thereby realizing the full utilization of the scene information of the three-dimensional space and determining the gazing target in the three-dimensional space without combining additional human skeleton information, thereby being capable of being applied to a wide range of three-dimensional scenes.

Of course, it is not necessary for any product or method of practicing the invention to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flow chart of a method for determining a gaze target according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a method for determining a gaze target according to an alternative embodiment of the present invention;

fig. 3 is a schematic diagram of a two-dimensional heat-fixation map provided in accordance with an alternative embodiment of the present invention;

fig. 4 is a schematic view of a gaze vector in a three-dimensional coordinate system according to an alternative embodiment of the present invention;

fig. 5 is a block diagram of a system for determining a gaze target according to an embodiment of the present invention;

fig. 6 is a block diagram of an apparatus for determining a gaze target according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

An embodiment of the present invention provides a method for determining a gazing target, as shown in fig. 1, including:

s101, splicing the head picture, the head position mask and the gazing vector space on a channel dimension to obtain a first input parameter.

The head image is obtained by extracting and processing the position of the head in a scene image acquired by a depth camera through a preset head detection algorithm.

The head position mask is obtained by converting the head picture through a preset head mask generating algorithm. Specifically, the conversion process is: the preset head mask generation algorithm is used for converting the head picture into data, setting a head region in the head picture to be 1 and setting other scene regions to be 0, and finally generating a mask for representing the head position.

Optionally, the specific functions of the preset head detection algorithm and the preset head mask generation algorithm in the present invention are implemented, and the present invention does not give redundant details and limitations to the specific contents of the two algorithms.

The above-mentioned gaze vector space is a three-channel picture representing all possible gaze vectors in the three-dimensional space, which reflects the data set of the gaze vectors from the human eye to the visual target point in the three-dimensional space.

Optionally, in the embodiment of the present invention, a depth camera (RGB-D) is used to acquire an image, the camera can actively measure a distance from each pixel to a camera plane, and the camera can register an acquired common three-channel color scene picture and an acquired first depth picture through a built-in algorithm to obtain a registered second depth picture.

S102, inputting the first input parameter into a first backbone network for feature extraction, so that the first backbone network performs feature extraction on the first input parameter.

The first backbone network performs feature extraction on the first input parameter, specifically: and extracting the three-dimensional gazing path characteristics of the first input parameters. The method realizes the full utilization of the scene information in the three-dimensional scene, and does not need to combine human skeleton information to realize the determination of the gazing target.

Optionally, in an optional embodiment of the present invention, the first backbone network adopts a 50-layer residual error neural network structure (Resnet-50), and what structure is adopted for the first backbone network is for achieving the purpose of the present invention, which is not limited by the present invention.

S103, obtaining a first watching vector space characteristic output by the first backbone network, and inputting the first watching vector space characteristic into the coarse-grained module so that the coarse-grained module encodes the first watching vector space characteristic.

The coarse-granularity module is composed of an encoder and a full connection layer, and is configured to perform encoding processing on the spatial feature of the first gaze vector to generate a first coarse three-dimensional gaze vector after coarse processing.

And S104, obtaining a first three-dimensional watching vector with coarse granularity output by the coarse granularity module, and performing matrix multiplication on the first three-dimensional watching vector and a watching vector space to obtain a watching area heat map.

The watching area heat map is used for representing the watching degree of the visual area in the three-dimensional space.

And S105, determining a gazing target based on the first gazing vector space characteristics.

Optionally, the method further includes:

and acquiring a target scene by using a depth camera to obtain a scene picture and a first depth picture.

And extracting a head picture and a first position from the scene picture by using a preset head detection algorithm, wherein the first position is the position of eyes in the target scene.

And converting the head picture by using a preset head mask generation algorithm to obtain a head position mask.

And registering the scene picture and the first depth picture by using a preset registration algorithm to obtain a second registered depth picture.

A gaze vector space is constructed using the second depth picture, the parameters of the depth camera, and the first location.

The parameters of the depth camera comprise internal parameters and external parameters, the internal parameter data comprise an optical center and a focal length of the depth camera, and the external parameter data comprise a space coordinate system and a rotation matrix of the depth camera. It should be noted that the parameters of the depth camera are technical parameters known by those skilled in the art, and the present invention is not described herein in detail.

Optionally, the method further includes:

and inputting the scene picture, the watching area heat map and the head position mask as fourth input parameters into a third trunk network for feature extraction, so that the third trunk network performs feature extraction on the fourth input parameters.

And obtaining the visual saliency characteristics output by the third backbone network, and multiplying the visual saliency characteristics with a mapping graph of the head attention to obtain the visual saliency characteristics with attention, wherein the mapping graph of the head attention is obtained by changing the shapes of the head characteristics and the head position characteristics through a full connection layer, the head characteristics are extracted from a head picture through a preset head path extraction model, and the head position characteristics are obtained by pooling a head position mask through a preset pooling algorithm.

The visual saliency characteristic is used for representing the attention degree of human facing a visual saliency area in a scene through visual saliency. The visually significant region is a region in which a human being focuses on a scene.

The above-mentioned map of the head attention is a matrix for characterizing the visual attention.

It will be understood by those skilled in the art that the step S105 can be implemented in various ways, such as by using a probabilistic-object (HAO) model in the point cloud to determine the gazing object.

Of course, in an alternative embodiment of the present invention, the step S105 may also be implemented as follows:

optionally, in an optional embodiment of the present invention, the determining a gaze target based on the first gaze vector space feature specifically includes:

and inputting the first watching vector space feature, the head feature and the attention-bearing visual saliency feature as second input parameters into a fine-grained module, and coding the second input parameters by the fine-grained module to obtain a fine-grained second three-dimensional watching vector.

And taking the second three-dimensional watching vector and the two-dimensional watching heat map as third input parameters, carrying out calculation processing on the third input parameters in a three-dimensional space by a preset joint inference algorithm, and determining a watching target, wherein the two-dimensional watching heat map is obtained by coding and decoding the head features and the attention-bearing visual saliency features by a preset coder decoder framework.

And the fine-grained module consists of an encoder and a full-connection layer and is used for encoding the second input parameter so as to generate a refined accurate second three-dimensional watching vector.

The two-dimensional fixation heat map is a heat map reflecting the degree to which the visual area is fixed on the two-dimensional picture.

Optionally, the preset head path extraction model is a second backbone network, and the method further includes:

and inputting the head picture into a second backbone network for feature extraction, so that the second backbone network performs feature extraction on the head picture.

And obtaining the head characteristics output by the second backbone network.

Alternatively, in another alternative embodiment of the present invention, the second backbone network can adopt a 50-layer residual neural network structure (Resnet-50) as the first backbone network, and can also adopt other structures, so as to achieve the purpose of the present invention, which is not limited by the present invention.

Optionally, in an optional embodiment of the present invention, the calculating, by using a preset joint inference algorithm, a third input parameter in a three-dimensional space to determine the gazing target specifically includes:

the method comprises the steps of projecting an proposed area into a three-dimensional space by using a preset pinhole camera model, obtaining a gaze vector set located in the proposed area in the three-dimensional space, selecting a third three-dimensional gaze vector with the highest fitting degree with a second three-dimensional gaze vector in the gaze vector set by using a preset joint inference algorithm, and determining a pointing point of the third three-dimensional gaze vector as a gaze target, wherein the central point of the proposed area is a position corresponding to the maximum value of a two-dimensional gaze heat map, and the length and the width of the proposed area are not greater than those of the two-dimensional gaze heat map.

To facilitate an understanding of the above method, it will be explained herein in connection with an alternative embodiment of the present method as shown in fig. 2:

as shown in fig. 2, in the three-dimensional gaze feature extraction path, after a gaze vector space 201, a head position mask 202, and a head picture 203 are spliced 205 in a channel dimension, they are input into a first backbone network 206 as first input parameters, the first input parameters are subjected to feature extraction by the first backbone network 206 to obtain first gaze vector space features 207, the first gaze vector space features 207 are subjected to encoding processing by a coarse-granularity module 208 to obtain a first three-dimensional gaze vector 209, and after a matrix multiplication 210 is performed on the first three-dimensional gaze vector 209 and the gaze vector space 201, a gaze region heat map 211 is obtained.

In the head feature extraction path, the second backbone network 212 performs feature extraction on the input head picture 203 to obtain a head feature 213.

Pooling the head position mask 202 by using a preset pooling algorithm 214 to obtain a head position feature 215, splicing 205 the head position feature 215 and the head feature 213 in a channel dimension, inputting the spliced head position feature 215 and the spliced head position feature 213 into a full-connection layer 216, and changing the shape to obtain a mapping 217 of the head attention.

In the visual saliency feature extraction path, a third backbone network 228 is used for performing feature extraction on the input head position mask 202, the scene picture 204 and the gaze area heat map 211 to obtain visual saliency features 218, and after multiplication 219 is performed on each channel of the visual saliency features 218 and a mapping map 217 of head attention, the visual saliency features 220 with attention are obtained.

In the three-dimensional gaze vector prediction branch, after the first gaze vector spatial feature 207, the visual saliency feature with attention 220, and the head feature 213 are spliced 221 in the channel dimension, they are input into the fine-grained module 222 as second input parameters, and after the second input parameters are encoded by the fine-grained module 222, the second three-dimensional gaze vector 223 is obtained.

In the two-dimensional fixation heat map prediction branch, the head feature 213 and the attention-bearing visual saliency feature 220 are encoded and decoded by a preset encoder-decoder architecture 224 to obtain a two-dimensional fixation heat map 225.

And (3) taking the second three-dimensional gazing vector 223 and the two-dimensional gazing heat map 225 as third input parameters, and performing calculation processing on the third input parameters by using a preset joint inference algorithm 226 to determine a gazing target 227.

It should be noted that the matrix multiplication 210 and the multiplication 219 may be the same or different, and the channel dimension splicing 205 and the channel dimension splicing 221 may be the same or different, and the specific situation is subject to an actual application scenario, which is not limited by the present invention.

For further clarity of explanation, please refer to fig. 3 and 4 to understand the above process of determining the fixation target by using the preset joint inference algorithm:

fig. 3 is a two-dimensional fixation heat map 225 obtained by the above method, in which an oval area with different density points is a larger value area in the two-dimensional fixation heat map, and a proposed area 229 is selected in the two-dimensional fixation heat map 225, and the center point of the proposed area 229 is the point corresponding to the maximum value in the two-dimensional fixation heat map.

Optionally, the length and width of the proposed area 229 are not greater than those of the two-dimensional heat map, and the present invention does not limit the specific values of the length and width of the proposed area.

Projecting the proposed area into a three-dimensional space by using a preset pinhole camera model to obtain a gaze vector set under a three-dimensional space coordinate system containing an X axis, a Y axis and a Z axis as shown in FIG. 4, wherein a cube 229 in FIG. 4 is an area of the proposed area 229 in FIG. 3 projected into the three-dimensional space, and vector target points in the gaze vector set comprise a black hollow circle 231, a white hollow circle 230, a white circle and a black circle.

In fig. 4, the black hollow circle 231 is the first position, the white hollow circle 230 is the fixed fixation target, the white circle is the possible fixation point in the proposed region 229, and the black circle is the fixation point in the three-dimensional space except the fixation point in the proposed region 229 in fig. 4.

And determining a target point located at the white hollow circle 230 as a gaze target point by using a third three-dimensional gaze vector having the highest fitting degree with the second three-dimensional gaze vector in the gaze vector set, namely, the third three-dimensional gaze vector using the black hollow circle 231 in fig. 4 as an emitting point and the white hollow circle 230 as a target point.

Optionally, in an optional embodiment of the present invention, the constructing a gaze vector space by using the second depth picture, the parameter of the depth camera, and the first position specifically includes:

and adding two channels to the second depth picture to construct a pixel space picture, wherein the two channels are respectively used for representing the abscissa and the ordinate of a pixel coordinate system.

And projecting the abscissa and the ordinate represented by the two channels of the pixel space picture into a three-dimensional space after the abscissa and the ordinate are processed by the depth camera according to the parameters of the depth camera, so as to obtain a first three-dimensional space picture.

And performing regularization processing on a second three-dimensional space picture generated after the first three-dimensional space picture is subtracted from the first position to obtain a fixation vector space.

According to the embodiment of the invention, the information in the three-dimensional space acquired by the depth camera is fully utilized by constructing the gazing vector space, so that the scene information under the three-dimensional scene can be directly extracted and utilized. Meanwhile, the invention utilizes the three-dimensional gazing path feature extraction model and combines the gazing vector space to extract and process the features of the scene information provided by the three-dimensional scene so as to determine the gazing target in the three-dimensional space, thereby realizing the full utilization of the scene information of the three-dimensional space and determining the gazing target in the three-dimensional space without combining additional human skeleton information, thereby being capable of being applied to a wide range of three-dimensional scenes.

In correspondence with the above-described embodiment of the method of determining a fixation target, the present invention also provides a system for determining a fixation target, as shown in fig. 5, comprising:

the parameter obtaining module 501 is configured to splice the head picture, the head position mask, and the gaze vector space in a channel dimension to obtain a first input parameter.

The feature extraction module 502 is configured to input the first input parameter into a first backbone network for feature extraction, so that the first backbone network performs feature extraction on the first input parameter.

The coarse-grained processing module 503 is configured to obtain a first gaze vector spatial feature output by the first backbone network, and input the first gaze vector spatial feature into the coarse-grained module, so that the coarse-grained module performs encoding processing on the first gaze vector spatial feature.

The heat map generating module 504 is configured to obtain a coarse-grained first three-dimensional gaze vector output by the coarse-grained module, and perform matrix multiplication on the first three-dimensional gaze vector and a gaze vector space to obtain a gaze area heat map.

A gaze target determination module 505 for determining a gaze target based on the first gaze vector spatial feature.

Optionally, the system further includes:

and the first picture extraction submodule is used for acquiring a target scene by using the depth camera to obtain a scene picture and a first depth picture.

And the second picture extraction submodule is used for extracting a head picture and a first position from the scene picture by using a preset head detection algorithm, wherein the first position is the position of eyes in the target scene.

And the first picture conversion sub-module is used for converting the head picture by using a preset head mask generation algorithm to obtain a head position mask.

And the first image registration sub-module is used for registering the scene image and the first depth image by using a preset registration algorithm to obtain a second depth image after registration.

And the first gaze vector space construction submodule is used for constructing a gaze vector space by utilizing the second depth picture, the parameters of the depth camera and the first position.

Optionally, the gaze target determining module 505 further comprises:

and the first parameter input sub-module is used for inputting the scene picture, the watching area heat map and the head position mask as fourth input parameters into a third trunk network for feature extraction, so that the third trunk network performs feature extraction on the fourth input parameters.

The first multiplication processing submodule is used for obtaining the visual saliency characteristics output by the third backbone network, multiplying the visual saliency characteristics with a mapping graph of the head attention to obtain the visual saliency characteristics with attention, wherein the mapping graph of the head attention is obtained after the head characteristics and the head position characteristics change shapes through a full connection layer, the head characteristics are extracted from a head picture through a preset head path extraction model, and the head position characteristics are obtained after a head position mask is pooled through a preset pooling algorithm.

Optionally, the gazing target determining module 505 may be specifically configured to:

and inputting the first watching vector space feature, the head feature and the attention-bearing visual saliency feature as second input parameters into a fine-grained module, and coding the second input parameters by the fine-grained module to obtain a fine-grained second three-dimensional watching vector.

And taking the second three-dimensional watching vector and the two-dimensional watching heat map as third input parameters, carrying out calculation processing on the third input parameters in a three-dimensional space by a preset joint inference algorithm, and determining a watching target, wherein the two-dimensional watching heat map is obtained by coding and decoding the head features and the attention-bearing visual saliency features by a preset coder decoder framework. Optionally, the first multiplication processing sub-module further includes:

and the third picture extraction submodule is used for inputting the head picture into a second backbone network for feature extraction so that the second backbone network can extract features of the head picture.

And the head characteristic acquisition submodule acquires the head characteristics output by the second backbone network.

Optionally, the gazing target determining module 505 may be specifically configured to:

the method comprises the steps of projecting an proposed area into a three-dimensional space by using a preset pinhole camera model, obtaining a gaze vector set located in the proposed area in the three-dimensional space, selecting a third three-dimensional gaze vector with the highest fitting degree with a second three-dimensional gaze vector in the gaze vector set by using a preset joint inference algorithm, and determining a pointing point of the third three-dimensional gaze vector as a gaze target, wherein the central point of the proposed area is a position corresponding to the maximum value of a two-dimensional gaze heat map, and the length and the width of the proposed area are not greater than those of the two-dimensional gaze heat map.

Optionally, the first gaze vector space construction sub-module may specifically be configured to:

and adding two channels to the second depth picture to construct a pixel space picture, wherein the two channels are respectively used for representing the abscissa and the ordinate of a pixel coordinate system.

And projecting the abscissa and the ordinate represented by the two channels of the pixel space picture into a three-dimensional space after the abscissa and the ordinate are processed by the depth camera according to the parameters of the depth camera, so as to obtain a first three-dimensional space picture.

And performing regularization processing on a second three-dimensional space picture generated after the first three-dimensional space picture is subtracted from the first position to obtain a fixation vector space.

As shown in fig. 6, an embodiment of the present invention further provides an apparatus for determining a gaze target, including:

a processor 601;

a memory 602 for storing instructions executable by the processor 601.

Wherein the processor is configured to execute the instructions to implement any of the above methods of determining a gaze target.

A computer readable storage medium having instructions which, when executed by a processor 601 of a device for determining a gaze target, enable the device for determining a gaze target to perform any of the above methods of determining a gaze target.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:利用机器人进行集装箱查验的方法、系统、装置及介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!