Label identification method and device
1. A tag identification method, comprising:
multi-modal characteristics of a video to be recognized are extracted, and a first label of the video to be recognized is determined by utilizing the multi-modal characteristics;
classifying and identifying the video to be identified, acquiring the category of the video to be identified, and identifying a second label of the video to be identified based on the category of the video to be identified;
and acquiring a third label of the video to be identified through a new hot label set updated in real time, and determining the video label of the video to be identified by combining the first label, the second label and the third label.
2. The tag identification method according to claim 1, wherein the identifying the second tag of the video to be identified based on the category of the video to be identified comprises:
when the category of the video to be identified is a movie and television comprehensive category, identifying a target object in the video to be identified;
and determining a second label of the video to be identified based on the target object.
3. The tag identification method according to claim 2, wherein the target object contains a character, and determining the second tag of the video to be identified based on the target object comprises:
identifying character features in the video to be identified, and determining character labels of the video to be identified;
acquiring the names of the film and television ensembles related to the character tags through a knowledge graph model;
calculating the similarity of the video to be identified and the comprehensive video corresponding to the comprehensive video name, and determining the comprehensive video label of the video to be identified;
and determining the second label according to the character label and the movie and television comprehensive label.
4. The tag identification method according to claim 1, wherein the identifying the second tag of the video to be identified based on the category of the video to be identified comprises:
and when the category of the video to be identified is a game category, matching the video to be identified through game template data, and determining a second label of the video to be identified according to the matched game template data.
5. The tag identification method according to claim 4, wherein the game template data comprises a skill box template, the matching with the video to be identified through the game template data, and the determining the second tag of the video to be identified according to the matching result comprises:
matching the video to be recognized with a plurality of skill box templates, and determining a target skill box matched with the video to be recognized;
acquiring a target game role associated with the target skill box;
and determining the second label according to the target game role.
6. The tag identification method according to claim 1, wherein the obtaining of the third tag of the video to be identified through the new hot tag set updated in real time comprises:
acquiring a new hot label set updated in real time, and acquiring video data corresponding to the new hot label set;
calculating the similarity between the title of the video to be identified and the title of the video data, and screening out a plurality of target videos of which the similarity accords with a preset threshold value from the video data;
and determining a third label of the video to be identified through the new hot labels corresponding to the target videos.
7. The tag identification method according to claim 1, wherein the determining the video tag of the video to be identified in combination with the first tag, the second tag, and the third tag comprises:
determining a label which accords with the credit acquisition strategy in the first label and the second label as a fourth label according to the credit acquisition strategies corresponding to the first label and the second label respectively;
and determining the video label of the video to be identified according to the fourth label and the third label.
8. The tag identification method according to claim 7, wherein the determining, according to the credit acquisition policies respectively corresponding to the first tag and the second tag, a tag that meets the credit acquisition policy from among the first tag and the second tag as a fourth tag includes:
for each first label, if the model of the first label is output as the confidence model corresponding to the first label, determining the first label as a fourth label; the confidence model corresponding to the first label outputs that the confidence of the first label accords with the credit acquisition strategy corresponding to the first label;
for each second label, if the model of the second label is output as a confidence model corresponding to the second label, determining the second label as a fourth label; and outputting the confidence coefficient of the second label to accord with the credit acquisition strategy corresponding to the second label by the confidence model corresponding to the second label.
9. The tag identification method according to claim 7 or 8, wherein the determining the first tag of the video to be identified by using the multi-modal features comprises: multi-modal characteristics of a video to be recognized are extracted through a first model, and a first label of the video to be recognized is determined according to the multi-modal characteristics;
the identifying the second tag of the video to be identified based on the category of the video to be identified includes: identifying the video to be identified based on a second model corresponding to the category of the video to be identified to obtain a second label of the video to be identified;
before the determining, according to the credit acquisition policies respectively corresponding to the first tag and the second tag, a tag that meets the credit acquisition policy in the first tag and the second tag as a target tag, the method further includes:
identifying a first label of a plurality of video samples as a first test result by the first model; identifying a second label of the plurality of video samples as a second test result by the second model;
calculating a first confidence of the first model for each label included in the first test result and calculating a second confidence of the second model for each label included in the second test result;
and for target tags which are coincident in the first test result and the second test result, determining a confidence model corresponding to each target tag from the first model and the second model according to the first confidence degree and the second confidence degree of each target tag, and taking the corresponding relation between the target tags and the confidence models as a credit acquisition strategy corresponding to the target tags.
10. A label identification device, comprising:
the multi-modal feature extraction module is used for extracting multi-modal features of a video to be recognized and determining a first label of the video to be recognized by utilizing the multi-modal features;
the category identification module is used for classifying and identifying the video to be identified, acquiring the category of the video to be identified, and identifying a second label of the video to be identified based on the category of the video to be identified;
and the new hot label identification module is used for acquiring a third label of the video to be identified through a new hot label set updated in real time and determining the video label of the video to be identified by combining the first label, the second label and the third label.
Background
Video tags play a very important role in video recommendation, video search and other services. The video label can accurately depict the characteristics of the video, can assist in depicting the interests and habits of users, and can provide comprehensive and accurate basis for services such as video recommendation, video search and the like.
In the related art, the labels of the video are predicted mainly by training a model. However, this solution can only achieve good identification effect on general-purpose and abstract labels, and cannot identify detailed and fine labels, which is difficult to meet the actual requirements of video labels.
It is to be noted that the information disclosed in the above background section is only for enhancement of understanding of the background of the present application and therefore may include information that does not constitute prior art known to a person of ordinary skill in the art.
Disclosure of Invention
The application aims to provide a tag identification method, a tag identification device, an electronic device and a computer readable storage medium, so that the tag identification accuracy is improved to a certain extent, and the timeliness of tags is improved.
According to a first aspect of the present application, there is provided a tag identification method, the method comprising: multi-modal characteristics of a video to be recognized are extracted, and a first label of the video to be recognized is determined by utilizing the multi-modal characteristics; classifying and identifying the video to be identified to obtain the category of the video to be identified, and identifying a second label of the video to be identified based on the category of the video to be identified; and acquiring a third label of the video to be identified through a new hot label set updated in real time, and determining the video label of the video to be identified by combining the first label, the second label and the third label.
In an exemplary embodiment of the application, based on the foregoing embodiments, the identifying the second tag of the video to be identified based on the category of the video to be identified includes:
when the category of the video to be identified is a movie and television comprehensive category, identifying a target object in the video to be identified;
and determining a second label of the video to be identified based on the target object.
In an exemplary embodiment of the application, based on the foregoing embodiment, the target object includes a character feature, and determining the second tag of the video to be recognized based on the target object includes:
identifying character features in the video to be identified, and determining character labels of the video to be identified;
acquiring the names of the film and television ensembles related to the character tags through a knowledge graph model;
calculating the similarity of the video to be identified and the comprehensive video corresponding to the comprehensive video name, and determining the comprehensive video label of the video to be identified;
and determining the second label according to the character label and the movie and television comprehensive label.
In an exemplary embodiment of the application, based on the foregoing embodiments, the identifying the second tag of the video to be identified based on the category of the video to be identified includes:
and when the category of the video to be identified is a game category, matching the video to be identified through game template data, and determining a second label of the video to be identified according to the matched game template data. In an exemplary embodiment of the present application, based on the foregoing embodiment,
the game template data comprises a skill frame template, the game template data is matched with the video to be identified, and the step of determining the second label of the video to be identified according to the matching result comprises the following steps:
matching the video to be recognized with a plurality of skill box templates, and determining a target skill box matched with the video to be recognized;
acquiring a target game role associated with the target skill box;
and determining the second label according to the target game role.
In an exemplary embodiment of the present application, based on the foregoing embodiment, the obtaining the third tag of the video to be identified through the new hot tag set updated in real time includes:
acquiring a new hot label set updated in real time, and acquiring video data corresponding to the new hot label set;
calculating the similarity between the title of the video to be identified and the title of the video data, and screening out a plurality of target videos of which the similarity accords with a preset threshold value from the video data;
and determining a third label of the video to be identified through the new hot labels corresponding to the target videos.
In an exemplary embodiment of the application, based on the foregoing embodiments, the determining, by combining the first tag, the second tag, and the third tag, the video tag of the video to be identified includes:
determining a label which accords with the credit acquisition strategy in the first label and the second label as a fourth label according to the credit acquisition strategies corresponding to the first label and the second label respectively;
and determining the video label of the video to be identified according to the fourth label and the third label.
In an exemplary embodiment of the present application, based on the foregoing embodiments, determining, according to the credit acquisition policies respectively corresponding to the first tag and the second tag, a tag that meets the credit acquisition policy in the first tag and the second tag as a fourth tag includes:
for each first label, if the model of the first label is output as the confidence model corresponding to the first label, determining the first label as a fourth label; the confidence model corresponding to the first label outputs that the confidence of the first label accords with the credit acquisition strategy corresponding to the first label;
for each second label, if the model of the second label is output as a confidence model corresponding to the second label, determining the second label as a fourth label; and outputting the confidence coefficient of the second label to accord with the credit acquisition strategy corresponding to the second label by the confidence model corresponding to the second label.
In an exemplary embodiment of the present application, based on the foregoing embodiment, the determining the first label of the video to be recognized by using the multi-modal features includes: multi-modal characteristics of a video to be recognized are extracted through a first model, and a first label of the video to be recognized is determined according to the multi-modal characteristics;
the identifying the second tag of the video to be identified based on the category of the video to be identified includes: identifying the video to be identified based on a second model corresponding to the category of the video to be identified to obtain a second label of the video to be identified;
before the determining, according to the credit acquisition policies respectively corresponding to the first tag and the second tag, a tag that meets the credit acquisition policy in the first tag and the second tag as a target tag, the method further includes:
identifying a first label of a plurality of video samples as a first test result by the first model; identifying a second label of the plurality of video samples as a second test result by the second model;
calculating a first confidence of the first model for each label included in the first test result and calculating a second confidence of the second model for each label included in the second test result;
and for target tags which are coincident in the first test result and the second test result, determining a confidence model corresponding to each target tag from the first model and the second model according to the first confidence degree and the second confidence degree of each target tag, and taking the corresponding relation between the target tags and the confidence models as a credit acquisition strategy corresponding to the target tags.
According to a second aspect of the present application, there is provided a tag identification apparatus, the apparatus comprising: the system comprises a multi-modal feature extraction module, a category identification module and a new hot tag identification module.
The multi-modal feature extraction module is used for extracting multi-modal features of the video to be recognized and determining a first label of the video to be recognized by utilizing the multi-modal features.
And the category identification module is used for classifying and identifying the video to be identified, acquiring the category of the video to be identified, and identifying the second label of the video to be identified based on the category of the video to be identified.
And the new hot label identification module is used for acquiring a third label of the video to be identified through a new hot label set updated in real time and determining the video label of the video to be identified by combining the first label, the second label and the third label.
In an exemplary embodiment of the present application, based on the foregoing embodiment, the category identification module includes: the movie and television comprehensive identification module is used for identifying a target object in the video to be identified when the category of the video to be identified is the movie and television comprehensive category; and the movie and television comprehensive label determining module is used for determining a second label of the video to be identified based on the target object.
In an exemplary embodiment of the present application, based on the foregoing embodiment, the target object includes a character feature, and the video comprehensive label determination module may include a character recognition module, a knowledge graph module, a similarity calculation module, and a label output module.
The character recognition module is used for recognizing character features in the video to be recognized and determining character labels of the video to be recognized.
And the knowledge graph module is used for acquiring the movie and television integrated names related to the character tags through a knowledge graph model.
And the similarity calculation module is used for calculating the similarity of the video to be identified and the comprehensive video corresponding to the comprehensive video name, and determining the comprehensive video label of the video to be identified.
And the label output module is used for determining the second label according to the character label and the movie and television integrated label.
In an exemplary embodiment of the present application, based on the foregoing embodiment, the category identification module is configured to: and when the category of the video to be identified is a game category, matching the video to be identified through game template data, and determining a second label of the video to be identified according to the matched game template data.
In an exemplary embodiment of the present application, based on the foregoing embodiment, the game template data includes a skill box template, and the category identification module may include a skill box matching module, a game character determination module, and a game tag determination module.
The skill box matching module is used for matching the video to be recognized with a plurality of skill box templates and determining a target skill box matched with the video to be recognized.
And the game role determination module is used for acquiring the target game role associated with the target skill box.
And the game tag determining module is used for determining the second tag according to the target game role.
In an exemplary embodiment of the present application, based on the foregoing embodiments, the new hot tag identification module may include a video data acquisition module, a video similarity calculation module, and a new hot tag determination module.
The video data acquisition module is used for acquiring a new hot label set updated in real time and acquiring video data corresponding to the new hot label set.
And the video similarity calculation module is used for calculating the similarity between the title of the video to be identified and the title of the video data and screening out a plurality of target videos of which the similarity meets a preset threshold value from the video data.
And the new hot label determining module is used for determining a third label of the video to be identified according to the new hot labels corresponding to the target videos.
In an exemplary embodiment of the present application, based on the foregoing embodiments, the new thermal label identification module may be configured to: determining a label which accords with the credit acquisition strategy in the first label and the second label as a fourth label according to the credit acquisition strategies corresponding to the first label and the second label respectively; and determining the video label of the video to be identified according to the fourth label and the third label.
In an exemplary embodiment of the present application, the new thermal label identification module may be configured to: for each first label, if the model of the first label is output as the confidence model corresponding to the first label, determining the first label as a fourth label; the confidence model corresponding to the first label outputs that the confidence of the first label accords with the credit acquisition strategy corresponding to the first label; for each second label, if the model of the second label is output as a confidence model corresponding to the second label, determining the second label as a fourth label; and outputting the confidence coefficient of the second label to accord with the credit acquisition strategy corresponding to the second label by the confidence model corresponding to the second label.
In an exemplary embodiment of the present application, based on the foregoing embodiment, the determining the first label of the video to be recognized by using the multi-modal features includes: multi-modal characteristics of a video to be recognized are extracted through a first model, and a first label of the video to be recognized is determined according to the multi-modal characteristics; the identifying the second tag of the video to be identified based on the category of the video to be identified includes: identifying the video to be identified based on a second model corresponding to the category of the video to be identified to obtain a second label of the video to be identified; the device also comprises a sample acquisition module, a confidence test module and a confidence acquisition strategy determination module.
The first test result module is used for identifying first labels of the video samples through the first model to serve as first test results.
And the second test result module is used for identifying a second label of the plurality of video samples through the second model to serve as a second test result.
And the confidence test module is used for calculating a first confidence of the first model for each label contained in the first test result and calculating a second confidence of the second model for each label contained in the second test result.
And the confidence acquisition strategy determining module is used for determining a confidence model corresponding to each target label from the first model and the second model according to the first confidence degree and the second confidence degree of each target label for the target label which is coincident in the first test result and the second test result, and taking the corresponding relation between the target label and the confidence model as the confidence acquisition strategy corresponding to the target label.
According to a third aspect of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the tag identification method of any of the embodiments of the first aspect described above.
According to a fourth aspect of the present application, there is provided an electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the tag identification method of any embodiment of the first aspect above via execution of the executable instructions.
According to a fifth aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the tag identification method provided in the above embodiments.
The exemplary embodiments of the present application may have some or all of the following advantages:
in the tag identification scheme provided by an example embodiment of the application, multi-modal features of a video to be identified are extracted, a first tag of the video to be identified is determined by using the multi-modal features, the video to be identified is classified and identified, a second tag is further identified according to the category of the video to be identified, a third tag is also identified by using a new hot tag set, and finally, a video tag of the video to be identified is finally determined by combining the first tag, the second tag and the third tag. Therefore, in the technical scheme, on one hand, the first label of the video is determined integrally through the multi-modal features, so that the universality of the label is ensured; on the other hand, videos are classified and refined, and the second label is determined in a targeted manner according to the category of the videos, so that the label accuracy can be improved; on the other hand, the new hot label set is used for identifying the real-time hot content, so that the timeliness of the labels can be improved. And the first label, the second label and the third label have less data dependence, can be flexibly updated and migrated, can improve the reusability and simultaneously reduce the maintenance difficulty.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
Fig. 1 is a system architecture diagram illustrating an application scenario to which a tag identification method according to an embodiment of the present application may be applied.
FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.
Fig. 3 schematically shows a flow diagram of a tag identification method according to an embodiment of the present application.
Fig. 4 schematically shows a flow diagram for identifying a second tag according to an embodiment of the application.
Fig. 5 schematically shows a flow diagram for identifying a second tag according to another embodiment of the present application.
Fig. 6 schematically shows a flow diagram for identifying a third tag according to an embodiment of the present application.
Fig. 7 schematically shows a flow diagram for identifying a third tag according to another embodiment of the present application.
Fig. 8 schematically shows a flowchart for determining a credit acquisition policy according to an embodiment of the present application.
Fig. 9 schematically shows a system architecture diagram of a tag identification method according to an embodiment of the present application.
Fig. 10 is a schematic diagram of a tag architecture of a tag identification method according to an embodiment of the present application.
Fig. 11 is a schematic diagram illustrating a display effect of a video tag according to a tag identification method of an embodiment of the present application.
Fig. 12 shows a schematic structural diagram of a tag identification device according to an embodiment of the present application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known technical solutions have not been shown or described in detail to avoid obscuring aspects of the present application.
Furthermore, the drawings are merely schematic illustrations of the present application and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
In the related technology for identifying the video tags, the tags of the videos can be predicted by adopting multi-modal features, the whole model is divided into two stages, firstly, single-modal feature extraction is carried out, then, fusion prediction of the multi-modal features is carried out, and the method has high flexibility and can meet the requirements of big data scenes. For example, a word vector of a video is extracted by using a pre-training model to obtain a text mode, image features of the video are extracted by using a convolution network to obtain an image mode, audio features are extracted by using the convolution network to obtain an audio mode, multi-mode information is interactively fused to be integrated into a uniform characterization vector, and a label of the video is predicted. However, this scheme can only achieve results on conceptual tags and a small number of entity tags, and for tags in the expert field that need to be understood in a refined manner, the recognition effect is difficult to meet the actual requirements. Moreover, a large amount of training data needs to be accumulated in the scheme, and the video updating speed in an actual scene is very high, so that the timeliness of the labels predicted by the model is difficult to maintain.
In view of this, the present exemplary embodiment provides a tag identification method capable of overcoming one or more of the problems described above. Fig. 1 is a schematic diagram illustrating a system architecture of an exemplary application environment to which a tag identification method according to an embodiment of the present application can be applied.
As shown in fig. 1, the system architecture 100 may include one or more of terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The terminal devices 101, 102, 103 may be various electronic devices having a display screen, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.
The tag identification method provided by the embodiment of the present application is generally executed by the server 105, and accordingly, the tag identification apparatus is generally disposed in the server 105. However, it is easily understood by those skilled in the art that the tag identification method provided in the embodiment of the present application may also be executed by the terminal devices 101, 102, and 103, and accordingly, the tag identification apparatus may also be disposed in the terminal devices 101, 102, and 103, which is not particularly limited in this exemplary embodiment.
FIG. 2 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.
It should be noted that the computer system 200 of the electronic device shown in fig. 2 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 2, the computer system 200 includes a Central Processing Unit (CPU)201 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)202 or a program loaded from a storage section 208 into a Random Access Memory (RAM) 203. In the RAM 203, various programs and data necessary for system operation are also stored. The CPU 201, ROM 202, and RAM 203 are connected to each other via a bus 204. An input/output (I/O) interface 205 is also connected to bus 204.
The following components are connected to the I/O interface 205: an input portion 206 including a keyboard, a mouse, and the like; an output section 207 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 208 including a hard disk and the like; and a communication section 209 including a network interface card such as a LAN card, a modem, or the like. The communication section 209 performs communication processing via a network such as the internet. A drive 210 is also connected to the I/O interface 205 as needed. A removable medium 211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 210 as necessary, so that a computer program read out therefrom is mounted into the storage section 208 as necessary.
In particular, according to embodiments of the present application, the processes described below with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 209 and/or installed from the removable medium 211. The computer program, when executed by a Central Processing Unit (CPU)201, performs various functions defined in the methods and apparatus of the present application. In some embodiments, the computer system 200 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes technologies such as image processing, image Recognition, image semantic understanding, image retrieval, OCR (Optical Character Recognition), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and the like, and also includes common biometric technologies such as face Recognition, fingerprint Recognition, and the like.
The key technologies of Speech Technology (Speech Technology) are Automatic Speech Recognition (ASR) and Speech synthesis (Text To Speech, TTS) as well as voiceprint Recognition. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.
The technical scheme of the embodiment of the application is explained in detail as follows:
fig. 3 schematically shows a flow chart of a tag identification method according to an embodiment of the application. The tag identification method may be applied to the server 105, and may also be applied to one or more of the terminal devices 101, 102, and 103, which is not particularly limited in this exemplary embodiment. Referring to fig. 3, the tag identification method may include the steps of:
step S310, multi-modal characteristics of the video to be recognized are extracted, and the first label of the video to be recognized is determined by utilizing the multi-modal characteristics.
Step S320, the videos to be recognized are classified and recognized, the category of the videos to be recognized is obtained, and the second label of the videos to be recognized is recognized based on the category of the videos to be recognized.
Step S330, acquiring a third label of the video to be identified through the new hot label set updated in real time, and determining the video label of the video to be identified by combining the first label, the second label and the third label.
In the tag identification method provided by the exemplary embodiment, on one hand, the first tag of the video is determined as a whole through multi-modal features, so that the universality of the tag is ensured; on the other hand, videos are classified and refined, and the second label is determined in a targeted manner according to the category of the videos, so that the label accuracy can be improved; on the other hand, the new hot label set is used for identifying the real-time hot content, so that the timeliness of the labels can be improved. And the first label, the second label and the third label have less data dependence, can be flexibly updated and migrated, can improve the reusability and simultaneously reduce the maintenance difficulty.
Next, the above steps in the present exemplary embodiment will be described in more detail.
In step S310, multi-modal features of the video to be recognized are extracted, and a first tag of the video to be recognized is determined by using the multi-modal features.
The modality refers to a form of information, and each source or form of information may be referred to as a modality, for example, information may be voice, video, text, and the like, and a person may have vision, hearing, smell, and the like, and each of these may be referred to as a modality. In this embodiment, the multimodal features may include audio features, image features, and text features. The multi-modal characteristics of the video to be recognized are extracted, and the image characteristics, the audio characteristics and the text characteristics of the video to be recognized can be respectively extracted; and then fusing the image features, the audio features and the text features to obtain multi-modal features.
Specifically, first, features of each modality are individually extracted. The video to be identified comprises a plurality of frames of images, and a certain number of frames of images can be extracted from the images for feature extraction, for example, one frame of image is extracted every second for extraction, one frame of image is extracted every 2 seconds for extraction, and the like. The manner of extracting the image features may include various manners, for example, the image features may be extracted by EfficientNet, which is a convolutional neural network; the image features may be extracted by other convolutional neural networks, for example, vgnet, ResNet, and the like, and the embodiment is not limited thereto. The video to be identified also comprises audio, wherein the audio is sampled, audio features are further extracted, for example, an audio segment of 0.96 seconds is sampled, a Melton spectrogram is extracted, and then the audio features are obtained by utilizing VGGish extraction. VGGish is a VGG model based on Tensorflow, which can extract semantic embedding features from the audio spectrum. The video to be recognized usually also comprises a title or a name, and text features in the title or the name of the video to be recognized can be extracted by using the BERT model.
After the features of each mode are extracted, the features of each mode can be aggregated to obtain multi-mode features, namely, a plurality of single-mode features are fused into one multi-mode feature. For example, the image feature, the audio feature, and the text feature may be added to obtain a multi-modal feature, or the image feature, the audio feature, and the text feature may be multiplied to obtain a multi-modal feature, or the multi-modal feature may be fused by using a linear function method.
After the modal features are extracted, the multi-modal features can be classified to obtain a first label. The classification here refers to classifying the corresponding domain of the video content, and thus the first tag may include tags of all domains, for example, the first tag may include a tv show, a movie, a game, a gourmet, a variety, and the like, and may also include types of other domain dimensions, such as live broadcast, news, and the like. Correspondingly, the second label includes more detailed labels for various fields, including but not limited to a title of the tv show, a participant of the tv show, an event corresponding to the video, and the like; for example, the first tag may be "tv drama", the second tag may be "my first half," and for example, the first tag may be "game", the second tag may be "royal glory", and so on. All tags may be divided into a first tag and a second tag by a predetermined tag system, the first tag and the second tag may each comprise a plurality of tags, and the tags comprised in the first tag and the tags comprised in the second tag may coincide in some domain dimensions, for example, in the person dimension, the first tag may be person a and the second tag may also be person a, but the second tag is more detailed than the first tag in the person dimension, so that the second tag may comprise person B, C, etc. in addition to person a.
For example, a video to be recognized may be input into a NeXtVLAD model, and the recognition result of the model may be used as the first tag. The NeXtVLAD can be used for extracting the features of the video frames and compressing the features into feature vectors so as to realize the classification of the videos and better fuse multi-mode information. Meanwhile, text information such as the title or name of the video to be recognized is input into the TextCNN and BI-LSTM models, a characterization vector of the text information can be generated by utilizing the TextCNN or BI-LSTM, the text is classified, and an obtained classification result is used as a first label. For example, the first tag may be: drama, inland drama, drama slide, etc. The NeXtVLAD model introduces an attribute mechanism to aggregate the characteristics of each video frame, can be suitable for a video classification method of any number of frames, and can better aggregate image characteristics and audio characteristics; the TextCNN is a text classification model based on a convolutional neural network, and can encode text information so as to realize text classification; BI-LSTM is a bidirectional LSTM model, which can take context order of text into consideration and has more accurate classification effect. In order to improve the classification effect of the multi-modal features, the identification label of the video to be identified can be obtained by identifying the video to be identified by using a NeXtVLAD model, the identification label of the video to be identified can also be obtained by using a TextCNN or BI-LSTM model, and the union of the two types of identification labels can be used as the first label, so that the universality of the first label is ensured.
In step S320, the videos to be recognized are classified and recognized, the category of the videos to be recognized is obtained, and the second tag of the videos to be recognized is recognized based on the category of the videos to be recognized.
In this exemplary embodiment, the classification and recognition of the video to be recognized may be performed by using a machine learning model, for example, training the classification model by using the above NeXtVLAD algorithm, so as to classify the video to be recognized by using multi-modal features, thereby obtaining the category of the video. For example, the category of the video to be recognized may include a movie and television ensemble category, a game category, a food category, and a news category. Corresponding video sample data can be obtained in advance for each category to be marked, and the marked video sample data can be used for training the convolutional neural network to obtain a video classification model. And then, classifying and identifying the video to be identified by using the trained video classification model, and outputting the category of the video to be identified. In addition, videos may be classified into finer granularity according to actual needs, for example, the categories may include tv drama category, movie category, entertainment news category, time news category, domestic cuisine, foreign food, and the like, and the embodiment is not limited thereto.
And training a corresponding label prediction model aiming at each category of the video, inputting the video to be recognized into the label prediction model corresponding to the category after the category of the video to be recognized is determined, performing finer-grained classification, and determining a second label of the video to be recognized. In an exemplary embodiment, when the category of the video to be identified is a movie and television ensemble category, identifying a target object in the identified video; and then determining a second label of the video to be identified based on the target object. The target object may refer to an object included in the video, including but not limited to a person, an animal, an object, a scene, and the like. For example, the target object may be a scene in a video, and the scene in the video may be identified by a corresponding tag prediction model, so as to obtain a location position feature corresponding to the video, and thus the location position feature is used as a second tag of the video. For example, the target object may also be a person in the video, and then the person in the video is identified to determine a person characteristic, and the obtained person characteristic is used as the second tag. In other embodiments, the target object may also be a vehicle, a room, a road, etc. in the video, and then the vehicle, the road, the room, etc. of the video may be identified as the second tag, etc. by the model, which also belongs to the protection scope of the present application.
In an exemplary embodiment, when the target object is a feature of a human being, the method of identifying the second tag of the video to be recognized may include the following steps S410, S420, and steps S430 and S440, as shown in fig. 4.
In step S410, the character features in the video to be identified are identified, and the character tag of the video to be identified is determined. The human features may include human face features, human body features, or features of part of a body, etc. The recognition process is described in the present exemplary embodiment taking a human face as an example. Specifically, a video to be recognized is input into a face recognition model, the video can be sampled through the face recognition model to obtain a plurality of image frames, the image frames are subjected to face detection and face alignment, then face embedding characteristics are extracted, retrieval is carried out in a face database, face information with the highest confidence coefficient is output, and the face information is used as a character tag. For example, if the confidence level of actor a is 0.7 and the confidence level of actor B is 0.9, actor B is output as the search result, and "actor B" is the character tag of the video to be recognized.
In step S420, the movie and television episode name associated with the person tag is obtained through the knowledge graph model. The name of the movie and television ensemble may include a name of a movie, a name of a television episode, a name of an ensemble, or a combination of any two or three of the names.
The knowledge graph is a semantic network system with a large scale, and mainly aims to describe the association relation between entities or concepts in the real world, collect a large amount of data and arrange the data into a knowledge base which can be processed by a machine so as to realize visual display.
The knowledge graph model in this embodiment may be configured as a data model that can be identified and processed by a computer by collecting related data of a movie in advance, such as video data of the movie, information of staff members, description of a scenario, and the like, and determining a relationship between a person and the movie. The names of the movie and television series associated with the character tags are queried through the knowledge graph model, for example, the character tag is 'actor a', and the names of all movie and television series shown by 'actor a' can be output through the knowledge graph model. It is understood that a person tag may be associated with a plurality of movie and television episode names, i.e. a person may play a plurality of movie and television episodes; multiple character tags may also be associated with the same movie episode name, i.e., multiple actors may play the same movie episode. If a plurality of character tags are identified in step S410, the names of the movie ensembles associated with the plurality of character tags may be acquired.
In step S430, the similarity between the video to be identified and the movie and television ensemble video corresponding to the movie and television ensemble name is calculated to determine the movie and television ensemble label of the video to be identified. In this example embodiment, the names of the movie and television ensembles related to the character tags may include one or more names, the video corresponding to each movie and television ensemble name is obtained, the similarity between the video to be identified and the movie and television ensemble video is calculated, and the movie and television ensemble name with the highest similarity is used as the movie and television ensemble tag of the video to be identified. Specifically, for each movie and television integrated video, multiple frames of pictures can be extracted in advance to construct a picture library, for example, pictures of key scenario segments in the movie and television integrated video, pictures of artificial marks and the like are extracted; then extracting key frames from the video to be identified, respectively carrying out similarity calculation on each key frame and each picture in a picture library, and adding the obtained similarities to obtain a similarity calculation result of the picture library and the video to be identified; and finally, selecting a picture library with the highest similarity with the video to be identified according to the similarity calculation result, and taking the movie and television play corresponding to the picture library as a final movie and television integrated label. For example, assume that a movie series a, a movie series B, and a movie series C are associated with the character tag, the similarity between the video of the movie series a and the video to be identified, the similarity between the movie series B and the video to be identified, and the similarity between the movie series C and the video to be identified are calculated, respectively, and the movie series with the highest similarity is taken as the final movie and television integrated tag.
In step S440, the second label is determined according to the person label and the movie and television integrated label. The character tags and the movie integrated tags may be combined as the second tags, for example, the character tags include "actor a, actor B", the movie integrated tags include "movie a", and the second tags may include "actor a, actor B, movie a". And according to the movie and television comprehensive labels, other information related to the movie and television comprehensive labels can be acquired through the knowledge graph model, and the acquired information, the character labels and the movie and television comprehensive labels are added into the second label, so that the labels of the videos to be identified are determined more comprehensively. For example, information of all actors, director, etc. related to the movie integrated label, or information having the strongest association with the movie integrated label, such as lead actor of the movie play, etc. In the knowledge graph model, different weights can be given to the incidence relation between different entities, the strength of the incidence relation is indicated by the weights, and if the incidence relation between two entities is stronger, the weight is higher. Therefore, after the movie and television ensemble information most similar to the video to be identified is obtained, the entity with the highest weight of the incidence relation with the movie and television ensemble information can be searched from the knowledge graph model and output as the second label, or a certain number of entities related to the movie and television ensemble can be output as the second label according to the weight of the incidence relation, and the like. For example, if the character tag includes "actor a, actor B", the movie and television integrated tag includes "movie and television show a", the entity with the highest weight of the association relationship with "movie and television show a" may be queried as "actor C" according to the knowledge graph model, and their association relationship may be "lead," then "actor a, actor B, actor C, movie and television show a" may be used as the second tag of the video to be identified.
In an exemplary embodiment, the category of the video to be identified may also be a game category, and when the category of the video to be identified is the game category, the video to be identified may be matched with the video to be identified through the game template data, and the second tag of the video to be identified may be determined according to the matched game template data. The game template data may include a plurality of template information, such as a game character template, a game scene template, a model of an object in a game, a skill frame template, and the like, and may further include other templates, such as a game map template, and the like, which is not particularly limited in this embodiment. Game template data can be constructed by collecting game characters, skill boxes and game scenes contained in each game as templates in advance. The templates contained in the game template data are matched with the video to be identified one by one, and the template matched with the video to be identified can be determined, so that the information of the template is used as a second label, for example, the video to be identified is matched with the template of the game scene A, and the second label can be the game scene A.
In an exemplary embodiment, when the game template data includes a skill box template, the method of determining the second tag may include the following steps S510, S520, and S530, as shown in fig. 5.
In step S510, the video to be recognized is matched with a plurality of skill box templates, and a target skill box matched with the video to be recognized is determined. In the game application, the skill box refers to an area where a control used for triggering the skill in the game interface is located, different game characters have different skills, the skill boxes of the game characters of each game application can be collected in advance and stored as skill box templates, and the number of the skill box templates can be multiple. And then extracting a certain number of frame pictures from the video to be identified, matching the extracted pictures with a plurality of skill frame templates, and determining the skill frame template matched with the pictures as a target skill frame. In one example, the number of the skill frame templates includes 100, the 100 skill frame templates are matched with the pictures contained in the video to be recognized one by one, and the skill frame template a matched with the pictures contained in the video to be recognized is determined to be the "skill frame template a", so that the skill frame template a is the target skill frame.
In step S520, a target game character associated with the target skill box is acquired. A skill box can be associated with a game character, and the association relationship between the skill box and the game character can also be constructed through a knowledge graph or a relational database key-value. After the target skill box is determined, the knowledge map or corresponding database may be queried to determine the target game character associated with the target skill box.
In step S530, the second tag is determined according to the target game character. The target game character can be used as the second label of the video to be identified, for example, if the skill box a is associated with the game character B, the second label of the video to be identified can be the "game character B". After the target game role corresponding to the video to be recognized is determined, other related information, such as a game name, a game type and the like, can be retrieved according to the target game role, and the retrieved information can also be used as a second tag.
In the exemplary embodiment, the category of the video to be recognized may further include other various categories, such as food, news, and the like, and for each category, a tag prediction model may be trained in advance to be responsible for predicting a second tag with fine granularity in the category, so that the tag is more accurate. For example, for the gourmet food category, a large number of gourmet food pictures of different cuisine can be collected, a classification model is trained, then the to-be-identified videos of the gourmet food category are classified through the classification model, the cuisine corresponding to the to-be-identified videos is output, and the output cuisine is used as a second label of the to-be-identified videos. And a finer-grained model can be trained for each dish system respectively to identify a more specific dish name or food name, for example, a classification model is trained for the dessert category, which dessert is in the video is identified through the classification model, if a picture frame in the video to be identified is input into the classification model corresponding to the dessert category, the classification model can output a result such as "cake" or "egg tart", and the output result is used as a second label, so that the accuracy of the label can be improved.
With reference to fig. 3, in step S330, a third tag of the video to be identified is obtained through the new hot tag set updated in real time, and the video tag of the video to be identified is determined by combining the first tag, the second tag, and the third tag.
Wherein the new hot label set comprises a plurality of new hot labels. The new hot tag may include a tag associated with the current hot event. In an actual scene, short video content is updated very fast and is easy to become a pop point and a hot point which people pay attention to, and the hot point content is usually kept for a short time and can be faded quickly, so that the timeliness is strong. Therefore, the new hot tag may be updated in real time, for example, once a day, once every 8 hours, and the like, and the update frequency may be determined according to actual needs, which is not limited in this embodiment. For example, a new hot tag found on the same day can be marked through manual operation marking, and stored in a new hot tag set, so that real-time updating of the new hot tag set is realized. Hot spots and explosion events can be manually judged in advance, so that the discovery of new hot labels can be accelerated, and the new hot labels can keep strong timeliness.
After the updated new hot label set is obtained, the video to be identified and the new hot label in the new hot label set can be matched, and if the title information of the video to be identified can be accurately matched with the new hot label, the new hot label is used as a third label of the video to be identified. For example, the new hot tag set includes A, B, C, D, four hot events, the title of the video to be identified is matched with the four hot events, and if a is matched with a field in the title of the video to be identified, a is used as a third tag of the video to be identified.
Because the new hot label has short timeliness, and data accumulation for a longer time is needed through model prediction, and the response speed is slow, the actual requirement that the label needs to respond quickly can be met through the new hot label set in the exemplary embodiment, and the timeliness and the accuracy of the label are improved.
In an exemplary embodiment, the method of acquiring the third tag of the video to be recognized may include the following steps S610, S620, and S630, as shown in fig. 6.
In step S610, a new hot tag set updated in real time is obtained, and video data corresponding to the new hot tag set is obtained. The new hot tag set may be updated once a day, once every 8 hours, once every 4 hours, etc., each update may be manually updating the found new hot tag set to a new hot tag set, the new hot tag set may be stored in a specific directory, and then the updated new hot tag set may be obtained from a predetermined directory according to a predetermined update time. The video data with the new hot tag may be pulled from the content repository based on the updated new set of hot tags. The same video data may carry two or more new hot tags.
In step S620, the similarity between the title of the video to be recognized and the title of the video data is calculated, and a plurality of target videos with the similarity meeting a preset threshold are screened from the video data. A plurality of new hot tags may be included in the set of new hot tags, each new hot tag may correspond to a plurality of video data. For example, the similarity between the title of the video data N and the title of the video to be identified is calculated, if the similarity satisfies a preset threshold, the video data N is used as the target video, and so on, the similarity between each video data and the title of the video to be identified is calculated, and all target videos with the similarities satisfying the preset threshold are obtained. The preset threshold may be determined according to actual requirements, for example, 0.6, 0.7, or 0.65, 0.8, and the like, and this embodiment is not particularly limited thereto.
In step S630, a third tag of the video to be identified is determined according to the new hot tags corresponding to the target videos. Specifically, according to the number of times that each new hot tag in the target video is matched, the new hot tag can be voted, and the new hot tag with the highest score is used as the third tag of the video to be identified. Each time the new hot tag is matched once, a score can be obtained, for example, if the new hot tag corresponding to the target video a is A, B, the new hot tag corresponding to the target video B is B, C, the new hot tag corresponding to the target video C is C, the new hot tag corresponding to the target video D is D, B, the score a is 1, the score B is 3, the score C is 2, the score D is 1, and the score with the highest score is B, the new hot tag B can be used as the third tag of the video to be identified. In addition, one or more of the new hot tags corresponding to the target video may be selected as the third tag, for example, the new hot tag B with the highest score and the new hot tag C with the second score may be used as the third tag, and so on.
In the exemplary embodiment, a new hot tag prediction model may also be trained according to the new hot tag set and the corresponding video data, and the new hot tag of the video to be recognized is predicted by the model. Or determining a new hot label set by combining the new hot label prediction model, specifically, acquiring verification data, identifying the verification data by using the new hot label set and the new hot label prediction model, determining a new hot label of the verification data, and calculating the call admission rate of the two modes to the verification data, and if the call admission rate of the new hot label set to the verification data is higher, taking the current new hot label set out of the library for identifying the video to be identified; and if the recall ratio of the new hot label prediction model is higher, combining the new hot labels adopted by the model into a new hot label set, and ex-warehousing the combined new hot label set.
Fig. 7 schematically shows a flow chart of a method for obtaining a third tag of a video to be identified by a new set of hot tags. As shown in fig. 7, in step S710, new hot tags marked by operation are put into a warehouse, and an updated new hot tag set is obtained. In step S720, a new hot pool is constructed. The new hot pool may include video data with a new hot tag; the operation can print a new hot label on the video related to the new hot label and upload the video to the online for browsing by people, and pull the video data with the new hot label from the online database according to the updated new hot label set. In step S730, the title contained in the new hot pool is matched with the title of the video to be identified. If the similarity between a title in the new hot pool and the title of the video to be identified meets the threshold, the video data corresponding to the title can be output as a matching result, and after each title in the new hot pool is matched one by one, all target videos with the similarity meeting the threshold can be obtained. For example, the new hot pool includes 100 videos in total, titles of the 100 videos are respectively matched with titles of videos to be identified, wherein 10 videos whose similarity of the titles meets a threshold value are output as a matching result, or new hot tags corresponding to the 10 videos are output as a matching result, where the 10 videos whose similarity of the titles meets the threshold value. In step S740, the title of the video to be identified is matched with the new hot tag. And the matching result is a new hot label with the similarity to the title of the video to be identified meeting a threshold value. In step S750, a matching result is obtained, where the matching result includes the new hot tag of the target video matched in step S730 and the new hot tag matched in step S740. In step S760, the score of each new hot label is calculated based on the matching result. For example, each new hot tag may score as many times as each new hot tag is matched, e.g., if new hot tag a is matched 4 times, it scores 4. In step S770, a new hot tag prediction result of the video to be identified is determined according to the score. For example, the new hot label with the highest score may be used as the prediction result of the video to be identified, or a certain number of new hot labels may be used as the prediction results according to the scores, and the like. In this step, the prediction result of the new hot label is the third label of the video to be video.
In step S330, after the third tag of the video to be identified is obtained, the final video tag of the video to be identified may be determined by combining the first tag, the second tag, and the third tag. For example, the first tag, the second tag and the third tag may be subjected to union operation, and the union of the first tag, the second tag and the third tag is used as a video tag of the video to be identified, so as to mark the video to be identified. The mode of merging the first label, the second label and the third label is used as the final video label of the video to be identified, so that the recall rate of the video label can be improved to the greatest extent. After the video tag of the video to be identified is determined, other videos with similar tags can be recommended to the user in a video recommending scene and a video searching scene according to the video tag of the video to be identified. In an exemplary embodiment, according to the credit acquisition strategies corresponding to the first tag and the second tag respectively, a fourth tag may be determined from the tags meeting the credit acquisition strategies in the first tag and the second tag; and then determining the video label of the video to be identified according to the fourth label and the third label. The credit acquisition policy may refer to a condition whether the tag acquires the credit, for example, the credit acquisition policy is the tag whose credit acquisition recognition probability exceeds a threshold. The first label and the second label may both be multiple labels, and the trust strategy corresponding to each label may be different, for example, if the trust strategy of the label a is that the recognition probability exceeds 80%, if the probability identified by the label a does not exceed 80%, it may be determined that the label a does not conform to the trust strategy, and the label a is deleted from the final result.
Before identifying the video to be identified, a credit acquisition policy of each tag may be predetermined, specifically, a plurality of tag samples may be obtained first, and in an embodiment of the present application, the first tag and the second tag may be obtained by identifying through a model, that is, the step S310 and the step S320 are performed through the model, so as to output the first tag and the second tag. It will be appreciated that the model responsible for identifying the first tag may be a different model than the model responsible for identifying the second tag. For example, multi-modal features of the video to be recognized are extracted through a first model, and a first label of the video to be recognized is determined by using the multi-modal features; and identifying a second label of the video to be identified through the second model. The first model can be used to extract multi-modal features of the video, so as to classify the video by using the multi-modal features, for example, the first model can be a NeXtVLAD model or the like. The second model may be determined by the category of the video to be recognized, for example, if the category of the video to be recognized is the movie and television ensemble category, any one or more of models related to the movie and television ensemble, such as a face recognition model, a movie and television episode knowledge map association model, and a movie and television ensemble episode name retrieval model, may be used as the second model to understand the video to be recognized at a finer granularity. For example, if the type of the video to be recognized is a game type, one or more of game type models such as a game character tag model for recognizing a game character and a game name tag model for recognizing a game name may be used as the second model. Therefore, customized tag models (such as the video ensemble-related model and the game model) can be fused on the basis of the general video tag model, and the recognition capability on some entity tags is improved. After the training of the first model and the second model is completed, the confidence degrees of the first model and the second model can be determined, and then the confidence degree is determined to obtain a credit strategy. Specifically, for each first tag, if the model of the first tag is output as the confidence model corresponding to the first tag, the first tag is determined as a fourth tag; the confidence model corresponding to the first label outputs that the confidence of the first label accords with the credit acquisition strategy corresponding to the first label; for each second label, if the model of the second label is output as a confidence model corresponding to the second label, determining the second label as a fourth label; and outputting the confidence coefficient of the second label to accord with the credit acquisition strategy corresponding to the second label by the confidence model corresponding to the second label.
The determination of the fourth tag from the first tag and the second tag depends on whether the model of the output tag (including the first tag and the second tag) is a confidence model corresponding to the tag, and the confidence model corresponding to each tag may be predetermined. For a first tag, the confidence model is a model that outputs the confidence level of the first tag according with the credit acquisition policy, for example, the first model identifies a video as a first tag a, and if the confidence model of the tag a is determined to be the first model according to the credit acquisition policy, the tag a can be determined to be a fourth tag; on the contrary, if the confidence model of the tag a in the credit-taking strategy is the second model, the model of the tag a is not the corresponding confidence model, and the tag a cannot be used as the fourth tag. The same applies to the second label. The confidence coefficient is the accuracy of the model to the label recognition, and after the first model and the second model are obtained through training, the accuracy of the model to each label can be calculated through testing the first model and the second model.
Before determining the fourth tag in the first tag and the second tag of the video to be recognized, a confidence model corresponding to the tags needs to be determined, that is, a credit acquisition strategy is determined. In an exemplary embodiment, the specific process of determining the credit acquisition policy may include steps S810 to S840, as shown in fig. 8, where:
step S810, identifying a first label of a plurality of video samples through the first model as a first test result.
Step S820, identifying second labels of the plurality of video samples through the second model to serve as second test results.
Step S830, calculating a first confidence of the first model for each label included in the first test result, and calculating a second confidence of the second model for each label included in the second test result.
Step 840, for target tags which are coincident in the first test result and the second test result, determining a confidence model corresponding to each target tag from the first model and the second model according to the first confidence degree and the second confidence degree of each target tag, and taking the corresponding relation between the target tags and the confidence models as a credit acquisition strategy corresponding to the target tags.
In step S810, the video sample may be understood as a video whose video tag has been determined, for example, a certain number of videos may be obtained in advance, and the video tag is determined for each video manually, and the video is marked with the corresponding video tag, so as to obtain the video sample. Inputting the video sample into the first model may obtain a first test result for the first model.
In step S820, a second test result may be obtained by inputting the video sample into a second model. The first test result may include a plurality of first tags, and the second test result may include a plurality of second tags.
In step S830, it may be verified whether the first label is classified correctly according to the video labels of the video samples, so as to calculate a first confidence of the first model for each label in the test result. Confidence refers to the accuracy of model recognition, and the higher the accuracy, the higher the confidence. For example, the video label of 10 samples in the video samples is a, if the first model identifies the label a of the 10 samples, the confidence for the first model of the label a is 1, and if the first model identifies the label a of 9 samples in the 10 samples, the confidence for the label a is 0.9. Similarly, the confidence of the second model for each label in the second test result is calculated as a second confidence.
In step S840, an intersection operation is first performed on the first test result and the second test result, and a target label that is overlapped in the first test result and the second test result may be determined. And then comparing the first confidence coefficient of the first model with the second confidence coefficient of the second model for the target label, and determining the confidence model of the target label by the higher confidence coefficient of the first confidence coefficient and the second confidence coefficient. For example, the target tags may include a plurality of tags, such as A, B, C, D, and the confidence levels of the first model for the four target tags are: 0.75, 0.7, 0.8, and the confidence of the second model for the target tag is: 0.8, 0.75, 0.6, 0.75, the confidence model for tag a can be determined to be the second model, the confidence model for tag B to be the second model, the confidence model for tag C to be the first model, and the confidence model for tag D to be the first model. And after determining a corresponding confidence model for each target label, storing the corresponding relation between the target label and the corresponding confidence model as a credit acquisition strategy. For the label with the first confidence degree being the same as the second confidence degree, any one of the first model and the second model can be selected as the confidence model, for example, the first model is selected as the confidence model, and the like.
After the first tag and the second tag of the video to be identified are obtained, the stored credit acquisition strategy can be inquired, the confidence model corresponding to each first tag and each second tag is determined, if the model for outputting the first tag is different from the confidence model, the first tag can be determined not to accord with the credit acquisition strategy, and if the model for outputting the first tag is the same as the confidence model, the first tag can be determined to accord with the credit acquisition strategy. For example, the first tag of the video to be identified is: A. b, C, if the second tag is A, C, E, F, G, it can be determined that A, B, C, E, F, G corresponds to the confidence models respectively through the credit acquisition policy, for example, if the confidence model of the tag a is the second model, it can be determined that the tag a does not comply with the credit acquisition policy, and if the confidence model of the tag C is the second model, it can be determined that the tag C complies with the credit acquisition policy.
The labels included in the first test result and the second test result may not cover all the labels in the label system, for example, 1000 labels are designed for the label system, and the video sample may include 100 labels or 50 labels. For the labels not covered in the first test result and the second test result, the probability threshold may be used as the confidence acquisition policy, for example, a probability threshold is determined, the labels with the recognition probability exceeding the probability threshold are determined as the labels meeting the confidence acquisition policy, and the labels not exceeding the probability threshold are determined as the labels not meeting the confidence acquisition policy.
After a tag which accords with the credit acquisition strategy in the first tag and the second tag of the video to be identified is taken as a fourth tag, outputting the union set of the fourth tag and the third tag as a final video tag of the video to be identified; the third label can also be screened, and the third label and the fourth label which meet the conditions are used as final video labels; for example, a threshold is set, and the third tag and the fourth tag meeting the threshold are output as video tags of the video to be identified.
The method for selecting the union of the first label, the second label and the third label can improve the recall rate of the labels to the greatest extent. In the embodiment, the first label and the second label are screened, the label with higher confidence coefficient is filtered out through the confidence acquisition strategy to serve as the video label of the video to be identified, the label with lower confidence coefficient can be prevented from influencing the identification accuracy of the label, and therefore the accuracy of the video label is improved on the premise that the recall rate is guaranteed as far as possible.
Fig. 9 schematically shows a system architecture diagram of the tag identification method of the present exemplary embodiment. As shown in fig. 9, the system architecture 900 may include a first label prediction model 901, a second label prediction model 902, a third label prediction model 903, and a fusion module 904. The video to be recognized can be simultaneously input into the models 901, 902 and 903, the multi-modal features of the video to be recognized are extracted through the first label prediction model 901, and the first label of the video to be recognized is predicted by utilizing the multi-modal features; predicting a second label of the video to be identified through a second label prediction model 902; a third label of the video to be identified can be predicted through the third label prediction model 903; and then, the results of the model 901, 902 and 903 prediction are fused through a fusion module 904, so as to obtain the final video label of the video to be identified.
The second tag prediction model 902 may include, for example, a classification model 9021 and a plurality of vertical models, which may be, for example, a movie identification model 9022, a game identification model 9023, a news tag identification model 9024, and a gourmet tag identification model 9025. The classification model 9021 may classify and identify videos to be identified, predict categories of the videos to be identified, perform gate control according to the identified categories, and distribute the videos to be identified to corresponding vertical models, for example, if the category is "game category", send the videos to be identified to the game identification model 9025, and if the category is "movie and television ensemble", send the videos to be identified to the movie and television show identification model 9022, and the like.
The vertical model can be used for carrying out label identification aiming at a specific category, and has finer-grained understanding. A plurality of vertical models can be designed according to the label classification system, for example, the label system can be as shown in fig. 10, and four types of vertical models can be designed according to the label system shown in fig. 10, such as a movie and television drama identification model, a game identification model, a news label identification model, and a gourmet label identification model.
For example, the movie and television show recognition model 9022 may be responsible for performing face recognition on a video to be recognized, determining that the video to be recognized includes a person, predicting a person tag of the video to be recognized, acquiring a movie and television show name associated with the person in combination with a knowledge graph, and then acquiring a video of a related movie and television show by retrieving a movie and television show database, comparing the video with the video to be recognized, and determining a movie and television comprehensive tag corresponding to the video to be recognized; and the movie and television integrated labels can be used to acquire other related information again through the knowledge graph, such as the lead actor, other actors except for the identified characters, and the like, and the related information and the movie and television integrated labels, the character labels, and the like are output to the fusion module 904 as a second label together. The game identification model 9023 may identify information such as a game name, a game role, and game skill of the video to be identified by using skill box template matching, and output the identified result as a second tag to the fusion module 904. The news tag identification model 9024 can be trained according to news events collected in advance, so that the news events of the video to be identified are identified by the model and are output to the fusion module as second tags. The cate tag recognition model 9025 can be trained through collected pictures of various cates; the model can be used for identifying the gourmet contained in the video to be identified, and then the gourmet name is output to the fusion module as a second label.
The third label prediction model 903 may be used to predict a new hot label of the video to be recognized through the new hot label set updated in real time, and output the new hot label as a third label to the fusion module 904.
The fusion module 904 may calculate a union of the first label, the second label, and the third label output by all models, and output the union as a video label. Fig. 11 schematically shows the presentation effect of the video tag. As shown in fig. 11, the video tag for recognition of the video a to be recognized may include "drama name, genre, region, viewpoint, person name" and the like, and the confidence of recognition may also be displayed in the recognition result, for example, the drama name tag 1101 may indicate that the confidence of recognizing the drama name "a" is "1.0", the type tag 1102 may indicate that the confidence of recognizing the genre "episode flowers" is "1.0", and the like.
For example, when the output results of the first label prediction model 901 and the second label prediction model 902 are fused, the call-waiting rates of the models 901 and 902 and the corresponding call-waiting rates when the output results of the models 901 and 902 are combined may be verified, and the three call-waiting rates are compared to perform fusion in the fusion mode with the best effect. Specifically, after obtaining the model 901 and the model 902, a verification data set D may be obtained, and the model 901 and the model 902 are used to respectively identify samples in the verification data set, so as to obtain identification results E and T; and a label U obtained by combining the recognition results of the two models, that is, the label U includes a first label in the recognition result E of the model 901 and a second label in the recognition result F of the model 902. Then, for the label X, the recall tolerance e of the model 901, the recall tolerance f of the model 902, and the recall tolerance U of U obtained by combining the two are calculated on the verification data set D, the magnitudes of the three recall tolerances are compared, and if e is the maximum, the first label output by the model 901 and the third label identified by the model 903 may be combined in the fusion module 904 to serve as a final video label.
In addition, in this embodiment, each module may be split, for example, the first label prediction model 901 and the second label prediction model 902 are combined to predict a label of a video to obtain a final video label, and then, for example, the first label prediction model 901 and the third label prediction model 903 are combined to identify the video label; for example, the third label prediction model 903 is combined with other label prediction models to jointly predict a video label, and the like. Therefore, the tag identification method provided by the embodiment has strong reusability and flexibility.
To further verify the effectiveness of the present application, the inventors compared the identification effect of the tag identification method implemented according to the system architecture 900 described above with the identification effect of the video tag model in other methods. Specifically, the accuracy rate of other video label models is found to be 82.2% through experiments, and the recall rate is 65.3%. When the first tag identification model 901 and the second tag identification model 902 in the system architecture in the embodiment are adopted to identify the video tags, the accuracy rate is improved by 0.7% to 82.9%, and the recall rate is improved by 2.4% to 67.7%; the system architecture 900 is integrally adopted for identifying the video tags, so that the accuracy is improved by 1% and reaches 82.2%; the recall rate is improved by 20% to 85.3%, the response speed to a new hot label is greatly improved, and the timeliness of the label is ensured.
From the comparison result, the video tag is identified by using the tag identification method, so that the identified video tag is more accurate and has higher accuracy.
Those skilled in the art will appreciate that all or part of the steps for implementing the above embodiments are implemented as computer programs executed by a processor (including a CPU and a GPU). Which when executed by a processor performs the above-described functions as defined by the above-described method provided herein. The program may be stored in a computer readable storage medium, which may be a read-only memory, a magnetic or optical disk, or the like.
Furthermore, it should be noted that the above-mentioned figures are only schematic illustrations of the processes involved in the method according to the exemplary embodiment of the present application, and are not intended to be limiting. It will be readily understood that the processes shown in the above figures are not intended to indicate or limit the chronological order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.
The following introduces the tag identification apparatus provided in the present technical solution:
a tag identification apparatus is provided in the present exemplary embodiment. Referring to fig. 12, the tag identification apparatus 1200 includes: a multi-modal feature extraction module 1201, a category identification module 1202, and a new thermal label identification module 1203.
The multi-modal feature extraction module 1201 is configured to extract multi-modal features of a video to be recognized, and determine a first tag of the video to be recognized by using the multi-modal features.
The category identification module 1202 is configured to perform category identification on the video to be identified to obtain a category of the video to be identified, and identify a second tag of the video to be identified based on the category of the video to be identified.
A new hot tag identification module 1203, configured to obtain a third tag of the video to be identified through a new hot tag set updated in real time, and determine a video tag of the video to be identified by combining the first tag, the second tag, and the third tag.
In an exemplary embodiment of the present application, based on the foregoing embodiments, the category identifying module 1202 includes: the movie and television comprehensive identification module is used for identifying a target object in the video to be identified when the category of the video to be identified is the movie and television comprehensive category; and the movie and television comprehensive label determining module is used for determining a second label of the video to be identified based on the target object.
In an exemplary embodiment of the present application, based on the foregoing embodiment, the target object includes a character feature, and the comprehensive movie label determination module may include a character recognition module, a knowledge graph module, a similarity calculation module, and a label output module.
The character recognition module is used for recognizing character features in the video to be recognized and determining character labels of the video to be recognized.
And the knowledge graph module is used for acquiring the movie and television integrated names related to the character tags through a knowledge graph model.
And the similarity calculation module is used for calculating the similarity of the video to be identified and the comprehensive video corresponding to the comprehensive video name, and determining the comprehensive video label of the video to be identified.
And the label output module is used for determining the second label according to the character label and the movie and television integrated label.
In an exemplary embodiment of the present application, based on the foregoing embodiment, the category identification module 1202 is configured to: and when the category of the video to be identified is a game category, matching the video to be identified through game template data, and determining a second label of the video to be identified according to the matched game template data.
In an exemplary embodiment of the present application, based on the foregoing embodiment, the game template data includes a skill box template, and the category identification module may include a skill box matching module, a game character determination module, and a game tag determination module.
The skill box matching module is used for matching the video to be recognized with a plurality of skill box templates and determining a target skill box matched with the video to be recognized.
And the game role determination module is used for acquiring the target game role associated with the target skill box.
And the game tag determining module is used for determining the second tag according to the target game role.
In an exemplary embodiment of the present application, based on the foregoing embodiments, the new hot tag identification module 1203 may include a video data acquisition module, a video similarity calculation module, and a new hot tag determination module.
The video data acquisition module is used for acquiring a new hot label set updated in real time and acquiring video data corresponding to the new hot label set.
And the video similarity calculation module is used for calculating the similarity between the title of the video to be identified and the title of the video data and screening out a plurality of target videos of which the similarity meets a preset threshold value from the video data.
And the new hot label determining module is used for determining a third label of the video to be identified according to the new hot labels corresponding to the target videos.
In an exemplary embodiment of the present application, based on the foregoing embodiments, the new thermal label identification module may be configured to: determining a label which accords with the credit acquisition strategy in the first label and the second label as a fourth label according to the credit acquisition strategies corresponding to the first label and the second label respectively; and determining the video label of the video to be identified according to the fourth label and the third label.
In an exemplary embodiment of the present application, the new thermal label identification module may be configured to: for each first label, if the model of the first label is output as the confidence model corresponding to the first label, determining the first label as a fourth label; the confidence model corresponding to the first label outputs that the confidence of the first label accords with the credit acquisition strategy corresponding to the first label; for each second label, if the model of the second label is output as a confidence model corresponding to the second label, determining the second label as a fourth label; and outputting the confidence coefficient of the second label to accord with the credit acquisition strategy corresponding to the second label by the confidence model corresponding to the second label.
In an exemplary embodiment of the present application, based on the foregoing embodiment, the determining the first label of the video to be recognized by using the multi-modal features includes: multi-modal characteristics of a video to be recognized are extracted through a first model, and a first label of the video to be recognized is determined according to the multi-modal characteristics; the identifying the second tag of the video to be identified based on the category of the video to be identified includes: identifying the video to be identified based on a second model corresponding to the category of the video to be identified to obtain a second label of the video to be identified; the device also comprises a sample acquisition module, a confidence test module and a confidence acquisition strategy determination module.
The first test result module is used for identifying first labels of the video samples through the first model to serve as first test results.
And the second test result module is used for identifying a second label of the plurality of video samples through the second model to serve as a second test result.
And the confidence test module is used for calculating a first confidence of the first model for each label contained in the first test result and calculating a second confidence of the second model for each label contained in the second test result.
And the confidence acquisition strategy determining module is used for determining a confidence model corresponding to each target label from the first model and the second model according to the first confidence degree and the second confidence degree of each target label for the target label which is coincident in the first test result and the second test result, and taking the corresponding relation between the target label and the confidence model as the confidence acquisition strategy corresponding to the target label.
The specific details of each module or unit in the above tag identification apparatus have been described in detail in the corresponding tag identification method, and therefore are not described herein again.
It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.
As another aspect, the present application also provides a computer-readable medium, which may be contained in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable medium carries one or more programs, which when executed by the electronic device, cause the electronic device to implement the tag identification method described in the above embodiments.
For example, the electronic device may implement the following as shown in fig. 3: s310, extracting multi-modal characteristics of a video to be recognized, and determining a first label of the video to be recognized by utilizing the multi-modal characteristics; s320, classifying and identifying the video to be identified to obtain the category of the video to be identified, and identifying a second label of the video to be identified based on the category of the video to be identified; and S330, acquiring a third label of the video to be identified through a new hot label set updated in real time, and determining the video label of the video to be identified by combining the first label, the second label and the third label.
As another example, the electronic device may implement the steps shown in fig. 4-7.
It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.