Video processing method, device and equipment based on artificial intelligence and storage medium

文档序号:8331 发布日期:2021-09-17 浏览:19次 中文

1. A video processing method based on artificial intelligence is characterized by comprising the following steps:

performing frame extraction and lens division processing on a video to be processed to obtain at least one image to be processed;

using a target detection model to perform target detection on a current image to be processed in the at least one image to be processed to obtain at least one detection area corresponding to at least one type of detection object; the target detection model is used for detecting the at least one detection area occupied by the at least one type of detection object from the current image to be processed;

determining a subject detection region from the at least one detection region based on the size information of each of the at least one detection region;

determining a scene identification result of the current image to be processed based on the main body size information and the detection object of the main body detection area, and further obtaining a scene identification result of each image to be processed in the at least one image to be processed; the scene identification result represents the distance between the image content and the visual starting point;

and realizing intelligent processing of the video to be processed based on the scene identification result of each image to be processed.

2. The method of claim 1, wherein the body size information comprises: a body region height and a body region width; the at least one type of detection object comprises: a person object; the determining the scene recognition result of the current image to be processed based on the main body size information and the detection object of the main body detection area comprises:

when the detection object is the human object, obtaining the size ratio and the first area of the main body detection area according to the height of the main body area and the width of the main body area;

when the first area is larger than or equal to a first preset area threshold value, determining the scene identification result according to the size ratio; alternatively, the first and second electrodes may be,

when the first area is smaller than the first preset area threshold and larger than or equal to a second preset area threshold, determining the scene identification result based on the height of the main body area or the width of the main body area; the second preset area threshold is smaller than the first preset area threshold; alternatively, the first and second electrodes may be,

and when the first area is smaller than the second preset area threshold, determining the scene identification result as a large distant scene, and marking the current image to be processed as the image not meeting the preset result.

3. The method of claim 2, wherein the size ratio is a ratio of the body region height to the body region width; the determining the scene identification result according to the size ratio comprises:

when the size ratio is larger than a first preset size ratio threshold value and is smaller than or equal to a second preset size ratio threshold value, determining the scene recognition result as a face close-up; the second preset size ratio threshold is greater than the first preset size ratio threshold; alternatively, the first and second electrodes may be,

when the size ratio is larger than the second preset size ratio threshold and smaller than or equal to a third preset size ratio threshold, determining the scene identification result as a human body close scene; the third preset size ratio threshold is greater than the second preset size ratio threshold; alternatively, the first and second electrodes may be,

when the size ratio is larger than the third preset size ratio threshold and smaller than or equal to a fourth preset size ratio threshold, determining the scene identification result as a human body panorama; the fourth preset size ratio threshold is greater than the third preset size ratio threshold; alternatively, the first and second electrodes may be,

and when the size ratio is larger than the fourth preset size ratio threshold, determining the scene identification result as a long-distance scene.

4. The method of claim 3, wherein determining the scene recognition result based on the subject region height or the subject region width comprises:

when the height of the main body area or the width of the main body area is greater than or equal to a preset first edge length threshold value, determining the scene identification result as a human body panorama; alternatively, the first and second electrodes may be,

when the height of the main body area or the width of the main body area is smaller than the preset first edge length threshold and is larger than or equal to a preset second edge length threshold, judging whether the size ratio is larger than a second preset size ratio threshold;

when the size ratio is larger than or equal to the second preset size ratio threshold, determining a scene identification result as a long scene; alternatively, the first and second electrodes may be,

when the size ratio is smaller than the second preset size ratio threshold, determining the scene identification result as a large distant scene; alternatively, the first and second electrodes may be,

and when the height of the main body area or the width of the main body area is smaller than the preset second side length threshold, determining the scene identification result as a large long-distance scene, and marking the current image to be processed as the image not meeting the preset result.

5. The method of claim 2, wherein the body size information comprises: a body region height and a body region width; the at least one type of detection object comprises: an object; the determining the scene recognition result of the current image to be processed based on the main body size information and the detection object of the main body detection area comprises:

when the detection object is the object, obtaining a second area of the main body detection area according to the height of the main body area and the width of the main body area; alternatively, the first and second electrodes may be,

when the second area is larger than or equal to a preset third area threshold value, determining the scene recognition result as an object feature; alternatively, the first and second electrodes may be,

when the second area is smaller than the preset third area threshold and larger than or equal to a preset fourth area threshold, determining the scene identification result as an object close scene; the preset fourth area threshold is smaller than the preset third area threshold; alternatively, the first and second electrodes may be,

when the second area is smaller than the preset fourth area threshold and larger than or equal to the second preset area threshold, determining the scene identification result as an object panorama; the preset fourth area threshold is greater than the second preset area threshold; alternatively, the first and second electrodes may be,

and when the second area is smaller than the second preset area threshold value, determining the scene identification result as an object perspective.

6. The method of claim 1, wherein after performing frame extraction and mirror segmentation on the video to be processed to obtain at least one image to be processed, the method further comprises:

when the target detection model is used for carrying out target detection on the current image to be processed in the at least one image to be processed and a detection area is not detected, marking the current image to be processed as not conforming to a preset result, and not using the current image to be processed for carrying out video processing.

7. The method according to any one of claims 1 to 6, wherein before the target detection is performed on each of the at least one image to be processed by using the target detection model, the method further comprises:

carrying out target detection on a training sample image set by using an initial target detection model, and determining sample images which do not accord with a preset result from the training sample image set;

taking the sample image which does not accord with the preset result as an incremental training sample, and obtaining the labeling result of the incremental training sample so as to obtain an incremental training sample set;

and performing iterative training on the initial target detection model based on the incremental training sample set and the training sample image set to obtain the target detection model.

8. An artificial intelligence-based video processing apparatus, comprising:

the video frame extracting module is used for performing frame extraction and lens division processing on a video to be processed to obtain at least one image to be processed;

the target detection model is used for carrying out target detection on the current image to be processed in the at least one image to be processed to obtain at least one detection area corresponding to at least one type of detection object; the target detection model is used for detecting the at least one detection area occupied by the at least one type of detection object from the current image to be processed;

the scene identification module is used for determining a main body detection area from the at least one detection area according to the size information of each detection area in the at least one detection area; determining a scene identification result of the current image to be processed based on the main body size information and the detection object of the main body detection area, and further obtaining a scene identification result of each image to be processed in the at least one image to be processed; the scene identification result represents the distance between the image content and the visual starting point;

and the video processing module is used for realizing intelligent processing of the video to be processed based on the scene identification result of each image to be processed.

9. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the method of any one of claims 1 to 7 when executing executable instructions stored in the memory.

10. A storage medium storing executable instructions for performing the method of any one of claims 1 to 7 when executed by a processor.

Background

At present, in video cover picture generation and intelligent generation of wonderful videos of a video processing technology, a far and near scene recognition function of a video image is of great importance. For example, when generating a cover map, the generating business often needs to be able to provide rich image material capabilities of long-range view, medium-range view, full-range view, close-up view, etc., but not all of them come from large face material; in video generation, generally, it is necessary to alternately generate a story-layered video with a panoramic or distant view segment as a starting scene, a character close view as a transition, and a character close-up as a key segment. The current common method is to train a conventional deep learning distant view and near view recognition model by marking mass data, and comprises the following steps: manually defining the types of different scenes, collecting a large number of images from scratch, manually labeling, labeling and cleaning, training a recognition model and the like. Therefore, the current method needs a large amount of manpower to be input into the process of labeling and collecting mass data to support model training to achieve high recognition accuracy, so that the workload involved in model training is large, the training time is long, the efficiency of model training is reduced, and the efficiency of video processing is further reduced; moreover, when the existing method is applied to a video analysis task, the number of images to be analyzed by the distant view and near view identification model is extremely large, the analysis task is complex, and the calculation pressure of the video analysis task is caused, so that the video processing efficiency is further reduced.

Disclosure of Invention

The embodiment of the application provides a video processing method, a video processing device, video processing equipment and a storage medium based on artificial intelligence, and the efficiency of video processing can be improved.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a video processing method based on artificial intelligence, which comprises the following steps:

performing frame extraction and lens division processing on a video to be processed to obtain at least one image to be processed;

using a target detection model to perform target detection on a current image to be processed in the at least one image to be processed to obtain at least one detection area corresponding to at least one type of detection object; the target detection model is used for detecting the at least one detection area occupied by the at least one type of detection object from the current image to be processed;

determining a subject detection region from the at least one detection region based on the size information of each of the at least one detection region;

determining a scene identification result of the current image to be processed based on the main body size information and the detection object of the main body detection area, and further obtaining a scene identification result of each image to be processed in the at least one image to be processed; the scene identification result represents the distance between the image content and the visual starting point;

and realizing intelligent processing of the video to be processed based on the scene identification result of each image to be processed.

The embodiment of the application provides a video processing device based on artificial intelligence, includes:

the video frame extracting module is used for performing frame extraction and lens division processing on a video to be processed to obtain at least one image to be processed;

the target detection model is used for carrying out target detection on the current image to be processed in the at least one image to be processed to obtain at least one detection area corresponding to at least one type of detection object; the target detection model is used for detecting the at least one detection area occupied by the at least one type of detection object from the current image to be processed;

the scene identification module is used for determining a main body detection area from the at least one detection area according to the size information of each detection area in the at least one detection area; determining a scene identification result of the current image to be processed based on the main body size information and the detection object of the main body detection area, and further obtaining a scene identification result of each image to be processed in the at least one image to be processed; the scene identification result represents the distance between the image content and the visual starting point;

and the video processing module is used for realizing intelligent processing of the video to be processed based on the scene identification result of each image to be processed.

In the above apparatus, the body size information includes: a body region height and a body region width; the at least one type of detection object comprises: a person object; the scene identification module is further configured to, when the detection object is the human object, obtain a size ratio and a first area of the main body detection area according to the height of the main body area and the width of the main body area; when the first area is larger than or equal to a first preset area threshold value, determining the scene identification result according to the size ratio; or when the first area is smaller than the first preset area threshold and larger than or equal to a second preset area threshold, determining the scene identification result based on the height of the main body region or the width of the main body region; or the second preset area threshold is smaller than the first preset area threshold; or when the first area is smaller than the second preset area threshold, determining the scene recognition result as a large long-distance scene, and marking the current image to be processed as the image not meeting the preset result.

In the above device, the dimension ratio is a ratio of the height of the body region to the width of the body region; the scene recognition module is further used for determining the scene recognition result as a face close-up when the size ratio is larger than a first preset size ratio threshold and is smaller than or equal to a second preset size ratio threshold; or the second preset size ratio threshold is larger than the first preset size ratio threshold; or when the size ratio is larger than the second preset size ratio threshold and smaller than or equal to a third preset size ratio threshold, determining the scene recognition result as a human close scene; the third preset size ratio threshold is greater than the second preset size ratio threshold; or when the size ratio is larger than the third preset size ratio threshold and smaller than or equal to a fourth preset size ratio threshold, determining the scene recognition result as a human body panorama; the fourth preset size ratio threshold is greater than the third preset size ratio threshold; or, when the size ratio is greater than the fourth preset size ratio threshold, determining the scene recognition result as a long-distance scene.

In the above apparatus, the scene identification module is further configured to determine the scene identification result as a human body panorama when the height of the main body area or the width of the main body area is greater than or equal to a preset first edge length threshold; or, when the height of the main body area or the width of the main body area is smaller than the preset first edge length threshold and is larger than or equal to a preset second edge length threshold, judging whether the size ratio is larger than a second preset size ratio threshold; when the size ratio is larger than or equal to the second preset size ratio threshold, determining a scene identification result as a long scene; or when the size ratio is smaller than the second preset size ratio threshold, determining the scene identification result as a large distant scene; or when the height of the main body area or the width of the main body area is smaller than the preset second side length threshold, determining the scene identification result as a large long-distance scene, and marking the current image to be processed as not meeting the preset result.

In the above apparatus, the body size information includes: a body region height and a body region width; the at least one type of detection object comprises: an object; the scene identification module is further configured to obtain a second area of the main body detection region according to the height of the main body region and the width of the main body region when the detection object is the object; when the second area is larger than or equal to a preset third area threshold value, determining the scene recognition result as an object feature; or when the second area is smaller than the preset third area threshold and larger than or equal to a preset fourth area threshold, determining the scene identification result as an object close scene; the preset fourth area threshold is smaller than the preset third area threshold; or when the second area is smaller than the preset fourth area threshold and larger than or equal to the second preset area threshold, determining the scene identification result as an object panorama; the preset fourth area threshold is greater than the second preset area threshold; or when the second area is smaller than the second preset area threshold, determining the scene recognition result as an object perspective.

In the above apparatus, the target detection model is further configured to perform frame extraction and mirror separation on the video to be processed to obtain at least one image to be processed, perform target detection on a current image to be processed in the at least one image to be processed, mark the current image to be processed as not meeting a preset result when a detection area is not detected, and not use the current image to be processed for video processing.

In the above device, the artificial intelligence-based video processing device further includes a model training module, where the model training module is configured to perform target detection on each to-be-processed image in the at least one to-be-processed image by using the target detection model, perform target detection on the training sample image set by using the initial target detection model before obtaining a target detection result, and determine a sample image that does not meet a preset result from the training sample image set; taking the sample image which does not accord with the preset result as an incremental training sample, and obtaining the labeling result of the incremental training sample so as to obtain an incremental training sample set; and performing iterative training on the initial target detection model based on the incremental training sample set and the training sample image set to obtain the target detection model.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the video processing method based on artificial intelligence provided by the embodiment of the application when the executable instructions stored in the memory are executed.

The embodiment of the application provides a storage medium, which stores executable instructions and is used for causing a processor to execute, so that the video processing method based on artificial intelligence provided by the embodiment of the application is realized.

The embodiment of the application has the following beneficial effects:

the method comprises the steps of obtaining at least one detection area through a pre-trained target detection model, determining a scene recognition result of a current image to be processed based on size information and a detection object of a main body detection area in the at least one detection area, reducing training workload required by a zero-training scene recognition model and image processing workload of the scene recognition model in a video processing process, and improving video processing efficiency.

Drawings

FIG. 1 is an alternative schematic block diagram of an architecture of an artificial intelligence based video processing system according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an alternative structure of an artificial intelligence based video processing apparatus according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of an alternative artificial intelligence based video processing method according to an embodiment of the present application;

FIG. 4 is a diagram illustrating an effect of an alternative different scene recognition result provided by an embodiment of the present application;

FIG. 5 is a schematic flow chart of an alternative artificial intelligence based video processing method according to an embodiment of the present application;

FIG. 6 is a schematic diagram illustrating an effect of comparing different scenes of a character with a picture relationship provided in an embodiment of the present application;

FIG. 7 is a schematic flow chart of an alternative artificial intelligence based video processing method according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of an alternative artificial intelligence based video processing method according to an embodiment of the present application;

FIG. 9 is a block diagram of an alternative functional block of an artificial intelligence based video processing system according to an embodiment of the present application;

fig. 10 is a schematic diagram of an alternative process of training a target detection model according to an embodiment of the present application.

Detailed Description

In order to make the objectives, technical solutions and advantages of the present application clearer, the present application will be described in further detail with reference to the attached drawings, the described embodiments should not be considered as limiting the present application, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.

Where similar language of "first/second" appears in the specification, the following description is added, and where reference is made to the term "first \ second \ third" merely for distinguishing between similar items and not for indicating a particular ordering of items, it is to be understood that "first \ second \ third" may be interchanged both in particular order or sequence as appropriate, so that embodiments of the application described herein may be practiced in other than the order illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.

Before further detailed description of the embodiments of the present application, terms and expressions referred to in the embodiments of the present application will be described, and the terms and expressions referred to in the embodiments of the present application will be used for the following explanation.

1) Deep learning technology: a machine learning technique using a deep neural network system.

2) Labeling: and drawing the object in the image by adopting a rectangular labeling frame, and labeling the object label corresponding to the frame, namely labeling type.

3) Identifying a model: the machine learning technology learns the labeled sample data (picture-corresponding relation of the designated label) to obtain the mathematical model, obtains parameters of the mathematical model in the learning and training process, identifies the parameters of the mathematical model loaded in the prediction process and calculates the probability that the input sample belongs to a certain physical label in the designated range.

4) And (3) detecting the model: the machine learning technology is used for learning a mathematical model obtained after labeling sample data (pictures, corresponding relations between a plurality of specified labeling frames and label pairs) are learned, obtaining parameters of the mathematical model in the learning and training process, identifying the parameters of the mathematical model loaded during prediction, and calculating a prediction frame of the input sample with a real object label and the probability of the prediction frame belonging to a certain real object label in a specified range.

5) And (3) detection model labeling: and drawing the object in the image by adopting a rectangular labeling frame, and labeling the object label corresponding to the frame, namely labeling type.

6) Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

7) Computer Vision technology (Computer Vision, CV): computer vision is a science for researching how to make a machine "see", and further, it means that a camera and a computer are used to replace human eyes to perform machine vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. Computer vision technologies generally include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technologies, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also include common biometric technologies such as face recognition and fingerprint recognition.

8) Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

9) The scene difference is the difference of the range size of the object in the video camera due to the different distance between the video camera and the object. The division of the scene into five categories, from near to far, is close-up, medium, panoramic and distant.

The scheme provided by the embodiment of the application relates to the technologies of artificial intelligent image detection, image recognition and the like, and is specifically explained by the following embodiments: the embodiments of the present application provide a video processing method, apparatus, device and storage medium based on artificial intelligence, which can improve the efficiency of video processing, and an exemplary application of the electronic device provided in the embodiments of the present application is described below, where the electronic device provided in the embodiments of the present application can be implemented as various types of user terminals such as a notebook computer, a tablet computer, a desktop computer, a set-top box, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device), and can also be implemented as a server. In the following, an exemplary application will be explained when the device is implemented as a terminal.

Referring to fig. 1, fig. 1 is an alternative architecture diagram of an artificial intelligence based video processing system 100 provided in this embodiment of the present application, in which terminals (a terminal 400-1 and a terminal 400-2 are exemplarily shown) are connected to a server 200 through a network 300, and the network 300 may be a wide area network or a local area network, or a combination of both networks.

The terminal 400-1 is configured to obtain a video to be processed from the database 500 through the server 200, and perform frame extraction and mirror separation on the video to be processed to obtain at least one image to be processed; using a target detection model to perform target detection on a current image to be processed in at least one image to be processed to obtain at least one detection area corresponding to at least one type of detection object; determining a subject detection region from the at least one detection region based on the size information of each of the at least one detection region; determining a scene identification result of the current image to be processed based on the main body size information and the detection object of the main body detection area, and further obtaining the scene identification result of each image to be processed in at least one image to be processed; the scene identification result represents the distance between the image content and the visual starting point; the intelligent processing of the video to be processed is realized based on the scene recognition result of each image to be processed, and the processing result of the video to be processed, such as the intelligently clipped video or the video cover map, is displayed on the graphical interface 410-1. The terminal 400 is further configured to send the processing result to the server 200 through the network 300, so that the server 200 pushes the processing result to the terminal 400-2 and displays the processing result on the graphical processing interface 410-2 of the terminal 400-2. The server 200 is configured to respond to the acquisition request of the terminal 400-1, send the video to be processed to the terminal 400-1 from the database 500, receive a processing result of the video to be processed by the terminal 400-1, and push the processing result to the terminal 400-2 through the network 300.

In some embodiments, the server 200 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the embodiment of the present application is not limited.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a terminal 400-1 according to an embodiment of the present application, where the terminal 400-1 shown in fig. 2 includes: at least one processor 410, memory 450, at least one network interface 420, and a user interface 430. The various components in the terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable communications among the components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 440 in fig. 2.

The Processor 410 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable the presentation of media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard disk drives, optical disk drives, and the like. Memory 450 optionally includes one or more storage devices physically located remote from processor 410.

The memory 450 includes either volatile memory or nonvolatile memory, and may include both volatile and nonvolatile memory. The nonvolatile memory may be a Read Only Memory (ROM), and the volatile memory may be a Random Access Memory (RAM). The memory 450 described in embodiments herein is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data, examples of which include programs, modules, and data structures, or a subset or superset thereof, to support various operations, as exemplified below.

An operating system 451, including system programs for handling various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for communicating to other computing devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: bluetooth, wireless compatibility authentication (WiFi), and Universal Serial Bus (USB), etc.;

a presentation module 453 for enabling presentation of information (e.g., user interfaces for operating peripherals and displaying content and information) via one or more output devices 431 (e.g., display screens, speakers, etc.) associated with user interface 430;

an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided by the embodiments of the present application may be implemented in software, and fig. 2 shows an artificial intelligence based video processing apparatus 455 stored in a memory 450, which may be software in the form of programs and plug-ins, and the like, and includes the following software modules: a video framing module 4551, an object detection module 4552, a scene recognition module 4553 and a video processing module 4554, which are logical and thus may be arbitrarily combined or further divided according to the functions implemented.

The functions of the respective modules will be explained below.

In other embodiments, the artificial intelligence based video processing apparatus (hereinafter, referred to as a video processing apparatus) provided in this embodiment may be implemented in hardware, and by way of example, the apparatus provided in this embodiment may be a processor in the form of a hardware decoding processor, which is programmed to execute the artificial intelligence based video processing method provided in this embodiment, for example, the processor in the form of the hardware decoding processor may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), or other electronic components.

The video processing method based on artificial intelligence provided by the embodiment of the present application will be described in conjunction with exemplary applications and implementations of the terminal provided by the embodiment of the present application.

Referring to fig. 3, fig. 3 is an alternative flow chart diagram of an artificial intelligence based video processing method provided by the embodiment of the present application, which will be described in conjunction with the steps shown in fig. 3.

S101, performing frame extraction and lens division processing on a video to be processed to obtain at least one image to be processed;

in the embodiment of the application, the video processing device can perform video split-mirror on the video to be processed. And obtaining at least one shot, wherein each shot in the at least one shot comprises at least one candidate picture. The video processing device extracts a preset number of pictures from at least one candidate picture contained in each shot to be used as the images to be processed, so that at least one image to be processed can be extracted from at least one shot.

In some embodiments, the video processing apparatus may perform video separation on the video to be processed through a video separation source library python, scenedect v5.0, to obtain a plurality of shots corresponding to the video to be processed, where each shot includes a plurality of pictures. Since the same thing is often in the shots of the same split mirror, in order to reduce the amount of calculation, the video processing apparatus extracts the middle two frames of images from each shot as the images to be processed, thereby obtaining at least one image to be processed through a plurality of shots.

S102, performing target detection on a current image to be processed in at least one image to be processed by using a target detection model to obtain at least one detection area corresponding to at least one type of detection object; the target detection model is used for detecting at least one detection area occupied by at least one type of detection object from the current image to be processed.

In this embodiment, the video processing apparatus may perform target detection on each to-be-processed image in the at least one to-be-processed image using a multi-classification target detection model. For a current image to be processed in at least one image to be processed, the video processing apparatus may extract image features from the current image to be processed by using a target detection model, identify and predict an image region including at least one type of detection object in the current image to be processed based on the extracted image features and at least one type of preset detection object, and finally output at least one detection region corresponding to the at least one type of detection object through the target detection model, thereby completing target detection on the current image to be processed.

In the embodiment of the present application, the area information of the at least one detection area includes the height and width of the detection area, the position coordinates of the detection area in the image to be processed, and the confidence level that the detection area corresponds to each type of detection object in the at least one type of detection object, respectively, that is, the probability that each type of detection object is included in the detection area.

S103, determining a main body detection area from the at least one detection area according to the size information of each detection area in the at least one detection area.

In this embodiment of the application, the video processing apparatus may determine a main body detection area in the current image to be processed according to size information of each detection area in the at least one detection area, where image content in the main body detection area represents main body content of the current image to be processed.

In some embodiments, the dimensional information may be a detection region height and width; the video processing apparatus may calculate the area of each detection region from the height and width of the detection region, and may determine a detection region having the largest area from among the at least one detection region as the main body detection region. In other embodiments, the video processing apparatus may determine, according to the aspect ratio of each detection region, a detection region with an aspect ratio within a preset ratio range as a main detection region, specifically select the detection region according to actual conditions, and the embodiment of the present application is not limited thereto.

S104, determining a scene identification result of the current image to be processed based on the main body size information and the detection object of the main body detection area, and further obtaining the scene identification result of each image to be processed in at least one image to be processed; and the scene identification result represents the distance between the image content and the visual starting point.

In the embodiment of the application, because the main body detection area represents the main body content in the current image to be processed, the video processing device can determine the scene identification result of the current image to be processed according to the determined size information of the main body detection area and the prior knowledge of the size information of the detection object. The video processing device performs the same processing on each image to be processed in at least one image to be processed, and can obtain the scene identification result of each image to be processed.

In the embodiment of the application, the scene recognition result may be the size and range of the shot subject and the image presented in the screen frame structure, and represents the distance from the image content to the visual starting point. In some embodiments, the scene recognition result may be a big long scene, a panorama, a close scene, and a close-up.

In some embodiments, the macro, long, panoramic, close up and close up views may be as shown in FIG. 4.

And S105, intelligently processing the video to be processed based on the scene recognition result of each image to be processed.

In the embodiment of the application, the video processing device can select images to be processed with different scenes for intelligent video editing based on the scene recognition result of each image to be processed, so as to realize intelligent processing of the video to be processed; the video processing device can also select a suitable target image to be processed according to the scene identification result of each image to be processed according to the requirements of practical application, and generate a cover map corresponding to the video to be processed by using the target image to be processed, so as to realize the combined push of the cover map and the video to be processed.

In some embodiments, for an intelligent video clip scene, the video processing apparatus may stitch and clip images to be processed of different scenes by a progressive or jumping group method. For example, for progressive assembly, the video processing apparatus may stitch the images to be processed with the scene recognition results respectively being close-up, middle-view, full-view and far-view in the order from near to far to obtain a video clip with a far-away effect, or the video processing apparatus may stitch the images to be processed with the scene recognition results respectively being far-view, full-view, middle-view, close-view and close-up in the order from far to near to obtain a video clip with a near-away effect. For the jump type assembly, the video processing device can select images to be processed with different scenes to be spliced in a jump type according to different clipping requirements to obtain video clips with obvious visual change characteristics. The specific selection is performed according to actual conditions, and the embodiments of the present application are not limited.

It can be understood that, in the embodiment of the present application, the video processing apparatus obtains at least one detection area through a pre-trained target detection model, and determines a scene recognition result of a current image to be processed based on size information and a detection object of a main body detection area in the at least one detection area, so as to reduce training workload required for zero-training scene recognition models and image processing workload of the scene recognition models in a video processing process, thereby improving video processing efficiency.

In some embodiments, referring to fig. 5, fig. 5 is an optional flowchart of the artificial intelligence based video processing method provided in the embodiment of the present application, where the body size information includes: a body region height and a body region width; at least one type of detection object includes: the person object, determining the scene recognition result of the current image to be processed based on the subject size information of the subject detection area and the detection object in S104, may be implemented through S1041-S1043, which will be described with reference to the respective steps.

S1041, when the detection object is a human object, obtaining the size ratio and the first area of the main body detection area according to the height and the width of the main body area.

In this embodiment of the application, the video processing apparatus may determine whether the detection object corresponding to the main detection region is a human object according to a confidence of the main detection region for each type of detection object in the at least one type of detection object. When the detection object is determined to be the person object, the video processing device may determine the scene identification result of the current image to be processed according to the subject size information of the subject detection area and by combining the prior knowledge of the size information of the person object in the video.

In this embodiment, when the detection object is a human object, the video processing apparatus may calculate the size ratio and the first area of the main body detection area according to the height of the main body area and the width of the main body area.

In some embodiments, the video processing device may normalize the size information of the subject detection area to use a uniform scene recognition standard for different resolutions of the image to be processed. For example, the video processing apparatus may normalize the size information of the subject detection region to (0,1) interval, such as for 448 × 448 of the current image to be processed, where the pixel height of the subject detection region is 40 and the pixel width is 20, the video processing apparatus takes 40/448 as the subject detection region height, 20/448 as the subject detection region width, and (40/448) × (20/448) as the first area of the subject detection region.

S1042, when the first area is larger than or equal to a first preset area threshold, determining a scene recognition result according to the size ratio.

In this embodiment of the application, when the first area of the main body detection region is greater than or equal to the first preset area threshold, the video processing apparatus may: such as the ratio of the height of the subject detection region to the width of the subject detection region, or the ratio of the width of the subject detection region to the height of the subject detection region, to determine the scene recognition result.

In some embodiments, the first preset area threshold may take 0.2 for the body region height and body region width normalized to the (0,1) interval. Other values can be selected according to the definitions of different specific scenes, and the selection is specifically performed according to the actual situation, which is not limited in the embodiments of the present application.

In some embodiments, different views of the character object may be compared to the screen relationship as shown in FIG. 6. As can be seen from fig. 6, for a person object, in some pictures mainly representing the person, the face close-up, the panoramic and the distant views of the person may occupy a larger area in the whole picture; the close-up width and height difference of the human face is small, and the width and height difference of the close view, the panoramic view and the distant view of the human body gradually increases along with the distance represented by the view difference from near to far. Meanwhile, in some pictures representing characters or environments at the same time, the human body panorama and the human body perspective, including the large perspective, may occupy a smaller area in the whole image; at the moment, the area occupied by the human body panorama has the characteristic of longer side length of the area, the area occupied by the long-distance view has the characteristics of shorter side length of the area and larger width-height difference, and the area occupied by the large-distance view has the characteristics of shorter side length and smaller width-height difference. Therefore, the video processing apparatus may determine the scene recognition result of the current image to be processed according to the size information of the main body detection area, in combination with the characteristics of the person object in fig. 6 under different scenes.

In some embodiments, the size ratio is a ratio of a height of the body region to a width of the body region, and when the first area is greater than or equal to the first preset area threshold, determining the scene recognition result according to the size ratio in S1042 may be implemented by performing S201-S204, which will be described with reference to the steps.

S201, when the size ratio is larger than a first preset size ratio threshold value and is smaller than or equal to a second preset size ratio threshold value, determining the scene recognition result as a face feature; the second predetermined size ratio threshold is greater than the first predetermined size ratio threshold.

In this embodiment of the application, the first preset size ratio threshold may be a ratio of a minimum height to a minimum width of the occupied area of the human figure object, which is set according to a priori knowledge, and if the ratio is smaller than the first preset size ratio threshold, the basic knowledge of the proportion of the human figure object may be violated. The second preset size ratio threshold may be a threshold set corresponding to a conventional scale of a close-up of a human face (e.g., above the shoulders of a human body). Wherein the second preset size ratio threshold is greater than the first preset size ratio threshold. Therefore, when the first area is greater than or equal to the first preset area threshold, and the size ratio is greater than the first preset size ratio threshold and is less than or equal to the second preset size ratio threshold, it is indicated that the main body detection area is large in area, the width and height differences are not obvious, and the characteristics of face feature are met, and the video processing device can determine the scene recognition result as the face feature.

In some embodiments, the first preset size ratio threshold may take 1 and the second preset size ratio threshold may take 2. Other values can be set according to the actual situation, and the selection is specifically performed according to the actual situation, which is not limited in the embodiments of the present application.

S202, when the size ratio is larger than a second preset size ratio threshold and is smaller than or equal to a third preset size ratio threshold, determining the scene identification result as a human body close scene; the third predetermined size ratio threshold is greater than the second predetermined size ratio threshold.

In the embodiment of the present application, the third preset size ratio threshold is greater than the second preset size ratio threshold, that is, the aspect ratio of the human object close-up with the third preset size ratio threshold is greater than the aspect ratio of the human object, and may be considered as the aspect ratio corresponding to a close scene (e.g., above the chest of the human body). Therefore, when the first area is greater than or equal to the first preset area threshold, and the size ratio is greater than the second preset size ratio threshold and less than or equal to the third preset size ratio threshold, it is indicated that the main body detection area is larger in area, and the width-height difference is obvious, which accords with the characteristics of a human body close shot, and the video processing device determines the scene recognition result as a human body close shot.

In some embodiments, the third preset size ratio threshold may be 4, or other values may be set according to an actual situation, specifically, the third preset size ratio threshold is selected according to the actual situation, and the embodiment of the present application is not limited.

And S203, when the size ratio is larger than a third preset size ratio threshold and is smaller than or equal to a fourth preset size ratio threshold, determining the scene identification result as the human body panorama.

In this embodiment of the application, the fourth preset size ratio threshold is greater than the third preset size ratio threshold, and it may be considered that the aspect ratio of the human figure object is further lengthened in the human figure height direction on the basis of the close shot. Under the condition that the first area is larger than or equal to the first preset area threshold, when the size ratio is larger than the third preset size ratio threshold and smaller than or equal to the fourth preset size ratio threshold, the fact that the area of the main body detection area is large, the width and height difference is obvious and the characteristics of the human body panorama are met is shown, and the video processing device determines the scene identification result as the human body panorama.

In some embodiments, the fourth preset size ratio threshold may be 6, or other values may be set according to an actual situation, specifically, the fourth preset size ratio threshold is selected according to the actual situation, and the embodiment of the present application is not limited.

And S204, when the size ratio is larger than a fourth preset size ratio threshold, determining the scene identification result as a long-distance scene.

In the embodiment of the application, when the size ratio is greater than the fourth preset size ratio threshold, it indicates that the width-height difference of the main body detection area is further increased, and the visual effect distance of the person object is farther, so correspondingly, the video processing device determines the scene identification result as a long-range scene.

S1043, when the first area is smaller than a first preset area threshold and larger than or equal to a second preset area threshold, determining a scene recognition result based on the height of the main body area or the width of the main body area; the second preset area threshold is smaller than the first preset area threshold.

In this embodiment of the application, the second preset area threshold is smaller than the first preset area threshold, and when the first area of the main body detection area corresponding to the person object is smaller than the first preset area threshold and is greater than or equal to the second preset area threshold, it is indicated that the area occupied by the main body detection area in the current image to be processed is smaller, the person object corresponding to the main body detection area is farther from the visual starting point, and the video processing device may further determine the scene recognition result based on the height of the main body area or the width of the main body area.

In some embodiments, for the height and the width of the main body region normalized to the (0,1) interval, the second preset area threshold may be set to 0.01, or may be set to another value, which is specifically selected according to an actual situation, and the embodiment of the present application is not limited.

In some embodiments, S1043 may be implemented by performing S301-S305, which will be described in conjunction with the various steps.

S301, when the height or width of the main body area is larger than or equal to a preset first edge length threshold value, determining the scene identification result as a human body panorama.

In the embodiment of the application, when the first area is smaller than the first preset area threshold and is greater than or equal to the second preset area threshold, and when the height of the main body region or the width of the main body region is greater than or equal to the preset first edge length threshold, it indicates that the area of the main body detection region is smaller, and the edge length of one edge is longer, which meets the characteristics of the human body panorama, and the video processing device determines the scene recognition result as the human body panorama.

In some embodiments, for the height and the width of the main body region normalized to the (0,1) interval, the preset first edge length threshold may be set to 0.3, or may be set to another value according to an actual situation, which is specifically selected according to the actual situation, and the embodiment of the present application is not limited.

S302, when the height or width of the main body area is smaller than a preset first edge length threshold and larger than or equal to a preset second edge length threshold, whether the size ratio is larger than a second preset size ratio threshold is judged.

In this embodiment, when the first area is smaller than the first preset area threshold and is greater than or equal to the second preset area threshold, and when the height of the main body region or the width of the main body region is smaller than the preset first edge length threshold and is greater than or equal to the preset second edge length threshold, the video processing apparatus may further determine whether the size ratio is greater than the second preset size ratio threshold.

In some embodiments, for the height of the main body region and the width of the main body region normalized to the (0,1) interval, the preset second side length threshold may be set to 0.1, or may be set to another value according to an actual situation, which is specifically selected according to the actual situation, and the embodiment of the present application is not limited.

And S303, when the size ratio is larger than or equal to a second preset size ratio threshold value, determining the scene identification result as a long-distance scene.

In the embodiment of the present application, under the condition that the first area is smaller than the first preset area threshold and is greater than or equal to the second preset area threshold, and the height of the main body region or the width of the main body region is smaller than the preset first edge length threshold and is greater than or equal to the preset second edge length threshold, when the size ratio is greater than or equal to the second preset size ratio threshold, it indicates that the main body detection region has a smaller area, a shorter edge length, and an obvious width-height difference, which are in line with the characteristics of a long shot, and the video processing device determines the scene identification result as the long shot.

And S304, when the size ratio is smaller than a second preset size ratio threshold, determining the scene identification result as a large distant scene.

In the embodiment of the present application, under the condition that the first area is smaller than the first preset area threshold and is greater than or equal to the second preset area threshold, and the height of the main body region or the width of the main body region is smaller than the preset first edge length threshold and is greater than or equal to the preset second edge length threshold, when the size ratio is smaller than the second preset size ratio threshold, it indicates that the main body detection region has a smaller area, a shorter edge length, and an unobvious width-height difference, which are in line with the characteristics of a large perspective, the video processing device determines the scene identification result as the large perspective.

S305, when the height of the main body area or the width of the main body area is smaller than a preset second side length threshold, determining the scene identification result as a large long-distance scene, and marking the current image to be processed as the image not meeting the preset result.

In this embodiment of the application, when the first area is smaller than the first preset area threshold and is greater than or equal to the second preset area threshold, and the height of the main body region or the width of the main body region is smaller than the preset first edge length threshold and is greater than or equal to the preset second edge length threshold, when the height of the main body region or the width of the main body region is smaller than the preset second edge length threshold, it indicates that a certain edge length of the main body detection region is too narrow, and the video processing device may determine the scene identification result as a large long-distance scene. Moreover, because the excessively narrow side length represents that the current image to be processed may be in a large perspective or an intermediate boundary detected incorrectly, the video processing device may further mark the current image to be processed as not conforming to the preset result to prompt the subsequent video processing process to further confirm the image not conforming to the preset result, and may apply the image marked as not conforming to the preset result to the process of further training the target detection model, which will be described in the part of model training.

And S1044, when the first area is smaller than a second preset area threshold value, determining the scene identification result as a large distant scene, and marking the current image to be processed as not conforming to the preset result.

In the embodiment of the application, when the first area is smaller than the second preset area threshold, it is described that the area occupied by the main body detection area in the whole picture is too small, the video processing device may determine the scene recognition result as a large long-range scene, and further mark the current image to be processed as not conforming to the preset result, so as to prompt the subsequent video processing process to further confirm the image not conforming to the preset result, and apply the image marked as not conforming to the preset result to the process of further training the target detection model, and the description will be performed on the model training part.

In some embodiments, after S102, when the video processing apparatus obtains the at least one detection area, when the number of detection areas in the at least one detection area, in which the detection object is a human object, is greater than a preset human number threshold, the video processing apparatus may determine the scene identification result of the current image to be processed as a human group scene.

It should be noted that, in the embodiment of the present application, the size ratio may also be a ratio of a width of the main body region to a height of the main body region, and accordingly, the first preset size ratio threshold, the second preset size ratio threshold, the third preset size ratio threshold, and the fourth preset size ratio threshold may be set correspondingly based on a predefined ratio of the width to the height, and are specifically selected according to an actual situation, which is not limited in the embodiment of the present application.

It can be understood that, in the embodiment of the application, the video processing device can judge the far and near scenes of the image through the priori knowledge of the far and near scenes based on the pre-trained target detection model, so that the situation that a labeled sample is directly collected in a large scale for recognition is avoided, meanwhile, the situation that the whole calculation time is too long due to the introduction of an additional deep learning model is avoided, the effect is closer to the definition of the far and near scenes, the efficiency and the accuracy of scene recognition results are improved, and the efficiency and the accuracy of video processing are improved.

In some embodiments, referring to fig. 7, fig. 7 is an alternative flowchart of the artificial intelligence based video processing method provided by the embodiment of the present application. The body size information includes: a body region height and a body region width; at least one type of detection object includes: the object may be, for example, an image object of a building, a vehicle, a tree, or the like. The process of determining the scene identification result of the current image to be processed based on the main body size information of the main body detection area and the detection object in S104 may be implemented through S1045 to S1049, and will be described with reference to each step.

And S1045, when the detection object is an object, obtaining a second area of the main body detection area according to the height and the width of the main body area.

And S1046, when the second area is larger than or equal to a preset third area threshold value, determining the scene recognition result as an object close-up.

In the embodiment of the application, when the second area is greater than or equal to the preset third area threshold, it is described that the proportion occupied by the main body detection area occupied by the object in the current image to be processed is relatively large, the distance from the object to the visual starting point is relatively short, and the video processing device determines the scene recognition result as the object close-up.

In some embodiments, for the height and the width of the main body region normalized to the (0,1) interval, the preset third area threshold may be set to 0.3, or may be set to another value according to an actual situation, specifically selected according to the actual situation, and the embodiment of the present application is not limited.

S1047, when the second area is smaller than a preset third area threshold and larger than or equal to a preset fourth area threshold, determining the scene identification result as an object close scene; the preset fourth area threshold is smaller than the preset third area threshold.

In this embodiment, the preset fourth area threshold is smaller than the preset third area threshold. And when the second area is smaller than a preset third area threshold value and is larger than or equal to a preset fourth area threshold value, the fact that the main body detection area occupied by the object is smaller than the area corresponding to the close-up is described, and the main body detection area belongs to the area range corresponding to the close-up, and the video processing device determines the scene identification result as the object close-up.

In some embodiments, for the height and the width of the main body region normalized to the (0,1) interval, the preset third area threshold may be set to 0.1, or may be set to another value according to an actual situation, specifically selected according to the actual situation, and the embodiment of the present application is not limited.

S1048, when the second area is smaller than a preset fourth area threshold and larger than or equal to a second preset area threshold, determining the scene identification result as an object panorama; the preset fourth area threshold is larger than the second preset area threshold.

In this embodiment, the predetermined fourth area threshold is greater than the second predetermined area threshold. And when the second area is smaller than a preset fourth area threshold and larger than or equal to a second preset area threshold, the video processing device determines the scene identification result as the object panorama.

In some embodiments, for the height and the width of the main area normalized to the (0,1) interval, the preset fourth area threshold may be set to 0.01, or may be set to another value according to an actual situation, which is specifically selected according to the actual situation, and the embodiment of the present application is not limited.

And S1049, when the second area is smaller than a second preset area threshold, determining the scene recognition result as an object perspective.

In this embodiment of the application, when the second area is smaller than the second preset area threshold, the video processing apparatus determines the scene identification result as an object perspective.

It can be understood that, in the embodiment of the present application, the video processing apparatus may further perform distance and near view identification on the object of the object type, so that the distance and near view identification on multiple types of detection objects is realized on the basis of not greatly increasing the model computation amount and the processing amount, and the video processing efficiency is improved.

In some embodiments, after S101, S001 may be further included, which will be described in conjunction with the steps.

And S001, when the target detection model is used, performing target detection on the current image to be processed in the at least one image to be processed, and when the detection area is not detected, marking the current image to be processed as not conforming to a preset result, and not using the current image to be processed to perform video processing.

In the embodiment of the application, when the video processing apparatus uses the target detection model to perform target detection on the current image to be processed in the at least one image to be processed and does not detect the detection area, it is indicated that the current image to be processed contains a target object that cannot be identified by the target detection model, and the scene identification result cannot be further inferred according to the identifiable target object. The video processing device marks the current image to be processed as the image which does not accord with the preset result, and does not use the current image to be processed for video processing; the images marked as not meeting the preset result can be applied to the process of further training the target detection model, and the description will be given in the model training part.

In some embodiments, referring to fig. 8, fig. 8 is an optional flowchart of the artificial intelligence based video processing method provided in the embodiments of the present application, and based on fig. 3 and before S102, S401 to S403 may also be included, which will be described with reference to each step.

S401, carrying out target detection on the training sample image set by using the initial target detection model, and determining sample images which do not accord with a preset result from the training sample image set.

In this embodiment, the video processing apparatus may use an initial target detection model, perform target detection on a training sample image set based on a default initial training weight of the initial target detection model, and determine, based on a target detection result and in combination with the scene recognition method, a training sample image in which a target is not detected, and/or a training sample image in which a first area of a main body detection region is smaller than a second preset area threshold, and/or a training sample image in which a height of the main body region or a width of the main body region is smaller than a preset second side length threshold from the training sample image set as a sample image that does not conform to the preset result.

In some embodiments, the initial object detection model may be yolov5 model, or may be other types of object detection network models, which are specifically selected according to actual situations, and the embodiments of the present application are not limited.

In some embodiments, the training sample image set may use an open source data coco training set as the training sample image set, and may also use other image sets as the training sample image set, which is specifically selected according to actual situations, and the embodiments of the present application are not limited thereto.

S402, taking the sample image which does not accord with the preset result as an incremental training sample, obtaining the labeling result of the incremental training sample, and further obtaining an incremental training sample set.

In the embodiment of the application, the video processing device takes the sample image which does not conform to the preset result as the incremental training sample, obtains the labeling result of the incremental training sample, and further obtains the incremental training sample set.

In the embodiment of the present application, the sample image that does not conform to the preset result includes undetected subject objects, and these undetected subject objects may be missed by the initial target detection model or may be newly added type objects that do not belong to the preset detection object type of the initial target detection model. The video processing device collects the labeling results of the sample images which do not accord with the preset results, the incremental training samples and the labeling results corresponding to the incremental training samples are used as incremental training sample sets, and then the incremental training sample sets are put into training of the initial target detection model to enhance the target detection capability of the initial target detection model.

In the embodiment of the application, the labeling result is a labeling area for manually labeling the position, size and category of the main object in the sample image which does not accord with the preset result.

And S403, performing iterative training on the initial target detection model based on the enhanced training sample set and the training sample image set to obtain a target detection model.

In this embodiment of the application, the video processing apparatus may use the enhanced training sample set and the training sample image set as a full-scale training data set, divide the full-scale training data set into a plurality of batches to obtain at least one batch of training data sets, and update the network weight by using a standard Stochastic Gradient Descent (SGD) optimization method. Specifically, the video processing device may perform image enhancement preprocessing on each training data set in at least one training data set, and input the preprocessed training data set into the initial target detection model to perform forward calculation, so as to obtain a training prediction region set corresponding to each training data set in a current training round; the video processing device calculates the confidence loss of the training prediction region set, the classified cross entropy loss of a positive prediction region in the training prediction region set, the coordinate loss of a central position and the width and height loss according to the training prediction region set and the labeled region set corresponding to each batch of training data sets; and finally, adding the confidence coefficient loss, the classification cross entropy, the center position coordinate loss and the width and height loss to obtain the total loss of the current round of training. The video processing device obtains gradient values for carrying out weight adjustment on each layer of neural network of the initial target detection model through an SGD algorithm according to the total loss of the current round of training, updates each layer of weights by using the gradient values, and carries out next round of training based on the updated weights until reaching a preset training target, and if the training times reach the preset times or the total loss reaches below a preset loss threshold value, the training is finished to obtain the target detection model.

In some embodiments, the video processing device may calculate the total loss for each training round by equation (1), as follows:

wherein S is2And predicting a position set with a training prediction area in the current training data, wherein obj is a positive detection box with an object in the training prediction area, and nonobj is a negative detection box without the object in the training prediction area. x is the number ofiAnd yiTo be the center coordinates of the noted region,andin order to train the center coordinates of the prediction region,

is a loss of center position coordinates. w is aiAnd hiRespectively, the width and height values of the marked area,andrespectively for trainingThe width and height values of the prediction region are refined,

the width and height loss is high; ciIn order to train the confidence score for the predicted region,to train the overlap of the prediction region and the labeling region,

in order to be a loss of confidence,

to predict class loss.

It can be understood that, in the embodiment of the application, the video processing device may excavate a training difficulty case of the target detection through the scene recognition method, and train the initial target detection model by using the training difficulty case as an incremental sample, so that the target detection capability of the target detection model is improved, and the accuracy of the target detection and the accuracy of the video processing are further enhanced.

Next, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

In the embodiment of the present application, referring to fig. 9, fig. 9 shows an alternative functional module structure diagram of an artificial intelligence based video processing system. In the system of fig. 9, the video processing apparatus may apply the perspective identification method in the embodiment of the present application to a video analysis service downstream of the key target detection module by using an original key target detection module in the video analysis understanding system, so as to help improve the video understanding and video embedding effects, and avoid additionally introducing a perspective identification model with a complex analysis task and a high calculation pressure into the video analysis understanding system, so that the whole video processing system is lighter, the time consumption of perspective identification calculation is reduced, and the video processing efficiency is improved. An exemplary application process of the embodiment of the present application is further described below with reference to fig. 10, taking the example of intelligently generating a video cover map based on the video processing system shown in fig. 9. For the perspective view recognition module shown in fig. 10, during the model training process, the module may output the recognition result to the model training module for collecting the sample image of the missing target; during video processing, the module may output the recognition result to a video processing module for generating a cover sheet.

In the embodiment of the application, for the model training process, the video processing device can input the original sample set into the target detection model, and the target detection model is used for carrying out target detection on the current sample image in the original sample set to obtain the current sample detection result. The video processing apparatus may analyze the current sample detection result by using the far-near view recognition module in fig. 10 and the methods in S501-S504, so as to obtain a far-near view recognition result, as follows:

s501, when the current sample detection result is at least one target detection frame, determining a target detection frame with the largest area from the at least one target detection frame as a main body detection area; and normalizing the area, height and width of the main body detection area to a numerical value interval of (0,1), so as to obtain the area a of the main body detection area, the height h of the main body area, the width w of the main body area and the aspect ratio of the main body detection area, wherein ratio is equal to h/w.

S502, when the detection object in the main detection area is a human object, determining a far and near view recognition result of the current sample image by a human far and near view determination method, specifically, the human far and near view determination method may be implemented by S01-S03, as follows:

and S01, when a is greater than or equal to 0.2, determining the far and near scene recognition result of the current sample image according to the value of ratio.

In S01, the first predetermined area threshold is 0.2. S01 may include S01-1 through S01-4 as follows:

and S01-1, when the ratio is more than 1 and less than or equal to 2, determining the far and near scene recognition result of the current sample image as the face close-up.

In S01-1, the first preset size ratio threshold is 1, and the second preset size ratio threshold is 2.

And S01-2, when the ratio is more than 2 and less than or equal to 4, determining the far and near scene recognition result of the current sample image as the human body near scene.

In S01-2, the third predetermined size ratio threshold is 4.

And S01-3, when the ratio is more than 4 and less than or equal to 6, determining the far and near scene recognition result of the current sample image as the human body panorama.

In S01-2, the fourth predetermined size ratio threshold is 6.

And S01-4, when the ratio is more than 6, determining the far and near scene recognition result of the current sample image as a far scene.

And S02, when a is greater than or equal to 0.01 and less than 0.2, determining the far and near scene recognition result of the current sample image based on the value of w or h.

In S02, the second preset area threshold is 0.01. S02 may include S02-1 through S02-4 as follows:

and S02-1, when w or h is greater than or equal to 0.3, determining the far and near scene recognition result of the current sample image as the human body panorama.

In S02-1, the first edge length threshold is preset to be 0.3.

And S02-2, when w or h is less than 0.3 and is greater than or equal to 0.1, if the ratio is greater than or equal to 2, determining the far and near view identification result of the current sample image as a far view.

In S02-2, a second side length threshold is preset to be 0.1.

And S02-3, when w or h is less than 0.3 and is greater than or equal to 0.1, if the ratio is less than 2, determining the perspective identification result of the current sample image as a large perspective.

And S02-4, when w or h is smaller than 0.1, determining the far and near scene identification result of the current sample image as a large far scene, and marking the current sample image as not conforming to the preset result.

And S03, when a is smaller than 0.01, determining the far and near scene identification result of the current sample image as a large far scene, and marking the current sample image as not meeting the preset result.

S503, when the detection object in the main detection area is an object, determining a far and near view recognition result of the current sample image by an object far and near view determination method, specifically, the object far and near view determination method may be implemented by S11-S14, as follows:

and S11, when a is greater than or equal to 0.3, determining the far and near scene recognition result of the current sample image as the object close-up.

In S11, the third area threshold is preset to be 0.3.

And S12, when a is less than 0.3 and greater than or equal to 0.1, determining the far and near scene recognition result of the current sample image as the object near scene.

In S11, the fourth area threshold is preset to be 0.1.

And S13, when a is less than 0.1 and greater than or equal to 0.01, determining the far and near scene recognition result of the current sample image as the object panorama.

And S14, when a is less than 0.01, determining the far and near scene recognition result of the current sample image as a far scene.

And S504, when the current sample detection result is no target, marking the current sample image as not conforming to a preset result.

In the embodiment of the application, for the model training process, the video processing device performs the same processing on each sample image in the original sample set by the methods of S501-S504, uses all obtained sample images which do not conform to the preset result as sample images of the missing target, and obtains the labeling result of the sample images of the missing target, thereby obtaining the incremental training sample set. The video processing device uses the original sample set and the enhanced training sample set together to train the target detection model, and uses the trained target detection model to perform a video processing process.

In the embodiment of the application, in the video processing process, the video processing device performs frame extraction and lens division on a video to be processed to obtain at least one image to be processed, and the video processing device performs target detection on the at least one image to be processed by using the trained target detection model to obtain at least one detection area of each image to be processed. The video processing device performs the processing consistent with the processing in S501-S504 on at least one detection area of each image to be processed through the long-short shot recognition module to obtain a long-short shot recognition result of each image to be processed, and finally, the video processing device can determine a target image according to the actual cover map requirement in the long-short shot recognition result of each image to be processed and generate a cover map by using the target image. Illustratively, when the actual cover map requirement is to adopt character close-ups as the cover map of the video to be processed, the video processing device determines the image to be processed belonging to the character close-ups as the target image according to the far and near scene recognition result of each image to be processed, and generates the cover map of the character close-ups effect by using the target image, thereby completing the video processing process.

It can be understood that, in the embodiment of the present application, the video processing apparatus may further train the target detection model by using an algorithm logic in the distant view and near view recognition module, excavate a missing target sample image from the original sample set as a missing target image, collect a labeling result of the missing target image to form a labeled sample training set, and perform incremental training and updating on the target detection neural network model by using the labeled sample training set, so as to improve the accuracy of the target detection model, and further improve the accuracy of the entire video processing system. Moreover, the video processing device is matched with the far and near scene recognition logic on the existing target detection model, and can quickly determine the far and near scene recognition result of the image according to the target detection result of the target detection model without increasing excessive processing workload, so that the video processing efficiency is improved.

Continuing with the exemplary structure of the artificial intelligence based video processing device 455 provided by the embodiments of the present application as software modules, in some embodiments, as shown in fig. 2, the software modules stored in the artificial intelligence based video processing device 455 of the memory 450 may include:

the video frame extracting module 4551 is configured to perform frame extraction and mirror segmentation on a video to be processed to obtain at least one image to be processed;

the target detection model 4552 is configured to perform target detection on a current image to be processed in the at least one image to be processed, so as to obtain at least one detection area corresponding to at least one type of detection object; the target detection model is used for detecting the at least one detection area occupied by the at least one type of detection object from the current image to be processed;

a scene recognition module 4553, configured to determine a subject detection region from the at least one detection region according to size information of each detection region of the at least one detection region; determining a scene identification result of the current image to be processed based on the main body size information and the detection object of the main body detection area, and further obtaining a scene identification result of each image to be processed in the at least one image to be processed; the scene identification result represents the distance between the image content and the visual starting point;

a video processing module 4554, configured to implement intelligent processing on the video to be processed based on the scene identification result of each image to be processed.

In some embodiments, the body size information includes: a body region height and a body region width; the at least one type of detection object comprises: a person object; the scene identification module 4553 is further configured to, when the detection object is the person object, obtain a size ratio and a first area of the main body detection area according to the height of the main body area and the width of the main body area; when the first area is larger than or equal to a first preset area threshold value, determining the scene identification result according to the size ratio; or when the first area is smaller than the first preset area threshold and larger than or equal to a second preset area threshold, determining the scene identification result based on the height of the main body region or the width of the main body region; the second preset area threshold is smaller than the first preset area threshold; or when the first area is smaller than the second preset area threshold, determining the scene recognition result as a large long-distance scene, and marking the current image to be processed as the image not meeting the preset result.

In some embodiments, the size ratio is a ratio of the body region height to the body region width; the scene recognition module 4553 is further configured to determine the scene recognition result as a face feature when the size ratio is greater than a first preset size ratio threshold and is less than or equal to a second preset size ratio threshold; the second preset size ratio threshold is greater than the first preset size ratio threshold; or when the size ratio is larger than the second preset size ratio threshold and smaller than or equal to a third preset size ratio threshold, determining the scene recognition result as a human close scene; or, when the third preset size ratio threshold is greater than the second preset size ratio threshold; when the size ratio is larger than the third preset size ratio threshold and smaller than or equal to a fourth preset size ratio threshold, determining the scene identification result as a human body panorama; the fourth preset size ratio threshold is greater than the third preset size ratio threshold; or, when the size ratio is greater than the fourth preset size ratio threshold, determining the scene recognition result as a long-distance scene.

In some embodiments, the scene recognition module 4553 is further configured to determine the scene recognition result as a human body panorama when the height of the subject region or the width of the subject region is greater than or equal to a preset first edge length threshold; or, when the height of the main body area or the width of the main body area is smaller than the preset first edge length threshold and is larger than or equal to a preset second edge length threshold, judging whether the size ratio is larger than a second preset size ratio threshold; when the size ratio is larger than or equal to the second preset size ratio threshold, determining a scene identification result as a long scene; or when the size ratio is smaller than the second preset size ratio threshold, determining the scene identification result as a large distant scene; or when the height of the main body area or the width of the main body area is smaller than the preset second side length threshold, determining the scene identification result as a large long-distance scene, and marking the current image to be processed as not meeting the preset result.

In some embodiments, the body size information includes: a body region height and a body region width; the at least one type of detection object comprises: an object; the scene identification module 4553 is further configured to, when the detection object is the object, obtain a second area of the main body detection region according to the height of the main body region and the width of the main body region; when the second area is larger than or equal to a preset third area threshold value, determining the scene recognition result as an object feature; or when the second area is smaller than the preset third area threshold and larger than or equal to a preset fourth area threshold, determining the scene identification result as an object close scene; the preset fourth area threshold is smaller than the preset third area threshold; or when the second area is smaller than the preset fourth area threshold and larger than or equal to the second preset area threshold, determining the scene identification result as an object panorama; the preset fourth area threshold is greater than the second preset area threshold; or when the second area is smaller than the second preset area threshold, determining the scene recognition result as an object perspective.

In some embodiments, the target detection model 4552 is further configured to perform frame extraction and mirror segmentation on the video to be processed to obtain at least one image to be processed, perform target detection on a current image to be processed in the at least one image to be processed, mark the current image to be processed as not meeting a preset result when a detection area is not detected, and not perform video processing using the current image to be processed.

In some embodiments, the artificial intelligence-based video processing apparatus further includes a model training module, where the model training module is configured to perform target detection on each to-be-processed image of the at least one to-be-processed image by using the target detection model, before obtaining a target detection result, perform target detection on the training sample image set by using the initial target detection model, and determine, from the training sample image set, sample images that do not meet a preset result; taking the sample image which does not accord with the preset result as an incremental training sample, and obtaining the labeling result of the incremental training sample so as to obtain an incremental training sample set; and performing iterative training on the initial target detection model based on the incremental training sample set and the training sample image set to obtain the target detection model.

It should be noted that the above description of the embodiment of the apparatus, similar to the above description of the embodiment of the method, has similar beneficial effects as the embodiment of the method. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the artificial intelligence based video processing method according to the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium having stored therein executable instructions that, when executed by a processor, cause the processor to perform a method provided by embodiments of the present application, for example, the method as illustrated in fig. 3, 5, 7, and 8.

In some embodiments, the computer-readable storage medium may be memory such as FRAM, ROM, PROM, EP ROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may correspond, but do not necessarily have to correspond, to files in a file system, and may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a hypertext Markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device or on multiple computing devices at one site or distributed across multiple sites and interconnected by a communication network.

In summary, according to the embodiment of the application, the video processing device can judge the far and near scenes of the image through the priori knowledge of the far and near scenes based on the pre-trained target detection model, so that the situation that the marked samples are directly collected in a large scale for recognition is avoided, meanwhile, the situation that the overall calculation time is too long due to the introduction of an additional deep learning model is avoided, the effect is closer to the definition of the far and near scenes, the efficiency and the accuracy of scene recognition results are improved, and the efficiency and the accuracy of video processing are improved.

The above description is only an example of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, and improvement made within the spirit and scope of the present application are included in the protection scope of the present application.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:交通工具的乘员监视装置

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!