Task scheduling method, system, GPU and equipment for Warp level scheduling

文档序号:7335 发布日期:2021-09-17 浏览:24次 中文

1. A task scheduling method of Warp level scheduling is characterized in that: the method comprises the following steps:

when the task is a first task, analyzing hardware information and configuration information of the task submitted by a user offline;

acquiring the maximum parallelism degree when the task is parallel to the main flow task based on the hardware information and the configuration information of the task;

when the task is a non-primary task, on-line task pair packing decision is carried out based on hardware information of the task and the collected maximum parallelism decision, the selected task pair is packed into a new task, and the new task is submitted to the GPU, so that the original two GPU tasks in the task pair realize task scheduling at a warp level.

2. The method of claim 1, wherein: when the task is the first task, the offline analysis of the hardware information and the configuration information of the task submitted by the user comprises the following steps: and collecting the use condition of each data path of the tasks submitted by the users and the task performance of the tasks submitted by the users under different thread block configurations.

3. The method for task scheduling at Warp level as claimed in claim 1 or 2, wherein: the implementation mode of acquiring the maximum parallelism degree when the hardware information and the configuration information are parallel to the main flow task based on the task is as follows:

acquiring a thread block configuration interval under the condition that the task performance is unchanged based on the configuration information of the task;

acquiring a computing access type of the task based on the configuration information and the hardware information of the task;

and calculating the access type and the task thread block configuration interval based on the tasks, and traversing the parallel performance of the two tasks under all the thread block configurations so as to obtain the maximum parallelism.

4. The method of claim 1, wherein: performing an online task-to-packet decision based on hardware information of the task and the collected maximum parallelism decision comprises:

if a task pair in the coming tasks is subjected to parallel amplification and tuning processing and a corresponding maximum parallelism decision is obtained, selecting the task pair according to the tuned maximum parallelism decision; and for the remaining task pairs which are not subjected to parallel amplification processing, performing packing decision judgment based on the hardware information of the tasks.

5. The method of claim 4, wherein: the judgment of the packing decision of the remaining task pairs which are not subjected to the parallel amplification processing based on the hardware information of the tasks comprises the following steps:

acquiring hardware information of tasks based on the tasks, and adding the data access utilization rates in the hardware information of the two tasks pairwise;

judging whether the sum of the data paths has the utilization rate of one data path exceeding a set threshold value:

if so, confirming that the two tasks cannot be packaged into a new task;

if not, the two tasks can be packaged into a new task.

6. The method of claim 1, wherein: the step of packaging the selected task pair into a new task comprises the following steps:

adjusting the grid dimension and the thread block dimension of the two tasks to one dimension;

calculating the grid dimension and the thread block dimension of the new task;

adjusting the two tasks from the global function to the equipment function, and constructing a new transfer parameter of the kernel function based on the original grid dimension and the original thread block dimension of the two tasks;

constructing the content of a new kernel function;

the transfer parameters and synchronization functions of the new device functions are adjusted.

7. A task scheduling system for Warp level scheduling, characterized by: the task scheduling system for Warp level scheduling comprises:

the off-line analysis module is used for off-line analyzing hardware information and configuration information of a task submitted by a user when the task is a first task;

the parallelism amplification module is used for acquiring the maximum parallelism when the parallelism is parallel to the main flow task based on the hardware information and the configuration information of the task;

the decision module is used for carrying out online task pair packing decision based on the hardware information of the task and the collected maximum parallelism decision when the task is a non-primary task;

and the task packing module is used for packing the selected task pair into a new task and submitting the new task to the GPU so as to enable the original two GPU tasks in the task pair to realize task scheduling at a warp level.

8. The Warp-level scheduled task scheduling system of claim 7, wherein: the task packing module comprises:

the dimension control unit is used for adjusting the grid dimension and the thread block dimension of the two tasks to one dimension and calculating the grid dimension and the thread block dimension of the new task;

the parameter construction unit is used for adjusting the two tasks from the global function to the equipment function and constructing a new transfer parameter of the kernel function based on the original grid dimension and the original thread block dimension of the two tasks;

the content construction unit is used for constructing the content of the new kernel function;

and the adjusting unit is used for adjusting the transfer parameters and the synchronization function of the new equipment function.

9. A GPU, comprising: method for operating said computer program to implement a task scheduling method of Warp-level scheduling according to any of claims 1 to 6.

10. An electronic device, characterized in that: use of a GPU according to claim 9.

Background

The rapid development of general purpose GPUs has led to their massive adoption in cloud computing and data centers. On one hand, traditional science such as mathematics, neuroscience and other disciplines adopt the GPU to carry out corresponding scientific calculation. On the other hand, many directions in the computer field have begun to employ GPUs to accelerate related computing tasks. For example, machine learning and deep learning rely on the GPU for training and servicing, while dataflow tasks rely on the GPU for accelerating data processing. Thus, to support an increasing variety of GPU tasks, a large number of enterprises and academic institutions are beginning to build GPU-based private clouds.

As applications become increasingly diverse, GPU architectures are also continually evolving. Originally, the streaming multiprocessor SM of GPUs contained only one type of computational unit CUDA core, which is further subdivided into INT32, FP32 and FP 64. With the rapid development and large number of applications of deep learning, NVIDIA proposes a dedicated computing unit, sensor core. Since the most common computation in deep learning is matrix multiplication, the only function of the sensor core is the matrix multiplication computation. It is easy to see that the conventional task and the data stream processing task mainly use the CUDA core, and the deep learning task mainly uses the sensor core. Meanwhile, the CUDA core and the sensor core are two kinds of computing hardware which are completely independent, which also means that two of them can be computed simultaneously. Therefore, there is a possibility that the utilization rate of the GPU is improved by simultaneously using two types of computing hardware, and the throughput of the GPU as a whole can be improved.

The scheduling of existing GPUs depends on the submission configuration of the GPU tasks. One or more kernel functions are in one GPU task. When a GPU task is submitted to a GPU, kernel functions in the task are sequentially executed. When a kernel function is committed, the kernel function starts a specified number of Thread blocks Thread block in a specified configuration. Within each thread block there is a specified number of threads. When a thread block is committed to the SM, every 32 threads act as a warp and execute on the GPU simultaneously, and when one warp stops for memory accesses, another warp is scheduled for GPU computation. Essentially, the task scheduling granularity on the GPU is at the warp level.

For existing dispatch systems, these systems rely on the NVIDIA official, provisioned MPS for dispatch. MPS supports GPU tasks applying for a specific proportion of SM resources, whereas tasks can only use so many SM resources. When one task has not used up the computational resources on the current SM, another task can be scheduled onto the SM. This means that the scheduling granularity of MPS is SM level, more precisely task level or kernel level. So the kernel level scheduling granularity cannot be directly utilized to both compute units. This means that the first difficulty in using both hardware in parallel is coarse-grained task scheduling.

In addition to the issue of scheduling granularity, a second difficulty in using both hardware in parallel is the contention of the memory system. Since two tasks share one GPU memory system, memory contention is very likely to occur. Neither existing tools nor previous work manage the contention of the memory system. Although two computing hardware units are theoretically possible to be used in parallel, when a task has particularly high requirements on the memory system, the two computing units still cannot be used in parallel. When a large number of tasks arrive, it is a third challenge to decide which tasks can be run together.

NVIDIA MPS is a multitask parallel interface offered by NVIDIA officials. MPS supports the task to apply for a specified proportion of SM resources, and the task can only use these SM resources. When one task has not used up the computational resources on the current SM, another task can be scheduled onto the SM.

When one task has not used up the computational resources on the current SM, another task can be scheduled onto the SM. This means that the schedule granularity of MPS is SM-level, more precisely task-level or Kernel-level, so that Kernel-level schedule granularity cannot be directly utilized to both compute units. This means that a first drawback of NVIDIA MPS is coarse-grained task scheduling.

In addition to the issue of scheduling granularity, a second drawback of using both hardware in parallel is the inability to sense contention by the memory system. Because two tasks share one memory system, memory competition is easy to occur, and the parallelism of the two tasks is further influenced. Finally, when a large number of tasks arrive, it is a third disadvantage that MPS is unable to perceive how to decide which tasks can be mixed out.

Disclosure of Invention

In view of the above drawbacks of the prior art, an object of the present invention is to provide a task scheduling method, system, GPU and device for Warp-level scheduling, which are used to implement high-throughput task scheduling of Warp-level scheduling.

To achieve the above and other related objects, the present invention provides a task scheduling method for Warp level scheduling, including: when the task is a first task, analyzing hardware information and configuration information of the task submitted by a user offline; acquiring the maximum parallelism degree when the task is parallel to the main flow task based on the hardware information and the configuration information of the task; when the task is a non-primary task, on-line task pair packing decision is carried out based on hardware information of the task and the collected maximum parallelism decision, the selected task pair is packed into a new task, and the new task is submitted to the GPU, so that the original two GPU tasks in the task pair realize task scheduling at a warp level.

In an embodiment of the present invention, when the task is a first task, the offline analyzing hardware information and configuration information of the task submitted by the user includes: and collecting the use condition of each data path of the tasks submitted by the users and the task performance of the tasks submitted by the users under different thread block configurations.

In an embodiment of the present invention, the implementation manner of obtaining the maximum parallelism when the task is parallel to the mainstream task based on the hardware information and the configuration information of the task is as follows: acquiring a thread block configuration interval under the condition that the task performance is unchanged based on the configuration information of the task; acquiring a computing access type of the task based on the configuration information and the hardware information of the task; and calculating the access type and the task thread block configuration interval based on the tasks, and traversing the parallel performance of the two tasks under all the thread block configurations so as to obtain the maximum parallelism.

In an embodiment of the present invention, performing an online task-to-packet decision based on hardware information of a task and a collected maximum parallelism decision includes: if a task pair in the coming tasks is subjected to parallel amplification and tuning processing and a corresponding maximum parallelism decision is obtained, selecting the task pair according to the tuned maximum parallelism decision; and for the remaining task pairs which are not subjected to parallel amplification processing, performing packing decision judgment based on the hardware information of the tasks.

In an embodiment of the present invention, the determining, for the remaining task pairs that have not undergone parallel amplification processing, a packing decision based on the hardware information of the task includes: acquiring hardware information of tasks based on the tasks, and adding the data access utilization rates in the hardware information of the two tasks pairwise; judging whether the sum of the data paths has the utilization rate of one data path exceeding a set threshold value: if so, confirming that the two tasks cannot be packaged into a new task; if not, the two tasks can be packaged into a new task.

In an embodiment of the present invention, the packaging the selected task pair into a new task includes: adjusting the grid dimension and the thread block dimension of the two tasks to one dimension; calculating the grid dimension and the thread block dimension of the new task; adjusting the two tasks from the global function to the equipment function, and constructing a new transfer parameter of the kernel function based on the original grid dimension and the original thread block dimension of the two tasks; constructing the content of a new kernel function; the transfer parameters and synchronization functions of the new device functions are adjusted.

In an embodiment of the present invention, an implementation manner of packaging the two tasks into a new task includes: adjusting the grid dimension and the thread block dimension of the two tasks to one dimension; calculating the grid dimension and the thread block dimension of the new task; adjusting the two tasks from the global function to the equipment function, and constructing a new transfer parameter of the kernel function based on the original grid dimension and the original thread block dimension of the two tasks; constructing the content of a new kernel function; the transfer parameters and synchronization functions of the new device functions are adjusted.

The embodiment of the invention also provides a task scheduling system for Warp level scheduling, which comprises: the off-line analysis module is used for off-line analyzing hardware information and configuration information of a task submitted by a user when the task is a first task; the parallelism amplification module is used for acquiring the maximum parallelism when the parallelism is parallel to the main flow task based on the hardware information and the configuration information of the task; the decision module is used for carrying out online task pair packing decision based on the hardware information of the task and the collected maximum parallelism decision when the task is a non-primary task; and the task packing module is used for packing the selected task pair into a new task and submitting the new task to the GPU so as to enable the original two GPU tasks in the task pair to realize task scheduling at a warp level.

In an embodiment of the present invention, the task packing module includes: the dimension control unit is used for adjusting the grid dimension and the thread block dimension of the two tasks to one dimension and calculating the grid dimension and the thread block dimension of the new task; the parameter construction unit is used for adjusting the two tasks from the global function to the equipment function and constructing a new transfer parameter of the kernel function based on the original grid dimension and the original thread block dimension of the two tasks; the content construction unit is used for constructing the content of the new kernel function; and the adjusting unit is used for adjusting the transfer parameters and the synchronization function of the new equipment function.

Embodiments of the present invention further provide a GPU, configured to run the computer program to implement the task scheduling method of Warp level scheduling as described above.

The embodiment of the invention also provides electronic equipment applying the GPU.

As described above, the task scheduling method, system, GPU and device of Warp level scheduling of the present invention have the following beneficial effects:

the invention realizes the task scheduling of the high-throughput Warp level scheduling without the user perception in advance, can indirectly provide the support of the scheduling technology for the potential GPU configured with various computing units, and can ensure that the dynamic task scheduling system of the GPU chip configured with various computing units can provide task scheduling service for users.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic overall flowchart of a task scheduling method of Warp level scheduling in an embodiment of the present application.

Fig. 2 is a schematic diagram illustrating a schematic structure of a task scheduling method of Warp level scheduling in an embodiment of the present application.

Fig. 3 is a schematic diagram illustrating a specific implementation procedure of the task scheduling method for Warp level scheduling in an embodiment of the present application.

Fig. 4 is a schematic block diagram of a task scheduling system for Warp level scheduling in an embodiment of the present application.

Fig. 5 is a schematic block diagram of a GPU in an embodiment of the present application.

Description of the element reference numerals

Task scheduling system for 100 Warp level scheduling

110 off-line analysis module

120-parallelism amplification module

130 decision module

140 task packing module

S100 to S300

S1-S8

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

The embodiment aims to provide a task scheduling method, a task scheduling system, a GPU and a task scheduling device for Warp level scheduling, and the task scheduling method, the system, the GPU and the task scheduling device are used for achieving high-throughput Warp level scheduling.

The embodiment aims to design and realize a system and a method which run on a data center and manage shared resources among a large number of programs, and under the condition of simultaneously ensuring the correctness of the two programs, the throughput of the whole task is maximized.

The principle and implementation of the method, system, GPU and apparatus for task scheduling at Warp level of the present invention will be described in detail below, so that those skilled in the art can understand the method, system, GPU and apparatus for task scheduling at Warp level of the present invention without creative labor.

Example 1

Specifically, as shown in fig. 1, this embodiment provides a task scheduling method for Warp level scheduling, where the task scheduling method for Warp level scheduling includes:

step S100, when the task is a first task, analyzing hardware information and configuration information of the task submitted by a user offline;

step S200, acquiring the maximum parallelism degree when the task is parallel to the main flow task based on the hardware information and the configuration information of the task;

and step S300, when the task is a non-primary task, performing online task pair packing decision based on hardware information of the task and the collected maximum parallelism decision, packing the selected task pair into a new task, and submitting the new task to the GPU, so that the original two GPU tasks in the task pair realize task scheduling at a warp level.

The following describes steps S100 to S300 of the task scheduling method of Warp level scheduling according to this embodiment in detail with reference to fig. 2.

And step S100, when the task is the first task, analyzing the hardware information and the configuration information of the task submitted by the user offline.

In this embodiment, when the task is the first task, the offline analyzing the hardware information and the configuration information of the task submitted by the user includes: and collecting the use condition of each data path of the tasks submitted by the users and the task performance of the tasks submitted by the users under different thread block configurations.

Specifically, in this embodiment, first, the requirements of the tasks on the GPU memory system are analyzed, and the use conditions of the tasks submitted by the user on each memory channel are collected. Secondly, analyzing the performance change of the task to the number of the submitting thread blocks (blocks), and collecting the submitting thread block (block) intervals when the performance of the task is ensured.

And step S200, acquiring the maximum parallelism degree when the task is parallel to the main flow task based on the hardware information and the configuration information of the task.

In this embodiment, the parallelism of two packed tasks is maximized, i.e. the parallelism of two computing hardware is maximized. Maximizing the parallelism of two tasks requires managing the memory system contention of the two tasks. The memory system requirement for a task depends mainly on the number of thread blocks (blocks) it computes simultaneously on a single SM (compute unit). When the number of thread blocks (blocks) simultaneously computed on an SM (compute unit) is larger, the more concurrent memory requests, i.e., the higher the memory bandwidth requirement. Therefore, the present embodiment dynamically adjusts the number of thread blocks (blocks) submitted by two tasks through the support of source compilation, thereby maximizing the parallelism of the two tasks on the GPU.

Through corresponding research, GPU tasks can be divided into three types, namely memory intensive tasks, balanced tasks and calculation intensive tasks. The computationally intensive tasks have less bandwidth requirements. When the number of blocks is large, the performance of the task is improved, but when the number of blocks is improved to a certain number, the performance improvement tends to be slow. Memory intensive tasks require more memory bandwidth and less computation. When the number of blocks increases, the performance of the task is reduced due to access competition among the blocks. The main characteristic of the equalization type task is that the change of the block number does not cause any influence on the performance of the equalization type task.

In order to maximize the parallelism of two tasks, the embodiment designs a task pair maximization parallelism algorithm. The specific idea of the algorithm is as follows: firstly, a performance curve of tasks changing along with the number of submitted thread blocks (blocks) is described, wherein the number of the thread blocks is the number of blocks of a GPU task residing on a single stream processor; secondly, judging the task type of the task and determining the starting block of the task; and thirdly, traversing all block configuration pairs of all task pairs to find out the parallel configuration with the maximum parallelism. And fourthly, returning all the configuration pairs.

Specifically, in this embodiment, the implementation manner of obtaining the maximum parallelism when the task is parallel to the mainstream task based on the hardware information and the configuration information of the task is as follows:

1) acquiring a thread block configuration interval under the condition that the task performance is unchanged based on the configuration information of the task;

2) acquiring a computing access type of the task based on the configuration information and the hardware information of the task;

3) and calculating the access type and the task thread block configuration interval based on the tasks, and traversing the parallel performance of the two tasks under all the thread block configurations so as to obtain the maximum parallelism.

In the present embodiment, an algorithm for maximizing parallelism of tasks is shown in table 1 below.

TABLE 1

And step S300, when the task is a non-primary task, performing online task pair packing decision based on hardware information of the task and the collected maximum parallelism decision, packing the selected task pair into a new task, and submitting the new task to the GPU, so that the original two GPU tasks in the task pair realize task scheduling at a warp level.

When a large number of tasks arrive, deciding in real time which two tasks can together take advantage of the parallelism of the two computing hardware. Because two tasks that run together share the memory architecture of the GPU, when one task does not use all of the GPU memory resources, the other task can be scheduled onto the GPU.

And directly packaging the tasks by adopting the task configuration after parallel amplification after the tasks are subjected to parallel amplification analysis. And when the task arrives and is not amplified and analyzed in parallel, performing online decision of a packed task pair according to the acquired memory use information. The corresponding online packing decision algorithm: it is determined whether the usage of all data paths exceeds a set threshold, which is initially set to, but not limited to, 100, adjusted based on user and historical information. When the utilization of any of the data paths is later than a threshold, then both tasks are determined to be unable to be a packed pair.

Specifically, in this embodiment, the performing an online task-to-packing decision based on the hardware information of the task and the collected maximum parallelism decision includes:

if a task pair in the coming tasks is subjected to parallel amplification and tuning processing and a corresponding maximum parallelism decision is obtained, selecting the task pair according to the tuned maximum parallelism decision; and for the remaining task pairs which are not subjected to parallel amplification processing, performing packing decision judgment based on the hardware information of the tasks.

In this embodiment, the determining, based on the hardware information of the task, that the task pairs which are not subjected to the parallel amplification processing are subjected to the packing decision includes:

acquiring hardware information of tasks based on the tasks, and adding the data access utilization rates in the hardware information of the two tasks pairwise;

judging whether the sum of the data paths has the utilization rate of one data path exceeding a set threshold value:

if so, confirming that the two tasks cannot be packaged into a new task;

if not, the two tasks can be packaged into a new task.

Specifically, in this embodiment, a judgment algorithm for performing a packing decision on the remaining task pairs that have not undergone the parallel amplification processing based on the hardware information of the task is shown in table 2 below.

TABLE 2

In this embodiment, the packaging the selected task pair into a new task includes: adjusting the grid dimension and the thread block dimension of the two tasks to one dimension; calculating the grid dimension and the thread block dimension of the new task; adjusting the two tasks from the global function to the equipment function, and constructing a new transfer parameter of the kernel function based on the original grid dimension and the original thread block dimension of the two tasks; constructing the content of a new kernel function; the transfer parameters and synchronization functions of the new device functions are adjusted.

Since NVIDIA currently only opens MPS for relevant scheduling, and the scheduling granularity of MPS is kernel level, this results in MPS not being able to utilize both computing hardware simultaneously. Meanwhile, when tasks are scheduled to the GPU, the scheduling granularity on SM is at the warp level. In order to enjoy the warp-level scheduling, the embodiment implements the packing of two tasks, so as to ensure that the two tasks are submitted to the GPU simultaneously, and further utilize the computing parallelism of the two computing units.

Specifically, a specific implementation procedure for packing the two tasks into a new task is as follows.

Firstly, grid dimension and block dimension of two GPU tasks are adjusted to only one dimension, and the original grid dimension and block dimension are used as parameters of a new kernel function.

Second, the grid dimension of the new task is calculated, which is the maximum of the grid dimensions of the two tasks.

Third, the block dimension of the new task is calculated, which is the sum of the block dimensions of the two tasks.

And fourthly, adjusting the two tasks from the global function to the device function, and constructing a new kernel function parameter according to the original parameters of the two tasks and the dimension of the original task.

Fifth, the contents of a new kernel function are constructed. In the corresponding block, the first M threads are responsible for kernel A, and the last N threads are responsible for kernel B.

Sixthly, adjusting the parameters of the new device function. The new device function needs to add dimension information of the original kernel function in addition to the original parameters. And calculating original thread id and block id through the dimension information of the incoming kernel function so as to ensure the calculation correctness.

Seventh, to ensure the correctness of synchronization within a thread, __ synchreads () function is adjusted to bar.

In the embodiment, the task packing of the warp level uses source compiling to the source, and partial thread synchronization is solved to support task scheduling of the warp level, so that two tasks can be ensured to simultaneously utilize two kinds of computing hardware of the GPU.

As shown in fig. 2 and fig. 3, in order to further understand the task scheduling method of warp level scheduling of the present embodiment, an implementation process of task scheduling of warp level scheduling of the present embodiment is described below.

S1, the user submits the task: and the user writes own program according to the requirement by calling the corresponding API.

S2, whether the task is the first task or not is analyzed, namely whether the task comes for the first time or not is analyzed, and if so, the method goes to S3: and (4) performing offline analysis, and otherwise, entering S5.

S3, off-line analysis: and collecting relevant information of the tasks.

S4, parallelism magnification: and based on the related information of the off-line analysis, the block interval of each task with changed performance is obtained. Then, by adjusting the number of blocks submitted by each task, an attempt is made to find the maximum parallelism of the two tasks on the GPU.

S5, online decision packaging task pair: and performing online decision-making based on the task information obtained by offline analysis and parallel amplification. And if the task is not amplified and analyzed in parallel, performing online packaging decision through the use of the memory system.

S6, task packaging: and (4) packaging tasks belonging to a warp level, and performing packaging operation based on a packaging decision of an online decision. And (3) forming a new task by the two tasks through a source-to-source compiling method, submitting the new task to the GPU, and enabling the two tasks to enjoy the warp-level scheduling.

S7, task running: the task starts to be executed on the corresponding node, and after the execution is finished, the execution is performed in S8: and outputting the result.

Example 2

As shown in fig. 4, the present embodiment provides a task scheduling system 100 for Warp level scheduling, where the task scheduling system 100 for Warp level scheduling includes: an offline analysis module 110, a parallelism amplification module 120, a decision module 130, and a task packing module 140.

In this embodiment, the offline analysis module 110 is configured to analyze the hardware information and the configuration information of the task submitted by the user offline when the task is the first task.

In this embodiment, when the task is the first task, the offline analyzing the hardware information and the configuration information of the task submitted by the user includes: and collecting the use condition of each data path of the tasks submitted by the users and the task performance of the tasks submitted by the users under different thread block configurations.

Specifically, in this embodiment, first, the requirements of the tasks on the GPU memory system are analyzed, and the use conditions of the tasks submitted by the user on each memory channel are collected. Secondly, analyzing the performance change of the task to the number of the submitting thread blocks (blocks), and collecting the submitting thread block (block) intervals when the performance of the task is ensured.

In this embodiment, the parallelism amplifying module 120 is configured to obtain the maximum parallelism when the task is parallel to the main stream task based on the hardware information and the configuration information of the task.

The principle of the parallelism amplifying module 120 is the same as that of step S200 in embodiment 1, and technical features similar or identical to each other are not repeated.

In this embodiment, the decision module 130 is configured to perform an online task-to-packet decision based on hardware information of a task and a collected maximum parallelism decision when the task is a non-primary task.

When a large number of tasks arrive, the decision module 130 decides in real time which two tasks can together take the parallel advantage of both computing hardware. Because two tasks that run together share the memory architecture of the GPU, when one task does not use all of the GPU memory resources, the other task can be scheduled onto the GPU.

After the task is analyzed by the parallelism amplifying module 120, task packing is performed directly by adopting the task configuration of the parallelism amplifying module 120. When a task arrives and is not analyzed by the parallelism amplifying module 120, the decision module 130 performs an online decision of a packed task pair according to the memory usage information acquired by the offline analyzing module 110.

In this embodiment, the task packing module 140 is configured to pack the selected task pair into a new task, and submit the new task to the GPU, so that the two original GPU tasks in the task pair implement task scheduling at a warp level.

Specifically, in this embodiment, the task packing module 140 includes: the system comprises a dimension control unit, a parameter construction unit, a content construction unit and an adjusting unit.

In this embodiment, the dimension control unit is configured to adjust the grid dimension and the thread block dimension of the two tasks to one dimension, and calculate the grid dimension and the thread block dimension of the new task; in this embodiment, the parameter building unit is configured to adjust the two tasks from the global function to the device function, and build a new transfer parameter of the kernel function based on the original grid dimension and the original thread block dimension of the two tasks; in this embodiment, the content construction unit is configured to construct the content of the new kernel function; in this embodiment, the adjusting unit is configured to adjust a transfer parameter and a synchronization function of a new device function.

In this embodiment, the principles of the decision module 130 and the task packing module 140 are the same as step S300 in embodiment 1, and similar or identical technical features between the principles are not repeated.

Example 3

The embodiment of the present invention further provides a GPU, configured to run the computer program to implement the task scheduling method of Warp level scheduling described in embodiment 1.

As shown in fig. 5, in the present embodiment, the GPU is a novel GPU based on a configuration Tensor computation core (sensor core). Fig. 5 is a diagram of a GPU chip for a configuration Tensor computation core (sensor core). In fig. 5, SM (compute unit) represents a coarse-grained compute unit of the GPU. Each SM has two kinds of fine-grained computing units, namely a stream processing core (CUDA core) and a Tensor core (Tensor core). Registers and first level cache (L1) are shared within each SM, and all SMs share second level cache (L2 cache) and DRAM memory.

Example 4

Embodiments of the present invention also provide an electronic device, which is, for example, a fixed terminal, such as a server, a desktop, or the like.

The electronic device applies the GPU described in embodiment 3 to cause the electronic device to perform the steps of the task scheduling method as the Warp level scheduling in embodiment 1. Since the specific implementation process of the steps of the task scheduling method of Warp level scheduling has been described in detail in embodiment 1, it is not described herein again.

In summary, the invention realizes the task scheduling of the high-throughput Warp-level scheduling without the user perception in advance, can indirectly provide the support of the scheduling technology for the potential GPUs configured with various computing units, and can enable the dynamic task scheduling system of the GPU chip configured with various computing units to provide the task scheduling service for the user. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention shall be covered by the claims of the present invention.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:一种起爆器基于LINUX系统进行业务处理的方法及起爆器

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!