Many-core architecture with heterogeneous processors and data processing method thereof

文档序号:7240 发布日期:2021-09-17 浏览:93次 中文

1. A many-core architecture with heterogeneous processors, comprising: the multi-core array comprises a plurality of computing cores and at least one processing core with different functions from the computing cores, the processing core and the computing cores are provided with synchronous clocks, and the processing core and the adjacent computing cores are communicated through inter-core routing.

2. The many-core architecture with heterogeneous processors of claim 1, wherein the many-core array comprises:

a plurality of computing cores; and the number of the first and second groups,

at least one processing core integrated with an FPGA and/or at least one processing core integrated with a DSP.

3. The many-core architecture with heterogeneous processors of claim 2, wherein the many-core array is a two-dimensional matrix network, at least one processing core integrated with an FPGA and/or at least one processing core integrated with a DSP are disposed at corners of the many-core array, and the processing core integrated with an FPGA and/or the processing core integrated with a DSP communicate with two computing cores adjacent thereto through two inter-core routing paths.

4. The many-core architecture with heterogeneous processors of claim 2, wherein the many-core array is a two-dimensional matrix network, at least one FPGA-integrated processing core and/or at least one DSP-integrated processing core is disposed inside the many-core array, and the FPGA-integrated processing core and/or the DSP-integrated processing core communicates with four computing cores adjacent thereto through four inter-core routing paths.

5. The many-core architecture with heterogeneous processors of claim 2, wherein the many-core array is a two-dimensional matrix network, at least one DSP-integrated processing core is disposed at a corner of the many-core array, the DSP-integrated processing core communicates with two adjacent computing cores via two inter-core routing paths, and at least one FPGA-integrated processing core is disposed inside the many-core array, the FPGA-integrated processing core communicates with four adjacent computing cores via four inter-core routing paths.

6. The many-core architecture with heterogeneous processors of claim 2, wherein the many-core array is a two-dimensional matrix network, at least one processing core of an integrated FPGA is disposed at a corner of the many-core array, the processing core of the integrated FPGA communicates with two adjacent computing cores via two inter-core routing paths, and at least one processing core of an integrated DSP is disposed inside the many-core array, the processing core of the integrated DSP communicates with four adjacent computing cores via four inter-core routing paths.

7. A data processing method of a many-core architecture with heterogeneous processors, the data processing method being implemented by using the many-core architecture with heterogeneous processors as claimed in any one of claims 1 to 6, the data processing method comprising: and the operation data of the current computing core is transmitted to at least one processing core with different functions from the current computing core for computing through inter-core routing.

8. A many-core chip employing a many-core architecture with heterogeneous processors as claimed in any of claims 1-6.

9. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement a data processing method for a many-core architecture with heterogeneous processors as claimed in claim 7.

10. A computer-readable storage medium having a computer program stored thereon, the computer program being executable by a processor to implement a data processing method of a many-core architecture with heterogeneous processors as claimed in claim 7.

Background

Currently, AI chips are mostly designed in a homogeneous manner, i.e., the structure of each processor core is the same. This makes the function setting of each processor simple for controlling the area and power consumption of the chip, and lacks many simple logic judgment and control mechanisms, so that many new algorithms and models are not supported. Once complex operation is met, a special circuit needs to be designed or an off-chip CPU is used for processing, so that the operation efficiency is low and the energy consumption is high.

Disclosure of Invention

In order to solve the above problems, an object of the present invention is to provide a many-core architecture with a heterogeneous processor and a data processing method thereof, in which a processing core integrated with an FPGA and a processing core integrated with a DSP are added to a many-core array to implement the many-core architecture with the heterogeneous processor, so that the many-core array can process various complex logic controls and complex scientific calculations, improve the operation efficiency, and reduce the energy consumption.

The invention provides a many-core architecture with heterogeneous processors, which comprises the following components: the multi-core array comprises a plurality of computing cores and at least one processing core with different functions from the computing cores, the processing core and the computing cores are provided with synchronous clocks, and the processing core and the adjacent computing cores are communicated through inter-core routing.

As a further improvement of the present invention, the many-core array comprises:

a plurality of computing cores; and the number of the first and second groups,

at least one processing core integrated with an FPGA and/or at least one processing core integrated with a DSP.

As a further improvement of the present invention, the many-core array is a two-dimensional matrix network, at least one processing core of an integrated FPGA and/or at least one processing core of an integrated DSP are disposed at corners of the many-core array, and the processing core of the integrated FPGA and/or the processing core of the integrated DSP communicate with two adjacent computing cores thereof through two inter-core routing paths.

As a further improvement of the present invention, the many-core array is a two-dimensional matrix network, at least one processing core of an integrated FPGA and/or at least one processing core of an integrated DSP are disposed inside the many-core array, and the processing core of the integrated FPGA and/or the processing core of the integrated DSP communicate with four adjacent computing cores thereof through four inter-core routing paths.

As a further improvement of the present invention, the many-core array is a two-dimensional matrix network, at least one processing core of an integrated DSP is disposed at a corner of the many-core array, the processing core of the integrated DSP communicates with two adjacent computing cores thereof through two inter-core routing paths, and at least one processing core of an integrated FPGA is disposed inside the many-core array, and the processing core of the integrated FPGA communicates with four adjacent computing cores thereof through four inter-core routing paths.

As a further improvement of the present invention, the many-core array is a two-dimensional matrix network, at least one processing core of an integrated FPGA is disposed at a corner of the many-core array, the processing core of the integrated FPGA communicates with two adjacent computing cores thereof through two inter-core routing paths, and at least one processing core of an integrated DSP is disposed inside the many-core array, and four inter-core routing paths of the processing core of the integrated DSP communicate with four adjacent computing cores thereof.

As a further improvement of the present invention, the processing core of the integrated FPGA is configured to process operations that cannot be processed by the computing core, and perform logic control and instruction judgment processing, and the processing core of the integrated DSP is configured to process operations that cannot be processed by the computing core, and perform processing of non-customized operations.

The invention also provides a data processing method of the many-core architecture with the heterogeneous processor, which is adopted and comprises the following steps: and the operation data of the current computing core is transmitted to at least one processing core with different functions from the current computing core for computing through inter-core routing.

As a further improvement of the invention, the operation data of the current computing core is transmitted from the current computing core to at least one processing core of the integrated FPGA and/or at least one processing core of the integrated DSP for operation through inter-core routing.

As a further improvement of the present invention, a single operation task is divided into a plurality of sub operation tasks, the plurality of sub operation tasks are distributed to the plurality of computation cores, the processing core of at least one integrated FPGA and/or the processing core of at least one integrated DSP for parallel processing, and at the same time, the plurality of computation cores, the processing core of the integrated FPGA and/or the processing core of the integrated DSP for parallel processing the respective sub operation tasks.

As a further improvement of the present invention, the operation data of the sub-operation tasks corresponding to the processing core of the integrated FPGA and/or the processing core of the integrated DSP is transmitted from the current computing core to the processing core of the integrated FPGA and/or the processing core of the integrated DSP through inter-core routing for operation.

As a further improvement of the invention, the processing core of the integrated FPGA or the processing core of the integrated DSP closest to the computing core where the current single sub-operation task corresponds to the operation data is searched through the inter-core route, the operation data is transmitted to the processing core of the integrated FPGA or the processing core of the integrated DSP for operation, and after the operation of the processing core of the integrated FPGA or the processing core of the integrated DSP is finished, the operation result is transmitted to the next computing core through the inter-core route for continuous operation.

As a further improvement of the invention, the processing core of the integrated FPGA and/or the processing core of the integrated DSP which is closest to the computing core where the current multiple sub-operation tasks correspond to the operation data are respectively searched through the inter-core routing, the operation data are respectively transmitted to the processing core of each integrated FPGA and/or the processing core of each integrated DSP for operation, and after the operation of the processing core of each integrated FPGA and/or the processing core of each integrated DSP is finished, the operation result is respectively transmitted to the next computing core through the inter-core routing for continuous operation.

As a further improvement of the present invention, the operation data that cannot be processed by the computation core includes operation data of logic control and judgment instructions, and is transmitted from the current computation core to the processing core of the integrated FPGA through inter-core routing for operation; the operation data which can not be processed by the computing core comprises non-customized operation data, and is transmitted to the processing core of the integrated DSP from the current computing core through inter-core routing for operation.

The invention also provides a many-core chip which adopts the many-core architecture with the heterogeneous processors.

The invention also provides an electronic device comprising a memory and a processor, wherein the memory is used for storing one or more computer instructions, and the one or more computer instructions are executed by the processor to realize the data processing method of the many-core architecture with the heterogeneous processors.

The invention also provides a computer readable storage medium on which a computer program is stored, the computer program being executed by a processor to implement the data processing method of the many-core architecture with heterogeneous processors.

The invention has the beneficial effects that:

the existing network-on-chip structure is not required to be modified, the newly added processing core type can be supported only by increasing the node type, and the many-core architecture with the heterogeneous processor is realized. The FPGA processing core and the DSP processing core are integrated in the many-core array, so that the many-core array can process various complex logic control and complex scientific calculation, the transmission bandwidth can be greatly saved, the energy consumption is reduced, the operation efficiency is improved, and the inference/training process of a neural network is accelerated. Meanwhile, instruction scheduling of an on-chip processor is not needed, a pipeline data processing mode can be formed, processing delay is reduced, and processing efficiency is improved. Meanwhile, by using the programmable characteristics of the FPGA and the DSP, recompilation can be performed according to the needs of calculation, logic, scheduling and the like, and compared with the existing pure-computation-core and many-core architecture, the flexibility is higher.

Drawings

In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is to be understood that the drawings in the following description are merely exemplary of the disclosure, and that other drawings may be derived from those drawings by one of ordinary skill in the art without undue inventive faculty.

Fig. 1 is a schematic diagram of a many-core architecture with heterogeneous processors according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

It should be noted that, if directional indications (such as up, down, left, right, front, and back … …) are involved in the disclosed embodiment, the directional indications are only used to explain the relative position relationship between the components, the motion situation, and the like in a specific posture (as shown in the drawing), and if the specific posture is changed, the directional indications are changed accordingly.

In addition, in the description of the present disclosure, the terms used are for illustrative purposes only and are not intended to limit the scope of the present disclosure. The terms "comprises" and/or "comprising" are used to specify the presence of stated elements, steps, operations, and/or components, but do not preclude the presence or addition of one or more other elements, steps, operations, and/or components. The terms "first," "second," and the like may be used to describe various elements, not necessarily order, and not necessarily limit the elements. In addition, in the description of the present disclosure, "a plurality" means two or more unless otherwise specified. These terms are only used to distinguish one element from another. These and/or other aspects will become apparent to those of ordinary skill in the art in view of the following drawings, and the description of the embodiments of the disclosure will be more readily understood by those of ordinary skill in the art. The drawings are only for purposes of illustrating the described embodiments of the disclosure. One skilled in the art will readily recognize from the following description that alternative embodiments of the structures and methods illustrated in the present disclosure may be employed without departing from the principles described in the present disclosure.

Currently, AI chips are mostly designed in a homogeneous manner, i.e., the structure of each processor core is the same. This makes the function setting of each processor simple for controlling the area and power consumption of the chip, and lacks many simple logic judgment and control mechanisms, so that many new algorithms and models are not supported. When complex operation is encountered, a special circuit needs to be designed or processed by an off-chip CPU. If a special circuit is designed, the later optimization of the chip is not flexible, and the development time is delayed. Processing by an off-chip CPU interrupts the originally designed pipelined parallel processing, causing performance degradation.

A many-core architecture with heterogeneous processors of an embodiment of the present disclosure includes: the multi-core array comprises a plurality of computing cores and at least one processing core with different functions from the computing cores, synchronous clocks are arranged between the processing cores and the computing cores, and the processing cores and the adjacent computing cores are communicated through inter-core routing.

In one implementation, a many-core array includes:

a plurality of computing cores, an

At least one processing core integrated with an FPGA and/or at least one processing core integrated with a DSP.

The many-core architecture with the heterogeneous processor, which is described in the embodiment of the disclosure, can support a newly added processing core type only by adding a node type to an existing network-on-chip structure without modifying the existing network-on-chip structure, thereby realizing the many-core architecture with the heterogeneous processor. The newly added processing cores (the processing core integrated with FPGA and the processing core integrated with DSP) can be used as common computing cores, so that the software and various applications which are already called by the upper layer do not need to be modified. The added heterogeneous processors can simultaneously support various operations, including the operation of logic control and judgment instructions, the non-customized operation and the like, wherein the non-customized operation can refer to any operation except the operation supported by the computing cores, and therefore the many-core chip can support various algorithms and models. When complex operation is met, the processing core of the integrated DSP is called to complete corresponding non-customized operation, and when a plurality of processing cores of the integrated DSP are designed, a plurality of small networks can be supported to simultaneously perform non-customized operation. When complex logic control operation is met, the processing core of the integrated FPGA is called to complete corresponding operation, and multiple small networks can be supported to simultaneously perform logic control when the processing cores of multiple integrated FPGAs are designed. When the original computing core is used for operation, the FPGA and/or the DSP do not need to be scheduled through the on-chip CPU, and the parallel pipeline processing of data is kept, so that the processing delay is reduced, and the processing efficiency is improved. The processing core of the integrated FPGA and the processing core of the integrated DSP are newly added, the FPGA and the DSP can be recompiled according to the requirements of calculation, logic, scheduling and the like by utilizing the programmable characteristics of the FPGA and the DSP, and compared with the existing pure-calculation core and many-core architecture, the flexibility is higher.

In an implementation mode, the computation core comprises an AI computation unit, a storage unit and a route, the FPGA-integrated processing core comprises an FPGA computation unit, a storage unit and a route, and the DSP-integrated processing core comprises a DSP computation unit, a storage unit and a route. The computing core, the processing core integrated with the FPGA and the processing core integrated with the DSP have the same storage unit and routing configuration. For example, the memory capacity of the memory cells of the computing core, the processing core integrated with the FPGA, and the processing core integrated with the DSP may be the same. The computing core, the processing core of the integrated FPGA and the processing core of the integrated DSP are designed into modules with the same positioning for integration, have the same time sequence control and communication mode, and form an array integration framework of a two-dimensional grid with the same structure.

In an implementation manner, the storage capacities of the storage units of the computation core, the processing core integrated with the FPGA, and the processing core integrated with the DSP may be different or not completely the same, for example, the storage capacities of the storage units of the cores (including the computation core, the processing core integrated with the FPGA, and the processing core integrated with the DSP) may be different, or the storage capacities of the cores may not be completely the same.

The many-core array can be a two-dimensional matrix network, a two-dimensional ring network, a two-dimensional star network or a three-dimensional hierarchical network in general. The many-core array network structure can be selected and designed according to the functions, requirements and the like of the chip. The number of the two processing cores of the processing core of the integrated FPGA and the processing core of the integrated DSP can be adaptively designed according to the functions, requirements and the like of the chip. For example, when the processing amount of the logic control and judgment instruction and the like is small, a processing core of the integrated FPGA may be arranged in the chip many-core array, and when the processing amount of the logic control and judgment instruction and the like is large, the number of the processing cores of the integrated FPGA may be increased appropriately. Similarly, when the processing amount of the non-customized operation is small, one processing core of the integrated DSP may be arranged in the chip many-core array, and when the processing amount of the non-customized operation is large, the number of the processing cores of the integrated DSP may be increased appropriately.

As previously described, an off-the-shelf operation may refer to any operation other than that supported by a compute core, and may include, for example and without limitation, sin, cos, log, exponent, and the like. For example, when the operations cured or customized by a common computing core include floating point, integer operations, uncured or customized sin, cos, log, exponent, etc., operations that are non-customized may be processed by the processing core of the integrated DSP.

For example, during face recognition, there are cases where the positions of faces in the acquired pictures are different, the shooting angles are different, and the like, so that in the data processing process, regions of interest (ROIs) of the selected pictures are different, and the sizes of the regions of interest (ROIs) are different. During the data processing, the data may change dynamically, so that an off-the-shelf operation that cannot be processed by the computing core may occur. In the prior art, only the data can be transmitted to a CPU for processing, and the disclosure can directly process the non-customized operations on a chip through a processing core of an integrated DSP. In addition, a plurality of processing cores integrated with FPGA and/or a plurality of processing cores integrated with DSP are designed in one many-core array, so that the many-core array can support simultaneous operation of a plurality of small networks, and the overall operation speed of the chip is improved.

In a preferred embodiment, the many-core array is designed into a two-dimensional square network, and the design of a symmetrical structure enables data transmission and processing, chip power consumption and heat dissipation and the like to be optimized, so that the overall performance of the chip is improved.

As for the positions of the processing cores of the integrated FPGA and the integrated DSP in the many-core array, the optimal positions of the processing cores of the integrated FPGA and the integrated DSP can be determined according to the position of the input computing core specified by the operation data, so that the processing core of the integrated FPGA and/or the processing core of the integrated DSP closest to the computing core can be quickly found, and the overall operation rate of the chip is improved.

In an implementation manner, the many-core array is a two-dimensional matrix network, at least one processing core integrated with the FPGA and/or at least one processing core integrated with the DSP are disposed at corners of the many-core array, and the processing core integrated with the FPGA and/or the processing core integrated with the DSP communicate with two adjacent computing cores through two inter-core routing paths. For example, a processing core that integrates an FPGA or a processing core that integrates a DSP may be located at a corner of a many-core array. Multiple processing cores of an integrated FPGA or multiple processing cores of an integrated DSP can also be arranged at multiple corners of the many-core array. Or a processing core integrating an FPGA and a processing core integrating a DSP can be respectively arranged at a plurality of corners of the many-core array. Or a processing core of an integrated FPGA and a processing core of a plurality of integrated DSPs can be respectively arranged at a plurality of corners of the many-core array. Or a plurality of processing cores integrating FPGA and a processing core integrating DSP can be respectively arranged at a plurality of corners of the many-core array. And a plurality of processing cores of integrated FPGA and a plurality of processing cores of integrated DSP can be respectively arranged at a plurality of corners of the many-core array. Every time a processing core integrating the FPGA is added, the processing capacities of chip logic control, instruction judgment and the like can be correspondingly increased, and every time a processing core integrating the DSP is added, the processing capacities of chip non-customized operation and the like can be correspondingly increased, so that the operation efficiency is improved. However, in the specific design, the number and the positions of the processing cores of the integrated FPGA and the processing cores of the integrated DSP need to be adaptively designed in consideration of the area, the function, the energy consumption, and the like of the chip.

In an implementation manner, the many-core array is a two-dimensional matrix network, at least one processing core integrated with the FPGA and/or at least one processing core integrated with the DSP are disposed inside the many-core array, and the processing core integrated with the FPGA and/or the processing core integrated with the DSP communicate with four adjacent computing cores through four inter-core routing paths. For example, a processing core integrating an FPGA or a processing core integrating a DSP may be provided inside the many-core array. Multiple processing cores integrating an FPGA or multiple processing cores integrating a DSP can also be arranged in the many-core array. Or a processing core integrating an FPGA and a processing core integrating a DSP can be respectively arranged in the many-core array. Or a processing core integrating an FPGA and a plurality of processing cores integrating a DSP can be respectively arranged in the many-core array. Or a plurality of processing cores integrating FPGA and a processing core integrating DSP can be respectively arranged in the many-core array. And a plurality of processing cores integrating the FPGA and a plurality of processing cores integrating the DSP can be respectively arranged in the many-core array. As in the above embodiment, more processing cores increase the complex computation processing capability of the chip, but in the specific design, the number and the positions of the processing cores need to be adaptively designed by comprehensively considering the area, the function, the energy consumption, and the like of the chip.

In one implementation, multiple processing cores of an integrated FPGA or multiple processing cores of an integrated DSP may be placed on a diagonal inside a many-core array. This approach of processing cores symmetric to the internal diagonals of many-core arrays is particularly well suited for two-dimensional square networks. For example, for a two-dimensional 4 x 4 network, one FPGA-integrated processing core or one DSP-integrated processing core may be disposed in each of the second row, the second column, and the third row, the third column of the many-core array. And a processing core integrating an FPGA or a processing core integrating a DSP can be arranged in the second row, the third column and the third row and the second column of the many-core array. And a processing core integrated with an FPGA or a processing core integrated with a DSP can be arranged in the second row, the second column, the third row, the second column and the third column of the many-core array, and the mode is generally used for the situation that the logic control and judgment instruction or the non-customized operation has large processing capacity. And a processing core integrating the FPGA can be arranged in the second row, the third column and the third row and the second column of the many-core array, and a processing core integrating the DSP can be arranged in the second row, the third column and the third row and the second column of the many-core array. As in the foregoing embodiment, more processing cores increase the complex arithmetic processing capability of data, but in the specific design, the number and positions of the processing cores need to be adaptively designed by comprehensively considering the area, function, energy consumption, and the like of the chip.

In an implementation manner, the many-core array is a two-dimensional matrix network, the processing core of the at least one integrated DSP is disposed at a corner of the many-core array, the processing core of the integrated DSP communicates with two adjacent computing cores thereof through two inter-core routing paths, the processing core of the at least one integrated FPGA is disposed inside the many-core array, and the processing core of the integrated FPGA communicates with four adjacent computing cores thereof through four inter-core routing paths. For example, a DSP integrated processing core may be located at a corner of the many-core array, and an FPGA integrated processing core may be located within the many-core array. Or a processing core integrating the DSP can be arranged at each of a plurality of corners of the many-core array, and a processing core integrating the FPGA is arranged in the many-core array. Or a processing core integrating the DSP can be arranged at one corner of the many-core array, and a plurality of processing cores integrating the FPGA are arranged in the many-core array. And a plurality of processing cores integrated with the DSP can be arranged at each of a plurality of corners of the many-core array, and a plurality of processing cores integrated with the FPGA are arranged in the many-core array. Like the above embodiment, the positions and the numbers of the two processing cores, i.e., the integrated FPGA and the integrated DSP, are designed according to the logical control and judgment instruction operation and the non-customized operation processing amount required by the chip, and the area, the function, the energy consumption, and the like of the chip need to be comprehensively considered. Fig. 1 shows an example of a many-core architecture with heterogeneous processors according to an embodiment of the present disclosure, in a 4 × 4 two-dimensional network, processing cores of an integrated DSP are respectively disposed at four corners, and processing cores of an integrated FPGA are respectively disposed in a second row, a second column, and a third row, a third column.

In an implementation manner, the many-core array is a two-dimensional matrix network, the processing core of the integrated FPGA is disposed at a corner of the many-core array, the processing core of the integrated FPGA communicates with two adjacent computing cores through two inter-core routing paths, the processing core of the integrated DSP is disposed inside the many-core array, and four inter-core routing paths of the processing core of the integrated DSP communicate with four adjacent computing cores. For example, a processing core integrating an FPGA may be disposed at a corner of the many-core array, and a processing core integrating a DSP may be disposed inside the many-core array. Or a plurality of corners of the many-core array are respectively provided with a processing core integrating FPGA, and a processing core integrating DSP is arranged in the many-core array. Or a processing core integrating FPGA can be arranged at one corner of the many-core array, and a plurality of processing cores integrating DSP are arranged in the many-core array. And a plurality of corners of the many-core array are respectively provided with a processing core integrating FPGA, and a plurality of processing cores integrating DSP are arranged in the many-core array. Like the above embodiment, the positions and the numbers of the two processing cores, i.e., the integrated FPGA and the integrated DSP, are designed according to the required logic control and judgment instruction and the non-customized operation throughput, and the area, the function, the energy consumption, and the like of the chip need to be comprehensively considered.

The processing core of the integrated FPGA in the above embodiment may be used as a general computing core, and may also be used to process complex operation data, including operation data that cannot be processed by the computing core, such as operation data of logic control and judgment instructions, and the processing core of the integrated DSP may be used as a general computing core, and may also be used to process complex operation data, including operation data that cannot be processed by the computing core, such as non-customized operation data.

The data processing method of the many-core architecture with the heterogeneous processors in the embodiment of the disclosure adopts the many-core architecture with the heterogeneous processors, and comprises the following steps: the operation data of the current computing core is transmitted to at least one processing core with different functions from the current computing core for computing through inter-core routing.

In an implementable embodiment, the processing cores that differ in function from the computing cores may be designed as processing cores of an integrated FPGA and/or processing cores of an integrated DSP. At this time, the data processing method includes: and the operation data of the current computing core is transmitted to at least one processing core of the integrated FPGA and/or at least one processing core of the integrated DSP from the current computing core through inter-core routing for operation.

In an implementation manner, a many-core architecture with heterogeneous processors according to the embodiments of the present disclosure maintains parallel pipeline processing of data, a single operation task is divided into a plurality of sub operation tasks, the plurality of sub operation tasks are allocated to a plurality of computation cores, a processing core of at least one integrated FPGA, and/or a processing core of at least one integrated DSP for parallel processing, and at the same time, the plurality of computation cores process respective sub operation tasks in parallel. It can be understood that a single operation task is divided into a plurality of sub-operation tasks, each of the computation cores and the processing cores only processes the corresponding sub-operation task, and the plurality of sub-operation tasks are processed by the computation cores and the processing cores simultaneously, so that in the processing mode, data transmission and operation among the cores do not need intervention and scheduling of an on-chip CPU, processing delay is reduced, and processing efficiency is improved.

In one implementation, after a single operation task is divided into sub-operation tasks, operation data of the sub-operation tasks is transmitted from the computing core to each processing core. And the operation data of the sub-operation tasks corresponding to the processing cores of the integrated FPGA and/or the processing cores of the integrated DSP are transmitted to the processing cores of the integrated FPGA and/or the processing cores of the integrated DSP from the current computing core through inter-core routing for operation.

In an implementation, the processing core integrated with the FPGA may be used for processing logic control and judgment instructions, etc., and the processing core integrated with the DSP may be used for processing off-the-shelf operations, etc. And the operation data of the logic control and judgment instructions which cannot be processed by the plurality of computing cores is transmitted to the processing core of the integrated FPGA from the current computing core through the inter-core routing for operation, and the non-customized operation data which cannot be processed by the plurality of computing cores is transmitted to the processing core of the integrated DSP through the inter-core routing for operation.

In an implementation manner, when the many-core array includes a plurality of processing cores, if the current computing core encounters an unsupported sub-operation task (for example, may be a logic control and judgment instruction or an off-the-shelf operation), at this time, the processing core of the integrated FPGA or the processing core of the integrated DSP closest to the computing core where the current single sub-operation task corresponds to the operation data may be found through inter-core routing, and the operation data is transmitted to the processing core of the integrated FPGA or the processing core of the integrated DSP for operation, and after the operation of the processing core of the integrated FPGA or the processing core of the integrated DSP is completed, the operation result is transmitted to the next computing core through inter-core routing to continue the operation. The processing core of the integrated FPGA or the processing core of the integrated DSP closest to the computing core is quickly searched, so that the overall operation rate of the chip is improved.

In one implementation, in a many-core array, when only a single logic control and judgment instruction sub-operation task or a single non-customized sub-operation task: searching a processing core of the integrated FPGA closest to a computing core where operation data of a logic control and judgment instruction corresponding to a current single logic control and judgment instruction sub-operation task is located through inter-core routing, transmitting the operation data of the logic control and judgment instruction to the processing core of the integrated FPGA for logic control and judgment instruction operation, and transmitting an operation result to the next computing core for continuous operation through inter-core routing after the operation of the processing core of the integrated FPGA is finished; or searching the processing core of the integrated DSP closest to the computing core where the current single non-customized sub-operation task corresponds to the non-customized operation data through inter-core routing, transmitting the non-customized operation data to the processing core of the integrated DSP for non-customized operation, and transmitting the operation result to the next computing core for continuous operation through the inter-core routing after the operation of the processing core of the integrated DSP is finished.

In an implementation manner, when the many-core array comprises a plurality of processing cores, if the current computing core encounters a plurality of unsupported sub-operation tasks (for example, one logic control and judgment instruction sub-operation and one non-customized sub-operation, one logic control and judgment instruction sub-operation and a plurality of non-customized sub-operations, a plurality of logic control and judgment instruction sub-operations and one non-customized sub-operation, a plurality of logic control and judgment instruction sub-operations and a plurality of non-customized sub-operations), the processing core of the integrated FPGA and/or the processing core of the integrated DSP closest to the computing core where the operation data corresponding to the current plurality of sub-operation tasks are located can be respectively found through inter-core routing, and the operation data is respectively transmitted to the processing cores of the integrated FPGA and/or the processing cores of the integrated DSP for operation, after the operation of the processing cores of the integrated FPGA and/or the processing cores of the integrated DSP is completed, and respectively transmitting the operation result to the next computation core for continuous operation through the inter-core routing.

In one implementation, in a many-core array, where there is a single logic control and arbitration instruction sub-operation task and a single non-custom sub-operation task: searching a processing core of the integrated FPGA closest to a computing core where operation data of a logic control and judgment instruction corresponding to a current single logic control and judgment instruction sub-operation task is located through inter-core routing, transmitting the operation data of the logic control and judgment instruction to the processing core of the integrated FPGA for logic control and judgment instruction operation, and transmitting an operation result to the next computing core for continuous operation through inter-core routing after the processing core of the integrated FPGA finishes operation; and searching a processing core of the integrated DSP closest to the computing core where the current single non-customized sub-operation task corresponds to the non-customized operation data through inter-core routing, transmitting the non-customized operation data to the processing core of the integrated DSP for non-customized operation, and after the processing core of the integrated DSP finishes the operation, transmitting the operation result to the next computing core through the inter-core routing for continuous operation.

In one implementation, in a many-core array, where there is a single logic control and arbitration instruction sub-operation task and multiple non-custom sub-operation tasks: searching a processing core of the integrated FPGA closest to a computing core where operation data of a logic control and judgment instruction corresponding to a current single logic control and judgment instruction sub-operation task is located through inter-core routing, transmitting the operation data of the logic control and judgment instruction to the processing core of the integrated FPGA for logic control and judgment instruction operation, and transmitting an operation result to the next computing core for continuous operation through inter-core routing after the operation of the processing core of the integrated FPGA is finished; and searching a plurality of processing cores of integrated DSP closest to the computing core where the current plurality of non-customized sub-operation tasks correspond to the non-customized operation data through inter-core routing, transmitting the non-customized operation data to the processing cores of the integrated DSP for non-customized operation, and after the processing cores of the integrated DSP finish operation, transmitting the operation result to the next computing core through the inter-core routing for continuous operation.

In one implementation, in a many-core array, when there are multiple logic control and arbitration instruction sub-operation tasks and a single non-custom sub-operation task: respectively searching a plurality of processing cores of integrated FPGAs closest to a computing core where operation data of logic control and judgment instructions corresponding to a plurality of current logic control and judgment instruction sub-operation tasks are located through inter-core routing, respectively transmitting the operation data of the logic control and judgment instructions to the processing cores of the integrated FPGAs for logic control and judgment instruction operation, and transmitting an operation result to the next computing core for continuous operation through the inter-core routing after the operation of the processing cores of the integrated FPGAs is finished; and searching a processing core of the integrated DSP closest to the computing core where the current single non-customized sub-operation task corresponds to the non-customized operation data through inter-core routing, transmitting the non-customized operation data to the processing core of the integrated DSP for non-customized operation, and after the processing core of the integrated DSP finishes the operation, transmitting the operation result to the next computing core through the inter-core routing for continuous operation.

In one implementation, when there are multiple logic control and arbitration instruction sub-operation tasks and multiple non-custom sub-operation tasks in the many-core array: respectively searching a plurality of processing cores of integrated FPGAs closest to a computing core where operation data of logic control and judgment instructions corresponding to a plurality of current logic control and judgment instruction sub-operation tasks are located through inter-core routing, respectively transmitting the operation data of the logic control and judgment instructions to the processing cores of the integrated FPGAs for logic control and judgment instruction operation, and after the operation of the processing cores of the integrated FPGAs is finished, respectively transmitting operation results to the next computing core through the inter-core routing for continuous operation; and searching a plurality of processing cores of the integrated DSP closest to the computing core where the current plurality of non-customized sub-operation tasks correspond to the non-customized operation data through inter-core routing, respectively transmitting the non-customized operation data to the processing cores of the integrated DSPs to perform non-customized operation, and transmitting the operation result to the next computing core through the inter-core routing to continue the operation after the operation of the processing cores of the integrated DSPs is finished.

The disclosure also relates to a many-core chip, and the many-core architecture with heterogeneous processors is adopted in the embodiment of the disclosure. The many-core chip utilizes the advantages of complex logic control and judgment instruction processing of the FPGA and scientific calculation processing of the DSP, and only needs to increase the type of nodes on the premise of not changing the existing network-on-chip structure, so that the chip can support various complex logic control and complex operation while realizing common AI calculation. The FPGA and the DSP are integrated in the many-core array, and external communication is performed through inter-chip routing of the AI computing core array, so that the energy consumption of a chip is reduced. The many-core chip can be applied to the field of artificial intelligence, and by adding the heterogeneous cores in the AI computing array, algorithms, logic control and judgment instructions, non-customized operation and the like which cannot be processed or are processed inefficiently by the AI computing core can be processed, so that the transmission bandwidth can be greatly saved, the energy consumption is reduced, the operation efficiency is improved, and the inference/training process of a neural network is accelerated.

The disclosure also relates to an electronic device comprising a server, a terminal and the like. The electronic device includes: at least one processor; a memory communicatively coupled to the at least one processor; and a communication component communicatively coupled to the storage medium, the communication component receiving and transmitting data under control of the processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to implement the data processing method of the many-core architecture with heterogeneous processors in the above embodiments.

In an alternative embodiment, the memory is used as a non-volatile computer-readable storage medium for storing non-volatile software programs, non-volatile computer-executable programs, and modules. The processor executes various functional applications and data processing of the device by running nonvolatile software programs, instructions and modules stored in the memory, that is, the data processing method of the multi-core architecture with heterogeneous processors is realized.

The memory may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store a list of options, etc. Further, the memory may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and such remote memory may be connected to the external device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

One or more modules are stored in the memory and, when executed by the one or more processors, perform the data processing method of the many-core architecture with heterogeneous processors in any of the method embodiments described above.

The above-mentioned product may execute the data processing method of the many-core architecture with the heterogeneous processor provided in the embodiment of the present application, and has corresponding functional modules and beneficial effects of the execution method.

The present disclosure also relates to a computer-readable storage medium storing a computer-readable program for causing a computer to perform some or all of the above-described embodiments of a data processing method for a many-core architecture with heterogeneous processors.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Furthermore, those of ordinary skill in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It will be understood by those skilled in the art that while the present invention has been described with reference to exemplary embodiments, various changes may be made and equivalents may be substituted for elements thereof without departing from the scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:一种基于异步单轨的流水线处理器

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!