Instruction processing device, processor, chip, computing equipment and corresponding method
1. An instruction processing apparatus comprising:
an execution unit;
at least one performance register set, which is used for recording the execution information of the execution unit during the execution of the instruction to be debugged;
the configuration register is used for configuring debugging parameters; and
trigger circuitry configured to enable one or more of the at least one performance register set to record the execution information in accordance with the debug parameters.
2. The instruction processing apparatus according to claim 1, wherein the performance register set comprises:
at least one time counter for recording the time when the first preset event occurs; and/or
And the event counter is used for recording the occurrence times of a second preset event.
3. The instruction processing apparatus according to claim 2, wherein the first preset event comprises one or more of: starting to execute an instruction, finishing the execution of the instruction, starting to read a memory, finishing reading the memory, starting to write the memory and finishing writing the memory;
the second preset event comprises one or more of the following: use compute unit, read memory block, write memory block.
4. The instruction processing apparatus according to any of claims 1-3, wherein the instruction to be debugged is one or more instructions in a target instruction sequence marked by a marking instruction, the debug parameters comprise a debug mode, the debug mode comprises a marking mode,
the trigger circuit is further configured to:
under the condition that the debugging mode is a marking mode, identifying whether a current instruction is the marking instruction or not; and
in response to determining that the current instruction is a marker instruction, enabling one or more of the at least one performance register set to record the execution information.
5. The instruction processing apparatus of claim 4, wherein the marker instruction comprises a first field to indicate a marker type comprising a debug start and a debug end and a second field to indicate a set of performance registers,
the trigger circuit is further configured to: and enabling the performance register set corresponding to the second field to record the execution information in response to determining that the current instruction is a mark instruction.
6. The instruction processing apparatus according to any one of claims 1 to 3, wherein the debug parameters include a debug mode including an auto mode, the debug parameters further including a loop parameter corresponding to the auto mode, the loop parameter being for dividing the instruction to be debugged into a plurality of instruction groups,
the trigger circuit is further configured to:
in a case where the debug mode is an auto mode, circularly enabling each of the at least one performance register set to record execution information during execution of each of the plurality of instruction sets, respectively.
7. The instruction processing apparatus of claim 6, wherein the loop parameter comprises:
a start instruction number indicating the number of a first instruction in a first instruction group in a loop;
a single debug instruction count for indicating a number of instructions included in the instruction group;
the instruction interval is used for indicating the number difference of a first instruction in two adjacent instruction groups; and
and the loop increment is used for indicating the value of the increase of the starting instruction number after one loop is finished.
8. The instruction processing apparatus according to any one of claims 1-7, further comprising:
a timer configured to start timing when the instruction to be debugged starts executing and end timing when the instruction to be debugged finishes executing.
9. A processor comprising at least one instruction processing apparatus as claimed in any one of claims 1 to 8.
10. A chip comprising at least one processor as claimed in claim 9.
11. A computing device comprising the chip of claim 10.
12. A method of evaluating instruction execution performance, comprising:
acquiring execution information of instructions to be debugged when executed by one or more instruction processing devices, wherein the instruction processing devices are the instruction processing devices according to any one of claims 1-8;
determining a performance index of the instruction to be debugged according to the execution information; and
and judging whether the execution performance of the instruction to be debugged reaches a target or not according to the performance index.
13. The method according to claim 12, wherein the instruction to be debugged is executed by an instruction processing apparatus, the performance index includes a first data amount read from an external memory by the instruction processing apparatus per unit time and a second data amount calculated by a calculation unit of the instruction processing apparatus per unit time,
the judging whether the execution performance of the instruction to be debugged reaches a target according to the performance index comprises:
judging whether the first data volume is matched with the second data volume; and
in response to determining that the first amount of data matches the second amount of data, determining that execution performance of the instruction to be debugged achieves a target.
14. The method of claim 12, wherein the instruction to be debugged is executed by an instruction processing apparatus, and the performance indicator comprises at least one of: the number of partial writes to the external memory by the instruction processing device is a first proportion of the total number of writes to the external memory, the number of times access to the internal memory by the instruction processing device is blocked is a second proportion of the total number of accesses to the internal memory, the number of times of use of the computing unit in the instruction processing device is a third proportion of the maximum number of times available to the computing unit,
the judging whether the execution performance of the instruction to be debugged reaches a target according to the performance index comprises: and judging whether the execution performance of the instruction to be debugged reaches a target or not according to the relative size of the performance index and a preset value.
15. The method according to any one of claims 12 to 14, wherein the instruction to be debugged is executed by a plurality of instruction processing apparatuses in cooperation, each of the plurality of instruction processing apparatuses executes a part of the instruction to be debugged, the execution information includes time period information for which the plurality of instruction processing apparatuses each execute the corresponding part of the instruction,
determining the performance indicator of the instruction to be debugged according to the execution information comprises:
and generating an execution time sequence chart of the instruction to be debugged according to the respective time period information of the instruction processing devices.
16. The method according to any one of claims 12 to 14, wherein the instruction to be debugged is divided into a plurality of instruction fragments, each instruction fragment includes a plurality of instruction groups, and a part of the same instruction group exists in two adjacent instruction fragments, the instruction to be debugged is executed by the one or more instruction processing apparatuses a plurality of times, execution time information of one instruction fragment is recorded by a performance register set of the corresponding instruction processing apparatus every time of execution, the execution time information includes an execution time of each instruction group included in the corresponding instruction fragment,
the obtaining of the execution information of the instructions to be debugged when executed by one or more instruction processing devices comprises: respectively acquiring execution time information of corresponding instruction fragments recorded when the instruction to be debugged is executed each time;
the determining the performance index of the instruction to be debugged according to the execution information comprises:
determining a time offset according to the execution time of the same instruction group of two adjacent instruction segments;
according to the time offset, calibrating the execution time information of the instruction segments except the first instruction segment in the plurality of instruction segments; and
and determining the total execution time of the instruction to be debugged according to the calibrated execution time information.
17. An apparatus for evaluating instruction execution performance, comprising:
an information acquisition module configured to acquire execution information of instructions to be debugged when executed by one or more instruction processing apparatuses, the instruction processing apparatus being the instruction processing apparatus according to any one of claims 1 to 8;
the index determining module is configured to determine a performance index of the instruction to be debugged according to the execution information; and
and the performance judging module is configured to judge whether the execution performance of the instruction to be debugged reaches a target according to the performance index.
18. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 12-16.
19. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 12-16.
20. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 12-16 when executed by a processor.
Background
The quality of the writing of program instructions can affect the computational efficiency of the processor. This effect is particularly evident in scenarios where hardware accelerators (e.g., special purpose processors such as NPUs, GPUs, FPGAs, etc.) are employed for neural network computations. The hardware accelerator generally includes a plurality of coprocessing units and an internal memory, and each coprocessing unit is internally integrated with a large number of concurrent computing units for performing accelerated processing on computing tasks. When the hardware accelerator performs calculation, the neural network algorithm needs to be mapped onto a concurrent calculation unit of the hardware accelerator through program instructions. Different mapping schemes are possible for the same neural network algorithm. Due to the limitations of the internal structure, the computing resources and the data path of the hardware accelerator, although different mapping schemes can obtain correct computing results, the computing efficiency may be greatly different.
Thus, in some cases, a user may wish to obtain information about the execution of a processor when executing a certain instruction or instructions in order to evaluate the performance of the instructions accordingly.
The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.
Disclosure of Invention
The present disclosure provides an instruction processing apparatus, a processor, a chip, a computing device and a corresponding method.
According to an aspect of the present disclosure, there is provided an instruction processing apparatus including: an execution unit; at least one performance register set, which is used for recording the execution information of the execution unit during the execution of the instruction to be debugged; the configuration register is used for configuring debugging parameters; and trigger circuitry configured to enable one or more of the at least one performance register set to record the execution information in accordance with the debug parameters.
According to another aspect of the present disclosure, there is provided a processor including at least one instruction processing apparatus as described above.
According to another aspect of the present disclosure, there is provided a chip comprising at least one processor as described above.
According to another aspect of the present disclosure, there is provided a computing device comprising the above chip.
According to another aspect of the present disclosure, there is provided a method of evaluating instruction execution performance, comprising: acquiring execution information of instructions to be debugged when the instructions are executed by one or more instruction processing devices; determining a performance index of the instruction to be debugged according to the execution information; and judging whether the execution performance of the instruction to be debugged reaches a target or not according to the performance index.
According to another aspect of the present disclosure, there is provided an apparatus for evaluating execution performance of an instruction, including: the information acquisition module is configured to acquire execution information of the instructions to be debugged when the instructions are executed by one or more instruction processing devices; the index determining module is configured to determine a performance index of the instruction to be debugged according to the execution information; and the performance judging module is configured to judge whether the execution performance of the instruction to be debugged reaches a target according to the performance index.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described method of evaluating instruction execution performance.
According to another aspect of the present disclosure, a non-transitory computer-readable storage medium having computer instructions stored thereon is provided. The computer instructions are for causing a computer to perform the above-described method of evaluating instruction execution performance.
According to another aspect of the disclosure, a computer program product is provided, comprising a computer program. Which when executed by a processor implements the above-described method of evaluating the performance of execution of instructions.
According to one or more embodiments of the present disclosure, by providing a configuration register, a trigger circuit, and at least one performance register set in an instruction processing apparatus, execution information of an execution unit during execution of an instruction to be debugged can be recorded.
According to the embodiment of the disclosure, only a small number of registers (a configuration register and at least one performance register group) and a trigger circuit with a simple structure need to be arranged in the instruction processing device, so that the execution information of the instruction to be debugged can be recorded. The recording and transmission of the execution information can multiplex the original register read-write access in the instruction processing device, no additional data access or special memory structure connected with the instruction processing device is needed, the production cost and the design cost are low, and the occupied chip area is small. In addition, the embodiment of the disclosure has no invasion to the original data path of the instruction processing device and the memory connected with the instruction processing device when recording the execution information, and can accurately record the execution information of the instruction to be debugged.
Further, according to the embodiment of the disclosure, the execution performance of the instruction to be debugged can be accurately evaluated according to the execution information of the instruction to be debugged, so that the instruction to be debugged is optimized.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.
FIG. 1 shows a schematic diagram of an instruction processing apparatus according to an embodiment of the present disclosure;
FIG. 2 shows a schematic diagram of a processor according to an embodiment of the present disclosure;
FIG. 3 shows a schematic diagram of a chip according to an embodiment of the disclosure;
FIG. 4 illustrates a flow diagram of a method of evaluating instruction execution performance according to an embodiment of the present disclosure;
FIG. 5 shows a schematic diagram of an execution timing diagram according to an embodiment of the present disclosure;
FIG. 6 shows a schematic diagram of calibrating execution time information according to an embodiment of the disclosure;
FIG. 7 shows a flow diagram of an instruction evaluation and optimization process according to an embodiment of the present disclosure;
FIG. 8 is a block diagram illustrating an apparatus for evaluating instruction execution performance according to an embodiment of the present disclosure; and
FIG. 9 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, the timing relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.
The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the elements may be one or more. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.
In some cases, a user may wish to obtain information about the execution of a processor when executing a certain instruction or instructions in order to evaluate the performance of the instructions accordingly.
In the related art, the following two schemes are mainly used for acquiring information executed by a processor:
one solution is to add special hardware in the processor to record and transmit the execution information during the execution process of the processor. In order to record and uniformly process and transmit the execution information of each component (such as a processor core, a memory, etc.) in the processor, a special storage module needs to be arranged to temporarily store the information, and an independent data path is designed to summarize and uniformly output the execution information of each component. Moreover, in order to ensure sufficient output bandwidth, a special data compression module may be further required to reduce the amount of transferred data, and these designs all require additional chip area, thereby increasing the production cost of the chip. Meanwhile, the added hardware needs to pay extra development and verification cost, so that the design cost of the chip is increased.
Alternatively, the relevant execution information is transferred and recorded by multiplexing the existing storage resources and data paths in the processor. The disadvantage of this solution is that the recording and transfer process of the execution information is intrusive to the working state of the processor. Because the recorded execution information and the control flow and the data flow generated in the working process of the processor share the storage bandwidth and the data path, the working of the processor is slowed down because the storage bandwidth and the network bandwidth are occupied, and the finally obtained processor execution information cannot truly reflect the normal working state of the processor. Particularly for a Neural network processor, since the calculation, storage and network bandwidth of a high-density, large-data-volume and high-throughput Neural network (e.g., a Deep Convolutional Neural network, CDNN) are critical resources of the processor in most cases, the degree of misalignment of the obtained processor execution information is further increased, and it is difficult for a user to obtain a critical instruction optimization direction and an optimization point according to the obtained execution information.
To this end, the disclosed embodiments provide an instruction processing apparatus capable of accurately recording execution information when an instruction is executed at low cost. Further, based on the execution information of the instructions recorded by the instruction processing device, the present disclosure also provides a method and a device for evaluating the execution performance of the instructions, which can accurately evaluate the execution performance of the instructions based on the execution information of the instructions so as to optimize the instructions. In the embodiment of the present disclosure, a process of executing an instruction and recording execution information of the instruction is denoted as "debugging" the instruction, and accordingly, the instruction that is executed and recorded with the execution information is "an instruction to be debugged".
Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
FIG. 1 shows a schematic diagram of an instruction processing apparatus 100 according to an embodiment of the disclosure. Instruction processing apparatus 100 may be, for example, a single-core processor, a processor core of a multi-core processor, or a processing element in an electronic system. It should be noted that the Processor herein includes, but is not limited to, a Central Processing Unit (CPU), a Neural-Network Processing Unit (NPU), a Graphics Processing Unit (GPU), a Tensor Processing Unit (TPU), a Digital Signal Processor (DSP), and the like.
As shown in fig. 1, the instruction processing apparatus 100 includes: an execution unit 110; at least one performance register set 122 (four performance register sets 122-1-122-4 are shown in FIG. 1) for recording execution information during execution of the instruction to be debugged by the execution unit 110; a configuration register 124 for configuring debugging parameters; and a trigger circuit 130 configured to enable one or more of the at least one performance register set 122 to record execution information according to the debug parameter.
According to the embodiment of the disclosure, only a small number of registers (namely, a configuration register and at least one performance register group) and a trigger circuit with a simple structure are required to be arranged in the instruction processing device, so that the execution information of the instruction to be debugged can be recorded. The recording and transmission of the execution information can multiplex the original register read-write access in the instruction processing device, no additional data access or special memory structure connected with the instruction processing device is needed, the production cost and the design cost are low, and the occupied chip area is small. In addition, the embodiment of the disclosure has no invasion to the original data path of the instruction processing device and the memory connected with the instruction processing device when recording the execution information, and can accurately record the execution information of the instruction to be debugged.
According to some embodiments, execution unit 110 includes circuitry operable to execute instructions, which may include, for example, decoders and different types of compute units.
The decoder may, for example, fetch instructions in the form of high-level machine instructions or macro-instructions in memory 102 and decode these instructions to generate low-level micro-operations, micro-code entry points, micro-instructions, or other low-level instructions or control signals. The low-level instructions or control signals may operate at a low level (e.g., circuit level or hardware level) to implement the operation of high-level instructions. The decoder may be implemented in different ways including, but not limited to, microcode, look-up tables, hardware implementations, Programmable Logic Arrays (PLAs), etc. The present disclosure is not limited by the manner in which the decoder is implemented, and any manner in which the decoder can be implemented is within the scope of the present disclosure.
The calculation unit performs an operation according to the decoded instruction. The computational units include, but are not limited to, Arithmetic Logic units (arithmetric and Logic units), multiplier array circuits, adder array circuits, vector processing circuits, format conversion (e.g., converting floating point numbers to fixed point numbers or vice versa) circuits, and the like.
According to some embodiments, as shown in fig. 1, instruction processing apparatus 100 includes a register unit 120, register unit 120 is coupled to execution unit 110, and execution unit 110 may read and write registers in register unit 120 through a predetermined register read and write path. Register unit 120 may include different bit widths, different types, and different numbers of register sets or registers that may be used to store control information, state information, operands of instructions, etc. during execution of the instructions by execution unit 110.
In the embodiment of the present disclosure, the register unit 120 includes the at least one performance register set 122 and the configuration register 124, so that the execution unit 110 or the trigger circuit 130 can read and write the performance register set 122 and the configuration register 124 through the original register read and write path of the register unit 120 without providing an additional data path.
It is understood that, in addition to the performance register set 122 and the configuration register 124, other registers (e.g., general purpose registers, vector registers, etc.) may be included in the register unit 120, without limitation.
According to some embodiments, the performance register set comprises at least one time counter for recording the time of occurrence of a first preset event and/or at least one event counter for recording the number of occurrences of a second preset event. According to some embodiments, the first preset event may for example comprise one or more of: instruction start execution, instruction execution completion, memory read start, memory read end, memory write start, and memory write end. The second preset event may for example comprise one or more of the following: use compute unit, read memory block, write memory block.
For example, as shown in FIG. 1, performance register 122-1 includes a time counter 122-1A and an event counter 122-1B. The time counter 122-1A is used to record the current time when a specific event (a first preset event) occurs. For example, the time when the instruction to be debugged starts to execute, the time when the instruction to be debugged finishes executing, the time when the instruction to be debugged enters or exits a specific module (such as a memory, etc.), and the like are recorded. The event counter 122-1B is used for performing a self-increment operation when a specific event (a second preset event) occurs, so as to record the number of times of occurrence of the specific event. For example, the number of accesses to the memory 102, the number of accesses to an external memory (the external memory is not shown in fig. 1), the number of uses of various computing units (e.g., multipliers, adders, etc.) inside the execution unit 110, and the like are recorded.
Configuration registers 124 are used to configure debug parameters. The trigger circuit 130 is configured to enable one or more of the at least one performance register set 122 to record execution information of the instruction to be debugged according to the debug parameters in the configuration registers 124. Only one flip-flop circuit 130 is shown in fig. 1, but it will be appreciated that in other embodiments, multiple flip-flop circuits may be provided, each configured to enable one or more performance register sets. For example, the same number of flip-flop circuits as the number of performance register groups may be provided, each flip-flop circuit being for enabling a corresponding one of the performance register groups.
According to some embodiments, in particular, the trigger circuit 130 may further comprise one or more trigger subcircuits, each for enabling one or more registers (time registers or event registers) in a respective performance register.
According to some embodiments, the debug parameters configurable by configuration registers 124 include a debug MODE, and accordingly, configuration registers 124 include a MODE register 124A for configuring the debug MODE, and the name of MODE register 124A may be, for example, TRACE _ MODE. There may be multiple debug modes, and different debug modes may be distinguished by the value in mode register 124A. For example, the debug mode may include a flag mode and an auto mode. When the value in mode register 124A is 0, it indicates that the current debug mode is flag mode; when the value in mode register 124A is 1, it indicates that the current debug mode is auto mode.
In the marker mode, instruction processing apparatus 100 takes one or more instructions in the target instruction sequence marked by the marker instruction as an instruction to be debugged, and records execution information of execution unit 110 during execution of the instruction to be debugged. For example, the tagged instructions include, for example, a debug start tagged instruction (trace _ begin) and a debug end tagged instruction (trace _ end), and an instruction in the target instruction sequence between the debug start tagged instruction trace _ begin and the debug end tagged instruction trace _ end is an instruction to be debugged. Accordingly, according to some embodiments, the trigger circuit 130 is further configured to: in the case that the debug mode is the flag mode, identifying whether the current instruction (i.e., the instruction currently executed by the execution unit 110) is a flag instruction; and in response to determining that the current instruction is a marker instruction, enabling one or more of the at least one performance register set 122 to record execution information.
According to some embodiments, the marker instruction includes a first field to indicate a marker type including a debug start and a debug end and a second field to indicate a performance register set. For example, when the value of the first field is 1, it indicates that the current marker instruction is a debug start marker instruction trace _ begin; when the value of the first field is 0, it indicates that the current mark instruction is a debug end mark instruction trace _ end. The value of the second field may be, for example, an identification of a set of performance registers. Accordingly, the trigger circuit 130 is further configured to: and enabling the performance register group corresponding to the second field to record the execution information in response to determining that the current instruction is the marking instruction. That is, the marker instruction is used to start or end the debugging process of the instruction to be debugged, and can specify into which performance register group the execution information in the debugging process is stored.
According to some embodiments, where the debug mode is auto mode, the debug parameters also include loop parameters corresponding to the auto mode, and accordingly, the configuration registers 124 include registers for configuring the loop parameters (e.g., start register 124B, number register 124C, step register 124D, increment register 124E, etc., shown in fig. 1 and below). The cycle parameter is used to control the debugging process in the automatic mode. In the automatic mode, part or all of the instructions in the target instruction sequence can be taken as the instructions to be debugged based on the configured loop parameters, and the execution information of the execution unit 110 during the execution of the instructions to be debugged can be automatically recorded without manually setting the marking instructions in the target instruction sequence by the user. The automatic mode is suitable for the situation that the instruction to be debugged contains a large number of instructions, for example, the instruction to be debugged is the codes of a plurality of convolution layers of a neural network, or the codes of a residual error unit of ResNet, and the like.
In particular, the loop parameter is used to divide the instructions to be debugged into a plurality of instruction groups, and accordingly, the trigger circuit 130 is further configured to: and in the case that the debugging mode is the automatic mode, circularly enabling each performance register group in the at least one performance register group to record the execution information during the execution of each instruction group in the plurality of instruction groups. According to some embodiments, the execution unit 110 may automatically generate a tag instruction for each instruction group (i.e., add a trace _ begin instruction at the front end of each instruction group and a trace _ end instruction at the back end of each instruction group) based on the loop parameter in the configuration register 124, the tag instruction including a second field for indicating the performance register group. Accordingly, the trigger circuit 130 may cycle enable the corresponding set of performance registers to record execution information based on the tag instruction of each instruction set.
For example, the instruction processing apparatus 100 shown in FIG. 1 has 4 sets of individual performance registers 122-1 to 122-4 in common. According to the loop parameters stored in the configuration register 124, 9 instruction groups, i.e., instruction groups 1 to 9, are divided from the instruction to be debugged. Then, in the automatic mode, the trigger circuit 130 enables the performance register set 122-1 to record the execution information of the instruction set 1, then enables the performance register set 122-2 to record the execution information of the instruction set 2, then enables the performance register set 122-3 to record the execution information of the instruction set 3, then enables the performance register set 122-4 to record the execution information of the instruction set 4, then enables the performance register set 122-1 to record the execution information of the instruction set 5 again, and so on until the execution information of all the instruction sets is recorded.
According to some embodiments, after the performance register set 122 records the execution information of a certain instruction set, the recorded execution information may be transferred to the memory 102 for storage (further, the execution information may also be transferred from the memory 102 to an external memory with a larger capacity, which is not shown in fig. 1), so as to avoid that the execution information of the instruction set currently stored by the performance register set 122 is overwritten by the execution information of the next instruction set and lost.
According to some embodiments, the cycle parameters include: a start instruction number indicating the number of a first instruction in a first instruction group in a loop; a single debug instruction count for indicating the number of instructions included in a single instruction group; the instruction interval is used for indicating the number difference of a first instruction in two adjacent instruction groups; and a loop increment for indicating a value by which the start instruction number increases after one loop ends. Accordingly, configuration registers 124 may include a start register 124B for configuring a start instruction number, a number register 124C for configuring a number of single debug instructions, a stride register 124D for configuring an instruction interval, and an increment register 124E for configuring a loop increment.
Through the multiple circulation parameters, the division mode of the instruction group can be flexibly configured, and flexible debugging in an automatic mode is realized.
Tables 1 to 3 below show the case where each performance register group (performance register groups 122-1 to 122-4) corresponds to an instruction to be debugged every cycle when the MODE register 124A (register name is TRACE _ MODE, value 0 indicates flag MODE, value 1 indicates auto MODE), the start register 124B (register name is TRACE _ BEGIN), the number register 124C (register name is TRACE _ NUM), the STEP register 124D (register name is TRACE _ STEP), and the increment register 124E (register name is TRACE _ override) are set to different values.
TABLE 1
TABLE 2
TABLE 3
According to some embodiments, the instruction processing apparatus 100 further comprises a timer (the timer is not shown in fig. 1) configured to start timing when the instruction to be debugged starts executing and end timing when the instruction to be debugged completes executing to generate a timeline of the debugging process. When the first predetermined event occurs, the trigger circuit 130 may enable the corresponding performance register set, and write the time of the current timer into the time counter of the corresponding performance register set.
The timer may be implemented, for example, as a Write-Only (WO) counter in the register unit 120. When the instruction to be debugged starts to execute, the timer starts to self-increment; and when the execution of the instruction to be debugged is finished, ending the timing, namely clearing the value in the timer.
To avoid obscuring the description, a relatively simple instruction processing apparatus 100 is shown in FIG. 1. It will be appreciated that in other embodiments, the instruction processing apparatus may also include other modules, such as an instruction fetch unit, a Cache (Cache), and so forth. The present disclosure is not limited to the specific structure of the instruction processing apparatus 100.
The instruction processing apparatus of the present disclosure may be applied in a processor as a processor core for executing a specific calculation task in the processor. Correspondingly, the embodiment of the disclosure also provides a processor, which comprises at least one instruction processing device.
FIG. 2 shows a schematic diagram of a processor 200 according to an embodiment of the disclosure. As shown in FIG. 2, the processor 200 includes three instruction processing devices 210-230, a control unit 240, a Memory 250 and a DMA (Direct Memory Access) 260. Processor 200 may be, for example, an NPU, and instruction processing devices 210-230 may be co-processing units in the NPU having the features of instruction processing device 100 described above. The control unit 240 is used for issuing calculation tasks to the instruction processing devices 210-230 and coordinating the calculation processes of the three. The DMA260 may be connected to other modules (e.g., other processors, memory, etc.) of the chip on which the processor 200 is located through the on-chip interconnect unit 202.
The processor of the disclosed embodiments may be integrated in a chip to enable the chip to provide processing functions supported by the processor. Correspondingly, the embodiment of the disclosure also provides a chip comprising at least one processor.
Fig. 3 shows a schematic diagram of a chip 300 according to an embodiment of the disclosure. The chip 300 may be an embedded chip, for example. As shown in fig. 3, chip 300 includes on-chip interconnect unit 310, and Central Processing Unit (CPU)320, one or more coprocessors 330, memory unit 340, and display unit 350 interconnected by on-chip interconnect unit 310. Coprocessor 330 may be, for example, an NPU, GPU, TPU, or the like. One or more of the central processor 320 and the coprocessor 330 may be a processor into which the instruction processing apparatus of the disclosed embodiments is integrated. The Memory unit 340 may be, for example, a Static Random-Access Memory (SRAM), a High Bandwidth Memory (HBM), a Graphics Double Data Rate Memory (GDDR), or the like. The display unit 350 is used to drive one or more external displays.
The above chip may be included in a computing device to implement corresponding functions in the computing device, including but not limited to executing related control programs, performing data analysis, operations and processing, network communication, controlling peripherals of the computing device, and the like. Correspondingly, the embodiment of the disclosure also provides a computing device comprising the chip. The computing device may be, for example, an in-vehicle device, an industrial control device, a sensing device, an intelligent home device (smart speaker, smart door lock, smart display device), and the like, but is not limited thereto.
As described above, the instruction processing apparatus according to the embodiment of the present disclosure can accurately record the execution information of the instruction to be debugged. Based on the recorded execution information of the instruction to be debugged, the embodiment of the disclosure provides a method for evaluating the execution performance of the instruction.
The method for evaluating instruction performance of the embodiment of the disclosure can be executed in a debugging host coupled with an instruction processing device. For example, the instruction processing apparatus (or the processor or chip on which the instruction processing apparatus is located) may be connected to the debug host through a debug interface, such as a PCI-e (Peripheral Component Interconnect-express) interface, a jtag (joint Test Action group) interface, or the like, and the execution information of the instruction to be debugged recorded by the instruction processing apparatus is transmitted to the debug host, and then the debug host executes the method for evaluating the instruction performance according to the embodiment of the present disclosure based on the execution information. The debugging host may be, for example, a desktop personal computer, a notebook computer, or the like. In some embodiments, the commissioning host may also be a server or a mobile device.
FIG. 4 illustrates a flow diagram of a method 400 of evaluating instruction execution performance according to an embodiment of the disclosure. As shown in fig. 4, the method 400 includes:
step 410, obtaining execution information of the instructions to be debugged when the instructions are executed by one or more instruction processing apparatuses, where the one or more instruction processing apparatuses are instruction processing apparatuses according to an embodiment of the present invention (for example, instruction processing apparatus 100 shown in fig. 1);
step 420, determining a performance index of the instruction to be debugged according to the execution information; and
step 430, judging whether the execution performance of the instruction to be debugged reaches the target according to the performance index.
According to the embodiment of the disclosure, the execution performance of the instruction to be debugged can be accurately evaluated according to the execution information of the instruction to be debugged, so that the instruction to be debugged is optimized.
The various steps of method 400 are described in detail below.
As described above, the instruction processing apparatus may collect various execution information during the execution of the instruction to be debugged, where the execution information includes the time and the number of times of occurrence of various events. The execution information obtained in step 410 may be all or part of the execution information collected by the instruction processing apparatus.
The kind of the execution information acquired in step 410 may be determined according to the kind of the performance index to be calculated in step 420.
According to some embodiments, the instruction to be debugged is executed by an instruction processing apparatus, and the performance index of step 420 includes a first data amount read from the external memory by the instruction processing apparatus per unit time and a second data amount calculated by a calculation unit of the instruction processing apparatus per unit time. Accordingly, step 430 includes: judging whether the first data volume is matched with the second data volume; and in response to determining that the first amount of data matches the second amount of data, determining that the execution performance of the instruction to be debugged achieves the target.
According to some embodiments, the first amount of data read from the external memory by the instruction processing device per unit time may be calculated according to the formula mn/(t2-t1), where m is the maximum throughput of the bus (i.e., the maximum amount of data that can be read at a time), n is the number of times a read data valid signal appears on the bus during execution of the instruction to be debugged, t1 is the time when reading the external memory starts, and t2 is the time when reading the external memory ends. n, t1, t2 are the execution information obtained by step 410, and m is a theoretical value (not belonging to the execution information to be obtained).
According to some embodiments, the second data amount calculated by the calculation unit of the instruction processing apparatus per unit time may be calculated according to the formula pq/(t4-t3), where p is the theoretical throughput (i.e., the maximum data amount that can be calculated at a time) of the calculation unit (e.g., multiplier-adder) of the instruction processing apparatus, q is the total number of times the calculation unit is occupied during execution of the instruction to be debugged, t3 is the time when the calculation unit starts to perform the calculation, and t4 is the time when the calculation unit ends the calculation. q, t3, t4 are the execution information obtained by step 410, and p is a theoretical value (not belonging to the execution information that needs to be obtained).
It is understood that the judgment criterion of whether the first data amount and the second data amount match may be set according to specific situations. In some embodiments, the data read from the external memory is all involved in only one operation, in which case, the first amount of data matching the second amount of data may mean that the first amount of data and the second amount of data are equal (or approximately equal). In other embodiments, the data read from the external memory may be respectively involved in a plurality of operations, in which case, the first data amount and the second data amount match may mean that the second data amount is an integer multiple of the first data amount (or approximately an integer multiple of the first data amount).
Based on the above embodiment, the first data amount may represent the memory access performance of the instruction to be debugged, and the second data amount may represent the computation performance of the instruction to be debugged. The more the two are matched, the better the execution performance of the instruction to be debugged is, the better the storage resource and the calculation resource are fully utilized in the execution process of the instruction to be debugged, and no obvious short board exists. If the two are not matched, the instruction to be debugged can be optimized until the two are matched.
In the above embodiments, the first data amount and the second data amount may be used to characterize the overall performance of the instruction to be debugged, and therefore, both may be denoted as "characterization parameters".
According to some embodiments, the instruction to be debugged is executed by an instruction processing apparatus, and the performance indicator in step 420 includes at least one of: the number of times of partial writing to the external memory by the instruction processing device accounts for a first proportion of the total number of times of writing to the external memory, the number of times of blocking access to the internal memory by the instruction processing device accounts for a second proportion of the total number of times of accessing the internal memory, and the number of times of use of the computing unit in the instruction processing device accounts for a third proportion of the maximum number of times available to the computing unit. Accordingly, step 430 includes: and judging whether the execution performance of the instruction to be debugged reaches the target or not according to the relative size of the performance index and the preset value.
In an embodiment of the present disclosure, "partially writing" to a memory means that the amount of data of the write-once memory is less than the data bit width of the memory. For example, the data bit width of a certain memory is 4 bytes, and the data amount written into the memory by a certain write operation is 2 bytes, then the write operation is "partial write".
According to some embodiments, the number a1 of partial writes by the instruction processing apparatus to the external memory (i.e. a memory located outside the processor in which the instruction processing apparatus is located, such as HBM, GDDR, etc.) and the total number b1 of writes to the external memory may be obtained in step 410, accordingly, the first ratio a1/b 1. If the first ratio is too large (larger than a predetermined value), a large performance loss of the external memory is caused. In order to improve performance, the instruction to be debugged may be optimized to increase aligned access to the external memory (that is, the data size of a single write is an integer multiple of the data bit width of the external memory), and decrease the first ratio.
According to some embodiments, in order to improve the access performance of an internal memory (i.e., a memory located inside a processor where an instruction processing device is located), a storage space of the internal memory is generally divided into a plurality of blocks (banks), and data located in the same block share a read-write interface of data and cannot be accessed by a plurality of instruction processing devices at the same time; the data in different blocks can be accessed by different instruction processing devices at the same time by using different read-write interfaces. In the execution process of the instruction to be debugged, if the data needing to be accessed are located on the same block, the access can be sequentially completed only in an arbitration mode, and the requests which are not arbitrated are temporarily blocked, so that the access performance is reduced.
According to some embodiments, the number of times the instruction processing apparatus is blocked from accessing the internal memory, a2, and the total number of times the internal memory is accessed, b2, may be obtained in step 410, and accordingly, the second ratio, a2/b 2. If the second ratio is too large (larger than a predetermined value), it indicates that the data distribution in the internal memory is not reasonable, and the internal memory is easily blocked. To improve performance, the instructions to be debugged may be optimized so that data may only be distributed in different blocks of the internal memory, to reduce the second ratio, reducing the likelihood of access blocking.
According to some embodiments, for the NPU and the GPU, the computing units (e.g., multiplier-adder) are arranged in an array (i.e., computing unit array) manner, so that the parallel computing capability is stronger. However, some computing tasks are relatively small in computing scale, resulting in a portion of the computing units being unused, thereby resulting in wasted computing power.
According to some embodiments, the number of times of use a3 of the compute unit and the number of times of use b3 of the compute unit array in the instruction processing apparatus may be obtained in step 410, and accordingly, a third ratio is a3/(b3 × c), where c is the number of compute units included in the compute unit array. If the third ratio is too small (smaller than a preset value), the calculation unit is not fully utilized, and a large resource waste exists. In order to improve performance, the instruction to be debugged may be optimized to increase the third ratio, so that the computing unit is fully utilized.
In the above embodiment, the first proportion, the second proportion and the third proportion may help to locate a specific problem existing in the instruction to be debugged, characterize which factors specifically affect the execution efficiency, and can provide a specific optimization direction to a user. Therefore, the three can be recorded as "traceability parameters".
According to some embodiments, the instruction to be debugged is executed by a plurality of instruction processing apparatuses in cooperation, each of the plurality of instruction processing apparatuses executing a part of the instruction to be debugged. In this case, the time period information for each of the plurality of instruction processing apparatuses to execute the corresponding partial instruction may be acquired through step 410. And accordingly, in step 420, an execution timing chart of the instruction to be debugged is generated according to the respective time period information of the plurality of instruction processing apparatuses. Further, the bottleneck resource and optimization direction of the instruction to be debugged can be determined according to the execution time sequence diagram.
For example, a neural network algorithm may be cooperatively executed by a plurality of co-processing units (i.e., instruction processing devices) in the NPU, each co-processing unit executing a portion of the neural network algorithm. For example, some coprocessors perform matrix operations, some coprocessors perform vector operations, some coprocessors perform data format conversion operations, etc. Through the step 410, time period information of each co-processing unit executing the corresponding partial instruction can be obtained, for example, the time period information of the co-processing unit COP0 executing the instruction includes t 0-t 1, t 3-t 5; the time period information of the instruction executed by the coprocessing unit COP1 comprises t2-t 4 and t 6-t 7; and the like. Subsequently, through step 420, the time period information of the instructions executed by the co-processing units may be sequentially arranged according to the unified timeline, so as to generate an execution timing chart of the instructions to be debugged.
Fig. 5 illustrates one example of an execution timing diagram according to an embodiment of the present disclosure. The execution timing diagram in fig. 5 is generated according to the time period information of instructions executed by six co-processing units COP 0-COP 5 of the NPU, each line corresponds to one co-processing unit, and one shaded block in each line corresponds to a time period, which indicates that the corresponding co-processing unit is occupied in the time period and is in a busy (busy) state. As shown in fig. 5, the co-processing unit COP3 is always busy, so COP3 is a bottleneck resource. In order to improve the execution performance, the part of the instruction to be debugged, which is executed by the COP3, can be optimized to improve the efficiency of executing the instruction by the COP3, reduce the execution time of the COP3, reduce the time of waiting for the COP3 by other co-processing units and improve the execution efficiency of the instruction to be debugged on the whole NPU.
According to some embodiments, due to chip area overhead considerations, typically only a limited number of performance register sets can be integrated in an instruction processing apparatus. In order to obtain a timing chart of the whole execution process of the instruction to be debugged, when the instruction to be debugged includes a large number of instructions, the instruction to be debugged needs to be executed multiple times (for example, the instruction to be debugged may be executed multiple times by the same instruction processing apparatus, or may be executed once by a plurality of different instruction processing apparatuses, etc.), the execution information of a part of instructions is recorded each time, and the execution information of each part of instructions obtained by multiple times of execution is merged. It should be noted that, because the hardware states of different instruction processing apparatuses are different, and the storage environments of the same instruction processing apparatus at different times are also different, when the instruction to be debugged is executed for multiple times, the time spent on the same instruction is different, so that a certain instruction or a certain group of instructions cannot be executed independently to record the execution information of the instruction, but the instruction to be debugged needs to be executed completely, and then the required execution information of the certain instruction or the certain group of instructions is obtained therefrom, so that it can be ensured that the environments for executing the instruction to be debugged each time are substantially the same, and the combined timing diagram is consistent with the actual operating state of the instruction to be debugged.
Further, since the hardware states of different instruction processing apparatuses are different, and the storage environments of the same instruction processing apparatus at different times are also different, there may be fluctuation in the time for executing the instruction to be debugged each time, and the execution information executed each time cannot be directly merged according to the time line. Therefore, according to some embodiments, the execution time information of each execution may be calibrated, and then the calibrated execution time information is summarized and combined to obtain a timing diagram of the instruction to be debugged, and the total execution time of the instruction to be debugged is determined. According to the scheme, corresponding execution time information can be obtained in real time and combined in the period that the instruction to be debugged is executed (a timing diagram can be generated in real time), and the execution time information can be uniformly obtained and combined after the whole instruction to be debugged is executed (the method can further adopt an averaging method of multiple times of operation to improve the accuracy of time calibration and reduce accidental influences).
In addition, when the instruction number of the instruction to be debugged is huge, if a user has a plurality of execution environments (for example, a chip, a processor, and other hardware environments including an instruction processing device), the debugging work of the instruction to be debugged can be distributed to the plurality of execution environments to be executed in parallel, each execution environment records the execution time information of a part of instructions in the instruction to be debugged, and finally, a timing diagram of the whole instruction to be debugged is obtained by combination, so that the purpose of shortening the evaluation time is achieved.
Specifically, according to some embodiments, an instruction to be debugged is divided into a plurality of instruction segments, each instruction segment includes a plurality of instruction groups (in some cases, one instruction group may include only one instruction), and there are some identical instruction groups in two adjacent instruction segments, the instruction to be debugged is executed by one or more instruction processing apparatuses multiple times, and each time execution (in the automatic mode described above) is performed, execution time information of one instruction segment is recorded by a performance register group of the corresponding instruction processing apparatus, and the execution time information includes an execution time of each instruction group included in the corresponding instruction segment.
In step 410, the execution time information of the corresponding instruction segment recorded each time the instruction to be debugged is executed is respectively obtained, and correspondingly, in step 420, the time offset is determined according to the execution time of the same instruction group of the two adjacent instruction segments; calibrating the execution time information of the instruction segments except the first instruction segment in the plurality of instruction segments according to the time offset; and determining the total execution time of the instruction to be debugged according to the calibrated execution time information. The time offset may be, for example, an average value of a difference between execution start times and an execution end time of the same instruction group.
Further, in step 430, it may be determined whether the execution performance of the instruction to be debugged reaches the target according to the relative size of the total execution time and the preset value. If the total execution time is less than a preset value, the execution performance of the instruction to be debugged can be considered to reach the target; otherwise, the instruction to be debugged may be optimized to reduce its execution time until the total execution time is less than the preset value.
FIG. 6 shows a schematic diagram of calibrating execution time information according to an embodiment of the disclosure. In the embodiment shown in FIG. 6, instructions to be debugged instr 0-instr 6 are divided into two instruction segments 630, 640, each instruction segment including four instruction groups, each instruction group including one instruction. As shown in FIG. 6, instruction fragment 630 includes instructions instr0 through instr3, and instruction fragment 640 includes instructions instr3 through instr6, both of which exist in the same instruction (group), instruction instr 3.
The instruction to be debugged is executed twice, in the two execution processes, the corresponding instruction processing device adopts an automatic mode to record the execution time information of the corresponding instruction segment, and the value of the adopted debugging parameter can be shown in table 1 above. During the first execution, the execution time information of instruction fragment 630 is recorded, that is, the execution times of instructions instr0 through instr3 are recorded (as shown in table 1, the execution times of instructions instr0 through instr3 can be recorded by the performance register sets 122-1 through 122-4, respectively), and the execution times of instructions instr0 through instr3 are shown in rectangular blocks 610 through 613 in fig. 6. Similarly, at the second execution time, the execution time information of instruction fragment 640 is recorded, i.e., the execution times of instructions instr 3-instr 6 are recorded, and the execution times of instructions instr 3-instr 6 are shown in rectangular blocks 623-626 of FIG. 6.
When merging the execution time information of instruction fragment 630 and instruction fragment 640, the time offset is first determined based on the execution time of the same instruction of both, i.e., execution times 613 and 623 of instruction instr 3. The time offset Δ t may be an average value of a difference Δ t1 between the start time and a difference Δ t2 between the end time of the execution times 613 and 623, that is, Δ t ═ Δ t1 +/Δ t 2)/2. It is understood that the calculated Δ t1, Δ t2, Δ t may be positive or negative.
The execution time information of the instruction fragment 640 is then calibrated according to the time offset. In particular, for instructions instr3 that repeat with instruction fragment 630, the execution time 613 in the preceding instruction fragment 630 is taken directly, and the execution time 623 in the following instruction fragment 640 is no longer taken. The execution times of the instructions instr4 to instr6 are calibrated according to the time offset Δ t, and the time offsets Δ t are added to the execution times 624 to 626 of the instructions instr4 to instr6, respectively, to obtain calibrated execution times 624 'to 626', as shown in fig. 6.
Then, the total execution time of the instructions instr 0-instr 6 to be debugged is determined according to the calibrated execution time of each instruction, namely the execution time 610-613, 624 '-626'. The total execution time is the difference between the end time t of the last instruction instr6 and the start time t0 of the first instruction instr 0.
Based on the instruction processing device and the method for evaluating the instruction execution performance of the embodiment of the disclosure, performance evaluation and optimization can be performed on the instruction to be debugged. FIG. 7 illustrates an exemplary flow diagram of an instruction evaluation and optimization process according to an embodiment of the disclosure. In this embodiment, the instruction to be debugged is a neural network algorithm instruction, and the instruction processing device is each co-processing unit in the NPU.
As shown in fig. 7, in step 702, the configuration registers of the co-processing units are configured, and the instruction to be debugged is executed multiple times using the auto mode, so as to generate an execution timing chart.
Subsequently, in step 704, it is checked whether the synchronization condition between the co-processing units is expected according to the execution timing chart, i.e. whether the time for starting the computation by each co-processing unit is expected.
If not, step 706, then step 710 is performed, the program code is modified to resolve the synchronization issue, and step 702 is re-performed.
If the result is satisfactory, step 706 proceeds to step 708, where the bottleneck co-processing unit and the bottleneck instruction are found according to the occupation time of each co-processing unit, and then step 712 is performed.
In step 712, the characterization parameters of the bottleneck instruction are checked to determine whether the bottleneck instruction has an optimization space.
In step 714, if there is an optimization space, then step 718 is executed to check the source parameters of the instruction, find an optimization point and an optimization direction according to the source parameters, and modify the relevant instruction. Step 720 is then performed to mark the modified instruction or group of instructions with the marking instruction and to record execution information of the modified instruction or group of instructions using the marking pattern by configuring the configuration register and to perform a performance evaluation based on the execution information. Then, step 722 is executed to check whether the characterization parameter has improved, if not, step 718 is executed again, and if so, step 724 is executed.
If there is no optimization space in step 714, then step 716 is performed to modify the mapping scheme. Then step 724 is executed to rerun the instruction to be debugged and check whether the total execution time reaches the optimization target. In step 726, if the goal is reached, go to step 728 to complete the optimization; if the target is not reached, step 702 is executed again.
The instruction processing device and the scheme for evaluating the instruction execution performance have the following beneficial effects:
1. the hardware overhead is low, the original register read-write channel of the multiplexing instruction processing device is used for transmitting data, and no additional storage resource or bus channel is needed to be added;
2. the data path and the memory of a processor (the instruction processing device can be one processor core in a multi-core processor) where the instruction processing device is located are not invaded, and the recorded execution information is not misaligned due to the influence on the original instruction execution process in the debugging process;
3. two debug modes, a flag mode and an auto mode, are provided. The marking mode has strong pertinence, the debugging process is rapid, and the method is suitable for rapidly evaluating the optimization effect; the automatic mode can obtain the overall execution condition of the instruction to be debugged with larger instruction scale (such as a plurality of convolution layers of a neural network, a residual error unit of ResNet and the like), so that the performance bottleneck is convenient to locate, but the instruction to be debugged needs to be executed for many times, the consumed time is slightly long, and a user can flexibly select two modes according to the requirement;
4. the simple and feasible scheme for merging the execution time information obtained by executing the instruction to be debugged for multiple times in the automatic mode is provided. In an automatic mode, by a simple data processing mode of adding offset to reference, the execution time information obtained in multiple executions can be combined, so that a time sequence chart is obtained, the post-processing precision is acceptable, and the calculated amount is controllable;
5. aiming at the automatic mode, a resource time changing mode can be adopted, and the debugging efficiency is improved through multiple hardware in parallel;
6. and the method is non-intrusive to software stacks such as drivers, frameworks and runtime. The compatibility cost is low because a processor (such as a software stack of a neural network processor) does not need to be changed.
It is to be understood that the instruction processing apparatus of the embodiment of the present disclosure is preferably a single processor core in a single-core processor or a multi-core processor for processing coarse-grained instructions, and accordingly, the instruction to be debugged is preferably a coarse-grained instruction. A coarse grain instruction refers to an instruction that is capable of performing a series of operations. For example, for an NPU, a coarse grain instruction may be a convolution instruction that performs operations including batch load operations, multiply operations, add operations, etc. of data. Because the coarse-grained instructions can execute a series of operations, the number of the coarse-grained instructions included in a program (i.e., the above target instruction sequence) written for realizing a certain specific task is not too large, and accordingly the number of the instructions to be debugged is not too large, the time consumed by the instruction processing device for debugging the instructions to be debugged is short, and the debugging efficiency is high. In addition, for a coarse-grained instruction, the processing data scale is different, and the corresponding calculation efficiency and the memory access efficiency are often different. For a multi-core processor comprising a plurality of instruction processing devices, by acquiring and analyzing execution information recorded when each instruction processing device executes coarse-grained instructions to be debugged, on one hand, whether the dependency relationship between the plurality of instruction processing devices which execute asynchronously is reasonable can be checked, and on the other hand, the division manner of the instructions to be debugged on each instruction processing device (that is, which instruction processing device executes each part of the instructions to be debugged) can be optimized, so that the execution efficiency of the instructions to be debugged on the whole processor can be optimized.
In addition, it can be understood that the instruction processing apparatus according to the embodiment of the present disclosure may also be a single processor core in a single-core processor or a multi-core processor for processing fine-grained instructions, and accordingly, the instruction to be debugged is a fine-grained instruction. A fine grain instruction refers to an instruction that is capable of performing one operation or a small number of operations of the same type. Fine grain instructions are relative to coarse grain instructions. A fine-grained instruction can perform fewer operations and fewer types of operations than a coarse-grained instruction. However, because the number and types of operations that can be executed by the fine-grained instructions are small, the number of fine-grained instructions included in a program (i.e., the above target instruction sequence) written for implementing a specific task is usually large, and accordingly the number of instructions to be debugged is large, the time consumed by the instruction processing apparatus for debugging the instructions to be debugged is long, and the debugging efficiency is low. On the other hand, because the fine-grained instruction generally does not involve the problem of large performance change caused by the change of the processed data size, the execution information recorded when the instruction processing apparatus executes the fine-grained instruction to be debugged generally has a limited effect on optimizing the instruction to be debugged.
According to the embodiment of the disclosure, an apparatus for evaluating instruction execution performance is also provided.
FIG. 8 shows a block diagram of an apparatus 800 for evaluating instruction execution performance according to an embodiment of the present disclosure. As shown in fig. 8, the apparatus 800 includes:
an information obtaining module 810, configured to obtain execution information of the instructions to be debugged and executed by one or more instruction processing apparatuses, where the one or more instruction processing apparatuses are instruction processing apparatuses according to an embodiment of the present invention (for example, the instruction processing apparatus 100 shown in fig. 1);
an index determining module 820 configured to determine a performance index of the instruction to be debugged according to the execution information; and
and the performance judging module 830 is configured to judge whether the execution performance of the instruction to be debugged reaches a target according to the performance index.
According to the embodiment of the disclosure, the execution performance of the instruction to be debugged can be accurately evaluated according to the execution information of the instruction to be debugged, so that the instruction to be debugged is optimized.
According to some embodiments, the instruction to be debugged is executed by an instruction processing apparatus, and the performance index includes a first data amount read from an external memory by the instruction processing apparatus per unit time and a second data amount calculated by a calculation unit of the instruction processing apparatus per unit time; the performance judging module comprises: a matching unit configured to determine whether the first data amount and the second data amount match; and the judging unit is configured to respond to the determination that the first data volume is matched with the second data volume, and judge that the execution performance of the instruction to be debugged reaches a target.
According to some embodiments, the instruction to be debugged is executed by an instruction processing apparatus, and the performance indicator includes at least one of: the number of times of partial writing to the external memory by the instruction processing device accounts for a first proportion of the total number of times of writing to the external memory, the number of times of blocking access to the internal memory by the instruction processing device accounts for a second proportion of the total number of times of accessing the internal memory, and the number of times of use of the computing unit in the instruction processing device accounts for a third proportion of the maximum number of times of availability of the computing unit; the performance determination module is further configured to: and judging whether the execution performance of the instruction to be debugged reaches a target or not according to the relative size of the performance index and a preset value.
According to some embodiments, the instruction to be debugged is executed by a plurality of instruction processing apparatuses in cooperation, each of the plurality of instruction processing apparatuses executes a part of the instruction to be debugged, and the execution information includes time period information for the plurality of instruction processing apparatuses to each execute the corresponding part of the instruction; the metric determination module is further configured to: and generating an execution time sequence chart of the instruction to be debugged according to the respective time period information of the instruction processing devices.
According to some embodiments, the instruction to be debugged is divided into a plurality of instruction segments, each instruction segment includes a plurality of instruction groups, and a part of the same instruction group exists in two adjacent instruction segments, the instruction to be debugged is executed by the one or more instruction processing apparatuses a plurality of times, each time execution time information of one instruction segment is recorded by a performance register group of the corresponding instruction processing apparatus, and the execution time information includes an execution time of each instruction group included in the corresponding instruction segment; the information acquisition module is further configured to: respectively acquiring execution time information of corresponding instruction fragments recorded when the instruction to be debugged is executed each time; the index determination module includes: an offset determination unit configured to determine a time offset according to execution times of the same instruction groups of two adjacent instruction fragments; a time calibration unit configured to calibrate execution time information of instruction fragments other than a first instruction fragment of the plurality of instruction fragments according to the time offset; and a total time determining unit configured to determine a total execution time of the instruction to be debugged according to the calibrated execution time information.
It should be understood that the various modules of the apparatus 800 shown in fig. 8 may correspond to the various steps in the method 400 described with reference to fig. 4. Thus, the operations, features and advantages described above with respect to the method 400 are equally applicable to the apparatus 800 and the modules comprised thereby. Certain operations, features and advantages may not be described in detail herein for the sake of brevity.
Although specific functionality is discussed above with reference to particular modules, it should be noted that the functionality of the various modules discussed herein may be divided into multiple modules and/or at least some of the functionality of multiple modules may be combined into a single module. For example, the metric determination module 820 and the performance determination module 830 described above may be combined into a single module in some embodiments.
It should also be appreciated that various techniques may be described herein in the general context of software, hardware elements, or program modules. The various modules described above with respect to fig. 8 may be implemented in hardware or in hardware in combination with software and/or firmware. For example, the modules may be implemented as computer program code/instructions configured to be executed in one or more processors and stored in a computer-readable storage medium. Alternatively, the modules may be implemented as hardware logic/circuitry. For example, in some embodiments, one or more of the information acquisition module 810, the metric determination module 820, and the performance determination module 830 may be implemented together in a system on a chip (SoC). The SoC may include an integrated circuit chip (which includes one or more components of a Processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, Digital Signal Processor (DSP), etc.), memory, one or more communication interfaces, and/or other circuitry), and may optionally execute received program code and/or include embedded firmware to perform functions.
According to an embodiment of the present disclosure, there is also provided an electronic device, a readable storage medium, and a computer program product.
Referring to fig. 9, a block diagram of a structure of an electronic device 900, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 9, the apparatus 900 includes a computing unit 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The calculation unit 901, ROM 902, and RAM 903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.
A number of components in the device 900 are connected to the I/O interface 905, including: an input unit 906, an output unit 907, a storage unit 908, and a communication unit 909. The input unit 906 may be any type of device capable of inputting information to the device 900, and the input unit 906 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. Output unit 907 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 908 may include, but is not limited to, a magnetic disk, an optical disk. Communication unit 909 allows device 900 to exchange information/data with other devices over a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication transceiver, and/or a chipset, such as bluetoothTMDevices, 1302.11 devices, Wi-Fi devices, Wi-Max devices, cellular communication devices, and/or the like.
The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 901 performs the various methods and processes described above, such as the method 400 described above. For example, in some embodiments, the method 400 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 900 via ROM 902 and/or communications unit 909. When loaded into RAM 903 and executed by computing unit 901, may perform one or more of the steps of method 400 described above. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the method 400 by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the methods, systems, and apparatus described above are merely exemplary embodiments or examples and that the scope of the present disclosure is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.