Method, apparatus, chip, device, medium and program product for performing operations

文档序号:7353 发布日期:2021-09-17 浏览:24次 中文

1. A method of performing arithmetic operations in deep learning training, comprising:

obtaining an instruction for the arithmetic operation, the arithmetic operation comprising a plurality of vector operations;

determining, for each vector operation of the plurality of vector operations, two source operand vectors for comparison; and

and executing the vector operation on the two source operand vectors by using the instruction format aiming at the vector operation to obtain an operation result comprising a destination operand vector.

2. The method of claim 1, wherein the two source operand vectors each have a first number of elements, performing the vector operation on the two source operand vectors comprising:

for each element in the two source operand vectors, performing a second number of element-by-element comparison operations in parallel according to the data type of the element, wherein the first number is greater than or equal to the second number.

3. The method of claim 2, further comprising:

the values of the corresponding elements in the destination operand vector are determined.

4. The method of claim 1, wherein the instruction format comprises a field for the two source operand vectors, a field for the destination operand vector, a field for a data type, an opcode field, and/or a reserved field.

5. The method of claim 4, wherein in the opcode domain, an opcode comprises one of: comparing whether the object is smaller than another object; comparing whether the object is larger than another object; and comparing whether the object is equal to another object.

6. The method of claim 4, wherein the data type comprises one of: floating point number, half floating point number, signed integer, and unsigned integer.

7. The method of claim 1, wherein each vector operation of the plurality of vector operations is performed in order of load, ALU operation, store, the execution of two adjacent vector operations of the plurality of vector operations being partially overlapping.

8. An apparatus that performs arithmetic operations in deep learning training, comprising:

at least one vector acceleration module, the at least one vector acceleration module comprising:

an obtaining module configured to obtain an instruction for the arithmetic operation, the arithmetic operation comprising a plurality of vector operations;

a vector determination module configured to determine, for each vector operation of the plurality of vector operations, two source operand vectors for comparison; and

a vector calculation module configured to perform the vector operation on the two source operand vectors using an instruction format for the vector operation to obtain an operation result including a destination operand vector.

9. The apparatus of claim 8, wherein the two source operand vectors each have a first number of elements, the performing the vector operation on the two source operand vectors comprising:

for each element in the two source operand vectors, performing a second number of element-by-element comparison operations in parallel according to the respective data type of the element, wherein the first number is greater than or equal to the second number.

10. The apparatus of claim 9, the performing the vector operation on the two source operand vectors further comprising:

the values of the corresponding elements in the destination operand vector are determined.

11. The apparatus of claim 8, wherein the instruction format comprises a field for the two source operand vectors, a field for the destination operand vector, a field for a data type, an opcode field, and/or a reserved field.

12. The apparatus of claim 11, wherein in the opcode domain, an opcode comprises one of: comparing whether the object is smaller than another object; comparing whether the object is larger than another object; and comparing whether the object is equal to another object.

13. The apparatus of claim 11, wherein the data type of the destination operand vector comprises one of: floating point number, half floating point number, signed integer, and unsigned integer.

14. The apparatus of claim 8, wherein each vector operation of the plurality of vector operations is performed in order of load, ALU operation, store, the execution of two adjacent vector operations of the plurality of vector operations being partially overlapping.

15. A chip, comprising:

at least one processor; and

the apparatus of any of claims 8-14 communicatively connected with the at least one processor.

16. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

17. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

18. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.

Background

With the wide application of deep learning training, people put forward higher and higher requirements on improving the speed of deep learning training. The operation in the deep learning training may involve various operations, such as a scalar operation (simply referred to as a scalar operation), a vector operation (simply referred to as a vector operation), and the like. In the deep learning algorithm, complex operations such as tensor operations are often required for various application scenarios. Tensor operations can be decomposed into a plurality of continuous vector operation operations by using a compiler, and the execution of the vector operations usually needs to occupy a large amount of computing resources, so that a large amount of vector operations cannot be processed in time, and even a system for performing deep learning training quits the execution of the operation operations due to insufficient computing resources. Therefore, it is desirable to improve efficiency over a large number of consecutive vector operations in order to increase the speed of the overall deep learning training.

Disclosure of Invention

The present disclosure provides a method, an apparatus, a chip, an electronic device, a storage medium, and a program product for performing an arithmetic operation.

According to a first aspect of the present disclosure, there is provided a method of performing an arithmetic operation in deep learning training, comprising: obtaining an instruction for an arithmetic operation, the arithmetic operation comprising a plurality of vector operations; determining, for each vector operation of a plurality of vector operations, two source operand vectors for comparison; and performing a vector operation on the two source operand vectors using the instruction format for the vector operation to obtain an operation result including a destination operand vector.

According to a second aspect of the present disclosure, there is provided an apparatus that performs an arithmetic operation in deep learning training, including: an obtaining module configured to obtain an instruction for an arithmetic operation, the arithmetic operation including a plurality of vector operations; a vector determination module configured to determine, for each vector operation of a plurality of vector operations, two source operand vectors for comparison; and a vector calculation module configured to perform a vector operation on the two source operand vectors using an instruction format for the vector operation to obtain an operation result including a destination operand vector.

According to a third aspect of the present disclosure, there is provided a chip comprising at least one processor; and an apparatus according to the second aspect of the disclosure in communicative connection with the at least one processor.

According to a fourth aspect of the present disclosure, there is provided an electronic device comprising at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to implement a method according to the first aspect of the disclosure.

According to a fifth aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to implement a method according to the first aspect of the present disclosure.

According to a sixth aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, performs the method according to the first aspect of the present disclosure.

According to the technology of the present disclosure, a method of performing an arithmetic operation in deep learning training is provided, with which instructions for the arithmetic operation can be vectorized according to different data types, increasing the parallelism of the arithmetic operation, thereby achieving acceleration of the arithmetic operation. Therefore, in deep learning training, a large amount of continuous vector operations occupying a large amount of computing resources can improve the processing efficiency, and further can improve the computing speed of the whole deep learning training.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic block diagram of a deep learning training environment 100 in which methods of performing arithmetic operations of certain embodiments of the present disclosure may be implemented;

FIG. 2 is a flow chart of a method 200 of arithmetic operations according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of a method 300 of arithmetic operations according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of acceleration vector operations according to an embodiment of the present disclosure;

FIG. 5 is a diagram of a scenario in which continuous vector operations are performed, in which embodiments of the present disclosure may be implemented;

FIG. 6 is a block diagram of an apparatus 600 for performing arithmetic operations for implementing an embodiment of the present disclosure;

FIG. 7 is a schematic block diagram of a chip 700 for performing arithmetic operations to implement an embodiment of the present disclosure; and

FIG. 8 is a block diagram of an electronic device 800 for implementing a method of performing arithmetic operations of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

As described above in the background art, with the wide application of deep learning training, people have made higher and higher demands for increasing the speed of deep learning training. The operation operations in the deep learning algorithm may involve various operations, such as scalar operation operations, vector operation operations, and the like. A typical tensor operation in a deep learning algorithm may be decomposed into a plurality of successive vector operations involving computation for an seccc (condition code) operation, for example, SETlt and SETgt both belong to one of the seccc operation operations, the main operations of which are shown in table 1.

TABLE 1 SETcc operation

In the SETcc operation, according to the result that two source operands are larger and smaller, the destination operand is set to be 0 or 1 of the data type, and the data type of the destination operand and the data type of the source operand are kept consistent. Element-Wise (Element-Wise) EW comparison operation is common operation in a deep learning algorithm, and SETlt and SETgt are used for calculating the inverse gradient of the EW comparison operation in the algorithm training process. Table 2 below shows a common EW comparison algorithm.

TABLE 2 EW Algorithm

In the deep learning training, it can be considered how to increase the computation speed of the deep learning training process by accelerating the vector operation in the acceleration unit of the inverse training algorithm in the Artificial Intelligence (AI) chip processor. In the case where the number of arithmetic operations is particularly large, the speed of the arithmetic operations is a major limitation of the computational power of the artificial intelligence chip processor. First, in deep learning training, the execution of a large number of vector operations often requires a large amount of computing resources, which results in a large number of consecutive vector operations being unable to be processed in time, and even a system performing deep learning training quits the execution of the operation due to insufficient computing resources. Secondly, the mainstream deep learning algorithm in the conventional technology has a certain problem in handling a large amount of vector operations. For example, vector acceleration units of conventional CPU and GPU processors do not support seccc instructions, and when deep learning algorithm training involves seccc operations, two solutions are currently adopted to solve the problem: (1) serialized operations using scalar units; (2) the acceleration is carried out in a mode of starting multi-core parallelism. Scheme (1) is generally used in CPU processors of intel/ARM manufacturers, the number of processor cores is usually small, and from the viewpoint of programming model, it is not suitable for executing the same algorithm core (kernel) on multiple processor cores simultaneously, so that serial processing can only be performed by a scalar processing unit of each core, the serial processing time is long, and the delay is N (typically, N is 8, 16) times of that of parallel processing. The scheme (2) is generally used in a GPU processor, the number of threads (threads) of the GPU is large, and one task can be easily divided into a plurality of threads from a programming model to be executed, so that the relative serial processing speed is improved, but the problem of high inter-thread synchronization overhead exists. Therefore, the utilization rate of the chip processor is insufficient by the conventional technology, so that the performance power consumption ratio of the chip processor is not high, and the efficiency of deep learning is influenced.

To address, at least in part, one or more of the above issues and other potential issues, embodiments of the present disclosure propose a scheme to perform arithmetic operations in deep learning training. In the scheme, the parallelism of the operation is increased by vectorizing the instruction for the operation, and the calculation speed of the operation can be improved. Second, this approach can avoid inefficiencies in CPU serialization processing because multiple vector operations are performed simultaneously. Thirdly, the method can avoid the synchronization overhead of GPU processing because threads are not needed to synchronize the completion degree of the same computing task. By utilizing the technical scheme disclosed by the invention, the artificial intelligent chip processor is effectively utilized, so that the deep learning training speed is effectively improved.

FIG. 1 illustrates a schematic block diagram of a deep learning training environment 100 in which methods of performing arithmetic operations in certain embodiments of the present disclosure may be implemented. In accordance with one or more embodiments of the present disclosure, deep learning training environment 100 may be a cloud environment. As shown in fig. 1, deep learning training environment 100 includes a computing device 110. In the deep learning training environment 100, the input data 120 is provided to the computing device 110 as input to the computing device 110. The input data 120 may include, for example, data associated with an arithmetic operation for deep learning, data associated with an instruction for an arithmetic operation, and the like. As also shown in fig. 1, computing device 110 includes a scalar processing unit 113 and a vector acceleration unit 115.

According to one or more embodiments of the present disclosure, when an arithmetic operation for deep learning needs to be performed, the associated data is provided as input data 120 to the computing device 110. Then, the scalar processing unit 113 (sometimes referred to as a core module) in the computing device 110 processes a basic scalar operation on the input data 120, and converts the input data 120 into a form of an instruction for the operation (for example, an SETcc instruction and a vector SETcc instruction (vsetc instruction) by operations such as fetching (instruction fetch) IF, decoding (instruction decode) ID, and the like, although the scope of the present disclosure is not limited thereto). Instructions for arithmetic operations may then be written back to memory of the scalar processing unit 113, after being processed by the arithmetic logic ALU, or may be distributed to the vector acceleration unit 115 (also sometimes referred to as a vector acceleration module).

In an embodiment of the present disclosure, support for a new instruction vsetc to support operations on input data 120 is proposed on the basis of the existing architecture 32-bit instruction set, the format of which is shown in table 3. The design of the instruction format mainly considers two problems: (1) compatibility, by employing independent opcode domains, does not affect existing instruction formats. (2) Expansibility, and the instruction format fully considers the subsequent expansion requirement which possibly exists, and takes a specific field as a reserved domain. It should be understood that the instruction vsetc is given as an example of implementing an arithmetic operation and that a person skilled in the art will be able to use the content and spirit of the present disclosure to set up instructions that implement similar and new functions. By way of example only, an implementation of the vselt instruction is shown in table 3.

TABLE 3 vSETcc instruction format

As shown in table 3, in the vselt instruction, a specific field (e.g., xfinct field) is used as a reserved field. It should be understood that other fields may be used as reserved fields for subsequent expansion requirements that may exist. As also shown in table 3, in the opcode domain, an opcode (opcode) refers to a specific vector operation, for example, to distinguish whether a condition code belongs to one of "object smaller Than another object (LessThan)", "object larger Than another object (Great Than)", "object Equal to another object (Equal)". In addition, table 3 also shows the data types of the supported vector data, such as floating point number (float), half floating point number (bfloat), signed integer (int), unsigned integer (unsigned), and the like. It should be understood that although only the above data types are shown here, other data types may be used, such as 16-bit signed binary complement integers (short), 64-bit signed binary complement integers (long), double precision 64-bit floating point numbers (double) conforming to the IEEE754 standard, single 16-bit Unicode characters (char), Boolean (borolean) representing one bit of information, and so forth.

In the vector acceleration unit 115, an instruction for an arithmetic operation (e.g., an SETcc instruction) is vectorized to realize that a plurality of vector operations (also referred to as vectorization operations) are executed in parallel, and the plurality of vector operations are executed in series. The scalar processing unit 113 and the vector acceleration unit 115 interact through a simple interface, so that the independence of module development is realized to a certain extent, and the influence on the existing processor unit is reduced.

It should be appreciated that the deep learning training environment 100 is merely exemplary and not limiting, and is scalable in that more computing devices 110 can be included, and more input data 120 can be provided to the computing devices 110, such that more users can be satisfied to simultaneously utilize more computing devices 110, and even more input data 120, to simultaneously or non-simultaneously determine and perform arithmetic operations for multiple deep learning. In addition, the computing device 110 may also include other elements, such as a data storage element, an information preprocessing element, and so forth.

FIG. 2 illustrates a flow diagram of a method 200 of performing an arithmetic operation in accordance with an embodiment of the present disclosure. In particular, the method 200 of performing an arithmetic operation may be performed by the computing device 110 in the deep learning training environment 100 shown in fig. 1. It should be understood that the method 200 of performing an arithmetic operation may also include additional operations not shown and/or may omit illustrated operations, as the scope of the present disclosure is not limited in this respect.

At block 202, the computing device 110 obtains an instruction for an arithmetic operation, the arithmetic operation comprising a plurality of vector operations. According to one or more embodiments of the present disclosure, the instruction for the arithmetic operation may be the input data 120, or may be an instruction processed by the scalar processing unit 113 in the computing device 110.

At block 204, the computing device 110 determines, for each vector operation of the plurality of vector operations obtained at block 202, two source operand vectors for comparison. In accordance with one or more embodiments of the present disclosure, the source operands involved in each vector operation are distributed by data type into a register set (vector register file) VRF, or cache, or other type of temporary storage. Since the method 200 aims to speed up the operation in the framework of the existing chip processor, the problem to be solved is to reduce the delay of the serial processing scalar operation and simultaneously reduce or avoid the synchronization overhead among different threads. In this case, the aforementioned problem is solved in the method 200 by implementing vectorization of instructions for arithmetic operations, and for example using the vsetc instruction format, for example.

At block 206, the computing device 110 performs a vector operation on two source operand vectors using an instruction format for the vector operation to obtain an operation result including a destination operand vector. In accordance with one or more embodiments of the present disclosure, in the context of the present disclosure, for data that needs to be operated on, for example, data that is compared, they are combined in the form of a vector, and a corresponding operation is performed on each element in the vector, and a process of obtaining a calculation result is a vectorization operation or a vector operation. By vectorizing the instructions for the operation, the parallelism of the operation is increased, and the method can improve the calculation speed of the operation.

FIG. 3 illustrates a flow diagram of a method 300 of performing an arithmetic operation in accordance with an embodiment of the present disclosure. In particular, the method 300 of performing an arithmetic operation may also be performed by the computing device 110 in the deep learning training environment 100 shown in fig. 1. It should be understood that the method 300 of performing an arithmetic operation may be considered an extension of the method 200 of performing an arithmetic operation, and that it may also include additional operations not shown and/or may omit the operations shown, as the scope of the present disclosure is not limited in this respect.

At block 302, the computing device 110 obtains an instruction for an arithmetic operation, the arithmetic operation comprising a plurality of vector operations. The specific content of the step referred to in the block 302 is the same as that of the step referred to in the block 202, and is not described herein again.

At block 304, the computing device 110 determines, for each vector operation of the plurality of vector operations obtained at block 202, two source operand vectors for comparison. The specific content of the step referred to in the block 304 is the same as that of the step referred to in the block 204, and is not described herein again.

At block 306, the computing device 110 performs a second number of element-by-element comparison operations in parallel, per respective data type of the elements, for each element in two source operand vectors, each having a first number of elements, the first number being greater than or equal to the second number, using the instruction format for vector operations, to obtain an operation result that includes a destination operand vector.

In accordance with one or more embodiments of the present disclosure, for data that needs to be operated on, e.g., compared data, which are combined in vector form, operating on the resulting two source operand vectors will be superior to operating on two scalar source operands because elements of the same type are handled centrally. The two source operand vectors have a first number of elements, respectively, and then each of the elements performs a second number of element-by-element comparison operations in parallel according to the data type of the element. It will be appreciated that on a chip with limited resources, for example, the number of processors may be relatively small, and thus for a first number of elements requiring operations, the number of element operations performed in the corresponding processing unit may be equal to or less than the number of elements, and for elements in the vector that do not complete an operation, the next parallel processing cycle may be sequentially awaited. In other words, in the solution of the present disclosure, the number of elements in the source operand vector (i.e., the first number) may be greater than or equal to the number of vector operations performed (i.e., the second number). Therefore, the technical scheme of the disclosure can be used on not only the next generation chip processor with powerful computing function, but also the existing chip processor with limited resources, thereby improving the utilization rate of the existing chip processor.

Fig. 4 is a schematic diagram of a process 400 of accelerating vector operations according to an embodiment of the present disclosure. In accordance with one or more embodiments, in FIG. 4, the vector operation is performed by first loading data from memory into the corresponding source operand register set VRF (401), and after the operands are ready, they are sent to one of the corresponding comparison sub-modules (431 and 437) for operation, and the operation result is finally written back (store) into the memory space. It should be appreciated that for partially reusable data, the process of loading from memory may be omitted.

As shown in fig. 4, respectively have a first number N1Individual-element source operand vector src0(1x N)1Vector) 411 and src1(1x N)2Vector) 413 is distributed to a second number N of the same data type by data type2In an arithmetic submodule, i.e. the number of elements participating in parallel in the current element-by-element comparison operation is N2. As previously mentioned, on a resource-limited chip, for example, the number of processors may be relatively small, by the number N of comparison sub-modules that will perform the element-by-element comparison operation in parallel2Set to less than or equal to the number of elements N in the source operand vector1According to the technical scheme, the chip processor can be effectively utilized, so that the training speed of the deep learning algorithm on the chip processor is effectively increased. After calculation by the operation submodule of the same data type, taking the floating point operation submodule 431 as an example, the floating point number element in the src 0411 and the corresponding floating point number element in the src1413 may be compared, and a determination is made at the multiplexer 451 that a comparison result between the floating point number element in the src 0411 and the corresponding floating point number element in the src1413 is true or false. It should be understood that the condition codes under which the comparison is made may be those belonging to "object Less Than another object (Less Than Than)", "object greater Than another object (Great Than)", "object, etcIn one of the other objects (Equal) ". Then, after a determination is made at the multiplexer 471 for the data type (vtype), the destination operand dst 491 is set to a constant 1 for the data type if the data type is consistent, and to a constant 0 for the data type otherwise.

In accordance with one or more embodiments of the present disclosure, since the comparison submodule of each data type operates on instructions of all data types in fig. 4, it is determined that a specific data type is valid at the comparison calculation result. It should be appreciated that the determination of the data type in FIG. 4 may also be made at the source operand vector, so that it may be determined that only one type of compare submodule is performing before the operation is performed. Further, it should be understood that the specific data types listed in FIG. 4 are shown by way of example only and are not limiting of other possible data types.

Fig. 5 is a scenario diagram of performing a continuous vector operation 500 in which an embodiment of the present disclosure may be implemented. As shown in fig. 5, each of the successive vector operations, in which the execution of adjacent two vector operations partially overlap, is not performed serially, but is performed in the order of Load (LD), ALU operation, Store (ST). In fact, by implementing execution of continuous vector operations, in combination with parallel execution of element-by-element comparison operations in fig. 4, the technical solution of the present disclosure has a significant advance in processing a large number of complex operation operations compared to conventional CPU processors and GPU processors, which not only reduces the delay of serial processing, but also avoids the problem of large inter-thread synchronization overhead in parallel processing.

The deep learning training environment 100 associated with a method of performing arithmetic operations in certain embodiments of the present disclosure, a method 200 of performing arithmetic operations in accordance with an embodiment of the present disclosure, a method 300 of performing arithmetic operations in accordance with an embodiment of the present disclosure, accelerated vector operations in accordance with an embodiment of the present disclosure, and related matter of performing continuous vector operations in accordance with an embodiment of the present disclosure are described above with reference to fig. 1-5. It should be understood that the above description is intended to better illustrate what is recited in the present disclosure, and is not intended to be limiting in any way.

It should be understood that the number of various elements and the size of physical quantities employed in the various drawings of the present disclosure are by way of example only and are not limiting upon the scope of the present disclosure. The above numbers and sizes may be arbitrarily set as needed without affecting the normal implementation of the embodiments of the present disclosure.

Details of the method 200 of performing an arithmetic operation and the method 300 of performing an arithmetic operation according to an embodiment of the present disclosure have been described above with reference to fig. 1 to 5. Hereinafter, respective modules in the apparatus that performs an arithmetic operation will be described with reference to fig. 6.

FIG. 6 is a block diagram of an apparatus 600 for performing arithmetic operations for implementing embodiments of the present disclosure. As shown in fig. 6, the apparatus 600 for performing an arithmetic operation includes: an obtaining module 610 configured to obtain an instruction for an arithmetic operation, the arithmetic operation including a plurality of vector operations; a vector determination module 620 configured to determine, for each vector operation of a plurality of vector operations, two source operand vectors for comparison; and a vector calculation module 630 configured to perform a vector operation on the two source operand vectors using an instruction format for the vector operation to obtain an operation result including a destination operand vector.

In one or more embodiments, wherein the two source operand vectors each have a first number of elements, performing the vector operation on the two source operand vectors comprises: for each element in the two source operand vectors, a second number of element-by-element comparison operations are performed in parallel according to the respective data type of the element, where the first number is greater than or equal to the second number.

In one or more embodiments, wherein performing the vector operation on the two source operand vectors further comprises: the values of the corresponding elements in the destination operand vector are determined.

In one or more embodiments, the instruction format includes fields for two source operand vectors, fields for destination operand vectors, fields for data types, opcode fields, and/or reserved fields, among others.

In one or more embodiments, wherein in the opcode domain, the opcode includes one of: comparing whether the object is smaller than another object; comparing whether the object is larger than another object; and comparing whether the object is equal to another object.

In one or more embodiments, the data type of the destination operand vector includes one of: floating point number, half floating point number, signed integer, and unsigned integer.

In one or more embodiments, wherein the plurality of vector operations are performed in the order of load, ALU operation, store, the execution of each of the plurality of vector operations is partially overlapping.

Through the above description with reference to fig. 1 to 6, the technical solution according to the embodiments of the present disclosure has many advantages over the conventional solution. For example, with the technical solution according to the embodiment of the present disclosure, by vectorizing the instructions for the operation, increasing the parallelism of the operation, and implementing parallel element-by-element comparison operation on consecutive vector operations, the technical solution of the present disclosure can effectively improve the computation speed of deep learning training.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

FIG. 7 is a schematic block diagram of a chip 700 for performing arithmetic operations to implement an embodiment of the present disclosure. As shown in fig. 7, a chip 700 for performing an arithmetic operation may include: the processor 710 converts input data into an instruction form for operation through operations such as instruction fetching and decoding, and distributes the instruction form to the vector acceleration module 720, and similarly, the vector acceleration module 720 may return an accelerated vector operation result to the processor 710. It should be understood that the chip 700 may include multiple processors 710 and multiple vector acceleration modules 720, and that the vector acceleration modules 720 may be the apparatus 600 shown in fig. 6, or a combination of multiple apparatuses. It should also be appreciated that chip 700 may be implemented separately or in combination with other existing hardware architectures to increase the speed of operation and utilization of the hardware system comprising the chip.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 8 illustrates a schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The calculation unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

A number of components in the device 800 are connected to the I/O interface 805, including: an input unit 806, such as a keyboard, a mouse, or the like; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, or the like; and a communication unit 809 such as a network card, modem, wireless communication transceiver, etc. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

Computing unit 801 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The computing unit 801 performs the various methods and processes described above, such as the methods 200, 300. For example, in some embodiments, the methods 200, 300 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program can be loaded and/or installed onto device 800 via ROM 802 and/or communications unit 809. When loaded into RAM 803 and executed by the computing unit 801, a computer program may perform one or more of the steps of the methods 200, 300 described above. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the methods 200, 300 in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:用于处理任务的方法、处理器、设备和可读存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!