Method and system for plug-in management of dirty data of real-time task based on Flink
1. A method for plug-in management of dirty data of real-time tasks based on Flink is characterized by comprising the following steps:
the dirty data manager acquires dirty data configuration information, instantiates the dirty data manager and instantiates a corresponding dirty data plug-in object;
the dirty data manager collects dirty data and abnormal reasons generated by tasks and stores the dirty data and the abnormal reasons into a message queue;
the dirty data manager launches a dirty data consumer in a dirty data plug-in object;
the dirty data consumer trains in turn to consume the dirty data in the message queue;
if the dirty data consumer successfully consumes the dirty data in the message queue, the dirty data manager adds 1 to a dirty data consumption count value until the dirty data consumption count value reaches a preset dirty data consumption count value, and the task is determined to fail; and if the dirty data consumer fails to consume the dirty data in the message queue, the dirty data manager adds 1 to the count value of the failed data until the count value of the failed data reaches a preset count value of the failed data, and the task is determined to fail.
2. The method of claim 1, wherein before the dirty data manager initiates a dirty data consumer in a dirty data plug-in object, further comprising:
the dirty data manager determines a first piece of dirty data in the message queue.
3. The method of claim 1, wherein the dirty data manager collects dirty data and abnormal causes generated by the task, and before storing the dirty data and abnormal causes in the message queue, the method further comprises:
the dirty data manager initializes the message queue.
4. The method of claim 1, wherein the dirty data consumer training the dirty data in the message queue in turn comprises:
the dirty data consumers consume the dirty data in the message queue through dirty data consumer subclass rotation training.
5. A system for plug-in management of dirty data of real-time tasks based on Flink is characterized by comprising: instantiating the obtained dirty data manager according to the dirty data configuration information and instantiating the obtained dirty data consumer;
the dirty data manager is used for collecting dirty data and abnormal reasons generated by the tasks and storing the dirty data and the abnormal reasons into a message queue; initiating the dirty data consumer in a dirty data plug-in object;
the dirty data consumer is used for training and consuming the dirty data in the message queue in turn;
the dirty data manager is further used for adding 1 to the dirty data consumption count value if the dirty data consumer successfully consumes the dirty data in the message queue until the dirty data consumption count value reaches a preset dirty data consumption count value, and determining that the task fails; and if the dirty data consumer fails to consume the dirty data in the message queue, adding 1 to the count value of the failed data until the count value of the failed data reaches a preset count value of the failed data, and determining that the task fails.
6. The system of claim 5, wherein the dirty data manager is further configured to determine a first piece of dirty data in the message queue prior to launching a dirty data consumer in a dirty data plug-in object.
7. The system of claim 5, wherein the dirty data manager is further configured to initialize the message queue before collecting dirty data and exception causes generated by the task and storing the dirty data and exception causes in the message queue.
8. The system of claim 5, wherein the dirty data consumer trains the dirty data in the message queue in turn by:
the dirty data consumer is specifically used for consuming the dirty data in the message queue through a dirty data consumer subclass rotation training.
Background
From the perspective of the big data warehouse, whether a real-time task or an off-line task, the error data generated during the task is also a part of valid data, and should be a part of historical data. The current Flink real-time task does not well record the error data into a library, the error data cannot be calculated by connecting other tasks, and the error data is only singly filtered, so that part of data is lost, and the accuracy of a task result is influenced.
When the strategy of error data is met, the task is directly failed and restarted, or all dirty data are ignored, and whether the task is stopped or not can not be controlled according to the number of errors.
Disclosure of Invention
The present invention aims to provide a method and system for plugin management of dirty data based on Flink real-time tasks that overcomes one of the above problems or at least partially solves any of the above problems.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the invention provides a method for plug-in management of dirty data of real-time tasks based on Flink, which comprises the following steps: the dirty data manager acquires dirty data configuration information, instantiates the dirty data manager and instantiates a corresponding dirty data plug-in object; the dirty data manager collects dirty data and abnormal reasons generated by tasks and stores the dirty data and the abnormal reasons into a message queue; the dirty data manager launches a dirty data consumer in a dirty data plug-in object; the dirty data consumer trains in turn to consume the dirty data in the message queue; if the dirty data consumer successfully consumes the dirty data in the message queue, the dirty data manager adds 1 to a dirty data consumption count value until the dirty data consumption count value reaches a preset dirty data consumption count value, and the task is determined to fail; and if the dirty data consumer fails to consume the dirty data in the message queue, the dirty data manager adds 1 to the count value of the failed data until the count value of the failed data reaches a preset count value of the failed data, and the task is determined to fail.
Before the dirty data manager starts the dirty data consumer in the dirty data plug-in object, the method further comprises the following steps: the dirty data manager determines a first piece of dirty data in the message queue.
Before the dirty data manager collects dirty data and abnormal reasons generated by a task and stores the dirty data and the abnormal reasons into a message queue, the method further comprises the following steps: the dirty data manager initializes the message queue.
Wherein the dirty data consumer training the dirty data in the message queue in turn comprises: the dirty data consumers consume the dirty data in the message queue through dirty data consumer subclass rotation training.
The invention also provides a system for plug-in management of dirty data of real-time tasks based on Flink, which comprises the following steps: instantiating the obtained dirty data manager according to the dirty data configuration information and instantiating the obtained dirty data consumer; the dirty data manager is used for collecting dirty data and abnormal reasons generated by the tasks and storing the dirty data and the abnormal reasons into a message queue; initiating the dirty data consumer in a dirty data plug-in object; the dirty data consumer is used for training and consuming the dirty data in the message queue in turn; the dirty data manager is further used for adding 1 to the dirty data consumption count value if the dirty data consumer successfully consumes the dirty data in the message queue until the dirty data consumption count value reaches a preset dirty data consumption count value, and determining that the task fails; and if the dirty data consumer fails to consume the dirty data in the message queue, adding 1 to the count value of the failed data until the count value of the failed data reaches a preset count value of the failed data, and determining that the task fails.
Wherein the dirty data manager is further configured to determine a first piece of dirty data in the message queue before launching a dirty data consumer in a dirty data plug-in object.
The dirty data manager is further configured to initialize the message queue before collecting dirty data and an abnormal reason generated by a task and storing the dirty data and the abnormal reason into the message queue.
Wherein the dirty data consumer trains the dirty data in the message queue in turn by: the dirty data consumer is specifically used for consuming the dirty data in the message queue through a dirty data consumer subclass rotation training.
Therefore, the method and the system for plug-in management of dirty data based on the Flink real-time task provided by the embodiment of the invention adopt the basic design mode of a producer and a consumer, utilize a message queue as a cache, collect the dirty data through a dirty data manager, process the dirty data through a dirty data consumer, and the dirty data consumer only needs to pay attention to the consumption processing of the message without paying attention to the failure or not of the task, so that the management and the processing process of the dirty data are decoupled.
In addition, the specific processing of the dirty data is realized by utilizing the inheritance relationship of the object according to different implementation of the subclasses.
Therefore, the invention adds the condition that the task aims at the dirty data processing, controls whether the task is stopped or not, and performs plug-in management on the dirty data in the real-time task, thereby effectively improving the fault tolerance in the task execution process.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart of a method for plug-in management of dirty data based on a Flink real-time task according to an embodiment of the present invention;
FIG. 2 is a flowchart executed by a dirty data manager in the method for plug-in management of dirty data based on a Flink real-time task according to the embodiment of the present invention;
FIG. 3 is a flowchart of dirty data consumer execution in the method for plug-in management of dirty data based on Flink real-time task according to the embodiment of the present invention;
fig. 4 is a schematic structural diagram of a system for performing plug-in management on dirty data of a Flink-based real-time task according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Fig. 1 shows a flowchart of a method for performing plugin management on dirty data based on a Flink real-time task according to an embodiment of the present invention, and referring to fig. 1, the method for performing plugin management on dirty data based on a Flink real-time task according to an embodiment of the present invention includes:
s1, the dirty data manager acquires the dirty data configuration information, instantiates the dirty data manager, and instantiates the corresponding dirty data plug-in object.
Specifically, by configuring the Dirty data rule, Dirty data Manager Dirty-Manager is instantiated, and the corresponding Dirty data plug-in object Dirty-plugins is instantiated. The dirty data manager records the dirty data and the abnormal reason, stores the dirty data and the abnormal reason into the message queue, manages the starting of the dirty data plug-in and maintains the dirty data message queue. The dirty data is consumed and processed by the dirty data consumer to the dirty data queue.
The Dirty data manager records the data of the Dirty data and the abnormal reason, stores the data and the abnormal reason into the message queue, acts as a producer role, instantiates the Dirty data consumer Dirty-consumer in the corresponding Dirty data plug-in object Dirty-plug, asynchronously and cyclically consumes the data in the queue, acts as a consumer role, and realizes different processing modes according to different consumption.
S2, the dirty data manager collects the dirty data and abnormal reason generated by the task and stores them in the message queue.
As an optional implementation manner of the embodiment of the present invention, before the dirty data manager collects dirty data and abnormal reasons generated by a task and stores the dirty data and abnormal reasons into a message queue, the method for performing plugin management on the dirty data based on the Flink real-time task further includes: the dirty data manager initializes the message queue.
Specifically, Dirty data and an abnormal reason are stored through a message queue, when a Dirty data Manager Dirty-Manager collects the Dirty data and the abnormal information, a Dirty-entry object can be generated and stored in a Dirty-queue message queue, so that a subsequent Dirty data consumer can consume the Dirty data.
S3, the dirty data manager launches a dirty data consumer in the dirty data plug-in object.
As an optional implementation manner of the embodiment of the present invention, before the dirty data manager starts the dirty data consumer in the dirty data plug-in object, the method for plug-in management of dirty data based on the Flink real-time task further includes: the dirty data manager determines the first piece of dirty data in the message queue. After the dirty data manager determines the first piece of dirty data in the message queue, the dirty data consumer may be initiated to consume.
S4, the dirty data consumer trains the dirty data in the message queue.
Specifically, the dirty data consumer only consumes the dirty data, without concern for failure or non-failure of the task.
As an optional implementation manner of the embodiment of the present invention, the dirty data in the dirty data consumer training message queue includes: and the dirty data consumers are used for training the dirty data in the message consumption queue in turn through the dirty data consumer subclass. The consumption of the dirty data consumers is realized by subclasses, so that the plug-in management of the dirty data can be realized.
S5, if the dirty data in the dirty data consumer consumption message queue is successful, the dirty data manager adds 1 to the dirty data consumption count value until the dirty data consumption count value reaches the preset dirty data consumption count value, and determines that the task fails; and if the dirty data in the dirty data consumer consumption message queue fails, the dirty data manager adds 1 to the count value of the failed data until the count value of the failed data reaches a preset count value of the failed data, and determines that the task fails.
Specifically, the dirty data manager determines that the task is considered to be failed if the cumulative consumption of the dirty data in the task exceeds a threshold and/or the consumption failure times exceed a threshold, otherwise the task is considered to be successful.
Therefore, by using the method for plug-in management of dirty data based on the Flink real-time task, provided by the embodiment of the invention, the basic design mode of a producer and a consumer is adopted, the message queue is used as a cache, the dirty data collection is realized by a dirty data manager, the dirty data processing is realized by a dirty data consumer, and the dirty data consumer only needs to pay attention to the consumption processing of the message and does not need to pay attention to whether the task fails or not, so that the management and the processing process of the dirty data are decoupled.
In addition, the specific processing of the dirty data is realized by utilizing the inheritance relationship of the object according to different implementation of the subclasses.
Therefore, the invention adds the plug-in management of the tasks aiming at the dirty data processing condition, controlling whether the tasks stop or not and the dirty data in the real-time tasks, and effectively improving the fault tolerance in the task execution process.
The method for plug-in management of dirty data based on the Flink real-time task, provided by the invention, is further explained by taking a dirty-log plug-in for printing the dirty data into a log as an example:
referring to fig. 2, Dirty-manager is a Dirty data manager, and its specific implementation is Dirty manager, records Dirty data itself and an exception cause, and stores the Dirty data itself and the exception cause in a message queue, and at the same time manages the start of a Dirty data plug-in, and maintains the Dirty data message queue, where the execution flow of the Dirty data manager includes:
step 1, DirtyManager is a dirty data manager, instantiates a corresponding dirty data plug-in object according to specified configuration information, for example, instantiates configuration information of dirty-log plug-in which dirty data is printed in a log:
{"type":"log","print.rate":"100","dirty.limit":"1000","error.limit":"1000","log.properties":"/data/log.properties"}
the configuration information not only has the type corresponding to the dirty data plug-in, but also has the specific parameters in the corresponding type.
Step 2, instantiating a dirty-log plug-in, executing a control method, training dirty data in a dirty-queue message queue in turn, printing the dirty data into a log once every 100 dirty data according to the specific configuration print.
And 3, when dirty data and abnormal information are collected by the DirtyManager, generating a dirty-entry object and storing the dirty-entry object into a dirty-queue message queue, if the dirty data is the first piece of dirty data, starting a consumer thread pool, asynchronously training the queue, and consuming queue data through the consumer thread.
And 4, successfully issuing the dirty data to the dirty data plugin, and successfully consuming the dirty data, namely, if the processed dirty data is not abnormal, adding 1 to the total totalCount of the consumed dirty data, otherwise, when the dirty data is issued overtime or the plugin consumes the dirty data and throws the abnormal data, printing the abnormal data through a log, and simultaneously adding 1 to the total errorCount of the consumed dirty data.
And 5, if the total number totalCount of consumed dirty data reaches the set dirty.limit value or the number errorCount of failed consumed dirty data reaches the set error.limit value, the task fails and does not retry again.
Referring to fig. 3, dirtyConsumer is a consumer role in a dirty data plugin and provides functions of consuming dirty data queues and processing dirty data, an abstract dirtyConsumer is used as an abstract implementation, a Runnable interface is implemented, a run method executes an abstract method, and a containment method is implemented by subclasses, so that plugin management of dirty data is achieved. Meanwhile, the dirtyConsumer consumer only needs to pay attention to the consumption processing of the message, does not need to pay attention to whether the task fails or not, and achieves the decoupling of the management and the processing of the dirty data. The execution flow of the DirtyConsumer comprises the following steps:
step 1, instantiating an abstract dirty conditioner subclass, starting a control thread pool and activating a control thread when first piece of dirty data is generated;
and 2, the consume thread executes a consume method, and the dirty data in the dirty data queue are consumed in turn. According to different methods in subclasses, different dirty data processing effects are achieved.
Therefore, by using the method for plug-in management of dirty data based on the Flink real-time task, provided by the embodiment of the invention, the basic design mode of a producer and a consumer is adopted, the message queue is used as a cache, the dirty data collection is realized by a dirty data manager, the dirty data processing is realized by a dirty data consumer, and the dirty data consumer only needs to pay attention to the consumption processing of the message and does not need to pay attention to whether the task fails or not, so that the management and the processing process of the dirty data are decoupled.
In addition, the specific processing of the dirty data is realized by utilizing the inheritance relationship of the object according to different implementation of the subclasses.
Therefore, the invention adds the plug-in management of the tasks aiming at the dirty data processing condition, controlling whether the tasks stop or not and the dirty data in the real-time tasks, and effectively improving the fault tolerance in the task execution process.
Fig. 4 shows a schematic structural diagram of a system for performing plugin management on dirty data based on a Flink real-time task according to an embodiment of the present invention, where the system for performing plugin management on dirty data based on a Flink real-time task according to an embodiment of the present invention is applied to the method, and only the structure of the system for performing plugin management on dirty data based on a Flink real-time task is generally described below, and for other reasons, please refer to the related description in the method for performing plugin management on dirty data based on a Flink real-time task, referring to fig. 4, the system for performing plugin management on dirty data based on a Flink real-time task according to an embodiment of the present invention includes: instantiating the obtained dirty data manager according to the dirty data configuration information and instantiating the obtained dirty data consumer;
the dirty data manager is used for collecting dirty data and abnormal reasons generated by the tasks and storing the dirty data and the abnormal reasons into the message queue; launching a dirty data consumer in a dirty data plug-in object;
the dirty data consumer is used for training dirty data in the message queue in turn;
the dirty data manager is also used for adding 1 to the dirty data consumption count value if the dirty data in the dirty data consumer consumption message queue succeeds, until the dirty data consumption count value reaches a preset dirty data consumption count value, and determining that the task fails; and if the dirty data in the dirty data consumer consumption message queue fails, adding 1 to the count value of the failed data until the count value of the failed data reaches the preset count value of the failed data, and determining that the task fails.
As an optional implementation manner of the embodiment of the present invention, the dirty data manager is further configured to determine a first piece of dirty data in the message queue before starting the dirty data consumer in the dirty data plug-in object.
As an optional implementation manner of the embodiment of the present invention, the dirty data manager is further configured to initialize the message queue before collecting the dirty data and the abnormal reason generated by the task and storing the dirty data and the abnormal reason into the message queue.
As an optional implementation manner of the embodiment of the present invention, the dirty data consumer trains dirty data in the message queue in turn by: the dirty data consumers are specifically used for training dirty data in the message consumption queue in turn through the dirty data consumer subclass.
Therefore, by using the system for plug-in management of dirty data based on the Flink real-time task, which is provided by the embodiment of the invention, a basic design mode of a producer and a consumer is adopted, a message queue is used as a cache, the collection of the dirty data is realized by a dirty data manager, the processing of the dirty data is realized by a dirty data consumer, and the dirty data consumer only needs to pay attention to the consumption processing of the message and does not need to pay attention to whether the task fails or not, so that the management and the processing process of the dirty data are decoupled.
In addition, the specific processing of the dirty data is realized by utilizing the inheritance relationship of the object according to different implementation of the subclasses.
Therefore, the invention adds the plug-in management of the tasks aiming at the dirty data processing condition, controlling whether the tasks stop or not and the dirty data in the real-time tasks, and effectively improving the fault tolerance in the task execution process.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种远程调用方法及装置以及系统