Power network safety data cleaning system and method based on machine learning
1. A power network data cleaning method based on machine learning is characterized by comprising the following steps:
the method comprises the following steps that firstly, a data acquisition module (1) acquires power network data, and the data acquisition module (1) stores the power network data into an original power network data storage unit (41);
step two, the data processing module (2) extracts power network data from the original power network data storage unit (41), the data processing module (2) analyzes the power network data to acquire operation event information, and the data processing module (2) stores the acquired operation event information into the operation event information storage unit (42);
and thirdly, analyzing all the operation event information by the data analysis module (3) through a machine learning clustering algorithm, extracting abnormal data and outlier data in the operation event information, and respectively storing the abnormal data, the outlier data and the normal data by the data analysis module (3) to finish data cleaning.
2. The machine learning-based power network data cleaning method according to claim 1, wherein the specific process of analyzing the power network data by the data processing module (2) to obtain the operation event information in the second step is as follows: the method comprises the steps that firstly, a regular expression in a rule storage unit (43) is extracted by a data processing module (2), the data processing module (2) analyzes the power network data collected by a data collection module (1) according to the regular expression to obtain field information of the power network data, then the data processing module (2) performs data feature matching on the field information, operation event information classification is performed according to matching results, and all power network data belonging to the same operation event information category are stored in an operation event information storage unit (42) as one type of operation event information data.
3. The machine learning-based power network data cleaning method according to claim 1, wherein the machine learning clustering algorithm in the third step is specifically a K-means algorithm, and the specific process of analyzing all operation event information by the data analysis module (3) through the K-means algorithm is as follows: the data analysis module (3) establishes a data analysis model based on a K-means algorithm, the data analysis module (3) extracts all operation event information in a database (4) and carries out feature extraction processing on the operation event information, a feature value sample set is established according to feature values of all operation event information obtained through feature extraction, a plurality of feature values are selected from the feature value sample set to establish a training sample set, the data analysis module (3) trains the data analysis model according to the training sample set, the feature value sample set is analyzed according to the trained data analysis model, feature value distribution conditions of all operation event information are output according to analysis results, and the data analysis module (3) obtains abnormal data and outlier data in the sample set according to the feature value distribution conditions.
4. The machine learning-based power network data cleaning method according to claim 3, wherein after the data analysis module (3) acquires the abnormal data and the outlier data, the data analysis module (3) firstly judges the generation reason of the abnormal data and the outlier data, and if the generation reason of the abnormal data and the outlier data is judged to be that the power network receives an attack, the data analysis module (3) labels the abnormal data and the outlier data and stores the abnormal data and the outlier data into the abnormality monitoring data unit (44); if the generation reason of the abnormal data and the outlier data is judged to be detection errors, the data analysis module (3) judges an abnormal value of each abnormal data and each outlier data through a Grubbs method, if the abnormal data or the outlier data is judged to be the abnormal value, the data analysis module (3) eliminates the abnormal data or the outlier data which are judged to be the abnormal value, the data analysis module (3) fills missing values of operation event information corresponding to the eliminated abnormal data or the outlier data through a K-means algorithm, and the data are restored to the database (4) after filling is completed, so that data cleaning is completed; if the abnormal data or the outlier data are judged not to belong to the abnormal value, the data analysis module (3) marks the abnormal data or the outlier data and stores the marked abnormal data or the outlier data into the database (4) again.
5. The machine learning-based power network data cleaning method according to claim 4, wherein when the abnormal data or the outlier data is determined to be an abnormal value, the data analysis module (3) further extracts the operation event information corresponding to the abnormal data or the outlier data from the operation event information storage unit (42), and the data analysis module (3) performs clearing and missing value filling processing on data associated with the abnormal data or the outlier data in the operation event information according to an association rule in the rule storage unit (43).
6. The machine learning-based power network data cleaning method according to claim 1, wherein in the second step, before the data processing module (2) stores all the acquired operation event information in the operation event information storage unit (42), the data processing module (2) further performs time stamping processing on each operation event information.
7. The machine learning-based power network data cleaning method according to claim 6, wherein after the data processing module (2) performs time stamping processing on each operation event information, the data processing module (2) further establishes a corresponding database (4) index for each operation event information.
8. The utility model provides an electric power network data cleaning system based on machine learning, its characterized in that, includes data acquisition module (1), data processing module (2) and data analysis module (3), data acquisition module (1) is connected with data processing module (2), data acquisition module (1) is used for gathering electric power network data, data analysis module (3) is connected with data processing module (2), data processing module (2) are used for analyzing electric power network data and classify, data analysis module (3) are used for extracting the unusual data and the outlier data in the operational event information.
9. The machine learning-based power network data cleaning system according to claim 8, further comprising a database (4), wherein the database (4) comprises an original power network data storage unit (41), an operation event information storage unit (42), a rule storage unit (43) and an abnormality monitoring data unit (44), the original power network data storage unit (41) is connected with the data acquisition module (1) and the data processing module (2), the operation event information storage unit (42) is connected with the data processing module (2) and the data analysis module (3), the rule storage unit (43) is connected with the data processing module (2), and the abnormality monitoring data unit (44) is connected with the data analysis module (3).
Background
With the continuous development of power grids, more and more safety systems are applied to the safety protection work of the power grids in order to protect the operation safety of the power grids. Because the actually measured power grid data often has various problems such as data loss, data abnormality or data error, and the actually measured power grid data has low application reliability, when responding to the security threat suffered by the power grid, if the actually measured power grid data is directly used for application, the effectiveness of the response measure formulated according to the power grid data is affected.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a power network safety data cleaning system and method based on machine learning.
The purpose of the invention is realized by the following technical scheme:
a power network data cleaning method based on machine learning comprises the following steps:
the method comprises the following steps that firstly, a data acquisition module acquires power network data, and the data acquisition module stores the power network data into an original power network data storage unit;
step two, the data processing module extracts power network data from the original power network data storage unit, analyzes the power network data to acquire operation event information, and stores the acquired operation event information into the operation event information storage unit;
and step three, analyzing all the operation event information by the data analysis module through a machine learning algorithm, extracting abnormal data and outlier data in the operation event information, and respectively storing the abnormal data, the outlier data and the normal data by the data analysis module to finish data cleaning.
The power network data are classified, and the operation event information is perfected according to the power network data, so that the safety of the power network can be conveniently judged in the follow-up process. When abnormal data and outlier data are analyzed, the power network data in the same type of operation event information are judged and analyzed, the data calculation amount is reduced, and the analysis efficiency is improved.
Further, the specific process of analyzing the power network data by the data processing module to obtain the operation event information in the second step is as follows: the method comprises the steps that firstly, a data processing module extracts a regular expression in a rule storage unit, the data processing module analyzes power network data collected by a data collection module according to the regular expression to obtain field information of the power network data, then the data processing module performs data feature matching on the field information, operation event information classification is performed according to a matching result, and all power network data belonging to the same operation event information category are stored in an operation event information storage unit as a type of operation event information data.
The electric network data can be analyzed into different field information through the regular expression, data feature matching is conveniently carried out according to the type of the field information, and classification accuracy is improved.
Further, the machine learning clustering algorithm in the third step is specifically a K-means algorithm, and the specific process of analyzing all the operation event information by the data analysis module through the K-means algorithm is as follows: the data analysis module establishes a data analysis model based on a K-means algorithm, extracts all operation event information in a database and carries out feature extraction processing on the operation event information, a feature value sample set is established according to feature values of all operation event information obtained through feature extraction, a plurality of feature values are selected from the feature value sample set to establish a training sample set, the data analysis module trains the data analysis model according to the training sample set, the feature value sample set is analyzed according to the trained data analysis model, the feature value distribution condition of all operation event information is output according to the analysis result, and the data analysis module acquires abnormal data and outlier data in the sample set according to the feature value distribution condition.
The K-means algorithm can effectively extract abnormal data and outlier data from mass data, and the recognition accuracy is high.
Further, after the data analysis module acquires the abnormal data and the outlier data, the data analysis module firstly judges the generation reason of the abnormal data and the outlier data, and if the generation reason of the abnormal data and the outlier data is judged to be that the power network receives an attack, the data analysis module labels the abnormal data and the outlier data and stores the abnormal data and the outlier data into the abnormal monitoring data unit; if the generation reason of the abnormal data and the outlier data is judged to be detection errors, the data analysis module judges abnormal values of each abnormal data and each outlier data through a Grubbs method, if the abnormal data or the outlier data are judged to be the abnormal values, the data analysis module removes the abnormal data or the outlier data which are judged to be the abnormal values, the data analysis module fills missing values of operation event information corresponding to the removed abnormal data or the outlier data through a K-means algorithm, and the data are stored in a database again after filling is completed to finish data cleaning; and if the abnormal data or the outlier data are judged not to belong to the abnormal value, the data analysis module labels the abnormal data or the outlier data and stores the labeled abnormal data or the outlier data into the database again.
The abnormal data or the outlier data are removed and filled with normal values to repair errors if the abnormal data or the outlier data are abnormal values, and the abnormal data or the outlier data do not need to be monitored if the abnormal data or the outlier data are the measurement errors, and only the normal marking or the error correction is needed. If the abnormal data and the outlier data are generated due to the attack of the power grid, the corresponding operation events need to be monitored in real time, and the situation that the abnormal data and the outlier data are attacked again is prevented or a timely response can be made when the abnormal data and the outlier data are attacked. The judgment of the reasons for generating the abnormal data and the outlier data can effectively improve the monitoring efficiency and reduce the monitoring workload.
Further, when the abnormal data or the outlier data is judged to be an abnormal value, the data analysis module further extracts the operation event information corresponding to the abnormal data or the outlier data from the operation event information storage unit, and the data analysis module removes and fills missing values of data related to the abnormal data or the outlier data in the operation event information according to the association rule in the rule storage unit.
And error correction processing is also carried out on the associated data, so that the accuracy of the data in the operation event information is ensured.
Furthermore, in the second step, before the data processing module stores all the acquired operation event information in the operation event information storage unit, the data processing module also adds a timestamp to each operation event information.
Furthermore, after the data processing module adds a timestamp to each running event information, the data processing module also establishes a corresponding database index for each running event information.
After the database index is established for each operation event information, the required operation event information can be quickly and accurately acquired when the operation event information needs to be consulted subsequently.
The utility model provides an electric power network data cleaning system based on machine learning, includes data acquisition module, data processing module and data analysis module, data acquisition module and data processing module are connected, data acquisition module is used for gathering electric power network data, data analysis module and data processing module are connected, data processing module is used for analyzing the classification to electric power network data, data analysis module is used for extracting abnormal data and outlier data in the operational event information.
Further, the electric power network data cleaning system based on machine learning further comprises a database, wherein the database comprises an original electric power network data storage unit, an operation event information storage unit, a rule storage unit and an abnormity monitoring data unit, the original electric power network data storage unit is simultaneously connected with a data acquisition module and a data processing module, the operation event information storage unit is simultaneously connected with a data processing module and a data analysis module, the rule storage unit is connected with the data processing module, and the abnormity monitoring data unit is connected with the data analysis module.
The invention has the beneficial effects that:
the power network data are analyzed and classified to obtain operation event information in the power grid operation process, data cleaning is carried out by taking the operation event information as a unit, the data calculation amount is low, the cleaning efficiency is improved, and the application reliability of the power network data after the data cleaning is prompted. When the abnormal data and the outlier data in the operation event information are extracted, the reasons for the abnormal data and the outlier data are judged, the abnormal data and the outlier data caused by the measurement error are marked or subjected to error correction, and only the abnormal data and the outlier data generated by the power grid under attack are subjected to data cleaning, so that the data cleaning efficiency is improved.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic diagram of an embodiment of the present invention;
wherein: 1. the system comprises a data acquisition module 2, a data processing module 3, a data analysis module 4, a database 41, an original power network data storage unit 42, an operation event information storage unit 43, a rule storage unit 44 and an abnormality monitoring data unit.
Detailed Description
The invention is further described below with reference to the figures and examples.
Example (b):
a method for cleaning power network data based on machine learning, as shown in fig. 1, includes the following steps:
step one, a data acquisition module 1 acquires power network data, and the data acquisition module 1 stores the power network data into an original power network data storage unit 41;
step two, the data processing module 2 extracts the power network data from the original power network data storage unit 41, the data processing module 2 analyzes the power network data to obtain operation event information, and the data processing module 2 stores the obtained operation event information into the operation event information storage unit 42;
and step three, the data analysis module 3 analyzes all the operation event information through a machine learning algorithm, abnormal data and outlier data in the operation event information are extracted, and the data analysis module 3 respectively stores the abnormal data, the outlier data and the normal data to finish data cleaning.
The specific process of analyzing the power network data by the data processing module 2 to obtain the operation event information in the second step is as follows: firstly, the data processing module 2 extracts the regular expression in the rule storage unit 43, the data processing module 2 analyzes the power network data collected by the data collection module 1 according to the regular expression to obtain field information of the power network data, then the data processing module 2 performs data feature matching on the field information, performs operation event information classification according to a matching result, and stores all the power network data belonging to the same operation event information category as a type of operation event information data into the operation event information storage unit 42.
The machine learning clustering algorithm in the third step is specifically a K-means algorithm, and the specific process of analyzing all the operation event information by the data analysis module 3 through the K-means algorithm is as follows: the data analysis module 3 establishes a data analysis model based on a K-means algorithm, the data analysis module 3 extracts the Access Log related data of Nginx in the database 4 and carries out feature extraction processing on the data, can analyze the characteristic values of request time, request address, request head, request method, request body, response state code and the like, constructing a characteristic value sample set according to all characteristic values obtained by characteristic extraction, selecting a request address characteristic value and a request head characteristic value from the characteristic value sample set to construct a training sample set, training a data analysis model by a data analysis module 3 according to the training sample set, analyzing the characteristic value sample set according to the trained data analysis model, and outputting the characteristic value distribution condition of all the operation event information according to the analysis result, and acquiring abnormal data and outlier data in the Access Log related data of Nginx by the data analysis module 3 according to the characteristic value distribution condition.
After the data analysis module 3 acquires the abnormal data and the outlier data, the data analysis module 3 firstly judges the generation reason of the abnormal data and the outlier data, the data analysis module 3 acquires the acquisition nodes corresponding to the abnormal data and the outlier data, and calls all the power network data of the acquisition nodes in the near period of time from the database 4, and acquires the fluctuation curve and the fluctuation amplitude of the power network data in the period of time. If the generation reason of the abnormal data and the outlier data is judged to be that the power network receives an attack, the data analysis module 3 labels the abnormal data and the outlier data and stores the labeled abnormal data and the outlier data into the abnormal monitoring data unit 44; if the generation reason of the abnormal data and the outlier data is judged to be detection error, the data analysis module 3 judges an abnormal value of each abnormal data and each outlier data through the grubbs method, if the abnormal data or the outlier data is judged to be the abnormal value, the data analysis module 3 removes the abnormal data or the outlier data which is judged to be the abnormal value, the data analysis module 3 selects the value which is judged to be the normal data for the last time in the operation event information storage unit 42 by using the K-means algorithm to fill the missing value of the operation event information corresponding to the removed abnormal data or the outlier data, and stores the operation event information into the database 4 again after the filling is completed to complete the data cleaning; if the abnormal data or the outlier data are judged not to belong to the abnormal value, the data analysis module 3 labels the abnormal data or the outlier data and stores the labeled abnormal data or the outlier data into the database 4 again.
Because the standard deviation of the abnormal data and the outlier data is in an unknown state, the abnormal value is judged by adopting the Grubbs method, and the Grubbs method not only sets a certain confidence coefficient, but also introduces an average value and a standard deviation, so that the judgment accuracy is very high. The abnormal value judgment formula of the grubbs method is as follows:
wherein: t is an abnormal judgment value, a is abnormal data or outlier data, b is an average value of data in the operation event information corresponding to the abnormal data or the outlier data, and S is a standard deviation of the data in the operation event information corresponding to the abnormal data or the outlier data.
After the T value is obtained, the data table for the grubbs method verification needs to be retrieved in the rule storage unit 43, the T value is compared with the data in the table, and if the calculated T value is larger than the data corresponding to the table, it is determined that the abnormal data or the outlier data is an abnormal value.
When the abnormal data or the outlier data is determined to be an abnormal value, the data analysis module 3 further extracts the operation event information corresponding to the outlier data from the operation event information storage unit 42, and the data analysis module 3 performs a clearing and missing value filling process on the data associated with the outlier data in the operation event information according to the association rule in the rule storage unit 43.
In the second step, before the data processing module 2 stores all the acquired operation event information in the operation event information storage unit 42, the data processing module 2 further performs timestamp processing on each operation event information.
After the data processing module 2 adds the timestamp to each operation event information, the data processing module 2 also establishes a corresponding database 4 index for each operation event information.
The utility model provides an electric power network data cleaning system based on machine learning, as shown in fig. 2, includes data acquisition module 1, data processing module 2, data analysis module 3 and database 4, data acquisition module 1 is connected with data processing module 2, data acquisition module 1 is used for gathering electric power network data, data analysis module 3 is connected with data processing module 2, data processing module 2 is used for analyzing electric power network data and classifies, data analysis module 3 is used for extracting abnormal data and outlier data in the operational event information.
The database 4 comprises an original power network data storage unit 41, an operation event information storage unit 42, a rule storage unit 43 and an abnormality monitoring data unit 44, wherein the original power network data storage unit 41 is simultaneously connected with the data acquisition module 1 and the data processing module 2, the operation event information storage unit 42 is simultaneously connected with the data processing module 2 and the data analysis module 3, the rule storage unit 43 is connected with the data processing module 2, and the abnormality monitoring data unit 44 is connected with the data analysis module 3.
And monitors the power grid acquisition nodes corresponding to the abnormal data or the outlier data stored in the abnormal monitoring data unit 44 in real time.
The above-described embodiments are only preferred embodiments of the present invention, and are not intended to limit the present invention in any way, and other variations and modifications may be made without departing from the spirit of the invention as set forth in the claims.