Flow baseline model construction method for transformer substation core industrial control service
1. A flow baseline model construction method for transformer substation core industrial control service comprises the following steps:
s1, analyzing a data stream of a transformer substation, obtaining basic service characteristics, and calculating to obtain high-order characteristics representing service stream interaction according to the obtained basic service characteristics;
s2, clustering similar services;
s3, generating a final original input feature matrix according to the acquired basic features and the high-order features;
s4, constructing a convolutional neural network-based self-encoder, and learning the incidence relation of the original input feature matrix obtained in the step S3, so as to obtain high-dimensional middle layer features;
s5, reducing the dimension of the high-dimensional middle layer characteristics obtained in the step S4, and calculating to obtain low-dimensional key characteristics representing the service interaction process;
s6, calculating normal distribution parameters in a plurality of periods aiming at each dimension;
and S7, constructing a multi-dimensional key feature combined Gaussian distribution function in each service to finally obtain a service baseline model.
2. The flow baseline model construction method for the core industrial control service of the transformer substation according to claim 1, wherein the step S1 is performed to analyze the data flow of the transformer substation, obtain basic service features, and calculate high-order features representing service flow interaction according to the obtained basic service features, specifically, feature extraction includes creating flow features, updating the flow features and outputting the flow features; circularly reading and analyzing basic features in the original data stream, initializing stream information according to the type of the data stream, and adding the stream information into a feature stack; finally, calculating high-order characteristics based on the basic characteristics, updating the characteristics of the flow in the stack and outputting;
creating stream characteristics: creating flow characteristics based on the matching result of the original flow basic characteristics and the characteristic stack; the basic flow characteristics are configuration characteristics in a physical layer, a link layer, a network layer and a transmission layer in the data flow; forming a five-tuple from a source IP, a source port, a destination IP, a destination port and a protocol type in the basic characteristics; if the current flow quintuple is not matched with the data in the feature stack, creating a flow feature: initializing flow information according to the first data flow direction, comprising: packet size, packet header length, stream identification, and initialization window size; the initialization information is pushed to complete the creation process;
updating the stream characteristics: if the current flow quintuple is matched with data in the feature stack and does not meet the flow feature output condition, updating the flow feature: updating stream characteristics including basic characteristic updating and high-order characteristic updating; the basic feature updating comprises packet size accumulative updating, packet head length accumulative updating, label number accumulative updating and packet number accumulative updating; then calculating high-order characteristics according to the basic characteristics, wherein the high-order characteristics comprise flow activation/idle time, flow rate, byte rate, flow arrival time interval, payload statistic and sub-flow related statistic sub-flow number; the updating of the basic characteristics is realized by accumulating counts; the high-order characteristic updating establishes data flow interactive correlation through a transmitting sequence and a receiving sequence in basic characteristics, and respectively calculates characteristic changes of a forward link and a reverse link by combining time stamps, thereby realizing the high-order characteristic updating;
output flow characteristics: outputting the stream characteristics in the stack if the following conditions exist:
1) judging whether the current flow is overtime or not according to the basic feature TTL;
2) judging whether the current flow is finished or not according to the basic feature FIN identifier;
3) whether the current pcap file is traversed or not is finished;
wherein, the conditions 1) and 2) need to complete the feature stack update of the current data stream and output the corresponding stream feature; condition 3) indicates that the current pcap file has ended, outputting all the stream features in the stack.
3. The flow baseline model construction method for the core industrial control service of the transformer substation according to claim 2, wherein the step S2 is performed for similar service clustering, specifically clustering according to quintuple extracted from a service data stream; the clustering rules are as follows:
rule one is as follows: the service types are jointly divided according to the fixed destination port and the protocol type;
rule two: the service flows of the same source equipment, different source IPs, and the same destination IP, destination port and protocol type are classified into the same type;
rule three: the same source IP, the same destination device, the different destination IP, the same destination port and the protocol type service are classified into the same type.
4. The flow baseline model construction method for the core industrial control service of the transformer substation according to claim 3, wherein the step S3 is to generate a final original input feature matrix according to the obtained basic features and high-order features, specifically, the extracted basic features and high-order features include a packet length feature, a packet number feature, a packet time feature, a packet content feature and a configuration feature; wherein the packet length characteristic represents the amount of information of the service bearer content; the number of packets feature the service interaction mode; the packet time characteristics represent the frequency of service interaction; the packet content characteristic represents content identification based on TCP/IP service; the configuration characteristics represent configuration information of the network device; forming a service interaction matrix by the forward/reverse characteristics to describe the overall characteristics of the service; meanwhile, the packet length characteristic, the packet number characteristic and the packet time characteristic all contain four high-order characteristics of a maximum value, a minimum value, a mean value and a variance; the packet content characteristics comprise TCP/IP different content identification count values; and the forward/reverse characteristics in the service interaction matrix carry out clustering sequencing on the packet length characteristics, the packet number characteristics, the packet time characteristics and the packet content characteristics according to the attributes to generate a final network input original characteristic matrix.
5. The flow baseline model construction method for the core industrial control service of the transformer substation according to claim 4, wherein the construction of the step S4 is based on a convolutional neural network self-encoder, and specifically comprises the following steps:
the input of the self-encoder is a 2 x N dimensional original characteristic matrix x, wherein N is a service flow unidirectional characteristic dimension; the self-encoder is based on a double-layer convolution neural network; the structure of the encoder and the decoder is mirror symmetry; the first convolution layer function in the encoder is f1(. o) the second layer convolution layer function is f2The output of the encoder is a service intermediate layer characteristic vector h, and the encoder establishes a mapping relation between the intermediate layer characteristic vector and an input characteristic matrix, wherein h is f2(f1(x) ); the decoder is also a two-layer convolutional neural network, in which the first layer decoder decodes the results of the second layer encoder with a decoding function g2(. h) a second layer decoder for demodulating the first layer encoder result, the decoding function being g1(. o), the final decoded output isThe decoder establishes a mapping relation between the output characteristic matrix and the intermediate layer characteristic vector asFinally, representing the original characteristic vector incidence relation by utilizing the characteristic vector h of the middle layer;
the self-encoder network is trained by minimizing the mean absolute error as a loss function:
wherein x is an original data stream characteristic matrix;reconstructing a feature matrix for the self-encoder; m is a feature matrix dimension; x is the number ofiIs the ith dimension eigenvalue in the eigenvalue matrix;reconstructing an ith dimension eigenvalue in the eigen matrix for the self-encoder;
taking the original characteristic matrix as an input characteristic matrix, iteratively training a self-encoder according to batches, and judging whether the network learns the original characteristic distribution through an objective function: when the training results of each batch of objective function values meet the set requirements, the self-encoder can be used for fitting the original feature distribution of each service, and finally the intermediate layer vector h represents the original feature incidence relation.
6. The flow baseline model construction method for the core industrial control service of the transformer substation according to claim 5, wherein the step S5 is to perform dimensionality reduction on the high-dimensional interlayer feature obtained in the step S4, calculate to obtain a low-dimensional key feature representing a service interaction process, specifically, the core industrial control service in the transformer substation is subjected to self-encoder to obtain a high-dimensional interlayer feature h, and the high-dimensional feature is mapped to a low-dimensional brand-new orthogonal feature by a principal component analysis method to eliminate redundant data in the high-dimensional feature to obtain a service key feature;
obtaining n types of core services in the transformer substation after service classification, wherein the characteristic h of the middle layer has m dimensions, and the characteristics are divided for the same core industrial control serviceAnd (4) performing similarity distribution, and calculating the characteristic mean value of each dimension of the middle layer of the same service through sampling to obtain the middle layer characteristic matrix of m x n dimensions of different services
Carrying out zero-mean processing on the intermediate layer characteristic matrix X to obtain a matrix X ', and calculating a covariance matrix C of the X', wherein the matrix C is a symmetric matrix and is shown as follows:
C=X'×(X')T
in the formulaTaking the row mean value of the intermediate layer characteristic matrix; n is the number of characteristic matrix columns; sum () is the sum of feature matrices by rows;
respectively calculating the eigenvalue of the business covariance matrix C and the corresponding eigenvector; descending the service characteristic values, setting a threshold value delta according to characteristic value influence factors, generating a transfer matrix P from a characteristic vector corresponding to the characteristic value larger than delta, and performing dot product operation on the characteristic h of the middle layer of the self-encoder and the transfer matrix P to obtain a key characteristic K after dimensionality reduction, wherein the h dimensionality is 1 × m, the P dimensionality is m × v, the K dimensionality is 1 × v, and v is the number larger than the characteristic threshold value delta; k ═ h × P.
7. The flow baseline model construction method for the core industrial control service of the transformer substation according to claim 6, wherein the step S6 is to calculate normal distribution parameters in a plurality of periods for each dimension, specifically to convert the same-service characteristic distribution of unknown prior probability into normal distribution by using a central limit theorem; the specific parameters are calculated as:
a. the same service satisfies the same probability distribution to obtain the key characteristics of the periodic sampling data; assume that the service set is S ═ S1,S2,...,Sn}, each service S in the periodiThe E S key characteristics are as follows:
Keyi=(ki1,ki2,...,kiv)T
key in formulaiThe key characteristic vector of the ith type of service is obtained; k is a radical ofijTaking values of j-dimension vectors in the i-type service key feature vectors;
b. calculating key characteristic mean value of the same service in L periods, service SiThe feature vector over L cycles is:
Vil=(Keyi1,Keyi2,...,Keyil)
key in formulaijJ is a key feature vector in the jth period of the ith type of service, wherein j is 1, 2.
Solving the key characteristic vector V of the ith serviceiMean vector of each dimension in L periodWherein each element satisfies a normal distribution:
c. obtaining a service S using maximum likelihood estimationiIn the normal distribution parameters of key features of each dimension in the L period, the service S is assumed to be sampled for W times of the L periodiThe normal distribution parameter obtained from the middle key characteristic j isThe calculation method of each parameter comprises the following steps:
wherein w is the number of periodic repeated sampling;the j dimension characteristic of the ith type service is the average value after the sampling in the nth period;
d. get service SiParameters of each dimension in key features
8. The flow baseline model construction method for the core industrial control service of the transformer substation according to claim 7, wherein the step S7 is implemented by constructing a multi-dimensional key feature joint Gaussian distribution function in each service to finally obtain a service baseline model, and specifically, after obtaining distribution parameters of each dimension of the key feature, dimension integration is implemented by combining Gaussian distribution to obtain a data baseline model of the service; the key point of the combined Gaussian distribution lies in calculating the correlation coefficient among all dimensions, and the calculation formula of the correlation coefficient is
In the formula, COV () is a correlation coefficient of the mean value of the ith dimension and the jth dimension of the ith class service in the key features;
obtaining a covariance matrix sigma according to correlation coefficients among dimensions, wherein the final service baseline model is a joint Gaussian distribution probability model, and the expression mode is as follows:
and calculating a covariance matrix sigma by using the maximum likelihood estimation, wherein the calculation formula is as follows, wherein v is a key characteristic dimension:
Background
The industrial control system is used as a central center of an industrial system and controls the transmission and interaction of information in the industrial system, covers various types of control systems, is composed of various automation components and control components for collection and monitoring, and guarantees the automatic operation of industrial infrastructure. The industrial control system has the characteristics of structural stability, service limitation and interaction periodicity. The development of the industrial internet enables the traditional industrial control system and the internet technology to realize deep fusion, and various production element resources in the industrial system are efficiently shared, so that the cost is reduced and the efficiency is increased through an automatic and intelligent production mode.
The intelligent development of the power grid enables the information technology to be widely applied to all links of power generation, power transmission, power transformation, power distribution, power utilization and scheduling, and the power grid has the characteristics of wider network, more users, more interaction and more technology updating. The transformer substation is used as a downward extension of a power backbone communication network, bears services such as remote sensing, remote signaling and remote sensing of a distributed energy source/energy storage station, and has the characteristics of complex terminal equipment, wide distribution, service continuity requirements, service real-time requirements, special protocol transmission and the like. Core services in the transformer substation are communicated according to a power grid special protocol in a TCP/IP mode, and three-remote data in communication flow data of the core services are main services, so that the state of terminal equipment is periodically acquired and an operation instruction is returned; embodied as a large number of time-driven state detection and information collection services. By constructing a baseline model and describing each service profile in the transformer substation, on one hand, the real-time service running state can be provided for operation and maintenance personnel; on the other hand, in the face of an open industrial internet environment, the baseline model extension can be applied to abnormal behavior detection, an active defense means is provided, and abnormal detection modes are enriched.
However, the conventional method for constructing the baseline model is time-consuming, labor-consuming and low in reliability.
Disclosure of Invention
The invention aims to provide a flow baseline model construction method which is high in reliability, high in efficiency, simple and easy to implement and faces to the core industrial control service of a transformer substation.
The invention provides a flow baseline model construction method for transformer substation core industrial control service, which comprises the following steps:
s1, analyzing a data stream of a transformer substation, obtaining basic service characteristics, and calculating to obtain high-order characteristics representing service stream interaction according to the obtained basic service characteristics;
s2, clustering similar services;
s3, generating a final original input feature matrix according to the acquired basic features and the high-order features;
s4, constructing a convolutional neural network-based self-encoder, and learning the incidence relation of the original input feature matrix obtained in the step S3, so as to obtain high-dimensional middle layer features;
s5, reducing the dimension of the high-dimensional middle layer characteristics obtained in the step S4, and calculating to obtain low-dimensional key characteristics representing the service interaction process;
s6, calculating normal distribution parameters in a plurality of periods aiming at each dimension;
and S7, constructing a multi-dimensional key feature combined Gaussian distribution function in each service to finally obtain a service baseline model.
Analyzing the data stream of the transformer substation to obtain basic service features, and calculating to obtain high-order features representing service stream interaction according to the obtained basic service features, wherein the feature extraction specifically comprises creating stream features, updating stream features and outputting stream features; circularly reading and analyzing basic features in the original data stream, initializing stream information according to the type of the data stream, and adding the stream information into a feature stack; finally, calculating high-order characteristics based on the basic characteristics, updating the characteristics of the flow in the stack and outputting;
creating stream characteristics: creating flow characteristics based on the matching result of the original flow basic characteristics and the characteristic stack; the basic flow characteristics are configuration characteristics in a physical layer, a link layer, a network layer and a transmission layer in the data flow; forming a five-tuple from a source IP, a source port, a destination IP, a destination port and a protocol type in the basic characteristics; if the current flow quintuple is not matched with the data in the feature stack, creating a flow feature: initializing flow information according to the first data flow direction, comprising: packet size, packet header length, stream identification, initialization window size, etc.; the initialization information is pushed to complete the creation process;
updating the stream characteristics: if the current flow quintuple is matched with data in the feature stack and does not meet the flow feature output condition, updating the flow feature: updating stream characteristics including basic characteristic updating and high-order characteristic updating; the basic feature update comprises the accumulative update of the packet size, the accumulative update of the packet head length, the accumulative update of the number of labels, the accumulative update of the number of packets and the like; then calculating high-order characteristics according to the basic characteristics, wherein the high-order characteristics comprise stream activation/idle time, stream rate, byte rate, stream arrival time interval, payload statistic, sub-stream related statistic sub-stream number and the like; the updating of the basic characteristics is realized by accumulating counts; the high-order characteristic updating establishes data flow interactive correlation through a transmitting sequence and a receiving sequence in basic characteristics, and respectively calculates characteristic changes of a forward link and a reverse link by combining time stamps, thereby realizing the high-order characteristic updating;
output flow characteristics: outputting the stream characteristics in the stack if the following conditions exist:
1) judging whether the current flow is overtime or not according to the basic feature TTL;
2) judging whether the current flow is finished or not according to the basic feature FIN identifier;
3) whether the current pcap file is traversed or not is finished;
wherein, the conditions 1) and 2) need to complete the feature stack update of the current data stream and output the corresponding stream feature; condition 3) indicates that the current pcap file has ended, outputting all the stream features in the stack.
Step S2, clustering similar services, specifically clustering according to quintuple extracted from the service data stream; the clustering rules are as follows:
rule one is as follows: the service types are jointly divided according to the fixed destination port and the protocol type;
rule two: the service flows of the same source equipment, different source IPs, and the same destination IP, destination port and protocol type are classified into the same type;
rule three: the same source IP, the same destination device, the different destination IP, the same destination port and the protocol type service are classified into the same type.
Step S3, generating a final original input feature matrix according to the obtained basic features and high-order features, specifically, the extracted basic features and high-order features include a packet length feature, a packet number feature, a packet time feature, a packet content feature and a configuration feature; wherein the packet length characteristic represents the amount of information of the service bearer content; the number of packets feature the service interaction mode; the packet time characteristics represent the frequency of service interaction; the packet content characteristic represents content identification based on TCP/IP service; the configuration characteristics represent configuration information of the network device; forming a service interaction matrix by the forward/reverse characteristics to describe the overall characteristics of the service; meanwhile, the packet length characteristic, the packet number characteristic and the packet time characteristic all contain four high-order characteristics of a maximum value, a minimum value, a mean value and a variance; the packet content characteristics comprise TCP/IP different content identification count values; and the forward/reverse characteristics in the service interaction matrix carry out clustering sequencing on the packet length characteristics, the packet number characteristics, the packet time characteristics and the packet content characteristics according to the attributes to generate a final network input original characteristic matrix.
The construction of the convolutional neural network-based self-encoder described in step S4 specifically includes the following steps:
the input of the self-encoder is a 2 x N dimensional original characteristic matrix x, wherein N is a service flow unidirectional characteristic dimension; the self-encoder is based on a double-layer convolution neural network; the structure of the encoder and the decoder is mirror symmetry; the first convolution layer function in the encoder is f1(. o) the second layer convolution layer function is f2The output of the encoder is a service intermediate layer characteristic vector h, and the encoder establishes a mapping relation between the intermediate layer characteristic vector and an input characteristic matrix, wherein h is f2(f1(x) ); the decoder is also a two-layer convolutional neural network, in which the first layer decoder decodes the results of the second layer encoder with a decoding function g2(. h) a second layer decoder for demodulating the first layer encoder result, the decoding function being g1(. to) finally solveThe code output result isThe decoder establishes a mapping relation between the output characteristic matrix and the intermediate layer characteristic vector asFinally, representing the original characteristic vector incidence relation by utilizing the characteristic vector h of the middle layer;
the self-encoder network is trained by minimizing the mean absolute error as a loss function:
wherein x is an original data stream characteristic matrix;reconstructing a feature matrix for the self-encoder; m is a feature matrix dimension; x is the number ofiIs the ith dimension eigenvalue in the eigenvalue matrix;reconstructing an ith dimension eigenvalue in the eigen matrix for the self-encoder;
taking the original characteristic matrix as an input characteristic matrix, iteratively training a self-encoder according to batches, and judging whether the network learns the original characteristic distribution through an objective function: when the training results of each batch of objective function values meet the set requirements, the self-encoder can be used for fitting the original feature distribution of each service, and finally the intermediate layer vector h represents the original feature incidence relation.
Step S5, performing dimensionality reduction on the high-dimensional intermediate layer characteristics obtained in step S4, calculating to obtain low-dimensional key characteristics representing a service interaction process, specifically, obtaining high-dimensional intermediate layer characteristics h after the core industrial control service in the transformer substation passes through a self-encoder, mapping the high-dimensional characteristics to low-dimensional brand new orthogonal characteristics through a principal component analysis method, eliminating redundant data in the high-dimensional characteristics, and obtaining service key characteristics;
obtaining n types of core services of the transformer substation after service classification, wherein the intermediate layer characteristics h have m dimensions, the characteristics of the same core industrial control service are distributed in the same way, and the average value of the intermediate layer characteristics of each dimension of the same service is calculated by sampling to obtain an intermediate layer characteristic matrix of different m x n dimensions of different services
Carrying out zero-mean processing on the intermediate layer characteristic matrix X to obtain a matrix X ', and calculating a covariance matrix C of the X', wherein the matrix C is a symmetric matrix and is shown as follows:
C=X'×(X')T
in the formulaTaking the row mean value of the intermediate layer characteristic matrix; n is the number of characteristic matrix columns; sum () is the sum of feature matrices by rows;
respectively calculating the eigenvalue of the business covariance matrix C and the corresponding eigenvector; descending the service characteristic values, setting a threshold value delta according to characteristic value influence factors, generating a transfer matrix P from a characteristic vector corresponding to the characteristic value larger than delta, and performing dot product operation on the characteristic h of the middle layer of the self-encoder and the transfer matrix P to obtain a key characteristic K after dimensionality reduction, wherein the h dimensionality is 1 × m, the P dimensionality is m × v, the K dimensionality is 1 × v, and v is the number larger than the characteristic threshold value delta; k ═ h × P.
Step S6, calculating normal distribution parameters in a plurality of periods aiming at each dimension, specifically converting the same-service characteristic distribution of unknown prior probability into normal distribution by using a central limit theorem; the specific parameters are calculated as:
a. the same service satisfies the same probability distribution to obtain the key characteristics of the periodic sampling data; assume that the service set is S ═ S1,S2,...,Sn}, each service S in the periodiThe E S key characteristics are as follows:
Keyi=(ki1,ki2,...,kiv)T
key in formulaiThe key characteristic vector of the ith type of service is obtained; k is a radical ofijTaking values of j-dimension vectors in the i-type service key feature vectors;
b. calculating key characteristic mean value of the same service in L periods, service SiThe feature vector over L cycles is:
Vil=(Keyi1,Keyi2,...,Keyil)
key in formulaijJ is a key feature vector in the jth period of the ith type of service, wherein j is 1, 2.
Solving the key characteristic vector V of the ith serviceiMean vector of each dimension in L periodWherein each element satisfies a normal distribution:
c. obtaining a service S using maximum likelihood estimationiIn the normal distribution parameters of key features of all dimensions in the L period, the service S is obtained after W times of L period samplingiThe normal distribution parameter of the key feature of the j-th dimension isThe calculation method of each parameter comprises the following steps:
wherein w is the number of periodic repeated sampling;the j dimension characteristic of the ith type service is the average value after the sampling in the nth period;
d. get service SiParameters of each dimension in key features
Step S7, constructing a multi-dimensional key feature joint Gaussian distribution function in each service to finally obtain a service baseline model, specifically, after obtaining each dimension distribution parameter of the key feature, realizing dimension integration through joint Gaussian distribution to obtain a data baseline model of the service; the key point of the combined Gaussian distribution lies in calculating the correlation coefficient among all dimensions, and the calculation formula of the correlation coefficient is
In the formula, COV () is a correlation coefficient of the mean value of the ith dimension and the jth dimension of the ith class service in the key features;
obtaining a covariance matrix sigma according to correlation coefficients among dimensions, wherein the final service baseline model is a joint Gaussian distribution probability model, and the expression mode is as follows:
and calculating a covariance matrix sigma by using the maximum likelihood estimation, wherein the calculation formula is as follows, wherein v is a key characteristic dimension:
the flow baseline model construction method for the transformer substation core industrial control service provided by the invention extracts the key characteristics of the service interaction flow by using a deep learning network, and performs mathematical modeling based on the key characteristics, thereby constructing a service baseline model; from the practical application perspective, the invention provides an effective modeling scheme for service monitoring and abnormality detection in the transformer substation, and has the advantages of high reliability, higher efficiency, simplicity and feasibility.
Drawings
FIG. 1 is a schematic process flow diagram of the process of the present invention.
FIG. 2 is a schematic diagram of a convolutional neural network-based self-encoder architecture according to the present invention.
FIG. 3 is a graph illustrating a loss function curve of network training according to the method of the present invention.
Fig. 4 is a schematic diagram of the distribution of characteristics of the intermediate layer of the core industrial control service in the substation according to the embodiment of the method of the present invention.
Fig. 5 is a schematic diagram of distribution of key features of core industrial control services in a substation according to an embodiment of the method of the present invention.
Fig. 6 is a schematic diagram of 1 st, 4 th, and 10 th dimension distribution of business key features of the embodiment of the method of the present invention.
Detailed Description
FIG. 1 is a schematic flow chart of the method of the present invention: the invention provides a flow baseline model construction method for transformer substation core industrial control service, which comprises the following steps:
s1, analyzing a data stream of a transformer substation, obtaining basic service characteristics, and calculating to obtain high-order characteristics representing service stream interaction according to the obtained basic service characteristics; specifically, the feature extraction comprises creating stream features, updating stream features and outputting stream features; circularly reading and analyzing basic features in the original data stream, initializing stream information according to the type of the data stream, and adding the stream information into a feature stack; finally, calculating high-order characteristics based on the basic characteristics, updating the characteristics of the flow in the stack and outputting;
creating stream characteristics: creating flow characteristics based on the matching result of the original flow basic characteristics and the characteristic stack; the basic flow characteristics are configuration characteristics in a physical layer, a link layer, a network layer and a transmission layer in the data flow; forming a five-tuple from a source IP, a source port, a destination IP, a destination port and a protocol type in the basic characteristics; if the current flow quintuple is not matched with the data in the feature stack, creating a flow feature: initializing flow information according to the first data flow direction, comprising: packet size, packet header length, stream identification, initialization window size, etc.; the initialization information is pushed to complete the creation process;
updating the stream characteristics: if the current flow quintuple is matched with data in the feature stack and does not meet the flow feature output condition, updating the flow feature: updating stream characteristics including basic characteristic updating and high-order characteristic updating; the basic feature update comprises the accumulative update of the packet size, the accumulative update of the packet head length, the accumulative update of the number of labels, the accumulative update of the number of packets and the like; then calculating high-order characteristics according to the basic characteristics, wherein the high-order characteristics comprise stream activation/idle time, stream rate, byte rate, stream arrival time interval, payload statistic, sub-stream related statistic sub-stream number and the like; the updating of the basic characteristics is realized by accumulating counts; the high-order characteristic updating establishes data flow interactive correlation through a transmitting sequence and a receiving sequence in basic characteristics, and respectively calculates characteristic changes of a forward link and a reverse link by combining time stamps, thereby realizing the high-order characteristic updating; for describing data stream interaction behavior characteristics, such as: the packet size, the byte number, the duration and the like are calculated, and the maximum value, the minimum value, the mean value and the variance of the characteristics are calculated; only accumulation operation is carried out on the identification characteristics of the description data stream content; for describing the configuration characteristics of the data stream, such as a time stamp, a quintuple and the like, only recording is carried out, and updating is not carried out; finally generating 77-dimensional features;
output flow characteristics: outputting the stream characteristics in the stack if the following conditions exist:
1) judging whether the current flow is overtime or not according to the basic feature TTL;
2) judging whether the current flow is finished or not according to the basic feature FIN identifier;
3) whether the current pcap file is traversed or not is finished;
wherein, the conditions 1) and 2) need to complete the feature stack update of the current data stream and output the corresponding stream feature; condition 3) indicates that the current pcap file is finished and all stream characteristics in the stack are output;
s2, clustering similar services; specifically, clustering is carried out according to quintuple extracted from service data flow; the clustering rules are as follows:
rule one is as follows: the service types are jointly divided according to the fixed destination port and the protocol type;
rule two: the service flows of the same source equipment, different source IPs, and the same destination IP, destination port and protocol type are classified into the same type;
rule three: the same source IP, the same kind of destination equipment, the different destination IP, the same destination port and the protocol type service are classified into the same kind;
s3, generating a final original input feature matrix according to the acquired basic features and the high-order features; the extracted basic features and high-order features comprise a packet length feature, a packet number feature, a packet time feature, a packet content feature and a configuration feature; wherein the packet length characteristic represents the amount of information of the service bearer content; the number of packets feature the service interaction mode; the packet time characteristics represent the frequency of service interaction; the packet content characteristic represents content identification based on TCP/IP service; the configuration characteristics represent configuration information of the network device; forming a service interaction matrix by the forward/reverse characteristics to describe the overall characteristics of the service; meanwhile, the packet length characteristic, the packet number characteristic and the packet time characteristic all contain four high-order characteristics of a maximum value, a minimum value, a mean value and a variance; the packet content characteristics comprise TCP/IP different content identification count values; the forward/reverse characteristics in the service interaction matrix carry out clustering sequencing on the packet length characteristics, the packet number characteristics, the packet time characteristics and the packet content characteristics according to attributes to generate a final network input original characteristic matrix;
s4, constructing a convolutional neural network-based self-encoder, and learning the incidence relation of the original input feature matrix obtained in the step S3, so as to obtain high-dimensional middle layer features;
the original characteristics represent the basic interactive mode of industrial control service flow, the similar services have the same interactive mode, including the same service interactive period, similar service interactive contents, fixed service interactive rules, and the association between the original characteristics can more fully describe the service characteristics; the self-encoder is a neural network which makes an output value equal to an input value by utilizing a back propagation algorithm and is composed of an encoder and a decoder, wherein the encoder represents the mapping relation between input data and potential spatial representation, the decoder represents the mapping relation between the potential spatial representation and the output data, and the potential spatial representation can represent the incidence relation between the input data; therefore, the method and the device realize feature association extraction by using the self-encoder model;
the self-encoder is constructed by adopting the following steps:
the input of the self-encoder is a 2 x N dimensional original characteristic matrix x, wherein N is a service flow unidirectional characteristic dimension; the self-encoder is based on a double-layer convolution neural network; the structure of the encoder and the decoder is mirror symmetry; the first convolution layer function in the encoder is f1(. o) the second layer convolution layer function is f2The output of the encoder is a service intermediate layer characteristic vector h, and the encoder establishes a mapping relation between the intermediate layer characteristic vector and an input characteristic matrix, wherein h is f2(f1(x) ); the decoder is also a two-layer convolutional neural network, in which the first layer decoder decodes the results of the second layer encoder with a decoding function g2(. h) a second layer decoder for demodulating the first layer encoder result, the decoding function being g1(. o), the final decoded output isThe decoder establishes a mapping relation between the output characteristic matrix and the intermediate layer characteristic vector asFinally, representing the original characteristic vector incidence relation by utilizing the characteristic vector h of the middle layer;
the target function is used for restricting the network training direction and is used for training the mapping function f of the coder and the decoder1()、f2()、 g1() And g2() The parameter self-encoder realizes feature extraction by reconstructing input data, and requires that the input and the output of the parameter self-encoder are as same as possible, so that the target function should minimize the difference between the input and the output, namely, minimize the original feature matrix x and the reconstruction matrixCorresponding to the difference between the dimensions, the final training aim is expected to be gi=fi -1Is obtained byThe self-encoder network is trained by minimizing the mean absolute error as a loss function:
wherein x is an original data stream characteristic matrix;reconstructing a feature matrix for the self-encoder; m is a feature matrix dimension; x is the number ofiIs the ith dimension eigenvalue in the eigenvalue matrix;reconstructing an ith dimension eigenvalue in the eigen matrix for the self-encoder;
taking the original characteristic matrix as an input characteristic matrix, iteratively training a self-encoder according to batches, and judging whether the network learns the original characteristic distribution through an objective function: when the training results of each batch of objective function values meet the set requirements, the self-encoder can be used for fitting the original characteristic distribution of each service, and finally the intermediate layer vector h represents the original characteristic incidence relation;
s5, reducing the dimension of the high-dimensional middle layer characteristics obtained in the step S4, and calculating to obtain low-dimensional key characteristics representing the service interaction process; specifically, a high-dimensional interlayer feature h is obtained after an industrial control service in a transformer substation passes through a self-encoder, the high-dimensional feature is mapped to a low-dimensional brand-new orthogonal feature through a principal component analysis method, redundant data in the high-dimensional feature are eliminated, and a service key feature is obtained;
obtaining n types of core services of the transformer substation after service classification, wherein the intermediate layer characteristics h have m dimensions, the characteristics of the same core industrial control service are distributed in the same way, and the average value of the intermediate layer characteristics of each dimension of the same service is calculated by sampling to obtain an intermediate layer characteristic matrix of different m x n dimensions of different services
Carrying out zero-mean processing on the intermediate layer characteristic matrix X to obtain a matrix X ', and calculating a covariance matrix C of the X', wherein the matrix C is a symmetric matrix and is shown as follows:
C=X'×(X')T
in the formulaTaking the row mean value of the intermediate layer characteristic matrix; n is the number of characteristic matrix columns; sum () is the sum of feature matrices by rows;
respectively calculating the eigenvalue of the business covariance matrix C and the corresponding eigenvector; descending the service characteristic values, setting a threshold value delta according to characteristic value influence factors, generating a transfer matrix P from a characteristic vector corresponding to the characteristic value larger than delta, and performing dot product operation on the characteristic h of the middle layer of the self-encoder and the transfer matrix P to obtain a key characteristic K after dimensionality reduction, wherein the h dimensionality is 1 × m, the P dimensionality is m × v, the K dimensionality is 1 × v, and v is the number larger than the characteristic threshold value delta; k ═ hxp;
s6, calculating normal distribution parameters in a plurality of periods aiming at each dimension; converting the same-service characteristic distribution of unknown prior probability into normal distribution by using a central limit theorem; the specific parameters are calculated as:
a. the same service satisfies the same probability distribution to obtain the key characteristics of the periodic sampling data; assume that the service set is S ═ S1,S2,...,Sn}, each service S in the periodiThe E S key characteristics are as follows:
Keyi=(ki1,ki2,...,kiv)T
key in formulaiThe key characteristic vector of the ith type of service is obtained; k is a radical ofijTaking values of j-dimension vectors in the i-type service key feature vectors;
b. calculating key characteristic mean value of the same service in L periods, service SiThe feature vector over L cycles is:
Vil=(Keyi1,Keyi2,...,Keyil)
key in formulaijJ is a key feature vector in the jth period of the ith type of service, wherein j is 1, 2.
Solving the key characteristic vector V of the ith serviceiMean vector of each dimension in L periodWherein each element satisfies a normal distribution:
c. obtaining a service S using maximum likelihood estimationiIn the normal distribution parameters of key features of each dimension in the L period, the service S is assumed to be sampled for W times of the L periodiThe normal distribution parameter obtained from the middle key characteristic j isThe calculation method of each parameter comprises the following steps:
wherein w is the number of periodic repeated sampling;the j dimension characteristic of the ith type service is the average value after the sampling in the nth period;
d. get service SiParameters of each dimension in key features
S7, constructing a multi-dimensional key feature combined Gaussian distribution function in each service to finally obtain a service baseline model; specifically, after obtaining key feature distribution parameters of each dimension, dimension integration is realized through combined Gaussian distribution to obtain a data base line model of a service; the key point of the combined Gaussian distribution lies in calculating the correlation coefficient among all dimensions, and the calculation formula of the correlation coefficient is
In the formula, COV () is a correlation coefficient of the mean value of the ith dimension and the jth dimension of the ith class service in the key features;
obtaining a covariance matrix sigma according to correlation coefficients among dimensions, wherein the final service baseline model is a joint Gaussian distribution probability model, and the expression mode is as follows:
and calculating a covariance matrix sigma by using the maximum likelihood estimation, wherein the calculation formula is as follows, wherein v is a key characteristic dimension:
the invention is further illustrated by the following examples:
example 1: referring to fig. 2, an encoder and a decoder in the self-encoder structure are both two layers of convolutional networks, and parameters of a first layer of convolutional layer in the encoder are as follows: 20 convolution kernels of size (2, 5), step size 1, second layer convolution layer parameters: 8 convolution kernels of size (1, 1) with step size 1; the decoder parameters are configured with the encoder image. The input is a 2 x 22 dimensional service forward/backward original characteristic matrix x, and the output isThe self-encoder is a neural network with the same input and output, thereforeThis defines the minimum mean square error between the input and output of the loss function:the network optimization mode is a random gradient descent method, original data are randomly divided into a training set and a verification set according to the proportion of 8:2, the training set is used for training model parameters, the verification set is used for testing the fitting degree of the model, and 20 batches of training are performed.
The loss function curve of the model on the training and verification set refers to fig. 3, and can be known from the curve: 1) the loss function of the model is continuously reduced in the learning process, so that the model can learn the original characteristic distribution and reconstruct the input characteristics; 2) when the training reaches the 10 th batch, the model is stabilized, and the descending amplitude of the loss function is reduced; 3) the verification set and the training set have the same descending trend, which indicates that the current model has no overfitting phenomenon.
Example 2: extracting the intermediate layer characteristics and the key characteristics, referring to fig. 2, using the output of the encoder in the trained self-encoder as the service intermediate layer characteristics, obtaining 176-dimensional intermediate layer characteristics according to the network parameters in the implementation example 1, wherein the intermediate layer characteristic dimensions are not easy to construct a service baseline model, so that PCA linear dimension reduction is performed on the intermediate layer characteristics, and the service key characteristics are extracted.
The periodic key services in the core services of the transformer substation account for 87% of the total flow of the transformer substation, wherein the service middle layer characteristic distribution diagram refers to fig. 4(a), (b) and (c), it can be seen that the output characteristics of the middle layers of the self-encoders of the same type of services have stability, and the characteristic fluctuation trend has consistency, so that the same type of services have the same distribution characteristics.
Performing PCA dimension reduction based on high-dimensional intermediate layer characteristics, wherein the range of the characteristic value of the service matrix is [ -2.296e-8,1.245]And determining the critical characteristic dimension to be 12 dimensions according to the order of the characteristic value index, and generating a transfer matrix by the corresponding characteristic vector. And performing dot product on the intermediate layer characteristics and the service matrix to obtain service key characteristics. Referring to fig. 5(a), (b), and (c), key feature profiles of key services are shown. As can be seen from the figure, the key characteristics obtained after PCA dimension reduction of the similar services have the same characteristics with the intermediate layer characteristicsThe same property.
Example 3: key features solicit parameters and baseline configuration: after 12-dimensional key features of the service are obtained, normal distribution parameters are calculated for each dimension by using a central limit theorem, taking a specific class of service as an example, the key features are randomly grouped according to 20 times of sampling, feature mean values are calculated for each dimension in each group of key features, the first-dimensional feature distribution of the service is shown by referring to fig. 6(a), the fourth-dimensional feature distribution of the service is shown by referring to fig. 6(b), the tenth-dimensional feature distribution of the service is shown by referring to fig. 6(c), and after distribution conversion, each dimension in the key features of the service can be represented by normal distribution.
After the feature distribution of each dimension of the service is obtained, a service baseline model is constructed by using the combined gaussian distribution, the key point is to calculate a covariance matrix sigma between the dimensions, and the dimension of the key feature of the service is determined to be 10 in embodiment 2, so that the covariance matrix is 10 x 10 dimensions, and the maximum likelihood estimation covariance result is as follows:
the traffic final baseline model is constructed from the maximum likelihood estimation results (mu, sigma).