Network encryption traffic identification method based on deep learning
1. A network encryption traffic identification method based on deep learning is characterized in that: the method is realized by the following steps:
step one, acquiring a data set;
step two, preprocessing the data set;
step three, balancing a data set by utilizing an SMOTE algorithm to obtain a data sample flow;
step four, training the DenseNet model, and automatically extracting the characteristics by using the trained model;
and step five, adding a softmax layer, and identifying and judging the encrypted flow.
2. The method for network encryption traffic identification based on deep learning of claim 1, wherein: the step of preprocessing the data set specifically comprises:
step two, data load extraction:
reading and processing the pcap file by adopting a Scapy module, analyzing the structure of each data stream by the Scapy module after reading the flow data, extracting the data stream load information of each flow, namely the effective load byte, and storing;
step two, data load processing:
uniformly intercepting payload data with the length of 1024 bytes, intercepting data streams of the first n overlong bytes in the data, filling 0 for the number of less than n bytes, and filling 0 for the overlong data; removing data link layer bytes of the data packet; afterwards, to eliminate the experimental error effect, 0 needs to be filled in the UDP header; and (3) normalizing the extracted data packet bytes, normalizing the byte values from [0,255] to [0,1], and filling the data of each packet into a matrix with the dimension of 32 x 32.
3. The method for network encryption traffic identification based on deep learning of claim 1, wherein: the method for obtaining the data sample flow by utilizing the SMOTE algorithm to balance the data set comprises the following steps:
firstly, finding out the sample centers of a minority sample according to Euclidean distance, and dividing the sample centers into a core layer sample point, a second layer sample point and an outermost layer sample point according to the Euclidean distance from a central point; wherein the number of sample points in each layer is distributed evenly;
secondly, setting different sample point selection probabilities according to different layers, wherein the selection probabilities of the three layers of sample points are distributed from near to far according to the distance from the center point; finally, linear interpolation is realized to achieve sample balance;
the specific algorithm process is realized as follows:
a) let T be the number of samples in the minority class of the training set, and the target synthesizes the minority class into N new samples (N must be a positive integer and N is N)>T), one sample X of a minority classiThe feature vector is Xi,i∈{1,...,T};
b) Calculating according to the Euclidean distance to obtain the central point of the minority sample according to the whole minority sample, and dividing the minority sample into core layer sample points X according to the central pointi1Second layer sample point Xi2Outermost layer sample Point Xi3Setting the selection probability of each layer of sample points from high to low in sequence;
c) selecting any point X from a few samples, selecting the nearest K neighbors of the same kind Y according to a KNN algorithm, and then generating a random number zeta of 0-11Thus, K new samples Z were synthesized, Z being defined as follows:
Z=X+ζ1×(Y-X) (2)
d) repeating step c) N times, thereby synthesizing KN new samples Xnew,new∈1,...,N;
e) And c) carrying out the operations of the steps b) to d) on all the few types of samples, namely, achieving the aim of sample balance.
4. The network encryption traffic identification method based on deep learning of claim 1 or 2 is characterized in that: the DenseNet model is trained, and the trained model is utilized to automatically extract the characteristics,
in DenseNet, the input to each layer is from the outputs of all previous layers, and the input formula and total connection times for each layer are:
Xl=Hl([X0,X1,...,Xl-1]) (3)
Csum=L(L+1)/2 (4)
wherein XlFor the first layer input, HlRepresents oneNonlinear transformation, CsumL is the number of layers for the total number of connections.
Background
In the research of the encryption traffic identification method based on deep learning, the application service for finely identifying the encryption traffic is a task to be finally completed. The main identification method relates to traffic identification and encrypted traffic identification research. In the network traffic identification technology, there are many conventional technical methods. With the progress of science and technology, the flow identification technology is also continuously advanced, and the technical methods can be roughly divided into the following categories: port number-based recognition techniques, deep packet inspection recognition techniques, and machine learning-based recognition techniques.
With the development of diversification of internet applications, the identification accuracy of the port number-based identification technology is lower and lower. Due to the increasing presence of Peer-to-Peer network traffic (Peer-to-Peer, P2P), many application services use dynamic ports, i.e. no longer use the well-known port numbers in the mapping table, while many web and ftp servers allow for manual specification of port numbers in order to increase the flexibility of the server. In addition, in order to hide own traffic from detection, many malicious software use dynamic port and port disguise technology, which further reduces the recognition accuracy of the port recognition technology.
The deep packet inspection and identification technology has extremely high identification accuracy, is simple and effective, but has some disadvantages; for example, the manpower cost is huge, and under the current popularization of network application and encryption technology, the DPI identification technology cannot meet the current traffic identification requirement.
Machine learning-based recognition techniques are currently common traffic recognition techniques. Therefore, it is often used for encrypted traffic identification studies. Like traffic of unknown protocols, application services using encryption technology are increasing. For encrypted traffic, Okada Y et al identifies the application layer protocol of the encrypted traffic by using information in the data stream that is not relevant to encryption, such as the length and duration of the number of bytes of the packet, etc. According to the correlation between the unencrypted flow and the encrypted flow, Alsharmari R and the like use a machine learning algorithm to achieve a good effect on the identification of the encrypted flow. Haffner P et al use a variety of supervised learning algorithms to demonstrate the feasibility of machine learning in the field of encrypted traffic identification. The tension wave realizes network application identification by utilizing a characteristic statistics and machine learning method. Korczy ń ski M et al successfully identifies the type of application service under the Skype protocol, such as video, voice, text, etc., for encrypted traffic identification as represented by Skype and SSH. Alshammarii R et al select several machine learning algorithms to identify encrypted traffic by using different data stream attribute characteristics of different encrypted traffic. Alshammar R et al have made a number of studies in this regard, including identification studies of P2P flux. Because the encryption modes of different encryption protocols are different, the data packaging formats are different, so different application data have different characteristics and rules, similarly, when a large amount of flow data of the same encryption protocol exists, the encryption protocols can be regularly circulated, and the identification and classification of the encryption protocols can be realized.
In recent years, with the rapid development of network technologies and the increasing emphasis on private data, encryption technologies such as SSL, SSH, and Tor are widely used in network communications, and network encryption traffic is rapidly increasing and changing threat situations. The attacker uses the encryption as a tool for hiding activities, and the encrypted traffic provides a multiplicative machine for the malicious network attacker to hide the command and control activities. Encrypted traffic needs to be identified before the network encrypted traffic is analyzed. The high-accuracy identification and detection of the encrypted flow have important practical significance for ensuring the network information safety and maintaining the normal operation of the network. The traditional characteristic engineering has the problems of time and labor consumption in the aspects of extracting and selecting the traffic characteristics, so that the invention is meaningful in the research of encrypted traffic identification.
Disclosure of Invention
The invention aims to solve the problems of time and labor consumption caused by flow characteristic extraction and selection in the traditional characteristic engineering, and provides a network encryption flow identification method based on deep learning.
A network encryption traffic identification method based on deep learning is realized by the following steps:
step one, acquiring a data set;
step two, preprocessing the data set;
step three, balancing a data set by utilizing an SMOTE algorithm to obtain a data sample flow;
step four, training the DenseNet model, and automatically extracting the characteristics by using the trained model;
and step five, adding a softmax layer, and identifying and judging the encrypted flow.
In an embodiment of the present invention, preferably, the step of preprocessing the data set includes:
step two, data load extraction:
reading and processing the pcap file by adopting a Scapy module, analyzing the structure of each data stream by the Scapy module after reading the flow data, extracting the data stream load information of each flow, namely the effective load byte, and storing;
step two, data load processing:
uniformly intercepting payload data with the length of 1024 bytes, intercepting data streams of the first n overlong bytes in the data, filling 0 for the number of less than n bytes, and filling 0 for the overlong data; removing data link layer bytes of the data packet; afterwards, to eliminate the experimental error effect, 0 needs to be filled in the UDP header; and (3) normalizing the extracted data packet bytes, normalizing the byte values from [0,255] to [0,1], and filling the data of each packet into a matrix with the dimension of 32 x 32.
In an embodiment of the present invention, preferably, the step of obtaining the data sample stream by balancing the data set with the SMOTE algorithm includes:
firstly, finding out the sample centers of a minority sample according to Euclidean distance, and dividing the sample centers into a core layer sample point, a second layer sample point and an outermost layer sample point according to the Euclidean distance from a central point; wherein the number of sample points in each layer is distributed evenly;
secondly, setting different sample point selection probabilities according to different layers, wherein the selection probabilities of the three layers of sample points are distributed from near to far according to the distance from the center point; finally, linear interpolation is realized to achieve sample balance;
the specific algorithm process is realized as follows:
a) let T be the number of samples in the minority class of the training set, and the target synthesizes the minority class into N new samples (N must be a positive integer and N is N)>T), one sample X of a minority classiThe feature vector is Xi,i∈{1,...,T};
b) Calculating according to the Euclidean distance to obtain the central point of the minority sample according to the whole minority sample, and dividing the minority sample into core layer sample points X according to the central pointi1Second layer sample point Xi2Outermost layer sample Point Xi3Setting the selection probability of each layer of sample points from high to low in sequence;
c) selecting any point X from a few samples, selecting the nearest K neighbors of the same kind Y according to a KNN algorithm, and then generating a random number zeta of 0-11Thus, K new samples Z were synthesized, Z being defined as follows:
Z=X+ζ1×(Y-X) (2)
d) repeating step c) N times, thereby synthesizing KN new samples Xnew,new∈1,...,N;
e) And c) carrying out the operations of the steps b) to d) on all the few types of samples, namely, achieving the aim of sample balance.
In one embodiment of the present invention, preferably, the DenseNet model is trained, and the trained model is used to automatically perform the feature extraction process of,
in DenseNet, the input to each layer is from the outputs of all previous layers, and the input formula and total connection times for each layer are:
Xl=Hl([X0,X1,...,Xl-1]) (3)
Csum=L(L+1)/2 (4)
wherein XlFor the first layer input, HlRepresenting a non-linear transformation, CsumL is the number of layers for the total number of connections.
The invention has the beneficial effects that:
the invention also considers the influence of the class imbalance of the sample data set on the classification precision, and provides an encrypted flow identification model based on deep learning under the condition of data class imbalance. Firstly, preprocessing a data set by the model, wherein the specific preprocessing of the data set is to truncate n bytes before data flow, and the number of the n bytes is not enough to fill 0; in order to prevent physical hardware from affecting the classification, the data link layer bytes of the data packet need to be removed; since the UDP header is 12 bytes less than the TCP header, 0 needs to be filled in the UDP header in order to eliminate the influence of experimental errors; in order to obtain the optimal algorithm classification effect, the extracted data packet bytes need to be normalized. And then, balancing the data set by using a SMOTE algorithm to obtain a data sample flow. And then, the DenseNet training is used for automatically extracting the features, so that the problem that the traditional feature engineering consumes time and labor when extracting and selecting the flow features is solved, and finally, a softmax layer is added for judging and finishing the identification of the encrypted flow.
The invention adopts an improved SMOTE algorithm to balance the data set categories under the condition of unbalanced data set categories. The DenseNet network structure is used for encrypted flow identification, the problem of gradient disappearance caused by deepening of the number of network layers is solved, the transfer of a feature map (feature map) is enhanced, the feature reuse is encouraged, the short-circuit connection between layers is realized, and the training result is deeper, more effective and more accurate.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a flow chart of data preprocessing according to the present invention;
FIG. 3 is a data load processing diagram of the present invention;
FIG. 4 is a schematic diagram of the SMOTE algorithm;
FIG. 5 is a schematic diagram of the improved SMOTE algorithm of the present invention;
FIG. 6 is a dense block structure diagram;
FIG. 7 is a diagram of convolution and effect;
FIG. 8 is a structural diagram of DenseNet;
FIG. 9 is a schematic view of maximum pooling.
Detailed Description
The first embodiment is as follows:
in this embodiment, as shown in fig. 1, a method for identifying network encryption traffic based on deep learning is implemented by the following steps:
step one, acquiring a data set;
step two, preprocessing the data set;
step three, balancing a data set by utilizing an SMOTE algorithm to obtain a data sample flow;
step four, training the DenseNet model, and automatically extracting the characteristics by using the trained model;
and step five, adding a softmax layer, and identifying and judging the encrypted flow.
The second embodiment is as follows:
different from the specific embodiment, in the method for identifying network encrypted traffic based on deep learning of the present embodiment, the step of preprocessing the data set specifically includes:
step two, data load extraction:
reading and processing the pcap file by adopting a Scapy module, analyzing the structure of each data stream by the Scapy module after reading the flow data, extracting the data stream load information of each flow, namely the effective load byte, and storing; in preparation for further processing of the data, the data preprocessing flow is shown in fig. 2.
Step two, data load processing:
when an application program transmits data, the file type represented by the data can be pictures, audio or video, and the like, the sizes of the files are generally large, and a single TCP stream is far from being used for transmitting all complete information. For a whole segment of data, the TCP protocol fragments the whole segment of data, so the extracted data includes a large amount of fragmented data, one data length of the fragment is generally 1514 bytes, and after removing the data frames of the ethernet layer, EP layer and TCP header messages, the length of payload data is generally 1460 bytes.
However, the UDP protocol does not have a function of segmenting very long data, so when the data length exceeds 1500 bytes specified by a Maximum Transmission Unit (MTU), IP fragmentation is performed at a network layer, but a header packet of the UDP protocol is different from a header packet of the TCP protocol in length, and since the UDP protocol itself is a connectionless protocol, the length of data loaded on the UDP protocol is different from that of the TCP protocol, and the gap is obvious.
In order to ensure the consistency of data, the length of payload data needs to be uniformly intercepted to 1024 bytes, data streams of the overlong first n bytes in the data are intercepted, 0 is filled in the data of less than n bytes, and the data of the overlong first n bytes is filled with 0; in order to prevent physical hardware from affecting the classification, the data link layer bytes of the data packet need to be removed; afterwards, since the UDP header is 12 bytes less than the TCP header, 0 needs to be filled in the UDP header in order to eliminate the experimental error effect; in order to obtain the optimal algorithm classification effect, the extracted data packet bytes need to be normalized, the byte values are normalized from [0,255] to [0,1], and the data of each packet is filled into a 32-by-32-dimensional matrix. The specific data load processing is shown in fig. 3:
the third concrete implementation mode:
different from the first or second specific embodiment, in the method for identifying network encrypted traffic based on deep learning of this embodiment, the step of obtaining the data sample stream by balancing the data set using the SMOTE algorithm specifically includes:
the SMOTE refers to a technology for synthesizing a few classes of oversampling, and is proposed by Chawla in 2002 in order to solve the unbalanced problem of data, that is, a technology for synthesizing a few classes of oversampling, which is an improved scheme based on a random oversampling algorithm. The technology is a common means for processing unbalanced data at present, is agreed by academia and industry, and avoids the defects of model overfitting and no generalization effect caused by a simple copy oversampling strategy. The SMOTE algorithm basically thinks that firstly, sample points are randomly selected, the Euclidean distance between the sample points and the similar sample points is calculated, K sample points are randomly selected by utilizing a K neighbor idea, and then new samples are generated by random linear interpolation between the two points, so that a few samples are added to balance a data set. Wherein the Euclidean distance formula is as follows:
where dist (X, Y) is the euclidean distance of sample point X from sample point Y.
The traditional SMOTE algorithm adopts random selection of K samples in K neighbors for linear interpolation. The invention does not adopt the random selection of K samples in K neighbors in the SMOTE algorithm;
firstly, finding out the sample centers of a minority sample according to Euclidean distance, and dividing the sample centers into a core layer sample point, a second layer sample point and an outermost layer sample point according to the Euclidean distance from a central point; wherein the number of sample points in each layer is distributed evenly;
secondly, setting different sample point selection probabilities according to different layers, wherein the selection probabilities of the three layers of sample points are distributed from near to far according to the distance from the center point; finally, linear interpolation is realized, and sample balance is achieved. Fig. 4 is a schematic diagram of SMOTE algorithm. Fig. 5 is a schematic diagram of the improved SMOTE algorithm of the present invention.
The specific algorithm process is realized as follows:
a) let T be the number of samples in the minority class of the training set, and the target synthesizes the minority class into N new samples (N must be a positive integer and N is N)>T), one sample X of a minority classiThe feature vector is Xi,i∈{1,...,T};
b) Calculating according to the Euclidean distance to obtain the central point of the minority sample according to the whole minority sample, and dividing the minority sample into core layer sample points X according to the central pointi1Second layer sample point Xi2Outermost layer sample Point Xi3Setting the selection probability of each layer of sample points from high to low in sequence;
c) selecting any point X from a few samples, selecting the nearest K neighbors of the same kind Y according to a KNN algorithm, and then generating a random number zeta of 0-11Thus, K new samples Z were synthesized, Z being defined as follows:
Z=X+ζ1×(Y-X) (2)
d) repeating step c) N times, thereby synthesizing KN new samples Xnew,new∈1,...,N;
e) And c) carrying out the operations of the steps b) to d) on all the few types of samples, namely, achieving the aim of sample balance.
The fourth concrete implementation mode:
different from the third embodiment, in the method for identifying network encrypted traffic based on deep learning of the present embodiment, the DenseNet model is trained, and the trained model is used to automatically extract features,
in a conventional convolutional neural network, if the number of layers is L, there are L connections, but in DenseNet, the input of each layer comes from the output of all the previous layers, and the input formula of each layer and the total number of connections are:
Xl=Hl([X0,X1,...,Xl-1]) (3)
Csum=L(L+1)/2 (4)
wherein XlFor the first layer input, HlRepresenting a non-linear transformation, CsumL is the number of layers as the total number of connections;
the structure in fig. 6 is a dense block. Where the non-linear transformations each include Batch Normalization (BN), the input to the input activation function may be normalized. Thus, the influence of offset and increase of input data is solved. ReLU is a non-linear activation function, and its formula is as follows:
ReLU(x)=max(0,x) (5)
convolution (Convolution) is also included, in which one neuron is connected to only part of the neighbouring layer neurons. Weight sharing refers to that neurons share weights, i.e., convolution kernels, in the same feature plane of the convolutional layer. The shared weight can reduce the connection between each layer of the network, and the risk of overfitting is reduced. The initialization of the convolution kernel usually uses a random decimal matrix, and the convolution kernel obtains reasonable weight through the training of a network. The effect of convolution and convolution kernels is schematically illustrated in fig. 7. The structure of fig. 8 is a full dense net (dense net) comprising 3 dense blocks.
Between the dense blocks there are convolutional layers and pooling layers, which have been described above. Pooling, also known as sampling techniques, typically takes the form of both mean and maximum sub-sampling. Pooling layers are periodically inserted between convolutional layers. The purpose of sample sampling is to no longer focus on the specific locations of features, but rather to allow the system to focus only on the relative locations between features, thereby continuously reducing data space. Thus, the number of parameters and the amount of calculation are reduced, avoiding overfitting to some extent. The maximum pooling is schematically shown in FIG. 9.
To objectively evaluate the performance of the algorithm, the precision P, recall R and F are chosen1-measureThe three scoring indexes are evaluated, the accuracy rate is the proportion of all correctly predicted scores to the total score, the recall rate is the proportion of all correctly predicted scores to be positive, and F is the proportion of all scores to be positive1The value is a comprehensive evaluation index and is defined as a harmonic mean value of the accuracy rate and the recall rate;
the calculation formula is as follows:
P=Tp/(Tp+Fp) (6)
R=Tp/(Tp+FN) (7)
F1=2PR/(P+R) (8)
wherein, TpThe number of correctly recognized samples of the encrypted traffic is really represented; fpFalse positive indicates the number of false positive flags that are actually encrypted traffic; fNFalse negatives represent the number of correctly identified samples of the unencrypted traffic.
The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.