Internal threat early warning method based on user portrait
1. An internal threat early warning method based on user portrait is characterized in that,
the method comprises the following steps: acquiring data and preprocessing the data to obtain intrinsic characteristic data;
performing user portrait by using a hierarchical clustering method based on intrinsic characteristic data to obtain a user group;
and for the user group, early warning is carried out when internal threat attack occurs.
2. The user representation based internal threat alert method of claim 1,
the specific steps of obtaining the data and preprocessing the data to obtain the intrinsic characteristic data are as follows:
acquiring experimental data;
extracting intrinsic data corresponding to each user in the data set in a parallelization manner by adopting the user name as a keyword based on a spark platform;
and for each user, extracting corresponding intrinsic characteristic attribute data and normalizing.
3. The user representation based internal threat alert method of claim 2,
the experimental data adopts an internal threat test data set proposed by the CERT department of the university of California Meilong.
4. A user profile based internal threat alert method as recited in claim 3, wherein said intrinsic characteristic attribute data includes business attribute data and personal attribute data.
5. The user representation based internal threat alert method of claim 4,
the business attribute data comprises roles, projects, business units, functional units, departments, groups and belonged supervisors, and the personal attribute data comprises openness, accountability, camber, pleasure and emotionality.
6. The user representation based internal threat alert method of claim 4,
the method for using the hierarchical clustering method to portray the user comprises the following specific steps of:
calculating a first attribute similarity based on the Euclidean distance for the personal attribute data;
calculating a second attribute similarity for the service attribute data based on the same degree;
calculating a total attribute similarity based on the first attribute similarity and the second attribute similarity;
the number of end-user groups for the hierarchical cluster is determined based on the contour coefficients.
Background
In addition to traditional host and network behavior data, internal threat researchers are increasingly exploring the association of user intrinsic characteristic data with internal threats. In a real network environment, the characters and experiences of users are possibly different, the types of the users are various, and particularly, a hierarchical clustering method is provided for representing the intrinsic characteristic attributes of the users, the number of clusters does not need to be specified in advance in hierarchical clustering, the hierarchical relationship of data can be represented according to needs, and the hierarchical relationship among the users can be found in the user representation field. In the user portrait field, a user has qualitative data and quantitative data of two different types, while in the traditional hierarchical clustering algorithm, attribute similarity calculation mostly uses a single measurement mode, for example, heterogeneous data directly applied to the user portrait field can cause inaccurate clustering effect and inaccurate portrait effect, and particularly provides an attribute similarity calculation method for comprehensively calculating quantitative data and qualitative data.
Disclosure of Invention
The invention aims to provide an internal threat early warning method based on user portrait, and aims to solve the problem that the accuracy of portrait effect and clustering effect is low in the existing method.
In order to achieve the above object, the present invention provides an internal threat early warning method based on a user portrait, comprising: acquiring data and preprocessing the data to obtain intrinsic characteristic data; performing user portrait by using a hierarchical clustering method based on intrinsic characteristic data to obtain a user group; and for the user group, early warning is carried out when internal threat attack occurs.
The method comprises the following specific steps of obtaining data and preprocessing the data to obtain intrinsic characteristic data: acquiring experimental data; extracting intrinsic data corresponding to each user in the data set in a parallelization manner by adopting the user name as a keyword based on a spark platform; and for each user, extracting corresponding intrinsic characteristic attribute data and normalizing.
Wherein the experimental data adopts an internal threat test data set proposed by CERT department of the university of California Meilong.
Wherein the intrinsic characteristic attribute data comprises business attribute data and personal attribute data.
The service attribute data comprises roles, projects, service units, functional units, departments, groups and belonged supervisors, and the personal attribute data comprises openness, accountability, camber, pleasure and emotionality.
The method comprises the following specific steps of using a hierarchical clustering method to portray a user to obtain a user group: calculating a first attribute similarity based on the Euclidean distance for the personal attribute data; calculating a second attribute similarity for the service attribute data based on the same degree; calculating a total attribute similarity based on the first attribute similarity and the second attribute similarity; the number of end-user groups for the hierarchical cluster is determined based on the contour coefficients.
The invention relates to an internal threat early warning method based on a user portrait, which comprises the steps of obtaining data and preprocessing the data to obtain internal characteristic data; performing user portrait by using a hierarchical clustering method based on intrinsic characteristic data to obtain a user group; and for the user group, early warning is carried out when internal threat attack occurs. Aiming at the problem that internal users have diversity, a method for using hierarchical clustering as an internal user portrait is provided, accuracy of portrait effect is improved, two types of data of quantification and qualitative are available in internal characteristic data of the users, most of traditional clustering algorithms use a single similarity measurement mode of Euclidean distance or cosine similarity and the like, the traditional clustering algorithms cannot be well applied to user portrait, a mode of respectively calculating attribute similarity of the quantitative data and the qualitative data and weighting and summing the attribute similarity is provided as a similarity measurement method, and accuracy of the clustering effect is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a first aggressor profile of the invention;
FIG. 2 is a second aggressor profile of the invention;
FIG. 3 is a flow chart of a method of the present invention for internal threat forewarning based on a user representation;
FIG. 4 is a flow chart of the present invention for obtaining data and pre-processing the data to obtain intrinsic characteristic data;
FIG. 5 is a flow chart of user profiling using hierarchical clustering based on intrinsic feature data to obtain user groups in accordance with the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.
Referring to fig. 1 to 5, the present invention provides an internal threat early warning method based on a user portrait, including:
s101, acquiring data and preprocessing the data to obtain intrinsic characteristic data;
the method comprises the following specific steps:
s201, acquiring experimental data;
the data set used in the experiment is a CERT-IT data set which is an internal threat testing data set proposed by the CERT department of the university of Meilong in the card. There are multiple versions of this data set, from r1 to r6, with the r5.2 version being employed herein. The CERT dataset consists of a number of files that contain a log of the employee's behavior in the organization. logon.csv, http.csv, email.csv, device.csv, psychrometric.csv contain login, logout, website access, email, copy files to removable disk, time and behavior of connecting and disconnecting the removable disk, scores on employee psychology tests, and an LDAP file containing user positions, departments, work periods and participating items. The method adopts a position information LDAP file and a user five-personality test score file Psychometric.
S202, extracting intrinsic data corresponding to each user in the data set in a parallelization mode by adopting the user name as a keyword based on a spark platform;
apache Spark is a fast, general-purpose computing engine designed specifically for large-scale data processing. Spark is a universal parallel framework similar to Hadoop MapReduce and originated from UC Berkeley AMP lab (AMP labs of Berkeley, California university), Spark has the advantages of Hadoop MapReduce; but different from MapReduce, Job intermediate output results can be stored in a memory, so that HDFS reading and writing are not needed, and Spark can be better suitable for MapReduce algorithms which need iteration, such as data mining, machine learning and the like. Spark is implemented in the Scala language, which uses Scala as its application framework. Unlike Hadoop, Spark and Scala can be tightly integrated, where Scala can manipulate distributed datasets as easily as manipulating local collection objects. Although Spark is created to support iterative work on a distributed dataset, it is actually a complement to Hadoop and can run in parallel in a Hadoop file system. This behavior may be supported by a third party cluster framework named messos. Spark was developed by the university of california berkeley branch AMP laboratory (Algorithms, Machines, and People Lab) and was used to build large, low-latency data analysis applications.
S203 extracts corresponding intrinsic feature attribute data for each user, and normalizes the data.
The intrinsic characteristic attribute data includes business attribute data and personal attribute data for a single user. Extracting corresponding internal feature attribute data, and processing to obtain 15 features in total, wherein the service attribute data comprises roles, projects, service units, functional units, departments, groups and belonged supervisors, and the personal attribute data comprises openness, accountability, camber, amenity and emotionality.
The characteristics are shown in table 1:
TABLE 1
S102, performing user portrait by using a hierarchical clustering method based on intrinsic characteristic data to obtain a user group;
a hierarchical clustering algorithm based on an improved attribute similarity calculation mode is adopted to portray a user. The basic idea of hierarchical clustering is as follows: the method comprises the steps of firstly, regarding each sample as a class, then, defining the distance between the classes, selecting a pair with the minimum distance to combine the samples into a new class, then, calculating the distance between the new class and other classes according to an inter-class clustering criterion function, and then combining two class data with the minimum distance, wherein the class is reduced in each combination until all the samples are combined into the same class or combined into the specified number of classes.
The internal threat user portrait has the problems of large data volume and heterogeneous data, and the hierarchical clustering has good effect on large sample data due to the fact that the distance and the rule similarity of the hierarchical clustering are easy to define and less limited, so that the hierarchical clustering has feasibility in being used for the internal threat user. Meanwhile, due to the diversity and the hierarchical relationship of users, the person portrait can be better finished by using a hierarchical clustering method which does not need to appoint the number of clusters in advance and can find the hierarchical relationship of the clusters.
The method comprises the following specific steps:
s301, calculating a first attribute similarity for the personal attribute data based on the Euclidean distance;
the five personality scores of the personality attributes of the user A and the user B are set as follows: dscoreA=[OA,CA,EA,AA,NA]、dscoreB=[OB,CB,EB,AB,NB]Wherein O, C, E, A, N represents the values of openness, accountability, camber, comfort and emotionality, respectively, the attribute similarity of the personal attribute is calculated as shown in the following formula.
S302, calculating second attribute similarity for the service attribute data based on the same degree;
suppose that the qualitative data of the user A and the qualitative data of the user B are respectively d after conversionbusA=[a1,a2,a3,a4,a5,a6],dbusB=[b1,b2,b3,b4,b5,b6]. Let dbusA、dbusBAnd when the data are in the same time, the same number is num, num is calculated, and n represents the characteristic number of the service attribute. The similarity calculation formula of the qualitative data is shown as the following formula.
S303, calculating the total attribute similarity based on the first attribute similarity and the second attribute similarity;
the total attribute similarity of the two users is calculated, and the total attribute similarity is obtained by weighting and calculating the attribute similarity of the quantitative data and the qualitative data of the two users, wherein the calculation mode is shown in the following formula.
Wherein λ is the weight.
By means of weighted summation of the two attribute similarities, the problem of attribute similarity calculation of different types of data can be solved, accuracy of clustering effect is improved, influence of the different types of data on clustering analysis results can be dynamically changed by changing weights according to business requirements, and adaptability of a clustering algorithm is improved.
S304 determines the number of end-user groups for the hierarchical cluster based on the contour coefficients.
In a clustering algorithm, the selection of the clustering number K is crucial, and a proper K can enable the clustering effect to be more representative. Therefore, the method of contour coefficients is used herein to determine the final grouping number K of hierarchical clusters.
The outline coefficient calculation process is as follows:
for the ith object, calculate the average distance to all other objects in the cluster to which it belongs, and mark as ai(indicating its degree of agglomeration)
The average distance, denoted b, for all points of the ith object and other clusters not containing itselfi(embodying his degree of separation)
The contour coefficient of the ith object is li=(bi-ai)/max(ai,bi)
The value range of the contour coefficient is [ -1,1],liThe closer to 1, the more reasonable the clustering result of the sample i is, the more reasonable liThe closer to-1, the more unreasonable the sample i clustering result is, and the more likely it is to be assigned to other clusters.
The calculation formula of the total contour coefficient when the clustering group number is K is shown as the following formula:
n is the total number of samples.
In actual service, the number of clusters is too large, so that the number of users in each group is too small, and the significance of user clustering is lost; and if the cluster number is too small, the number of users in each group is too large, and the common supervision difficulty is increased, so that the value of K is selected to be the most reasonable value in the range of [20,40 ].
S103, early warning is carried out on the user group when internal threat attack occurs.
For the user groups obtained by the previous part of hierarchical clustering analysis, when internal threat attack occurs, the supervision on the same group of users is enhanced, the effects of discovering the attack and preventing the attack in advance are realized, and the effect of early warning is achieved. If the GAN network is used for anomaly detection, when the anomaly score (obtained by reconstruction errors) of a certain behavior of a user exceeds an anomaly threshold value, namely the behavior is judged to be the abnormal behavior, the threat degree of anomaly detection on other users in the group is increased, the anomaly grade obtained by the real attack behavior of the user can be far greater than the set threshold value, and the detection precision is improved; and because most attacks do not occur suddenly and all attacks have preorder actions, the preorder actions of the abnormal attacks which do not exceed the abnormal threshold value originally can be detected after the threat degree of the detection result is increased, and the effect of threat early warning is achieved.
The effectiveness verification of the user image drawing method based on hierarchical clustering comprises the following steps: effectiveness is mainly verified from the distribution of the attackers in each group after clustering and the time of combining attack behaviors.
The experiment selects the user intrinsic characteristic data and the attack behavior data of July to verify. A first aggressor profile is shown in fig. 1.
As can be seen from the figure, the 0# packet includes 12 attackers, which account for 41% of the total attackers, and the 4# packets each include 5 attackers, which account for 17% of the total attackers. Therefore, it can be seen from the figure that most of the internal threat attackers have similarity in internal characteristics, and similar attack users can be found through a user image method, so that a basis is provided for internal threat early warning.
Meanwhile, we can arrange the destructive behaviors of the attackers in 7 months, and draw a data table by combining the groups of the attackers, and the results are shown in the following table:
as can be seen from the table, the vast majority of the attackers are in the same group. If the internal threat of the KEW0198 is detected, the supervision on the DAS1320 is increased, and if the weight of the abnormal behavior during the abnormal detection of the user is increased or the threat degree of the abnormal detection result of the user behavior is improved, the real attack behavior and the attack preorder behavior of the user can be more prominent than the result obtained by using a single internal threat detection algorithm, the system security manager can give attention to the threat behavior in a more timely early warning manner, and the organization is prevented from generating more loss.
The user portrait method based on hierarchical clustering is superior in verification: the K-means algorithm is selected for comparison in the experiment
The experiment selects the user intrinsic characteristic data and the attack behavior data of July to verify. And a second aggressor profile as shown in figure 2 is obtained.
The destructive behavior of the 7 month attacker was also ranked and a data table was drawn in combination with the group of attackers, the results are shown in the table below.
From the above chart, it can be seen that although the K-means algorithm can group users more evenly, the abnormal users are not well grouped in the same group, and the effects of common supervision and early warning cannot be realized, which also proves the superiority of the method.
While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种基于场景特征的动态IP定位聚类方法