Method and system for processing security log elements
1. A method for processing secure log elements, the method comprising:
collecting a log file;
analyzing the log file, and judging the type of the log file;
if the log file is of an unknown log type, the natural language processing module classifies and extracts element information of the log file;
the asset comparison module verifies and matches the element information of the log file to obtain subject and object information;
and storing the extracted element information of the log file and the matched subject and object into an event information base.
2. The secure log element processing method according to claim 1,
the step of analyzing the log file further comprises:
extracting each element information in the log file, and separating the element information based on a line separation mode;
and storing the separated log file into an original log information base.
3. The secure log element processing method according to claim 2,
judging the type of the log file in the original log information base;
if the log file is of a known log type, processing element information of the log file based on a preset event processing rule;
and storing the processed log file into an event information base.
4. The secure log element processing method according to claim 1 or 2,
and the natural language processing module realizes the classification of the log files through a CharCNN text classification model.
5. The secure log element processing method according to claim 1 or 2,
the natural language processing module extracts the generation time of the log file through the time element.
6. The secure log element processing method according to claim 1 or 2,
the natural language processing module extracts protocol elements in the log file by using keywords generated by a word segmentation function in HanLP;
the natural language processing module also extracts IP values using the text similarity algorithm function in HanLP.
7. The secure log element processing method according to claim 1 or 2,
the natural language processing module extracts the action elements and the result type elements in the log file.
8. The secure log element processing method according to claim 1 or 2,
constructing an equipment asset information base, wherein the equipment asset information base stores subject and object information of the behavior log;
and constructing a classification and semantic word bank, storing words and words corresponding to the asset type information of the equipment in the classification and semantic word bank, or counting classified words of high-frequency words after word segmentation processing.
9. A secure log element processing system, comprising: the system comprises a log acquisition module, a log analysis and judgment module, a natural language processing module, an asset comparison module, a data storage module and an event information base;
the log collection module is used for collecting log files from a log source;
the log analyzing and judging module is used for analyzing the log file and judging the type of the log file;
if the log file is of an unknown log type, the natural language processing module classifies and extracts element information of the log file;
the asset comparison module is used for verifying and matching the element information of the log file to obtain subject and object information;
and the data storage module is used for storing the extracted element information of the log file and the matched subject and object into an event information base.
10. The secure log element processing system of claim 9,
further comprising: original log information base, equipment asset information base, classification and semantic word base.
Background
The rapid development of random computer technology and the wide application of technologies such as cloud computing, internet of things, big data, mobile internet, artificial intelligence and the like bring great convenience to the work, life, study, entertainment and the like of people, and are accompanied with a lot of network security problems. Network security issues from both internal and external sources can cause significant loss to organizations, especially for enterprises, and also pose a significant threat to information security.
At present, various informatization systems are built in enterprises, the requirements of daily office work and production are met, and products such as various safety inspection products and protection products form effective safety protection for internal organizations. In the process of using an information system by a user, a formed system operation log, an application access log and the like are also important means for analyzing network security, but log types, formats and organization modes of different information systems are different, and the information system construction is continuous, so that in the face of the massive and heterogeneous behavior data, how to reasonably and efficiently analyze and process the data into uniform and standardized information is a technical problem to be solved urgently at present.
Disclosure of Invention
The invention provides a security log factor processing method, which realizes high-efficiency analysis and classification of unknown logs under a mass data mode and provides support for network security analysis, threat detection and the like.
The method comprises the following steps:
collecting a log file;
analyzing the log file, and judging the type of the log file;
if the log file is of an unknown log type, the natural language processing module classifies and extracts element information of the log file;
the asset comparison module verifies and matches the element information of the log file to obtain subject and object information;
and storing the extracted element information of the log file and the matched subject and object into an event information base.
The invention also relates to extracting each element information in the log file and separating the element information based on a line separation mode;
and storing the separated log file into an original log information base.
The invention also relates to the judgment of the type of the log file in the original log information base;
if the log file is of a known log type, processing element information of the log file based on a preset event processing rule;
and storing the processed log file into an event information base.
The invention also relates to a natural language processing module which realizes the classification of the log files through a CharCNN text classification model.
The natural language processing module extracts the generation time of the log file through the time element.
And the natural language processing module extracts the protocol elements in the log file by using the keywords generated by the word segmentation function in the HanLP.
The natural language processing module also extracts IP values using the text similarity algorithm function in HanLP.
The natural language processing module extracts the action elements and the result type elements in the log file.
The invention also relates to a device asset information base which stores the subject and object information of the behavior log;
and constructing a classification and semantic word bank, storing words and words corresponding to the asset type information of the equipment in the classification and semantic word bank, or counting classified words of high-frequency words after word segmentation processing.
The invention also relates to a system for processing the safety log elements, which comprises: the system comprises a log acquisition module, a log analysis and judgment module, a natural language processing module, an asset comparison module, a data storage module and an event information base;
the log collection module is used for collecting log files from a log source;
the log analyzing and judging module is used for analyzing the log file and judging the type of the log file;
if the log file is of an unknown log type, the natural language processing module classifies and extracts element information of the log file;
the asset comparison module is used for verifying and matching the element information of the log file to obtain subject and object information;
and the data storage module is used for storing the extracted element information of the log file and the matched subject and object into an event information base.
The invention also comprises: original log information base, equipment asset information base, classification and semantic word base.
According to the technical scheme, the invention has the following advantages:
compared with the traditional log analysis technology based on the template, the system provided by the invention has the advantages that the method for extracting the elements of the safety log based on the natural language processing technology can automatically extract the elements such as time, protocol, subject, object, action, result and the like from the unknown log, reduces the participation degree of people, can greatly improve the accuracy of extracting the elements by combining with the recorded asset information, solves the problem that the unknown log cannot be processed based on the template, and realizes the effective utilization of the unknown log.
Drawings
In order to more clearly illustrate the technical solution of the present invention, the drawings used in the description will be briefly introduced, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a flow diagram of a method for secure log element processing;
FIG. 2 is a schematic diagram of an embodiment of a method for processing a security log element;
FIG. 3 is a flowchart for implementing log classification based on CharCNN;
FIG. 4 is a schematic diagram of a secure log element processing system.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The security log element processing method is used for meeting daily office and production security check and forming effective security protection for the interior of a data system. And analyzing a system operation log and an application access log which are formed in the process of using the information system by the user. And analyzing and processing the log files into a unified normalized log file.
The elements and algorithm steps of each example described in the embodiments disclosed in the method for processing secure log elements provided by the present invention can be implemented by electronic hardware, computer software, or a combination of the two, and in order to clearly illustrate the interchangeability of hardware and software, the components and steps of each example have been generally described in terms of functions in the above description. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The block diagram shown in the drawings of the security log element processing method provided by the present invention is only a functional entity and does not necessarily correspond to a physically independent entity. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.
Furthermore, the described features, structures, or characteristics of the methods of secure log element processing provided by the present invention may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to provide a thorough understanding of embodiments of the invention. One skilled in the relevant art will recognize, however, that the invention may be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations or operations have not been shown or described in detail to avoid obscuring aspects of the invention.
In the method for processing the safety log elements, a log processing system is used for intensively collecting syslog logs and logs with certain text formats such as XML, JSON, CSV, fixed width, general lists and key value pairs, and then customizing different rule templates to analyze the logs.
The invention also combines natural language to process, the natural language is an important subject for researching how to make the machine have language analysis and intelligent understanding in the field of artificial intelligence, and the natural language is fully applied in various industries. Compared with the traditional log analysis technology based on the template, the safety log element processing method based on the natural language processing technology can automatically extract elements such as time, protocol, subject, object, action, result and the like from the unknown log, reduces the participation degree of people, can greatly improve the accuracy of extracting the elements by combining the recorded asset information, solves the problem that the unknown log cannot be processed based on the template, and realizes the effective utilization of the unknown log.
Specifically, as shown in fig. 1 and 2, S101, collecting a log file;
the log file may be collected from a log source. The log source may be a terminal or a server that generates a log file, or a log file generated by software running in the system, or the like.
In the process of collecting and analyzing the log files, the contents of the log files of different data sources are read firstly, the log contents can be processed according to a common line separation mode, the log contents are analyzed to set the log types, and then original log records are organized according to the structural description of the original logs in an original log information base.
Illustratively, the data structure is briefly described as follows: { access time: 2021010101: 11:11, filename: device a _20210101.log, log type: unknown log, log content: text content, and simultaneously writing the original log record into an original log information base.
The access time, file name, log type, log content, etc. are the element information of the log file.
S102, analyzing the log file, and judging the type of the log file;
in this embodiment, the log type attribute in the original log record in the original log information base is logically determined.
If the log file is of a known log type, processing element information of the log file based on a preset event processing rule; and storing the processed log file into an event information base.
Such as { access time: 2021010101: 11:11, filename: device a _20210101.log, log type: known log, log content: text content }.
The element information of the log files is processed based on the preset event processing rule, and the element information can be sorted and screened based on access time, and the time of each log file is sorted. And extracting and classifying file names and the like.
S103, if the log file is of an unknown log type, the natural language processing module classifies and extracts element information of the log file;
after receiving the element information of the original log file, the natural language processing module classifies the log content, performs word segmentation extraction on the log record, and extracts special element information, such as log element time, subject IP, object IP and the like, from the log record by using a special method.
Specifically, the log classification is to perform text classification processing on the input log file content to realize log file label classification of a log source.
The invention uses CharCNN text classification model to realize the classification of log files, and CharCNN proposes text classification from character level and extracts high-level abstract concept. In order to realize CharCNN, firstly, an alphabet is constructed, the alphabet used in the invention is marked as follows, 69 characters are used, one-hot coding is used for the alphabet, an all-zero vector (used for processing characters not in the character table) is added, and therefore 70 characters are totally added, and each character is converted into a 70-dimensional vector. The algorithm also handles the character encoding backwards, i.e. reading the text backwards, which has the advantage that the latest read-in character is always where the output starts.
As shown in fig. 3, the model convolution and pooling layers are two sizes of neural networks proposed in the present invention. Both sizes of neural networks consist of 9 neural networks of 6 convolutional layers and 3 fully-connected layers. A 1-D convolutional neural network is used here. In addition, two dropout layers are added between the three fully connected layers to achieve model regularization.
The log file is processed by model convolution to form a preset characteristic length, the preset characteristic length reaches a maximum pooling layer, and the calculation complexity of an upper layer is reduced by eliminating a non-maximum value. The model convolution and pooling layers provide a form of translational indeformable. When the element information of the log file is classified and extracted, the maximum pooling layer is combined, for single element information, a plurality of conversion directions are provided, for example, the conversion directions can be up, down, left, right, left upper, left lower, right upper and right lower, the maximum layer of the invention is realized on a window of 8 x 8, so that the sizes of the element information of the log file classified and extracted by the natural language processing module and the extracted model are reduced, the calculation speed is improved, and the robustness of the extracted features is improved.
The invention also relates to a plurality of extraction modes of the element information of the log file, including the time element extraction of the log file.
The time element extraction mainly extracts the time generated by the log in the log record, the time character strings in the log record with the shapes of 2021010101: 11:11 are extracted by using the time analysis function in HanLP, a plurality of time type character strings possibly exist in the log record, all the time type character strings are extracted and processed, and the time generated by the log can be corrected in a manual mode and the like.
The time element may be the time at which an element of the log was generated, or the time at which the log was processed, or the time at which the log was sent, etc. Such as for { access time: 2021010101: 11:11, filename: device a _20210101.log, log type: known log, log content: access time in text content.
The invention also relates to the extraction of the protocol elements, which is obtained by comparing the keywords generated by the word segmentation function in HanLP on the log content in the log record with the protocol types http, https, Ftp and the like in the classified word list.
Of course, in some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communications network). When extracting elements, protocol elements are mainly extracted.
The invention also relates to the extraction of the subject and object factors. The subject and object element extraction is to simply mark the subject and object by extracting the contents with the form of the IP value of 127.0.0.1 from the log contents in the log record by using the text similarity algorithm function in HanLP and extracting the sequence of a plurality of IPs. The subject and the object may be terminals, servers, etc. that send or generate logs.
The invention also relates to action element extraction, which is obtained by comparing the keywords generated by the word segmentation function in HanLP on the log content in the log record with the action types of get, post, put, add, delete and the like in the classified word list.
The invention also relates to result element extraction, wherein the result element extraction is obtained by comparing the log content in the log record with result types of success, fault and the like in the classified word list by using keywords generated by the word segmentation function in HanLP, and the result element extraction can be not considered.
After extracting the element information of the log file, the natural language processing module may perform classification processing based on the extracted type, or store the classification.
The classification may be based on time elements of different set time periods.
The protocol elements may be classified, and log files having the same protocol element may be classified.
The classification can also be performed based on the extraction of the subject and object elements, and the classification can be performed by log files with the same subject and object elements.
The log files having the same action element and result element may be classified based on the action element and result element.
S104, the asset comparison module verifies and matches the element information of the log file to obtain subject and object information;
after extracting the element information of the log file, the natural language processing module may perform classification processing based on the extracted type, or store the classification. The asset comparison module verifies and matches element information of the log file, and particularly, after host and object IP information is extracted from the log record by using the natural language processing module, the correctness of the host and the object can be verified by using an asset comparison function.
And S105, storing the element information of the extracted log file and the matched subject and object into an event information base.
The log record is processed by natural language to extract log element information, and then stored, for example, the event data structure is as follows: { event generation time: 2021010101: 11:11, body IP: 127.0.0.1, guest IP: 127.0.0.1, protocol type: http, action: get, result: success }, and finally writing the configured event information into an event information base.
The security log element processing method can provide important help for network security analysis, threat detection, event handling, emergency response and the like.
The security log element processing method can effectively process various logs of unknown types and formats, overcomes the defects that the undefined logs are easy to omit and cannot be analyzed in the conventional template-based method, and improves the discovery capability of threat events and the availability and usability of the system.
The invention analyzes and classifies the safety logs through natural language processing, so that the logs of various types of network equipment are easier for users to understand.
According to the invention, by grouping the keywords after analyzing the unknown logs and recording the keywords into the classified word bank, the speed and accuracy of identifying the log elements can be improved, and the keywords can also be used as the input of other models for repeated use.
Generally, even if the contents of the safety logs generated by the equipment of the same manufacturer are similar to each other, the logs are classified from the angle, so that the training period of the statement analysis model can be omitted or improved, and the accuracy of extracting the log elements of the equipment of the same manufacturer, the same type, different types or unknown types can be improved.
Because the training sentences have greater relevance with the personal characteristics of the developer, the training words extracted by the elements are closer to the language habits of the developer, and the personal characteristics of the developer can be intuitively reflected.
Based on the foregoing method, the present invention further provides a system for processing secure log elements, as shown in fig. 4, including: the system comprises a log acquisition module 1, a log analysis and judgment module 2, a natural language processing module 3, an asset comparison module 4, a data storage module 5 and an event information base;
the log collection module 1 is used for collecting log files from a log source; the log analyzing and judging module 2 is used for analyzing the log file and judging the type of the log file; if the log file is of an unknown log type, the natural language processing module 3 classifies and extracts the element information of the log file;
the asset comparison module 4 is used for verifying and matching the element information of the log file to obtain subject and object information;
the data storage module 5 is used for storing the extracted element information of the log file and the matched subject and object into an event information base.
The security log element processing system provided by the invention takes network security log processing as an example, can collect and analyze log files of a log source, mark the log type of an analysis result, and perform a rule model processing process marked by a dotted line in fig. 2 until the end if the log type is a known log. If the log type is unknown, the log classification and element extraction links in the natural language processing assembly are required, and then the processes of asset comparison and event organization and storage are carried out, so that the writing of the standardized event into the event information base is completed.
The system provided by the invention is composed of a storage database and a natural language processing module 3, wherein the natural language processing module 3 is used for realizing the extraction method of the security log elements.
Specifically, the storage database involved in the system of the present invention includes an original log information base, a device asset information base, a classification and semantic word base, and an event information base.
The invention constructs an original log information base, namely a base for classified storage of the originally input network security log. In the original log information base, the source basic information, the source classification and the original log content of all original logs are stored, and the unknown classification is also stored as one classification of log types.
The invention also constructs an equipment asset information base, establishes an equipment asset base for storing the information of subjects, objects and the like of the behavior log, needs to preprocess the asset information in advance, can complete the record of the assets in a manual registration and import mode, and has the equipment asset types including but not limited to terminals, servers, network equipment and the like. The device asset information base can also be applied to identifying whether the element is a subject or an object associated with the original log by comparison after the element is extracted from the log.
The invention also constructs a classification and semantic word library, establishes a classification word library for storing words and words corresponding to the equipment asset types or other attribute information or high-frequency words counted after word segmentation, such as the equipment asset types, network areas, IP addresses, domain names and the like, and elements related to the printing operation and operation results of the network logs in the original logs.
The invention also constructs an event information base, establishes a normalized event information base after analyzing the original log, extracts log elements including but not limited to time, protocol, subject, object, action, result and the like by a template-based or natural language processing-based method, and stores the log elements in the event information base according to a structured data format, thereby facilitating the subsequent analysis and utilization.
The natural language processing module 3 is used for analyzing log files without defining log processing rules in the system, performing word segmentation on file contents through a natural language processing model in combination with a classification and semantic word library to extract entity information such as time, protocols, subject IP, object IP, actions and results in logs as element information, then comparing the identified IP information with the equipment types in the corresponding equipment asset information library, identifying the accuracy of the elements of the subject and the object, and finally organizing the extracted log elements to form event information and storing the event information into an event information library.
Further speaking, the natural language processing module 3 realizes the classification of the log files through a CharCNN text classification model.
The generation time of the log file is also extracted by the time element. The keywords generated by the word segmentation function in the HanLP are also used to extract protocol elements in the log file. The IP value is also extracted using the text similarity algorithm function in HanLP. The natural language processing module 3 extracts the action element and the result type element in the log file.
Therefore, by designing and implementing the scheme, the invention realizes the high-efficiency analysis and classification of the unknown logs under the mass data mode, and provides support for the aspects of network security analysis, threat detection and the like.
Compared with the traditional log analysis technology based on the template, the system provided by the invention has the advantages that the method for extracting the elements of the safety log based on the natural language processing technology can automatically extract the elements such as time, protocol, subject, object, action, result and the like from the unknown log, reduces the participation degree of people, can greatly improve the accuracy of extracting the elements by combining with the recorded asset information, solves the problem that the unknown log cannot be processed based on the template, and realizes the effective utilization of the unknown log.
The security log element processing method and system provided by the present invention are the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein, which can be implemented in electronic hardware, computer software, or a combination of both, and in the above description the components and steps of the examples have been generally described in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The security log element processing method and system provided by the present invention may write program code for performing the operations of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:数据处理方法、装置、设备及存储介质