Data flow classification management method and system

文档序号:7504 发布日期:2021-09-17 浏览:24次 中文

1. A data circulation classification management method is characterized in that: the method comprises the following steps:

firstly, training by collecting data of industries where users are located to establish a classification model or keywords, setting an industry classification model or keywords according to user industry classification, classifying files which are operated in a computer through the classification model or matching the keywords, marking the files and saving feature vectors of the files if the files which are operated belong to the classification or hit keywords appointed by the users, recording application program features of application program processes of the marked files, and establishing an application program feature library; the method comprises the steps of carrying out screenshot on an application program operating a marker file, carrying out character conversion identification on the screenshot at the same time, and identifying whether the file name or the file content information of the marker file being operated exists in an image; when the marked file is compressed, recording and correlating the hash value of the marked file and the hash value of the packed and compressed file; when the marked file is decompressed, if the decompressed file is identified as a target file, the hash values of the target file and the compressed file are recorded and associated, and the associated information is finally stored in a pattern database and used for circulation analysis, and the marked file, the program characteristic record of an application program for operating the file, the operation information and the accurate screenshot are uploaded to a server;

secondly, identifying and analyzing file circulation at a server side, collecting marked files at the server side, classifying the files with high similarity into the same file by performing hash comparison on the same marked file, wherein the specific file classification method comprises the steps of performing similarity calculation on a feature vector of the marked file and a central feature vector of the classified file, adding the class and updating the classified central feature vector if the similarity with a certain classified file is greater than a set threshold value, and otherwise, automatically classifying the files; when the newly collected marking files are classified, the new files are associated with the history classification files through the operation mode and the operation time information and are recorded into the pattern database; the selected marking file is traced to the source and the subsequent circulation by inquiring the circulation related information of the selected marking file in the pattern database, and is displayed in a pattern mode;

and thirdly, early warning, counting and classifying the outgoing, uploading, downloading, copying, printing and burning operation modes of the selected mark file in each calculation, determining the mark file with high operation frequency as a high-risk file, analyzing the collected mark file, inquiring and comparing the process information of the application program accessing the file in an application feature library with a digital signature, and if the mark file is not in the application feature library and accesses the user privacy position, performing network connection operation and judging the mark file to be a remote control program background access operation to perform early warning.

2. The data flow classification management method according to claim 1, characterized in that: step one, the characteristic record of the application program comprises digital signature, manufacturer information, product information, operation mode and file data behavior.

3. The data flow classification management method according to claim 1, characterized in that: and the second operation mode comprises sending, uploading, downloading, copying, printing, recording, compressing and decompressing.

4. A data flow classification management system, characterized by: the establishment of the data flow classification management method based on any one of claims 1-3 comprises the following steps: the system comprises a content classification module, a file content marking module, a file monitoring module, an image processing module, a remote control analysis module and a network circulation analysis module;

the content classification module is used for setting file industry classification information of the data stream-to-classification management system and performing classification marking according to file types set by a user;

the file content marking module is used for sampling the classified types of the contents of the files to calculate a characteristic value;

the file monitoring module is used for monitoring operation changes and content changes of the local file in real time, including uploading, printing, copying, downloading, compressing and packaging;

the image processing module is used for carrying out operation screenshot on the specified classified files accessed by all processes of the host operating system;

the remote control classification module is used for analyzing file traversal operations of all application programs, distinguishing user operations from non-user operations, and preventing the remote control programs from checking key positions of users to protect the privacy of the users;

the network circulation analysis module is used for correlating the classified file content characteristic values of all the file types formulated by the users and calculating the computers in which the characteristics of the files are distributed, thereby generating a circulation chart.

5. The data flow classification management system according to claim 4, characterized in that: the content classification module is used for sampling operated file contents in the computer, comparing the types of the file contents in the terminal computer according to file content classification information set by a user, performing digital signature recording on an application program accessing the file to form sampling of an application feature library, and recording the sampling in a Hash database in the computer to provide subsequent query operation;

the file content marking module is used for extracting the theme of the classified file content, extracting the theme according to the main content of the content file, performing similar comparison with the file content classification information set by a user to form a unique file content main body central feature, and adding the unique file content main body central feature into hash data in the computer according to the feature value and synchronizing the unique file content main body central feature to the server;

the file monitoring module is used for judging all files operated in the computer and identifying outgoing operation, printing operation, copying operation, uploading operation, downloading operation, burning operation and intercepting operation of the files, when the operation modes are identified, file information and application program process information are inquired in an application feature library, whether the file content classification set by a user is operated or not is judged, and the file information is recorded and added into hash data in the computer and is synchronized to a server;

the image processing module is used in a computer, performs screenshot operation after an application program operates a file of file content classification information set by a user, performs character conversion identification on the image, identifies whether the operated file name or file content information exists in the image, and transmits the image back to a server for recording;

the remote control analysis module is used for analyzing a file content classification file which is identified as user setting in a computer, inquiring the application program process information accessing the file in an application characteristic library and inquiring a digital signature, if the application program process information is not in the application program characteristic library, simultaneously accessing the user privacy positions such as a desktop and my document, simultaneously performing network connection operation, judging the user privacy positions as remote control program background access operation, and transmitting data information back to a server for recording;

the network flow analysis module performs comprehensive aggregation analysis according to data uploaded by terminal computing of the whole network through the server, performs correlation analysis through data results provided by a file content marking module in terminal computing of the whole network, correlates the operation of each file, correlates data information uploaded by an image processing module in terminal computing of the whole network, performs correlation analysis on data uploaded by a terminal computing file monitoring module of the whole network on all classified files, and performs flow analysis according to distribution of all classified files.

Background

The prior art method for determining whether a document is stolen or divulged can be roughly as follows: identifying whether a target computer file contains set keywords or not in an active scanning mode, namely scanning a computer file in a queue scanning and recursive scanning mode, judging the file type according to a file extension name, and matching keywords; or whether the document information contains outward circulation information is identified according to the keywords, and in the identified document information, a manager can check the document content information and manually audit record, check and count the circulation condition of the file, and the method for stealing and divulging the document in the prior art has the following defects: the outgoing condition of the file cannot be found in time; the circulation condition of the file in the network cannot be detected; the operation of the file cannot be synchronously imaged and recorded; manually recording in massive audit data; the background access mode of the remote control program cannot be judged and blocked; similar files or files of the same classification cannot be identified, and misjudgment is high.

Disclosure of Invention

In view of the above background, the present invention provides a data flow classification management method and a data flow classification management system.

In order to achieve the purpose, the invention provides the following technical scheme:

a data circulation classification management method comprises the following steps:

firstly, training by collecting data of industries where users are located to establish a classification model or keywords, setting an industry classification model or keywords according to user industry classification, classifying files which are operated in a computer through the classification model or matching the keywords, marking the files and saving feature vectors of the files if the files which are operated belong to the classification or hit keywords appointed by the users, recording application program features of application program processes of the marked files, and establishing an application program feature library; the method comprises the steps of carrying out screenshot on an application program operating a marker file, carrying out character conversion identification on the screenshot at the same time, and identifying whether the file name or the file content information of the marker file being operated exists in an image; when the marked file is compressed, recording and correlating the hash value of the marked file and the hash value of the packed and compressed file; when the marked file is decompressed, if the decompressed file is identified as a target file, the hash values of the target file and the compressed file are recorded and associated, and the associated information is finally stored in a pattern database and used for circulation analysis, and the marked file, the program characteristic record of an application program for operating the file, the operation information and the accurate screenshot are uploaded to a server;

secondly, identifying and analyzing file circulation at a server side, collecting marked files at the server side, classifying the files with high similarity into the same file by performing hash comparison on the same marked file, wherein the specific file classification method comprises the steps of performing similarity calculation on a feature vector of the marked file and a central feature vector of the classified file, adding the class and updating the classified central feature vector if the similarity with a certain classified file is greater than a set threshold value, and otherwise, automatically classifying the files; when the newly collected marking files are classified, the new files are associated with the history classification files through the operation mode and the operation time information and are recorded into the pattern database; the selected marking file is traced to the source and the subsequent circulation by inquiring the circulation related information of the selected marking file in the pattern database, and is displayed in a pattern mode;

and thirdly, early warning, counting and classifying the outgoing, uploading, downloading, copying, printing and burning operation modes of the selected mark file in each calculation, determining the mark file with high operation frequency as a high-risk file, analyzing the collected mark file, inquiring and comparing the process information of the application program accessing the file in an application feature library with a digital signature, and if the mark file is not in the application feature library and accesses the user privacy position, performing network connection operation and judging the mark file to be a remote control program background access operation to perform early warning.

In the above technical solution, the application characteristic record of step one includes a digital signature, manufacturer information, product information, operation mode, and file data behavior.

In the above technical solution, the operation manner of the second step includes sending out, uploading, downloading, copying, printing, recording, compressing, and decompressing.

The invention also discloses a data flow classification management system using the technical scheme, which comprises the following steps: the system comprises a content classification module, a file content marking module, a file monitoring module, an image processing module, a remote control analysis module and a network circulation analysis module;

the content classification module is used for setting file industry classification information of the data stream-to-classification management system and performing classification marking according to file types set by a user;

the file content marking module is used for sampling the classified types of the contents of the files to calculate a characteristic value;

the file monitoring module is used for monitoring operation changes and content changes of local files in real time, such as uploading, printing, copying, downloading, compressing and packaging;

the image processing module is used for carrying out operation screenshot on the specified classified files accessed by all processes of the host operating system;

the remote control classification module is used for analyzing file traversal operations of all application programs, distinguishing user operations from non-user operations, and preventing the remote control programs from checking key positions of users to protect the privacy of the users;

the network circulation analysis module is used for correlating the classified file content characteristic values of all the file types formulated by the users and calculating the computers in which the characteristics of the files are distributed, thereby generating a circulation chart.

Firstly, the network circulation analysis module is also used for setting the specified file content type by a user, and also can select the file classification mode of the industry, such as the file content classification modes of building type, drawing type, contract type and the like, and after the setting is finished, the system synchronously issues the classification information set by the user to each terminal computer in the network;

in the above technical solution, the content classification module is further configured to sample operated file content in the computer, perform type comparison on the file content in the terminal computer according to file content classification information set by a user, perform sampling of an application feature library by recording a digital signature for an application program that has accessed the file, and record the sampling in a hash database in the computer to provide subsequent query operations;

in the technical scheme, the file content marking module is further used for performing theme extraction on classified file contents, performing theme extraction according to the main content of the content files, performing similar comparison with file content classification information set by a user to form a unique file content main body central feature, and adding the unique file content main body central feature into hash data in the computer according to the feature value and synchronizing the unique file content main body central feature to the server;

in the above technical solution, the file monitoring module is further configured to determine all files operated in the computer and identify an operation mode of the file, for example: outgoing operation, printing operation, copying operation, uploading operation, downloading operation, burning operation and intercepting operation; when the behaviors are identified, file information and application program process information are inquired in an application feature library, whether file content classification set by a user is operated or not is judged, and the file content classification is recorded and added to hash data in the computer and synchronized to the server;

in the technical scheme, the image processing module is also used in the computer, when the application program operates the file of the file content classification information set by the user, the screenshot operation is performed, meanwhile, the character conversion identification is performed on the image, whether the operated file name or the file content information exists in the image is identified, and the image is transmitted back to the server for recording.

In the above technical solution, the remote control analysis module is further configured to analyze a file content classification file identified as a file set by a user in a computer, query the application process information accessing the file in the application feature library and query the digital signature, and if the application process information is not in the application feature library, access the user privacy location, such as a desktop, my document, and the like. And meanwhile, network connection operation is carried out, the remote control program is judged to be background access operation, and data information is transmitted back to the server for recording.

In the above technical solution, the network flow analysis module performs comprehensive aggregation analysis according to data uploaded by the terminal computing of the whole network through the server, performs association analysis according to data results provided by the file content marking module in the terminal computing of the whole network, associates each file operation, associates data information uploaded by the image processing module in the terminal computing of the whole network, performs association analysis on data uploaded by the terminal computing file monitoring module of the whole network on all classified files, and performs flow analysis according to the distribution of all classified files, if: after the building type file in the terminal calculation A is sent out by WeChat, the terminal computer B receives the building type file, copies the building type file into the U disk, copies the building type file in the U disk out to the C terminal calculation for printing, and simultaneously, the remote control program in the C terminal computer sends the building type file out to the xx mailbox.

In the technical scheme, the network circulation analysis module is also used for graphically displaying the file circulation results of the classified files in the terminal calculation of the whole network, sending the mails and making reports and reports of the file circulation, and meanwhile, the flow direction condition of each file can be effectively checked, so that the intellectual property files in an enterprise are prevented from being circulated to the Internet or competitors. The method effectively helps enterprises to manage and control each classified file in a standardized way.

The invention has the beneficial effects that: the invention overcomes the difficulties of file circulation management in the prior art, can not control and manage the file circulation in the network, needs complicated operation of manually audited data, effectively helps enterprises to manage knowledge product files in enterprises, prevents the files from being sent out by remote control programs, prevents important knowledge product files from being sent out to the Internet, and enables the enterprises to effectively check the flow direction of the files in real time and carry out standardized management.

The method solves the bottleneck of the traditional file keyword matching method, and can perform central extraction on the file in a subject extraction mode through the main body of the file content to obtain the file central content for classification judgment. The files of the whole network can be circulated and checked without manually auditing the data, so that the flow direction data of the files are effectively provided for enterprises, the standardization management and control are facilitated, and the intellectual property files of the enterprises are prevented from being circulated to the Internet and competitors.

Detailed Description

The technical solutions of the present invention will be described clearly and completely below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given in the present patent application without inventive step, shall fall within the scope of protection of the present patent application.

An embodiment of a data flow classification management method includes the following steps:

firstly, training by collecting data of industries where users are located to establish a classification model or keywords, setting an industry classification model or keywords according to user industry classification, classifying files which are operated in a computer through the classification model or matching the keywords, marking the files and saving feature vectors of the files if the files which are operated belong to the classification or hit keywords appointed by the users, recording application program features of application program processes of the marked files, and establishing an application program feature library; the method comprises the steps of carrying out screenshot on an application program operating a marker file, carrying out character conversion identification on the screenshot at the same time, and identifying whether the file name or the file content information of the marker file being operated exists in an image; when the marked file is compressed, recording and correlating the hash value of the marked file and the hash value of the packed and compressed file; when the marked file is decompressed, if the decompressed file is identified as a target file, the hash values of the target file and the compressed file are recorded and associated, and the associated information is finally stored in a pattern database and used for circulation analysis, and the marked file, the program characteristic record of an application program for operating the file, the operation information and the accurate screenshot are uploaded to a server; the application program characteristic record comprises digital signature, manufacturer information, product information, operation modes and file data behaviors, wherein the operation modes comprise outgoing, uploading, downloading, copying, printing, recording, compressing and decompressing.

Secondly, identifying and analyzing file circulation at a server side, collecting marked files at the server side, classifying the files with high similarity into the same file by performing hash comparison on the same marked file, wherein the specific file classification method comprises the steps of performing similarity calculation on a feature vector of the marked file and a central feature vector of the classified file, adding the class and updating the classified central feature vector if the similarity with a certain classified file is greater than a set threshold value, and otherwise, automatically classifying the files; when the newly collected marking files are classified, the new files are associated with the history classification files through the operation mode and the operation time information and are recorded into the pattern database; the selected marking file is traced to the source and the subsequent circulation by inquiring the circulation related information of the selected marking file in the pattern database, and is displayed in a pattern mode;

step three, early warning, counting and classifying the operation modes of sending out, uploading, downloading, copying, printing and burning of the selected mark file in each calculation, determining the mark file with high operation times as a high-risk file, analyzing the collected mark file, inquiring and comparing the process information of the application program accessing the file in an application feature library with a digital signature, if the mark file is not in the application feature library and simultaneously accesses the privacy position of the user, performing network connection operation, and judging as remote control program background access operation to perform early warning, if: the existing terminal calculates A has the building type file, after being sent out by the WeChat, the terminal computer B receives the building type file, copies the building type file to the U disk, copies the building type file in the U disk to the C terminal for printing, and simultaneously the remote control program in the C terminal computer sends the building type file out to the '123456 @ qq.com' mailbox, and finally forms a network circulation relation diagram of classified files; if the operation mode of newly collecting file record is receiving by WeChat, and if the WeChat in the history classified file sends the file and the time is consistent, the two files are related by "WeChat outgoing" and recorded to the pattern database.

In contrast, the above technical solution is implemented by using a data flow classification management system, which includes: the system comprises a content classification module, a file content marking module, a file monitoring module, an image processing module, a remote control analysis module and a network circulation analysis module;

the content classification module is used for setting file industry classification information of a data stream transfer classification management system and performing classification marking according to file types set by a user, and is also used for sampling operated file contents in a computer, performing type comparison on the file contents in a terminal computer according to the file content classification information set by the user, performing digital signature recording on an application program accessing the file to form sampling of an application feature library, and recording the sampling in a Hash database in the computer to provide subsequent query operation;

the file content marking module is used for sampling the classified types of the file contents to calculate a characteristic value, extracting the theme of the classified file contents, extracting the theme according to the main content of the content file, comparing the theme with the file content classification information set by a user to form a unique file content main body central characteristic, adding the unique file content main body central characteristic into hash data in the computer according to the characteristic value and synchronizing the unique file content main body central characteristic to the server;

the file monitoring module is used for monitoring operation changes and content changes of local files in real time, such as uploading, printing, copying, downloading, compressing and packaging, and is also used for judging all files operated in a computer and identifying operation modes of the files, for example: outgoing operation, printing operation, copying operation, uploading operation, downloading operation, burning operation and intercepting operation; when the behaviors are identified, file information and application program process information are inquired in an application feature library, whether file content classification set by a user is operated or not is judged, and the file content classification is recorded and added to hash data in the computer and synchronized to the server;

the image processing module is used for carrying out operation screenshot on specified classified files accessed by all processes of the host operating system, and is also used in the computer, when the application program operates the files of the file content classification information set by the user, the screenshot operation is carried out, meanwhile, the image is subjected to character conversion identification, whether the operated file name or the file content information exists in the image is identified, and the image is transmitted back to the server for recording.

The remote control analysis module is used for analyzing file traversal operation of all application programs, distinguishing user operation from non-user operation, preventing the remote control program from checking key positions of the user to protect privacy of the user, analyzing file content classification files identified as set by the user in the computer, inquiring application program process information accessing the files in an application feature library and inquiring digital signatures, and simultaneously accessing user privacy positions such as a desktop, my documents and other positions if the application program process information is not in the application program feature library. And meanwhile, network connection operation is carried out, the remote control program is judged to be background access operation, and data information is transmitted back to the server for recording.

The network circulation analysis module is used for correlating the classified document content characteristic values of all the document types formulated by the user and calculating the distribution of the document characteristics in which computers, so as to generate a circulation diagram, the network circulation analysis module is also used for setting the specified document content types by the user, and can also select the document classification modes of the industry, such as the document content classification modes of building types, drawing types, contract types and the like, and the system synchronously issues the classification information set by the user to each terminal computer in the network after the setting is completed; the network circulation analysis module can carry out comprehensive aggregation analysis according to data uploaded by terminal calculation of the whole network through a server, carry out association analysis through data results provided by a file content marking module in terminal calculation of the whole network, then associate the operation of each file, associate data information uploaded by an image processing module in terminal calculation of the whole network, then perform association analysis on data uploaded by a terminal calculation file monitoring module of the whole network on all classified files, and finally perform circulation analysis according to the distribution of all classified files, such as: after the building type file in the terminal calculation A is sent out by WeChat, the terminal computer B receives the building type file, copies the building type file into the U disk, copies the building type file in the U disk out to the C terminal calculation for printing, and simultaneously, the remote control program in the C terminal computer sends the building type file out to the xx mailbox. The network circulation analysis module is also used for graphically displaying the file circulation results of the classified files in the terminal calculation of the whole network, sending mails and formulating reports and reports of the file circulation, and meanwhile, the flow direction condition of each file can be effectively checked, so that the intellectual property files in an enterprise are prevented from being circulated to the Internet or competitors. The method effectively helps enterprises to manage and control each classified file in a standardized way.

The above description is only for the specific embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can easily think of the changes or substitutions within the technical scope of the present invention, and shall be covered by the protection scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the protection scope of the claims.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:一种日志空间管理方法、装置、电子设备及介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!