Access behavior identification method suitable for encrypted HTTP/2 webpage

文档序号:7882 发布日期:2021-09-17 浏览:23次 中文

1. An access behavior recognition method suitable for encrypting HTTP/2 web pages is characterized by comprising the following steps:

step 1: extracting information of the encrypted HTTP/2 webpage as fingerprint characteristics of the webpage;

step 2: monitoring TLS flow of an access website, and blocking when a concurrent request exists so as to convert the concurrent request into a form of sending a next request after a single request is sent and a response is received;

and step 3: and extracting all response data transmitted between two adjacent requests, matching the response data in a preset sliding time window with the fingerprint characteristics, and if the response data is completely matched with the fingerprint characteristics, determining that a behavior of accessing the webpage exists in the time window corresponding to the corresponding request.

2. An access behavior recognition method for an encrypted HTTP/2 web page as claimed in claim 1, wherein the step 1 comprises the following processes:

step 101: the method comprises the steps that different browsers are used for accessing the same encrypted HTTP/2 webpage for multiple times at different times, and plaintext flow and ciphertext flow accessed by the webpage under various conditions are obtained;

step 102: and extracting the type, the domain name and the corresponding ciphertext length average value of each Web resource contained in the webpage from the plaintext flow and the ciphertext flow, and taking the extracted type, the domain name and the corresponding ciphertext length average value as the fingerprint characteristic of the webpage.

3. The method according to claim 2, wherein in step 101, at least 4 browsers are used to access the encrypted HTTP/2 web page at random time points with at least 10 intervals longer than 6 hours, and plaintext traffic and ciphertext traffic are obtained and formed into the data set for at least 40 accesses.

4. The method as claimed in claim 2, wherein the step 102 is to extract each Web resource R included in the Web page from plaintext traffic and ciphertext trafficiType T (R)i) Domain name N (R)i) And the corresponding ciphertext length mean value E (R)i) Wherein i is more than or equal to 1 and less than or equal to N, and N is the number of Web resources contained in the webpage; then, constructing the fingerprint characteristics of the webpage as follows: FP ═ FRi|1≤i≤N,T(Ri) E.g., TP }, where FR isi={N(Ri),L(Ri) Denotes a Web resource RiThe feature set of (1), TP ═ document, javascript, css }, represents three Web resource types that are significantly related to the Web page content; l (R)i)=[E(Ri)·(1-α),E(Ri)·(1+α)]Denotes a compound of formula E (R)i) And (5) constructing a ciphertext length interval after the elastic coefficient alpha is scaled.

5. The method as claimed in claim 1, wherein the step 2 is to monitor TLS streams of all connected web pages, block concurrent requests to send only one request at a time when multiple concurrent requests are initiated in a TLS stream, and send the next request only after receiving the response content of the previous request.

6. The method as claimed in claim 5, wherein the monitoring TLS flow of all connected web pages is performed by obtaining resource domain name set NL of web pages and monitoring domain names in all connected NLTLS stream, where NL ═ { N (R)i)|1≤i≤N,T(Ri)∈TP},N(Ri) For a Web resource R contained in a Web pageiN is the number of Web resources contained in the Web page, T (R)i) For a Web resource R contained in a Web pageiAnd (2) Type of (TP) { document, javascript, css }, which represents three Web resource types significantly related to Web page content.

7. The method as claimed in claim 1, wherein the step 3 of extracting all response data transmitted between two adjacent requests is performed in a monitored TLS stream of the visited website, after concurrent requests are blocked, the response data is transmitted as a single request, and any two adjacent requests req are transmittedjAnd reqj+1All response data transmitted therebetween, treated as reqjResponsive Web resource R'jAnd extracting domain name N (R'j) And ciphertext length size l (R'j)。

8. The method as claimed in claim 7, wherein in step 3, the response data in the preset sliding time window is matched with the fingerprint feature, and the Web resource of the accessed Web page is obtained as the attribute set RP ═ FR 'in the preset sliding time window with length t'jJ is more than or equal to |0 and less than or equal to N' } to match with the fingerprint features; wherein N ' is Web resource number, FR ' visited in sliding time window 'j={N(R′j),l(R′j) Denotes Web resource R'jThe attribute of (2).

9. The method for identifying the access behavior applicable to the encrypted HTTP/2 webpage according to claim 8, wherein the step 3 of matching all the fingerprint features means that a user-accessed webpage Web resource attribute set RP and a webpage fingerprint feature FP are matched in a sliding time window, and a user-accessed Web resource set RM successfully matched with the feature fingerprint FP is obtained={R′j|N(R′j)=N(Ri),l(R′j)∈L(Ri) I is more than or equal to 1 and less than or equal to | FP |, j is more than or equal to 0 and less than or equal to N' }, wherein | FP | represents the number of characteristic Web resources in the fingerprint characteristics, namely Web resources which are the same as a certain Web resource domain name in the characteristic fingerprint and have the ciphertext length within the length interval are found out from the page Web resources accessed by the user, the number of elements in RM, namely the number of the Web resources successfully matched is recorded as | RM |, and when | RM |, the number of elements in RM is | FP |, the elements are completely matched with the fingerprint characteristics.

10. The method as claimed in claim 1, wherein in step 3, the predetermined sliding time window is not less than the time required for sending a single request and completing receiving a response.

Background

In recent years, with the development of internet technology, more and more websites adopt the TLS protocol to transmit web page content, which brings about a small challenge to the management of internet behavior. This is because, when a user accesses a certain web page, the TLS protocol encrypts the web page contents using a negotiated encryption algorithm, and the internet behavior management system cannot recognize the web page access behavior of the user by analyzing the transmitted plaintext contents as in the past.

In order to solve the problem, some internet behavior management systems decrypt the page content accessed by the user by deploying a CA certificate at the user end and acting on the TLS access traffic of the user, so as to identify the webpage access behavior of the user. However, this approach is expensive to manage and may violate the privacy of the user, and is currently being replaced by web fingerprinting technology. Because the TLS protocol does not significantly change the characteristics of the size, the transmission direction, the transmission sequence, the transmission interval and the like of the data packet during the transmission of the webpage content, the webpage fingerprint identification technology utilizes the characteristics to construct the fingerprint of the webpage, thereby identifying the webpage access behavior. However, as the current HTTP/2 protocol is increasingly popular, new problems are encountered in the web page fingerprint identification technology.

The HTTP/2 protocol is a newer version of the HTTP/1.1 protocol, with the greatest improvement in traffic transmission being the introduction of a multiplexing mechanism. Compared with the HTTP/1.1 protocol which can only receive one request response at a time, the multiplexing mechanism enables the HTTP/2 protocol to receive a plurality of request responses simultaneously, and the efficiency of flow transmission is greatly improved. However, this mechanism changes the transmission mode of HTTP web content so that the packet characteristics at the time of transmission are no longer available. Therefore, the above-mentioned web fingerprinting technology is no longer applicable to encrypted HTTP/2 web pages using the TLS protocol.

Disclosure of Invention

In order to solve the technical problem that the access behavior cannot be identified when the encrypted HTTP/2 webpage based on the TLS protocol is accessed at present, the invention provides an access behavior identification method suitable for the encrypted HTTP/2 webpage.

In order to achieve the technical purpose, the technical scheme of the invention is that,

an access behavior identification method suitable for encrypting HTTP/2 web pages comprises the following steps:

step 1: extracting information of the encrypted HTTP/2 webpage as fingerprint characteristics of the webpage;

step 2: monitoring TLS flow of an access website, and blocking when a concurrent request exists so as to convert the concurrent request into a form of sending a next request after a single request is sent and a response is received;

and step 3: and extracting all response data transmitted between two adjacent requests, matching the response data in a preset sliding time window with the fingerprint characteristics, and if the response data is completely matched with the fingerprint characteristics, determining that a behavior of accessing the webpage exists in the time window corresponding to the corresponding request.

The access behavior identification method suitable for the encrypted HTTP/2 webpage comprises the following steps of 1:

step 101: the method comprises the steps that different browsers are used for accessing the same encrypted HTTP/2 webpage for multiple times at different times, and plaintext flow and ciphertext flow accessed by the webpage under various conditions are obtained;

step 102: and extracting the type, the domain name and the corresponding ciphertext length average value of each Web resource contained in the webpage from the plaintext flow and the ciphertext flow, and taking the extracted type, the domain name and the corresponding ciphertext length average value as the fingerprint characteristic of the webpage.

In the step 101, at least 4 browsers are used to access the encrypted HTTP/2 webpage at least 10 random time points with an interval of more than 6 hours, and plaintext traffic and ciphertext traffic of at least 40 accesses are acquired to form a data set.

The access behavior identification method suitable for the encrypted HTTP/2 webpage comprises the step 102 of extracting each Web resource R contained in the webpage from plaintext flow and ciphertext flowiType T (R)i) Domain name N (R)i) And the corresponding ciphertext length mean value E (R)i) Wherein i is more than or equal to 1 and less than or equal to N, and N is the number of Web resources contained in the webpage; then, constructing the fingerprint characteristics of the webpage as follows: FP ═ F (FR)i|1≤i≤N,T(Ri) E.g., TP }, where FR isi={N(Ri),L(Ri) Denotes a Web resource RiThe feature set of (1), TP ═ document, javascript, css }, represents three Web resource types that are significantly related to the Web page content; l (R)i)=[E(Ri)·(1-α),E(Ri)·(1+α)]Denotes a compound of formula E (R)i) And (5) constructing a ciphertext length interval after the elastic coefficient alpha is scaled.

The access behavior identification method suitable for the encrypted HTTP/2 webpage comprises the following steps that step 2, TLS flows of all connected webpages are monitored, when a plurality of concurrent requests are initiated in a certain TLS flow, the concurrent requests are blocked so that only one request can be sent at one time, and the next request is sent only after the response content of the last request is received.

The method for identifying the access behavior of the encrypted HTTP/2 webpage is implemented by acquiring a resource domain name set NL of the webpage and monitoring TLS flows of domain names in all the connection NL, wherein NL is { N (R)i)|1≤i≤N,T(Ri)∈TP},N(Ri) For a Web resource R contained in a Web pageiN is the number of Web resources contained in the Web page, T (R)i) For a Web resource R contained in a Web pageiAnd (2) Type of (TP) { document, javascript, css }, which represents three Web resource types significantly related to Web page content.

In the step 3, all response data transmitted between two adjacent requests are extracted, the response data are sent in a single request form after concurrent requests are blocked in a monitored TLS flow for accessing a website, and any two adjacent requests req are sentjAnd reqj+1All response data transmitted therebetween, treated as reqjResponsive Web resource R'jAnd extracting domain name N (R'j) And ciphertext length size l (R'j)。

In the step 3, the response data in the preset sliding time window is matched with the fingerprint characteristics, and the accessed webpage W is obtained in the preset sliding time window with the length of teb resources as attribute set RP ═ FR'jJ is more than or equal to |0 and less than or equal to N' } to match with the fingerprint features; wherein N ' is Web resource number, FR ' visited in sliding time window 'j={N(R′j),l(R′j) Denotes Web resource R'jThe attribute of (2).

In the step 3, the matching of all fingerprint features refers to matching of a Web resource attribute set RP accessed by a user and a fingerprint feature FP of a Web page within a sliding time window, and obtaining a user access Web resource set RM ═ R 'successfully matched with the fingerprint feature FP'j|N(R′j)=N(Ri),l(R′j)∈L(Ri) I is more than or equal to 1 and less than or equal to | FP |, j is more than or equal to 0 and less than or equal to N' }, wherein | FP | represents the number of characteristic Web resources in the fingerprint characteristics, namely Web resources which are the same as a certain Web resource domain name in the characteristic fingerprint and have the ciphertext length within the length interval are found out from the page Web resources accessed by the user, the number of elements in RM, namely the number of the Web resources successfully matched is recorded as | RM |, and when | RM |, the number of elements in RM is | FP |, the elements are completely matched with the fingerprint characteristics.

In the step 3, the preset sliding time window is not less than the time required for sending a single request and finishing receiving a response.

The technical effect of the invention is that the characteristics of HTTP/2 and TLS protocol are fully utilized, the transmission flow characteristic is reduced to a recognizable mode by blocking HTTP/2, the plaintext domain name attribute and the ciphertext data packet attribute in TLS flow are extracted to recognize the access behavior of the encrypted HTTP/2 webpage on the premise of not decrypting the user access flow, and the invention has higher reliability and stability.

Embodiments of the present invention will be described below with reference to the drawings.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

Referring to fig. 1, the access behavior identification method suitable for encrypting the HTTP/2 webpage provided by this embodiment includes the following steps:

step 1: the method for collecting the plaintext and ciphertext flow when accessing a target encryption HTTP/2 webpage H comprises the following steps:

step 1.1: first, a traffic collection environment is constructed, and in this embodiment, the browser is set to allow the TLS session key to be derived, so that the browser can be used to decrypt collected traffic to obtain plaintext traffic data.

Step 1.2: and controlling different browsers to access the target encryption HTTP/2 webpage H at different moments by a webdriver designed automatic script to generate TLS encryption traffic. Wherein the access time point in the present embodiment is set to 15 accesses, each interval being 8 hours. And accessing by using four browsers of IE, Chrome, Firefox and Edge respectively to obtain 60 times of access data in total.

Step 1.3: the TLS encrypted traffic for access is captured using tshark and decrypted with the slave TLS session key to get the clear traffic.

Step 2: extracting the characteristics of the target webpage H from the processed traffic, wherein the characteristics comprise the following steps:

step 2.1: analyzing the collected clear text flow accessed by the webpage H to acquire each Web resource R contained in the target HTTP/2 webpageiWherein i is more than or equal to 1 and less than or equal to N, and N is the number of Web resources contained in the webpage H. The Web resources refer to contents of a plurality of HTTP requests initiated in the process of loading a webpage, and the contents are Web resources including html text, js script, jpg pictures and the like.

Step 2.2: extracting individual Web resources R from plaintext trafficiType T (R)i) Domain name N (R)i) Then finding out the corresponding Web resource ciphertext flow from the ciphertext flow, and counting the Web resource ciphertext flow lengths accessed by different browsers at different moments to obtain a ciphertext length average value E (R)i)。

And step 3: constructing fingerprint features of the webpage H, including:

step 3.1: selecting the Web resources with the types in the set TP { document, javascript, css } from the Web resources of the webpage H, taking the Web resources as fingerprint Web resources, wherein TP represents three Web resource types which are obviously related to the webpage content. These resource types are used to pick the appropriate resources to construct the feature fingerprint. Wherein document is a document type, and resources of the type are html, php, jsp, asp and the like; javascript is a scripting language type, and the resources of the type are mainly js; css is a cascading style sheet type, and the resources of this type are primarily cs. Other Web resources, such as fonts or icons, are not utilized to identify Web pages and are not resources that are significantly related to the content of the Web page.

Step 3.2: constructing feature set FR of fingerprint Web resourcei={N(Ri),L(Ri) Wherein L (R)i)=[E(Ri)·(1-α),E(Ri)·(1+α)]Denotes a compound of formula E (R)i) The ciphertext length interval constructed after the elastic coefficient α is scaled, α is set according to the stability of the network where α is located, and is set to 5% in this embodiment. Since the target webpage is identified in the scene that cannot be decrypted, the ciphertext content is not decrypted and identified, that is, only the length and the domain name of the ciphertext can be used as the feature. The length may be used to represent a resource in the target web page, and the domain name represents the website where the target web page is located.

Step 3.3: merging the feature set of each fingerprint Web resource into the fingerprint feature FP of the whole webpage H, wherein the FP is { FR }i|1<=i<=N,T(Ri)∈TP}。

Step 3.4: acquiring Web resource domain name set NL ═ { N (R) of webpage Hi)|1≤i≤N,T(Ri) E.g., TP) for identifying subsequent TLS flows that need monitoring. The domain name represents a website where a target webpage is located, and is used for determining network connection needing to be monitored in the subsequent identification process, so that the identification efficiency is improved, and the false alarm rate is reduced.

And 4, step 4: identifying and blocking HTTP/2 flow, and restoring the HTTP/2 flow into HTTP/1.1 transmission mode, including:

step 4.1: in a user network needing to identify the access behavior of the target HTTP/2 webpage, the Web resource domain name set NL of the webpage H is utilized to identify and monitor the TLS flow S which can possibly access the target webpage H.

Step 4.2: when multiple concurrent requests are initiated in the TLS stream S, it is determined that the stream is a multiplexed HTTP/2 stream.

Step 4.3: the concurrent requests in the TLS stream S are blocked to send only one request at a time, and the next concurrent request is sent if and only if the response content of the last request is received, thereby restoring the transmission mode of the HTTP/2 stream to the transmission mode of HTTP/1.1. It should be noted that the present embodiment only downgrades from the traffic transfer mode level, the transferred content is not affected, and the browser and the website still use HTTP/2 protocol for communication, because the protocol labeled in the transferred content is still HTTP/2.

And 5: identifying Web resource traffic R 'transported in TLS stream S'jAnd extracting the characteristics of the Web resources to construct a user access Web resource characteristic set, which comprises the following steps:

step 5.1: req any two adjacent TCP requests in TLS stream SjAnd reqj+1All response data transmitted therebetween, treated as reqjResponsive Web resource R'j

Step 5.2: extracting Web resources Rj'Domain name N (R'j) And ciphertext length size l ((R)'j) Wherein N (R'j) Get in clear text field of TLS handshake traffic,/((R'j) By accumulating reqjAnd reqj+1The length of all response data transmitted therebetween.

Step 5.3: if a time interval t is specified and 3s is set in this embodiment, a sliding time window with a length of t is used to obtain a feature set RP ═ FR 'of the Web resource accessed by the user'jJ is more than or equal to 0 and less than or equal to N ', wherein N' is the number of Web resources, FR 'visited in the time window'j={N(R′j),l(R′j) Denotes Web resource R'jThe characteristics of (1).

Step 6: identifying whether the access behavior of the target webpage H exists in the time window or not, wherein the identifying comprises the following steps:

step 6.1: matching a webpage Web resource attribute set RP accessed by a user with the fingerprint characteristics FP of the webpage H in each sliding time window, and acquiring a user access Web resource set RM ═ R'j|N(R′j)=N(Ri),l(R′j)∈L(Ri) I is more than or equal to 1 and less than or equal to | FP | and j is more than or equal to 0 and less than or equal to N' }, wherein | FP | represents the number of feature Web resources in the fingerprint feature. Namely, the Web resource which is the same as a certain Web resource domain name in the characteristic fingerprint and has the ciphertext length within the length interval is found out from the Web resources of the page accessed by the user.

Step 6.2: and recording the number of elements in the RM, namely the number of the successfully matched Web resources as | RM |. When the | RM | ═ FP |, that is, the characteristic fingerprint of the web page H is matched in the time window, it is considered that the user has the access behavior of the web page H, otherwise, it is considered that the user has not been identified.

完整详细技术资料下载
上一篇:石墨接头机器人自动装卡簧、装栓机
下一篇:远程处理方法、装置、后台管理端、系统及存储介质

网友询问留言

已有0条留言

还没有人留言评论。精彩留言会获得点赞!

精彩留言,会给你点赞!