Crawler detection method, device and readable storage medium

By using machine learning to generate analysis models from historical access traffic, this technology addresses the issue of relying on the professional experience of security personnel in existing technologies. It achieves more efficient web crawler detection, reduces false positive rates and costs, and improves network security.

CN116599686BActive Publication Date: 2026-06-23XIAMEN WANGSU CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIAMEN WANGSU CO LTD
Filing Date
2023-03-16
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing web crawler detection technologies rely on the professional experience of security personnel, resulting in high false alarm rates, high costs, and poor protection effectiveness, making it difficult to effectively counter advanced crawlers' bypass strategies.

Method used

By using machine learning to analyze historical access traffic, an analytical model is generated. This model is then used to detect web crawlers, reducing reliance on the professional experience of security personnel and improving detection accuracy while lowering the false positive rate.

Benefits of technology

It reduces reliance on the professional experience of security personnel, lowers false alarm rates and network security maintenance costs, and improves network protection effectiveness.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116599686B_ABST
    Figure CN116599686B_ABST
Patent Text Reader

Abstract

The application discloses a crawler detection method and device and a readable storage medium. An analysis server periodically acquires traffic data of a target website, groups the traffic data according to IP addresses of various access requests in the traffic data, and obtains groups corresponding to different IP addresses. Then, the analysis server extracts features of the groups to obtain a first feature set of each IP address, and the features in the first feature set are used to represent access behaviors of the IP address. After that, the analysis server inputs the features in the first feature set of each IP address into an analysis model, and determines which IP addresses are normal IP addresses and which IP addresses are abnormal IP addresses among multiple IP addresses accessing the target website in a current period. By using the scheme, the historical access traffic is analyzed in advance to obtain the analysis model, the network crawler is detected by using the analysis model, and the dependence on professional experience of security personnel is reduced.
Need to check novelty before this filing date? Find Prior Art

Description

TECHNICAL FIELD

[0001] Embodiments of the present application relate to the technical field of network security, and particularly relate to a crawler detection method, device and readable storage medium. BACKGROUND

[0002] A network crawler, also known as a web spider, is a program or script that automatically captures network information according to certain rules. Some unscrupulous people use network crawlers for malicious crawling operations.

[0003] Traditional network crawler detection techniques mainly rely on frequency limits, cookie / js feature detection, browser fingerprint analysis, and business flow analysis. With the escalation of the confrontation between crawlers and anti-crawlers, some advanced crawlers continuously adjust their crawling methods, thereby successfully bypassing the protection strategy and crawling the target website.

[0004] In order to deal with advanced crawlers, network security personnel continuously analyze online data and develop new protection strategies. This approach relies heavily on the professional experience of security personnel, and still has a large number of crawlers that can bypass the new protection strategy, resulting in poor network security. SUMMARY

[0005] Embodiments of the present application provide a crawler detection method, device and readable storage medium, which pre-learns historical access traffic to obtain an analysis model, and uses the analysis model to detect network crawlers, thereby reducing the dependence on the professional experience of security personnel and improving network security.

[0006] In a first aspect, embodiments of the present application provide a crawler detection method, comprising:

[0007] periodically obtaining access requests for requesting access to a target website to obtain traffic data;

[0008] grouping the traffic data according to IP addresses of the access requests to obtain a plurality of groups, the access requests belonging to the same group having the same IP address;

[0009] for each IP address, determining a first feature set according to the group of the IP address to obtain a first feature set of each IP address, the features in the first feature set being used to represent the access behavior of the IP address;

[0010] inputting the first feature set of each IP address into a pre-trained analysis model to cause the analysis model to output abnormal IP and normal IP.

[0011] In a second aspect, embodiments of this application provide an electronic device, including: a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein when the processor executes the computer program, the electronic device implements the method described in various possible implementations of the first aspect above.

[0012] Thirdly, embodiments of this application provide a computer-readable storage medium storing computer instructions, which, when executed by a processor, are used to implement the methods described in various possible implementations of the first aspect above.

[0013] Fourthly, embodiments of this application provide a computer program product comprising a computing program, wherein when the computer program is executed by a processor, it implements the method described in various possible implementations of the first aspect above.

[0014] The web crawler detection method, device, and readable storage medium provided in this application analyze a server that periodically acquires traffic data from a target website. The server groups the traffic data based on the IP addresses of each access request, resulting in groups corresponding to different IP addresses. Next, the server extracts features from each group to obtain a first feature set for each IP address. These features characterize the access behavior of the IP address. Then, the server inputs the features from the first feature sets of each IP address into an analysis model to determine which IP addresses among the multiple IP addresses accessing the target website within the current period are normal and which are abnormal. This approach pre-analyzes historical access traffic to obtain an analysis model, which is then used to detect web crawlers. This reduces reliance on the professional experience of security personnel, lowers false positive rates and network security maintenance costs, and improves network protection effectiveness. Attached Figure Description

[0015] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0016] Figure 1 This is a schematic diagram of the network architecture to which the crawler detection method provided in this application is applicable;

[0017] Figure 2 This is a flowchart of the crawler detection method provided in the embodiments of this application;

[0018] Figure 3This is a schematic diagram of the crawler detection method provided in the embodiments of this application;

[0019] Figure 4 This is a schematic diagram of a crawler detection device provided in an embodiment of this application;

[0020] Figure 5 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application;

[0021] Figure 6 This is a schematic diagram of a crawler detection device provided in an embodiment of this application;

[0022] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0023] To prevent web crawlers from scraping website content, the industry uses robot protocols to regulate their behavior; these protocols are also known as crawler protocols or robots.txt. However, some malicious web crawlers do not comply with robot protocols, requiring crawler detection technologies to identify them.

[0024] Current web crawler detection technologies primarily rely on frequency limiting, cookie / JS feature detection, browser fingerprinting, and business flow analysis to detect and block abnormal web crawlers. However, as the battle between web crawlers and anti-crawler systems intensifies, advanced crawlers in real-world scenarios rapidly adjust their crawling methods by simulating normal access, controlling access frequency, and altering crawling behavior. To counter these crawlers, cybersecurity personnel need to continuously analyze online data to develop new protective measures. Even so, a large number of crawlers still successfully bypass these measures and crawl target websites, posing a significant threat to network security. Moreover, this approach heavily relies on the professional experience of security personnel, resulting in high costs, high false positive rates, and poor protection effectiveness.

[0025] Based on this, embodiments of this application provide a web crawler detection method, device, and readable storage medium. The method analyzes historical access traffic in advance to obtain an analysis model, and uses the analysis model to detect web crawlers. This reduces the reliance on the professional experience of security personnel, while also reducing the false alarm rate and network security maintenance costs, and improving the effectiveness of network protection.

[0026] Figure 1 This is a schematic diagram of the network architecture to which the crawler detection method provided in this application applies. Please refer to... Figure 1The network architecture includes a detection server 11, an analysis server 12, a web server 13, and a terminal device 14. A network connection is established between the detection server 11 and the analysis server 12, between the detection server 11 and the web server 13, and between the detection server 11 and the terminal device 14.

[0027] The detection server 11 is equipped with a bot edge engine and other technologies to perform crawler detection on access requests initiated by terminal devices to access the website server 13. If an access request is initiated by a crawler program, the request is intercepted, monitored, and protected against CAPTCHA challenges. If an access request is initiated by a normal user, the request is sent to the website server 13.

[0028] A pre-trained analysis model is deployed on analysis server 12. Analysis server 12 periodically acquires traffic data from the target website, i.e., website server 13, such as acquiring traffic data every five minutes. The five-minute traffic data is grouped according to IP address, and features characterizing the access behavior of each IP address are extracted from each group, resulting in a first feature set for each IP address. Then, the first feature set of each IP address is input into the analysis model to obtain abnormal IPs and normal IPs. Normal IPs are the IP addresses of real users, and abnormal IPs are the IP addresses of web crawlers. Afterwards, the analysis server sends the abnormal IPs to detection server 11, which then intercepts the traffic of the abnormal IPs.

[0029] Website server 13 is used to provide various services and can be either hardware or software. When website server 13 is hardware, it can be a single server or a distributed server cluster composed of multiple servers. When website server 13 is software, it can be multiple software modules or a single software module, etc., and this application embodiment is not limited.

[0030] The terminal device 14 can be either hardware or software. When the terminal device 14 is hardware, it can be, for example, a mobile phone, tablet computer, e-book reader, laptop computer, desktop computer, server, etc. When the terminal device 14 is software, it can be installed in the hardware devices listed above. In this case, the terminal device 14 can be, for example, multiple software modules or a single software module, etc., and the embodiments of this application are not limited.

[0031] It should be noted that, although Figure 2 In this embodiment, the detection server 11 and the analysis server 12 are deployed separately. However, this embodiment is not limited to this; in other feasible implementations, the detection server 11 and the analysis server 12 can also be deployed together.

[0032] It should be understood that, Figure 1The number of detection servers 11, analysis servers 12, website servers 13, and terminal devices 14 shown is merely illustrative. In actual implementation, any number of detection servers 11, analysis servers 12, website servers 13, and terminal devices 14 can be deployed according to actual needs.

[0033] Below, based on Figure 1 The network architecture shown illustrates the web crawler detection method provided in this application embodiment. For example, please refer to... Figure 2 .

[0034] Figure 2 This is a flowchart of the crawler detection method provided in this embodiment. This embodiment is from... Figure 1 The architecture shown is explained from the perspective of analysis server 12. This embodiment includes:

[0035] 201. Periodically obtain access requests to the target website to obtain traffic data.

[0036] In this embodiment of the application, the analysis server periodically obtains traffic data of the target website, and the detection period is, for example, 1 minute, 5 minutes, 10 minutes, etc., which is not limited in this embodiment of the application.

[0037] 202. The traffic data is grouped according to the IP address of each access request to obtain multiple groups, and access requests belonging to the same group have the same IP address.

[0038] After acquiring traffic data for each detection period, the analysis server determines the IP address of each access request in the traffic data.

[0039] After the analysis server determines the IP address of each access request, it groups the traffic data according to the IP address, thus placing access requests with the same IP address into the same group. The number of groups indicates how many IPs are accessing the target website within that detection period.

[0040] 203. For each IP address, a first feature set is determined based on the grouping of the IP addresses to obtain a first feature set for each IP address, wherein the features in the first feature set are used to characterize the access behavior of the IP address.

[0041] For example, for each IP address, the server extracts features from each metric in the group using an algorithm. These metrics include, but are not limited to, various fields in the request header, the interval between access requests, and the IP segment in which the IP address is located. The fields in the request header include, but are not limited to, the accept request header, content_type, and cookies. The algorithms include, but are not limited to, calculating the mean, variance, median, mode, harmonic mean, normal fit, Poisson distribution fit, bag-of-words model, and term frequency–inverse document frequency (TF-IDF).

[0042] Feature extraction is also known as feature engineering. Through feature engineering, the server is analyzed to obtain the first feature set for each IP address. The features in the first feature set are used to characterize the access behavior of the IP address, such as the total number of accesses, access duration, access frequency, and the number of URL types accessed.

[0043] 204. Input the first feature set of each IP address into the pre-trained analysis model so that the analysis model outputs abnormal IPs and normal IPs.

[0044] After obtaining the first feature set of each IP address, the analysis server inputs the features from each IP address's first feature set into a pre-trained analysis model and runs the model, causing it to output abnormal and normal IP addresses. In this way, the analysis server determines which IP addresses among multiple IP addresses accessing the target website within the current period are normal IP addresses and which are abnormal IP addresses.

[0045] Next, the analysis server sends the abnormal IPs to the detection server. The detection server adds the abnormal IPs to a blacklist. Then, if an access request's IP address matches the blacklist, the detection server assumes the request was initiated by a web crawler. For example, the analysis server retrieves traffic data from 10:00 to 10:05. After obtaining this 5-minute traffic data, starting at 10:05, the analysis server groups the traffic data by IP address, extracts features, and uses a detection model to identify abnormal IPs. Simultaneously, starting at 10:05, the next detection cycle begins, and the analysis server retrieves traffic data from 10:05 to 10:10.

[0046] Assuming the analysis takes 1 minute, meaning the analysis server finishes its analysis at 10:06 and sends the abnormal IP address to the detection server, then between 10:06 and 10:11, each time the detection server receives an access request from a terminal device, it determines the IP address of the access request and whether that IP address is abnormal. If the access address is an abnormal IP address, the detection server monitors and intercepts the access request.

[0047] At 10:11, the detection server received abnormal IPs for the new detection period, updated the blacklist based on the received abnormal IPs, and continued to detect web crawlers based on the updated blacklist.

[0048] It should be noted that although the above description uses a detection cycle of 5 minutes as an example to illustrate the detection method of the automated framework crawler described in this application embodiment, this application embodiment is not limited. In other feasible implementations, the detection cycle can be 1 minute, 3 minutes, 10 minutes, 15 minutes, etc., and this application embodiment is not limited thereto.

[0049] Additionally, it should be noted that although the analysis time is 1 minute in the above embodiment, this application embodiment is not limited to this. In other optional implementations, the analysis time may be 1 second, 5 seconds, 10 seconds, etc.

[0050] The web crawler detection method provided in this application involves an analysis server periodically acquiring traffic data from a target website. The traffic data is grouped according to the IP addresses of each access request, resulting in groups corresponding to different IP addresses. Next, the analysis server extracts features from each group to obtain a first feature set for each IP address. These features characterize the access behavior of the IP address. Then, the analysis server inputs the features from the first feature set of each IP address into an analysis model to determine which IP addresses among multiple IP addresses accessing the target website within the current period are normal and which are abnormal. This approach pre-analyzes historical access traffic to obtain an analysis model, which is then used to detect web crawlers. This reduces reliance on the professional experience of security personnel, lowers false positive rates and network security maintenance costs, and improves network protection effectiveness.

[0051] Optionally, in the above embodiments, in step 203, for each IP address, the features in the first feature set include, but are not limited to:

[0052] 1. Total number of visits

[0053] 2. Visit duration

[0054] 3. Access frequency

[0055] 4. Average number of response bytes;

[0056] Each IP address packet contains multiple access requests, and each access request has a corresponding response message. The response message includes a response line, response headers, and a response body. The analysis server determines the average number of bytes in the response headers based on the response headers of each response message.

[0057] 5. Average response body byte count;

[0058] The analysis server determines the average number of bytes in the response body based on the number of bytes in the response body of each response message.

[0059] 6. Average response time;

[0060] 7. Static resource request ratio

[0061] Each IP address packet contains multiple access requests. Some of these requests access static resources, while others do not. Therefore, by analyzing these multiple access requests, the server can identify the requests that access static resources and thus determine the proportion of static resource requests.

[0062] 8. Number of UA types

[0063] Each IP address packet contains multiple access requests, and each access request has a User-Agent (UA) request header. The analysis server determines the value of the UA request header for each access request and performs deduplication to obtain the number of UA types.

[0064] 9. Average visits per User Agent

[0065] 10. Number of Referer types

[0066] Each IP address packet contains multiple access requests, and each access request has a Referer header. The server determines the value of the Referer header for each access request and performs deduplication to obtain the number of Referer header types.

[0067] 11. Number of URL types

[0068] Each IP address group contains multiple access requests, and each access request has a URL. The analysis server determines the URL of each access request and performs deduplication to obtain the number of URL types.

[0069] 12. Average number of visits per URL

[0070] 13. Number of types of client unique identifiers (client ID, CID)

[0071] The detection server sends a piece of code to the website server to calculate the CID. Upon receiving the access request from the terminal device, the website server uses this code to calculate the CID based on the terminal device's screen resolution, mouse movements, etc., and then returns it to the terminal device. Each subsequent access to the target website by the terminal device will include this CID. If an access request does not carry a CID, it indicates an abnormal access request. Furthermore, if a packet contains multiple different CIDs, it means that the IP address was used by multiple devices to complete a single access within a short period, making it an abnormal IP address.

[0072] 14. Average number of visits per Client ID

[0073] 15. The proportion of access requests with Client ID

[0074] A single IP address can contain multiple access requests; some requests carry a Client ID, while others do not. Therefore, the analytics server can determine the proportion of access requests that carry a Client ID.

[0075] 16. Information entropy of URL

[0076] The analysis server determines a URL sequence based on the access request from the IP address. This URL sequence is actually several URLs arranged in order. For example, if a user buys a plane ticket on an airline website, their actions are: opening the homepage, searching for tickets, and then purchasing a ticket. The URLs of the homepage, the ticket search page, and the ticket purchase page form a URL sequence.

[0077] After the analysis server determines the URL sequence, it determines the information entropy of the URL sequence. For example, the analysis server determines the information entropy of the URL according to the information entropy calculation formula.

[0078] 17. Standard deviation of eigenvectors after URLs are used in a bag-of-words scheme

[0079] 18. The number of parameters in the access request

[0080] A packet containing an IP address may contain multiple access requests. Some access requests may carry parameters, while others may not. The analysis server determines the number of parameters based on the access requests that carry parameters.

[0081] 19. The proportion of access requests with parameters

[0082] A packet from an IP address contains multiple access requests. Some access requests carry parameters, while others do not. The analysis server determines the number of access requests carrying parameters and the total number of access requests contained in the packet, thereby obtaining the proportion of access requests with parameters.

[0083] 20. TF-IDF characteristics of URLs

[0084] 22. TF-IDF characteristics of access request suffixes

[0085] The suffix may be, for example, .png, etc., but this application embodiment is not limited thereto.

[0086] 23. Bag-of-words features of response status codes

[0087] 24. Bag-of-words feature of request methods

[0088] Request methods include common HTTP request methods such as GET and POST.

[0089] 25. Characteristics of TF-IDF for Content-Type

[0090] Each access request includes a Content-Type header, and the analysis server determines the TF-IDF characteristics of the Content-Type.

[0091] 26. TF-IDF features of Referrer

[0092] ...

[0093] It should be noted that in the embodiments of this application, feature engineering is required during both the crawler detection process using the detection model and the training process of the detection model. During the inference process using the detection model, features are extracted for each IP address based on its grouping; during model training, features are extracted for each sample IP address based on its traffic grouping. The two processes are similar. The difference lies in the fact that TF-IDF features and / or bag-of-words features need to be stored during model training. Some features used in the inference process of the detection model, such as TF-IDF features and / or bag-of-words features, require the TF-IDF features and / or bag-of-words features stored during model training. This is because both TF-IDF features and bag-of-words features require a dictionary, which needs to be stored in advance.

[0094] Optionally, in the above embodiments, before the analysis server groups the traffic data according to the IP address of each access request to obtain multiple groups, i.e., before performing step 202 above, the analysis server also preprocesses the traffic data to filter out traffic that may affect the output results of the detection model. Traffic that does not need to be intercepted includes, but is not limited to:

[0095] A. Traffic originating from IP addresses that are on the whitelist.

[0096] The analytics server maintains a whitelist. Access requests from IP addresses on the whitelist are considered legitimate and do not require blocking. Therefore, it's necessary to filter out traffic from IP addresses that match the whitelist.

[0097] The IP addresses in the whitelist are provided by the customer, such as the company hosting the target website's server. Additionally, network operations and maintenance personnel can also add IP addresses to the whitelist; these IP addresses may be internal testing IPs or IPs used to access payment interfaces.

[0098] B. The crawler's traffic is known.

[0099] Known crawlers refer to the crawlers used by search engines to obtain website information. They are usually harmless crawlers, and there is no need to block access requests from known crawlers. Therefore, it is necessary to filter out traffic from known crawlers in the traffic data.

[0100] Additionally, if a client expects a particular search engine to be unable to crawl website information, there is no need to filter out the traffic from that particular search engine's crawler.

[0101] C. Traffic that conforms to custom rules.

[0102] This application supports custom data filtering logic, which can filter traffic with specific characteristics, i.e., traffic that conforms to custom rules, according to actual scenario requirements. Custom rules include, but are not limited to:

[0103] a) Traffic originating from a specific IP address or IP range.

[0104] The analysis server pre-stores target addresses or target network segments. If the IP address of an access request in the traffic data is the target IP address, or if the IP address belongs to the target network segment, the analysis server filters out the traffic of that IP address.

[0105] b) IP addresses that accessed resources with a specific Uniform Resource Locator (URL).

[0106] The analysis server pre-stores target URLs. If a request in the traffic data accesses a URL that is the target URL, the analysis server filters out traffic that accesses the target URL. The target URL could be, for example, the URL of a payment page.

[0107] c) IP addresses from the characteristic region.

[0108] If the server pre-stores a list of target regions, and the IP address belongs to any region in the list, then there is no need to intercept access requests initiated by that IP address.

[0109] If the location of an IP address in the traffic data matches the target region list, then the traffic from that IP address will be filtered out.

[0110] d) IPs with specific behaviors.

[0111] The analysis server stores a threshold number of visits in advance. If the number of visits from an IP address is less than the threshold number, it means that the IP address is not the IP address of the web crawler, so the traffic of the IP address is filtered out.

[0112] This approach preprocesses traffic data to filter out traffic from IP addresses that do not need to be blocked, reducing the amount of data while improving the speed of crawler detection.

[0113] Optionally, in the above embodiments, after the analysis server outputs abnormal and normal IPs using the analysis model, it can directly send the abnormal IPs to the detection server, which then stores the abnormal IPs in a blacklist. Subsequently, when the detection server receives a new access request—for example, if the detection server updates the blacklist at 10:06 and receives a new access request at 10:06:20—it determines the IP address of the new access request and then checks whether that IP address matches the blacklist. If the IP address matches the blacklist, the analysis server considers the access request to be initiated by a web crawler and performs monitoring, interception, and other processing on the access request.

[0114] In addition, after the analysis server outputs abnormal and normal IPs using the analysis model, it can perform certain processing on the output results, hereinafter referred to as post-processing. Then, it determines whether to block access requests from abnormal IPs. The post-processing process is described in detail below.

[0115] In one approach, after the analysis server outputs abnormal and normal IPs using an analysis model, for each abnormal IP, it determines whether the abnormal IP meets preset conditions based on the IP's grouping. If the abnormal IP meets at least one preset condition, it is considered a normal IP and does not need to be sent to the detection server. If the abnormal IP does not meet any preset condition, it is considered abnormal and needs to be sent to the detection server so that the detection server can intercept access requests initiated by the abnormal IP. The preset conditions include, but are not limited to:

[0116] i) An abnormal IP address has accessed the target URL

[0117] The target URL is, for example, the URL of the payment page.

[0118] ii. The target URL sequence corresponding to the access behavior of abnormal IPs

[0119] The analysis server determines a URL sequence based on the traffic in packets of abnormal URLs. This URL sequence represents the access behavior of the abnormal IP, such as which pages were visited in sequence. If this access sequence matches the target URL sequence pre-stored by the analysis server, the analysis server treats the abnormal IP as a normal IP.

[0120] iii. The value of the request header for access requests from abnormal IP addresses is the target value.

[0121] The abnormal URL group contains multiple access requests, each with a request header. If a target request header, such as the UA request header, has a target value, the analysis server treats the abnormal IP as a normal IP. Conversely, if the UA request header value is any other value, the analysis server considers the abnormal IP to be indeed abnormal.

[0122] iv. The access frequency of abnormal IP addresses is less than the preset frequency.

[0123] v. The number of accesses from abnormal IP addresses is less than the preset number.

[0124] The analysis server pre-stores thresholds such as preset frequency and preset number of times, and filters out some abnormal IPs based on the set thresholds.

[0125] If the access frequency or number of accesses of an abnormal IP is less than the preset frequency or number of accesses, the analysis server will treat the abnormal IP as a normal IP; otherwise, the analysis server will consider the abnormal IP to be an abnormal IP.

[0126] vi. Abnormal IPs have specific access characteristics, such as front-end anti-access features and human-machine verification features.

[0127] vii. Combinations of the above filtering methods.

[0128] In this approach, if the analysis model outputs an abnormal IP address, and the post-processing still results in an IP address, then all access requests initiated by that abnormal IP address will be intercepted, monitored, and processed.

[0129] This approach involves the analysis server outputting abnormal and normal IPs using the analysis model. Then, based on the grouping of abnormal IPs, it further determines whether an abnormal IP is indeed abnormal. IP addresses that have been reconfirmed and whose confirmation results are all abnormal are sent to the detection server, thus avoiding excessive blocking and improving the accuracy of crawler detection.

[0130] In another approach, after the analysis server outputs abnormal and normal IPs using the analysis model, for each abnormal IP, the analysis server obtains new access requests initiated by that abnormal IP within a preset future time period. If the new access request meets a second preset condition, the abnormal IP is allowed to access the target website; if the new access request does not meet any of the second preset conditions, the abnormal IP is not allowed to access the target website, i.e., the new access requests initiated by the abnormal IP are blocked. The second preset conditions include, but are not limited to:

[0131] Ⅰ. Request access to the target URL.

[0132] After identifying abnormal IPs using its analysis model, the analysis server forwards these IPs to the detection server. However, the detection server does not intercept or monitor all new access requests initiated by these abnormal IPs. If a new access request from an abnormal IP address corresponds to a target URL, the detection server allows the request, permitting the abnormal IP address to access the target website. If the URL corresponding to a new access request from an abnormal IP address is not the target URL, the detection server does not allow the request, prohibiting the abnormal IP address from accessing the target website.

[0133] II. The value of the request header is the target value.

[0134] Similarly, if the header value of a new access request initiated by an abnormal IP address is the target value, the detection server will allow the access request, meaning the abnormal IP address is allowed to access the target website. If the header value of a new access request initiated by an abnormal IP address is not the target value, the detection server will not allow the request, meaning the abnormal IP address is not allowed to access the target website.

[0135] III. Access frequency is less than the preset frequency

[0136] Each time the detection server receives a new access request from an abnormal IP address, it updates the access frequency. If the access frequency is less than the preset frequency, the access request is allowed, meaning the abnormal IP address is permitted to access the target website. If the access frequency is greater than or equal to the preset frequency, the detection server does not allow the abnormal IP address to access the target website.

[0137] IV. The number of visits is less than the preset number.

[0138] Each time the detection server receives a new access request from an abnormal IP address, it updates the access count. If the access count is less than the preset frequency, the access request is allowed, meaning the abnormal IP address is permitted to access the target website. If the access count is greater than or equal to the preset frequency, the detection server does not allow the abnormal IP address to access the target website.

[0139] V. Newly initiated access requests have specific access characteristics, such as front-end adversarial features or human-machine verification features. In this case, the detection server allows the access request, that is, allows the abnormal IP address to access the target website. If the newly initiated access request does not have specific access characteristics, the detection server does not allow it, that is, does not allow the abnormal IP address to access the target website.

[0140] In this approach, when an abnormal IP address is output by the analysis model, some access requests initiated by that abnormal IP address are intercepted and monitored.

[0141] Using this approach, the analysis server outputs abnormal and normal IPs using the analysis model, and then selectively filters new access requests initiated by abnormal IPs to avoid excessive blocking.

[0142] Optionally, in the above embodiments, the analysis server inputs the first feature set of each IP address into the pre-trained analysis model, so that before the analysis model outputs abnormal IPs and normal IPs, it also uses sample data to train the initial model to train the analysis model. Figure 3 This is a flowchart of the training and analysis model in the web crawler detection method provided in this embodiment. This embodiment includes:

[0143] 301. Obtain the historical traffic of the target website.

[0144] For example, the server can obtain historical traffic data for 1 hour, 1 day, and 3 hours, but this application embodiment is not limited to this.

[0145] 302. Group access requests with the same IP address in the historical traffic into a group to obtain multiple sample IPs and traffic groups for each sample IP.

[0146] For clarity, access requests in historical traffic will be referred to as sample requests, and the IP addresses of sample requests will be referred to as sample IPs.

[0147] After acquiring historical traffic, the analysis server identifies the sample IP address for each sample request within that historical traffic. Then, the historical traffic is grouped based on these sample IP addresses, placing requests from the same sample IP address into the same traffic group. The number of traffic groups indicates the number of sample IP addresses that accessed the target website historically.

[0148] It should be noted that before the analysis server executes step 302, it also performs data preprocessing on the historical traffic. For details, please refer to the above-mentioned preprocessing of traffic data, which will not be repeated here.

[0149] In addition, in this embodiment, the data preprocessing for historical traffic includes not only filtering but also parameter extraction and static resource judgment. Parameter extraction refers to determining the URL accessed by each sample request and extracting parameters and suffixes from the URL. For example, if a sample request's URL is: https: / / xxx.com / xxx?a=b&c=d, the analysis server extracts two key-value pairs: a=b and c=d. Suffixes, such as .png, are used to determine whether the target website contains static resources, such as images and videos, which are not limited in this embodiment.

[0150] 303. Based on the traffic groups of each sample IP, tag the corresponding sample IP, and the tag is used to indicate whether the sample IP is a normal IP or an abnormal IP.

[0151] The analysis server uses semi-automated tagging technology to tag each sample IP, thereby identifying which IPs are normal and which are abnormal among the multiple sample IPs.

[0152] 304. Based on the traffic groups of each sample IP, determine the second feature set to obtain the second feature set of each sample IP.

[0153] The feature extraction process can be found above. Figure 2 The description of step 203 will not be repeated here. The features in the second feature set are also described above in the section on the second feature set; they will not be repeated here.

[0154] 305. The analysis model is trained based on the labels of each sample IP and the second feature set of each sample IP.

[0155] The analysis server uses the labels obtained in step 303 and the second feature set of each sample IP obtained in step 304 to train the initial machine learning model, thereby obtaining the above-mentioned analysis model. The machine learning model includes, but is not limited to, random forest model, XGBoost model, support vector machine (SVM), deep neural network model, etc., and the embodiments of this application are not limited thereto.

[0156] This approach utilizes the labels and second feature sets of sample IPs to train an initial machine learning model, resulting in an analytical model for crawler detection. This significantly reduces reliance on professional network maintenance, lowers labor costs, and improves the accuracy of crawler detection.

[0157] Optionally, in the above embodiments, the analysis server may use regular expression matching, threshold judgment, or other methods to label sample IPs. Below, using threshold judgment as an example, a detailed explanation of how the analysis server labels sample IPs will be provided. For an example, please refer to... Figure 4 , Figure 4 This is a flowchart illustrating the tagging process in the web crawler detection method provided in this application embodiment. This embodiment includes:

[0158] 401. For each sample IP, determine the number of times the sample IP was accessed and the number of different types of URLs accessed.

[0159] Each sample IP address corresponds to a traffic group, which contains multiple sample requests. The analysis server determines the number of accesses based on the number of sample requests. Simultaneously, the analysis server determines the URL of each sample request and deduplicates it, thus obtaining the number of URL types.

[0160] 402. Analyze the server to determine whether the number of categories is less than the first threshold and whether the number of accesses is greater than the second threshold. If the number of categories is less than the first threshold and the number of accesses is greater than the second threshold, proceed to step 403; otherwise, proceed to step 404.

[0161] 403. The sample IP is determined to be an abnormal IP.

[0162] 404. Analyze the server to determine whether the number of categories is greater than or equal to the first threshold, whether the number of categories is less than the third threshold, and whether the sample IP has not accessed static resources. If the number of categories is greater than or equal to the first threshold, and the number of categories is less than the third threshold, and the sample IP has not accessed static resources, then proceed to step 403; if the number of categories is greater than or equal to the third threshold; or if the sample IP has accessed static resources, then proceed to step 405.

[0163] The first threshold is, for example, 3, and the second and third thresholds are, for example, 10, etc. The embodiments of this application are not limited.

[0164] 405. Does the access behavior of the sample IP conform to the preset behavior? If the access behavior of the sample IP conforms to the preset behavior, proceed to step 403; if the access behavior of the sample IP does not conform to the preset behavior, proceed to step 406.

[0165] Preset behaviors include: the sample IP has accessed a specific URL, the sample IP belongs to the malicious IP intelligence database, and the sample IP has accessed the URL too many times. These preset behaviors can be customized according to the customer's specific scenario.

[0166] 406. The sample IP address has been determined to be an abnormal IP address.

[0167] After the analysis server labels each sample IP, for websites training machine learning models for the first time, it also performs sampling analysis on the labeled sample IPs and fine-tunes the labeling rules based on the analysis results to ensure that the labeling results are close to reality. The labeling rules include the threshold rules and regular expression matching rules mentioned above.

[0168] Using this approach, the analysis server tags each sample IP address based on threshold judgments, resulting in high accuracy and speed.

[0169] Optionally, the above Figure 2 In the implementation shown, before determining the first feature set based on the IP address groups, each IP address in the multiple groups is tagged. The tagging method is the same as the process for tagging sample IPs described above, and will not be repeated here.

[0170] During the web crawler detection process using the analysis server's detection model, the purpose of labeling the IP addresses corresponding to each group is not to detect web crawlers, but to adjust thresholds, such as the first threshold, second threshold, and third threshold mentioned in the model training process. This way, when the detection model needs to be retrained subsequently, the adjusted thresholds can be used to label sample IPs during model training to improve the accuracy of model training.

[0171] After the analysis server labels the IP addresses corresponding to each group, it determines whether the labels of the IP addresses are consistent with the output of the analysis model to obtain an error value. The error value is the number of IP addresses whose labels are inconsistent with the output of the analysis model. Then, a threshold is adjusted based on the error value; this threshold is the one used during the training of the analysis model.

[0172] For example, the analysis server uses a proportional-integral-derivative (PID) control algorithm or other feedback algorithms to perform closed-loop adjustment of the thresholds used during model training. Taking the PID algorithm as an example, the closed-loop adjustment algorithm is shown in the following formula:

[0173]

[0174] Among them, K p Proportional gain is an adjustment parameter.

[0175] K i Integral gain is also a tuning parameter.

[0176] K d Differential gain is also a tuning parameter;

[0177] e: Error value = Set point (SP) - Feedback value (Process value (PV);

[0178] t: Current time

[0179] τ: Integral variable, with a value from 0 to the present time t.

[0180] In the above formula, the error value e, t, and the three PID tuning parameters are defined as follows:

[0181] The error value 'e' represents the difference between the output of the analysis model and the results of the semi-automatic tagging. In other words, the error value 'e' equals the number of IP addresses whose tags differ from the analysis model's output. For example, if there are 10 packets, and the tagging results for 6 packets differ from the analysis model's output, then the error value 'e' equals 6.

[0182] The unit for time t is currently hours.

[0183] The three tuning parameters K of PID i K p K d Adjustments will be made based on the actual situation.

[0184] Substituting the error value e, t, and the three PID tuning parameters into the closed-loop adjustment algorithm described above, we obtain the control output value u. If the initial threshold is set, such as the first threshold mentioned above being X, then at time t, the first threshold is adjusted to X+u(t).

[0185] In addition, to prevent sudden increases or decreases in attacks during actual business operations, or to prevent abnormal thresholds due to improper PID parameter settings, upper and lower limits can be set for the threshold. The adjusted threshold cannot exceed ±10% of the initial value.

[0186] After adjusting the threshold, the adjusted threshold will be used during model training when the model needs to be updated.

[0187] This method involves labeling the IP addresses corresponding to each group during the inference process of web crawlers using the analysis model. The labels of each IP address are compared with the output results of the analysis model. The error value is determined based on the comparison results, and then the threshold used in the model training process is adjusted according to the error value, thereby achieving closed-loop adjustment and improving the accuracy of the analysis model.

[0188] Figure 5 This is a schematic diagram of the closed-loop adjustment process in the crawler detection method provided in this application embodiment. Please refer to... Figure 5 The model training process involves data preprocessing, semi-automatic labeling, feature engineering, and model training, ultimately resulting in an analytical model.

[0189] The reasoning process using the analytical model involves the following steps: data preprocessing, semi-automated labeling, feature engineering, and model reasoning. The reasoning results indicate which of the multiple IP addresses are abnormal and which are normal.

[0190] Following the reasoning step is the data post-processing process, which involves reconfirming whether the abnormal IP address is indeed abnormal. After obtaining the detection results, the analysis server sends the abnormal IP address to the detection server. The detection server then performs protective actions, including but not limited to blocking, monitoring, and CAPTCHA challenges.

[0191] Please refer to Figure 5 After obtaining the detection results, the detection results can be compared with the semi-automatic labeling results to determine the error value, and then the threshold used in the model training process can be adjusted according to the error value.

[0192] Although Figure 5 The illustration shows the comparison between the detection results and the semi-automatic labeling results to obtain the error value. However, the embodiments of this application are not limited to this; for example, the error value can also be obtained by comparing the inference results, i.e., the output results of the analysis model, and the results of the semi-automatic labeling.

[0193] Optionally, in the above embodiments, during the tagging process, the analysis server can determine whether an IP address is a normal IP or an abnormal IP based on regular expression matching, threshold judgment, or other methods. An IP address may be labeled as an abnormal IP address if it has triggered a threshold, or if it satisfies a certain regular expression in the regular expression matching. Similarly, an IP address may be labeled as a normal IP address if it has not triggered a threshold, or if it does not satisfy a regular expression.

[0194] Therefore, abnormal IPs can be divided into two categories:

[0195] The first category is threshold-based abnormal IPs, which refer to IP addresses that have triggered specific thresholds, such as frequency limits or access limit limits.

[0196] The second category is non-threshold abnormal IPs, which mainly include IP addresses that triggered regular expressions.

[0197] Obviously, when adjusting the threshold, there is no need to consider non-threshold-type abnormal IPs. Therefore, optionally, in the above threshold adjustment process, in order to more accurately adjust the thresholds used in the model training process, during the process of the analysis server obtaining the error value, firstly, normal IPs and threshold-type abnormal IPs are identified from the IP addresses of each group. Then, it is determined whether the labels of each normal IP and each threshold-type abnormal IP are consistent with the output of the analysis model to obtain the error value. Among them, threshold-type abnormal IPs are IP addresses that are labeled as abnormal IPs because they do not meet the preset threshold, i.e., the first type of IP addresses mentioned above.

[0198] This approach filters out non-threshold abnormal IPs during closed-loop adjustment, thereby accurately adjusting the thresholds used in model training and ultimately training a high-precision analysis model, thus improving the accuracy of crawler detection.

[0199] The following are embodiments of the apparatus described in this application, which can be used to execute the embodiments of the method described in this application. For details not disclosed in the apparatus embodiments of this application, please refer to the embodiments of the method described in this application.

[0200] Figure 6 This is a schematic diagram of a web crawler detection device provided in an embodiment of this application. The web crawler detection device 600 includes:

[0201] The acquisition module 61 is used to periodically acquire access requests to the target website in order to obtain traffic data;

[0202] The grouping module 62 is used to group the traffic data according to the IP address of each access request to obtain multiple groups, and access requests belonging to the same group have the same IP address;

[0203] The processing module 63 is used to determine a first feature set for each IP address based on the grouping of the IP addresses, so as to obtain a first feature set for each IP address, wherein the features in the first feature set are used to characterize the access behavior of the IP address.

[0204] The detection module 64 is used to input the first feature set of each IP address into the pre-trained analysis model so that the analysis model outputs abnormal IPs and normal IPs.

[0205] In one feasible implementation, the acquisition module 61 is used to acquire the historical traffic of the target website before the detection module 64 inputs the first feature set of each IP address into the pre-trained analysis model so that the analysis model outputs abnormal IPs and normal IPs.

[0206] The grouping module 62 is also used to divide access requests with the same IP address in the historical traffic into a group to obtain multiple sample IPs and traffic groups of each sample IP.

[0207] The processing module 63 is further configured to: label the corresponding sample IPs according to the traffic groups of each sample IP, wherein the labels are used to indicate whether the sample IPs are normal or abnormal; determine a second feature set according to the traffic groups of each sample IP to obtain a second feature set for each sample IP; and train the analysis model according to the labels of each sample IP and the second feature set of each sample IP.

[0208] In one feasible implementation, when the processing module 63 tags the corresponding sample IPs according to the traffic groups of each sample IP, it determines the number of accesses and the number of URL types accessed by each sample IP; when the number of types is less than a first threshold and the number of accesses is greater than a second threshold, the sample IP is determined to be an abnormal IP; when the number of types is greater than or equal to the first threshold, the number of types is less than a third threshold, and the sample IP does not access static resources, the sample IP is determined to be an abnormal IP.

[0209] In one feasible implementation, the processing module 63 inputs the first feature set of each IP address into a pre-trained analysis model. After the analysis model outputs abnormal IPs and normal IPs, it further determines, for each abnormal IP, whether the abnormal IP meets a first preset condition based on the grouping of the abnormal IP. The first preset condition includes at least one of the following conditions: the abnormal IP has accessed a target URL; the access behavior of the abnormal IP corresponds to a target URL sequence; the request header value of the access request of the abnormal IP is a target value; the access frequency of the abnormal IP address is less than a preset frequency; and the number of accesses of the abnormal IP address is less than a preset number. When the abnormal IP meets the first preset condition, the abnormal IP is determined to be a normal IP.

[0210] In one feasible implementation, the processing module 63 inputs the first feature set of each IP address into a pre-trained analysis model. After the analysis model outputs abnormal IPs and normal IPs, it is further used to obtain, for each abnormal IP, new access requests initiated by the abnormal IP within a future preset time period; determine whether the new access requests meet a second preset condition, the second preset condition including at least one of the following conditions: access target URL, request header value is a target value, access frequency is less than a preset frequency, access count is less than a preset count; when the abnormal IP meets the second preset condition, the abnormal IP is allowed to access the target website.

[0211] In one feasible implementation, before the grouping module 62 groups the traffic data according to the IP address of each access request to obtain multiple groups, the processing module 63 is further configured to filter out traffic that does not need to be intercepted from the traffic data. The traffic that does not need to be intercepted includes at least one of the following: traffic whose IP address matches the whitelist, traffic from known crawlers, and traffic that conforms to custom rules. The custom rules include at least one of the following: the IP address is the target address or belongs to the target network segment, the IP address accesses the target Uniform Resource Locator URL, and the IP address comes from the target region.

[0212] In one feasible implementation, before determining the first feature set based on the grouping of each IP address, the processing module 63 is further configured to label the corresponding IP addresses based on the grouping of each IP address; for each IP address, determine whether the label of the IP address is consistent with the output of the analysis model to obtain an error value, wherein the error value is the number of IP addresses whose labels are inconsistent with the output of the analysis model; and adjust a threshold based on the error value, wherein the threshold is a threshold used in the training of the analysis model.

[0213] In one feasible implementation, the processing module 63 is used to determine normal IPs and threshold-type abnormal IPs from the IP addresses of each group. The threshold-type abnormal IPs are IP addresses that are labeled as abnormal IPs because they do not meet a preset threshold. The module 63 is used to determine whether the labels of each normal IP and each threshold-type abnormal IP are consistent with the output of the analysis model in order to obtain an error value.

[0214] The crawler detection device provided in this application embodiment can perform the server analysis actions in the above embodiments. Its implementation principle and technical effect are similar, and will not be described again here.

[0215] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. The electronic device 700 is, for example, the analysis server described above, and includes:

[0216] Processor 71 and memory 72;

[0217] The memory 72 stores computer instructions;

[0218] The processor 71 executes the computer instructions stored in the memory 72, causing the processor 71 to perform the crawler detection method implemented by the analysis server as described above.

[0219] The specific implementation process of processor 71 can be found in the above method embodiments, and its implementation principle and technical effect are similar. It will not be repeated here.

[0220] Optionally, the electronic device 700 also includes a communication component 73. The processor 71, memory 72, and communication component 73 can be connected via a bus 74.

[0221] This application also provides a computer-readable storage medium storing computer instructions, which, when executed by a processor, are used to implement the crawler detection method implemented by the analysis server as described above.

[0222] This application also provides a computer program product, which includes a computer program that, when executed by a processor, implements the crawler detection method implemented by the analysis server as described above.

[0223] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this application are indicated by the following claims.

[0224] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.

Claims

1. A method for detecting web crawlers, characterized in that, include: Periodically collect access requests to the target website to obtain traffic data; The traffic data is grouped according to the IP address of each access request to obtain multiple groups, and access requests belonging to the same group have the same IP address; For each IP address, a first feature set is determined based on the grouping of the IP addresses to obtain a first feature set for each IP address. The features in the first feature set are used to characterize the access behavior of the IP address. The first feature set of each IP address is input into the pre-trained analysis model so that the analysis model outputs abnormal IPs and normal IPs; Before inputting the first feature set of each IP address into the pre-trained analysis model so that the analysis model outputs abnormal IPs and normal IPs, the process also includes: Obtain the historical traffic of the target website; Access requests with the same IP address in the historical traffic are grouped together to obtain multiple sample IPs and traffic groups for each sample IP. Based on the traffic groups of each sample IP, the corresponding sample IP is tagged, and the tag is used to indicate whether the sample IP is a normal IP or an abnormal IP; Based on the traffic groups of each sample IP, a second feature set is determined to obtain the second feature set of each sample IP; The analysis model is trained based on the labels of each sample IP and the second feature set of each sample IP. The step of tagging the corresponding sample IPs based on the traffic groups of each sample IP includes: For each sample IP, determine the number of times the sample IP was accessed and the number of different types of URLs accessed; When the number of categories is less than the first threshold and the number of accesses is greater than the second threshold, the sample IP is determined to be an abnormal IP. When the number of categories is greater than or equal to the first threshold, the number of categories is less than the third threshold, and the sample IP does not access static resources, the sample IP is determined to be an abnormal IP.

2. The method according to claim 1, characterized in that, After inputting the first feature set of each IP address into the pre-trained analysis model so that the analysis model outputs abnormal IPs and normal IPs, the method further includes: For each abnormal IP, based on the grouping of the abnormal IP, it is determined whether the abnormal IP meets a first preset condition. The first preset condition includes at least one of the following conditions: the abnormal IP has accessed the target URL, the access behavior of the abnormal IP corresponds to the target URL sequence, the value of the request header of the access request of the abnormal IP is the target value, the access frequency of the abnormal IP is less than a preset frequency, and the number of accesses of the abnormal IP is less than a preset number. When the abnormal IP meets the first preset condition, the abnormal IP is determined to be a normal IP.

3. The method according to claim 1, characterized in that, After inputting the first feature set of each IP address into the pre-trained analysis model so that the analysis model outputs abnormal IPs and normal IPs, the method further includes: For each abnormal IP, obtain the new access requests initiated by the abnormal IP within a preset time period in the future; Determine whether the newly initiated access request meets the second preset condition, the second preset condition including at least one of the following conditions: access target URL, request header value is target value, access frequency is less than preset frequency, access number is less than preset number; When the abnormal IP meets the second preset condition, the abnormal IP is allowed to access the target website.

4. The method according to claim 1, characterized in that, Before grouping the traffic data according to the IP address of each access request to obtain multiple groups, the method further includes: Filter out traffic that does not need to be blocked from the traffic data. The traffic that does not need to be blocked includes at least one of the following: traffic whose IP address matches the whitelist, traffic from known crawlers, and traffic that conforms to custom rules. The custom rules include at least one of the following: the IP address is the target address or belongs to the target network segment, the IP address accesses the target Uniform Resource Locator URL, and the IP address comes from the target region.

5. The method according to claim 1, characterized in that, Before determining the first feature set based on the grouping of IP addresses for each IP address, the method further includes: Tag the corresponding IP addresses based on their grouping; For each IP address, determine whether the label of the IP address is consistent with the output of the analysis model to obtain an error value, which is the number of IP addresses whose labels are inconsistent with the output of the analysis model; The threshold is adjusted based on the error value, and the threshold is the threshold used in the training of the analysis model.

6. The method according to claim 5, characterized in that, For each IP address, determining whether the label of the IP address matches the output of the analysis model to obtain an error value includes: From the IP addresses of each group, normal IPs and threshold-type abnormal IPs are identified. The threshold-type abnormal IPs are IP addresses that are labeled as abnormal IPs because they do not meet a preset threshold. Determine whether the labels of each normal IP and each threshold-type abnormal IP are consistent with the output of the analysis model to obtain the error value.

7. An electronic device comprising a processor, a memory, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it causes the electronic device to implement the method as described in any one of claims 1 to 6.

8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1 to 6.