Network anti-crawling method, system and computer device

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By dynamically injecting SDKs through a reverse proxy server and employing human-machine adversarial verification, the problem of web crawlers stealing private data is solved, achieving a non-intrusive anti-crawling effect and protecting the security and resources of target websites.

CN117527281BActive Publication Date: 2026-06-30ZHONGAN INFORMATION TECH SERVICES CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: ZHONGAN INFORMATION TECH SERVICES CO LTD
Filing Date: 2022-07-29
Publication Date: 2026-06-30

Application Information

Patent Timeline

29 Jul 2022

Application

30 Jun 2026

Publication

CN117527281B

IPC: H04L9/40

AI Tagging

Technology Topics

Web site Software engineering

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In existing technologies, web crawlers that steal private data are prone to disrupting code structure and lack non-intrusive anti-crawling measures, leading to wasted system resources and service unavailability.

Method used

By dynamically injecting the SDK into the access response through a reverse proxy server, combined with a human-machine adversarial verification unit, malicious access behavior can be identified and blocked, achieving non-intrusive anti-crawling.

Benefits of technology

It effectively prevents web crawlers from stealing private data, protects target websites from being accessed by malicious programs, and avoids waste of system resources and service unavailability.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117527281B_ABST

Patent Text Reader

Abstract

This application relates to a method, system, and computer device for preventing web scraping. The method includes: when a client sends an access request to a target website, determining whether the target website has configured anti-scraping measures based on preset anti-scraping information; if the target website has configured anti-scraping measures, injecting an SDK into the access response, generating a first access response, and returning the first access response to the client; the access response is generated by the target website based on the access request; the client asynchronously submits its runtime information to a human-machine interface verification unit based on the first access response; when the client sends an access request to a subpage of the target website, determining whether the interface corresponding to the subpage is a protected interface; if the interface corresponding to the subpage is a protected interface, the human-machine interface verification unit performs human-machine interface verification based on the client's runtime information; if the human-machine interface verification result is successful, an access request is sent to the subpage. This method can prevent the target website from being accessed by malicious programs.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of network security technology, and in particular to a method, system and computer device for preventing web scraping. Background Technology

[0002] Online services, due to their publicly accessible nature, are easily accessed by malicious web crawlers or other forms of programs. These crawlers or other computer programs frequently access websites to request information, or simulate login requests to access online services as if they were legitimate users, aiming to excessively collect users' private data from various internet companies. The ultimate consequences include wasted system resources, service unavailability, and disruption to normal user experiences. For example, the exploitation of e-commerce websites by malicious actors is a typical example of this type of attack.

[0003] In existing technologies, such as patent application number "CN113660238", a method is disclosed that pre-embeds a point to obtain the client's operating environment and determines whether the terminal currently accessing the business transaction system is a risky device based on the client's operating environment. This method of obtaining the client's operating environment through pre-embedded points is a code-intrusive modification method. This approach disrupts the code structure; that is, when pre-embedding points on a target website, those skilled in the art need to manipulate the target website's code logic, otherwise system vulnerabilities may arise, rendering the target system inaccessible.

[0004] Therefore, there is an urgent need to propose a non-intrusive web scraping method, system, and computer device that can prevent web crawlers from stealing private data. Summary of the Invention

[0005] Therefore, it is necessary to provide a web anti-crawling method, system, and computer device that can effectively prevent web crawlers from stealing private data, addressing the aforementioned technical problems.

[0006] On the one hand, a method for preventing web scraping is provided, the method comprising:

[0007] When a client sends an access request to a target website, it determines whether the target website has configured anti-crawling based on preset anti-crawling information. If the target website has configured anti-crawling, the SDK is injected into the access response, a first access response is generated, and the first access response is returned to the client. The access response is generated by the target website based on the access request.

[0008] Based on the first access response, the client asynchronously submits its operating information to the human-machine adversarial verification unit;

[0009] When the client sends an access request to a subpage of the target website, it determines whether the interface corresponding to the subpage is a protected interface. If the interface corresponding to the subpage is a protected interface, the human-machine confrontation verification unit performs human-machine confrontation verification based on the client's operation information. When the human-machine confrontation verification result is passed, the access request is sent to the subpage.

[0010] In one embodiment, determining whether the target website has configured anti-crawling measures based on preset anti-crawling information includes: obtaining the target website attributes, which include at least one of a domain name, a webpage address, and an access request method; determining whether there is at least one piece of information content in the preset anti-crawling information that is consistent with the target website attributes, which includes the target website domain name, the target website webpage address, and the access request method for accessing the target website; if there is at least one piece of information content in the preset anti-crawling information that is consistent with the target website attributes, then it is determined that the target website has configured anti-crawling measures; otherwise, it is determined that the target website has not configured anti-crawling measures.

[0011] In one embodiment, the human-machine adversarial verification unit performs human-machine adversarial verification based on the client's operating information, including: performing client risk identification based on the client's operating information, obtaining a client risk identification result, wherein the client risk identification result is normal or suspicious; if the client risk identification result is normal, then sending the access request to the sub-webpage; if the client risk identification result is suspicious, then causing the client to initiate human-machine liveness verification.

[0012] In one embodiment, initiating human-machine liveness verification by the client includes: determining whether the client manually triggers verification; if the client manually triggers verification, obtaining the verification result and verifying whether the verification result passes; otherwise, determining that the human-machine adversarial verification result fails; if the verification result passes, determining that the human-machine adversarial verification result passes; otherwise, determining that the human-machine adversarial verification result fails.

[0013] In one embodiment, based on the client's operational information, client risk identification is performed to obtain client risk identification results, including: obtaining the number of access requests sent by the client to the sub-webpage, or monitoring the frequency of the client sending access requests to the sub-webpage; if the number of access requests sent by the client to the sub-webpage exceeds a first threshold, or the frequency of the client sending access requests to the sub-webpage exceeds a second threshold, then the client risk identification result is determined to be suspicious; otherwise, the client risk identification result is determined to be normal.

[0014] In one embodiment, the client's operational information includes: user agent information, identification information, behavioral information, and client internet address; the behavioral information includes at least one of: the number of access requests sent by the client to the sub-webpage and the frequency at which the client sends access requests to the sub-webpage.

[0015] In one embodiment, before injecting the SDK into the access response and generating the first access response, the method further includes: obtaining the type of the access response and determining whether the type of the access response is a Hypertext Markup Language type; if the type of the access response is a Hypertext Markup Language type, then injecting the SDK into the access response and generating the first access response; otherwise, then directly sending the access response to the client.

[0016] In one embodiment, determining whether the target website has configured anti-crawling measures based on preset anti-crawling information further includes: if the target website has not configured anti-crawling measures, then directly forwarding the access response to the client.

[0017] On the other hand, a web scraping prevention system is provided, the system comprising:

[0018] A reverse proxy server is used to determine whether the target website has configured anti-crawling based on preset anti-crawling information when a client sends an access request to the target website. If the target website has configured anti-crawling, the SDK is injected into the access response, a first access response is generated, and the first access response is returned to the client. The access response is generated by the target website based on the access request.

[0019] The human-machine confrontation verification unit is used to receive the client's running information asynchronously submitted by the client based on the first access response;

[0020] The reverse proxy server is also used to determine whether the interface corresponding to the sub-page is a protected interface when the client sends an access request to the sub-page of the target website.

[0021] The human-machine confrontation verification unit is also used to perform human-machine confrontation verification based on the client's running information when the interface corresponding to the sub-web page is a protected interface;

[0022] The reverse proxy server is also used to send the access request to the sub-webpage when the human-machine confrontation verification result is passed.

[0023] In another aspect, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to perform the following steps:

[0024] When a client sends an access request to a target website, it determines whether the target website has configured anti-crawling based on preset anti-crawling information. If the target website has configured anti-crawling, the SDK is injected into the access response, a first access response is generated, and the first access response is returned to the client. The access response is generated by the target website based on the access request.

[0025] Based on the first access response, the client asynchronously submits its operating information to the human-machine adversarial verification unit;

[0026] When the client sends an access request to a subpage of the target website, it determines whether the interface corresponding to the subpage is a protected interface. If the interface corresponding to the subpage is a protected interface, the human-machine confrontation verification unit performs human-machine confrontation verification based on the client's operation information. When the human-machine confrontation verification result is passed, the access request is sent to the subpage.

[0027] The above-mentioned anti-crawling method, system, and computer device, wherein the method includes, when a client sends an access request to a target website, determining whether the target website has configured anti-crawling based on preset anti-crawling information; if the target website has configured anti-crawling, injecting an SDK into the access response, generating a first access response, and returning the first access response to the client, wherein the access response is generated by the target website based on the access request;

[0028] Based on the first access response, the client asynchronously submits its operating information to the human-machine adversarial verification unit;

[0029] When the client sends an access request to a subpage of the target website, it determines whether the interface corresponding to the subpage is a protected interface. If the interface is protected, the human-machine confrontation verification unit performs human-machine confrontation verification based on the client's operating information. When the human-machine confrontation verification result is successful, the access request is sent to the subpage. The network anti-crawling method described in this application uses reverse proxy technology to dynamically inject the SDK into the access response generated by the access request corresponding to the target website. This allows for the identification and interception of suspicious or malicious access behaviors in a non-intrusive manner without disrupting the target website's code structure. This effectively blocks crawlers or other forms of malicious programs from accessing the target website, preventing further access. Attached Figure Description

[0030] Figure 1 This is a flowchart illustrating a network anti-scraping method in one embodiment;

[0031] Figure 2 This is a flowchart illustrating a network anti-scraping method in one embodiment;

[0032] Figure 3 This is a flowchart illustrating the step of a client sending an access request to a subpage in one embodiment;

[0033] Figure 4 This is a structural block diagram of a network anti-crawling system in one embodiment;

[0034] Figure 5 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0035] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0036] Example 1

[0037] The anti-scraping methods provided in this application, such as Figure 1 As shown, the method includes the following steps:

[0038] When a client sends an access request to a target website, it determines whether the target website has configured anti-crawling based on preset anti-crawling information. If the target website has configured anti-crawling, the SDK is injected into the access response, a first access response is generated, and the first access response is returned to the client. The access response is generated by the target website based on the access request.

[0039] Based on the first access response, the client asynchronously submits its operating information to the human-machine adversarial verification unit;

[0040] When the client sends an access request to a subpage of the target website, it determines whether the interface corresponding to the subpage is a protected interface. If the interface corresponding to the subpage is a protected interface, the human-machine confrontation verification unit performs human-machine confrontation verification based on the client's operation information. When the human-machine confrontation verification result is passed, the access request is sent to the subpage.

[0041] In the aforementioned anti-scraping methods, the SDK is dynamically injected into the access response based on the reverse proxy server. The target website does not need to pre-install security features or make customized development modifications. It only needs to generate anti-scraping information in advance based on the target website that needs protection. Based on the anti-scraping information, the SDK is dynamically injected into the access response in a non-intrusive manner to obtain the client's operating environment. This can effectively identify whether the requester initiating the access request is a real user, a crawler, or malicious traffic, so as to achieve accurate anti-scraping of the target website and prevent the target website from being accessed by malicious programs.

[0042] In one embodiment, determining whether the target website has configured anti-crawling measures based on preset anti-crawling information includes: obtaining the target website attributes, which include at least one of a domain name, a webpage address, and an access request method; determining whether there is at least one piece of information content in the preset anti-crawling information that is consistent with the target website attributes, which includes the target website domain name, the target website webpage address, and the access request method for accessing the target website; if there is at least one piece of information content in the preset anti-crawling information that is consistent with the target website attributes, then it is determined that the target website has configured anti-crawling measures; otherwise, it is determined that the target website has not configured anti-crawling measures.

[0043] In one embodiment, the human-machine adversarial verification unit performs human-machine adversarial verification based on the client's operating information, including: performing client risk identification based on the client's operating information, obtaining a client risk identification result, wherein the client risk identification result is normal or suspicious; if the client risk identification result is normal, then sending the access request to the sub-webpage; if the client risk identification result is suspicious, then causing the client to initiate human-machine liveness verification.

[0044] In one embodiment, initiating human-machine liveness verification by the client includes: determining whether the client manually triggers verification; if the client manually triggers verification, obtaining the verification result and verifying whether the verification result passes; otherwise, determining that the human-machine adversarial verification result fails; if the verification result passes, determining that the human-machine adversarial verification result passes; otherwise, determining that the human-machine adversarial verification result fails.

[0045] In one embodiment, based on the client's operational information, client risk identification is performed to obtain client risk identification results, including: obtaining the number of access requests sent by the client to the sub-webpage, or monitoring the frequency of the client sending access requests to the sub-webpage; if the number of access requests sent by the client to the sub-webpage exceeds a first threshold, or the frequency of the client sending access requests to the sub-webpage exceeds a second threshold, then the client risk identification result is determined to be suspicious; otherwise, the client risk identification result is determined to be normal.

[0046] In one embodiment, the client's operational information includes: user agent information, identification information, behavioral information, and client internet address; the behavioral information includes at least one of: the number of access requests sent by the client to the sub-webpage and the frequency at which the client sends access requests to the sub-webpage.

[0047] In one embodiment, such as Figure 2As shown, before injecting the SDK into the access response and generating the first access response, the method further includes: obtaining the type of the access response and determining whether the type of the access response is a Hypertext Markup Language type; if the type of the access response is a Hypertext Markup Language type, then injecting the SDK into the access response and generating the first access response; otherwise, then directly sending the access response to the client.

[0048] In one embodiment, determining whether the target website has configured anti-crawling measures based on preset anti-crawling information further includes: if the target website has not configured anti-crawling measures, then directly forwarding the access response to the client.

[0049] Example 2

[0050] In one embodiment, the web scraping prevention method includes:

[0051] Pre-generated anti-crawling information includes the target website domain name, the target website webpage address, and the access request method to the target website;

[0052] When a client sends an access request to a target website, it retrieves the target website's attributes, including at least one of the following: domain name, webpage address, and access request method. If any one of the target website's attributes matches the preset anti-scraping information, the target website is determined to have anti-scraping configured. If any one of the target website's attributes differs from the preset anti-scraping information, the target website is determined not to have anti-scraping configured.

[0053] When it is determined that the target website has configured anti-scraping measures, and the type of the access response is determined, it is determined whether the type of the access response is Hypertext Markup Language (HMR). If the type of the access response is HMR, the SDK is injected into the access response to generate a first access response, and the first access response is returned to the client. The access response is generated by the target website based on the access request.

[0054] If it is determined that the target website does not have anti-scraping measures configured, the access response generated by the target website will be sent directly to the client.

[0055] Once the client receives the first access response, it will asynchronously submit the client's runtime environment to the human-machine adversarial verification unit, but human-machine adversarial verification will not be performed at this time.

[0056] After a client accesses a target website, the target website contains many subpages. When a user accesses a subpage through the client, not all interfaces corresponding to all subpages are protected interfaces. In other words, such as... Figure 3As shown, when a user accesses a sub-page of a target website through a client, the human-machine confrontation unit will only perform human-machine confrontation verification based on the client's operating environment information if the interface corresponding to the sub-page is a protected interface. In practical application scenarios, those skilled in the art can determine, based on the actual situation, which interfaces corresponding to sub-pages of the target website are protected interfaces and which are not.

[0057] In one embodiment, the human-machine adversarial verification unit performs human-machine adversarial verification based on the client's operating information, including: performing client risk identification based on the client's operating information, obtaining a client risk identification result, wherein the client risk identification result is normal or suspicious; if the client risk identification result is normal, then sending the access request to the sub-webpage; if the client risk identification result is suspicious, then causing the client to initiate human-machine liveness verification.

[0058] In one embodiment, initiating human-machine liveness verification on the client includes: determining whether the client manually triggers verification; if the client manually triggers verification, obtaining the verification result and verifying whether the verification result passes; otherwise, determining that the human-machine adversarial verification result fails; if the verification result passes, determining that the human-machine adversarial verification result passes; otherwise, determining that the human-machine adversarial verification result fails. It should be understood that the human-machine liveness verification also includes facial recognition verification, SMS verification code verification, and other human-machine liveness verification methods that can determine that the user accessing the target website through the client is a real person.

[0059] In one embodiment, based on the client's operational information, client risk identification is performed to obtain client risk identification results, including: obtaining the number of access requests sent by the client to the sub-webpage, or monitoring the frequency of the client sending access requests to the sub-webpage; if the number of access requests sent by the client to the sub-webpage exceeds a first threshold, or the frequency of the client sending access requests to the sub-webpage exceeds a second threshold, then the client risk identification result is determined to be suspicious; otherwise, the client risk identification result is determined to be normal.

[0060] In one embodiment, the preset anti-scraping information can also be generated based on the client risk identification results, that is, within a certain time threshold, the number of access requests sent by the client to the target website is obtained, or the frequency of the client sending access requests to the target website is monitored; if within a certain time threshold, the number of access requests sent by the client to the target website exceeds a third threshold, or the frequency of the client sending access requests to the target website exceeds a fourth threshold, then anti-scraping information is generated based on the target website attributes.

[0061] In one embodiment, the client's operational information includes: user agent information, identification information, behavioral information, and client internet address; the behavioral information includes at least one of: the number of access requests sent by the client to the sub-webpage and the frequency at which the client sends access requests to the sub-webpage.

[0062] It should be understood that, although Figure 1-3 The steps in the flowchart are shown sequentially as indicated by the arrows, but these steps are not necessarily executed in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order in which these steps are executed, and they can be performed in other orders. Figure 1-3 At least some of the steps in the process may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these sub-steps or stages is not necessarily sequential, but can be executed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.

[0063] Example 3

[0064] In one embodiment, such as Figure 4 As shown, a web anti-scraping system is provided, including:

[0065] A reverse proxy server is used to determine whether the target website has configured anti-crawling measures based on preset anti-crawling information when a client sends an access request to the target website. If the target website has configured anti-crawling measures, the SDK is injected into the access response to generate a first access response, which is then returned to the client. The access response is generated by the target website based on the access request. If the target website has not configured anti-crawling measures, the access response is directly forwarded to the client.

[0066] The human-machine confrontation verification unit is used to receive the client's running information asynchronously submitted by the client based on the first access response;

[0067] The reverse proxy server is also used to determine whether the interface corresponding to the sub-page is a protected interface when the client sends an access request to the sub-page of the target website.

[0068] The human-machine confrontation verification unit is also used to perform human-machine confrontation verification based on the client's running information when the interface corresponding to the sub-web page is a protected interface;

[0069] The reverse proxy server is also used to send the access request to the sub-webpage when the human-machine confrontation verification result is passed.

[0070] In one embodiment, the system further includes a console that is communicatively connected to the reverse proxy server for storing preset anti-scraping information.

[0071] In one embodiment, the reverse proxy server is further configured to obtain the target website attributes, which include at least one of a domain name, a webpage address, and an access request method; determine whether there is at least one piece of information content in the preset anti-crawling information that is consistent with the target website attributes, which includes the target website domain name, the target website webpage address, and the access request method for accessing the target website; if there is at least one piece of information content in the preset anti-crawling information that is consistent with the target website attributes, then it is determined that the target website has configured anti-crawling; otherwise, it is determined that the target website has not configured anti-crawling.

[0072] In one embodiment, the human-machine confrontation unit includes a risk identification module and a human-machine liveness verification module; the risk identification module is used to identify client risks based on the client's operating information and obtain client risk identification results, wherein the client risk identification results are normal or suspicious; if the client risk identification results are normal, the access request is sent to the sub-webpage; if the client risk identification results are suspicious, the client initiates human-machine liveness verification.

[0073] In one embodiment, the human-machine liveness verification module is used to determine whether the client manually triggers verification; if the client manually triggers verification, the verification result is obtained and the verification result is verified as passing; otherwise, the human-machine adversarial verification result is determined to be failing; if the verification result is verified as passing, the human-machine adversarial verification result is determined to be passing; otherwise, the human-machine adversarial verification result is determined to be failing.

[0074] In one embodiment, the risk identification module is further configured to obtain the number of access requests sent by the client to the sub-page, or monitor the frequency of the client sending access requests to the sub-page; if the number of access requests sent by the client to the sub-page exceeds a first threshold, or the frequency of the client sending access requests to the sub-page exceeds a second threshold, then the client risk identification result is determined to be suspicious; otherwise, the client risk identification result is determined to be normal.

[0075] In one embodiment, the reverse proxy server is further configured to obtain the type of the access response and determine whether the type of the access response is Hypertext Markup Language (HMR). If the type of the access response is HMR, the SDK is injected into the access response to generate a first access response; otherwise, the access response is sent directly to the client.

[0076] For specific limitations regarding network anti-scraping systems, please refer to the limitations on network anti-scraping methods mentioned above, which will not be repeated here. Each module in the aforementioned network anti-scraping system can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in the computer device in hardware form, or stored in the memory of the computer device in software form, so that the processor can call and execute the corresponding operations of each module.

[0077] Example 4

[0078] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 5 As shown, the computer device includes a processor, memory, network interface, and database connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database stores intercepted access request data. The network interface communicates with external terminals via a network connection. When executed by the processor, the computer program implements a network anti-scraping method.

[0079] Those skilled in the art will understand that Figure 5 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0080] In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to perform the following steps:

[0081] When a client sends an access request to a target website, it determines whether the target website has configured anti-crawling based on preset anti-crawling information. If the target website has configured anti-crawling, the SDK is injected into the access response, a first access response is generated, and the first access response is returned to the client. The access response is generated by the target website based on the access request.

[0082] Based on the first access response, the client asynchronously submits its operating information to the human-machine adversarial verification unit;

[0083] When the client sends an access request to a subpage of the target website, it determines whether the interface corresponding to the subpage is a protected interface. If the interface corresponding to the subpage is a protected interface, the human-machine confrontation verification unit performs human-machine confrontation verification based on the client's operation information. When the human-machine confrontation verification result is passed, the access request is sent to the subpage.

[0084] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0085] Obtain the target website attributes, which include at least one of the following: domain name, webpage address, and access request method; determine whether there is at least one piece of information in the preset anti-crawling information that is consistent with the target website attributes, which includes the target website domain name, target website webpage address, and access request method; if there is at least one piece of information in the preset anti-crawling information that is consistent with the target website attributes, then it is determined that the target website has configured anti-crawling measures; otherwise, it is determined that the target website has not configured anti-crawling measures.

[0086] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0087] Based on the client's operational information, client risk identification is performed, and the client risk identification result is obtained, which is either normal or suspicious. If the client risk identification result is normal, the access request is sent to the sub-webpage. If the client risk identification result is suspicious, the client initiates human-machine liveness verification.

[0088] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0089] Determine whether the client manually triggers the verification; if the client manually triggers the verification, obtain the verification result and verify whether the verification result passes; otherwise, determine that the human-machine confrontation verification result fails; if the verification result passes, determine that the human-machine confrontation result passes; otherwise, determine that the human-machine confrontation verification result fails.

[0090] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0091] The system acquires the number of access requests sent by the client to the sub-page, or monitors the frequency of access requests sent by the client to the sub-page. If the number of access requests sent by the client to the sub-page exceeds a first threshold, or the frequency of access requests sent by the client to the sub-page exceeds a second threshold, the client risk identification result is determined to be suspicious; otherwise, the client risk identification result is determined to be normal.

[0092] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0093] Obtain the type of the access response and determine whether the type of the access response is Hypertext Markup Language (HTML). If the type of the access response is HTML, inject the SDK into the access response to generate a first access response; otherwise, directly send the access response to the client.

[0094] In one embodiment, the processor, when executing a computer program, also performs the following steps:

[0095] If the target website does not have anti-scraping measures configured, the access response will be directly forwarded to the client.

[0096] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0097] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0098] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. A network anti-crawling method, characterized by, The method includes: When a client sends an access request to a target website, it determines whether the target website has configured anti-crawling based on preset anti-crawling information. If the target website has configured anti-crawling, the SDK is injected into the access response, a first access response is generated, and the first access response is returned to the client. The access response is generated by the target website based on the access request. Based on the first access response, the client asynchronously submits its operating information to the human-machine adversarial verification unit; When the client sends an access request to a subpage of the target website, it determines whether the interface corresponding to the subpage is a protected interface. If the interface corresponding to the subpage is a protected interface, the human-machine confrontation verification unit performs human-machine confrontation verification based on the client's running information. When the human-machine confrontation verification result is passed, the access request is sent to the subpage.

2. The network anti-crawling method according to claim 1, characterized in that, Determining whether the target website has configured anti-crawling measures based on preset anti-crawling information includes: Obtain the target website attributes, which include at least one of the following: domain name, webpage address, and access request method; Determine whether there is at least one piece of information in the preset anti-crawling information that is consistent with the attribute of the target website. The anti-crawling information includes the target website domain name, the target website webpage address, and the access request method to access the target website. If at least one of the preset anti-scraping information items is consistent with the attribute of the target website, then the target website is determined to have configured anti-scraping measures; otherwise, the target website is determined not to have configured anti-scraping measures.

3. The network anti-crawling method according to claim 2, characterized in that, The human-machine adversarial verification unit performs human-machine adversarial verification based on the client's operational information, including: Based on the client's operational information, client risk identification is performed, and the client risk identification result is obtained, which is either normal or suspicious. If the client risk identification result is normal, then the access request is sent to the sub-webpage; If the client's risk identification result is suspicious, then the client will initiate human-machine liveness verification.

4. The network anti-crawling method according to claim 3, characterized in that, Initiating human-machine liveness verification on the client includes: Determine whether the client manually triggered the verification; If the client manually triggers the verification, the verification result is obtained and it is verified whether the verification result passes; otherwise, the human-machine confrontation verification result is determined to be unsuccessful. If the verification result passes, the human-machine confrontation result is determined to be passed; otherwise, the human-machine confrontation verification result is determined to be failed.

5. The network anti-crawling method according to claim 3, characterized in that, Based on the client's operational information, client risk identification is performed, and the client risk identification results are obtained, including: Obtain the number of access requests sent by the client to the sub-page, or monitor the frequency of access requests sent by the client to the sub-page; If the number of access requests sent by the client to the sub-page exceeds a first threshold, or the frequency of the client sending access requests to the sub-page exceeds a second threshold, then the client risk identification result is determined to be suspicious; otherwise, the client risk identification result is determined to be normal.

6. The network anti-crawling method according to claim 5, characterized in that, The client's operational information includes: user agent information, identification information, behavioral information, and client internet address; the behavioral information includes at least one of: the number of access requests sent by the client to the sub-webpage and the frequency at which the client sends access requests to the sub-webpage.

7. The network anti-crawling method according to claim 6, characterized in that, Before injecting the SDK into the access response and generating the first access response, the method further includes: Obtain the type of the access response and determine whether the type of the access response is Hypertext Markup Language. If the access response is of type Hypertext Markup Language, the SDK is injected into the access response to generate the first access response; otherwise, the access response is sent directly to the client.

8. The network anti-crawling method according to claim 1, characterized in that, Determining whether the target website has configured anti-crawling measures based on preset anti-crawling information also includes: If the target website does not have anti-scraping measures configured, the access response will be directly forwarded to the client.

9. A network anti-crawling system, characterized in that, The system includes: A reverse proxy server is used to determine whether the target website has configured anti-crawling based on preset anti-crawling information when a client sends an access request to the target website. If the target website has configured anti-crawling, the SDK is injected into the access response, a first access response is generated, and the first access response is returned to the client. The access response is generated by the target website based on the access request. The human-machine confrontation verification unit is used to receive the client's running information asynchronously submitted by the client based on the first access response; The reverse proxy server is also used to determine whether the interface corresponding to the sub-page is a protected interface when the client sends an access request to the sub-page of the target website. The human-machine confrontation verification unit is also used to perform human-machine confrontation verification based on the client's running information when the interface corresponding to the sub-web page is a protected interface; The reverse proxy server is also used to send the access request to the sub-webpage when the human-machine confrontation verification result is passed.

10. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 8.

Citation Information

Patent Citations

Method and system for recognizing web crawler
CN107147640A
Method and system for realizing streaming crawler
CN113297449A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Method and system for recognizing web crawler

Method and system for realizing streaming crawler