Network crawler system and method including advertisement filtering

By combining a scheduler and a machine learning filtering engine, dual filtering of advertising content is achieved, solving the problems of resource waste and accuracy caused by web crawler engines crawling advertising content, and improving crawling efficiency and accuracy.

CN117633327BActive Publication Date: 2026-06-30CHINA TELECOM CORP LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA TELECOM CORP LTD
Filing Date
2023-12-08
Publication Date
2026-06-30

Smart Images

  • Figure CN117633327B_ABST
    Figure CN117633327B_ABST
Patent Text Reader

Abstract

The application discloses a web crawler system and method comprising advertisement filtering. In the system: a scheduler distributes crawling tasks to multiple crawlers according to target to be crawled; each crawler executes corresponding crawling task and sends crawling result to a content parser; the content parser determines first crawling result which does not need to be crawled again and second crawling result which needs to be crawled again in each crawling result, parses the first crawling result to obtain first crawling content, and sends the second crawling result to a static rule filtering engine; the static rule filtering engine filters the second crawling result to obtain third crawling result, and sends the result to a machine learning filtering engine; the machine learning filtering engine filters the third crawling result to obtain second target to be crawled, and feeds back the target to the scheduler; and a result processor outputs the first crawling content. The application solves the technical problem that the existing web crawler engine crawls a large amount of advertisement content, which simultaneously causes great resource pressure on the crawling party and the content provider.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of web crawler technology, and more specifically, to a web crawler system and method that includes ad filtering. Background Technology

[0002] With the rapid development of the internet, internet advertising has become one of the main sources of revenue for online content providers, meaning that online content is currently flooded with advertisements. However, for web crawlers, advertising content is content that should not be crawled. Crawling advertising content is neither valuable to the crawler owner nor can it generate real marketing results. It also incurs additional expenses for advertisers and additional operating costs for online providers.

[0003] Currently, to address the aforementioned issues, web crawlers typically use additional storage space to store advertising content and perform data cleaning and other tasks after crawling to identify and filter ads. However, this approach has certain drawbacks: storing advertising content requires a large amount of storage space, which in turn affects the accuracy of ad identification and filtering. At the same time, data cleaning and other operations consume a lot of computing resources and time, resulting in a waste of storage and computing resources.

[0004] There is currently no effective solution to the above problems. Summary of the Invention

[0005] This application provides a web crawler system and method that includes ad filtering, in order to at least solve the technical problem that existing web crawler engines will put a lot of resource pressure on both the crawler and the content provider when crawling a large amount of advertising content.

[0006] According to one aspect of the embodiments of this application, a web crawler method including ad filtering is provided, comprising: a scheduler, multiple crawlers, a content parser, a result processor, a static rule filtering engine, and a machine learning filtering engine, wherein the scheduler is used to distribute crawling tasks to the multiple crawlers according to the crawling target, wherein the crawling target includes: a first crawling target set by the target object and a second crawling target fed back by the machine learning filtering engine; each crawler is used to execute the distributed crawling task and send the crawling result to the content parser; the content parser is used to determine the crawling result sent by each crawler. The first crawled result that does not need to be crawled again and the second crawled result that needs to be crawled again are processed. The first crawled content corresponding to the first crawled result is parsed, and the second crawled result is sent to the static rule filtering engine. The static rule filtering engine is used to filter the second crawled result according to the preset advertising filtering rules to obtain the third crawled result, and the third crawled result is sent to the machine learning filtering engine. The machine learning filtering engine is used to filter the third crawled result according to the pre-trained advertising filtering model to obtain the second target to be crawled, and the second target to be crawled is fed back to the scheduler. The result processor is used to output the first crawled content.

[0007] Optionally, the static rule filtering engine includes: a filtering rule management module and a first advertising filtering module. The filtering rule management module is used to periodically obtain open-source advertising filtering rules from the target community and store the advertising filtering rules. The first advertising filtering module is used to filter the second crawling result according to the advertising filtering rules stored in the filtering rule management module to obtain the third crawling result and send the third crawling result to the machine learning filtering engine.

[0008] Optionally, the machine learning filtering engine includes: a data management module, a machine learning module, and a second ad filtering module. The data management module stores a set of labeled training samples, which includes initial training samples and new training samples added gradually to the set. The types of training samples include: ad content and non-ad content. The machine learning module trains an ad filtering model based on the initial training samples and periodically retrains the ad filtering model based on all training samples in the set, updating the model parameters. The second ad filtering module identifies the third crawling result based on the ad filtering model. If the third crawling result is ad content, it filters the third crawling result; if the third crawling result is non-ad content, it feeds the third crawling result back to the scheduler as a second target to be crawled. The third crawling result is added to the training sample set as a new training sample, and the identification result is used as annotation information.

[0009] Optionally, the scheduler includes a crawling target management module, which is used to manage each target to be crawled according to the remote dictionary service and determine the crawling task corresponding to the target to be crawled, wherein the crawling task includes at least a Uniform Resource Locator.

[0010] Optionally, the scheduler also includes: a crawler management module, a task distribution module, and a result awareness module. The crawler management module is used to periodically detect the activity of multiple crawlers and prevent the distribution of crawling tasks to abnormal crawlers when abnormal crawlers are detected. The task distribution module is used to distribute crawling tasks to multiple crawlers according to the token bucket algorithm. The result awareness module is used to obtain the crawling results of each crawler according to the message queue method.

[0011] Optionally, each crawler is used to simulate browser behavior using the Selenium automated testing tool, execute the distributed crawling tasks, and send the crawling results to the content parser.

[0012] Optionally, the content parser includes: a first parsing module and a second parsing module, wherein the first parsing module is used to classify the crawling results sent by each crawler according to the document object model tree parser to obtain a first crawling result and a second crawling result, and send the second crawling result to the static rule filtering engine; the second parsing module is used to parse the first crawling result according to the Extensible Markup Language path syntax to obtain the first crawled content.

[0013] According to another aspect of the embodiments of this application, a web crawler method including ad filtering is also provided, comprising: distributing crawling tasks to multiple crawlers according to a target to be crawled, wherein the target to be crawled includes: a first target to be crawled set by the target object and a second target to be crawled fed back by a machine learning filtering engine; calling multiple crawlers to execute crawling tasks and obtaining crawling results; calling a content parser to determine the first crawling result that does not need to be crawled again and the second crawling result that needs to be crawled again in the crawling results of each crawler, and parsing the first crawling content corresponding to the first crawling result; calling a static rule filtering engine to filter the second crawling result according to a preset ad filtering rule to obtain a third crawling result; calling a machine learning filtering engine to filter the third crawling result according to a pre-trained ad filtering model to obtain a second target to be crawled; and calling a result processor to output the first crawling content.

[0014] According to another aspect of the embodiments of this application, a non-volatile storage medium is also provided, the non-volatile storage medium including a stored computer program, wherein the device where the non-volatile storage medium is located executes the above-described web crawler method including ad filtering by running the computer program.

[0015] According to another aspect of the embodiments of this application, an electronic device is also provided, the electronic device including: a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the above-described web crawler method including ad filtering through the computer program.

[0016] In this embodiment, the scheduler distributes crawling tasks to multiple crawlers based on the target to be crawled. The target to be crawled includes a first target set by the target object and a second target to be crawled fed back by the machine learning filtering engine. Each crawler executes the distributed crawling task and sends the crawling result to the content parser. The content parser determines the first crawling result that does not need to be crawled again and the second crawling result that needs to be crawled again from the crawling results sent by each crawler, parses the first crawling content corresponding to the first crawling result, and sends the second crawling result to the static rule filtering engine. The static rule filtering engine filters the second crawling result according to the preset advertising filtering rules to obtain a third crawling result and sends the third crawling result to the machine learning filtering engine. The machine learning filtering engine filters the third crawling result according to the pre-trained advertising filtering model to obtain a second target to be crawled and feeds back the second target to be crawled to the scheduler. The result processor outputs the first crawling content. Specifically, by using the ad filtering rules in the static filtering rule engine and the ad filtering model in the machine learning filtering engine, dual filtering of ad content is achieved. This allows the web crawler to bypass ad content and only crawl non-ad content, effectively solving the technical problem that existing web crawlers would place significant resource pressure on both the crawler and the content provider when crawling large amounts of ad content. Attached Figure Description

[0017] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:

[0018] Figure 1 This is a schematic diagram of an optional web crawler system including ad filtering according to an embodiment of this application;

[0019] Figure 2 This is a schematic diagram of another optional web crawler system including ad filtering according to an embodiment of this application;

[0020] Figure 3 This is a schematic diagram illustrating the operation of an optional web crawler system including ad filtering according to an embodiment of this application;

[0021] Figure 4 This is a schematic diagram of the structure of an optional computer terminal according to an embodiment of this application;

[0022] Figure 5 This is a flowchart illustrating an optional web crawler method including ad filtering according to an embodiment of this application. Detailed Implementation

[0023] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative effort should fall within the scope of protection of the present application.

[0024] It should be noted that the terms "first," "second," etc., used in the specification, claims, and drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0025] To better understand the embodiments of this application, the following is a translation and explanation of some nouns or terms that appear in the description of the embodiments of this application:

[0026] Redis (Remote Dictionary Server) is an open-source, network-enabled, in-memory or persistent, log-structured, key-value database written in ANSI C.

[0027] Uniform Resource Locator (URL): is a method of representing the location of information on the World Wide Web service of the Internet.

[0028] Document Object Model (DOM): A platform- and language-independent standard programming interface for processing Extensible Markup Language.

[0029] Extensible Markup Language Path (XPath): is a language used to determine the location of a part in an Extensible Markup Language formatted document.

[0030] The token bucket algorithm is one of the most commonly used algorithms for network traffic shaping and rate limiting. It is used to control the amount of data sent to the network and allows bursts of data to be sent.

[0031] Example 1

[0032] Traditional web crawler engines use additional storage space to store advertising content and perform data cleaning and filtering after crawling. However, this approach wastes storage and computing resources and affects the accuracy of advertising identification and filtering. To address these issues, this application first provides a web crawler system that includes advertising filtering, such as... Figure 1 As shown. The system includes: a scheduler 11, multiple crawlers 12 (1 to n), a content parser 13, a result processor 14, a static rule filtering engine 15, and a machine learning filtering engine 16, wherein:

[0033] Scheduler 11 can distribute crawling tasks to multiple crawlers according to the target to be crawled, wherein the target to be crawled includes: a first target to be crawled set by the target object and a second target to be crawled fed back by the machine learning filtering engine;

[0034] Each crawler 12 (1 to n) can execute the assigned crawling tasks and send the crawling results to the content parser;

[0035] The content parser 13 can determine the first crawl result that does not need to be crawled again and the second crawl result that needs to be crawled again in the crawl results sent by each crawler, parse the first crawl content corresponding to the first crawl result, and send the second crawl result to the static rule filtering engine.

[0036] The static rule filtering engine 15 can filter the second crawling result according to the preset advertising filtering rules to obtain the third crawling result, and send the third crawling result to the machine learning filtering engine.

[0037] The machine learning filtering engine 16 can filter the third crawling result based on the pre-trained advertising filtering model to obtain the second target to be crawled, and feed the second target to be crawled back to the scheduler.

[0038] As a result, processor 14 can output the first crawled content.

[0039] Figure 2 A schematic diagram of a more specific web crawler system incorporating ad filtering is shown below. Figure 2 This document provides a detailed explanation of the functions of each module in a web crawler system that includes ad filtering.

[0040] Optionally, the scheduler 11 includes: a crawling target management module 111, a crawler management module 112, a task distribution module 113, and a result perception module 114.

[0041] The crawling target management module can manage each crawling target based on the remote dictionary service and determine the crawling task corresponding to the crawling target. The crawling task includes at least a Uniform Resource Locator (URL).

[0042] Specifically, the target to be crawled can be set by relevant personnel, and the crawling task can be the target URL corresponding to the target webpage. Adding the target URL to the seed pool starts the crawling process. Accordingly, the seed pool usually includes information such as homepages, related topic pages, and popular links of various websites. The crawling seed pool is a series of seed links collected from the Internet by a web crawler program. These seed links can serve as the starting point for the web crawler to continue crawling target webpages. Through continuous updating and filtering, the web crawler can obtain new seed links from the seed pool, thereby expanding the scope and depth of crawling and maintaining the continuity and breadth of crawling activities. Managing crawling tasks through a remote dictionary service can realize a distributed scheduling engine.

[0043] The crawler management module can periodically detect the activity of multiple crawlers and prevent the distribution of crawling tasks to abnormal crawlers when abnormal crawlers are detected; the task distribution module can distribute crawling tasks to multiple crawlers according to the token bucket algorithm; and the result perception module can obtain the crawling results of each crawler according to the message queue.

[0044] Among them, periodic probing can periodically detect the availability of network devices and services. By sending probe data packets or requests, the monitoring system can sense the status of network devices and services and promptly detect faults or problems, helping network administrators to find and resolve network faults in a timely manner, and ensuring the stability and reliability of the network.

[0045] Correspondingly, the token bucket algorithm in the task distribution module maintains a token bucket and continuously adds tokens to it. Each time a crawler requests a task, it needs to take a token from the token bucket for processing. If there are not enough tokens in the token bucket, the crawler's request will be rejected or placed in a queue to wait. By controlling the token generation rate and the capacity of the token bucket, the frequency of requests is limited to achieve the purpose of flow control, effectively prevent the system from being overwhelmed by too many requests, and further protect the stability and reliability of the system.

[0046] Furthermore, the result awareness module can obtain the crawling results of each crawler through a message queue-based reactive decoupling component. Each crawler (message producer) sends the crawling results (messages) to the message queue, and the result awareness module (message consumer) receives the crawling results from the message queue. Through the mediation of the message queue, loosely coupled communication between components can be achieved, improving the scalability, maintainability and reliability of the system.

[0047] Optionally, each crawler can simulate browser behavior using the Selenium automated testing tool, execute the distributed crawling tasks, and send the crawling results to the content parser.

[0048] Specifically, developers can use different programming languages ​​to write test scripts, such as Java, Python, and C, and use these scripts to simulate user actions in the browser, such as clicking, filling out forms, and submitting, in order to verify the browser's functionality and stability.

[0049] Optionally, the content parser 13 includes a first parsing module 131 and a second parsing module 132.

[0050] The first parsing module can classify the crawling results sent by each crawler according to the document object model tree parser to obtain the first crawling result and the second crawling result, and send the second crawling result to the static rule filtering engine; the second parsing module can parse the first crawling result according to the extensible markup language path syntax to obtain the first crawled content.

[0051] Among them, a tree parser based on the BeautifulSoup document object model can be used to classify the crawling results. BeautifulSoup can automatically handle irregular tags, convert entity references, and perform other operations, thus making it convenient and quick to parse and extract data from HTML (Hyper Text Markup Language) and XML (eXtensible Markup Language) format files.

[0052] Optionally, the static rule filtering engine 15 includes: a filtering rule management module 151 and a first advertising filtering module 152.

[0053] The filtering rule management module can periodically obtain open-source advertising filtering rules from the target community and store them. The first advertising filtering module can filter the second crawling result according to the advertising filtering rules stored in the filtering rule management module to obtain the third crawling result, and then send the third crawling result to the machine learning filtering engine.

[0054] It should be noted that ad filtering rules can be contributed by all natural persons accessing the Internet on a voluntary basis.

[0055] Optionally, the machine learning filtering engine 16 includes: a data management module 161, a machine learning module 162, and a second advertising filtering module 163.

[0056] The data management module can store a set of labeled training samples, which includes initial training samples and new training samples added to the set gradually. The types of training samples include advertising content and non-advertising content. The machine learning module can train an ad filtering model based on the initial training samples and periodically retrain the ad filtering model based on all training samples in the training sample set, updating the model parameters of the ad filtering model. The second ad filtering module can identify the third crawling result based on the ad filtering model. If the third crawling result is advertising content, it filters the third crawling result. If the third crawling result is non-advertising content, it feeds the third crawling result back to the scheduler as the second target to be crawled. The third crawling result is added to the training sample set as a new training sample, and the identification result is used as annotation information.

[0057] Understandably, the ad filtering model in the machine learning module of the machine learning filtering engine can be implemented based on a text convolutional neural network. It can periodically use the current training dataset from the data management module and newly collected labeled training samples to continuously train and optimize the ad filtering model, generating corresponding model files. This allows the ad filtering model to better adapt to constantly changing ad formats and user behaviors, improving filtering accuracy and effectiveness. Subsequently, the second ad filtering module can obtain the latest model file from the machine learning module in real time and judge the third crawling result to determine whether the current third crawling result should be filtered. If so, the result is discarded; otherwise, it is retransmitted to the scheduler.

[0058] Figure 3 A flowchart illustrating the operation of an optional web crawler system incorporating ad filtering is shown, specifically including the following steps:

[0059] S1, the staff sets the target to be crawled and adds the target to the seed pool;

[0060] S2, before the seed pool is exhausted, the scheduler continues to distribute crawling tasks to multiple crawlers;

[0061] S3, after each crawler completes its crawling task, it sends the crawling results to the content parser;

[0062] S4, the content parser determines whether the crawling results need to be crawled a second time. If the second crawling results need to be crawled, step S5 is executed; if the first crawling results do not need to be crawled, step S6 is executed.

[0063] S5. Does the second crawling result match the advertising filtering rules in the static rule filtering engine? If it does, proceed to step S7; otherwise, proceed to step S8 and then to step S9.

[0064] S6, use the result processor to output the first crawled content corresponding to the first crawled result;

[0065] S7, discard the crawling result;

[0066] S8, output the third crawling result;

[0067] S9. Does the third crawling result match the filtering rules in the ad filtering model of the machine learning filtering engine? If it does, proceed to step S7; otherwise, start from step S2 again.

[0068] In this embodiment, the scheduler distributes crawling tasks to multiple crawlers based on the target to be crawled. The target to be crawled includes a first target set by the target object and a second target to be crawled fed back by the machine learning filtering engine. Each crawler executes the distributed crawling task and sends the crawling result to the content parser. The content parser determines the first crawling result that does not need to be crawled again and the second crawling result that needs to be crawled again from the crawling results sent by each crawler, parses the first crawling content corresponding to the first crawling result, and sends the second crawling result to the static rule filtering engine. The static rule filtering engine filters the second crawling result according to the preset advertising filtering rules to obtain a third crawling result and sends the third crawling result to the machine learning filtering engine. The machine learning filtering engine filters the third crawling result according to the pre-trained advertising filtering model to obtain the second target to be crawled and feeds back the second target to be crawled to the scheduler. The result processor outputs the first crawling content. Specifically, by using the ad filtering rules in the static filtering rule engine and the ad filtering model in the machine learning filtering engine, dual filtering of ad content is achieved. This allows the web crawler to bypass ad content and only crawl non-ad content, effectively solving the technical problem that existing web crawlers would place significant resource pressure on both the crawler and the content provider when crawling large amounts of ad content.

[0069] Example 2

[0070] According to an embodiment of this application, a web crawler method including ad filtering is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.

[0071] The methods and embodiments provided in this application can be executed on mobile terminals, computer terminals, or similar computing devices. Figure 4 A hardware block diagram of a computer terminal (or mobile device) for implementing a web crawler method that includes ad filtering is shown. Figure 4 As shown, the computer terminal 40 (or mobile device 40) may include one or more processors 402 (shown as 402a, 402b, ..., 402n in the figure) 402 (processor 402 may include, but is not limited to, a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 404 for storing data, and a transmission device 406 for communication functions. In addition, it may also include: a display, an input / output interface (I / O interface), a universal serial bus (USB) port (which may be included as one of the ports of a BUS bus), a network interface, a power supply, and / or a camera. Those skilled in the art will understand that... Figure 4 The structure shown is for illustrative purposes only and does not limit the structure of the aforementioned electronic device. For example, computer terminal 40 may also include... Figure 4 The more or fewer components shown, or having the same Figure 4 The different configurations shown.

[0072] It should be noted that the aforementioned one or more processors 402 and / or other data processing circuits are generally referred to herein as "data processing circuits". These data processing circuits may be embodied, in whole or in part, in software, hardware, firmware, or any other combination thereof. Furthermore, the data processing circuits may be a single, independent processing module, or may be integrated, in whole or in part, into any other element within the computer terminal 40 (or mobile device). As involved in the embodiments of this application, the data processing circuits serve as a processor control mechanism (e.g., selection of a variable resistor termination path connected to an interface).

[0073] The memory 404 can be used to store software programs and modules of application software, such as the program instructions / data storage device corresponding to the web crawler method including ad filtering in the embodiments of this application. The processor 402 executes various functional applications and data processing by running the software programs and modules stored in the memory 404, thereby implementing the above-mentioned vulnerability detection method for the application. The memory 404 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 404 may further include memory remotely located relative to the processor 402, and these remote memories can be connected to the computer terminal 40 via a network. Examples of the above-mentioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0074] The transmission device 406 is used to receive or send data via a network. Specific examples of the network described above may include a wireless network provided by the communication provider of the computer terminal 40. In one example, the transmission device 406 includes a Network Interface Controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In another example, the transmission device 406 may be a Radio Frequency (RF) module, used for wireless communication with the Internet.

[0075] The display can be, for example, a touchscreen liquid crystal display (LCD) that allows the user to interact with the user interface of the computer terminal 40 (or mobile device).

[0076] Under the aforementioned operating environment, this application provides a web crawler method that includes ad filtering, such as... Figure 5 As shown, the method includes the following steps:

[0077] Step S502: Distribute crawling tasks to multiple crawlers according to the target to be crawled, wherein the target to be crawled includes: a first target to be crawled set by the target object and a second target to be crawled fed back by the machine learning filtering engine;

[0078] Step S504: Call multiple crawlers to execute crawling tasks and obtain crawling results;

[0079] Step S506: Call the content parser to determine the first crawl result that does not need to be crawled again and the second crawl result that needs to be crawled again in the crawl results of each crawler, and parse the first crawl content corresponding to the first crawl result;

[0080] Step S508: Call the static rule filtering engine to filter the second crawling result according to the preset advertising filtering rules to obtain the third crawling result;

[0081] Step S510: Call the machine learning filtering engine to filter the third crawling result based on the pre-trained advertising filtering model to obtain the second target to be crawled.

[0082] Step S512: Call the result processor to output the first crawled content.

[0083] The following section explains each step of a web crawler method that includes ad filtering, using a specific implementation process as an example.

[0084] Optionally, each crawler can simulate browser behavior using the Selenium automated testing tool, execute the distributed crawling tasks, and send the crawling results to the content parser.

[0085] Specifically, developers can use different programming languages ​​to write test scripts, such as Java, Python, and C, and use these scripts to simulate user actions in the browser, such as clicking, filling out forms, and submitting, in order to verify the browser's functionality and stability.

[0086] Optionally, the content parser includes a first parsing module and a second parsing module. The first parsing module can classify the crawling results sent by each crawler according to the document object model tree parser to obtain a first crawling result and a second crawling result, and send the second crawling result to the static rule filtering engine; the second parsing module can parse the first crawling result according to the Extensible Markup Language path syntax to obtain the first crawled content.

[0087] Among them, a tree parser based on the BeautifulSoup document object model can be used to classify the crawling results. BeautifulSoup can automatically handle irregular tags, convert entity references, and perform other operations, thus making it convenient and quick to parse and extract data from HTML (Hyper Text Markup Language) and XML (eXtensible Markup Language) format files.

[0088] Optionally, the static rule filtering engine includes a filtering rule management module and a first ad filtering module. The filtering rule management module can periodically obtain open-source ad filtering rules from the target community and store the ad filtering rules; the first ad filtering module can filter the second crawling result according to the ad filtering rules stored in the filtering rule management module to obtain the third crawling result, and send the third crawling result to the machine learning filtering engine.

[0089] It should be noted that ad filtering rules can be contributed by all natural persons accessing the Internet on a voluntary basis.

[0090] Optionally, the machine learning filtering engine includes: a data management module, a machine learning module, and a second ad filtering module. The data management module can store a set of labeled training samples, which includes initial training samples and new training samples added gradually to the set. The types of training samples include: ad content and non-ad content. The machine learning module can train an ad filtering model based on the initial training samples and periodically retrain the ad filtering model based on all training samples in the set, updating the model parameters. The second ad filtering module can identify the third crawling result based on the ad filtering model. If the third crawling result is ad content, it filters the third crawling result; if the third crawling result is non-ad content, it feeds the third crawling result back to the scheduler as a second target to be crawled. The third crawling result is added to the training sample set as a new training sample, and the identification result is used as annotation information.

[0091] Understandably, the ad filtering model in the machine learning module of the machine learning filtering engine can be implemented based on a text convolutional neural network. It can periodically use the current training dataset from the data management module and newly collected labeled training samples to continuously train and optimize the ad filtering model, generating corresponding model files. This allows the ad filtering model to better adapt to constantly changing ad formats and user behaviors, improving filtering accuracy and effectiveness. Subsequently, the second ad filtering module can obtain the latest model file from the machine learning module in real time and judge the third crawling result to determine whether the current third crawling result should be filtered. If so, the result is discarded; otherwise, it is retransmitted to the scheduler.

[0092] In this embodiment, crawling tasks are distributed to multiple crawlers based on the target to be crawled. The target to be crawled includes a first target set by the target object and a second target fed back by the machine learning filtering engine. Multiple crawlers are invoked to execute the crawling tasks, obtaining crawling results. A content parser is invoked to determine the first crawling result that does not need to be crawled again and the second crawling result that needs to be crawled again from the crawling results of each crawler, and the first crawled content corresponding to the first crawling result is parsed. A static rule filtering engine is invoked to filter the second crawling result according to preset advertising filtering rules, obtaining a third crawling result. A machine learning filtering engine is invoked to filter the third crawling result according to a pre-trained advertising filtering model, obtaining a second target to be crawled. A result processor is invoked to output the first crawled content. By using the advertising filtering rules in the static filtering rule engine and the advertising filtering model in the machine learning filtering engine, dual filtering of advertising content is achieved. This allows the crawler engine to bypass advertising content and only crawl non-advertising content, effectively solving the technical problem that existing web crawler engines place significant resource pressure on both the crawler and the content provider when crawling large amounts of advertising content.

[0093] Example 3

[0094] According to an embodiment of this application, a non-volatile storage medium is also provided, which includes a stored computer program, wherein the device where the non-volatile storage medium is located executes the web crawler method including ad filtering in Embodiment 2 by running the computer program.

[0095] Specifically, the device containing the non-volatile storage medium executes the following steps by running the computer program: distributing crawling tasks to multiple crawlers based on the target to be crawled, wherein the target to be crawled includes: a first target to be crawled set by the target object and a second target to be crawled fed back by the machine learning filtering engine; calling multiple crawlers to execute crawling tasks and obtaining crawling results; calling a content parser to determine the first crawling result that does not need to be crawled again and the second crawling result that needs to be crawled again in the crawling results of each crawler, and parsing the first crawled content corresponding to the first crawling result; calling a static rule filtering engine to filter the second crawling result according to preset advertising filtering rules to obtain a third crawling result; calling a machine learning filtering engine to filter the third crawling result according to a pre-trained advertising filtering model to obtain a second target to be crawled; and calling a result processor to output the first crawled content.

[0096] According to an embodiment of this application, a processor is also provided for running a computer program, wherein the computer program executes the web crawler method including ad filtering in Embodiment 2.

[0097] Specifically, the computer program executes the following steps during runtime: distributing crawling tasks to multiple crawlers based on the target to be crawled, wherein the target to be crawled includes: a first target to be crawled set by the target object and a second target to be crawled fed back by the machine learning filtering engine; calling multiple crawlers to execute crawling tasks and obtaining crawling results; calling a content parser to determine the first crawling result that does not need to be crawled again and the second crawling result that needs to be crawled again from the crawling results of each crawler, and parsing the first crawled content corresponding to the first crawling result; calling a static rule filtering engine to filter the second crawling result according to preset advertising filtering rules to obtain a third crawling result; calling a machine learning filtering engine to filter the third crawling result according to a pre-trained advertising filtering model to obtain a second target to be crawled; and calling a result processor to output the first crawled content.

[0098] According to an embodiment of this application, an electronic device is also provided, comprising: a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the web crawler method including ad filtering in Embodiment 2 through the computer program.

[0099] Specifically, the computer program executes the following steps during runtime: distributing crawling tasks to multiple crawlers based on the target to be crawled, wherein the target to be crawled includes: a first target to be crawled set by the target object and a second target to be crawled fed back by the machine learning filtering engine; calling multiple crawlers to execute crawling tasks and obtaining crawling results; calling a content parser to determine the first crawling result that does not need to be crawled again and the second crawling result that needs to be crawled again from the crawling results of each crawler, and parsing the first crawled content corresponding to the first crawling result; calling a static rule filtering engine to filter the second crawling result according to preset advertising filtering rules to obtain a third crawling result; calling a machine learning filtering engine to filter the third crawling result according to a pre-trained advertising filtering model to obtain a second target to be crawled; and calling a result processor to output the first crawled content. The sequence numbers of the above embodiments are merely for description and do not represent the superiority or inferiority of the embodiments.

[0100] In the above embodiments of this application, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0101] In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units can be a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual couplings, direct couplings, or communication connections may be through some interfaces; indirect couplings or communication connections between units or modules may be electrical or other forms.

[0102] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0103] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0104] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard drive, magnetic disk, or optical disk.

[0105] The above are merely preferred embodiments of this application. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principles of this application, and these improvements and modifications should also be considered within the scope of protection of this application.

Claims

1. A web crawler system incorporating ad filtering, characterized in that, include: The system includes a scheduler, multiple crawlers, a content parser, a result processor, a static rule filtering engine, and a machine learning filtering engine. The scheduler is used to distribute crawling tasks to multiple crawlers according to the target to be crawled, wherein the target to be crawled includes: a first target to be crawled set by the target object and a second target to be crawled fed back by the machine learning filtering engine; Each of the crawlers is configured to execute the assigned crawling task and send the crawling results to the content parser. The content parser is used to determine the first crawl result that does not need to be crawled again and the second crawl result that needs to be crawled again in the crawl results sent by each crawler, parse the first crawl content corresponding to the first crawl result, and send the second crawl result to the static rule filtering engine. The static rule filtering engine is used to filter the second crawling result according to the preset advertising filtering rules to obtain the third crawling result, and send the third crawling result to the machine learning filtering engine. The machine learning filtering engine is used to filter the third crawling result based on the pre-trained advertising filtering model to obtain the second target to be crawled, and to feed the second target to be crawled back to the scheduler. The result processor is used to output the first crawled content.

2. The system according to claim 1, characterized in that, The static rule filtering engine includes: a filtering rule management module and a first advertisement filtering module, wherein... The filtering rule management module is used to periodically obtain open-source advertising filtering rules from the target community and store the advertising filtering rules. The first ad filtering module is used to filter the second crawling result according to the ad filtering rules stored in the filtering rule management module to obtain the third crawling result, and send the third crawling result to the machine learning filtering engine.

3. The system according to claim 1, characterized in that, The machine learning filtering engine includes: a data management module, a machine learning module, and a second ad filtering module, wherein... The data management module is used to store a set of labeled training samples, wherein the set of training samples includes: initial training samples and new training samples that are gradually added to the set of training samples, and the types of training samples include: advertising content and non-advertising content; The machine learning module is used to train the ad filtering model based on the initial training samples, and periodically retrain the ad filtering model based on all training samples in the training sample set to update the model parameters of the ad filtering model. The second ad filtering module is used to identify the third crawling result according to the ad filtering model. If the third crawling result is ad content, the third crawling result is filtered. If the third crawling result is non-ad content, the third crawling result is fed back to the scheduler as the second target to be crawled. The third crawling result is added to the training sample set as a new training sample, and the identification result is used as annotation information.

4. The system according to claim 1, characterized in that, The scheduler includes a crawling target management module, wherein, The crawling target management module is used to manage each of the crawling targets according to the remote dictionary service and determine the crawling task corresponding to the crawling target, wherein the crawling task includes at least a Uniform Resource Locator.

5. The system according to claim 4, characterized in that, The scheduler also includes: a crawler management module, a task distribution module, and a result awareness module, wherein... The crawler management module is used to periodically detect the activity of multiple crawlers and, when an abnormal crawler is detected, prohibit the distribution of crawling tasks to the abnormal crawler. The task distribution module is used to distribute crawling tasks to multiple crawlers according to the token bucket algorithm; The result perception module is used to obtain the crawling results of each crawler according to the message queue method.

6. The system according to claim 1, characterized in that, Each of the crawlers is configured to simulate browser behavior using the Selenium automated testing tool, execute the distributed crawling tasks, and send the crawling results to the content parser.

7. The system according to claim 1, characterized in that, The content parser includes: a first parsing module and a second parsing module, wherein... The first parsing module is used to classify the crawling results sent by each crawler according to the document object model tree parser, to obtain the first crawling result and the second crawling result, and to send the second crawling result to the static rule filtering engine. The second parsing module is used to parse the first crawling result according to the Extensible Markup Language path syntax to obtain the first crawled content.

8. A web crawler method incorporating ad filtering, characterized in that, include: The crawling task is distributed to multiple crawlers according to the target to be crawled, wherein the target to be crawled includes: a first target to be crawled set by the target object and a second target to be crawled fed back by the machine learning filtering engine; Multiple crawlers are invoked to perform crawling tasks and obtain crawling results; The content parser is invoked to determine the first crawl result that does not need to be crawled again and the second crawl result that needs to be crawled again from the crawl results of each crawler, and the first crawl content corresponding to the first crawl result is parsed. The static rule filtering engine is invoked to filter the second crawling result according to the preset advertising filtering rules, and a third crawling result is obtained. The machine learning filtering engine is invoked to filter the third crawling result based on the pre-trained advertising filtering model, thereby obtaining the second target to be crawled; The result processor is invoked to output the first crawled content.

9. A non-volatile storage medium, characterized in that, The non-volatile storage medium includes a stored computer program, wherein the device containing the non-volatile storage medium executes the web crawler method including ad filtering as described in claim 8 by running the computer program.

10. An electronic device, characterized in that, include: A memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the web crawler method including ad filtering as described in claim 8 through the computer program.