Multi-agent cooperative data collection method and device, medium and computer equipment
By employing a multi-agent collaborative data acquisition method, RPA is used to identify agents that capture dynamically rendered content, and encoding agents generate acquisition code. The data acquisition agent parses the data in an isolated sandbox environment and combines it with format verification, thus solving the problem that static tools cannot collect dynamic web page data and achieving efficient and secure data acquisition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CSC FINANCIAL CO LTD
- Filing Date
- 2026-03-12
- Publication Date
- 2026-06-19
AI Technical Summary
Existing static HTML data collection tools cannot completely and accurately collect dynamically changing web page data, resulting in incomplete data collection and insufficient real-time performance.
A multi-agent collaborative data acquisition method is adopted. RPA identifies agents that simulate browser access to web pages, encodes agents to generate data acquisition code, and the data acquisition agent runs in an isolated sandbox environment to perform structured parsing. Combined with a format review agent, the data is verified and finally stored in the target database.
Ensure the integrity and real-time nature of data collection, reduce human error, improve development efficiency, guarantee system security and stability, and achieve fast and accurate data collection.
Smart Images

Figure CN122240905A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data acquisition technology, and in particular to a multi-agent collaborative data acquisition method, apparatus, medium, and computer equipment. Background Technology
[0002] With the deep penetration of the digital economy, web page data collection has become a core requirement for many industries. In the government sector, government service platforms need to regularly collect policy documents and service guides published by various provinces and cities to ensure information synchronization. In the scientific research sector, academic teams need to collect paper abstracts and citation data from academic journal websites for bibliometric analysis. In the financial sector, securities firms need to collect publicly available stock quotes and financial news and public opinion information from financial websites to support quantitative trading model calculations, business opportunity mining, and public opinion risk control. These requirements are characterized by "massive data volume and dynamic changes in web page structure," which places stringent demands on the real-time performance, accuracy, and automation of data collection.
[0003] Currently, web page data is typically collected using static HTML scraping tools. However, web pages are constantly being updated and changed, so static scraping tools cannot accurately and completely capture the ever-changing web page data. Summary of the Invention
[0004] This invention provides a multi-agent collaborative data acquisition method, device, medium, and computer equipment, which mainly improves the integrity and accuracy of web page data acquisition.
[0005] According to a first aspect of the present invention, a multi-agent collaborative data acquisition method is provided, comprising: In response to the data collection signal of the target public webpage, the system obtains the data collection requirement information of the target public webpage, and uses RPA to identify the intelligent agent to simulate a browser to access the target public webpage. During the access to the target public webpage, the system captures the webpage source code containing dynamically rendered content. The coding agent generates data collection code containing data collection rules based on the data collection requirements and the webpage source code. The data collection rules include at least one of field extraction rules, data format conversion rules, and abnormal data processing rules. The data acquisition agent runs the data acquisition code in a preset isolated sandbox environment. By running the data acquisition code, the source code of the webpage is structured and parsed. Based on the structured parsing results, the required data in the target public webpage is determined and collected.
[0006] Optionally, determining and collecting the demand data from the target public webpage based on the structured parsing results includes: Structured snapshot data is determined based on the results of structured parsing; The format verification agent performs data verification on the structured snapshot data based on preset data verification rules, wherein the preset data verification rules include at least one of data type verification rules, data format verification rules, and data range verification rules. If the data verification is successful, the structured snapshot data is used as the required data. If the data verification fails, a data verification failure report is generated and sent to the coding agent. The coding agent is then controlled to regenerate new data collection code containing new data collection rules based on the data collection requirement information, the webpage source code, and the data verification failure report. The data acquisition agent runs the new data acquisition code in a preset isolated sandbox environment. By running the new data acquisition code, the source code of the webpage is re-structured and parsed. Based on the new structured parsing results, the required data in the target public webpage is determined and collected.
[0007] Optionally, after determining and collecting the demand data from the target public webpage based on the structured parsing results, the method further includes: The data collection logs of the target public webpage are determined, wherein the data collection logs include task logs and URL logs; Determine the data attribute information of the required data, wherein the data attribute information includes at least one of a timestamp field and a business identifier field; The data storage agent determines the target storage location information of the required data based on the data attribute information, wherein the target storage location information includes the target database instance identifier and the target data table identifier; Based on the target database instance identifier and the target data table identifier, the required data, data collection logs, webpage source code, and data collection code are written into the target data table of the target database.
[0008] Optionally, the coded intelligent agent generates data collection code containing data collection rules based on the data collection requirement information and the webpage source code by performing the following steps: Based on the webpage source code, generate code parsing prompts, identify the location path, attribute features, and data type of the target field to be collected in the webpage source code based on the code parsing prompts, and generate a field-path-type mapping table based on the location path, attribute features, and data type; Based on the field-path-type mapping table, generate rules to construct prompt instructions, and based on the rules to construct prompt instructions, generate data collection rules; Based on the field-path-type mapping table and the data collection rules, code writing prompts are generated. Based on the code writing prompts, the data collection code is generated according to the data collection requirements and the webpage source code.
[0009] Optionally, the method further includes: In response to the data acquisition signal of the current webpage, the system obtains the current data acquisition requirement information, uses RPA to identify the intelligent agent to capture the current webpage source code, determines the current structural features of the current webpage source code, and generates the current structural fingerprint of the current webpage source code based on the current structural features. Based on the current structural fingerprint and the current data collection requirement information, the current data collection code is matched in the code cache library. The data acquisition agent runs the current data collection code in a preset isolated sandbox environment. By running the data collection code, the current webpage source code is structured and parsed. Based on the current structured parsing result, the required data in the current webpage is determined and collected. The code cache library stores data collection codes corresponding to various page source codes and various data collection requirement information.
[0010] Optionally, the RPA identification agent simulates a browser accessing a target public webpage by performing the following steps, and during the access to the target public webpage, captures the webpage source code containing dynamically rendered content: The URL address and page metadata of the target public webpage are parsed, the rendering engine type and asynchronous loading strategy parameters of the target public webpage are determined based on the parsing results, and a browser running instance with dynamic monitoring capabilities is configured and generated based on the rendering engine type and the asynchronous loading strategy parameters. The browser instance is controlled to load and monitor the target public webpage. Based on the monitoring results, it is determined whether the target public webpage has been dynamically rendered. The method for determining whether the target public webpage has been dynamically rendered includes: determining whether the page network request of the target public webpage is in an idle state and whether the DOM tree structure has not changed within a preset time window. If the page network request is in an idle state and the DOM tree structure has not changed within the preset time window, it is determined that the target public webpage has been dynamically rendered. Once the target public webpage has been dynamically rendered, the webpage source code is determined based on the DOM tree structure that has not changed within a preset time window.
[0011] Optionally, after determining and collecting the demand data from the target public webpage based on the structured parsing results, the method further includes: Determine the area image of the required data in the target public webpage; Determine the semantic similarity between the data semantics in the region image and the data semantics of the required data. Based on the semantic similarity, retrieve the collection accuracy of the required data. If the collection accuracy does not meet the requirements, re-collect data on the target public webpage.
[0012] According to a second aspect of the present invention, a multi-agent collaborative data acquisition device is provided, comprising: The code capture unit is used to respond to the data collection signal of the target public webpage, obtain the data collection requirement information of the target public webpage, identify the intelligent agent through RPA to simulate the browser to access the target public webpage, and capture the webpage source code containing dynamically rendered content during the access process of the target public webpage. The code generation unit is used to generate data collection code containing data collection rules by an encoding agent based on the data collection requirement information and the webpage source code, wherein the data collection rules include at least one of field extraction rules, data format conversion rules, and abnormal data processing rules; The data acquisition unit is used to run the data acquisition code in a preset isolated sandbox environment through the data acquisition agent, perform structured parsing of the webpage source code by running the data acquisition code, and determine and collect the required data in the target public webpage based on the structured parsing results.
[0013] According to a third aspect of the present invention, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the above-described multi-agent collaborative data acquisition method.
[0014] According to a fourth aspect of the present invention, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the above-described multi-agent collaborative data acquisition method.
[0015] According to the present invention, a multi-agent collaborative data acquisition method, apparatus, medium, and computer equipment, compared with the current method of collecting web page data based on static HTML acquisition tools, the present invention, through RPA, identifies that the agent can capture the web page source code containing dynamically rendered content, which can ensure the integrity and real-time nature of data acquisition and ensure that the acquired data is always up-to-date; by encoding the agent to automatically generate data acquisition code based on the requirement information and web page source code, there is no need to manually write complex acquisition logic, which greatly improves development efficiency, reduces human error, and shortens the data acquisition cycle; the preset isolation sandbox environment provides an independent and safe space for the operation of data acquisition code, preventing malicious code or abnormal operations that may exist during the data acquisition process from affecting the main system, and ensuring the security and stability of the system; the multi-agent collaborative work avoids human intervention, thereby accelerating the data acquisition process, ensuring the stable and accurate execution of the data acquisition process, and each agent is an independent module with clear functions and interfaces. When a problem occurs in a certain agent, it can be debugged and repaired separately without affecting the normal operation of other agents. Attached Figure Description
[0016] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, illustrate exemplary embodiments of the invention and, together with their description, serve to explain the invention and do not constitute an undue limitation thereof. In the drawings: Figure 1 A flowchart of a multi-agent collaborative data acquisition method provided by an embodiment of the present invention is shown; Figure 2 This invention provides a flowchart of another multi-agent collaborative data acquisition method. Figure 3 This diagram illustrates the structure of a multi-agent collaborative data acquisition device according to an embodiment of the present invention. Figure 4 This invention provides a schematic diagram of the structure of another multi-agent collaborative data acquisition device. Figure 5 A schematic diagram of the physical structure of a computer device provided in an embodiment of the present invention is shown. Detailed Implementation
[0017] The present invention will be described in detail below with reference to the accompanying drawings and embodiments. It should be noted that, unless otherwise specified, the embodiments and features described in the present application can be combined with each other.
[0018] Currently, methods of collecting webpage data based on static HTML scraping tools cannot adapt to the dynamic rendering of webpage information, thus failing to collect webpage data completely and effectively.
[0019] To address the aforementioned problems, embodiments of the present invention provide a multi-agent collaborative data acquisition method, such as... Figure 1 As shown, the method includes: 101. In response to the data collection signal of the target public webpage, obtain the data collection requirement information of the target public webpage, identify the intelligent agent through RPA to simulate the browser to access the target public webpage, and capture the webpage source code containing dynamically rendered content during the access process of the target public webpage.
[0020] The information to be collected includes, but is not limited to, the target public webpage URL, the data fields to be obtained (such as price, title, and comments), the expected data format, and specific collection constraints (such as maximum waiting time and concurrency limits).
[0021] In this embodiment of the invention, to improve the code-grabbing capability of the RPA recognition agent, it is first necessary to train and construct the RPA recognition agent. Based on this, the method includes: constructing an initial RPA recognition agent; obtaining a first sample dataset, wherein the first sample dataset includes publicly available sample web pages with web page source code annotation information; dividing the sample dataset into a training set and a test set; training the initial RPA recognition agent using the training set; and testing the trained initial RPA recognition agent using the test set; finally, selecting the trained initial RPA recognition agent that meets the test conditions as the RPA recognition agent. The structure of the RPA recognition agent is the same as that of the initial RPA recognition agent. The RPA recognition agent includes a rendering recognition layer, an input encoding layer, and a code feedback layer. The rendering recognition layer identifies whether the web page has been rendered. Once the web page is rendered, the input encoding layer converts the rendered web page into a structured feature vector, and the code feedback layer performs code recognition on the structured feature vector to obtain the web page source code.
[0022] Specifically, in the agent training process, firstly, an initial RPA (Robotic Process Automation) recognition agent is constructed, and secondly, a first sample dataset is obtained. This dataset ensures it contains all necessary files, including multiple publicly available, non-privacy webpage samples covering different fields such as e-commerce, news, social media, and government affairs, including static HTML, dynamic SPAs, and complex interactive pages. The data is then converted to a format that the initial RPA recognition agent can understand. Finally, the agent is trained and tested. Specifically, the dataset can be divided first: using random or specific strategies (such as stratified sampling), the first sample dataset is divided into a training set and a test set. The agent is then trained using the training set, and tested using the test set to evaluate its performance on unseen webpages. Precision, recall, and other metrics on the test set are calculated and recorded. If the agent's performance does not meet the requirements, it can return to the training phase for further iterations or adjustments. This process yields a satisfactory RPA recognition agent.
[0023] Furthermore, the system listens for data collection trigger signals, including user manual commands, scheduled events, or external API calls. Upon receiving a signal, it immediately parses the corresponding data collection requirement information. Based on the parsed requirements, the system invokes an RPA identification agent to control a simulated browser to access the target public webpage, triggering the webpage's lazy loading mechanism and asynchronous rendering. The RPA identification agent monitors changes in the target public webpage's DOM tree in real time until the target dynamic content is fully rendered or a preset timeout threshold is reached. Once it's confirmed that the target dynamic content has been rendered to the page, the RPA identification agent directly extracts the current complete webpage source code through the browser kernel interface. This source code not only contains the initial HTML structure of the webpage but also fully preserves dynamic DOM nodes, inline styles, and real-time injected data objects. After the code is captured, the agent automatically closes the simulated browser instance, releases memory and network resources, and feeds back the collection results and execution status (success / failure / exception reason) to the task scheduling center, completing a single collection loop. This embodiment of the invention, through the RPA identification agent, can capture the webpage source code containing dynamically rendered content, ensuring data collection integrity and real-time performance, and guaranteeing that the collected data is always up-to-date.
[0024] It should be noted that the target public webpages in this embodiment of the invention originate from publicly accessible websites that allow data scraping. These websites do not have access restrictions or authorization issues, are not privacy websites, and the webpages within them are not privacy webpages. Before the identification agent, encoding agent, and data acquisition agent acquire data from the webpages on this website, the identities of each agent have been legally verified. Each agent does not engage in activities such as cracking the front-end encryption algorithm of the target public webpage, forging device fingerprints, bypassing CAPTCHAs, or frequently acquiring data, and therefore does not interfere with the normal operation of the website. Furthermore, the webpage data collected by each agent does not involve protected works, does not contain personal privacy, and its usage does not constitute a replacement for the original website.
[0025] 102. The coding agent generates data collection code containing data collection rules based on data collection requirements and webpage source code. The data collection rules include at least one of field extraction rules, data format conversion rules, and abnormal data processing rules.
[0026] In this embodiment of the invention, to ensure the coding capability of the coding agent, it is first necessary to train and construct the coding agent. During the training process, an initial coding agent is first constructed, and then a second sample dataset is acquired. The dataset ensures that it contains all necessary files, including the source code of multiple publicly available, non-privacy webpage samples covering different fields such as e-commerce, news, social media, and government affairs, as well as the data collection requirements of sample users for those webpage samples. It also includes data collection code annotation information that meets the user's data collection requirements. The data is converted to a format that the initial coding agent can understand, and finally, the coding agent is trained and tested. Specifically, the dataset can be divided first: using random or specific strategies (such as stratified sampling), the second sample dataset is divided into a training set and a test set. The agent is then trained using the training set, and tested using the test set to evaluate its performance on unseen data collection requirements and webpage source code. Precision, recall, and other metrics on the test set are calculated and recorded. If the coding agent's performance does not meet the requirements, it can return to the training phase for further iterations or adjustments. This process yields a coding agent that meets the requirements. The coded intelligent agent consists of an input layer, a hidden layer, and an encoding layer. The input layer takes the information on the collection requirements and the source code of the web page and inputs it into the hidden layer for feature extraction and fusion. The encoding layer encodes the output features of the hidden layer to obtain the data collection code.
[0027] Furthermore, the acquired data collection requirements information and the real-time crawled webpage source code are encapsulated to construct a structured prompt context. This context clarifies the "input source" (source code), the "target" (required fields), and the "constraints" (processing rules). First, based on the requirements information, the coding agent generates three types of core data collection rules: Field extraction rules: defining regular expressions or parsing logic for cleaning text from complex DOM nodes, removing HTML tags, and extracting specific substrings; Data format conversion rules: formulating data type mapping strategies, for example, converting the string "¥1,200.00" to the floating-point number 1200.00, or converting the relative time "3 hours ago" to a standard ISO timestamp; Abnormal data handling rules: setting fault tolerance mechanisms, including default value filling when fields are missing and retry logic after network timeout. Based on data collection rules, the coding agent parses the DOM structure of the source code, automatically identifies the HTML tags, XPath paths, or CSS selectors containing the target data, and generates independently executable data collection code. This generated code embeds all the aforementioned extraction, transformation, and exception handling logic, eliminating the need for manual hard coding. The collection code is self-explanatory, containing comments specific to the current webpage structure for easy maintenance. This invention, through a coding agent that automatically generates data collection code based on requirements and webpage source code, eliminates the need for manually writing complex collection logic, significantly improving development efficiency, reducing human error, and shortening the development cycle of data collection projects.
[0028] 103. The data acquisition agent runs data acquisition code in a preset isolated sandbox environment. The data acquisition code performs structured parsing of the webpage source code. Based on the structured parsing results, the required data in the target public webpage is determined and collected.
[0029] In this embodiment of the invention, to ensure the data acquisition capability of the data acquisition agent, it is first necessary to train and construct the data acquisition agent. Specifically, during the training process of the data acquisition agent, an initial data acquisition agent is first constructed, and then a third sample dataset is acquired. It is ensured that the dataset contains all necessary files, including data collection code for publicly available webpage samples with webpage requirement data annotation information. These publicly available webpage samples cover multiple publicly available, non-privacy webpage samples from different fields such as e-commerce, news, social media, and government affairs. The data is converted into a format that the initial data acquisition agent can understand. Finally, the data acquisition agent is trained and tested. Specifically, the dataset can be divided first: the third sample dataset can be divided into a training set and a test set using random or specific strategies (such as stratified sampling). The agent is then trained using the training set, and tested using the test set to evaluate its performance on unseen data. Precision, recall, and other metrics on the test set are calculated and recorded. If the performance of the data acquisition agent does not meet the requirements, it can return to the training phase for further iterations or adjustments. This process yields a data acquisition agent that meets the requirements. This data acquisition agent can be a large model.
[0030] In this embodiment of the invention, a pre-set isolated sandbox environment is pre-activated. This environment has the following characteristics: resource isolation: it has independent memory space, file system view, and network access control list to prevent the collection code from accessing sensitive resources of the host machine or initiating unauthorized external connections; timeout circuit breaker: it forcibly limits the maximum execution time of the code to prevent infinite loops or resource exhaustion; read-only input: the web page source code is mounted in the sandbox as a read-only data stream to ensure that the original data is not tampered with. Then, the data acquisition agent is invoked to inject the data collection code into the sandbox environment for execution. During execution, the data acquisition agent monitors the execution status of the code and intercepts any abnormal behavior that attempts to jump out of the sandbox boundary, execute system commands, or make network requests to non-target domain names. Within the sandbox, the code uses a built-in parsing engine to perform structured parsing of the mounted web page source code. Then, during code execution, based on embedded field extraction rules, the parsed source code DOM tree or data object is traversed to accurately locate the HTML node containing the required data. Data format conversion rules are executed on the required data to remove noise characters in real time. If a node is missing or the format is incorrect, abnormal data processing rules are triggered. Finally, based on the structured parsing results, the data acquisition agent extracts the required data that meets the data collection needs. This required data includes, for example, abstracts of papers found on publicly available web pages. The pre-set isolation sandbox environment in this embodiment provides an independent and secure space for the data collection code to run, preventing malicious code or abnormal operations that may occur during the data collection process from affecting the main system, thus ensuring the system's security and stability.
[0031] In this embodiment of the invention, each intelligent agent achieves real-time communication and data flow through standardized data interfaces and message queues. The "page loading capability" of the RPA-identifying intelligent agent is combined with the "code generation capability" of the coding intelligent agent to solve the dynamic page adaptation problem. The intelligent agent interaction architecture of this embodiment adopts a "layered decoupling" design, divided into a data layer, an intelligent agent layer, and a user interaction layer. The data layer includes a temporary data pool, a code cache, a rule base, and a target database. Each intelligent agent in the intelligent agent layer is deployed independently, using message queues for task distribution and result feedback, and REST APIs for data transmission. The user interaction layer provides a web-based visual interface, supporting user configuration of collection requirements, such as target URL lists, collection fields, data formats, collection frequencies, etc., viewing task progress (including real-time display of the working status of each intelligent agent), managing verification rules (such as adding / modifying / deleting rules), and exporting collection logs. Users submit data collection tasks via a web interface. The system generates a unique "task ID" and writes the task information to a message queue. The RPA identification agent listens to the message queue, obtains the task ID and URL list, and then begins source code scraping. The encoding agent listens for the RPA identification agent's "source code scraping complete" message, obtains the source code and collection requirements, and generates collection code. The data acquisition agent listens for the encoding agent's "code generation complete" message, executes the code, and outputs the required data. This embodiment of the invention avoids manual intervention through multi-agent collaborative work, thereby accelerating the data collection process and ensuring its stability and accuracy. Each agent is an independent module with clearly defined functions and interfaces. When a problem occurs in one agent, it can be debugged and repaired individually without affecting the normal operation of other agents.
[0032] According to the multi-agent collaborative data acquisition method provided by this invention, compared with the current method of collecting web page data based on static HTML collection tools, this invention uses RPA to identify agents that can capture web page source code containing dynamically rendered content, ensuring the integrity and real-time nature of data acquisition and ensuring that the collected data is always up-to-date. By automatically generating data acquisition code based on requirement information and web page source code through coded agents, complex acquisition logic can be manually written, greatly improving development efficiency, reducing human error, and shortening the data acquisition cycle. A preset isolation sandbox environment provides an independent and secure space for the data acquisition code to run, preventing malicious code or abnormal operations that may exist during the data acquisition process from affecting the main system, ensuring the system's security and stability. Multi-agent collaborative work avoids manual intervention, thereby accelerating the data acquisition process and ensuring its stable and accurate execution. Each agent is an independent module with clearly defined functions and interfaces; when a problem occurs in one agent, it can be debugged and repaired individually without affecting the normal operation of other agents.
[0033] Furthermore, to better illustrate the above-described multi-agent collaborative data acquisition process, as a refinement and extension of the above embodiments, this invention provides another multi-agent collaborative data acquisition method, such as... Figure 2 As shown, the method includes: 201. In response to the data collection signal of the target public webpage, obtain the data collection requirement information of the target public webpage, and use RPA to identify the intelligent agent to simulate the browser to access the target public webpage. During the access process of the target public webpage, capture the webpage source code containing dynamically rendered content.
[0034] In this embodiment of the invention, when it is necessary to collect required data from a target public webpage, the RPA first identifies that the agent needs to crawl the complete webpage source code. Based on this, step 201 specifically includes: parsing the URL address and page metadata of the target public webpage; determining the rendering engine type and asynchronous loading strategy parameters of the target public webpage based on the parsing results; configuring and generating a browser running instance with dynamic monitoring capabilities based on the rendering engine type and the asynchronous loading strategy parameters; controlling the browser running instance to load and monitor the target public webpage; and determining whether the target public webpage has been dynamically rendered based on the monitoring results. The method for determining whether the target public webpage has been dynamically rendered includes: determining whether the page network request of the target public webpage is in an idle state and whether the DOM tree structure has not changed within a preset time window; if the page network request is in an idle state and the DOM tree structure has not changed within the preset time window, then determining that the target public webpage has been dynamically rendered; and determining the webpage source code based on the DOM tree structure that has not changed within the preset time window when the target public webpage has been dynamically rendered.
[0035] Page metadata includes, but is not limited to, the webpage's HTTP response headers, meta tags, and initial HTML structure. Specifically, the RPA identification agent first pre-parses the target public webpage's URL and page metadata. Based on the parsing results, the agent automatically infers the page's rendering engine type, such as React, Vue, Angular, or native JS, as well as asynchronous loading strategy parameters, such as AJAX polling intervals and lazy loading trigger conditions. Based on these characteristics, a browser instance with dynamic monitoring capabilities is dynamically configured and launched, with corresponding pre-set listener hooks to adapt to the specific rendering mechanism. After the browser instance loads the target public webpage, it enters the real-time monitoring phase. The agent monitors the browser's network activity, confirming that all asynchronous data requests have stopped and the network is idle, i.e., no new or ongoing requests within a set time period. Simultaneously, it continuously tracks changes in the Document Object Model (DOM). If, within a pre-set time window set according to actual needs, no nodes are added, deleted, or attributes are changed in the DOM tree structure, the DOM is considered to have reached a stable state. When both of these conditions are met, the dynamic rendering of the target public webpage is considered complete, ensuring that all content that needs to be generated on the page is presented. Subsequently, based on the unchanged DOM tree structure, the system extracts the complete webpage source code through the browser kernel interface. This source code not only contains the initial HTML but also fully preserves the DOM nodes corresponding to dynamically injected text, image links, style information, and interaction logic. This embodiment of the invention, by configuring and generating a browser instance with dynamic monitoring capabilities, overcomes the problems of static crawling's inability to capture dynamic content and the inefficiency of simple delays in existing technologies, achieving efficient and accurate source code capture of dynamic websites.
[0036] In another embodiment of the present invention, the RPA identification agent retrieves the source code and stores it in a temporary data pool along with the URL and timestamp. When retrieving the source code, the RPA identification agent integrates a "dynamic waiting + fixed timeout" mechanism for loading and waiting, that is, it listens to the page loading status and the JS events associated with the target, while setting a maximum timeout to avoid task blocking due to page freeze. The source code is retrieved to obtain all resources of the page, from which the complete DOM tree is obtained, and finally merged into a complete source code package of "resources + DOM".
[0037] 202. Generate data collection code containing data collection rules by using an coded intelligent agent based on data collection requirements and webpage source code. The data collection rules include at least one of field extraction rules, data format conversion rules, and abnormal data processing rules.
[0038] In this embodiment of the invention, in order to collect demand data on a webpage, it is also necessary to generate data collection code through an intelligent coding agent. Therefore, step 202 specifically includes: generating a code parsing prompt instruction based on the webpage source code; identifying the location path, attribute features, and data type of the target field to be collected in the webpage source code based on the code parsing prompt instruction; generating a field-path-type mapping table based on the location path, attribute features, and data type; generating a rule-building prompt instruction based on the field-path-type mapping table; generating data collection rules based on the rule-building prompt instruction; generating a code writing prompt instruction based on the field-path-type mapping table and the data collection rules; and generating the data collection code based on the code writing prompt instruction according to the data collection demand information and the webpage source code.
[0039] Specifically, the coding agent first receives the webpage source code and automatically generates code parsing hints. These hints guide the coding agent to perform semantic analysis on the webpage's DOM tree, accurately identifying the target fields to be collected. During this process, the coding agent extracts three core dimensions for each target field: location path (e.g., XPath, CSS selectors, or relative hierarchical paths); attribute features (e.g., text content, HTML attribute values (href, src), style class names, etc.); and data type (inferring its original type, such as string, number, date, etc.). Based on the above analysis results, the coding agent outputs a structured field-path-type mapping table, serving as the data dictionary for subsequent rule formulation. Then, based on the generated mapping table, rule-building hints are constructed. These hints require the coding agent to combine data collection requirements (e.g., target output format, cleaning standards) to formulate specific data collection rules for each field in the mapping table. These rules include extraction logic (how to extract effective information from complex nodes), transformation logic (how to format the original data), and fault tolerance logic (the handling strategy when the path fails or data is missing). This process ensures effective decoupling and alignment between business logic and the underlying data structure. Finally, the system merges the field-path-type mapping table with the determined data collection rules, constructs code writing prompts, and the coding agent automatically writes complete data collection code based on these prompts, comprehensively considering the constraints of the data collection requirements and the actual structural characteristics of the webpage source code. It should be noted that the coding agent can use a large model to write the data collection code, that is, inputting the data collection requirements and the webpage source code into the large model, and directly outputting the data collection code through the large model. The generated data collection code not only embeds precise locators and converters, but also includes a complete exception handling mechanism and logging module, ensuring that it can stably and accurately extract target data from dynamic webpages during actual operation. In another embodiment of the invention, the coding agent also includes a code optimization mechanism. The code optimization mechanism automatically detects syntax errors in the generated collection code. If syntax errors are found, it automatically generates "correction prompts," such as "Syntax error exists on line 15 of the code: missing colon, please correct," and re-invokes the coding agent to generate collection code based on the correction prompts.
[0040] In another embodiment of the present invention, the data acquisition code generated by the coded intelligent agent can be stored in the code cache area.
[0041] 203. The data acquisition agent runs data acquisition code in a preset isolated sandbox environment, and performs structured parsing of the webpage source code by running the data acquisition code.
[0042] 204. Determine structured snapshot data based on the structured parsing results.
[0043] 205. The format verification agent performs data verification on the structured snapshot data based on preset data verification rules, wherein the preset data verification rules include at least one of data type verification rules, data format verification rules, and data range verification rules.
[0044] 206. If the data verification is successful, the structured snapshot data will be used as the required data. If the data verification fails, a data verification failure report will be generated and sent to the coding agent. The coding agent will then be controlled to regenerate new data collection code containing new data collection rules based on the data collection requirement information, the webpage source code, and the data verification failure report.
[0045] 207. The data acquisition agent runs new data acquisition code in a preset isolated sandbox environment. By running the new data acquisition code, the webpage source code is re-structured and parsed. Based on the new structured parsing results, the required data in the target public webpage is determined and collected.
[0046] Specifically, the data acquisition agent retrieves the collection code matching the task ID and URL from the code cache and downloads the corresponding webpage source code from the temporary data pool. The data acquisition agent loads and runs the data acquisition code in a pre-defined isolated sandbox environment. This code performs deep structured parsing of the input webpage source code (such as DOM traversal and JSON extraction), extracts preliminary intermediate results, and encapsulates them into structured snapshot data. This snapshot data can be in JSON format and includes task information, core data (target field values), execution logs (code execution time), etc., and also fully retains field names, original values, and metadata information as the benchmark object for subsequent verification. The format verification agent configures data verification rules through a web interface. Each rule includes attributes such as "rule ID, field name, rule type, rule parameters, and error message," and supports rule types such as data type verification, format verification, and range verification. The structured snapshot data is validated according to predefined data validation rules. Validation dimensions include at least: data type validation (verifying whether numeric fields are numbers, date fields conform to time formats, etc.); data format validation (checking whether strings conform to regular expressions, such as email, phone numbers, and URL formats); and data range validation (determining whether values are within a reasonable range, such as non-negative prices or percentages between 0 and 100). If validation is successful (all fields pass the predefined rules), the format validation agent determines the data is valid, directly confirms the structured snapshot data as the final required data, and pushes it to storage or downstream business systems. If validation fails (if any field is found to be non-compliant with the rules), the format validation agent immediately terminates the current process and generates a detailed data validation failure report. This report records information such as the failed fields, error type, and the difference between the actual value and the expected rule. After generating a failure report, the data collection requirements, the original webpage source code, and the newly generated data validation failure report are sent as joint inputs to the coding agent. The coding agent analyzes the reasons for the failure (such as selector bias, missing cleaning logic, or incorrect conversion formulas), and derives and generates new data collection code containing new data collection rules. The newly generated data collection code is then injected into the isolation sandbox for execution, repeating the above "parsing → validation" process until the output required data fully meets the validation standards. This embodiment of the invention effectively solves the data collection error problem caused by minor adjustments to the webpage structure or incomplete initial rules by adding a format validation agent, significantly improving the accuracy and robustness of data collection. In another embodiment of the invention, the format validation agent sets a maximum number of retries during the validation process. If the URL still fails after exceeding the retries, it is marked as an "abnormal URL," and the user is notified for manual intervention.
[0047] Furthermore, after collecting requirement data from the target public webpage, it is necessary to store information such as source code, collection logs, and requirement data. Based on this, the method includes: determining the data collection logs of the target public webpage, wherein the data collection logs include task logs and URL logs; determining the data attribute information of the requirement data, wherein the data attribute information includes at least one of a timestamp field and a business identifier field; determining the target storage location information of the requirement data based on the data attribute information using a data storage agent, wherein the target storage location information includes a target database instance identifier and a target data table identifier; and writing the requirement data, data collection logs, webpage source code, and data collection code into the target data table of the target database based on the target database instance identifier and the target data table identifier.
[0048] Specifically, after completing the collection of required data for the target public webpage, a data collection log is generated and defined synchronously. This log includes a task log recording the execution process and exception information, as well as a URL log recording the access trajectory. Simultaneously, a deep scan of the collected required data is performed to extract key data attribute information, focusing on identifying timestamp fields and business identifier fields (such as order number, user ID, product code, etc.) as the basis for subsequent storage routing. The data storage agent determines the target storage location information for this batch of data based on business identifier mapping rules and timestamp partitioning strategies, such as selecting specific database cluster nodes based on data volume or business domain; and locking specific data table names based on business type or time period. Then, based on the determined target database instance identifier and target data table identifier, the data storage agent packages the webpage data collection process data and writes it to the designated data table in the target database. This embodiment of the invention, through a data storage agent, enables structured storage of webpage collection process data, associating and archiving data such as "data, logs, source code, and code," constructing a complete data collection evidence chain, and improving the traceability and troubleshooting efficiency of the data system.
[0049] In this embodiment of the invention, to ensure the effectiveness of the format verification agent and the data storage agent, it is first necessary to train and construct the corresponding agents. Taking the format verification agent as an example, firstly, structured snapshot data (initial requirement data collected from the webpage) is obtained from publicly available webpage samples labeled with format verification results. This obtained data is then used as the fourth sample dataset to train and construct the format verification agent. During the training of the data storage agent, the sample dataset includes the process of collecting requirement data from publicly available webpages with accurate data storage result annotation information. The accurate storage result annotation information includes the target database instance identifier and the target data table identifier. The data storage agent is then trained based on this dataset. It should be noted that both the format verification agent and the data storage agent described above can be large models.
[0050] In another embodiment of the present invention, after collecting the demand data from the webpage, in order to ensure the quality of data collection, it is also necessary to verify the accuracy of the demand data. Based on this, the method includes: determining the area image of the demand data in the target public webpage; determining the semantic similarity between the data semantics in the area image and the data semantics of the demand data; retrieving the collection accuracy of the demand data based on the semantic similarity; if the collection accuracy does not meet the requirements, then re-collecting data in the target public webpage.
[0051] Specifically, after initially collecting the requirement data based on the structured parsing results, the process traces back to the original target public webpage rendering layer. Based on the coordinates or selector path of the requirement data in the DOM tree, a corresponding area image is extracted from the target public webpage. This area image fully preserves the visual presentation environment of the requirement data on the page. Then, visual text features from the area image and text features from the collected requirement data are extracted to construct data semantic vectors for both. The semantic similarity between these two semantic vectors is then calculated and compared with a preset similarity threshold set according to actual requirements. If the similarity is greater than the threshold, the requirement data is considered accurately obtained; if it is less than or equal to the threshold, the requirement data is considered inaccurate. In this case, the above process is used to re-acquire the requirement data from the webpage, and the data accuracy verification is repeated until the collected requirement data meets the accuracy verification conditions. This embodiment of the invention effectively solves the parsing error problem that may occur during pure text parsing through a cross-validation mechanism based on visual dimensions, thereby improving the accuracy of data collection from complex webpage layouts.
[0052] In another embodiment of the present invention, if it is necessary to collect non-privacy data on another public webpage, in order to improve data collection efficiency and save code writing resources, the specific collection method includes: responding to the data collection signal of the current webpage, obtaining the current data collection requirement information, using an RPA identification agent to capture the current webpage source code, and determining the current structural features of the current webpage source code, generating the current structural fingerprint of the current webpage source code based on the current structural features; matching the current data collection code in the code cache library based on the current structural fingerprint and the current data collection requirement information, and running the current data collection code in a preset isolated sandbox environment through the data acquisition agent, performing structured parsing of the current webpage source code by running the data collection code, and determining and collecting the required data in the current webpage based on the current structured parsing result, wherein the code cache library stores data collection codes corresponding to various page source codes and various data collection requirement information.
[0053] Specifically, upon receiving a data collection signal for the current webpage, the system first obtains specific data collection requirements, including but not limited to the target public webpage URL, the desired data fields (such as price, title, and comments), the expected data format, and specific collection constraints (such as maximum waiting time and concurrency limits). Then, the RPA-based intelligent agent is invoked to crawl the current webpage's source code and performs in-depth analysis to determine its current structural features, such as DOM tree topology, tag hierarchy distribution, and key node path patterns. Based on these features, a unique structural fingerprint is generated to identify the page's structure. Further, the system accesses a pre-built code cache library, which stores various historically accumulated page source code samples and their corresponding data collection codes, establishing a mapping index of "structural fingerprint - collection requirements - collection code." The system uses the generated current structural fingerprint combined with the current data collection requirements as a joint query key to match the collection code in the cache library. If a similar or identical structural fingerprint is found and the requirements are consistent, the corresponding current data collection code is directly extracted without rewriting, significantly improving response speed. After obtaining the matching current data collection code, the data acquisition agent loads it into a preset isolated sandbox environment for execution. The running code performs structured parsing of the current webpage source code, accurately locating and extracting webpage data that meets the requirements. It should be noted that the current webpage in this embodiment of the invention originates from a publicly accessible website that allows data scraping. This website has no access restrictions, is not a privacy website, and the webpages within it are not privacy webpages. Before acquiring data from the webpages on this website, the identities of the identification agent, encoding agent, and data acquisition agent are legally verified. None of the agents engage in activities such as cracking the current webpage's front-end encryption algorithm, forging device fingerprints, bypassing CAPTCHAs, or frequently acquiring data, thus not interfering with the normal operation of the website. Furthermore, the current webpage data collected by each agent does not involve protected works, does not contain personal privacy, and its usage does not constitute a replacement for the original website.
[0054] According to another multi-agent collaborative data acquisition method provided by the present invention, compared with the current method of collecting web page data based on static HTML collection tools, the present invention identifies that the agent can capture the web page source code containing dynamically rendered content through RPA, which can ensure the integrity and real-time nature of data acquisition and ensure that the collected data is always up-to-date; by encoding the agent to automatically generate data acquisition code based on the requirement information and web page source code, there is no need to manually write complex acquisition logic, which greatly improves development efficiency, reduces human error, and shortens the data acquisition cycle; the preset isolation sandbox environment provides an independent and safe space for the operation of data acquisition code, preventing malicious code or abnormal operations that may exist during the data acquisition process from affecting the main system, and ensuring the security and stability of the system; the multi-agent collaborative work avoids human intervention, thereby accelerating the data acquisition process, ensuring the stable and accurate execution of the data acquisition process, and each agent is an independent module with clear functions and interfaces. When a problem occurs in a certain agent, it can be debugged and repaired separately without affecting the normal operation of other agents.
[0055] Furthermore, as Figure 1 In specific implementation, embodiments of the present invention provide a multi-agent collaborative data acquisition device, such as... Figure 3 As shown, the device includes: a code capture unit 31, a code generation unit 32, and a data acquisition unit 33.
[0056] The code capture unit 31 can be used to respond to the data collection signal of the target public webpage, obtain the data collection requirement information of the target public webpage, and use RPA to identify an intelligent agent to simulate a browser to access the target public webpage, and capture the webpage source code containing dynamically rendered content during the access process of the target public webpage.
[0057] The code generation unit 32 can be used to generate data collection code containing data collection rules by an encoding agent based on the data collection requirement information and the webpage source code. The data collection rules include at least one of field extraction rules, data format conversion rules, and abnormal data processing rules.
[0058] The data acquisition unit 33 can be used to run the data acquisition code in a preset isolated sandbox environment through a data acquisition agent, perform structured parsing of the webpage source code by running the data acquisition code, and determine and collect the required data in the target public webpage based on the structured parsing results.
[0059] In specific application scenarios, in order to determine and collect the required data from the target public webpage based on the structured parsing results, such as... Figure 4As shown, the data acquisition unit 33 includes a determination module 331, a data verification module 332, a code regeneration module 333, and a data re-acquisition module 334.
[0060] The determining module 331 can be used to determine structured snapshot data based on the structured parsing results.
[0061] The data verification module 332 can be used to perform data verification on the structured snapshot data based on preset data verification rules by a format auditing agent. The preset data verification rules include at least one of data type verification rules, data format verification rules, and data range verification rules.
[0062] The code regeneration module 333 can be used to take the structured snapshot data as the required data if the data verification is qualified, and generate a data verification failure report if the data verification fails. The data verification failure report is then sent to the coding agent, which controls the coding agent to regenerate new data collection code containing new data collection rules based on the data collection requirement information, the webpage source code, and the data verification failure report.
[0063] The data re-collection module 334 can be used to run the new data collection code in a preset isolation sandbox environment through the data acquisition agent, re-structure and parse the webpage source code by running the new data collection code, and determine and collect the required data in the target public webpage based on the new structured parsing results.
[0064] In specific application scenarios, the device further includes a data storage unit 34 for storing the collected data.
[0065] The data storage unit 34 is used to determine the data collection log of the target public webpage, wherein the data collection log includes task logs and URL logs; determine the data attribute information of the required data, wherein the data attribute information includes at least one of a timestamp field and a business identifier field; determine the target storage location information of the required data based on the data attribute information by a data storage agent, wherein the target storage location information includes a target database instance identifier and a target data table identifier; and write the required data, data collection log, webpage source code, and data collection code into the target data table of the target database based on the target database instance identifier and the target data table identifier.
[0066] In specific application scenarios, in order to generate data acquisition code, the code generation unit 32 includes an identification module 321, a rule construction module 322, and a code generation module 323.
[0067] The identification module 321 can be used to generate code parsing prompts based on the webpage source code, identify the location path, attribute features, and data type of the target field to be collected in the webpage source code based on the code parsing prompts, and generate a field-path-type mapping table based on the location path, attribute features, and data type.
[0068] The rule building module 322 can be used to generate rule building prompts based on the field-path-type mapping table, and generate data collection rules based on the rule building prompts.
[0069] The code generation module 323 can be used to generate code writing prompts based on the field-path-type mapping table and the data collection rules, and generate the data collection code based on the data collection requirements information and the webpage source code according to the code writing prompts.
[0070] In specific application scenarios, in order to perform subsequent web page data collection, the data collection unit 33 can also be used to respond to the data collection signal of the current web page, obtain the current data collection requirement information, capture the current web page source code of the current web page through the RPA identification agent, determine the current structural features of the current web page source code, generate the current structural fingerprint of the current web page source code based on the current structural features, match the current data collection code in the code cache library based on the current structural fingerprint and the current data collection requirement information, and run the current data collection code in a preset isolated sandbox environment through the data acquisition agent. The run current data collection code is used to perform structured parsing of the current web page source code, and the required data in the current web page is determined and collected based on the current structured parsing result. The code cache library stores data collection codes corresponding to various page source codes and various data collection requirement information.
[0071] In specific application scenarios, in order to capture webpage source code, the code capture unit 31 includes a parsing module 311, a generation module 312, a judgment module 313, and a code capture module 314.
[0072] The parsing module 311 can be used to parse the URL address and page metadata of the target public webpage, and determine the rendering engine type and asynchronous loading strategy parameters of the target public webpage based on the parsing results.
[0073] The generation module 312 can be used to generate a browser running instance with dynamic monitoring capabilities based on the rendering engine type and the asynchronous loading strategy parameters.
[0074] The judgment module 313 can be used to control the browser instance to load and monitor the target public webpage, and determine whether the target public webpage has been dynamically rendered based on the monitoring results. The method for determining whether the target public webpage has been dynamically rendered includes: determining whether the page network request of the target public webpage is in an idle state and whether the DOM tree structure has not changed within a preset time window. If the page network request is in an idle state and the DOM tree structure has not changed within the preset time window, then the target public webpage is determined to have been dynamically rendered.
[0075] The code capture module 314 can be used to determine the source code of the webpage based on the DOM tree structure that has not changed within a preset time window after the target public webpage has been dynamically rendered.
[0076] In specific application scenarios, in order to verify the required data, the device also includes a data verification unit 35.
[0077] The data verification unit 35 can be used to determine the area image of the required data in the target public webpage; determine the semantic similarity between the data semantics in the area image and the data semantics of the required data; retrieve the collection accuracy of the required data based on the semantic similarity; and if the collection accuracy does not meet the requirements, re-collect data in the target public webpage.
[0078] It should be noted that other corresponding descriptions of the functional modules involved in the multi-agent collaborative data acquisition device provided in this embodiment of the invention can be found in the following references. Figure 1 The corresponding description of the method shown will not be repeated here.
[0079] Based on the above, Figure 1 Accordingly, this embodiment of the invention also provides a computer-readable storage medium storing a computer program that, when executed by a processor, performs the following steps: responding to a data acquisition signal from a target public webpage, acquiring data acquisition requirement information of the target public webpage; using an RPA-based intelligent agent to simulate a browser accessing the target public webpage, and during the access to the target public webpage, capturing the webpage source code containing dynamically rendered content; using an encoding intelligent agent to generate data acquisition code containing data acquisition rules based on the data acquisition requirement information and the webpage source code, wherein the data acquisition rules include at least one of field extraction rules, data format conversion rules, and abnormal data processing rules; using a data acquisition intelligent agent to run the data acquisition code in a preset isolated sandbox environment, performing structured parsing of the webpage source code by running the data acquisition code, and determining and acquiring the required data from the target public webpage based on the structured parsing results.
[0080] Based on the above, Figure 1 The method shown and as Figure 3 The embodiment of the device shown in the invention also provides a physical structure diagram of a computer device, such as... Figure 5 As shown, the computer device includes: a processor 41, a memory 42, and a computer program stored in the memory 42 and executable on the processor. Both the memory 42 and the processor 41 are mounted on a bus 43. When the processor 41 executes the program, it performs the following steps: responding to a data acquisition signal from a target public webpage, it acquires data acquisition requirement information from the target public webpage; it uses an RPA-based intelligent agent to simulate a browser accessing the target public webpage, and during the access process, it captures the webpage source code containing dynamically rendered content; it uses an encoding intelligent agent to generate data acquisition code containing data acquisition rules based on the data acquisition requirement information and the webpage source code, wherein the data acquisition rules include at least one of field extraction rules, data format conversion rules, and abnormal data processing rules; it uses a data acquisition intelligent agent to run the data acquisition code in a preset isolated sandbox environment, performs structured parsing of the webpage source code by running the data acquisition code, and determines and acquires the required data from the target public webpage based on the structured parsing results.
[0081] Through the technical solution of this invention, the intelligent agent can capture web page source code containing dynamically rendered content via RPA, ensuring the integrity and real-time nature of data collection and guaranteeing that the collected data is always up-to-date. The intelligent agent automatically generates data collection code based on the requirements and web page source code, eliminating the need for manual writing of complex collection logic, greatly improving development efficiency, reducing human error, and shortening the data collection cycle. A pre-set isolation sandbox environment provides an independent and secure space for the data collection code to run, preventing malicious code or abnormal operations from affecting the main system and ensuring system security and stability. Multi-agent collaborative work avoids manual intervention, thereby accelerating the data collection process and ensuring its stable and accurate execution. Each intelligent agent is an independent module with clearly defined functions and interfaces; when a problem occurs in one agent, it can be debugged and repaired individually without affecting the normal operation of other agents.
[0082] It is obvious to those skilled in the art that the modules or steps of the present invention described above can be implemented using general-purpose computing devices. They can be centralized on a single computing device or distributed across a network of multiple computing devices. Optionally, they can be implemented using computer-executable program code, thereby storing them in a storage device for execution by a computing device. In some cases, the steps shown or described can be performed in a different order than those presented herein, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. Thus, the present invention is not limited to any particular combination of hardware and software.
[0083] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A multi-agent collaborative data acquisition method, characterized in that, include: In response to the data collection signal of the target public webpage, the system obtains the data collection requirement information of the target public webpage, and uses RPA to identify the intelligent agent to simulate a browser to access the target public webpage. During the access to the target public webpage, the system captures the webpage source code containing dynamically rendered content. The coding agent generates data collection code containing data collection rules based on the data collection requirements and the webpage source code. The data collection rules include at least one of field extraction rules, data format conversion rules, and abnormal data processing rules. The data acquisition agent runs the data acquisition code in a preset isolated sandbox environment. By running the data acquisition code, the source code of the webpage is structured and parsed. Based on the structured parsing results, the required data in the target public webpage is determined and collected.
2. The method according to claim 1, characterized in that, The process of determining and collecting the required data from the target public webpage based on the structured parsing results includes: Structured snapshot data is determined based on the results of structured parsing; The format verification agent performs data verification on the structured snapshot data based on preset data verification rules, wherein the preset data verification rules include at least one of data type verification rules, data format verification rules, and data range verification rules. If the data verification is successful, the structured snapshot data is used as the required data. If the data verification fails, a data verification failure report is generated and sent to the coding agent. The coding agent is then controlled to regenerate new data collection code containing new data collection rules based on the data collection requirement information, the webpage source code, and the data verification failure report. The data acquisition agent runs the new data acquisition code in a preset isolated sandbox environment. By running the new data acquisition code, the source code of the webpage is re-structured and parsed. Based on the new structured parsing results, the required data in the target public webpage is determined and collected.
3. The method according to claim 1, characterized in that, After determining and collecting the required data from the target public webpage based on the structured parsing results, the method further includes: The data collection logs of the target public webpage are determined, wherein the data collection logs include task logs and URL logs; Determine the data attribute information of the required data, wherein the data attribute information includes at least one of a timestamp field and a business identifier field; The data storage agent determines the target storage location information of the required data based on the data attribute information, wherein the target storage location information includes the target database instance identifier and the target data table identifier; Based on the target database instance identifier and the target data table identifier, the required data, data collection logs, webpage source code, and data collection code are written into the target data table of the target database.
4. The method according to claim 1, characterized in that, The coded intelligent agent generates data collection code containing data collection rules based on the data collection requirement information and the webpage source code by performing the following steps: Based on the webpage source code, generate code parsing prompts, identify the location path, attribute features, and data type of the target field to be collected in the webpage source code based on the code parsing prompts, and generate a field-path-type mapping table based on the location path, attribute features, and data type; Based on the field-path-type mapping table, generate rules to construct prompt instructions, and based on the rules to construct prompt instructions, generate data collection rules; Based on the field-path-type mapping table and the data collection rules, code writing prompts are generated. Based on the code writing prompts, the data collection code is generated according to the data collection requirements and the webpage source code.
5. The method according to claim 1, characterized in that, The method further includes: In response to the data acquisition signal of the current webpage, the system obtains the current data acquisition requirement information, uses RPA to identify the intelligent agent to capture the current webpage source code, determines the current structural features of the current webpage source code, and generates the current structural fingerprint of the current webpage source code based on the current structural features. Based on the current structural fingerprint and the current data collection requirement information, the current data collection code is matched in the code cache library. The data acquisition agent runs the current data collection code in a preset isolated sandbox environment. By running the data collection code, the current webpage source code is structured and parsed. Based on the current structured parsing result, the required data in the current webpage is determined and collected. The code cache library stores data collection codes corresponding to various page source codes and various data collection requirement information.
6. The method according to claim 1, characterized in that, The RPA identification agent simulates a browser accessing a target public webpage by performing the following steps, and during the access to the target public webpage, it captures the webpage source code containing dynamically rendered content: The URL address and page metadata of the target public webpage are parsed, and the rendering engine type and asynchronous loading strategy parameters of the target public webpage are determined based on the parsing results; Based on the rendering engine type and the asynchronous loading strategy parameters, a browser instance with dynamic monitoring capabilities is generated. The browser instance is controlled to load and monitor the target public webpage. Based on the monitoring results, it is determined whether the target public webpage has been dynamically rendered. The method for determining whether the target public webpage has been dynamically rendered includes: determining whether the page network request of the target public webpage is in an idle state and whether the DOM tree structure has not changed within a preset time window. If the page network request is in an idle state and the DOM tree structure has not changed within the preset time window, it is determined that the target public webpage has been dynamically rendered. Once the target public webpage has been dynamically rendered, the webpage source code is determined based on the DOM tree structure that has not changed within a preset time window.
7. The method according to claim 1, characterized in that, After determining and collecting the required data from the target public webpage based on the structured parsing results, the method further includes: Determine the area image of the required data in the target public webpage; Determine the semantic similarity between the data semantics in the region image and the data semantics of the required data. Based on the semantic similarity, retrieve the collection accuracy of the required data. If the collection accuracy does not meet the requirements, re-collect data on the target public webpage.
8. A multi-agent collaborative data acquisition device, characterized in that, include: The code capture unit is used to respond to the data collection signal of the target public webpage, obtain the data collection requirement information of the target public webpage, identify the intelligent agent through RPA to simulate the browser to access the target public webpage, and capture the webpage source code containing dynamically rendered content during the access process of the target public webpage. The code generation unit is used to generate data collection code containing data collection rules by an encoding agent based on the data collection requirement information and the webpage source code, wherein the data collection rules include at least one of field extraction rules, data format conversion rules, and abnormal data processing rules; The data acquisition unit is used to run the data acquisition code in a preset isolated sandbox environment through the data acquisition agent, perform structured parsing of the webpage source code by running the data acquisition code, and determine and collect the required data in the target public webpage based on the structured parsing results.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.
10. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.