A method and system for automatic crawling, analyzing and structuring of government-enterprise policy data
By building an automated data capture, parsing, and structured data entry system, the problems of low efficiency, poor accuracy, and lack of full-process automation in the acquisition and processing of government and enterprise policy data have been solved. This system enables efficient, accurate, and fully automated processing of government and enterprise policy data, supporting intelligent government and enterprise policy services.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANXI WEIDA ENTERPRISE MANAGEMENT CONSULTING CO LTD
- Filing Date
- 2026-03-15
- Publication Date
- 2026-06-12
Smart Images

Figure CN122197864A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of data mining, natural language processing, and e-government informatization, specifically to a method and system for the automated capture, parsing, and structured storage of government and enterprise policies, and particularly to a technical implementation scheme for the automated collection, semantic parsing, standardized processing, and structured storage of unstructured government policy texts. Background Technology
[0002] In the field of government-enterprise policy services, the timeliness, accuracy, and structure of policy data directly determine the efficiency of enterprise application and the effectiveness of policy implementation, serving as the core data support for precise government-enterprise matching. Currently, existing methods for acquiring and processing policy data mainly rely on manual compilation and simple web scraping, which suffer from numerous core technical deficiencies: 1. Manual compilation is extremely inefficient and costly, and cannot keep up with the high-frequency updates of massive amounts of policies across the country. It is prone to omissions and delays in policy information, making it difficult to meet the real-time needs of government and enterprises for policy services. 2. Simple web crawlers can only obtain basic webpage source code, which is difficult to break through the anti-crawler mechanisms commonly set up by government platforms. The success rate of crawling is low, and they can only obtain raw unstructured webpage data. They cannot achieve deep semantic analysis of policy texts, and the output data is in a messy format and missing core elements. A lot of manpower is needed for secondary sorting and verification. 3. Although large models are increasingly widely used in the field of natural language processing, when directly used for policy text parsing, there are problems such as high model illusion rate, non-standard output format, and mismatch between extracted elements and the needs of government and enterprise policies and business. They lack deep integration with the rules of government and enterprise policies and business, resulting in the inability to directly structure and store the parsing results. 4. There is currently no complete technical solution in the industry that can automate the entire process of capturing, parsing, verifying, and storing government policies at all levels across the country, while simultaneously ensuring anti-crawler compatibility, data parsing accuracy, system scalability, and data credibility. This has become a key bottleneck restricting the development of intelligent government and enterprise policy services.
[0003] Therefore, there is an urgent need to build an automated method and system for capturing, parsing, and structuring policy data for government and enterprises. This system should integrate multiple technologies to achieve automated processing of policy data throughout the entire process, address the core pain points of existing technologies, and provide high-quality data support for intelligent policy services for government and enterprises. Summary of the Invention
[0004] To address the shortcomings of existing technologies, this invention provides an automated method and system for capturing, parsing, and structuring policy data for government and enterprises. It solves the technical problems of low efficiency in policy data collection, difficulty in overcoming anti-crawler measures, inaccurate parsing results, difficulty in directly storing unstructured data, and the lack of a fully automated processing solution in existing technologies. This invention achieves fully automated, standardized, and structured processing of policy data from collection to storage, while ensuring high system scalability, high stability, and high data reliability.
[0005] To achieve the above objectives, the present invention is implemented through the following core technical solutions: I. Automated methods for capturing, parsing, and structured storing government and enterprise policies Data source configuration and automated crawling: Maintain a list of policy data source URLs covering all levels of government platforms, including national ministries, provinces, cities, and districts / counties, and configure corresponding crawler tasks; utilize the HeadlessChrome dynamic rendering engine to load dynamic page content generated by JavaScript, and combine a combination of anti-crawler strategies such as request header spoofing, dynamic IP proxy pool rotation, request frequency control, and automatic CAPTCHA recognition to effectively bypass the anti-crawler mechanisms of government platforms and obtain the original policy webpage HTML source code data; simultaneously support two modes: timed polling crawling and incremental crawling based on website update notifications, taking into account both the comprehensiveness and timeliness of policy data.
[0006] Webpage cleaning and precise extraction of core text: Based on a method combining Document Object Model (DOM) tree structure analysis and text density calculation, the crawled HTML source code is parsed; the core content area of the webpage is located through DOM tree structure analysis, and paragraphs with text density exceeding a preset threshold are selected by text density calculation. Noise information such as sidebars, footers, advertisements, and navigation bars are precisely filtered out, and only the policy title, issuing agency, issuance date, and core content of the policy text are retained and extracted to form a plain text policy document, providing a clean data foundation for subsequent analysis.
[0007] Standardized semantic parsing of all elements in the large model: Construct a dedicated prompt word template containing 3-5 different types of policy few-sample parsing examples. Take the cleaned policy plain text as input, concatenate it with the prompt word template, and then pass it into the large model interface. Guide the large model to achieve deep semantic understanding of policy text through few-sample examples, accurately extract core business elements, and output structured data in standardized JSON format. Core fields uniformly cover applicable regions, applicable industries, applicant qualifications, business indicator requirements, intellectual property requirements, subsidy amount / proportion, and application start and end time, ensuring that the parsing results are uniform in format and complete in elements.
[0008] The rule engine performs full-dimensional validation and standardized error correction: It constructs a rule engine using a regular expression library, a national administrative division dictionary, and a set of government and enterprise policy business logic rules. This engine performs full-dimensional logical validation and standardized correction on the JSON-formatted data output by the large model. It completes operations such as unifying the application time format to the YYYY-MM-DD standard format, normalizing the applicable regional administrative division levels, and completing the unit for subsidy amounts. It automatically corrects logical errors such as application times earlier than document issuance times and subsidy ratios exceeding 100%, and marks and warns data missing core fields. Ultimately, it generates standardized, logically error-free structured policy data.
[0009] Automated storage and traceability management of structured data: Standardized structured policy data, verified and corrected by the rule engine, is mapped to a pre-defined table structure in the database; a unique 32-character policy ID is generated for each policy, and a full-text search index is established based on the policy title, applicable industry, and core clauses; a traceability link is generated pointing to the original policy release webpage to ensure data traceability; data is automatically written into the policy database, and each data entry is tagged with multi-dimensional labels indicating data source, update time, and data status. A data version management mechanism is also established, overwriting old versions of data and retaining historical versions when policies are updated, ensuring data integrity and traceability.
[0010] II. An automated system for capturing, parsing, and structured storing government and enterprise policies This system adopts a highly scalable microservice architecture, supports horizontal scaling of server nodes, and can adapt to high-concurrency policy data capture and parsing tasks. The system includes multiple government data sources and seven core functional modules that interact with each other through a service bus: 1- Data Acquisition Module, 2- Web Page Cleaning Module, 3- Large Model Parsing Module, 4- Rule Validation Module, 5- Data Storage Module, 6- Monitoring and Alarm Module, and 7- Storage and Processing Module. Each module is independent yet works in tandem to achieve fully automated processing of policy data.
[0011] Data Acquisition Module: Used to configure crawler tasks for the entire government affairs platform, schedule hardware resources, and capture the source code of target web pages; supports two modes: timed polling crawling and incremental crawling based on website update notifications; configured with anti-crawler adaptation components such as dynamic IP proxy pool, request header spoofing, and automatic CAPTCHA recognition, and integrated with the HeadlessChrome dynamic rendering engine, effectively breaking through the anti-crawler mechanism of the government affairs platform.
[0012] Webpage Cleaning Module: Used to parse HTML source code, it employs a combination of DOM tree structure analysis and text density calculation to accurately filter noisy information on webpages, extract clean policy titles, issuing agencies, publication dates, and core content, and output plain text policy documents.
[0013] Large Model Parsing Module: It has a built-in prompt word template library containing few sample examples, which is used to receive cleaned policy plain text and call the large model interface to perform key information extraction tasks; it adopts the adapter pattern to connect to the interfaces of various large model service providers such as GPT, Wenxin Yiyan, and Tongyi Qianwen, shielding the differences in interface parameters and return formats of different models, realizing seamless integration of large model parsing capabilities, and outputting standardized JSON format structured data.
[0014] Rule Validation Module: Built-in rule engine library, which includes a regular expression library, a national administrative division dictionary, and a set of government and enterprise policy business logic rules. It is used to define data cleaning and validation rules, perform full-dimensional post-processing, standardization correction, and logical error correction on the output results of large models, and generate standardized structured policy data without logical errors.
[0015] Data entry module: Used to write the final verified structured data into the policy database; realizes automatic mapping between data and database table structure, full-text search index creation, and original webpage traceability link generation; adds multi-dimensional tags to the data and establishes a data version management mechanism; new data sources can be flexibly added through configuration files, and can be adapted to various government platforms without modifying the code, greatly improving system adaptability.
[0016] Monitoring and alarm module: Used to monitor the operating status of each module of the system in real time, as well as key indicators such as policy capture success rate, parsing accuracy rate, and task processing speed; set threshold warnings, and push alarm information to system operation and maintenance personnel in real time via SMS, WeChat, etc. when capture failure, parsing accuracy rate is not up to standard, or module operation is abnormal, to ensure the stable operation of the system.
[0017] Storage processing module: includes memory and high-performance processor; memory adopts distributed storage architecture to store crawler configuration information, prompt word template library, rule engine library, structured policy data and system operation logs; processor calls computer programs in memory to execute all steps of the automated crawling, parsing and structured storage method for government and enterprise policies described in this invention, providing core computing support for system operation.
[0018] Beneficial effects 1. Achieve automated collection of policy data across all levels, overcoming anti-crawler bottlenecks: This invention effectively overcomes the anti-crawler mechanisms of government platforms by configuring data sources across all levels of government platforms and combining dynamic rendering engines with request header spoofing, dynamic IP proxy pool rotation, and other combined anti-crawler strategies, significantly improving the success rate of policy webpage crawling; it also supports both timed polling and incremental crawling modes, taking into account the comprehensiveness and timeliness of policy data, and completely solving the technical pain points of low efficiency in manual collection and low success rate of simple crawling.
[0019] 2. Improve the accuracy and standardization of policy analysis and reduce the impact of model illusion: This invention combines the deep semantic understanding capabilities of large models with the deterministic verification capabilities of rule engines. By using a few sample prompt word templates to guide large models to output standardized and complete analysis data, the rule engine then completes full-dimensional standardization correction and logical error correction, which greatly reduces the analysis error caused by model illusion, improves the accuracy and standardization of policy analysis, and ensures that the analysis results can be directly entered into the database for use.
[0020] 3. Achieve fully automated processing of policy data and reduce labor costs: This invention establishes a complete technical chain from government webpage crawling, core text extraction, semantic parsing, verification and error correction to structured data entry, enabling fully automated processing of policy data from unstructured webpage text to standardized structured data. This eliminates the need for manual secondary processing and verification, significantly reducing labor costs and providing unified, accurate, and real-time data support for intelligent policy matching and precise government-enterprise docking.
[0021] 4. High scalability and adaptability of the system to meet the continuous growth of business needs: The system of this invention adopts a microservice architecture, supports horizontal scaling, and can adapt to high-concurrency policy capture and parsing tasks; the large model parsing module adopts the adapter pattern to be compatible with multiple service provider interfaces, and the data entry module supports adding new data sources without code modification, which can adapt to various government platforms without modifying the code, and can effectively adapt to the business needs of frequent updates of government platforms and the continuous growth of policy data volume.
[0022] 5. Establish a data traceability and version management mechanism to enhance data credibility: This invention generates a unique policy ID and original webpage traceability link for each piece of structured policy data, establishes a data version management mechanism, and sets up a monitoring and alarm module to monitor the system operation status in real time, promptly detect and warn of abnormal issues, ensure the traceability, integrity and credibility of policy data, and enhance the value of data use.
[0023] 6. Multi-technology integration enables technological innovation and promotes intelligent government and enterprise policy services: This invention integrates multiple technologies such as web crawling, dynamic rendering, natural language processing, large model application, rule engine, and microservice architecture to achieve technological innovation in government and enterprise policy data processing. It completely solves the core bottlenecks of existing technologies and provides a high-quality data foundation for intelligent services such as intelligent matching, accurate push, and automated application of government and enterprise policies, thereby promoting the digital and intelligent upgrading of the government and enterprise policy service field. Attached Figure Description
[0024] Figure 1 This is a flowchart illustrating the overall processing flow of the method of the present invention; Figure 2 This is a block diagram of the modular architecture of the system of the present invention; Figure 3This is a flowchart detailing the large model parsing and rule engine verification process of this invention.
[0025] The system includes: 1. Multi-source government data sources; 2. Data collection module; 3. Webpage cleaning module; 4. Large model parsing module; 5. Rule verification module; 6. Data storage module; 7. Policy knowledge base; 8. Monitoring and alarm module; 9. Storage and processing module; 10. Prompt word template library; and 11. Rule engine library. Detailed Implementation
[0026] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0027] Example 1 As one aspect of the present invention, please refer to the appendix. Figure 1 and attached Figure 3 This invention provides an automated method for capturing, parsing, and structured storage of government and enterprise policies, applicable to the unified processing of policy data from government service platforms at all levels nationwide. The specific implementation steps are as follows: Data source configuration and automated crawling: Resources from government service platforms at all levels across the country were analyzed, and a list of policy data source URLs covering all levels of government service platforms, including the National Development and Reform Commission, provincial government service websites, and municipal / county government websites, was configured. Corresponding crawling tasks were created. A daily polling mechanism was set up at 8 AM and 8 PM, and website update notification listening was enabled to achieve incremental crawling. The HeadlessChrome dynamic rendering engine was enabled to load dynamic content generated by JavaScript in web pages, a dynamic IP proxy pool was configured to achieve IP address rotation, browser information was disguised in the request headers, a request frequency control strategy of once per second was set, and an automatic CAPTCHA recognition component was integrated. Through the above combined anti-crawler strategies, the anti-crawler mechanisms of government service platforms were bypassed, and the original HTML source code data of policy web pages was stably obtained.
[0028] Webpage cleaning and precise core text extraction: The crawled HTML source code is fed into the webpage cleaning program. The core content area of the webpage is located through DOM tree structure analysis. Combined with text density calculation methods, paragraphs with text density higher than a preset threshold are selected. Noise information such as sidebars, footers, advertisements, and navigation bars are precisely filtered out. Only the policy title, issuing agency, issuance date, and core content of the policy text are extracted to form a plain text policy document, ensuring the cleanliness of subsequent data parsing.
[0029] The large-scale model performs standardized semantic parsing of all elements: It retrieves a special template for government and enterprise policy parsing from a pre-set prompt word template library. This template contains 3-5 sample parsing examples of different types of policies, including subsidies for specialized and innovative enterprises, additional deductions for R&D expenses, and special funds for the integration of informatization and industrialization. The cleaned policy plain text is used as input, concatenated with the prompt word template, and then passed to the large-scale model interface. The large-scale model completes full-element parsing based on deep semantic understanding, and outputs structured data in JSON format, guided by the template, including applicable regions, industry classifications, eligibility conditions, operating indicators, intellectual property requirements, subsidy amounts / proportions, and application start and end times, ensuring that the parsing results are uniform in format and complete in elements.
[0030] The rule engine performs full-dimensional validation and standardized error correction: JSON data output from the large model is input into the rule engine, which then calls its built-in regular expression library, national administrative division dictionary, and government-enterprise policy business logic rule set for full-dimensional validation. Regular expressions are used to uniformly convert various non-standard date formats into the YYYY-MM-DD standard format. The national administrative division dictionary standardizes expressions such as "a certain city, a certain district" and "a certain province, a certain city" into the administrative division hierarchy format of "XX province, XX city, XX district". Standard units such as "ten thousand yuan" and "yuan" are added to subsidy amounts with missing units. Logical errors such as application times earlier than document issuance times and subsidy ratios exceeding 100% are automatically corrected. Data lacking core fields is marked with warnings, ultimately generating standardized, logically error-free structured policy data.
[0031] Automated storage and traceability management of structured data: Standardized structured policy data, after verification and correction, is mapped to a preset table structure in the MySQL policy database, generating a unique 32-character policy ID for each policy; a full-text search index is built for each data entry based on the policy title, applicable industry, and core clauses, while a traceability link pointing to the original policy release webpage is generated; data is automatically written into the policy database, and multi-dimensional tags are added to the data, including data source, update time, and data status. The data source is marked as a specific platform such as the XX Provincial Government Service Website or the XX Municipal Government Website, the update time is the time of completion of crawling and parsing, and the data status is divided into valid / pending verification; a data version management mechanism is established, and if it is a policy update, the old version data is overwritten while historical versions are retained to ensure the integrity, traceability, and credibility of the data.
[0032] Example 2 As one aspect of the present invention, please refer to the appendix. Figure 1 and attached Figure 3This invention provides an automated system for capturing, parsing, and structuring government and enterprise policies, deployed on a cloud server cluster. It employs a microservice architecture and supports horizontal scaling of server nodes based on business needs. The system can handle high-concurrency capture and parsing tasks of massive amounts of government policy data nationwide. The system includes multiple government data sources and seven core functional modules that sequentially interact via a cloud service bus: 2-Data Acquisition Module, 3-Web Page Cleaning Module, 4-Large Model Parsing Module, 5-Rule Validation Module, 6-Data Storage Module, 8-Monitoring and Alarm Module, and 9-Storage Processing Module.
[0033] The data acquisition module consists of a crawler task configuration unit, a dynamic rendering unit, an anti-crawler adaptation unit, and a scheduling unit. The crawler task configuration unit supports visual configuration of crawler tasks and URL lists for all levels of government platforms. The dynamic rendering unit uses HeadlessChrome to achieve dynamic page loading. The anti-crawler adaptation unit integrates a dynamic IP proxy pool, request header spoofing, and automatic CAPTCHA recognition components. The scheduling unit implements task scheduling for timed polling and incremental crawling, stably completing the crawling and aggregation of target webpage source code.
[0034] The webpage cleaning module consists of an HTML parsing unit, a noise filtering unit, and a text extraction unit. By combining DOM tree structure analysis with text density calculation, it accurately filters webpage noise and extracts policy titles, issuing agencies, publication dates, and core content in plain text format, providing a clean data foundation for subsequent parsing.
[0035] Large model parsing module: Built-in 10-prompt word template library, compatible with interfaces of various large model service providers such as GPT, Wenxin Yiyan, and Tongyi Qianwen; It shields the interface parameters and return format differences of different models through the adapter pattern, achieving seamless integration of large model parsing capabilities; It receives cleaned policy plain text, calls the large model interface to complete the extraction of key elements of the policy text, and outputs standardized JSON format structured data.
[0036] Rule validation module: Built-in 11-rule engine library, including regular expression library, national administrative division dictionary, and government and enterprise policy business logic rule set; completes full-dimensional standardization correction, logical error correction and missing field marking of large model parsing data, and generates standardized structured policy data without logical errors.
[0037] The data import module consists of a data mapping unit, an index building unit, a source link generation unit, and a version maintenance unit. It realizes automatic mapping of structured data to database table structure, full-text search index building, original webpage source link generation, and full lifecycle maintenance of data versions. It supports the flexible addition of new government data sources through configuration files, and can be adapted to various government platforms without modifying the code, greatly improving the system's adaptability.
[0038] Monitoring and Alarm Module: Collects real-time operational metrics for each module, including capture success rate, parsing accuracy, task processing speed, and server resource utilization. It sets thresholds for alerts when the capture success rate is below 90% and the parsing accuracy is below 85%. When capture failures, substandard parsing accuracy, module malfunctions, or excessively high server resource utilization occur, alarm information is pushed to system maintenance personnel in real-time via SMS, WeChat, and other means to ensure stable system operation.
[0039] The storage processing module includes a cloud-based distributed storage device and a high-performance processor. The cloud-based distributed storage device stores crawler configuration information, prompt word template library, rule engine library, structured policy data, and system operation logs. The high-performance processor calls the computer program in the storage device to execute all steps of the automated crawling, parsing, and structured storage method for government and enterprise policies as described in Example 1. The system's crawler cluster and API service cluster are deployed separately and can be expanded independently to meet the continuous collection and parsing needs of massive amounts of government policy data nationwide.
[0040] The standardized and structured policy data generated by this system provides services such as policy query, element retrieval, and intelligent matching through API interfaces. It provides high-quality and reliable data support for intelligent services such as intelligent matching, precise push, and automated application of government and enterprise policies, and promotes the digital and intelligent upgrading of the government and enterprise policy service field.
[0041] Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the core technical principles and spirit of the present invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A method for automatically capturing, parsing, and structured storing government and enterprise policies, characterized in that, To achieve fully automated and standardized processing of policy data from collection to storage, the following steps are included: S1: Configure a list of policy data source URLs covering all levels of government platforms at the national, provincial, municipal, and county levels, and configure corresponding crawler tasks. Use the HeadlessChrome dynamic rendering engine to load dynamic page content generated by JavaScript. Combine a combination of anti-crawler strategies, such as request header spoofing, dynamic IP proxy pool rotation, request frequency control, and automatic CAPTCHA recognition, to obtain the original policy webpage HTML source code data. It also supports dual modes of timed polling crawling and incremental crawling based on website update notifications. S2: A method combining Document Object Model (DOM) tree structure analysis and text density calculation is used to parse HTML source code, filter out noisy information such as sidebars, footers, advertisements, and navigation bars, and extract policy titles, issuing agencies, issuance dates, and core content of the policy text to form plain text policy documents; S3: Construct a prompt word template containing 3-5 different types of policy sample parsing examples. Concatenate the cleaned policy plain text with the prompt word template and input it into the large model interface. Guide the large model to output standardized JSON format structured data containing applicable regions, applicable industries, applicant qualifications, business indicator requirements, intellectual property requirements, subsidy amount / proportion, and application start and end time. S4: By building a rule engine through a regular expression library, a national administrative division dictionary, and a set of government and enterprise policy business logic rules, the engine performs full-dimensional logical verification and standardization correction on the JSON format data output by the large model. It unifies the application time to the YYYY-MM-DD standard format, normalizes the administrative division level of the applicable area, and completes the standard unit for the subsidy amount. It automatically corrects logical errors in data where the application time is earlier than the document issuance time or the subsidy ratio exceeds 100%. It marks and warns data that is missing core fields, and generates standardized structured policy data. S5: Maps standardized and structured policy data to a pre-defined table structure in the database, generates a unique 32-character policy ID for each policy, establishes a full-text search index based on the policy title, applicable industry, and core clauses, generates a source link pointing to the original policy release webpage, automatically writes the data into the policy database and adds multi-dimensional tags for data source, update time, and data status, and establishes a data version management mechanism to overwrite old versions of data and retain historical versions when policies are updated.
2. The method for automated capture, parsing, and structured storage of government and enterprise policies according to claim 1, characterized in that, The incremental crawling mode described in step S1 accurately obtains the latest policy data by listening to website update notifications. The request frequency control strategy flexibly configures the access frequency according to the characteristics of the government affairs platform to ensure crawling efficiency and compliance.
3. The method for automated capture, parsing, and structured storage of government and enterprise policies according to claim 1, characterized in that, The few-sample parsing examples described in step S3 cover typical government and enterprise policy types such as subsidies for specialized and innovative enterprises, additional deductions for R&D expenses, and special funds for the integration of informatization and industrialization. During the large model parsing process, the output format is constrained by prompt word templates to ensure that the fields are consistent and there are no missing elements.
4. The method for automated capture, parsing, and structured storage of government and enterprise policies according to claim 1, characterized in that, The administrative division level normalization mentioned in step S4 unifies non-standard expressions such as "a certain city and a certain district" and "a certain province and a certain city" into a three-level administrative division format of "XX province, XX city, XX district". The subsidy amount standard unit includes "yuan" and "ten thousand yuan", and is automatically matched and completed according to the policy scenario.
5. An automated system for capturing, parsing, and structured storage of government and enterprise policies, characterized in that: The system adopts a microservice architecture and supports horizontal scaling. It is used to execute the method described in any one of claims 1-4, including a multi-source government data source (1) and seven core modules that realize data interaction through a service bus in sequence: data acquisition module (2), web page cleaning module (3), large model parsing module (4), rule verification module (5), data storage module (6), monitoring and alarm module (8), and storage processing module (9). Each module is functionally independent and works in synergy to realize the full-process automated processing of policy data.
6. The automated system for capturing, parsing, and structured storage of government and enterprise policies according to claim 5, characterized in that, The data acquisition module consists of a crawler task configuration unit, a dynamic rendering unit, an anti-crawler adaptation unit, and a scheduling unit. The anti-crawler adaptation unit integrates a dynamic IP proxy pool, request header spoofing, and automatic CAPTCHA recognition components. The scheduling unit realizes intelligent scheduling of tasks for timed polling and incremental crawling.
7. The automated system for capturing, parsing, and structured storage of government and enterprise policies according to claim 5, characterized in that, The large model parsing module (4) has a built-in prompt word template library (10) and uses the adapter mode to connect to the interfaces of various large model service providers such as GPT, Wenxin Yiyan, and Tongyi Qianwen, shielding the differences in interface parameters and return formats of different models, and realizing seamless connection of large model parsing capabilities.
8. The automated system for capturing, parsing, and structured storage of government and enterprise policies according to claim 5, characterized in that, The rule verification module (5) has a built-in rule engine library (11), which includes a regular expression library, a national administrative division dictionary, and a set of government and enterprise policy business logic rules to complete the full-dimensional standardization correction, logical error correction, and missing field marking of the large model parsing data.
9. The automated system for capturing, parsing, and structured storage of government and enterprise policies according to claim 5, characterized in that, The data entry module (6) consists of a data mapping unit, an index creation unit, a traceability link generation unit, and a version maintenance unit. It supports the flexible addition of new government data sources through configuration files and can adapt to various government platforms without modifying the system code.
10. The automated system for capturing, parsing, and structured storage of government and enterprise policies according to claim 5, characterized in that, The monitoring and alarm module (8) collects system operation indicators such as capture success rate, parsing accuracy rate, task processing speed, and server resource utilization rate in real time, sets threshold warnings for capture success rate below 90% and parsing accuracy rate below 85%, and pushes abnormal alarm information to operation and maintenance personnel via SMS and WeChat. The storage processing module (9) adopts a cloud-based distributed storage and high-performance processor to realize distributed storage of crawler configuration information, template library, rule library, structured policy data and operation logs. The crawler cluster and API service cluster are deployed separately and expanded independently.