A grass-roots governance event intelligent structured processing and classification analysis system

By using a large language model and a regular expression collaborative parsing model, the problem of structuring multi-source heterogeneous texts of grassroots governance events was solved, enabling efficient event data retrieval and accurate governance decision support, thereby improving the digitalization and intelligence level of grassroots governance.

CN122285673APending Publication Date: 2026-06-26ZHEJIANG HAISHU TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG HAISHU TECH CO LTD
Filing Date
2026-05-28
Publication Date
2026-06-26

Smart Images

  • Figure CN122285673A_ABST
    Figure CN122285673A_ABST
Patent Text Reader

Abstract

This invention provides an intelligent structured processing and classification analysis system for grassroots governance events, including: a data access module configured to connect to grid worker mobile reporting, government service hotline feedback, and monitoring and early warning channels through a multi-source heterogeneous data adaptation interface, collecting raw text data of grassroots governance events from multiple sources. This invention utilizes a core mechanism of collaborative parsing using a large language model and regular expressions, combining the semantic understanding capabilities of the large language model for colloquial and fragmented text with the efficient and accurate characteristics of regular expression matching, thus solving the problems of high inference costs and large response delays associated with single large language model solutions. Simultaneously, it introduces a dedicated dictionary for grassroots governance to automatically correct typos and complete non-standard expressions, mapping dialects, slang, and non-standard time and location formats to standard field values, improving the completeness of structured conversion and data standardization.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to an intelligent structured processing and classification analysis system for grassroots governance events, belonging to the fields of natural language processing and grassroots governance informatization technology. Background Technology

[0002] Grassroots governance is a core component of social governance, and the efficiency of its incident handling directly impacts the level of governance sophistication. Currently, incidents in grassroots governance come from a wide range of sources, including reports from grid workers' mobile devices, feedback from the 12345 hotline, and community monitoring and early warning systems. The incident data from different channels is unstructured or semi-structured, and the differences in data formats make it difficult for grassroots governance platforms to efficiently retrieve incident data, significantly increasing the difficulty of integration and utilization.

[0003] In existing technologies, the follow-up processing of grassroots governance incidents lacks a standardized event data storage system and intelligent statistical analysis methods. This makes it difficult for grassroots governance personnel to quickly and accurately count the frequency of different types and labels of events within a given time period. They struggle to extract high-frequency problems and event patterns from massive amounts of event data, and governance decisions still rely on experience-based judgment, failing to achieve data-driven targeted governance. The digitalization and intelligence of grassroots governance are low, failing to meet the current development needs of refined governance. Currently, although some government data processing systems attempt to achieve structured data processing, these systems are mostly applicable to single-source government data and are not adapted to the multi-source nature of grassroots governance events.

[0004] Furthermore, grassroots governance event texts are characterized by colloquialism and fragmentation, containing numerous dialectal expressions, typos, and non-standard time and location formats. Existing general-purpose Natural Language Processing (NLP) tools lack the accuracy to extract key elements from such low-quality texts, making it difficult to effectively extract structured information such as event time, location, and involved parties. Regarding automated processing, while pure large language model (LLM) solutions are well-adapted to colloquial expressions, their high inference costs and large response latency make them unsuitable for real-time processing of high-frequency grassroots events. Pure rule-matching solutions, while fast, struggle to cover complex and varied colloquial expressions, resulting in low coverage and insufficient generalization ability. Therefore, achieving high-accuracy and high-completeness automated structured conversion of multi-source heterogeneous text data in the grassroots governance field has become an urgent technical problem. To address this, an intelligent structured processing and classification analysis system for grassroots governance events is proposed. Summary of the Invention

[0005] In view of this, the present invention provides an intelligent structured processing and classification analysis system for grassroots governance events, in order to solve or alleviate the technical problems in the existing technology of colloquial expressions, format differences and field omissions in multi-source heterogeneous text data in the field of grassroots governance, and the lack of a unified standard data structure, which makes it difficult to retrieve data efficiently, and at least provides a useful alternative.

[0006] The technical solution of the present invention is implemented as follows: a smart structured processing and classification analysis system for grassroots governance events, comprising: a data access module, configured to connect to the reporting channels of grid workers' mobile terminals, government service hotline feedback and monitoring and early warning channels through a multi-source heterogeneous data adaptation interface, collect the original text data of multi-source grassroots governance events, and store the original text data in an unstructured text cache area after removing non-text symbols and modal particles, unifying encoding and preprocessing sentence segmentation;

[0007] The structured processing module is configured to invoke a semantic parsing model that combines a finely tuned large language model (based on grassroots governance corpus) with regular expressions. It sequentially performs deterministic field extraction and semantic element extraction on the raw text data in the unstructured text cache. Specifically, it uses regular expressions for deterministic extraction of time, location, and name fields with clear format characteristics. For core event keywords, it uses a few-shot prompting learning template to drive semantic extraction of the large language model, extracting time elements, location elements, involved subject elements, and core event description information. For missing fields, it performs logical completion and consistency checks based on sliding window contextual semantic association rules and a preset grassroots governance element dictionary. Finally, it performs field mapping and format standardization according to a preset standardized data structure model to generate structured event data. This standardized data structure model includes fields for event number, reporting time, reporting source, jurisdiction, detailed address, core description, subject type, subject information, event classification, event tag, and processing status.

[0008] In a further preferred embodiment, the data access module is also configured to: collect text event content from various sources through the multi-source heterogeneous data adaptation interface, clean the original text data to remove duplicate content and special symbols, unify the encoding format, standardize line breaks and spaces, call the large language model to correct typos in the text, perform sentence and segmentation processing on long texts, and retain the core description segment of the event.

[0009] Further preferably, the structured processing module is configured to: perform semantic similarity matching between the extracted information items and the target fields of the standardized data structure model; call the large language model to perform reasoning and completion based on contextual semantics and the dictionary of grassroots governance elements for unmatched fields; and fill in default values ​​to ensure the structural integrity of the structured event data.

[0010] More preferably, the semantic similarity matching employs a hybrid algorithm combining edit distance and word vector cosine similarity, calculating a weighted composite score of the text edit distance score S1 and the word vector cosine similarity score S2. , where α is 0.4, if the weighted comprehensive score S is greater than or equal to the preset threshold θ, then the mapping is determined to be successful, otherwise it is determined to be unmatched, where θ is 0.8.

[0011] Further preferably, it also includes a data storage module, configured to persistently store the structured event data to a dedicated data table for grassroots governance according to preset database storage rules, and to establish a composite index for the event number, reporting time, subject information and subject type fields to support efficient retrieval under multiple conditions.

[0012] Further preferably, it also includes an automatic classification module, configured to perform semantic analysis on the core description fields in the structured event data and match them with a preset grassroots governance event classification system to achieve automatic event classification and synchronize the classification results to the event classification field of the structured event data.

[0013] Further preferably, it also includes an information association module, configured to extract the main information from the structured event data, match and associate it with the grassroots personnel information database and / or the jurisdiction enterprise information database, synchronize the association results to the main information field of the structured event data, and update the associated event statistics information of the corresponding database.

[0014] Further preferably, it also includes an intelligent tagging module, configured to extract core features from the core description and event classification fields of the structured event data based on keyword extraction and semantic similarity matching algorithms, and match them with the grassroots governance event tag library, automatically matching tags for events and synchronizing them to the event tag field of the structured event data.

[0015] Further preferred embodiments include a statistical analysis module, configured to perform multi-dimensional event frequency statistics based on the grassroots governance-specific data table and output statistical results according to query conditions.

[0016] The embodiments of the present invention have the following advantages due to the adoption of the above technical solutions:

[0017] I. This invention constructs a core mechanism for collaborative parsing of a large language model and regular expressions. It combines the semantic understanding capabilities of the large language model for colloquial and fragmented text with the efficient and accurate characteristics of regular expression matching, thus solving the problems of high inference costs and large response delays associated with single large language model solutions. At the same time, it introduces a specialized dictionary for grassroots governance to automatically correct typos and complete non-standard expressions, mapping dialects, slang, and non-standard time and location formats to standard field values, thereby improving the completeness of structured conversion and data standardization.

[0018] Second, this invention achieves integrated processing of element extraction, subject association, automatic classification, and intelligent tagging of multi-source heterogeneous event texts, reducing the burden on grassroots staff. Based on a standardized data system and multi-dimensional statistical analysis capabilities, it enables grassroots governance personnel to quickly and accurately grasp the frequency and distribution patterns of different types of events within any time period, automatically extract high-frequency core issues from massive events, and transform governance decisions from relying on experience-based judgment to data-driven targeted governance, thereby improving the digitalization and intelligent precision of grassroots governance.

[0019] The above overview is for illustrative purposes only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the invention will become readily apparent from the accompanying drawings and the following detailed description. Attached Figure Description

[0020] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0021] Figure 1 This is a schematic diagram of the various modules of the present invention.

[0022] Figure 2 This is a schematic diagram of the processing flow of the present invention.

[0023] Figure 3 This is a schematic diagram of the standardized data structure model of the present invention.

[0024] Figure 4 This is a schematic diagram illustrating the association of structured event data with the grassroots personnel information database in this invention.

[0025] Figure 5 This is a schematic diagram illustrating the association of structured event data with the grassroots governance event tag library in this invention.

[0026] Attached reference numerals: 1. Data access module; 2. Structured processing module; 3. Data storage module; 4. Automatic classification module; 5. Information association module; 6. Intelligent labeling module; 7. Statistical analysis module. Detailed Implementation

[0027] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. However, it should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of the invention. Furthermore, descriptions of well-known structures and technologies are omitted in the following description to avoid unnecessarily obscuring the concept of the invention.

[0028] In the description of this invention, it should be noted that when an element is referred to as being "fixed to" or "set on" another element, it can be directly on or indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to or indirectly connected to the other element.

[0029] In the description of this invention, it should be noted that the terms "center," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings, or the orientation or positional relationship commonly used when the product of this invention is in use. They are used only for the convenience of describing the invention and for simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the invention. Furthermore, the terms "first," "second," and "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified. "Several" means one or more, unless otherwise explicitly specified.

[0030] In the description of this invention, it should also be noted that, unless otherwise explicitly specified and limited, the terms "set," "install," "connect," and "link" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection of two components. Those skilled in the art can understand the specific meaning of the above terms in this invention based on the specific circumstances.

[0031] The embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

[0032] like Figures 1-5As shown, this embodiment of the invention provides an intelligent structured processing and classification analysis system for grassroots governance events, including: a data access module 1, configured to connect to the reporting channels of grid workers' mobile terminals, government service hotline feedback and monitoring and early warning channels through a multi-source heterogeneous data adaptation interface, collect the original text data of multi-source grassroots governance events, and store the original text data in an unstructured text cache area after removing non-text symbols and modal particles, unifying the encoding and preprocessing the sentence segmentation;

[0033] In one embodiment, the structured processing module 2 is configured to invoke a semantic parsing model that combines a large language model fine-tuned from grassroots governance corpus with regular expressions. This model sequentially performs deterministic field extraction and semantic element extraction on the raw text data in the unstructured text cache. Specifically, time, location, and name fields with clear format characteristics are extracted deterministically using regular expressions. Key event keywords are extracted semantically using a few-shot prompting learning template-driven large language model, extracting time elements, location elements, involved subject elements, and core event description information. Missing fields are logically completed and their consistency verified based on sliding window contextual semantic association rules and a preset grassroots governance element dictionary. Finally, field mapping and format standardization are performed according to a preset standardized data structure model to generate structured event data. This standardized data structure model includes fields for event number, reporting time, reporting source, jurisdiction, detailed address, core description, subject type, subject information, event classification, event tag, and processing status.

[0034] The large language model, fine-tuned from the corpus of grassroots governance, adopts LoRA low-rank adaptation technology. An incremental parameter ΔW=A×B is superimposed on the weights W of the basic model, where the rank r=14 and the scaling factor α=16. The model is trained in three stages based on a ratio of 6:3:1 for standard samples, noisy samples, and difficult samples. The learning rate adopts a 2e-5 cosine decay strategy to improve the model's accuracy in extracting elements from colloquial texts, misspelled texts, and non-standard format texts from grassroots levels.

[0035] The data access module 1 is also configured to: collect text event content from various sources through a multi-source heterogeneous data adaptation interface, clean the original text data to remove duplicate content and special symbols, unify the encoding format, standardize line breaks and spaces, call a large language model to correct typos in the text, process long texts into sentences and segments, and retain the core description of the event.

[0036] The structured processing module 2 is also configured to: perform semantic similarity matching between the extracted information items and the target fields of the standardized data structure model; call the large language model to perform reasoning and completion based on contextual semantics and the dictionary of grassroots governance elements for unmatched fields; and fill in default values ​​to ensure the structural integrity of the structured event data.

[0037] Semantic similarity matching employs a hybrid algorithm combining edit distance and word vector cosine similarity, calculating a weighted composite score of the text edit distance score S1 and the word vector cosine similarity score S2.

[0038] α is 0.4. If the weighted comprehensive score S is greater than or equal to the preset threshold θ, the mapping is considered successful; otherwise, it is considered unmatched. θ is 0.8.

[0039] In this embodiment, text event content from various sources is collected through an interface. The original text is preprocessed, including noise reduction, deduplication, removal of special symbols, unified encoding, and line break / space normalization. Typos in the text are corrected using a Large Language Model (LLM). Long texts are segmented into sentences and paragraphs, and the core event description is stored in an unstructured text cache. A standardized data structure model is constructed, including core fields such as event number, reporting time, reporting source, jurisdiction, detailed address, core description, subject type, subject information, and processing status, providing a unified standard for the structured transformation of multi-source event data stored in the unstructured text cache. Based on this, high-precision extraction is performed using regular expressions and a Large Language Model (LLM) for fields with clear characteristics such as time, location, and name. This includes deterministic field extraction, such as matching common formats like YYYY-MM-DD and MM-DD for time, and matching formats like neighborhood, community, road, and street for address.

[0040] Further semantic element extraction is performed to construct a few-shot prompt learning template, such as event type, event label, core keywords, etc. Using a large language model (LLM) combined with prompt engineering, intent recognition, element extraction and text summarization are performed to transform colloquial descriptions into standardized governance terms. Based on the sample prompt learning template, the category, label and keywords of the event are output.

[0041] The semantically extracted information items are then matched with the standardized data structure model fields (event number, reporting time, reporting source, jurisdiction, detailed address, core description, subject type, subject information, event classification, event tag, and processing status) for semantic similarity. The matching is calculated using a hybrid algorithm of edit distance and word vector cosine similarity.

[0042] For data that does not match a type or label, let the LLM infer the result automatically, fill it in, and add it to the corresponding type; fill other fields without information with default values ​​to ensure the integrity of the standardized structure; assemble all the above fields into a standardized object that conforms to the database table structure, write it into the grassroots governance-specific data table, complete the transformation from unstructured to structured, and provide a standard data foundation for subsequent classification, association, labeling, and statistics.

[0043] In one embodiment, the system further includes a data storage module 3, configured to persistently store structured event data in a dedicated data table for grassroots governance according to preset database storage rules, and to establish composite indexes for event number, reporting time, subject information, and subject type fields to support efficient multi-condition retrieval. It also includes an automatic classification module 4, configured to perform semantic analysis on the core descriptive fields in the structured event data and match them with a preset grassroots governance event classification system to achieve automatic event classification, and synchronize the classification results to the event classification field of the structured event data.

[0044] By building a grassroots governance event classification system in the database, which is divided into primary categories (such as daily visits and inspections, discovering and reporting safety hazards, etc.) and secondary subcategories (such as visits to merchants, migrant population, employment assistance, etc.), the system adapts to the actual business needs of grassroots governance. It performs feature extraction and semantic analysis on the core description fields of events in the structured event data, automatically matches them to the corresponding categories in the classification system, and synchronizes the classification results to the "event classification" field of the structured event data.

[0045] In one embodiment, the system further includes an information association module 5, configured to extract the main information from the structured event data, match and associate it with the grassroots personnel information database and / or the jurisdiction enterprise information database, synchronize the association results to the main information field of the structured event data, and update the associated event statistics information of the corresponding database.

[0046] Based on existing grassroots personnel information databases and local enterprise information databases, and according to core identifiers such as personnel name, ID number, and community affiliation, as well as enterprise name and unified social credit code, the identification information of involved personnel and enterprises in structured event data is extracted. This information is then matched and associated with the information in the personnel and enterprise databases. The unique identifiers in the associated databases are then synchronized to the main information fields in the structured event data. At the same time, association statistics are completed (such as the number of events involving a certain person, the types of events associated with a certain enterprise, etc.). The statistical results are updated in real time to the grassroots personnel information database and local enterprise information database.

[0047] In one embodiment, the system also includes an intelligent tagging module 6, configured to extract core features from the core description and event classification fields of structured event data based on keyword extraction and semantic similarity matching algorithms, and match them with the grassroots governance event tag library to automatically match tags for events and synchronize them to the event tag field of the structured event data.

[0048] The built-in grassroots governance event tag library is divided into basic tags and feature tags. Basic tags are associated with the event classification field in the structured event data, while feature tags are set according to the specific content of the event (such as urban management, water leakage repair, fire hazards, noise pollution, etc.). Based on keyword extraction and semantic similarity matching algorithms, core features are extracted from the event description, event classification and other fields of the structured event data, and relevant tags in the tag library are automatically matched for the event. It supports multiple tag annotations for a single event, and the tag results are synchronously stored in the structured event data.

[0049] It also includes a statistical analysis module 7, configured to perform multi-dimensional event frequency statistics based on a dedicated data table for grassroots governance, and output statistical results according to query conditions; based on structured event data written into the dedicated data table for grassroots governance, according to the user's statistical query needs (such as specified time period, jurisdiction, event category / tag), it performs quantity statistics and frequency calculation on events that meet the conditions through data statistics, generates event occurrence frequency statistical results, supports multi-dimensional statistical analysis by time, by category, by tag, etc., and the results can be output as structured data display.

[0050] Furthermore, this embodiment uses a specific event handling example and the model training process to provide a detailed explanation of the technical solution of the present invention.

[0051] The original input event text comes from a voice transcription record of feedback from the 12345 hotline. The content is as follows: "On February 22, 2026, the caller purchased a coat for 1100 yuan at the 'Luji Clothing Store' located at No. 90 Taiwan Road, 2nd Floor, Block A, Haining Leather City, No. 201 Haizhou West Road, Haizhou Street, Haining City. His wife tried it on at home, but was not satisfied and asked him to return it. The caller returned to the store on February 23 to return the coat and get a refund, but the store refused. We hope relevant departments can send someone to the site to coordinate and resolve the return issue."

[0052] This original text embodies the typical characteristics of grass-roots governance event data: First, the colloquial expressions are prominent, such as informal terms like "hua fei le" (spending), "over there", "wife", etc.; second, there are repeated punctuation marks, such as consecutive full stops "。。"; third, there are miscellaneous typos, such as "hua fei" actually being "spending"; fourth, there is semantic redundancy, such as the description of "asking him to return it" and the subsequent description of the return behavior constituting a repeated expression. If such text is directly fed into a general natural language processing system, it will greatly reduce the accuracy and integrity of element extraction. Therefore, preprocessing, structured parsing, and summary generation need to be carried out in sequence.

[0053] First, perform noise removal on the event text, aiming to eliminate interfering characters and retain the text content valuable for semantics.

[0054] Symbol cleaning: Use regular expressions to only retain English letters (case-insensitive), Arabic numerals, common Chinese characters, and various whitespace characters, and uniformly replace the rest of the special symbols, repeated punctuation, etc. with empty strings and delete them. The regular expression is:

[0055] cleaned_text = re.sub(r'[^a-zA-Z0-9\u4e00-\u9fa5\s]', '', raw_text). At the same time, filter colloquial fillers: Preset a filler library as the stopword set (stopwords), traverse the text after word segmentation, and filter out fillers without actual semantic contribution (such as colloquial fillers like "over there", "this", etc.). The implementation method is:

[0056] stopwords = set([filler library])

[0057] cleaned_tokens = [word for word in tokenized_text if word not in stopwords].

[0058] After this processing, the repeated full stops "。。" in the original text are standardized to a single full stop, the filler "over there" is deleted, and at the same time, the colloquial term "wife" is replaced with the standard term "wife" according to the context. The text obtained after noise removal is: "The caller bought a coat at Ruji Clothing Store, No. 90, Taiwan Road, 2nd Floor, Block A, Haining Leather City, 201 Haizhou West Road, Haizhou Sub-district, Haining City on February 22, 2026, spending 1100 yuan, took it home for his wife to try on, but his wife was not satisfied with the clothes and asked him to return it. The caller then took the clothes to the store on February 23 to return the goods and get a refund, but the merchant refused to refund him. It is hoped that relevant departments can send someone to the scene to coordinate and handle the return problem."

[0059] Next, deduplication is performed to identify and eliminate redundant and repetitive semantic segments in the text, making the overall semantics more concise. This embodiment employs a deduplication strategy based on text similarity calculation. First, the text segments to be compared are converted into vector representations using the CountVectorizer method, and then the cosine similarity between each pair of segments is calculated. When the similarity exceeds a preset threshold, it is determined to be a semantically repetitive segment and is deduplicated.

[0060] The specific implementation code is as follows:

[0061] From sklearn.feature_extraction.text import CountVectorizer

[0062] from sklearn.metrics.pairwise import cosine_similarity

[0063] vectorizer = CountVectorizer().fit_transform([text1, text2])

[0064] cosine_sim = cosine_similarity(vectorizer[0:1], vectorizer[1:2])

[0065] threshold = 0.8

[0066] if cosine_sim >= threshold:

[0067] cleaned_text = text1 # Remove duplicates, keep only one

[0068] When the similarity exceeds a preset threshold of 0.8, the two statements are considered semantically redundant, and one is retained while the other is removed. Specifically, for the descriptions "let him return it" and "come to the store to return the goods and get a refund," although the wording is different, they both refer to the same return behavior. The similarity value calculated is higher than the threshold, so the redundant expression "let him return it" is simplified.

[0069] The text after duplicate removal is: "The caller purchased a coat at the Rujiclothing Store, No. 90, Taiwan Road, 2nd Floor, Block A, Haining Leather City, No. 201, West Haizhou Road, Haizhou Sub-district, Haining City on February 22, 2026, spending 1,100 yuan. He took it home for his wife to try on, but his wife was not satisfied with the clothes. The caller then went to the store on February 23 to request a return and refund, but the merchant refused. He hopes that relevant departments can send someone to the site to coordinate and handle the problem."

[0070] After that, automatic correction of typos is performed. Since the texts of grass-roots governance events often contain homophone and similar-form character errors due to pinyin input, handwritten recognition, or transcription from spoken words by the public, this invention uses the powerful context semantic understanding ability of the pre-trained language model for error detection and correction. The text after duplicate removal is input into the model. The model combines consumption contexts such as "purchased a coat" and "1,100 yuan", and automatically corrects "huafei" to "spending", outputting the corrected segment "spending 1,100 yuan", thus restoring the correct description of the amount of money. This preprocessing model is trained with a large amount of Chinese corpora and is specifically optimized for common error patterns in the grass-roots governance field, and can effectively repair high-frequency errors such as "loudao" (corridor) and "yinhuan" (hidden danger), providing clean input for subsequent structured extraction.

[0071] After preprocessing, it enters the structured parsing stage. By comprehensively using the deterministic matching of regular expressions and the semantic understanding ability of the pre-trained language model based on BERT, the collaborative extraction of key elements of the event is realized. In time extraction, the regular expression time_pattern = r'\d{4}-\d{2}-\d{2}' is used to accurately match the standardized "year-month-day" format. After extracting multiple dates from the text, according to the semantics of the event, the last-occurring "February 23, 2026" is determined as the actual date of the return behavior.

[0072] In the extraction of regions and locations, a dual strategy combining regular expressions and named entity recognition (NER) is adopted: Regular expressions are used to capture location phrases with obvious patterns, such as address segments with suffixes or combinations like "sub-district", "community", "road", "number", "block", "building", etc.; Named entity recognition uses the fine-tuned BERT model to perform semantic-level boundary division and type annotation on administrative regions (such as "Haining City", "Haizhou Sub-district") and detailed addresses (such as "No. 201, West Haizhou Road, Block A, Haining Leather City, 2nd Floor, No. 90, Taiwan Road"). Through cross-verification and merging of the results of the two, the region field "Haining / Haizhou Sub-district" and the location field "No. 90, Taiwan Road, 2nd Floor, Block A, Haining Leather City" are obtained.

[0073] In identifying the parties involved, the sequence labeling and classification functions of the BERT model were used to extract the store name "Ruiji Clothing Store". For event classification and tag generation, the model automatically performs multi-tag classification based on its understanding of the overall event description, outputting the broad event category "Consumer Complaint" and further refining it into tags such as "Consumer Complaint, Return Dispute, Clothing Sales". The tags maintain semantic complementarity and a reasonable hierarchy, providing standardized data for subsequent statistical analysis and precise retrieval.

[0074] After structured parsing, the core description and overview generation stage begins, aiming to meet information retrieval needs in different scenarios with hierarchical summaries. The preprocessed text is used as input, and an end-to-end generative compression is performed using a BERT pre-trained model to remove redundant details while retaining key elements such as time, location, subject, amount, and request.

[0075] The generated core description is: "On February 22, 2026, a consumer purchased a coat for 1100 yuan at the seasonal clothing store located at No. 90 Taiwan Road, 2nd Floor, Block A, Haining Leather City. Because his wife was dissatisfied, he went to the store the next day to return the coat and request a refund, but the store refused. He hopes relevant departments will coordinate a solution." Subsequently, the model is called again to further refine the core description, generating a highly condensed one-sentence summary: "A consumer purchased a 1000 yuan coat at Haining Leather City but was refused a refund the following day."

[0076] The hierarchical generation strategy described above not only preserves event details for in-depth review but also provides concise summaries for list display and quick retrieval, thus meeting the information access needs of different levels in grassroots governance scenarios.

[0077] To achieve higher accuracy and robustness in the above-mentioned structured parsing and text generation process, this embodiment further discloses a special model training scheme for the field of grassroots governance.

[0078] The base model used is Qwen3-32B, which is adapted to the domain through efficient parameter fine-tuning. The training dataset is constructed based on the real distribution characteristics of grassroots governance texts, consisting of a proportional mix of three types of samples, totaling tens of thousands of labeled data points. Standard samples account for 60% of the dataset; these samples are grammatically correct and clearly expressed, with each data point containing an instruction, input (raw text), and output (structured JSON).

[0079] For example, if the input is "There are miscellaneous items piled up in the corridor of Building 8, Nanyuan Yili, which poses a safety hazard", the output includes fields such as event category "safety hazard" and event tag "fire hazard, items piled up in the corridor", which are used to establish the basic structured extraction capability of the model.

[0080] The proportion of noise samples is 30%. When simulating the informal input style of grid workers' mobile reporting, add filler words (such as "over there", "ah", "hey"), irregular punctuation (such as repeated exclamation marks, ellipsis, consecutive full stops), and redundancy in oral expressions. For example, "Over there in Building 8, Nanyuan Yili, there are a lot of sundries piled up in the corridor. It's a bit dangerous. It feels like it might catch fire...", which is used to enhance the model's robustness to oral expressions and symbolic noise.

[0081] The proportion of difficult samples is 10%. Specifically introduce typos and non-standard expressions to simulate extremely low-quality texts. For example, "There are a lot of sundries pushed in the building corridor of Building 8, Nanyuan Yili. There is a fire hazard", which contains multiple errors such as "building corridor (楼道)", "pushed (堆了)", "fire hazard (隐患)", etc., and is used to strengthen the model's ability to automatically correct typos and semantic completion ability for incomplete descriptions. The overall ratio of the dataset is standard samples: noise samples: difficult samples = 6:3:1, and this ratio design takes into account the balance of basic ability, anti-noise ability, and error correction ability.

[0082] The fine-tuning method adopts the Low-Rank Adaptation (LoRA) mode to effectively control the number of newly added parameters and reduce the computational resource overhead. The LoRA parameters are set as rank r = 14, scaling factor α = 16, dropout ratio = 0.05. Its parameter update follows the formula W' = W +ΔW, ΔW = A × B, where W is the original weight, ΔW is the incremental parameter matrix, and A and B are low-rank decomposition matrices. Only the matrices A and B are updated for gradients, which greatly reduces the scale of trainable parameters.

[0083] In the design of the loss function, multi-objective joint loss is used for optimization:

[0084] The generation loss L_gen adopts the standard cross-entropy loss to drive the model to generate high-quality text; the format constraint loss L_format imposes penalties on situations where the output structure does not conform to the predefined JSON format, fields are missing, or there are nesting errors, forcing the model to learn a stable structured output paradigm; the label consistency loss L_consistency is defined as

[0085] 1 - CosineSimilarity(label_pred, label_true), which is used to constrain the semantic direction between the predicted label and the true label to be consistent, avoiding the situation of correct semantics but drifting in wording. <000018The hard sample weighted loss L_hard applies a weighting coefficient w=2 to the loss value for hard samples, causing the model to pay more attention to low-quality samples during training. Training employs a three-stage progressive strategy: Stage 1 uses only standard samples for training, allowing the model to converge to possess basic structured extraction and generation capabilities; Stage 2 loads the optimal weights from Stage 1, introducing noisy samples for further training, learning to ignore colloquial interference while preserving core semantics; Stage 3 adds hard samples to the foundation of Stage 2, using weighted loss for enhanced training, focusing on improving performance in typo correction and incomplete element completion.

[0087] The training hyperparameters were set as follows: initial learning rate 2e-5, the learning rate was gradually increased to the initial value using a linear warmup strategy in the first 10% of steps, and then decayed using cosine annealing; batch size was set to 16, gradient accumulation steps were 4, and the equivalent batch size was 64, in order to simulate large batch training under limited GPU memory conditions and maintain training stability and convergence efficiency.

[0088] Through the aforementioned phased training and multi-objective joint optimization, this embodiment achieved significant results: on the retained test set, the event classification accuracy improved by approximately 13% compared to the single BERT baseline, and the semantic consistency index of label prediction improved by approximately 17%. The model's performance on noisy and difficult samples showed a clear advantage over general-purpose large models without domain adaptation, and the output stability was significantly improved. Experimental results show that the proposed solution effectively achieves high-accuracy and high-completeness automated structured conversion of multi-source heterogeneous unstructured texts for grassroots governance. It solves the technical contradictions in existing technologies, such as high inference costs and large response delays with single pure language model solutions, and low coverage and insufficient generalization ability with single pure rule solutions, providing accurate data support for targeted analysis and decision-making in grassroots governance events.

[0089] The above are merely specific embodiments of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various variations or substitutions within the technical scope disclosed in the present invention, and these should all be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A system for intelligent structured processing and classification analysis of grassroots governance events, characterized in that... The module includes: a data access module (1), configured to connect to the grid worker's mobile terminal reporting, government service hotline feedback and monitoring and early warning channels through a multi-source heterogeneous data adaptation interface, collect the original text data of multi-source grassroots governance events, and store the original text data in the unstructured text cache area after removing non-text symbols and modal particles, unifying the encoding and preprocessing the sentence segmentation. The structured processing module (2) is configured to call a semantic parsing model that combines a large language model fine-tuned by the corpus of grassroots governance with regular expressions to sequentially perform deterministic field extraction and semantic element extraction on the original text data in the unstructured text cache. Among them, the time field, location field and name field with clear format features are extracted deterministically using regular expressions, and the core keywords of the event are extracted semantically using a small sample prompt learning template driven by the large language model to extract time elements, location elements, involved subject elements and core event description information. For missing fields, logical completion and consistency verification are performed based on sliding window context semantic association rules and a preset grassroots governance element dictionary. Field mapping and format standardization are performed according to a preset standardized data structure model to generate structured event data. The standardized data structure model includes fields such as event number, reporting time, reporting source, jurisdiction, detailed address, core description, subject type, subject information, event classification, event tag, and processing status.

2. The intelligent structured processing and classification analysis system for grassroots governance events according to claim 1, characterized in that, The data access module (1) is also configured to: collect text event content from various sources through the multi-source heterogeneous data adaptation interface, clean the original text data to remove duplicate content and special symbols, unify the encoding format, standardize line breaks and spaces, call the large language model to correct typos in the text, process long texts into sentences and segments, and retain the core description segment of the event.

3. The intelligent structured processing and classification analysis system for grassroots governance events according to claim 1, characterized in that, The structured processing module (2) is further configured to: perform semantic similarity matching between the extracted information items and the target fields of the standardized data structure model; call the large language model to perform reasoning and completion based on contextual semantics and the dictionary of grassroots governance elements for unmatched fields; and fill in default values ​​to ensure the structural integrity of the structured event data.

4. The intelligent structured processing and classification analysis system for grassroots governance events according to claim 3, characterized in that, The semantic similarity matching employs a hybrid algorithm combining edit distance and word vector cosine similarity, calculating a weighted composite score of the text edit distance score S1 and the word vector cosine similarity score S2. , where α is 0.4, if the weighted comprehensive score S is greater than or equal to the preset threshold θ, then the mapping is determined to be successful, otherwise it is determined to be unmatched, where θ is 0.

8.

5. The intelligent structured processing and classification analysis system for grassroots governance events according to claim 1, characterized in that, It also includes a data storage module (3), configured to persistently store the structured event data to a dedicated data table for grassroots governance according to preset database storage rules, and to establish a composite index for the event number, reporting time, subject information and subject type fields to support efficient retrieval under multiple conditions.

6. The intelligent structured processing and classification analysis system for grassroots governance events according to claim 1, characterized in that, It also includes an automatic classification module (4), which is configured to perform semantic analysis on the core description fields in the structured event data and match them with the preset grassroots governance event classification system to realize automatic event classification and synchronize the classification results to the event classification field of the structured event data.

7. The intelligent structured processing and classification analysis system for grassroots governance events according to claim 6, characterized in that, It also includes an information association module (5), configured to extract the main information in the structured event data, match and associate it with the grassroots personnel information database and / or the enterprise information database in the jurisdiction, synchronize the association results to the main information field of the structured event data, and update the associated event statistics information of the corresponding database.

8. The intelligent structured processing and classification analysis system for grassroots governance events according to claim 6, characterized in that, It also includes an intelligent tagging module (6), which is configured to extract core features from the core description and event classification fields of the structured event data based on keyword extraction and semantic similarity matching algorithm and match them with the grassroots governance event tag library, automatically match tags for events and synchronize them to the event tag field of the structured event data.

9. The intelligent structured processing and classification analysis system for grassroots governance events according to claim 5, characterized in that, It also includes a statistical analysis module (7), which is configured to perform multi-dimensional event frequency statistics based on the grassroots governance-specific data table and output the statistical results according to the query conditions.