Fair competition review violation sub-classification matching method based on escape search

By constructing a structured knowledge base of violation rules and a lightweight large language model for escape retrieval, the problem of high model training cost and poor adaptability in fair competition review is solved. It achieves efficient and accurate multi-dimensional regulatory association matching, supports dynamic updates of regulations and interpretable decision-making basis.

CN122196154APending Publication Date: 2026-06-12TIANJIN UNIVERSITY OF TECHNOLOGY +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TIANJIN UNIVERSITY OF TECHNOLOGY
Filing Date
2026-03-05
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies in fair competition review suffer from high model training and maintenance costs, poor adaptability, difficulty in achieving multi-dimensional correlation and matching between policy texts and multiple regulations, and inability to meet the needs of "one-to-many" regulation comparison and review support.

Method used

We employ an escape-based retrieval method to construct a structured violation guidelines knowledge base and a lightweight large language model. Through intent escaping and hybrid retrieval, we achieve efficient matching of policy texts with multiple regulations, including an intent escaping engine, a lightweight model, the BM25 algorithm, and vector matching technology.

🎯Benefits of technology

It significantly reduces data costs and training expenses, improves review efficiency and accuracy, supports multi-dimensional regulatory correlation matching, provides interpretable decision-making basis, and adapts to dynamic updates of laws and regulations.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122196154A_ABST
    Figure CN122196154A_ABST
Patent Text Reader

Abstract

The application relates to the technical field of natural language processing and fair competition review, and particularly relates to a fair competition review violation subdivision matching method based on escape retrieval. First, according to the existing fair competition review related laws and regulations, a sustainable and updated local violation rule knowledge base is constructed, and the review standard text is converted into a standardized and retrievable violation rule; then, a locally deployed lightweight large language model is used as an intent escape engine, and through a preset prompt word template, a sentence to be reviewed is converted into a standardized query expression highly consistent with the language and intent of the regulation; a hybrid retrieval strategy is designed, and semantic vector deep semantic retrieval and BM25 algorithm keyword retrieval are simultaneously performed to capture semantic association and matching key terms respectively; finally, a fusion sorting algorithm is used to integrate and optimize the double retrieval results, and a list of potential violation regulation articles sorted according to relevance is output. The application can improve the accuracy and flexibility of review matching, and further improve the efficiency and quality of review work.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of natural language processing and fair competition review, and in particular to a method for fine competition review violation segmentation and matching based on escape retrieval. Background Technology

[0002] With the development of natural language processing technology, deep learning-based text analysis methods are gradually being applied to policy compliance review and intelligent supervision. In existing technologies, automated review of policy texts to determine whether they exclude or restrict competition typically relies on customizing pre-trained language models for specific tasks, directly outputting violation judgments by constructing text classification models.

[0003] However, the above-mentioned model fine-tuning-based technical approach has certain limitations in practical applications: on the one hand, model fine-tuning usually relies on a large number of manually labeled samples, resulting in high training and maintenance costs. Moreover, when relevant laws, regulations, or review standards are adjusted, it is often necessary to retrain the model or update the parameters, limiting the system's adaptability and scalability. On the other hand, this type of method usually uses a single or limited category of violations as the output, making it difficult to represent the correspondence between policy text content and multiple relevant legal provisions, and failing to meet the actual needs of fair competition review practice for "one-to-many" legal comparison and review support.

[0004] In the process of fair competition review, the same policy provision often needs to be comprehensively judged by comparing it with multiple laws and regulations or multiple review dimensions. Relying solely on a single classification result is insufficient to provide sufficient and clear legal basis for manual review or subsequent decision-making. Therefore, how to achieve multi-dimensional correlation and matching between policy texts and legal provisions while reducing model training and maintenance costs has become an urgent technical problem to be solved in the field of automated fair competition review.

[0005] Therefore, this invention proposes a fair competition review violation segmentation matching method based on escape retrieval to solve the above problems. Summary of the Invention

[0006] To address the shortcomings of existing technologies, this invention develops a fair competition review violation segmentation and matching method based on escape retrieval. This invention can achieve efficient matching and review support between policy texts and multiple related regulations without relying on large-scale model fine-tuning.

[0007] The technical solution of this invention to solve the technical problem is a fair competition review violation segmentation and matching method based on escape retrieval, comprising the following steps: S1. Based on the core norms of the current and effective fair competition review system in China, construct a local structured knowledge base of violation guidelines and a text vector knowledge base of violation guidelines. S2. For sentences that are to be reviewed for violations, the intent escaping engine and a locally deployed lightweight large language model are used to output the escaped query statement, core keywords and extended keywords in a structured manner according to the preset prompt word template. S3. Convert the structured escape query statement into a query vector, perform vector matching and recall with the violation criteria text vector knowledge base, and generate a vector retrieval candidate set; S4. Using the core keywords and extended keywords as the query term set, perform sentence-level retrieval based on the BM25 algorithm with the structured violation rule knowledge base, calculate the relevance score and recall, and generate a keyword retrieval candidate set; S5. The vector retrieval candidate set and the keyword retrieval candidate set are merged and reordered to obtain the final candidate list sorted by comprehensive relevance, forming a structured mapping result between the sentences to be reviewed for violations and the corresponding violation criteria.

[0008] S1 is as follows: S1.1 Targeted crawling of currently effective national normative documents on fair competition review, extracting prohibitive or restrictive clauses in various fields; S1.2. Using a single violation as the basic unit, the clauses are segmented and extracted. If a legal text fully describes a single violation, it is directly extracted as a violation rule; if a legal text contains multiple parallel violations, it is further segmented by semicolons, enumeration items, or independent semantics, so that each segment corresponds to a violation rule. S1.3 Generate a unique identifier ID for each violation rule and construct metadata including regulatory source traceability, timeliness identifier, review classification, core keywords, and initial weight values; Initial weight of each violation standard The calculation formula is as follows: , The number of violations of the rules is , , Indicates the first One rule of violation; This indicates the legal force level factor of the pre-defined violation criteria and the corresponding regulatory documents; This indicates the expiration date of the relevant regulatory document corresponding to the violation guidelines; and Let be the weight coefficient, and satisfy... ; Time factor The calculation process is as follows: using the current year as... The effective year of the regulations corresponding to the violation guidelines The difference between the two Then, regarding the time difference The time-dependent factor is obtained by modeling using exponential decay. The calculation formula is as follows: , , in, Indicates the time-degradation coefficient; The violation rules, metadata, and original text are integrated into violation rule knowledge units and stored in a structured manner to form a structured violation rule knowledge base; S1.4. An open-source pre-trained lightweight large language model is used as the text encoding model to encode each violation rule in the original text. Perform semantic vector encoding to output dense semantic feature vectors. ; S1.5 Store the dense semantic feature vectors and their corresponding unique identifiers (IDs) into the local FAISS vector database to construct a text vector knowledge base for violation criteria. FAISS returns the semantic vector most similar to the query vector and its corresponding ID, and uses the ID to trace back the original text and metadata in the structured violation rule knowledge base. S1.6. Update the structured violation guidelines knowledge base and the violation guidelines text vector knowledge base in real time according to the addition, revision or invalidation of regulations.

[0009] S2 is as follows: S2.1 For sentences that violate regulations and are to be reviewed, a lightweight large language model that is deployed locally and does not involve network calls is used as the core model for intent escaping. S2.2 Construct a compound prompt word template, while constraining the internal reasoning path and final output content of the model; S2.3, The compound prompt word template includes an implicit inference constraint module, which restricts the model to complete sequentially: 1) Identify the subject, object, and mode of administrative action; 2) Identify the types of behaviors that exclude or restrict competition; 3) Perform semantic alignment and abstract understanding of explicit statements and potential regulatory intentions in the text; And it does not output the intermediate reasoning process; S2.4, the compound prompt word template also includes an escape text generation constraint module, which constrains the model to generate standardized escape query statements that satisfy: 1) Eliminate rhetoric, colloquialisms, and context-dependent expressions; 2) Use neutral, abstract, and universally applicable policy and legal expressions; 3) Retain the core semantics of behavior type, target, and competitive impact; S2.5, the compound prompt word template also includes a keyword generation constraint module, and the constraint model is generated synchronously: Core keywords: Core words / phrases extracted from the sentences to be reviewed, including those related to the subject, behavior, measures, and competitive impact; Expanded keywords: These are derived by combining core keywords with domain terminology, synonyms / near-synonyms, and hierarchical concepts. S2.6 Format the intent escaping result into a structured JSON object output containing only the escaped query statement and keyword set. S3 is as follows: S3.1 Parse the structured JSON object and obtain the escaped query statement. ; S3.2. Using the same text encoding model as the one used to construct the text vector knowledge base for violation criteria, for... The query vector q is obtained through encoding; in the vector library vector set Among them, the top ones with the highest semantic similarity to q are selected. These vectors form a candidate set for vector retrieval. ; The calculation formula is as follows: , in, This represents the candidate set for vector retrieval. Represents from the set of vectors The top results are selected based on semantic similarity scores. One candidate regulatory knowledge item; query vector Vector of Legal Knowledge Items The semantic relevance score between them is calculated using cosine similarity.

[0010] S4 is as follows: S4.1 Parse the structured JSON object to obtain the keyword set. , This indicates the number of keywords in the keyword set. Indicates the first One keyword, ; S4.2. Based on the inverted index, perform BM25 keyword retrieval in the structured violation criteria knowledge base, select the top 25 knowledge units with the highest matching degree to form a keyword retrieval candidate set. ; The calculation formula is as follows: , in, This represents a set of knowledge units related to the violation standards. Represents a set of knowledge units related to violation standards. The Middle The original text of the violation standard; This indicates that the scores are sorted from highest to lowest according to the BM25 algorithm, and the top scores are selected. One non-compliant standard knowledge unit; This is a query scoring function for keyword retrieval, used to measure the original text of the violation criteria. With keyword set The degree of keyword matching between them is defined as: , in, Keywords In the original text of the violation standard Frequency of occurrence in; Indicates a legal entry The length of the text; This represents the average length of text entries in the structured violation guidelines knowledge base; and This represents the adjustment parameters in the BM25 algorithm, used to control the term frequency saturation and text length normalization strength; inverse document frequency. Used to measure keywords The formula for calculating the ability to differentiate within the legal knowledge base is as follows: , in, This represents the total number of structured legal knowledge entries in the structured violation guidelines knowledge base. A structured legal knowledge entry is the smallest retrieval unit after hierarchical segmentation and semantic normalization. This indicates that the structured violation rules knowledge base contains keywords. The number of knowledge items related to the violation guidelines is ranked from highest to lowest according to the BM25 score, and the recall is carried out accordingly. A candidate set of legal knowledge items is formed by keyword retrieval. .

[0011] S5 is detailed below: S5.1, Candidate Set for Vector Retrieval and keyword search candidate set Each violation rule retrieved from the search Min-Max normalization is performed to obtain the normalized semantic scores for each violation criterion. Normalized keyword scores ; S5.2, For those that appear simultaneously in the vector retrieval candidate set and keyword search candidate set The violation criteria in the data and the violation criteria that only appear in one of the candidate sets are used to construct a unified hybrid retrieval comprehensive scoring function. The calculation formula is as follows: , in, and This indicates that the weight parameters are configurable and satisfy the following conditions: ; S5.3, Comprehensive scoring based on mixed retrieval The candidate violation criteria are sorted in descending order of their scores, and a preset number of violation criteria are selected from the sorting results to form a set of candidate violation criteria for generating fair competition review results. S5.4 The structured output results include a set of candidate violation criteria corresponding to the text to be reviewed, the corresponding regulatory attribute information, and their sorting order determined by the comprehensive score of the hybrid retrieval.

[0012] The effects described in the invention are merely those of the embodiments, and not all the effects of the invention. The above technical solutions have the following advantages or beneficial effects: This invention discloses a fair competition review violation segmentation and matching method based on escape retrieval. By using intent escaping and a hybrid retrieval architecture, it effectively solves the technical bottlenecks of traditional intelligent review in terms of data dependence, model adaptation and dynamic updates of regulations, and greatly improves review efficiency and accuracy.

[0013] This invention abandons the traditional approach of relying on massive amounts of labeled data and model fine-tuning. Instead, it employs zero-sample or few-example hints to achieve intent translation and rule matching, significantly reducing data costs and training overhead, and making it easier to implement in engineering. Addressing the diverse expressions, implicit semantics, and abstract terminology of policy texts, this invention standardizes and reconstructs the original text through semantic intent translation, eliminating differences in colloquial, contextual, and rhetorical expressions, extracting a unified and standardized review intent, and improving the semantic alignment between policy clauses and legal provisions, providing a stable and unified semantic foundation for subsequent accurate matching.

[0014] This invention employs a hybrid retrieval mechanism that combines semantic vector retrieval and keyword retrieval. It recalls candidate regulatory entries from both semantic similarity and keyword matching, and then performs normalization and weighted fusion scoring to form a comprehensive ranking. The number of matches can be dynamically adjusted according to actual needs to achieve "one-to-many" related retrieval, meeting the actual needs of fair competition review for multi-basis comparison and comprehensive judgment.

[0015] Meanwhile, this invention independently encapsulates regulations and standards into a structured knowledge base that can be flexibly added to, deleted from, and continuously updated. The model only undertakes the functions of general intent understanding and semantic matching, without the need for retraining or fine-tuning due to regulatory revisions. This addresses the pain point of traditional methods where model failure occurs when regulations are updated from an architectural perspective, enabling agile adaptation to the dynamic evolution of laws and regulations and improving the long-term effectiveness and maintainability of the system.

[0016] This invention achieves highly adaptable, highly accurate, and highly interpretable violation segmentation matching without relying on task-level model fine-tuning. The output results include matching scores and regulatory traceability, providing interpretable and traceable decision-making basis for manual review. It significantly improves the intelligence level, comprehensiveness, accuracy, and timeliness of fair competition review, and has strong practical value and promotion prospects. Attached Figure Description

[0017] The accompanying drawings are provided to further illustrate the invention and form part of the specification. They are used together with the embodiments of the invention to explain the invention and do not constitute a limitation thereof.

[0018] Figure 1 This is a schematic diagram of the method flow of the present invention. Detailed Implementation

[0019] To clearly illustrate the technical features of this solution, the invention will be described in detail below through specific implementation methods and in conjunction with the accompanying drawings.

[0020] Example 1 like Figure 1 As shown, a fair competition review violation segmentation and matching method based on escape retrieval includes the following steps: S1. Based on the core norms of the current and effective fair competition review system in China, construct a local structured knowledge base of violation guidelines and a text vector knowledge base of violation guidelines. S2. For sentences that are to be reviewed for violations, the intent escaping engine and a locally deployed lightweight large language model are used to output the escaped query statement, core keywords and extended keywords in a structured manner according to the preset prompt word template. S3. Convert the structured escape query statement into a query vector, perform vector matching and recall with the violation criteria text vector knowledge base, and generate a vector retrieval candidate set; S4. Using the core keywords and extended keywords as the query term set, perform sentence-level retrieval based on the BM25 algorithm with the structured violation rule knowledge base, calculate the relevance score and recall, and generate a keyword retrieval candidate set; S5. The vector retrieval candidate set and the keyword retrieval candidate set are merged and reordered to obtain the final candidate list sorted by comprehensive relevance, forming a structured mapping result between the sentences to be reviewed for violations and the corresponding violation criteria.

[0021] In a specific implementation, S1 is as follows: S1.1 Targeted crawling of relevant normative documents on fair competition review currently in effect in the country, and sorting out the prohibitions or restrictions in various fields stipulated in the documents, including chapters and articles; S1.2. The obtained clauses are first segmented into text and then extracted using a single violation criterion as the basic unit. If a violation criterion is a single violation situation fully described in a legal item text, it is directly extracted and retained as a violation criterion; if a violation criterion contains multiple parallel violations in a legal item, it is further segmented by semicolon, enumeration item, or independent semantics, so that each segmented fragment corresponds to a violation criterion. S1.3. Generate a unique identifier ID for each extracted violation criterion and construct corresponding metadata, including regulatory source information, timeliness identifier, review classification, core keywords, and initial weight values. Initial weight of each violation standard The calculation formula is as follows: , Among them, the number of assumed violations is , , Indicates the first One violation of the rules; This indicates the legal force level factor of the pre-defined violation criteria and the corresponding regulatory documents; This indicates the expiration date of the relevant regulatory document corresponding to the violation guidelines; and Let be the weight coefficient, and satisfy... ; Time factor The calculation process is as follows: using the current year as... The effective year of the regulations corresponding to the violation guidelines The difference between the two Then, regarding the time difference The time-dependent factor is obtained by modeling using exponential decay. The calculation formula is as follows: , , in, The time decay coefficient indicates the rate at which the weight of control regulations decays over time; The extracted violation rules and corresponding metadata are integrated with the original text of the violation rules to form violation rule knowledge units. All knowledge units are stored locally in a structured format to form a structured violation rule knowledge base. S1.4. The open-source pre-trained Qwen3-4B lightweight large language model is selected as the basic model for text encoding. The original text content of the violation criteria in each structured violation criterion knowledge unit is then processed. Perform unified semantic vector encoding to output a dense semantic feature vector with a dimension of 2560. ; S1.5. Store the dense semantic feature vectors and their corresponding unique identifiers (IDs) into the locally deployed FAISS vector database to establish a text vector knowledge base for violation criteria. FAISS returns several semantic vectors that are most similar to the query vector and their corresponding unique identifiers (IDs). These IDs are used to trace back the original content of the violation rules and metadata information in the structured violation rule knowledge base. S1.6. Update the structured violation guidelines knowledge base and the violation guidelines text vector knowledge base in a timely manner based on the relevant violation guidelines entries in the newly added and revised relevant regulatory texts or the structured knowledge entries corresponding to the invalid regulatory texts announced in public notices.

[0022] In a specific implementation, S2 is as follows: S2.1 For sentences that violate regulations and are to be reviewed, the Qwen-chat-8B lightweight large language model is selected as the core model for intent translation. It is deployed locally and adapted and optimized, and the network call function of the model is turned off. S2.2 Construct a compound prompt word template for calling a lightweight large language model. This prompt word template is used to simultaneously constrain the implicit reasoning method and the final output content of the model during the model reasoning process. S2.3 The compound prompt word template includes an implicit reasoning constraint module, which is used to limit the reasoning path of the model when processing the sentences to be reviewed for violations. The implicit reasoning constraint module specifies the sequence of semantic analysis stages in the prompt words, so that the model completes the following semantic analysis processes in sequence internally without explicitly outputting the intermediate reasoning results: 1) Identify the administrative action subject, the regulated object and the behavior mode; 2) Determine the types of exclusion and restriction of competition behaviors that may be involved; 3) Perform semantic alignment and abstract understanding of the explicit expression of the text and the potential regulatory intent. S2.4 The compound prompt word template also includes a transfer text generation constraint module, which sets escape text generation constraint prompt words to constrain the model to generate transfer text for semantic vector retrieval after completing the implicit inference constraint module operation; The transfer text generation constraint module specifies the semantic abstraction level and expression norms of the generated content in the prompt words, so that the model outputs the transferred review statement that meets the following constraints: 1) Eliminate the rhetorical, colloquial or context-dependent expressions in the original text to be reviewed; 2) Use neutral, abstract and general legal or policy semantics to describe administrative behavior or regulatory measures; 3) Preserve the core semantic elements of behavior type, target and competitive impact while standardizing the expression; S2.5 The compound prompt word template also includes a keyword generation constraint module, which sets keyword generation constraint prompt words to constrain the model to generate a set of terms for keyword retrieval based on its implicit reasoning results in the same reasoning process; The keyword generation constraint module specifies keyword extraction and expansion rules in the prompts, ensuring that the model output includes a set of terms containing the following: 1) Core keywords 1) Core words or phrases extracted from the original text to be reviewed that reflect the subject, manner, restrictive measures, or competitive impact of the administrative action; 2) Expanded keywords Extended search terms are generated based on core keywords, combined with commonly used legal terms, synonyms or near-synonyms, and hierarchical concepts in the field of fair competition review. S2.6. Format and encapsulate the intent escaping result, and finally output a structured JSON object containing the following fields, without containing any intermediate reasoning information or analysis descriptions: 1) Escape query statement : Standardized query statements used for semantic vector retrieval; 2) Keyword set It includes a list of core keywords and a list of extended keywords.

[0023] The following is an example of a structured output for the sentence "A certain prefecture-level city stipulates that out-of-town chain supermarkets entering the local market must pay an additional industry access deposit of 2 million yuan, while local supermarkets are not required to pay it": { "Escape query statement": "Administrative bodies impose additional capital payment requirements on non-local businesses during the market access phase, discriminating between local and non-local businesses, thus excluding or restricting market competition." "Keyword set": { Key keywords: ["Out-of-town chain supermarkets", "Local market", "Additional fees", "Entry deposit", "Local supermarkets", "No fees required"] Expanded keywords: ["Non-local operators", "Market access", "Additional capital requirements", "Discriminatory treatment", "Exclusionary competition", "Differentiated taxation"] } } Here is an example of a prompt word template: It transforms the original policy text into a standardized, searchable query expression, without outputting any intermediate reasoning process or analysis description, and only returns the specified JSON format.

[0024] Task requirements: 1. Semantic analysis (executed internally only, no output): a) Identify the administrative actors, regulated entities, and specific behaviors (such as restrictive measures, differential treatment, etc.) in the text; b) Determine whether the behavior falls under the category of excluding or restricting competition (such as market access discrimination, preferential policies, restrictions on business conduct, etc.); c) Align the colloquial and contextual expressions in the text with legal terminology and abstract expressions in the field of fair competition review.

[0025] 2. Generation of escaped query statements (for semantic vector retrieval): a) Remove rhetoric and redundant information, retaining the core semantics (subject, object, method, competitive impact); b) Use neutral, abstract legal / policy language (e.g., "non-local operators" instead of "out-of-town supermarkets," "differential treatment" instead of "locals do not need to pay"); c) Clearly reflect the core intent of "excluding or restricting competition," ensuring alignment with the semantics of legal provisions.

[0026] 3. Keyword set generation (for BM25 search): a) Original keywords: Extract core words / phrases that reflect the subject, manner, restrictive measures, and competitive impact; b) Expanded keywords: Based on the original keywords, supplement synonyms, hierarchical concepts, and commonly used legal expressions in the field of fair competition review (e.g., "market access" expanded to "market access stage" or "entry threshold," "additional payment" expanded to "additional funding requirements" or "differentiated payment"); c) Keywords should cover the entire dimension of "subject-behavior-impact" to avoid omitting core semantics.

[0027] 4. Output format (mandatory, only JSON should be returned, no other content): { "Escaped Query Statement (Qs)": "[Standardized query statement generated according to requirements]", "Keyword Set (W)": {"Core Keywords"} ": ["【Original sentence core word 1】", "【Original sentence core word 2】", ...],"Extended keywords ": ["【Extended Word 1】", "【Extended Word 2】", ...]}} In a specific implementation, S3 is as follows: S3.1 Receive the structured JSON object output from step S2, and parse it to obtain the escaped query statement for semantic retrieval. ; S3.2. Using a text encoding model consistent with the text vector knowledge base construction stage of the violation criteria, the escaped query statement is processed. Perform semantic vector encoding to obtain the query vector. ; Based on query vector Its goal is to analyze the vector set in the text vector knowledge base of the violation criteria. In the middle, select the query vector The first with the highest semantic similarity The mathematical objective function of a vector is expressed as: in, This represents the candidate set for vector retrieval. Represents from the set of vectors The top results are selected based on semantic similarity scores. One candidate regulatory knowledge item; For query vector Vector of Legal Knowledge Items The semantic relevance score between them is calculated using cosine similarity.

[0028] In a specific implementation, S4 is as follows: S4.1 Receive the structured JSON object output from step S2, and parse it to obtain the keyword set for keyword retrieval. , This indicates the number of keywords in the keyword set. Indicates the first One keyword, ; S4.2, Based on Keyword Set A BM25-based keyword search is performed within the inverted index built on the local regulatory knowledge base, with the goal of selecting keywords from the structured violation guidelines knowledge base. The highest matching degree Each non-compliant standard knowledge unit forms a candidate set for keyword retrieval. Its mathematical objective function is expressed as: in, This represents a set of knowledge units related to the violation standards. Represents a set of knowledge units related to violation standards. The Middle The original text of the violation standard; This indicates that the scores are sorted from highest to lowest according to the BM25 algorithm, and the top scores are selected. One non-compliant standard knowledge unit; This is a query scoring function for keyword retrieval, used to measure the original text of the violation criteria. With keyword set The degree of keyword matching between them is defined as: , in, Keywords In the original text of the violation standard Frequency of occurrence in; Indicates a legal entry The length of the text; This represents the average length of text entries in the structured violation guidelines knowledge base; and This represents the adjustment parameters in the BM25 algorithm, used to control the term frequency saturation and text length normalization strength; inverse document frequency. Used to measure keywords The formula for calculating the ability to differentiate within the legal knowledge base is as follows: , in, This represents the total number of structured legal knowledge entries in the structured violation guidelines knowledge base. A structured legal knowledge entry is the smallest retrieval unit after hierarchical segmentation and semantic normalization. This indicates that the structured violation rules knowledge base contains keywords. The number of knowledge items related to the violation guidelines is ranked from highest to lowest according to the BM25 score, and the recall is carried out accordingly. A candidate set of legal knowledge items is formed by keyword retrieval. .

[0029] In a specific implementation, S5 is as follows: S5.1, Candidate Set for Vector Retrieval and keyword search candidate set Each violation rule retrieved from the search Min-Max normalization is performed to obtain the normalized semantic scores for each violation criterion. Normalized keyword scores ; S5.2, For those that appear simultaneously in the vector retrieval candidate set and keyword search candidate set The violation criteria in the data and the violation criteria that only appear in one of the candidate sets are used to construct a unified hybrid retrieval comprehensive scoring function. The calculation formula is as follows: , in, and This indicates that the weight parameters are configurable and satisfy the following conditions: ; S5.3, Comprehensive scoring based on mixed retrieval The candidate violation criteria are sorted in descending order of their scores, and a preset number of violation criteria are selected from the sorting results to form a set of candidate violation criteria for generating fair competition review results. S5.4 The structured output results include a set of candidate violation criteria corresponding to the text to be reviewed, the corresponding regulatory attribute information, and their sorting order determined by the comprehensive score of the hybrid retrieval.

[0030] The final structured mapping result for the regulation in a certain prefecture-level city that requires out-of-town chain supermarkets to pay an additional industry access deposit of 2 million yuan to enter the local market, while local supermarkets are exempt from this requirement, is as follows: { [{"Sorting": 1,"Violation Criterion ID": "R-001","Overall Score": 0.85,"Legal Source Information":{"Full Name of Regulation": "Implementation Measures of the Fair Competition Review Regulations","Issuing Department": "State Administration for Market Regulation","Legal Validity Level": "Departmental Regulations","Year of Effectiveness": 2025},"Review Category": "Market Access Category","Original Text of Violation Criterion": "Market access licensing management measures that illegally add market entry prohibitions or restrict the qualifications, ownership forms, equity ratios, business scope, business formats, and business models of business entities"}, {"Sorting": 2,"Violation Criterion ID": "R-045","Overall Score": 0.401,"Legal Source Information": {"Full Name of Regulation": "Detailed Rules for the Implementation of Fair Competition Review","Issuing Departments": "State Administration for Market Regulation, National Development and Reform Commission, Ministry of Finance, Ministry of Commerce, Ministry of Justice","Legal Validity Level": "Departmental Rules","Year of Effectiveness": 2021},"Review Category": "Market Access Category","Original Text of Violation Criterion": "Without legal, administrative, or State Council provisions, unreasonable differential treatment is imposed on operators of different ownership, regions, or organizational forms, and unequal market access and exit conditions are set."} … ] } This example is based on a dataset of 100 manually constructed violation sentences (each violation sentence is manually labeled with 1-5 real violation criteria, forming a standard mapping set of "sentence-related criteria", covering four fair competition review scenarios: market access, commodity flow, preferential policies, and business conduct). The following two core indicators are used to evaluate the performance of the method of this invention in terms of violation criterion candidate recall and ranking accuracy, as shown in Table 1. Recall (Recall@K): Recall is used to evaluate the "comprehensive recall capability" of the method of this invention for the relevant violation criteria of a single violation sentence. That is, in the top K search ranking results, the number of successfully hit relevant violation criteria is the proportion of the total number of relevant violation criteria labeled for the sentence. The final result is the average of 100 violation sentences.

[0031] Hit rate (Hit@K): The hit rate is used to evaluate the "bottom-line recall effectiveness" of the method of this invention, that is, the proportion of sentences in the top K results after retrieval and ranking that contain at least one manually labeled relevant violation criterion out of 100 violation sentences.

[0032] Table 1 Performance Evaluation Table The above results fully demonstrate that this invention, by introducing an intent-shifting mechanism based on a lightweight large language model and combining a weighted fusion strategy of semantic vector retrieval and keyword retrieval, can achieve efficient candidate recall and reasonable ranking of violation criteria in complex, multi-relevant violation criterion mapping scenarios. In the experiment, under the condition of returning the top-5 candidates, the hit rate reached 95% and the recall rate reached 85%, verifying the accuracy and stability of this method in fair competition review applications.

[0033] Although the specific embodiments of the invention have been described above in conjunction with the accompanying drawings, this is not intended to limit the scope of protection of the invention. Based on the technical solutions of the invention, various modifications or variations that can be made by those skilled in the art without creative effort are still within the scope of protection of the invention.

Claims

1. A fair competition review violation segmentation and matching method based on escape retrieval, characterized in that, Includes the following steps: S1. Based on the core norms of the current and effective fair competition review system in China, construct a local structured knowledge base of violation guidelines and a text vector knowledge base of violation guidelines. S2. For sentences that are to be reviewed for violations, the intent escaping engine and a locally deployed lightweight large language model are used to output the escaped query statement, core keywords and extended keywords in a structured manner according to the preset prompt word template. S3. Convert the structured escape query statement into a query vector, perform vector matching and recall with the violation criteria text vector knowledge base, and generate a vector retrieval candidate set; S4. Using the core keywords and extended keywords as the query term set, perform sentence-level retrieval based on the BM25 algorithm with the structured violation rule knowledge base, calculate the relevance score and recall, and generate a keyword retrieval candidate set; S5. The vector retrieval candidate set and the keyword retrieval candidate set are merged and reordered to obtain the final candidate list sorted by comprehensive relevance, forming a structured mapping result between the sentences to be reviewed for violations and the corresponding violation criteria.

2. The fair competition review violation segmentation and matching method based on escape retrieval according to claim 1, characterized in that, S1 is as follows: S1.1 Targeted crawling of currently effective national normative documents on fair competition review, extracting prohibitive or restrictive clauses in various fields; S1.

2. Using a single violation as the basic unit, the clauses are segmented and extracted. If a legal text fully describes a single violation, it is directly extracted as a violation rule; if a legal text contains multiple parallel violations, it is further segmented by semicolons, enumeration items, or independent semantics, so that each segment corresponds to a violation rule. S1.3 Generate a unique identifier ID for each violation rule and construct metadata including regulatory source traceability, timeliness identifier, review classification, core keywords, and initial weight values; Initial weight of each violation standard The calculation formula is as follows: , The number of violations of the rules is , , Indicates the first One rule of violation; This indicates the legal force level factor of the pre-defined violation criteria and the corresponding regulatory documents; This indicates the expiration date of the relevant regulatory document corresponding to the violation guidelines; and Let be the weight coefficient, and satisfy... ; Time factor The calculation process is as follows: using the current year as... The effective year of the regulations corresponding to the violation guidelines The difference between the two Then, regarding the time difference The time-dependent factor is obtained by modeling using exponential decay. The calculation formula is as follows: , , in, Indicates the time-degradation coefficient; The violation rules, metadata, and original text are integrated into violation rule knowledge units and stored in a structured manner to form a structured violation rule knowledge base; S1.

4. An open-source pre-trained lightweight large language model is used as the text encoding model to encode each violation rule in the original text. Perform semantic vector encoding to output dense semantic feature vectors. ; S1.5 Store the dense semantic feature vectors and their corresponding unique identifiers (IDs) into the local FAISS vector database to construct a text vector knowledge base for violation criteria. FAISS returns the semantic vector most similar to the query vector and its corresponding ID, and uses the ID to trace back the original text and metadata in the structured violation rule knowledge base. S1.

6. Update the structured violation guidelines knowledge base and the violation guidelines text vector knowledge base in real time according to the addition, revision or invalidation of regulations.

3. The fair competition review violation segmentation and matching method based on escape retrieval according to claim 1, characterized in that, S2 is as follows: S2.1 For sentences that violate regulations and are to be reviewed, a lightweight large language model that is deployed locally and does not involve network calls is used as the core model for intent escaping. S2.2 Construct a compound prompt word template, while constraining the internal reasoning path and final output content of the model; S2.3, The compound prompt word template includes an implicit inference constraint module, which restricts the model to complete sequentially: 1) Identify the subject, object, and mode of administrative action; 2) Identify the types of behaviors that exclude or restrict competition; 3) Perform semantic alignment and abstract understanding of explicit statements and potential regulatory intentions in the text; And it does not output the intermediate reasoning process; S2.4, the compound prompt word template also includes an escape text generation constraint module, which constrains the model to generate standardized escape query statements that satisfy: 1) Eliminate rhetoric, colloquialisms, and context-dependent expressions; 2) Use neutral, abstract, and universally applicable policy and legal expressions; 3) Retain the core semantics of behavior type, target, and competitive impact; S2.5, the compound prompt word template also includes a keyword generation constraint module, and the constraint model is generated synchronously: Core keywords: Core words / phrases extracted from the sentences to be reviewed, including those related to the subject, behavior, measures, and competitive impact; Expanded keywords: These are derived by combining core keywords with domain terminology, synonyms / near-synonyms, and hierarchical concepts. S2.6 The intent escaping result is formatted as a structured JSON object output containing only the escaped query statement and keyword set.

4. The fair competition review violation segmentation and matching method based on escape retrieval according to claim 3, characterized in that, S3 Specifically as follows: S3.1 Parse the structured JSON object and obtain the escaped query statement. ; S3.

2. Using the same text encoding model as the one used to construct the text vector knowledge base for violation criteria, for... The query vector q is obtained through encoding; in the vector library vector set Among them, the top ones with the highest semantic similarity to q are selected. These vectors form a candidate set for vector retrieval. ; The calculation formula is as follows: , in, This represents the candidate set for vector retrieval. Represents from the set of vectors The top results are selected based on semantic similarity scores. One candidate regulatory knowledge item; For query vector Vector of Legal Knowledge Items The semantic relevance score between them is calculated using cosine similarity.

5. The fair competition review violation segmentation and matching method based on escape retrieval according to claim 3, characterized in that, S4 is as follows: S4.1 Parse the structured JSON object to obtain the keyword set. , This indicates the number of keywords in the keyword set. Indicates the first One keyword, ; S4.

2. Based on the inverted index, perform BM25 keyword retrieval in the structured violation criteria knowledge base, select the top 25 knowledge units with the highest matching degree to form a keyword retrieval candidate set. ; The calculation formula is as follows: , in, This represents a set of knowledge units related to the violation standards. Represents a set of knowledge units related to violation standards. The Middle The original text of the violation standard; This indicates that the scores are sorted from highest to lowest according to the BM25 algorithm, and the top scores are selected. One non-compliant standard knowledge unit; This is a query scoring function for keyword retrieval, used to measure the original text of the violation criteria. With keyword set The degree of keyword matching between them is defined as: , in, Keywords In the original text of the violation standard Frequency of occurrence in; Indicates a legal entry The length of the text; This represents the average length of text entries in the structured violation guidelines knowledge base; and This represents the adjustment parameters in the BM25 algorithm, used to control the term frequency saturation and text length normalization strength; inverse document frequency. Used to measure keywords The formula for calculating the ability to differentiate within the legal knowledge base is as follows: , in, This represents the total number of structured legal knowledge entries in the structured violation guidelines knowledge base. A structured legal knowledge entry is the smallest retrieval unit after hierarchical segmentation and semantic normalization. This indicates that the structured violation rules knowledge base contains keywords. The number of knowledge items related to the violation guidelines is ranked from highest to lowest according to the BM25 score, and the recall is carried out accordingly. A candidate set of legal knowledge items is formed by keyword retrieval. .

6. The fair competition review violation segmentation and matching method based on escape retrieval according to claim 1, characterized in that, S5 is detailed below: S5.1, Candidate Set for Vector Retrieval and keyword search candidate set Each violation rule retrieved from the search Min-Max normalization is performed to obtain the normalized semantic scores for each violation criterion. Normalized keyword scores ; S5.2, For those that appear simultaneously in the vector retrieval candidate set and keyword search candidate set The violation criteria in the data and the violation criteria that only appear in one of the candidate sets are used to construct a unified hybrid retrieval comprehensive scoring function. The calculation formula is as follows: , in, and This indicates that the weight parameters are configurable and satisfy the following conditions: ; S5.3, Comprehensive scoring based on mixed retrieval The candidate violation criteria are sorted in descending order of their scores, and a preset number of violation criteria are selected from the sorting results to form a set of candidate violation criteria for generating fair competition review results. S5.4 The structured output results include a set of candidate violation criteria corresponding to the text to be reviewed, the corresponding regulatory attribute information, and their sorting order determined by the comprehensive score of the hybrid retrieval.