A technical sign similar detection method and system based on two-dimensional semantic recognition
By using a technical bid similarity detection method based on dual-dimensional semantic recognition, the problem of high false positive and false negative rates and low efficiency of traditional detection methods in identifying technical bid similarities is solved. This method achieves accurate and efficient identification of technical bids and reliable evidence preservation, thereby improving the fairness and impartiality of the bidding market.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HANGZHOU GOLDEN SOFTWARE SYST INC
- Filing Date
- 2026-03-18
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional plagiarism detection methods cannot effectively identify plagiarism through synonym substitution, sentence transformation, or semantic rewriting, and lack an understanding of industry-specific contexts, resulting in high false positive rates, low efficiency, and weak evidence chains.
A technical mark similarity detection method based on dual-dimensional semantic recognition is adopted. By parsing multi-format technical mark files, standardized text data is generated. Semantic modules are divided based on a pre-set industry knowledge base, initial weight values are configured, and combined with sliding hash calculation and pre-trained semantic model, character similarity rate and semantic similarity are calculated. A visual analysis report is generated and stored on the blockchain.
It achieves accurate and efficient identification of similar technical bids, reduces the false negative rate, improves detection efficiency, and ensures the transparency and reliability of the analysis process through weighted risk values and blockchain evidence storage, providing a solid technical guarantee for the bidding market.
Smart Images

Figure CN122242517A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer information processing technology, and in particular to a method and system for detecting technical similarities based on two-dimensional semantic recognition. Background Technology
[0002] With the full digitalization and standardization of bidding activities, the independence and originality of technical bids (such as construction organization designs and service plans) have become key indicators for measuring the fairness of bidding. In recent years, regulatory authorities have generally required similarity detection of technical bids to curb bid rigging and collusion, in order to identify the risk of collaborative bidding. Traditional detection methods mostly rely on text similarity calculations, such as character-based continuous matching or statistical similarity algorithms, to determine similarity by comparing whether the character sequences in the bid documents are highly consistent. These methods can detect plagiarism by direct copying and pasting to a certain extent, and because they are simple to implement and have reasonable computational efficiency, they were widely used in public resource trading centers and centralized procurement platforms in the early days.
[0003] However, as bidders upgrade their circumvention methods, traditional detection methods have gradually revealed many limitations. On the one hand, relying on single character-level comparison, they cannot effectively identify plagiarism achieved through covert means such as synonym substitution, sentence transformation, or semantic rewriting, resulting in a high rate of missed detection for similar content that is "different in expression but identical in core solution." On the other hand, due to a lack of understanding of industry-specific contexts, the system is prone to misjudging standard terminology or normative expressions in technical bids as malicious plagiarism, leading to an increased false positive rate. This not only increases the workload of regulatory review but may also affect the fairness of legitimate bids. Furthermore, the algorithm design of pairwise comparison of all texts results in low efficiency when processing large numbers of files, making it difficult to meet the demand for rapid response in real-world scenarios. At the same time, the generated plagiarism reports often only provide simple similarity ratios, failing to distinguish the differences in importance between core innovative modules and general normative content in technical bids, making the evidence chain weak and unconvincing. Summary of the Invention
[0004] To address the aforementioned technical issues, this application provides a method and system for detecting technical similarities based on dual-dimensional semantic recognition.
[0005] Firstly, this application provides a method for detecting technical similarities based on two-dimensional semantic recognition, employing the following technical solution:
[0006] A method for detecting technical similarities based on two-dimensional semantic recognition, the detection method comprising:
[0007] Parse multi-format technical specification files, extract text content, perform standardization and cleaning processes, and generate standardized text data that retains paragraph structure;
[0008] Based on a pre-built industry knowledge base, the standardized text data is divided into multiple semantic modules, and the initial weight values of each semantic module are configured to generate a modular text set with weight labels.
[0009] Based on the initial weight values of each semantic module in the modular text set, corresponding window configuration parameters are generated. According to the window configuration parameters, sliding hash calculation is performed on the text content to locate similar character segments and calculate the character similarity rate.
[0010] The modular text set is processed by extracting text vectors through a pre-trained semantic model, calculating semantic similarity, and combining contextual relevance verification to output semantically similar segments.
[0011] The character similarity rate and semantic similarity are combined, and the similarity level is mapped based on a preset judgment matrix. The weighted risk value is calculated by combining the initial weight values of each semantic module.
[0012] By integrating the identical character fragments, semantically identical fragments, and weighted risk values, a visual analysis report is generated, and the report's hash value is extracted and written to the blockchain for evidence storage.
[0013] By adopting the above technical solutions, a complete technical chain has been constructed, from multi-format text parsing, intelligent semantic module segmentation, adaptive character and semantic dual-dimensional detection, to final weighted risk quantification and blockchain evidence solidification. This systematically solves the four core pain points of traditional technical bid similarity detection: single dimension, high false positive and false negative rates, low efficiency, and weak evidence chain. Deeply integrating technologies such as deep learning and natural language processing with professional knowledge of bidding and tendering supervision, the detection system has evolved from a simple text comparison tool into an intelligent decision support system with semantic understanding capabilities and expert-level risk judgment wisdom. Ultimately, it achieves accurate, efficient, and conclusive identification and judgment of bid rigging behavior, providing a solid technical guarantee for maintaining the fairness and impartiality of the bidding and tendering market.
[0014] Secondly, this application provides a technical mark similarity detection system based on dual-dimensional semantic recognition, which adopts the following technical solution:
[0015] A technology similarity detection system based on two-dimensional semantic recognition, the detection system comprising:
[0016] The data standardization module is used to parse multi-format technical specification files, extract text content, perform standardization and cleaning processes, and generate standardized text data that retains paragraph structure.
[0017] The weight assignment module is used to divide the standardized text data into multiple semantic modules based on a pre-set industry knowledge base, configure the initial weight value of each semantic module, and generate a modular text set with weight labels.
[0018] The character similarity detection module is used to generate corresponding window configuration parameters based on the initial weight values of each semantic module in the modular text set, perform sliding hash calculation on the text content according to the window configuration parameters, locate similar character segments and calculate the character similarity rate;
[0019] The context verification module is used to extract text vectors from the modular text set through a pre-trained semantic model, calculate semantic similarity, and output semantically similar segments in combination with contextual relevance verification.
[0020] The risk quantification module is used to integrate the character similarity rate and semantic similarity, map the similarity level based on a preset judgment matrix, and calculate the weighted risk value by integrating the initial weight values of each semantic module.
[0021] The visualization report generation module is used to integrate the identical character fragments, semantically identical fragments, and weighted risk values to generate a visualization analysis report;
[0022] The blockchain evidence storage module is used to extract the report hash value and write it to the blockchain for evidence storage.
[0023] Thirdly, this application provides a computer device, which adopts the following technical solution:
[0024] A computer device includes a memory, a processor, and a computer program stored in the memory, the processor executing the computer program to perform the steps of the method as described in the first aspect.
[0025] Fourthly, this application provides a computer-readable storage medium, which adopts the following technical solution:
[0026] A computer-readable storage medium storing a computer program that can be loaded by a processor and executed as in any of the methods in the first aspect.
[0027] In summary, this application includes at least one of the following beneficial technical effects: By integrating a dual-dimensional identification mechanism of character-level similarity detection and semantic-level deep analysis, the accuracy and efficiency of detecting plagiarism in technical tender documents are significantly improved. On the one hand, based on the sliding hash algorithm, text-level similar segments are quickly located and character similarity rates are calculated, effectively capturing literal plagiarism. On the other hand, by combining a pre-trained semantic model and contextual verification, hidden similar content at the semantic level is accurately identified, avoiding missed detections due to differences in expression. Simultaneously, through modular weight allocation and weighted risk value calculation driven by an industry knowledge base, the priority of key semantic modules is focused, enhancing the relevance of the detection results. The final generated visual report and blockchain evidence ensure the transparency, traceability, and tamper-proof nature of the analysis process, providing an efficient and reliable risk assessment basis for bidding and tendering reviews. Attached Figure Description
[0028] Figure 1 This is a schematic diagram of the first process of a technical similarity detection method based on two-dimensional semantic recognition, according to one embodiment of this application.
[0029] Figure 2 This is a schematic diagram of the second process of a technical similarity detection method based on two-dimensional semantic recognition, which is one embodiment of this application.
[0030] Figure 3 This is a schematic diagram of the third process of a technical similarity detection method based on two-dimensional semantic recognition, according to one embodiment of this application.
[0031] Figure 4 This is a schematic diagram of the fourth process of a technical similarity detection method based on two-dimensional semantic recognition, according to one embodiment of this application.
[0032] Figure 5 This is a schematic diagram of the fifth process of a technical similarity detection method based on two-dimensional semantic recognition, according to one embodiment of this application.
[0033] Figure 6 This is a schematic diagram of the sixth process of a technical similarity detection method based on two-dimensional semantic recognition, according to one embodiment of this application.
[0034] Figure 7 This is a schematic diagram of the seventh process of a technical similarity detection method based on two-dimensional semantic recognition, according to one embodiment of this application. Detailed Implementation
[0035] To make the purpose, technical solution, and advantages of this application clearer, the following description is provided in conjunction with the appendix. Figures 1-7 The present application will be further described in detail below with reference to embodiments. It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the scope of the application.
[0036] This application discloses a method for detecting technical similarities based on two-dimensional semantic recognition.
[0037] Reference Figure 1 A method for detecting technical similarities based on two-dimensional semantic recognition, the detection method includes:
[0038] Step S101: Parse the multi-format technical specification file, extract the text content and perform standardized cleaning to generate standardized text data that retains paragraph structure;
[0039] The technical tender documents come from diverse sources, including native electronic documents (such as Word and PDF) and scanned copies (such as scanned versions of paper documents), which constitutes the initial challenge of technical processing. The system first needs to identify the file type through a format parsing engine: for electronic documents, it directly parses their internal structure (such as the XML format of Word), not only extracting text but also preserving layout information such as font, indentation, and paragraph marks. This information may serve as auxiliary evidence for later judging plagiarism traces (such as preserving the original format when copying); for scanned copies, it uses optical character recognition (OCR) technology combined with layout restoration algorithms to convert images into editable text and intelligently reconstruct document structures such as paragraphs and headings.
[0040] Following this, a standardized text cleaning process is performed to eliminate non-substantive differences and interfering information. The principle behind this is to establish a unified text comparison benchmark. This is achieved by automatically removing headers, footers, page numbers, and content from fixed templates provided by the bidding party, thus avoiding misjudging these common and inevitably identical parts as plagiarism and focusing on the core content written by the bidder. Filtering special symbols, redundant spaces, and line breaks, and standardizing number and unit formats (e.g., standardizing "10mm" and "10 millimeters" to "10 millimeters"), is to combat a simple plagiarism disguise tactic (by inserting irrelevant symbols or changing unit spelling to circumvent basic detection based on continuous character matching). After this step, the system obtains standardized text data with a clear structure and uniform format, laying a reliable foundation for subsequent refined module division and comparison.
[0041] Step S102: Based on a pre-built industry knowledge base, the standardized text data is divided into multiple semantic modules, and the initial weight values of each semantic module are configured to generate a modular text set with weight labels.
[0042] Traditional methods treat the entire technical specification as a black box for similarity calculation, failing to distinguish between key innovative points and routine content, leading to misjudgments and omissions. This step, however, uses a pre-built industry knowledge base (containing structural patterns and keyword libraries of typical technical specifications in engineering, service, and other fields) to drive a text classification model (usually fine-tuned based on pre-trained models such as BERT) to intelligently parse the standardized text.
[0043] Specifically, by analyzing the vocabulary, context, and semantics of each paragraph or chapter, it automatically categorizes them into predefined semantic modules, such as: "core technology modules" (e.g., specific construction techniques and service process designs), "general specification modules" (e.g., safety production clauses in national standards and formatted qualification statements), and "project-specific modules" (e.g., customized solutions tailored to the characteristics of this project).
[0044] Next, each semantic module is assigned an initial weight value. This operation embodies the concept of risk quantification, as the weight value directly reflects the importance of the module's content in the similarity determination. For example, setting the weight of the "core technology module" in the high range of 60%-80% means that similarities found here will have a decisive impact on the final risk assessment; while the weight of the "general specifications module" is only 10%-20%, and its contribution to similarity will be significantly diluted. In this way, the system no longer simply calculates the full-text similarity, but generates a "modular text set with weighted labels," allowing subsequent detection to focus on the parts that truly reflect the originality and core value of the bidding proposal.
[0045] Step S103: Based on the initial weight values of each semantic module in the modular text set, generate corresponding window configuration parameters, perform sliding hash calculation on the text content according to the window configuration parameters, locate similar character segments and calculate the character similarity rate;
[0046] The core logic of this step is to transform the importance (weight) of semantic modules into sensitivity parameters for algorithm execution, thereby optimizing the allocation of detection resources. Sliding window hash matching is an efficient string approximation matching algorithm. Its principle is to slide a fixed-length "window" across the text, calculating the hash value (a unique digital fingerprint) of the text within the window each time. By comparing the hash value sequences between different texts, consecutive identical character segments can be quickly discovered.
[0047] In this embodiment, the window size (i.e., length) is not fixed but dynamically generated by mapping the weight values of the modules. This is a key logical link: high-weight modules (such as core technologies, with a weight ≥ 60%) require higher detection sensitivity, so smaller windows (e.g., 50 characters) are configured to capture more subtle copying behavior; low-weight modules (such as general specifications, with a weight ≤ 20%) can tolerate higher similarity, so larger windows (e.g., 200 characters) are used to improve computational efficiency and reduce interference. The system slides the window according to a preset step size, and when the hash values of multiple consecutive windows match, the region is marked as a "suspected identical character fragment".
[0048] To further improve robustness, these fragments will undergo text normalization (i.e., punctuation and spaces will be removed before comparison) to identify plagiarism attempts that evade detection by simply adjusting formatting or replacing punctuation. Finally, the total number of characters in all identical fragments within a single module is counted and divided by the total number of characters in that module to obtain the module's "character similarity rate," which is an objective and quantifiable primary indicator.
[0049] Step S104: Extract text vectors from the modular text set using a pre-trained semantic model, calculate semantic similarity, and combine contextual relevance to verify and output semantically similar segments.
[0050] This involves using a deep language model (such as BERT) that has been pre-trained on massive amounts of corpus and fine-tuned on industry-specific (such as engineering) corpus. This model can convert a piece of text into a vector in a high-dimensional space (also known as an embedding vector), and the geometric features of this vector (such as orientation) encode the deep semantic information of the text.
[0051] Specifically, text fragments from a modular text set are input into the model to obtain their semantic vectors. Then, the cosine similarity of the corresponding modular text vectors between different tender documents is calculated; the closer the value is to 1, the more semantically similar they are. Using this method, the system can identify plagiarism behaviors such as synonym substitution (e.g., "tower crane" vs. "tower-type crane") and sentence transformation (e.g., active vs. passive sentences) where the characters differ but the meaning is the same.
[0052] Furthermore, to avoid misinterpretation, the algorithm incorporates contextual relevance verification. For example, for isolated, semantically similar short sentences (such as "using steel structures"), the system analyzes the semantics of several sentences preceding and following it to determine the sentence's true function and meaning in the original text. Only when its contextual information is also highly relevant is it ultimately classified as a "semantically similar fragment." This mimics the human thought process of understanding language by considering its context, effectively preventing misjudgments caused by coincidental similarities in expression.
[0053] Step S105: Combine character similarity rate and semantic similarity, map similarity level based on preset judgment matrix, and calculate weighted risk value by combining the initial weight values of each semantic module;
[0054] The underlying logic of this step lies in constructing a multi-factor, hierarchical judgment system to arrive at a final conclusion. First, the system inputs the obtained character similarity rate and semantic similarity metrics into a pre-defined judgment matrix. This matrix defines the similarity levels (e.g., high similarity, moderate similarity, low risk) corresponding to different threshold combinations, providing a standardized basis for preliminary judgment.
[0055] However, the real innovation lies in the calculation of the "weighted risk value." It doesn't simply average character and semantic metrics; instead, it introduces a third crucial dimension: the initial weight value of the module. The fusion logic is as follows: multiply the character similarity rate or semantic similarity (or, in a specific conflict scenario, the more sensitive one) by the module weight value. This multiplication essentially performs risk weighting. For example, even if a high character similarity rate is found in the "general specifications module," its low weight will result in a low-level weighted risk value. Conversely, in the "core technology module," even if the semantic similarity is only moderate, its high weight will significantly amplify the weighted risk value of that segment. This mechanism ensures that the final risk assessment accurately reflects the technical importance and potential harm of the similar content, providing a comprehensive and targeted quantitative basis for regulatory decisions.
[0056] Step S106: Integrate identical character fragments, semantically identical fragments, and weighted risk values to generate a visual analysis report, extract the report hash value, and write it to the blockchain for evidence storage.
[0057] The system automatically integrates all analysis results (including the specific locations of identical character fragments, the content of semantically identical fragments, the similarity indicators of each module, and the final weighted risk value) to generate a detailed visual analysis report. This report is usually presented in PDF format and includes a heatmap (visually showing the distribution of similarities), a data comparison table (clearly listing the details of similarities), and a summary of conclusions.
[0058] To further enhance the legal validity of the evidence, the system calculates the cryptographic hash value (a unique and irreversible digital fingerprint) of the report file and writes it to the blockchain along with a timestamp. Due to the decentralized and immutable nature of the blockchain, this notarization process essentially provides the report with an authoritative "birth certificate." Any tampering with the report will result in its hash value not matching the blockchain record. This solidifies the authenticity and creation time of the evidence, enabling the report to be accepted as strong electronic evidence in subsequent administrative reviews or judicial proceedings, effectively solving the ultimate problem of "insufficient evidence" in traditional methods.
[0059] The above implementation constructs a complete technical chain, from multi-format text parsing, intelligent semantic module segmentation, adaptive character and semantic dual-dimensional detection, to final weighted risk quantification and blockchain evidence solidification. This systematically addresses the four core pain points of traditional bid-rigging detection: single-dimensionality, high false positive / false negative rates, low efficiency, and weak evidence chains. By deeply integrating deep learning, natural language processing, and other technologies with professional knowledge of bidding and tendering supervision, the detection system evolves from a simple text comparison tool into an intelligent decision support system with semantic understanding capabilities and expert-level risk assessment wisdom. Ultimately, it achieves accurate, efficient, and conclusive identification and judgment of bid-rigging behavior, providing a solid technical guarantee for maintaining fairness and impartiality in the bidding and tendering market.
[0060] Reference Figure 2 As one implementation of step S101, the steps of parsing multi-format technical specification files, extracting text content, performing standardization and cleaning processes, and generating standardized text data that retains paragraph structure include:
[0061] Step S201: Obtain multi-format technical mark files, identify file types and call the corresponding parsing engine to extract the original text content containing paragraph position information;
[0062] Technical documents, during the digitization process, will generate various formats, mainly including two categories: one is native electronic documents (such as Word and PDF), which contain structural information internally; the other is scanned image files (such as JPG and PNG formats after scanning paper documents), whose content exists in pixel-based raster form. Simple text copying and pasting will lose a large amount of layout information that is potentially valuable for determining the originality of the document.
[0063] Therefore, the logical starting point of this step is accurate file format recognition, typically achieved by reading the file header's signature, which is a reliable identifier. After recognition, the system calls the corresponding parsing engine for targeted processing. For scanned documents, the core challenge is to simultaneously extract text content and layout structure from unstructured images. This requires converting the image into characters using OCR (Optical Character Recognition) technology and coupling it with a layout analysis algorithm. This algorithm can identify the position, size, and relationships of text blocks, thereby determining paragraph divisions, heading levels, etc., and reconstructing the paragraph's position information in the form of coordinates.
[0064] For native electronic documents (such as Word), they are essentially structured documents. Parsing engines can directly extract the text content and related rich attribute information such as font, indentation, and line spacing by deconstructing their internal format (such as parsing an XML node tree).
[0065] It should be noted that the purpose of this step is not to generate a plain text stream, but to generate a raw text content containing paragraph position information, thus preserving crucial raw data for subsequent structured analysis and possible plagiarism comparison.
[0066] Step S202: Delete non-substantive text fragments based on the pre-set cleaning rule base, including headers and footers, page numbers, and fixed template content of the bidding party;
[0067] The core logic of this step is data cleansing, which aims to eliminate "noise" text introduced by document format or the bidding process itself that is irrelevant to the originality of the bid proposal, thereby ensuring that subsequent plagiarism detection focuses on the core content that truly reflects the bidder's intentions and level.
[0068] In this embodiment, a scalable and updatable predefined cleaning rule base is constructed, which includes various intelligent recognition and filtering strategies. For fixed elements such as headers, footers, and page numbers, the rule base may contain a feature vocabulary (e.g., patterns containing keywords such as "number", "page", and "total") and combine it with regular expressions for matching and location.
[0069] Furthermore, the "fixed template content of the bidding party" needs to be processed, as relying solely on keyword matching is prone to mistakenly identifying the main text. Therefore, the solution introduces a template fingerprint database. The system can pre-extract recognized fixed template sections provided by the bidding party from the bidding documents or historical bid documents, and generate a unique digital fingerprint by calculating the MD5 hash or a more advanced hash value of its content. During cleaning, the system calculates the hash value of text fragments in the current document and compares it with records in the fingerprint database. Once a match is found, the fragment is determined to be template content rather than the bidder's original work and is securely deleted. This method based on cryptographic hash comparison is more accurate and adaptable than static rules, effectively avoiding misjudging standard clauses in the template as plagiarism.
[0070] Step S203: Perform format normalization processing on the extracted original text content to unify the expression of numbers and units, and filter out special symbols and redundant spaces;
[0071] Among them, this step aims to eliminate the "superficial differences" in text expressions, enabling the subsequent comparison algorithms to focus on the substantial similarities and differences in content rather than the superficial inconsistencies. This is a crucial preprocessing against simple plagiarism disguises. Plagiarists may attempt to avoid detection based on exact string matching by changing the writing forms of numbers (such as "10" and "ten"), transforming unit symbols (such as "mm" and "millimeter"), adding or deleting redundant spaces and special symbols, etc.
[0072] For this reason, the system embeds a format normalization engine, which is essentially a set of text conversion rules. The unification of the expression forms of numbers and units is an important part of it. The engine internally maintains a mapping relationship table. For example, the rules for bidirectional conversion between Chinese numbers and Arabic numbers, and the standardization of various unit symbols into a specified form (such as unifying "cm" and "centimeter" into "centimeter"). Filtering special symbols and redundant spaces involves another set of text cleaning rules. For example, compressing multiple consecutive line breaks into one, converting full-width punctuation marks (such as ",") into half-width punctuation marks (","), and deleting unnecessary spaces.
[0073] It can be understood that this process can be vividly understood as "syntactic beautification" of the text, stripping all superficial and formatted decorations that do not affect the core semantics, and restoring the purest and most core text semantic kernel. After normalization, texts such as "using 10mm steel bars" and "using ten millimeter steel bars" will be processed into exactly the same content, enabling character-level comparison to directly reveal their similarity essence.
[0074] Step S204, retain the mapping relationship between the original paragraph position information and the text content, and generate standardized text data.
[0075] Among them, the core logic of this step is to re-associate and integrate the clean text content after cleaning and normalization with the original layout structure information extracted in the first step, generating a structured data object that can ultimately be used for advanced semantic analysis. The reason for retaining paragraph position information (such as starting line number, indentation level, line spacing, etc.) is that these layout features themselves may contain important evidentiary information. For example, when determining whether there is plagiarism, if two documents are not only highly similar in text content but also have exactly the same unconventional indentation and special paragraph spacing, this phenomenon of "format-related copying" will become stronger circumstantial evidence, suggesting direct copy-pasting behavior rather than independent creative work.
[0076] Specifically, this can be achieved by constructing a paragraph position index table. This index table records the "coordinates" of each paragraph in the original document. Then, the system binds the cleaned and standardized text content obtained from steps two and three to this index table using pointers or unique identifiers. The resulting standardized text data is not a simple string, but a text data object rich in metadata and with structured tags. It contains both clean text content that can be used for high-precision comparison and layout clues that may be used to help determine plagiarism, providing a high-quality, high-information-density input foundation for subsequent in-depth analysis such as module division and two-dimensional comparison.
[0077] The above implementation combines various technologies such as file format parsing, hash fingerprinting in information theory, and text normalization in natural language processing to transform original documents from different sources and with chaotic formats into standardized text data with clean content, standard format, and preservation of key structural information. This lays a solid and reliable data foundation for subsequent accurate plagiarism detection algorithms and significantly improves the accuracy and robustness of the entire detection system from the source.
[0078] Reference Figure 3 As one implementation of step S102, the steps of dividing standardized text data into multiple semantic modules based on a pre-built industry knowledge base, configuring initial weight values for each semantic module, and generating a modular text set with weight labels include:
[0079] Step S301: Obtain standardized text data, call the text classification model of the pre-set industry knowledge base, divide the text content into multiple semantic module types and generate temporary module labels;
[0080] Traditional text similarity detection treats the entire document as a uniform whole, failing to distinguish the essential differences between "standard clauses" and "original solutions," which is one of the root causes of misjudgments and missed judgments. The breakthrough of this step lies in introducing a text classification model (typically fine-tuned based on pre-trained models like BERT) trained on a pre-built industry knowledge base (containing a large number of annotated technical documents from engineering, service, and other fields). This model does not perform simple keyword matching but rather deeply understands the semantics of paragraph text.
[0081] Specifically, the model receives standardized text data processed in previous steps, typically in paragraphs or logical sections. The model analyzes the vocabulary, syntax, and deeper semantic features of the paragraphs, outputting a probability distribution of semantic module types. For example, a text detailing the "specific construction techniques for foundation pit support" is highly likely to be classified as a "core technology module," while a text stating "we promise to comply with the Safety Production Law" is categorized as a "general specification module." The system assigns a temporary module label to each text fragment based on a preset probability threshold. This "temporary" designation reflects the rigor of the process, meaning that the classification result still allows for verification or adjustment by human experience in subsequent steps, rather than being an unchangeable final judgment. This step essentially reconstructs the thought process of review experts in the digital world—"identifying key points and distinguishing between primary and secondary aspects"—when reviewing technical proposals.
[0082] Step S302: Based on the semantic module type mapping pre-set weight rule library, assign initial weight values to each semantic module;
[0083] The core of this step is to assign differentiated risk sensitivity coefficients to modules of varying technological importance based on industry experience and regulatory logic. Simply dividing the text into different modules is insufficient; the "weight" of each module in the similarity determination must be clearly defined. To this end, the system embeds a pre-set weight rule base, which defines the mapping relationship between different semantic module types and initial weight values.
[0084] Specifically, this mapping relationship is based on profound industry insights: core technology modules (such as unique construction methods and innovative service processes) are the concentrated embodiment of the originality of bidding proposals, and their high degree of similarity strongly suggests potential collusion. Therefore, they are assigned the highest weight range (e.g., 60%-80%), meaning that similarities found in such modules will have a decisive impact on the final risk value. Conversely, general specification modules (such as statutory safety clauses and standard qualification documents) are highly standardized in content, and their similarities are often not specific. Therefore, they are assigned the lowest weight range (e.g., 10%-20%), and their impact will be significantly diluted.
[0085] Furthermore, for project-specific modules, the rule base may employ more complex dynamic strategies, such as detecting the presence of keywords like "customized" or "specialized" to determine their weight. This process is automated; the system queries the rule base based on the temporary module tags generated in the previous step to assign a quantified initial weight value to each piece of text that reflects its technological uniqueness, providing a precise numerical basis for subsequent weighted calculations.
[0086] Step S303: Receive external configuration instructions, adjust the initial weight values of the specified semantic module, and generate an adjustment log;
[0087] Integrating necessary human-computer interaction and audit trail mechanisms into automated decision-making is crucial for balancing the rigid rules of algorithms with the flexible judgment of human experts. This ensures the system operates efficiently while adapting to complex and ever-changing real-world scenarios. Although the automatically assigned initial weights are generally applicable, specific projects may present unique challenges. For example, in a procurement process that heavily emphasizes after-sales service, experts might deem the "service plan" module's weighting excessively.
[0088] Therefore, in this embodiment, the system also includes an open interface that allows authorized users (such as a review expert group) to temporarily adjust the initial weight values of specific modules via external configuration commands (usually through sliders or input boxes on a graphical interface). Behind this interactive function lies rigorous security protection and auditing logic.
[0089] First, the system imposes constraints on adjustment behavior, such as setting a weight cap (e.g., 30%) for general specification modules to prevent accidental over-weighting of non-core content and subsequent misjudgments. More importantly, every adjustment operation, including the adjusted object, original value, new value, operation time, and operator (optional), is automatically captured by the system and generates a structured adjustment log (usually in JSON format). This log forms a complete audit trail, ensuring the transparency and traceability of weight adjustments. It meets compliance requirements and provides clear decision-making basis for potential subsequent reviews, making the system both intelligent and trustworthy.
[0090] Step S304: Bind the semantic module text content, temporary module tags, and initial weight values to generate a modular text set with weight tags.
[0091] After the first three steps, the system has obtained three key elements: clean text content (semantic module text content), qualitative understanding of the content (temporary module tags), and quantitative importance score based on this understanding (weight value). The core task of this step is to organically and structurally bind these three elements together.
[0092] Specifically, a structured data object (such as a JSON object or a record in a database) is created, containing multiple fields that store the corresponding text fragments, their module type labels, and weight values. Finally, the system arranges all such data objects according to the physical order of the text in the original document (paragraph index), generating a modular collection of text with weight labels.
[0093] It should be noted that the output is no longer a simple text stream, but a deeply annotated, machine-readable semantic map. Subsequent detection algorithms (such as sliding window alignment or semantic vector calculation) can adjust the sensitivity or contribution of each segment based on its weight value, thereby achieving focused, risk-weighted, and refined detection.
[0094] In the above implementation, unstructured technical tender documents are transformed into semantic objects that can be understood by machines and quantified for analysis. This makes subsequent similarity detection no longer a simple calculation of character or semantic similarity, but an upgrade to a precise quantitative analysis that focuses on the core of technical originality and is risk-oriented. This fundamentally improves the accuracy and practicality of detection and provides a powerful technical tool for identifying covert bid-rigging behavior.
[0095] Reference Figure 4 As one implementation of step S103, the steps of generating corresponding window configuration parameters based on the initial weight values of each semantic module in the modular text set, performing sliding hash calculation on the text content according to the window configuration parameters, locating similar character segments, and calculating the character similarity rate include:
[0096] Step S401: Obtain a modular text set with weight labels, and extract the initial weight values and standardized text content of each semantic module;
[0097] Step S402: Dynamically map window configuration parameters according to the initial weight values of each semantic module;
[0098] Traditional character matching uses a fixed-size window for sliding matching, which cannot cope with the uneven importance of technical content, resulting in either insufficient detection of important parts or oversensitivity to common parts.
[0099] The innovation of this solution lies in establishing a dynamic mapping rule, specifically: for high-weight modules (such as "core technology modules" with a weight ≥ 60%), the originality requirements of their content are high, requiring extremely high detection sensitivity to detect subtle traces of plagiarism. Therefore, the system will configure a small window size (such as 50 characters) and a small sliding step size (such as 10 characters) for them. This is equivalent to using a high-powered magnifying glass for examination, which can capture more subtle continuous copying behavior.
[0100] Conversely, for low-weight modules (such as "general specification modules" with a weight of ≤20%), the content itself is highly standardized, and over-detection can easily lead to false alarms. Therefore, the system will configure a large window size (such as 200 characters) and a large step size (such as 50 characters) for them. This is equivalent to using a wide-angle lens to focus only on large, obvious copies, thereby improving computational efficiency and resisting interference.
[0101] It is understandable that through this dynamic mapping from weight values to window configuration parameters, the system generates customized sliding window rules for each module, achieving an optimal balance between detection accuracy and computational efficiency as a whole.
[0102] Step S403: Perform sliding hash calculation on the text content based on the window configuration parameters, locate the identical character fragments, and record the position information.
[0103] Among them, an efficient string matching algorithm is adopted. According to the customized rules generated in the previous step, the exactly identical character sequences in the text are located quickly and accurately. The system uses sliding hash calculation, which usually refers to a rolling hash algorithm such as Rabin-Karp. Through the ingenious design of the hash function, when the window slides, the hash value of the next window can be quickly obtained through a small amount of calculation based on the hash value of the previous window, without having to recalculate the hash values of all characters in the entire window each time, thus reducing the time complexity to nearly linear and greatly improving the efficiency of large-scale text comparison.
[0104] In the embodiment of the present application, the system slides according to the window length and step size customized for each module, and calculates the hash value (a unique digital fingerprint) of the text in each window in real time. When the hash values of two text fragments from different tender documents are exactly the same for multiple consecutive windows (for example, 3 consecutive windows), the algorithm determines that these two fragments are identical character fragments. This "continuous hit" mechanism effectively avoids misjudgment of single-window matching caused by accidental factors. Once the location is successful, the system will accurately record the position information, including the start and end character indexes of the fragment in the document, the module number to which it belongs, and the original text content. These information are not only used for subsequent statistics, but also the key data for generating a visual evidence report and achieving "accurate traceability".
[0105] Step S404: Perform normalization processing on the located identical character fragments to filter out non-substantive differences.
[0106] Among them, the logical principle of this step is to penetrate the surface format disguise set by plagiarists to avoid detection and identify substantive content similarities. Because some cunning plagiarists do not directly copy and paste, but will make simple changes to the text, such as adding, deleting, or replacing punctuation marks, adjusting line break positions, or changing the writing of numbers and units.
[0107] For this reason, the system introduces normalization processing. According to the preset normalization rule library, "purification" operations are performed on the original text of the located identical character fragments. This set of rules includes but is not limited to: deleting all punctuation marks and line break characters, converting full-width characters to half-width characters, and unifying numbers and units into standardized forms (such as converting both "10mm" and "十毫米" to "10毫米").
[0108] After this processing, the text is restored to its most essential character sequence state. The system then compares and confirms the "cleaned" text again. This process effectively "stitches together" identical segments that are separated by superficial formatting differences but are actually continuous in content, thus exposing the plagiarist's disguise and ensuring that the algorithm captures substantive, rather than formal, character similarities.
[0109] Step S405: Calculate the length ratio of identical character segments within each semantic module and the character similarity rate.
[0110] The judgment of character similarity cannot be limited to the qualitative level of "existence" or "absence," but requires a quantitative scale. Therefore, "length ratio" can be used as this scale, i.e., character similarity rate.
[0111] In this embodiment, the character similarity rate is calculated independently on a module-by-module basis. For a specific module (such as the "core technology module"), the numerator is the total character length of all identical segments confirmed after normalization within that module; the denominator is the total character length of the module as a whole. Dividing the two yields the character similarity rate for that module. It can be understood that the character similarity rate reflects the proportion of text content within that module that is completely identical to other tender documents. Finally, the system outputs a structured dataset (such as key-value pairs) clearly indicating each module and its corresponding character similarity rate.
[0112] The above implementation achieves a leap from the traditional, broad-based "one-size-fits-all" approach to an intelligent model that emphasizes key aspects and provides precise quantification in character-level similarity detection. It deeply integrates the weighting concept of semantic modules into the core of the character comparison algorithm, dynamically adapting detection sensitivity to content importance. Furthermore, it employs normalization technology to penetrate simple disguises, ultimately outputting a modular similarity density index. This provides a solid, reliable, and quantifiable character-level evidentiary foundation for the entire two-dimensional similarity detection system, significantly improving the accuracy, efficiency, and persuasiveness of the evidence.
[0113] Reference Figure 5 As one implementation of step S104, the steps of extracting text vectors from the modular text set through a pre-trained semantic model, calculating semantic similarity, and combining contextual relevance verification to output semantically similar segments include:
[0114] Step S501: Obtain a modular text set with weighted labels and extract the standardized text content of each semantic module;
[0115] Step S502: Call the pre-trained semantic model to vectorize the text content and generate a set of semantic vectors;
[0116] Among them, deep learning models are used to convert human-readable natural language text into numerical representations (i.e. vectors) that can be understood and computed by machines, thus enabling computers to make semantic-level comparisons.
[0117] It should be noted that the pre-trained semantic model used in the embodiments of this application has been fine-tuned for industry applications. It typically employs large-scale language models based on the Transformer architecture, such as BERT (Bidirectional Encoder Representations from Transformers). These models are pre-trained on massive general corpora, achieving powerful language understanding capabilities. By retraining the model on specialized engineering bidding document corpora, the model can more accurately understand the subtle differences in industry terminology. For example, it can map synonyms or professional expressions like "tower crane" and "tower-type crane" to very close positions in the semantic space.
[0118] Furthermore, vectorization encoding refers to the process by which the model transforms an input text segment (such as a paragraph) into a point in a fixed-length (e.g., 768-dimensional) high-dimensional space, i.e., a semantic vector. The geometric features of this vector (such as direction) encode the deep semantic information of the text. The closer the meanings of the texts, the closer the directions of their corresponding vectors in space. The system performs this operation on a per-segment basis within a module, preserving the index relationship between each vector and its corresponding segment, ultimately generating a set of semantic vectors, thus successfully transforming unstructured text into structured, computable mathematical objects.
[0119] Step S503: Calculate the semantic similarity between text segments based on the semantic vector set. When the semantic similarity exceeds the preset similarity threshold, it is marked as a candidate semantically similar segment.
[0120] This method quantifies the similarity of meaning between different text fragments by measuring the relative positional relationship of semantic vectors in space. The metric used for calculation is usually cosine similarity, which measures the consistency of direction by calculating the cosine of the angle between two vectors. Its value range is between -1 and 1. The closer the value is to 1, the more consistent the directions of the two vectors are, that is, the more similar the semantics of the texts they represent.
[0121] In this embodiment, the "core technology module" requires high originality and necessitates a strict threshold (e.g., 0.85). Only highly semantically similar modules are considered identical to avoid misjudgments. Conversely, the "general specification module," with its relatively fixed expression, allows for a more lenient threshold (e.g., 0.7) to prevent the exclusion of standard expressions. This dynamic threshold mechanism essentially transforms the importance information inherent in the module weights assigned upstream into a sensitivity parameter for this stage, achieving risk-oriented and precise judgment. The system calculates the cosine similarity of corresponding module text segments between different tender documents and compares it with the dynamic threshold set for that module. Segments exceeding the threshold are marked as candidate semantically identical segments.
[0122] Step S504: Combine the contextual relevance to perform contextual similarity verification on the candidate semantically similar segments to obtain the contextual similarity verification result;
[0123] The core logic of this step is to fundamentally distinguish between superficial semantic coincidence and substantive plagiarism by implementing a contextual consistency comparison, thereby determining whether semantically similar candidate segments in two different tender documents play the same core contextual role in their respective documents.
[0124] First, the contextual embedding of candidate segments is verified within their respective documents. This involves calculating the semantic relevance between the segment and its immediate preceding and following text, ensuring that the segment is an organic component of the semantic flow of its document, rather than an isolated fragment detached from context. This internal coherence check is a necessary prerequisite, filtering out scattered statements that lack logical support within the document itself.
[0125] Next, the system treats two semantically similar fragments from different documents and their respective independent contexts (i.e., the complete semantic unit consisting of the preceding text, the fragment itself, and the subsequent content) as two independent "contextual packages." By calculating the overall semantic similarity of these two "contextual packages," the system determines whether the semantic similarity between the fragments stems from the consistency of their larger technical discourse context. If the two fragments are similar themselves, and the complete contexts enclosing them are also highly similar, it indicates that they were copied or derived from the same, coherent technical discourse unit, constituting substantial similarity. Conversely, if the fragments are similar themselves, but their overall contexts are drastically different in terms of theme, logic, or discourse purpose, this similarity is considered a chance match and excluded.
[0126] Understandably, true technical plagiarism often involves copying not just an isolated sentence, but also its supporting logical arguments. Therefore, cross-document contextual consistency is a more reliable indicator of the intent and extent of plagiarism.
[0127] Step S505: Based on the context similarity verification results, output a list of semantically similar segments and their corresponding location information.
[0128] Specifically, a record of semantically similar fragments is generated, containing fields such as: starting paragraph index, ending paragraph index, semantic similarity value, and module identifier. A structured list of similar fragments is then output, sorted by module number. The starting and ending paragraph indices precisely indicate the location of the similar content in the original text, ensuring the traceability of the results; the semantic similarity value provides quantitative evidence of the degree of similarity; and the module identifier associates the fragment with a specific module type (such as core technology).
[0129] The above implementation achieves accurate identification of covert plagiarism behaviors such as "different expressions but the same meaning" in technical bids. By utilizing a deep semantic model finely tuned by the industry, it breaks through the dependence on the surface form of the text. Through the dual mechanism of dynamic threshold and context verification, it effectively balances the sensitivity and specificity of the identification. The final output of the structured result rich in metadata provides deep, traceable and highly convincing semantic dimension evidence for judging bid-rigging behavior, fundamentally making up for the inherent shortcomings of traditional single character comparison methods.
[0130] Reference Figure 6 As one implementation of step S105, the steps of fusing character similarity rate and semantic similarity, mapping similarity levels based on a preset judgment matrix, and calculating a weighted risk value by fusing the initial weight values of each semantic module include:
[0131] Step S601: Obtain the character similarity rate set, the semantic similarity fragment list, and the initial weight value of each semantic module;
[0132] The character similarity rate set is typically a key-value pair structure that clearly records the density ratio of character-level similarity for each module (such as the "core technology module"). This is an objective indicator based on surface text consistency. The semantically similar fragment list contains the specific location, similarity value, and module to which the fragments identified by the deep semantic model have different expressions but the same meaning. This is a qualitative and quantitative indicator that reveals deep semantic consistency. The weight value mapping table for each module quantifies the relative importance of each module in the overall technical solution.
[0133] Step S602: Based on the preset judgment matrix mapping similarity level, generate a preliminary risk label;
[0134] The system has one or more pre-set decision matrices, which are essentially two-dimensional lookup tables or rule sets. The two input dimensions are character similarity rate and semantic similarity.
[0135] In this embodiment, the rules of the judgment matrix are dynamic and depend on the type of module being processed. For core technology modules, due to their high importance, the system uses a highly sensitive judgment matrix (i.e., the first matrix). For example, even if the character similarity rate is not high (e.g., 30%), as long as the semantic similarity reaches a certain threshold (e.g., 0.7), it may be mapped to the "high-risk" level. This aims to capture sophisticated plagiarism behaviors that have undergone careful semantic rewriting.
[0136] Conversely, for the general specification module, a low-sensitivity decision matrix (i.e., the second matrix) is used, with a higher decision threshold (such as requiring a character similarity rate of more than 60% and a semantic similarity of more than 0.9), thereby avoiding misjudging the inevitable repetition of standard expressions as a risk.
[0137] Through this matrix mapping process, the system generates a preliminary risk label (such as high, medium, and low levels) for the similarity of each module. This label is a qualitative conclusion after a comprehensive preliminary assessment of evidence from both character and semantic dimensions.
[0138] Step S603: Perform a weighted fusion calculation on the initial risk identifier and the initial weight values of each semantic module to generate a weighted risk value for each module;
[0139] While the initial risk assessment is intuitive, it remains a qualitative description and does not consider the differences in importance between modules. For example, the severity of a "high risk" rating for a general module may be far lower than that of a "medium risk" rating for a core module. This step multiplies and merges the initial risk level, which reflects the "severity of plagiarism," with the module weight, which reflects the "importance of the plagiarized content," to obtain a module-level quantitative risk score that truly reflects the magnitude of potential harm.
[0140] Specifically, the initial qualitative risk identification is mapped to a quantitative risk level coefficient (e.g., high risk is mapped to a coefficient of 1.0, medium risk to 0.6, and low risk to 0.2). Then, the core multiplication operation is performed: Module-level weighted risk value = Risk level coefficient × Module weight value.
[0141] Understandably, the final risk value is influenced by both the severity of the plagiarism and the importance of the copied portion. Using this multiplicative model, a moderate similarity (rank coefficient 0.6) in the core module (weight 0.8) will have a significantly higher risk value (0.48) than a highly similarity (rank coefficient 1.0) in the general module (weight 0.15). This allows risk scoring to accurately target similarities in key technologies, achieving true risk guidance.
[0142] Step S604: Aggregate the weighted risk values of each module to generate the final weighted risk value.
[0143] After obtaining the module-level weighted risk value for each module, it is necessary to aggregate them into an overall evaluation. This step integrates the risk contributions of all modules in the entire technical target to form a single, comprehensive overall risk index, while ensuring that this index is not dominated by a large number of non-core modules.
[0144] In the embodiments of this application, a normalized aggregation algorithm is adopted. For example, the arithmetic mean of the weighted risk values of all modules is calculated, that is, the overall weighted risk value = Σ(module weighted risk value) / total number of modules, in order to measure the "average risk level contributed by each module", thereby avoiding the disproportionate impact on the overall result due to a module being too long or too short.
[0145] Ultimately, the generated overall weighted risk value is a quantitative index between 0 and 1, which can clearly and intuitively reflect the level of similarity risk of the entire technical target. The system can set a threshold for this overall value and automatically give a final judgment conclusion such as "high risk" or "normal". Together with the detailed data of each module, it forms a complete, structured risk report that combines qualitative and quantitative analysis, providing extremely solid and intuitive data support for regulatory decisions.
[0146] In the above implementation, the underlying detection evidence from both character and semantic dimensions is transformed into a weighted risk index that reflects both local risk details and overall risk levels. This solution cleverly balances the severity of plagiarism with the importance of the plagiarized content through a modular judgment strategy and a multiplicative fusion model. This allows the final risk quantification result to accurately focus on truly harmful, core-technology-related similarities, thus providing regulatory authorities with a quantifiable, interpretable, and highly valuable intelligent judgment tool. This fundamentally solves the pain point of traditional methods that only provide a single similarity percentage and cannot support precise regulation.
[0147] Reference Figure 7 As a further implementation of the technical similarity detection method, the detection method also includes:
[0148] Step S701: Obtain a list of semantically similar fragments and their corresponding semantic module type identifiers, and associate them with the initial weight values in the modular text set;
[0149] The list of semantically similar segments is a set of results output after preliminary semantic similarity calculation (e.g., cosine similarity exceeding a certain basic threshold). The semantic module type identifier is assigned during the technical specification structured parsing stage. It uniquely associates the segment with the semantic module to which it belongs (e.g., "core technology module" or "general specification module"). Each module has a pre-set initial weight value, which quantifies the technical importance of the module's content.
[0150] Step S702: Locate the context text range based on the semantic module type identifier, and extract the text content of the preceding and following paragraphs of semantically similar segments;
[0151] The underlying logic of this step lies in introducing "contextual integrity" analysis to overcome the misjudgment of "taking things out of context" that may result from comparing isolated text fragments. Semantic similarity may stem from coincidental common expressions or from intentional copying of specific technical solutions. To distinguish between the two, suspected similar fragments must be placed back into their original text streams for examination.
[0152] Specifically, the system utilizes the paragraph position index established during the document parsing phase to quickly locate the document region where the segment is located based on the "semantic module type identifier." Subsequently, it strategically extracts the text content of its "preceding paragraph" and "following paragraph."
[0153] Understandably, a passage truly suspected of plagiarism should exhibit a high degree of relevance or consistency in its preceding and following arguments (the technical background and problem statement provided in the preceding paragraphs) and its subsequent technical extensions (the specific parameters and implementation details proposed in the following paragraphs). For example, if both tender documents contain the semantically similar phrase "using prestressed tensioning technology," it might be difficult to determine plagiarism based solely on this sentence. However, if document A discusses "bridge reinforcement" in the preceding paragraph and "crack control" in the following paragraph, while document B discusses "floor slab construction" in the preceding paragraph and "mold installation" in the following paragraph, then although the core sentences are similar, the technical contexts are completely different, greatly reducing the likelihood of plagiarism.
[0154] Step S703: Call the pre-trained semantic model to perform vectorization encoding on the semantically similar fragment text, the preceding paragraph text and the following paragraph text respectively, to generate fragment semantic vector, preceding context vector and following context vector.
[0155] In this embodiment, a deep language model (such as BERT) pre-trained on a large-scale corpus and fine-tuned on a domain-specific corpus (such as engineering construction) is used to transform unstructured natural language text into mathematical vectors (i.e., embedding vectors) in a high-dimensional space. This process is not a simple word frequency count, but rather the result of the model comprehensively encoding the grammar, semantics, and even some pragmatic information of the text based on its built-in attention mechanism and other structures.
[0156] Specifically, the segment semantic vector encapsulates the core meaning of the suspected similarity point itself; the preceding context vector represents the semantic field of the background, premise, or problem domain that triggered the technical point; and the subsequent context vector represents the semantic field of the specific measures, results, or next steps derived from the technical point. Through this separate encoding, the system can not only compare the segments themselves (point-to-point), but also precisely analyze the semantic correlation strength between the segment and its context (point-to-field).
[0157] Step S704: Construct a dynamic semantic interference field based on the preceding context vector and the following context vector; calculate the perturbation tolerance of the segment semantic vector in the dynamic semantic interference field through a pre-set semantic perturbation analysis model; and output the semantic stability coefficient.
[0158] The principle behind this step is to simulate a potential plagiarist's "disturbance" behavior, such as paraphrasing and sentence restructuring, of the original text without altering its core semantics. The dynamic semantic interference field is a conceptual model, a semantic space region jointly defined by "preceding context vectors" and "subsequent context vectors." It represents the set of all acceptable expressions functionally equivalent to the original text under specific technical context constraints.
[0159] In this embodiment, the pre-built "semantic perturbation analysis model" (whose training may involve adversarial learning or a large number of rewritten sample pairs) evaluates the extent to which a given "fragment semantic vector" can deviate from its original position (i.e., be rewritten) while still falling within its proper contextual semantic field. If, even after a significant "shift" (corresponding to a large textual change), the fragment vector's meaning still perfectly matches its preceding and subsequent contexts (i.e., still belongs to the "perturbation field"), it indicates that the fragment's expression has strong "substitutability" and low "semantic stability," making it easy to be legally rewritten. The likelihood of its similarity being accidental or its evidentiary value for plagiarism is weaker, resulting in a lower output semantic stability coefficient. Conversely, if the fragment vector must be precisely located in its current position, and a slight shift would cause it to become detached from the context, it indicates that its expression is highly specific and stable. In this case, similarity is more anomalous, and the output semantic stability coefficient is higher.
[0160] Step S705: Combine the initial weight value and semantic stability coefficient corresponding to the semantic module type identifier to dynamically generate the anti-rewriting semantic similarity judgment threshold.
[0161] Among them, the initial weight value corresponding to the semantic module type identifier represents the dimension of technical importance, that is, the similarity of core innovation points is far more harmful than the similarity of general terms. The semantic stability coefficient represents the dimension of expression uniqueness, that is, if a segment with high stability (i.e., difficult to be unintentionally coincidental or benignly rewritten) appears similar, its suspicion of intentional plagiarism is far greater than that of a segment with low stability (general expression).
[0162] In this embodiment, the system fuses these two factors (e.g., through weighted or function mapping) to dynamically generate a "semantic similarity threshold resistant to rewriting". Specifically, for high-weight, high-stability segments (such as a unique technical solution description), the system generates a more lenient (lower) threshold because even if the absolute value of semantic similarity is not extremely high, a lower similarity is still noteworthy due to its important content and unique expression.
[0163] For low-weight, low-stability segments (such as a routine management description), the system will generate a stricter (higher) judgment threshold because the content itself allows a wide range of expressions and requires a very high similarity to be judged as abnormal. This dynamic mechanism ensures that the judgment criteria match the actual risk level of the segment.
[0164] Step S706: Based on the anti-rewriting semantic similarity judgment threshold, re-verify the semantic similarity of semantic similar segments and generate an anti-rewriting semantic similarity judgment result set;
[0165] The system compares the semantic similarity value of each pair of semantically similar segments with its corresponding dynamic threshold. If the similarity exceeds the threshold, it is classified as "high-risk anti-rewriting similarity"; if it is close to the threshold, it may be subject to final arbitration based on its contextual vector (for example, even if the similarity is slightly below the threshold, if the preceding and following contexts are highly consistent, it can still be classified as similar); if it is far below the threshold, it may be downgraded to "low-risk" or "no risk". The resulting anti-rewriting semantic similarity judgment set is no longer just a list of similarity scores, but a classification result with a clear risk level that incorporates multi-dimensional intelligent judgments such as technical importance, expression uniqueness, and contextual relevance, greatly enhancing its accuracy and interpretability.
[0166] Step S707: The set of similarity rates of associated characters and the set of semantic similarity judgment results against rewriting are aggregated according to the semantic module type identifier to generate risk decision nodes;
[0167] This approach links the character similarity rate set with the semantic similarity judgment result set, using "semantic module type identifier" as the aggregation unit. Each module generates an independent "risk decision node." This node encapsulates the module's multi-dimensional risk information: the similarity ratio at the character level, the risk judgment result at the semantic level (such as the number and location of high-risk segments), and naturally associates it with the module's initial weight. This aggregation method transforms a complex technical specification document into a structured risk assessment graph composed of multiple decision nodes. Each node is an independent analysis unit, clearly revealing "which technical part (module)", "what type of similarity (character / semantic) occurred", and "its severity". This provides a granular and comprehensive data structure for subsequent overall risk assessment and resource scheduling.
[0168] Step S708: For risk decision nodes whose weighted risk value exceeds the preset warning threshold, the distributed computing engine is triggered to perform priority recalculation on the core semantic module.
[0169] Specifically, not all parts of the entire technical bid batch testing process require the same amount of computing resources. The "weighted risk value" calculated by the "risk decision node" is a quantitative indicator of the overall risk level. When this value exceeds the "preset warning threshold," it indicates that the current bid pair has a high suspicion of being identical and requires further careful verification.
[0170] At this point, instead of blindly performing recalculation on all modules of the entire document, the system intelligently "triggers the distributed computing engine" and instructs it to perform "priority recalculation" on the "core semantic modules" (i.e., high-weight modules, such as the technical solution module). This means that the system will mobilize more computing resources (such as GPU clusters) and employ potentially more complex, accurate, but also more time-consuming algorithms or model parameters to perform a second, in-depth analysis of the most critical parts of the high-risk pairs. This mechanism ensures that, under limited computing resources and time constraints, priority is given to ensuring the accuracy of the analysis of the parts that have the greatest impact on the final judgment and are the most risky, thereby achieving an optimal balance between efficiency and accuracy overall.
[0171] Step S709: Integrate all risk decision nodes to construct a structured risk decision tree, and anchor the decision tree hash value and the report hash value to the blockchain evidence storage system.
[0172] Specifically, the risk decision nodes of each module are organized into a hierarchical tree data structure according to the logical structure of the technical target (such as chapters and module affiliations). This tree not only records the final risk conclusion, but also records the path to reaching that conclusion: from the root node (the entire technical target) to the branches (major modules), and then to the leaf nodes (the specific risk judgment results and the character and semantic data on which they are based).
[0173] The system then calculates the cryptographic hash value (a unique and irreversible digital fingerprint) of the structured data for the complete risk decision tree and anchors it, along with the hash value of the final visualization report, to the blockchain's evidence storage system. The decentralized, timestamped, and immutable nature of the blockchain ensures that these two hash values and their associated time information are permanently, publicly, and reliably recorded. Any subtle alteration to the original analysis report or decision tree data will result in a mismatch between its hash value and the record stored on the blockchain, thus being immediately detected.
[0174] In the above implementation, semantically similar fragments are associated with a modular weighting system, enabling precise focus on risk concerns. By introducing contextual analysis and semantic stability quantification, a "disturbance tolerance" assessment model to resist advanced rewriting plagiarism is constructed, significantly improving the ability to identify intelligent and covert plagiarism. By dynamically generating judgment thresholds by integrating module weights and semantic stability, the judgment criteria become flexible and intelligent, effectively reducing misjudgments and omissions. By aggregating character and semantic evidence into risk decision nodes by module, a clear structured risk assessment graph is formed. Through a risk-based priority recalculation scheduling mechanism, intelligent allocation of computing resources in massive data analysis is achieved. Finally, by constructing a traceable risk decision tree and using blockchain to solidify the entire chain of evidence, a seamless connection is formed from technical detection to judicial evidence.
[0175] The entire solution deeply integrates deep learning, natural language processing, distributed computing, and blockchain evidence storage technology. It not only solves the core technical challenges of "detecting accurately and effectively," but also systematically addresses the engineering and legal practice challenges of "detecting quickly and providing solid evidence." It provides a highly reliable and authoritative automated technical supervision tool for ensuring fairness and impartiality in the bidding and tendering field.
[0176] This application also discloses a technical mark similarity detection system based on two-dimensional semantic recognition.
[0177] A technical similarity detection system based on two-dimensional semantic recognition, the detection system includes:
[0178] The data standardization module is used to parse multi-format technical specification files, extract text content, perform standardization and cleaning processes, and generate standardized text data that retains paragraph structure.
[0179] The weight assignment module is used to divide standardized text data into multiple semantic modules based on a pre-built industry knowledge base, configure the initial weight value of each semantic module, and generate a modular text set with weight labels.
[0180] The character similarity detection module is used to generate corresponding window configuration parameters based on the initial weight values of each semantic module in the modular text set, perform sliding hash calculation on the text content according to the window configuration parameters, locate similar character segments and calculate the character similarity rate;
[0181] The context verification module is used to extract text vectors from a modular text set through a pre-trained semantic model, calculate semantic similarity, and combine contextual relevance verification to output semantically similar segments.
[0182] The risk quantification module is used to integrate character similarity rate and semantic similarity, map similarity level based on a pre-set judgment matrix, and calculate weighted risk value by integrating the initial weight values of each semantic module.
[0183] The visualization report generation module is used to integrate identical character fragments, semantically identical fragments, and weighted risk values to generate a visualization analysis report;
[0184] The blockchain evidence storage module is used to extract the report hash value and write it to the blockchain for evidence storage.
[0185] The technical mark similarity detection system based on dual-dimensional semantic recognition in this application embodiment can implement any of the above methods, and the specific working process of each module in the system can refer to the corresponding process in the above method embodiments.
[0186] In the several embodiments provided in this application, it should be understood that the provided methods and systems can be implemented in other ways. For example, the system embodiments described above are merely illustrative; for example, the division of a certain module is merely a logical functional division, and in actual implementation there may be other division methods, such as multiple modules can be combined or integrated into another system, or some features can be ignored or not executed.
[0187] This application also discloses a computer device.
[0188] Computer equipment, including memory, processor, and computer program stored in memory and executable on the processor, wherein the processor executes the computer program to implement a technical similarity detection method based on two-dimensional semantic recognition as described above.
[0189] This application also discloses a computer-readable storage medium.
[0190] A computer-readable storage medium storing a computer program that can be loaded by a processor and executed as described above in any of the technical similarity detection methods based on two-dimensional semantic recognition.
[0191] The computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in connection with an instruction execution system, apparatus, or device; the program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.
[0192] The above are all preferred embodiments of this application and are not intended to limit the scope of protection of this application. Any feature disclosed in this specification (including the abstract and drawings) may be replaced by other equivalent or similar features unless specifically stated otherwise. That is, unless specifically stated otherwise, each feature is only one example of a series of equivalent or similar features.
Claims
1. A method for detecting technical similarities based on dual-dimensional semantic recognition, characterized in that, The detection method includes: Parse multi-format technical specification files, extract text content, perform standardization and cleaning processes, and generate standardized text data that retains paragraph structure; Based on a pre-built industry knowledge base, the standardized text data is divided into multiple semantic modules, and the initial weight values of each semantic module are configured to generate a modular text set with weight labels. Based on the initial weight values of each semantic module in the modular text set, corresponding window configuration parameters are generated. According to the window configuration parameters, sliding hash calculation is performed on the text content to locate similar character segments and calculate the character similarity rate. The modular text set is processed by extracting text vectors through a pre-trained semantic model, calculating semantic similarity, and combining contextual relevance verification to output semantically similar segments. The character similarity rate and semantic similarity are combined, and the similarity level is mapped based on a preset judgment matrix. The weighted risk value is calculated by combining the initial weight values of each semantic module. By integrating the identical character fragments, semantically identical fragments, and weighted risk values, a visual analysis report is generated, and the report's hash value is extracted and written to the blockchain for evidence storage.
2. The method for detecting technical similarities based on dual-dimensional semantic recognition according to claim 1, characterized in that, The steps involved in parsing multi-format technical specification files, extracting text content, performing standardization and cleaning processes, and generating standardized text data that retains paragraph structure include: Acquire multi-format technical specification files, identify file types and call the corresponding parsing engine to extract the original text content containing paragraph position information; Non-substantive text fragments, including headers, footers, page numbers, and fixed template content of the bidding party, are deleted based on a pre-set cleaning rule base. The extracted raw text content is normalized to unify the expression of numbers and units, and filter out special symbols and redundant spaces. The mapping relationship between the original paragraph position information and the text content is preserved to generate standardized text data.
3. The method for detecting technical similarities based on dual-dimensional semantic recognition according to claim 1, characterized in that, Based on a pre-built industry knowledge base, the standardized text data is divided into multiple semantic modules, and initial weight values are configured for each semantic module to generate a modular text set with weight labels. The steps include: Obtain standardized text data, call the text classification model of the pre-built industry knowledge base, divide the text content into multiple semantic module types and generate temporary module labels; Based on a pre-defined weight rule library for semantic module type mapping, initial weight values are assigned to each semantic module. Receive external configuration instructions, adjust the initial weight values of the specified semantic modules, and generate adjustment logs; The semantic module text content, temporary module tags, and initial weight values are bound together to generate a modular text collection with weight tags.
4. The method for detecting technical similarities based on dual-dimensional semantic recognition according to claim 1, characterized in that, Based on the initial weight values of each semantic module in the modular text set, corresponding window configuration parameters are generated. The steps of performing sliding hash calculation on the text content according to the window configuration parameters, locating similar character segments, and calculating the character similarity rate include: Obtain a modular text set with weighted labels, and extract the initial weight values and standardized text content of each semantic module; Window configuration parameters are dynamically mapped based on the initial weight values of each semantic module; Based on the window configuration parameters, a sliding hash calculation is performed on the text content to locate identical character fragments and record their position information; Normalize the located identical character fragments to filter out non-substantial differences; The length ratio of identical character segments within each semantic module is statistically analyzed, and the character similarity rate is calculated.
5. The method for detecting technical similarities based on dual-dimensional semantic recognition according to claim 1, characterized in that, The steps of extracting text vectors from the modular text set using a pre-trained semantic model, calculating semantic similarity, and verifying semantically similar segments based on contextual relevance include: Obtain a modular text set with weighted labels and extract the standardized text content of each semantic module; The pre-trained semantic model is invoked to vectorize the text content, generating a set of semantic vectors. The semantic similarity between text segments is calculated based on a set of semantic vectors. When the semantic similarity exceeds a preset similarity threshold, the segments are marked as candidate semantically similar segments. The contextual similarity of the candidate semantically similar segments is checked in conjunction with the contextual relevance to obtain the contextual similarity check result; Based on the context similarity verification results, output a list of semantically similar segments and their corresponding location information.
6. The method for detecting technical similarities based on dual-dimensional semantic recognition according to claim 5, characterized in that, The steps of integrating the character similarity rate and semantic similarity, mapping similarity levels based on a preset judgment matrix, and calculating a weighted risk value by integrating the initial weight values of each semantic module include: Obtain the character similarity rate set, the semantic similarity segment list, and the initial weight value of each semantic module; Based on a pre-set judgment matrix mapping similarity levels, a preliminary risk label is generated. A weighted fusion calculation is performed on the preliminary risk identifier and the initial weight value of each semantic module to generate a weighted risk value for each module; The weighted risk values of each module are aggregated to generate the final weighted risk value.
7. A method for detecting technical similarities based on dual-dimensional semantic recognition according to any one of claims 1 to 6, characterized in that, The detection method further includes: Obtain a list of semantically similar fragments and their corresponding semantic module type identifiers, and associate them with the initial weight values in the modular text set; Based on the semantic module type identifier, locate the context text range and extract the text content of the preceding and following paragraphs of the semantically similar fragments; The pre-trained semantic model is invoked to vectorize the semantically similar text fragments, the preceding paragraph texts, and the following paragraph texts, generating fragment semantic vectors, preceding context vectors, and following context vectors. A dynamic semantic interference field is constructed based on the preceding context vector and the subsequent context vector. The perturbation tolerance of the segment semantic vector in the dynamic semantic interference field is calculated by a pre-set semantic perturbation analysis model, and the semantic stability coefficient is output. By combining the initial weight value corresponding to the semantic module type identifier and the semantic stability coefficient, a threshold for determining semantic similarity against rewriting is dynamically generated. Based on the aforementioned anti-rewriting semantic similarity determination threshold, the semantic similarity of semantically similar segments is re-verified, and an anti-rewriting semantic similarity determination result set is generated; The set of similarity rates of related characters and the set of semantic similarity judgment results against rewriting are aggregated according to the semantic module type to generate risk decision nodes; For risk decision nodes whose weighted risk value exceeds a preset warning threshold, the distributed computing engine is triggered to perform priority recalculation on the core semantic module; A structured risk decision tree is constructed by integrating all risk decision nodes, and the hash value of the decision tree and the hash value of the report are jointly anchored to the blockchain evidence storage system.
8. A technical mark similarity detection system based on dual-dimensional semantic recognition, characterized in that, The detection system includes: The data standardization module is used to parse multi-format technical specification files, extract text content, perform standardization and cleaning processes, and generate standardized text data that retains paragraph structure. The weight assignment module is used to divide the standardized text data into multiple semantic modules based on a pre-set industry knowledge base, configure the initial weight value of each semantic module, and generate a modular text set with weight labels. The character similarity detection module is used to generate corresponding window configuration parameters based on the initial weight values of each semantic module in the modular text set, perform sliding hash calculation on the text content according to the window configuration parameters, locate similar character segments and calculate the character similarity rate; The context verification module is used to extract text vectors from the modular text set through a pre-trained semantic model, calculate semantic similarity, and output semantically similar segments in combination with contextual relevance verification. The risk quantification module is used to integrate the character similarity rate and semantic similarity, map the similarity level based on a preset judgment matrix, and calculate the weighted risk value by integrating the initial weight values of each semantic module. The visualization report generation module is used to integrate the identical character fragments, semantically identical fragments, and weighted risk values to generate a visualization analysis report; The blockchain evidence storage module is used to extract the report hash value and write it to the blockchain for evidence storage.
9. A computer device, characterized in that: The method includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, implements the method as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that: The computer program is stored that can be loaded by a processor and executed as described in any one of claims 1 to 7.