Water treatment field file processing and classification method based on large language model
By detecting file header features and field parsingability to classify file types, combining large language models for skew correction and multi-element recognition, a five-layer professional cleaning process was designed, and a seed statement library was built for error correction model fine-tuning. This solved the problems of accurate identification of structured and unstructured files and efficient classification of professional data in water treatment file processing, achieving efficient and accurate file processing and knowledge accumulation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- PENYAO ENVIRONMENTAL PROTECTION
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-19
AI Technical Summary
Existing document processing technologies in the water treatment field suffer from insufficient accuracy in distinguishing between structured and unstructured documents, low recognition precision, inaccurate recognition of professional symbols by generalized OCR engines, lack of semantic constraints in data cleaning, and over-generalization of generalized error correction models, which fail to form an iterative knowledge system, resulting in wasted computing resources and loss of professional semantics.
File types are classified by detecting the characteristic byte sequence and field parsability of the file header, lightweight pre-reading is performed, skew correction and multi-element synchronous recognition are combined with a large language model, a five-layer professional cleaning process is designed, a seed sentence library is built for error correction model fine-tuning, and automatic classification and iterative updates are achieved based on semantic alignment mechanism.
It achieves efficient diversion and accurate identification of water treatment documents, improves the accuracy and efficiency of document processing, enhances the adaptability to complex engineering archives, ensures high-precision error correction and compliance verification of professional data, and supports adaptive expansion of the classification system and knowledge accumulation.
Smart Images

Figure CN122240575A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of information retrieval technology, specifically to a method for document processing and classification in the field of water treatment based on a large language model. Background Technology
[0002] In water treatment engineering practice, a large amount of critical information exists in various file formats, covering design drawings, test reports, water quality monitoring data, process parameter documents, and engineering plans. These documents contain professional data such as equipment codes, standard units of measurement, industry abbreviations, chemical reagent labels, and mathematical expressions, serving as the core basis for process optimization, water quality assessment, and engineering decisions.
[0003] However, existing document processing technologies have the following shortcomings: First, they lack a precise mechanism for distinguishing between structured and unstructured documents, and misuse optical character recognition (OCR) technology for directly parsable structured documents, resulting in wasted computing resources and the introduction of additional errors. Second, general-purpose OCR engines lack sufficient accuracy in recognizing engineering documents with tilted, low-contrast, or water treatment-related symbols, making it difficult to guarantee the quality of content restoration. Third, existing data cleaning solutions lack semantic constraint rules for the terminology system in the water treatment field, making it impossible to accurately extract and standardize professional data such as equipment codes, units of measurement, industry abbreviations, chemical formulas, and mathematical expressions. Fourth, generalized pre-trained models for text correction are too generalized, easily miscorrecting water treatment terms such as COD and SRT as ordinary words, thus damaging the integrity of professional semantics. Fifth, the processing results of existing solutions have failed to form an iterative and reusable professional corpus and knowledge system, hindering the continuous evolution of intelligent applications.
[0004] Therefore, there is an urgent need for an end-to-end, domain-adaptive file processing and classification method to achieve efficient diversion, accurate identification, professional data cleaning, and knowledge asset accumulation of multi-source heterogeneous files in the water treatment field.
[0005] To address this, a document processing and classification method based on a large language model is proposed for the water treatment field. Summary of the Invention
[0006] The purpose of this invention is to provide a document processing and classification method for the water treatment field based on a large language model. This method achieves structured and unstructured document separation by discriminating document structure, identifies and performs multi-layer professional cleaning on unstructured content, outputs standardized text by combining a domain-based fine-tuning error correction model, and achieves automatic classification and iterative updates of the classification system based on a semantic alignment mechanism.
[0007] To achieve the above objectives, the present invention provides the following technical solution: A document processing and classification method for the water treatment field based on a large language model includes: By detecting the parsability of the header byte sequence and fields, and using the internal data organization of the file as the criterion, the input file is divided into structured files and unstructured files; the structured files are parsed using a parsing library to extract structured data. After performing automatic skew correction preprocessing on unstructured documents, text, tables, formulas, and legends are simultaneously recognized, and the original recognized text is output. The original recognized text undergoes a five-layer ordered professional cleaning process: the first layer retains the original values of equipment codes and national standard numbers; the second layer corrects OCR character confusion errors in measurement units and verifies unit compliance; the third layer performs standardized matching based on a water treatment abbreviation list; the fourth layer performs double verification of chemical formulas through whitelist matching and legality verification; the fifth layer verifies the structural validity of mathematical expressions; the first, third, and fourth layers of compliance verification are performed on structured data; the two data streams are merged and the cleaned data is output. Construct a seed sentence library, inject noise into the seed sentences to generate a training dataset, fine-tune the pre-trained text correction model in the domain, correct the cleaned data, and output standardized text. The system automatically categorizes standardized text, reviews each categorization result, identifies gaps in the coverage of categorization criteria, and adds new subcategories to iteratively update the categorization criteria.
[0008] Preferably, the steps for dividing the input file into structured and unstructured files are as follows: Read the header feature byte sequence of the input file, and determine the encoding format and storage method of the internal data based on the byte sequence; for file formats determined to potentially have structured content, call the corresponding parsing library to perform lightweight pre-reading of the file content, checking whether field separators exist regularly and whether data columns can be completely parsed; if the pre-reading is successful and the field structure is complete, the file is determined to be a structured file, and the corresponding parsing library is called to completely extract the structured data; if the pre-reading fails, the field structure is missing, or the file content is embedded in image form, the file is determined to be an unstructured file; the lightweight pre-reading does not perform complete data loading.
[0009] Preferably, the original recognized text output process includes: For unstructured files, the page tilt angle of the image content is estimated, and rotation correction is performed based on the estimated angle to obtain a page image with normal orientation. For unstructured documents, the page is checked to see if it contains a text layer that can be directly extracted. If a text layer exists, the text content is extracted directly. If the page content is embedded in the form of an image, the page is converted into an image page by page before being input into the recognition engine. The recognition engine uses a unified detection head to simultaneously detect and extract content from text regions, table structures, mathematical formulas, and legend annotations. During the recognition process, contextual semantic constraints are introduced to perform probability correction on candidate characters. At the same time, the spatial position information of each element on the page is recorded. The recognized content and spatial layout information are merged to output the original recognized text.
[0010] Preferably, the original recognized text is subjected to five layers of ordered professional cleaning, specifically including: The first layer identifies strings in the original text that conform to the device number format, monitoring point number format, and national standard number format through regular expression matching rules, marks the matched strings as protected codes and retains the original values; The second layer identifies OCR character confusion of measurement units caused by similar character shapes in the text outside the protected encoding, including confusion between letters and numbers and confusion between ordinary letters and special measurement characters. After correcting them to the corresponding standard unit characters, they are compared with the preset water treatment measurement unit compliance set to obtain the unit compliance verification text. The third layer precisely matches the text output from the second layer with a water treatment-specific abbreviation list, and performs standard capitalization processing on the abbreviations that match the list to obtain the abbreviation-standardized text. The fourth layer involves first performing a precise match between strings suspected of being chemical formulas and a whitelist of commonly used water treatment agents and substances. If a match is found, the string is retained. For strings that do not match the whitelist, the validity of their element symbols and subscript structures is verified through chemical formula syntax parsing. If valid, the string is retained; otherwise, it is marked for manual review, resulting in a chemical formula verification text. The fifth layer, after excluding plain text strings, judges the validity of expression structure for character sequences containing operators, parentheses, and subscripts by parsing the syntax without performing evaluation. If valid, the complete expression is retained; otherwise, it is marked for manual review. After the five-layer cleaning is completed, the cleaned raw recognized text data is output.
[0011] Preferably, the seed statement library construction process is as follows: by structurally extracting and cleaning authoritative sources such as domain standards, process manuals, equipment documents and test reports, high-quality benchmark statements covering professional terms, typical sentence structures and standard descriptions are selected and organized; after deduplication, normalization and compliance verification, a seed statement library with accurate semantics and uniform format is formed.
[0012] Preferably, the acquisition of standardized text specifically includes: based on standardized sentences of high-confidence water treatment professional terms in the seed sentence library, according to the OCR character similarity confusion character pair rules, injecting confusion noise into the sentences in the seed sentence library in a targeted manner, using the sentences after noise injection as the model input and the original seed sentences as the standard output, obtaining the text correction training dataset, and performing domain fine-tuning on the text correction pre-trained model; The cleaned data is input into the domain-fine-tuned text correction pre-trained model. The character sequences in the cleaned data are vectorized. The context representation layer performs bidirectional context modeling and outputs terminology boundary markers. Based on the terminology boundary markers and context representation vectors, OCR obfuscated characters in non-terminology regions are corrected, terminology regions are preserved as a whole, and standardized text is output.
[0013] Preferably, the step of reviewing the classification results item by item includes: using the construction project document archiving standard as the initial classification system, constructing a structured classification instruction that includes classification basis, judgment rules and classification examples, driving the large language model to automatically classify the standardized text item by item, outputting the classification label and classification confidence description for each text item, and obtaining the classification results; Large language models consist of an input representation layer, multiple self-attention layers, and an output prediction layer; The input representation layer vectorizes the character sequence concatenated with structured classification instructions and normalized text, and encodes the classification criteria description, the judgment rules of each category, typical classification examples and the text content to be classified into a unified input vector sequence. The input vector sequence retains the positional boundary information of the classification instructions and the text to be classified. The multi-layer self-attention layer performs correlation calculation on each position in the input vector sequence with all other positions to obtain a context representation vector that integrates global context information for each position. In this process, the keyword vector in the text to be classified and the vector of the corresponding category description in the classification instruction form a high correlation weight, which enables the model to semantically align the text content features with the classification category description to obtain a semantically aligned context representation vector. The output prediction layer performs category probability distribution calculation on the semantically aligned context representation vector, takes the category entry corresponding to the highest probability as the classification label, and generates classification confidence description based on the concentration of the probability distribution. It outputs the classification label and classification confidence description for each normalized text to obtain the classification result.
[0014] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. This invention detects the byte sequence and parsability of file header features, using the internal data organization of the file as the core criterion to automatically classify input files into structured and unstructured files. A lightweight pre-fetch mechanism is employed to avoid resource waste caused by full loading. For structured files, the parsing library is directly called to extract field data; for unstructured files, skew correction and multi-element synchronous recognition are performed to achieve differentiated processing paths. This invention effectively avoids data loss or duplicate parsing caused by misjudgment, improves the accuracy and efficiency of file processing, and enhances the system's adaptability to complex engineering archives.
[0015] 2. This invention, tailored to the characteristics of text in the water treatment field, designs a five-layer ordered cleaning process. It performs hierarchical verification and standardization of equipment codes, units of measurement, professional abbreviations, chemical formulas, and mathematical expressions, and establishes protection mechanisms at key professional information points to prevent erroneous corrections. Through a dual verification mechanism of whitelist matching and grammatical validity, it can effectively identify structural errors caused by OCR misrecognition while preserving legitimate professional expressions. This invention achieves high-precision error correction and compliance verification without disrupting the structure of professional terminology, significantly improving the professional credibility and archivalability of the output text.
[0016] 3. This invention constructs a structured classification instruction system based on a large language model. Through classification criteria, rules, and examples, semantic alignment is driven, creating a highly correlated weight mapping between the features of the text to be classified and the category description, achieving high-precision automatic classification. Simultaneously, each classification result is reviewed, gaps in the classification criteria are identified, and sub-category entries are dynamically added, forming an iterative update mechanism. High-quality data is then used to update a professional corpus. This invention can continuously optimize classification criteria during practical applications, achieving adaptive expansion and knowledge accumulation of the classification system, significantly improving the intelligence level of water treatment engineering file management. Attached Figure Description
[0017] Figure 1 A schematic diagram of a document processing and classification method in the field of water treatment based on a large language model is provided by the present invention. Figure 2 A schematic diagram of the five-layer ordered specialized cleaning logic process provided by the present invention; Figure 3 This is a schematic diagram of the automatic classification and standard iteration process based on a large language model provided by the present invention. Detailed Implementation
[0018] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the invention.
[0019] Example 1: Please see Figure 1 This invention provides a file processing and classification method for the water treatment field based on a large language model. The technical solution is as follows: by detecting the parsability of the file header feature byte sequence and fields, and using the internal data organization form of the file as the discrimination criterion, the input file is divided into structured files and unstructured files; for structured files, a parsing library is called to extract structured data; After performing automatic skew correction preprocessing on unstructured documents, text, tables, formulas, and legends are simultaneously recognized, and the original recognized text is output. The original recognized text undergoes a five-layer ordered professional cleaning process: the first layer retains the original values of equipment codes and national standard numbers; the second layer corrects OCR character confusion errors in measurement units and verifies unit compliance; the third layer performs standardized matching based on a water treatment abbreviation list; the fourth layer performs double verification of chemical formulas through whitelist matching and legality verification; the fifth layer verifies the structural validity of mathematical expressions; the first, third, and fourth layers of compliance verification are performed on structured data; the two data streams are merged and the cleaned data is output. Construct a seed sentence library, inject noise into the seed sentences to generate a training dataset, fine-tune the pre-trained text correction model in the domain, correct the cleaned data, and output standardized text. The system automatically categorizes standardized text, reviews each categorization result, identifies gaps in the coverage of categorization criteria, and adds new subcategories to iteratively update the categorization criteria.
[0020] Furthermore, the input files are divided into structured files and unstructured files, specifically including: Read the header byte sequence of the input file, determine the encoding format and storage method of the internal data of the file based on the byte sequence, and obtain the preliminary result of the file format judgment; For files whose initial format is determined to be potentially structured, the corresponding parsing library is called to perform a lightweight pre-read of the file header data without loading the complete data. The system checks whether field delimiters appear regularly in each row, whether the number of data columns is consistent, and whether the content of each column field can be completely parsed into machine-readable values or strings to obtain the field parsingability judgment result. Based on the field parsability judgment result, the data is split: if the pre-read is successful and the field structure is complete, the file is determined to be a structured file, the corresponding parsing library is called to perform complete data loading, extract all field data, and output structured data; if the pre-read fails, the field structure is missing, or the file content is embedded in the form of an image, the file is determined to be an unstructured file and enters the automatic skew correction preprocessing process.
[0021] Specifically, upon receiving an input file, the system performs an internal data organization analysis to determine the subsequent processing path.
[0022] The first stage involves reading the characteristic byte sequence of the file header. Different file formats follow a fixed encoding standard when stored. Data files using commas or tabs as field delimiters, binary project files starting with a specific magic number, and document format files encapsulated with a specific structure all have distinguishable byte sequences. The system determines the encoding format and storage method of the data within the file based on the byte sequence, obtaining an initial file format determination result.
[0023] The second stage involves determining the parsability of a field. For files whose initial format assessment indicates they may contain structured content, the corresponding parsing library is invoked to perform a lightweight pre-read of the file header data, without loading the complete data. The lightweight pre-read checks the following three aspects: whether the field separator appears regularly across rows; whether the number of data columns remains consistent across rows; and whether the content of each column can be completely parsed into machine-readable numerical values or strings. If all three checks pass, the field parsability is determined to be parsable.
[0024] If the field parsability assessment result is parsable, the file is classified as a structured file. The corresponding parsing library is called to perform full data loading, extract all field data, output structured data, and proceed to the subsequent first, third, and fourth layer compliance verification processes. If pre-reading fails, field structure is missing, or file content is embedded in image form, the file is classified as an unstructured file and proceeds to the automatic skew correction preprocessing process.
[0025] This diversion mechanism uses the internal data organization of a file rather than its extension as the basis for judgment. It can effectively handle files whose extensions do not match their actual content, fundamentally avoid performing redundant recognition operations on structured files, reduce computational resource consumption, and prevent the introduction of recognition errors into standardized field content.
[0026] Furthermore, the original recognized text output process includes: performing page tilt angle estimation on the image content of the unstructured file: when the number of text lines on the page meets the reliable estimation condition, the tilt angle is obtained by analyzing the distribution direction of the text lines on the page; when the number of text lines on the page is insufficient, the tilt angle is obtained by using the page border line or drawing boundary line as a direction reference; when neither of the above two methods is available, zero tilt angle is used as the default value, no rotation correction is performed, and a tilt correction not performed mark is added when the original recognized text is output subsequently; rotation correction is performed on the image based on the tilt angle to obtain a page image with normal orientation; For unstructured documents, the system detects whether the page contains a text layer that can be directly extracted. If an extractable text layer exists, the text content is extracted directly. If the page content is embedded in the form of an image, the page is converted into an image page by page, and the image of the page with the correct orientation is input into the recognition engine. The recognition engine uses a unified detection head to simultaneously detect and extract content from text regions, table structures, mathematical formulas, and legend annotations. During the recognition process, contextual semantic constraints are introduced. When candidate characters have similar shapes and ambiguities, the final classification of the candidate characters is determined by probability weighting based on the semantic coherence of the surrounding already recognized characters, thus obtaining the recognition content of each region. At the same time, the row and column coordinates of each text block, table unit, and formula region on the page are recorded to obtain spatial layout information. The recognition content of each region is merged with the spatial layout information to output the original recognized text.
[0027] Specifically, for inputs identified as unstructured files, automatic tilt correction preprocessing is performed before multi-element synchronous recognition is performed to output the original recognized text.
[0028] Page tilt angle estimation is performed on the image content in unstructured files. When the number of text lines on the page meets the reliable estimation criteria, the tilt angle is obtained by analyzing the distribution direction of the text lines. When the number of text lines on the page is insufficient, the tilt angle is obtained using the page border line or drawing boundary line as a direction reference. When neither of the above methods is available, zero tilt angle is used as the default value, no rotation correction is performed, and a tilt correction not performed mark is added to the subsequent output original recognized text for manual verification. Rotation correction is then performed on the image based on the tilt angle to obtain a page image with normal orientation.
[0029] For unstructured documents, the system detects whether the page contains directly extractable text layers. If extractable text layers exist, the text content is extracted directly without conversion to an image. If the page content is embedded as an image, the page is converted into an image page by page, and the oriented page images are input into the recognition engine.
[0030] Multi-element synchronous recognition: The recognition engine uses a unified detection head to simultaneously detect and extract content from text regions, table structures, mathematical formulas, and legend annotations. During the recognition process, contextual semantic constraints are introduced: when candidate characters have ambiguity due to similar glyphs, the final classification of the candidate character is determined by probability weighting based on the semantic coherence of surrounding already recognized characters, resulting in the recognized content for each region. Simultaneously, the recognition engine records the row and column coordinates of each text block, table cell, and formula region on the page, obtaining spatial layout information. The recognized content of each region is merged with the spatial layout information to output the original recognized text.
[0031] Furthermore, the original recognized text is subjected to five layers of ordered professional cleaning sequentially, referring to... Figure 2 This includes performing first, third, and fourth-level compliance checks on structured data, merging the two data streams to output cleaned data, specifically including: The first layer of cleaning is performed on the original identified text: the original identified text is scanned by regular expression matching rules, and character sequences that conform to the equipment number format, monitoring point number format and national standard number format are identified. The matched character sequences are marked as protected codes and their original values are retained. No subsequent modifications are performed, resulting in text with protected code markings. A second layer of cleaning is performed on the text containing protected codes: In the text other than the protected codes, the confusion of measurement unit characters caused by similar character shapes is identified, and the confused characters are corrected to standard unit characters to obtain the unit corrected text; The unit string in the unit corrected text is compared with the preset water treatment measurement unit compliance set. If it complies with the rules, it is retained, and if it does not comply with the rules, it is marked as pending review to obtain the unit compliance verification text. The third layer of cleaning is performed on the unit's compliance verification text: the strings in the text are precisely matched with a water treatment-specific abbreviation list, and the abbreviations that match the list are normalized by standard capitalization to obtain abbreviated text; The fourth layer of cleaning is performed on the abbreviation standardization text: strings containing combinations of uppercase letters and number subscripts or containing element symbol format characters are identified as suspected chemical formulas. They are first precisely matched with the whitelist of commonly used water treatment agents and substances. If a match is found, the formula is retained. If no match is found, the legality of its element symbols and subscript structure is verified through chemical formula syntax parsing. If the structure is legal, it is retained; otherwise, it is marked as pending verification, thus obtaining the chemical formula verification text. The fifth layer of cleaning is performed on the chemical formula verification text: For character sequences that contain mathematical operators and have been excluded as non-chemical formulas by the fourth layer, the structure validity is verified by parsing the syntax without performing evaluation. If valid, the complete expression is retained; if invalid, it is marked as pending review. For character sequences containing Chinese characters, regardless of whether they contain subscripts or superscripts, they are not sent to the fifth layer of verification and are directly passed through to obtain the original recognized text cleaning result. Perform first, third, and fourth layer compliance checks on structured data: sequentially perform protected code recognition and retention, abbreviation standardization matching, chemical formula whitelist matching and legality verification on the content of each field, without performing the second layer OCR character obfuscation correction, in order to protect the integrity of the original fields of the structured data, and obtain the structured data verification results; The original text cleaning results and structured data verification results are merged at the field level and output as cleaned data.
[0032] Furthermore, a seed sentence library is constructed, noise is injected into the seed sentences to generate a training dataset based on patterns, the text correction pre-trained model is fine-tuned in the domain, the cleaned data is corrected, and normalized text is output. Specifically, this includes: The seed statement library construction process is as follows: By structurally extracting and cleaning authoritative sources such as domain standards, process manuals, equipment documents and test reports, high-quality benchmark statements covering professional terms, typical sentence structures and standard descriptions are selected and organized; after deduplication, normalization and compliance verification, a seed statement library with accurate semantics and uniform format is formed.
[0033] Constructing a seed sentence library: On the first run, sentences that have passed all five layers of cleaning in the current batch of cleaning data and have not generated any pending verification marks and whose semantics are complete are selected to construct an initial seed sentence library; on subsequent runs, sentences containing high-confidence water treatment professional terms and whose semantics are complete are extracted from the historically accumulated professional corpus to construct the current batch of seed sentence library. Constructing the training dataset: For each sentence in the seed sentence library, inject obfuscation noise according to the OCR glyph similarity obfuscation character pair rules, including replacing numeric characters that meet the obfuscation conditions with glyph similar letter characters and replacing special measurement characters with glyph similar ordinary letter characters to obtain the noise-injected sentence; use the noise-injected sentence as the model input sample and the corresponding original seed sentence as the standard output sample to construct the training dataset; The ChineseErrorCorrector-3_4B text correction pre-trained model is adopted. For character positions outside the boundaries of professional terms, the probability distribution of candidate characters is calculated based on the context representation vector, and the character with the highest probability is taken as the correction output. The difference between the standard output sample and the correction output is used as the training loss. The model parameters are updated through backpropagation to obtain the domain-fine-tuned text correction pre-trained model. Error correction of cleaned data: The cleaned data is input into the domain-fine-tuned text correction pre-trained model. The input encoding layer vectorizes the character sequences in the cleaned data. The context representation layer performs bidirectional context modeling and outputs terminology boundary markers. The output decoding layer corrects OCR obfuscated characters in non-terminology areas and preserves terminology areas as a whole based on terminology boundary markers and context representation vectors, and outputs standardized text. A quality assessment is performed on the standardized text: if the standardized text does not contain any strings indicating pending review, all unit strings belong to the water treatment unit compliance set, and all chemical formula strings pass whitelist matching or chemical formula syntax validity verification, it is deemed to be of acceptable quality and included in the professional corpus; if any condition is not met, it enters the manual review queue and is not automatically included in the professional corpus; the standardized texts of the current batch that meet the quality requirements are updated to the professional corpus for use in the construction of the next batch of seed sentence databases.
[0034] Specifically, the original identified text is subjected to five layers of ordered professional cleaning in sequence, while the structured data is subjected to the first, third and fourth layers of compliance verification. The two data streams are then merged to output the cleaned data.
[0035] The execution order of the five-layer cleaning is designed based on the character composition rules of water treatment professional data and the identification error propagation mechanism. The processing priority of each layer cannot be arbitrarily changed.
[0036] The first layer retains the equipment code and national standard number. It scans the original identification text using regular expression matching rules, identifying character sequences that conform to the equipment number format (a string composed of letter prefixes, hyphens, and numerical sequences, such as a combination of equipment type abbreviation and sequence number), the monitoring point number format, and the national standard number format (a string composed of standard category code and year number). Matching character sequences are marked as protected codes, their original values are retained, and no further modifications are performed, resulting in text containing the protected code markers. The technical reason for prioritizing this layer is that if the second layer's unit correction is executed before this layer, segments in the encoded string that resemble the character shapes of the measurement unit will be incorrectly corrected, leading to the corruption of the encoded string and causing equipment traceability failure.
[0037] The second layer involves OCR character obfuscation correction and compliance verification for measurement units. In text outside the protected encoding, it identifies measurement unit character obfuscation caused by similar character shapes, including confusion between the letter "o" and the number "0", the lowercase letter "l" and the number "1", and the common letter "u" and the special measurement character "μ". The obfuscated characters are corrected to standard unit characters, resulting in corrected unit text. The unit strings in the corrected unit text are compared with a preset water treatment measurement unit compliance set. compliant units are retained, while non-compliant units are marked as pending review, resulting in the unit compliance verification text.
[0038] The third layer: Standardized matching of water treatment industry abbreviations. This involves precisely matching the strings in the unit's compliance verification text against a dedicated list of water treatment abbreviations, which covers commonly used industry abbreviations in water treatment processes and engineering. The abbreviations that match the list are then subjected to standard case normalization processing to obtain the standardized abbreviation text.
[0039] The fourth layer employs a dual verification process: chemical formula whitelist matching and validity verification. Strings containing combinations of uppercase letters and numerical subscripts or element symbol format characters in the standardized abbreviation text are identified as suspected chemical formulas. These are first precisely matched against a whitelist of commonly used water treatment agents and substances; if a match is found, the formula is retained. Suspected chemical formula strings that do not match the whitelist are then verified for the validity of their element symbols and subscript structures through chemical formula syntax parsing. Valid formulas are retained; invalid formulas are marked as pending verification, resulting in the chemical formula verification text.
[0040] The fifth layer: Mathematical expression structure validity verification. For character sequences in the chemical formula verification text that simultaneously meet the following two conditions, mathematical expression verification is performed: first, the character sequence contains mathematical operators; second, the character sequence has been excluded as a non-chemical formula by the fourth layer. Character sequences containing Chinese characters, regardless of whether they contain subscripts or superscripts, are not sent to this layer for verification and are directly passed through. For character sequences that meet the conditions, structure validity verification is performed by parsing the syntax without performing evaluation. If valid, the complete expression is retained; if invalid, it is marked as pending review, yielding the original cleaned text result.
[0041] The default method for constructing the water treatment metrology unit compliance set is as follows: It includes metrology unit identifiers that have formed industry consensus in water treatment engineering design specifications, water quality testing standards, and process operation records. It covers the standard symbol expressions of metrology units such as mass concentration, turbidity, temperature, flow rate, and pressure. It is stored in the form of a text file, which allows technical personnel to add new entries before processing new types of engineering documents. The updated compliance set will take effect in the next batch of processing.
[0042] During the initial run, the historically accumulated professional corpus has not yet been established. At this time, sentences that have passed all five layers of cleaning in the current batch of cleaned data and have no pending verification tags, and whose semantics are complete, are selected to construct an initial seed sentence library. After the initial fine-tuning is completed, the initial seed sentence library is incorporated into the professional corpus, and subsequent batches extract seed sentences from the historically accumulated professional corpus.
[0043] Differential processing is performed on structured data: only the first, third, and fourth layers of compliance checks are performed, and the second layer of OCR character obfuscation correction is not performed. The reason for not performing the second layer is that the structured data is obtained by directly reading machine-readable fields through a standard parsing library, and there is no OCR character obfuscation problem. Performing the second layer of correction may misidentify legitimate characters in the fields as obfuscated errors and replace them, thus compromising the integrity of the original data fields.
[0044] The structured data undergoes a series of steps: first-level protected code identification and retention, third-level abbreviation standardization matching, and fourth-level chemical formula whitelist matching and legality verification, to obtain the structured data verification result.
[0045] The original text cleaning results and structured data verification results are merged at the field level and output as cleaned data, which then enters the subsequent domain fine-tuning and error correction process.
[0046] Furthermore, the seed statement library construction process is as follows: by structurally extracting and cleaning authoritative sources such as domain standards, process manuals, equipment documents and test reports, high-quality benchmark statements covering professional terms, typical sentence structures and standard descriptions are selected and organized; after deduplication, normalization and compliance verification, a seed statement library with accurate semantics and uniform format is formed.
[0047] Furthermore, the standardization of text acquisition specifically includes: standardizing sentences based on high-confidence water treatment professional terms in the seed sentence library, injecting obfuscation noise into the sentences in the seed sentence library according to the OCR character similarity obfuscation pair rules, using the sentences after noise injection as the model input and the original seed sentences as the standard output, obtaining the text correction training dataset, and performing domain fine-tuning on the text correction pre-trained model; The cleaned data is input into the domain-fine-tuned text correction pre-trained model. The character sequences in the cleaned data are vectorized. The context representation layer performs bidirectional context modeling and outputs terminology boundary markers. Based on the terminology boundary markers and context representation vectors, OCR obfuscated characters in non-terminology regions are corrected, terminology regions are preserved as a whole, and standardized text is output.
[0048] Specifically, during non-first runs, sentences that simultaneously meet the following conditions are extracted from the historically accumulated professional corpus: the sentences contain high-confidence water treatment professional terms that have passed five layers of cleaning and verification; the sentences are semantically complete and not truncated fragments. The selected sentences are then compiled to form the seed sentence library for the current batch.
[0049] For each sentence in the seed sentence library, obfuscation noise is injected according to the OCR glyph similarity obfuscation character pair rules: numerical characters that meet the obfuscation conditions are replaced with glyph-similar letter characters, and special measurement characters are replaced with glyph-similar ordinary letter characters. Specifically, this includes obfuscation between the letter "o" and the number "0", between the lowercase letter "l" and the number "1", between the ordinary letter "u" and the special measurement character "μ", and other glyph obfuscation types that frequently occur in water treatment engineering document recognition, resulting in an injected noise sentence. The injected noise sentences are used as input samples for the model, and the corresponding original seed sentences are used as standard output samples to construct a training dataset. This construction method ensures that the error distribution of the training data matches the systematic character obfuscation patterns generated by water treatment engineering document recognition.
[0050] The text correction pre-training model vectorizes the character sequence in the noise-injected sentence, mapping each character to a fixed-dimensional numerical vector, resulting in a character vector sequence. This character vector sequence carries the initial semantic representation of each character in the noise-injected sentence.
[0051] Bidirectional context modeling is performed on the character vector sequence: forward modeling passes context information sequentially from the first character to the last character of the sentence, while backward modeling passes context information sequentially from the last character to the first character. The forward and backward context information are fused at each character position to obtain a context representation vector for each character position that incorporates both the preceding and following context information. The context representation layer also performs overall identification of technical term boundaries in the sentence. By performing sequence labeling on the context representation vectors of consecutive character positions, it outputs technical term boundary markers for each character position. These markers are used by the subsequent output decoding layer to distinguish between terminology regions that need protection and non-terminology regions that can be corrected.
[0052] Based on the context representation vector and terminology boundary markers, the following processing is performed: consecutive character positions marked as within the terminology range are preserved as a whole, without performing position-by-position independent correction; for character positions outside the terminology boundaries, the probability distribution of candidate characters is calculated based on the context representation vector, and the character with the highest probability is taken as the corrected output. The difference between the standard output sample and the corrected output is used as the training loss, and the model parameters of the input encoding layer, context representation layer, and output decoding layer are updated through backpropagation. This allows the model to learn the ability to correct OCR confused characters and the ability to protect the indivisibility of water treatment terminology, resulting in a domain-fine-tuned pre-trained text correction model.
[0053] The cleaned data is input into a domain-fine-tuned pre-trained text correction model: the input encoding layer vectorizes the character sequences in the cleaned data to obtain character vector sequences; the context representation layer performs bidirectional context modeling on the character vector sequences to obtain context representation vectors and outputs terminology boundary markers; the output decoding layer preserves the overall character positions within the terminology range based on the terminology boundary markers, and corrects the character positions in non-terminology regions by selecting the character with the highest probability based on the context representation vector, outputting normalized text.
[0054] The standardized text undergoes a quality assessment based on the following conditions: the standardized text does not contain any strings indicating a pending review status; all unit of measurement strings belong to the water treatment unit of measurement compliance set; and all chemical formula strings pass whitelist matching or chemical formula grammatical validity verification. If all conditions are met, the text is deemed quality-qualified and included in the professional corpus; if any condition is not met, it enters the manual review queue and is not automatically included in the professional corpus. The standardized texts that pass quality in the current batch are updated in the professional corpus for use in building the seed sentence database for the next batch, forming an orderly corpus accumulation mechanism between batches.
[0055] Furthermore, automatic classification is performed on the standardized text, referring to... Figure 3 The classification results were reviewed item by item to identify gaps in the coverage of the classification criteria and to iteratively update the classification criteria by adding new sub-category entries. Specifically, this included: Automatic classification is performed: using the construction project document archiving standard as the initial classification system, a structured classification instruction is constructed, which includes a description of the classification basis, judgment rules for each category, and typical classification examples. The structured classification instruction and the standardized text are input into the large language model, and the classification label and classification confidence description corresponding to each standardized text are output to obtain the classification result. Establish a classification evaluation benchmark: During the initial deployment, professionals extract representative samples from historical files for each category and manually label them with classification tags to ensure that each category entry is covered by representative samples. These samples are stored in a structured form as the benchmark. Each time a new subcategory entry is added to the classification standard, manually labeled samples of the corresponding subcategory are added to the benchmark. The classification results are reviewed item by item: Independent classification assessment instructions are constructed, which are independent of the classification instructions in terms of content; the reference benchmark and classification results are input into the large language model, and the classification results are reviewed item by item according to the classification assessment instructions. Error samples with mismatch between classification labels and text content are identified, and the causes of errors are analyzed, including missing classification standard items, ambiguous definition of document type boundaries, and no corresponding items for document types specific to the water treatment industry in the current specifications. A list of error samples and an analysis of the causes of errors are obtained; the classification accuracy rate of the current round is calculated based on the proportion of the number of classification results that match the labels of the reference benchmark to the total number of reviews, and together with the list of error samples and the analysis of the causes of errors, a structured assessment report is output. Iterative update of classification standards: Based on the gaps in the classification standards identified in the structured assessment report, corresponding sub-category entries are added to the existing classification system. After updating the structured classification instructions, a new round of automatic classification and item-by-item review is triggered. The cycle is iterated until the classification accuracy reaches the preset threshold. Synchronous updates to the professional corpus: After the iteration stabilizes, the classification data that has been confirmed to meet the quality requirements through classification evaluation will be synchronously updated to the professional corpus for continuous iterative training of the subsequent text correction pre-training model.
[0056] Furthermore, the large language model includes an input representation layer, multiple self-attention layers, and an output prediction layer; The input representation layer vectorizes the character sequence concatenated from the structured classification instructions and the normalized text. It encodes the classification criteria description, the judgment rules for each category, typical classification examples, and the text content to be classified into a unified input vector sequence. The input vector sequence retains the positional boundary information of the classification instructions and the text to be classified, so that subsequent layers can distinguish between the instruction area and the content area. The multi-layer self-attention layer performs correlation calculation on each position in the input vector sequence with all other positions to obtain a context representation vector that integrates global context information for each position. In this process, the keyword vector in the text to be classified and the vector of the corresponding category description in the classification instruction form a high correlation weight, which enables the model to semantically align the text content features with the classification category description to obtain a semantically aligned context representation vector. The output prediction layer performs category probability distribution calculation on the semantically aligned context representation vector, takes the category entry corresponding to the highest probability as the classification label, and generates classification confidence description based on the concentration of the probability distribution. It outputs the classification label and classification confidence description corresponding to each normalized text as input data for the subsequent review process.
[0057] Specifically, the construction project document archiving standard is used as the initial classification system to construct a structured classification instruction that includes a description of the classification basis, judgment rules for each category, and typical classification examples.
[0058] Large language models consist of an input representation layer, multiple self-attention layers, and an output prediction layer.
[0059] The input representation layer vectorizes the character sequence resulting from the concatenation of structured classification instructions and normalized text. It encodes the classification criteria description, category judgment rules, typical classification examples, and the text content to be classified into a unified input vector sequence. This input vector sequence retains the positional boundary information of the classification instructions and the text to be classified, enabling subsequent layers to distinguish between the instruction region and the content region. This ensures that the model clearly defines the source of the classification rules and the scope of the content to be classified during semantic alignment.
[0060] The multi-layered self-attention layer performs relevance calculations on each position in the input vector sequence with all other positions: it calculates the relevance weight between the vector of each character position in the text to be classified and all other position vectors. Positions with higher weights contribute more to the context representation of the current position, while positions with lower weights contribute less. After multi-layered superposition calculations, a high relevance weight is formed between the keyword vector in the text to be classified and the vector of the corresponding category description in the classification instruction. This enables the model to semantically align the content features of the text to be classified with the classification category description, resulting in a semantically aligned context representation vector.
[0061] The output prediction layer performs category probability distribution calculation on the semantically aligned context representation vector. It calculates the probability of each category being selected for all category entries in the current classification system and takes the category entry with the highest probability as the classification label. At the same time, it generates a classification confidence statement based on the concentration of the probability distribution. The confidence is high if the probability distribution is concentrated in a single category, and low if the probability distribution is dispersed across multiple categories. It outputs the classification label and classification confidence statement for each normalized text, and obtains the classification result, which serves as the input data for the subsequent item-by-item review process.
[0062] During the initial deployment, professionals specializing in water treatment engineering document archiving extracted representative samples from historical files for each category. Each sample was manually labeled with its corresponding category tag, ensuring that the reference benchmark had representative samples covering all entries in the current classification system. This data was stored in a structured format, containing both the original text content and the corresponding manually labeled category tag. Each time a new subcategory entry was added to the classification standard, manually labeled samples for that subcategory were supplemented to the reference benchmark, ensuring the effectiveness of accuracy calculations across all classification dimensions.
[0063] An independent classification assessment instruction is constructed, which is independent of the classification instruction in content to avoid the influence of systematic bias in the classification module on the assessment conclusion. The reference benchmark and classification results are input into the large language model, and the classification results are reviewed item by item according to the classification assessment instruction: For each standardized text, its classification label is compared with the manually labeled label of the corresponding category in the reference benchmark to identify erroneous samples where the classification label does not match the text content; the causes of errors are analyzed for erroneous samples, including missing classification standard entries, ambiguous definition of document type boundaries, and the absence of corresponding entries in current regulations for document types specific to the water treatment industry; the classification accuracy rate for the current round is calculated based on the proportion of the number of classification results matching the reference benchmark labels to the total number of reviews; the list of erroneous samples, the analysis of the causes of errors, and the classification accuracy rate are output together as a structured assessment report.
[0064] Based on the classification gaps identified in the structured assessment report, corresponding sub-category entries are added to the existing classification system. For example, if the assessment report shows that water treatment engineering-specific contract documents such as equipment and material procurement contracts, supply contracts, and testing and inspection contracts exist under the bidding contract documents category and cannot be covered by the current standard entries, then the aforementioned sub-category entries are added under the corresponding category. After updating the structured classification instructions, a new round of automatic classification and item-by-item review is triggered, iterating until the classification accuracy reaches a preset threshold.
[0065] After the iteration stabilizes, the classification data that has been confirmed to meet the quality requirements through classification evaluation will be synchronously updated to a professional corpus for continuous iterative training of the subsequent text correction pre-training model, forming a complete data processing closed loop from file processing and classification optimization to model iteration.
[0066] Example 2: This embodiment uses a batch of engineering documents from a wastewater treatment plant upgrade project as input, including scanned PDF files of the process upgrade plan, XLS files of the equipment ledger, and TXT files of the operating parameter records. The document diversion stage, identification stage, and five-layer cleaning and compliance verification stage are consistent with the content of Embodiment 1, covering both structured and unstructured documents. It is used to verify the complete execution process of the three mechanisms: three-level professional terminology dictionary construction, prior boundary mark enhancement and fine-tuning, and field confidence scoring and conflict resolution.
[0067] After outputting the cleaned data, the method further includes constructing a three-level terminology dictionary for the cleaned data, specifically including: Construct the first-level basic terminology dictionary: For all strings in the cleaned data that have undergone five-level cleaning and are output in a retained state, count the document frequency of each string in all documents of the cleaned data, and include the strings whose document frequency is not lower than a preset frequency threshold into the basic terminology dictionary entries to obtain the basic terminology dictionary; Constructing a second-level strong terminology dictionary: Sentences from the cleaned data and entries from the basic terminology dictionary are input into the named entity recognition model. The named entity recognition model outputs entity boundary markers for each character position and identifies compound terms composed of multiple consecutive combinations of entries from the basic terminology dictionary. For compound terms whose character length exceeds the character length of a single entry in the basic terminology dictionary, they are included as independent entries, thus obtaining a strong terminology dictionary. Constructing a third-level water treatment professional corpus: Merging all entries of the basic terminology dictionary and the strong professional terminology dictionary, performing clustering on term pairs that co-occur in the same sentence with a frequency exceeding a preset co-occurrence threshold, storing the term set under the same cluster as a semantic category, and obtaining the water treatment professional corpus; The entries in the basic terminology dictionary and the strong professional terminology dictionary are added to the water treatment-specific abbreviation list used for the third-layer cleaning; the entries in the strong professional terminology dictionary are provided to the sequence label header of the context representation layer as a priori reference for professional term boundaries; the water treatment professional corpus is synchronously updated to the professional corpus for use in the construction of seed sentence libraries in subsequent batches.
[0068] Furthermore, highly specialized terminology dictionary entries are provided to the sequence labeling head as prior references for terminology boundaries, specifically including: Before the domain fine-tuning begins, each entry in the strong terminology dictionary is converted into a character-level boundary marker sequence based on the term positions in the original seed statement: the first character position of the entry is marked as the term start, the last character position is marked as the term end, the middle character positions are marked as the term interior, and the non-entry character positions are marked as non-terms, thus obtaining the prior boundary marker sequence. The training objective of the sequence labeling head is set as follows: given a noisy injected sentence as input, output a prior boundary label sequence that is consistent with the term positions of the original seed sentence as a supervision signal; when calculating the training loss of the sequence labeling head, for the character positions in the prior boundary label sequence that are labeled as the start of a term, inside a term, and the end of a term, the labeling loss weight corresponding to the position is set higher than the labeling loss weight of non-term positions, so that the sequence labeling head will preferentially fit the known term boundary pattern, and obtain a pre-trained text correction model with prior boundary enhancement and domain fine-tuning. When performing error correction on the cleaned data, the sequence labeling header directly outputs the term boundary sequence labeling results at the positions where the character sequence matches the entry in the strong terminology dictionary. For the unmatched positions, dynamic boundary inference is performed based on the context representation vector to output complete terminology boundary sequence labeling results, which are then used by the output decoding layer to distinguish between protected areas and error-corrected areas.
[0069] Furthermore, the merged and cleaned data from the two datasets also includes performing confidence scoring and conflict resolution on fields with the same identifier in both datasets. Specifically, this includes: Calculate field confidence: For each field in the original text cleaning result, if the field belongs to the processing target type of at least one of the first, second, third, fourth, and fifth layers, the OCR path confidence is calculated as the ratio of the number of layers in which the field passes in the preserved state to the total number of applicable cleaning layers. If the field does not belong to the processing target type of any layer, a text integrity check is performed on the field. If it passes, the OCR path confidence is recorded as full marks; if it fails, it is recorded as zero, thus obtaining the OCR path confidence value for each field. For each field in the structured data verification result, the direct parsing path confidence is calculated according to the same rules based on the compliance verification status of the first, third, and fourth layers, thus obtaining the direct parsing path confidence value for each field. Perform conflict resolution: For fields whose same field identifier appears in two data paths, compare the confidence values of the two paths and retain the field content of the path with the higher confidence value; when the two confidence values are equal, prioritize retaining the field content of the directly parsed path; record the retention decision as field-level conflict resolution metadata, including the field identifier, retention path marker, and the two confidence values; Output cleaned data with metadata: Merge field content with field-level conflict resolution metadata, and output cleaned data with metadata; during subsequent seed statement library construction, retain fields marked as OCR paths and containing conflict resolution records. Those skilled in the art will readily understand that the above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for document processing and classification in the field of water treatment based on a large language model, characterized in that, include: By detecting the parsability of the header byte sequence and fields, and using the internal data organization of the file as the criterion, the input file is divided into structured files and unstructured files. Structured files are parsed using a parsing library to extract structured data; After performing automatic skew correction preprocessing on unstructured documents, text, tables, formulas, and legends are simultaneously recognized, and the original recognized text is output. The original recognized text is cleaned in five layers of ordered professional cleaning: the first layer retains the original values of equipment codes and national standard numbers; the second layer corrects OCR character confusion errors in measurement units and verifies unit compliance; the third layer performs standardized matching based on a water treatment abbreviation list; the fourth layer performs double verification of chemical formulas by matching against a whitelist and verifying their legality; and the fifth layer verifies the validity of the mathematical expression structure. Perform first, third, and fourth-level compliance checks on structured data; merge the two data streams to output cleaned data; Construct a seed sentence library, inject noise into the seed sentences to generate a training dataset, fine-tune the pre-trained text correction model in the domain, correct the cleaned data, and output standardized text. The system automatically categorizes standardized text, reviews each categorization result, identifies gaps in the coverage of categorization criteria, and adds new subcategories to iteratively update the categorization criteria.
2. The method for document processing and classification in the field of water treatment based on a large language model according to claim 1, characterized in that: The steps to divide the input file into structured and unstructured files are as follows: Read the header byte sequence of the input file and determine the encoding format and storage method of the internal data based on the byte sequence; for file formats that are determined to potentially have structured content, call the corresponding parsing library to perform a lightweight pre-read of the file content, checking whether field delimiters exist regularly and whether data columns can be completely parsed; if the pre-read is successful and the field structure is complete, the file is determined to be a structured file, and the corresponding parsing library is called to fully extract the structured data. If prefetching fails, field structure is missing, or file content is embedded in image form, the file is determined to be an unstructured file; the lightweight prefetching does not perform complete data loading.
3. The method for document processing and classification in the field of water treatment based on a large language model according to claim 1, characterized in that: The process of outputting the original recognized text includes: For unstructured files, the page tilt angle of the image content is estimated, and rotation correction is performed based on the estimated angle to obtain a page image with normal orientation. For unstructured documents, the page is checked to see if it contains a text layer that can be directly extracted. If a text layer exists, the text content is extracted directly. If the page content is embedded in the form of an image, the page is converted into an image page by page before being input into the recognition engine. The recognition engine uses a unified detection head to simultaneously detect and extract content from text regions, table structures, mathematical formulas, and legend annotations. During the recognition process, contextual semantic constraints are introduced to perform probability correction on candidate characters. At the same time, the spatial position information of each element on the page is recorded. The recognized content and spatial layout information are merged to output the original recognized text.
4. The method for document processing and classification in the field of water treatment based on a large language model according to claim 1, characterized in that: The original recognized text is subjected to five layers of ordered professional cleaning, specifically including: The first layer identifies strings in the original text that conform to the device number format, monitoring point number format, and national standard number format through regular expression matching rules, marks the matched strings as protected codes and retains the original values; The second layer identifies OCR character confusion of measurement units caused by similar character shapes in the text outside the protected encoding, including confusion between letters and numbers and confusion between ordinary letters and special measurement characters. After correcting them to the corresponding standard unit characters, they are compared with the preset water treatment measurement unit compliance set to obtain the unit compliance verification text. The third layer precisely matches the text output from the second layer with a water treatment-specific abbreviation list, and performs standard capitalization processing on the abbreviations that match the list to obtain the abbreviation-standardized text. The fourth layer involves first performing a precise match between strings suspected of being chemical formulas and a whitelist of commonly used water treatment agents and substances. If a match is found, the string is retained. For strings that do not match the whitelist, the validity of their element symbols and subscript structures is verified through chemical formula syntax parsing. If valid, the string is retained; otherwise, it is marked for manual review, resulting in a chemical formula verification text. The fifth layer, after excluding plain text strings, judges the validity of expression structure for character sequences containing operators, parentheses, and subscripts by parsing the syntax without performing evaluation. If valid, the complete expression is retained; otherwise, it is marked for manual review. After the five-layer cleaning is completed, the cleaned raw recognized text data is output.
5. The method for document processing and classification in the field of water treatment based on a large language model according to claim 1, characterized in that: The seed statement library construction process is as follows: By structurally extracting and cleaning authoritative sources such as domain standards, process manuals, equipment documents and test reports, high-quality benchmark statements covering professional terms, typical sentence structures and standard descriptions are selected and organized; after deduplication, normalization and compliance verification, a seed statement library with accurate semantics and uniform format is formed.
6. The method for document processing and classification in the field of water treatment based on a large language model according to claim 1, characterized in that: The standardization text acquisition process specifically includes: standardizing sentences based on high-confidence water treatment professional terms in the seed sentence library; injecting obfuscation noise into sentences in the seed sentence library according to the OCR character similarity obfuscation pair rules; using the noisy sentences as model input and the original seed sentences as standard output; obtaining the text correction training dataset; and performing domain fine-tuning on the pre-trained text correction model. The cleaned data is input into the domain-fine-tuned text correction pre-trained model. The character sequences in the cleaned data are vectorized. The context representation layer performs bidirectional context modeling and outputs terminology boundary markers. Based on the terminology boundary markers and context representation vectors, OCR obfuscated characters in non-terminology regions are corrected, terminology regions are preserved as a whole, and standardized text is output.
7. The method for document processing and classification in the field of water treatment based on a large language model according to claim 1, characterized in that: The steps for reviewing each classification result include: using the construction project document archiving standard as the initial classification system, constructing a structured classification instruction that includes classification basis, judgment rules and classification examples, driving the large language model to automatically classify each standardized text, outputting the classification label and classification confidence description for each text, and obtaining the classification result; Large language models consist of an input representation layer, multiple self-attention layers, and an output prediction layer; The input representation layer vectorizes the character sequence concatenated with structured classification instructions and normalized text, and encodes the classification criteria description, the judgment rules of each category, typical classification examples and the text content to be classified into a unified input vector sequence. The input vector sequence retains the positional boundary information of the classification instructions and the text to be classified. The multi-layer self-attention layer performs correlation calculation on each position in the input vector sequence with all other positions to obtain a context representation vector that integrates global context information for each position. In this process, the keyword vector in the text to be classified and the vector of the corresponding category description in the classification instruction form a high correlation weight, which enables the model to semantically align the text content features with the classification category description to obtain a semantically aligned context representation vector. The output prediction layer performs category probability distribution calculation on the semantically aligned context representation vector, takes the category entry corresponding to the highest probability as the classification label, and generates classification confidence description based on the concentration of the probability distribution. It outputs the classification label and classification confidence description for each normalized text to obtain the classification result.