A data processing method and system for large language models
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-19
Smart Images

Figure CN122240839A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing technology, and in particular to a data processing method and system for large language models. Background Technology
[0002] In recent years, artificial intelligence technologies, represented by large language models, have made groundbreaking progress. Their core capabilities heavily rely on self-supervised pre-training with massive amounts of high-quality text data. Research shows that the scale, quality, and diversity of training data directly determine the model's knowledge depth and generalization ability. With the exponential growth of data volume, traditional data processing workflows based on manual rules are no longer sufficient. How to efficiently select high-quality, diverse training corpora from multi-source heterogeneous data has become a core challenge in the field of data engineering.
[0003] Current mainstream pre-training data processing solutions primarily revolve around HTML text crawled from the web. Through steps such as domain filtering, customized HTML parsing, multi-level deduplication, and quality filtering combining heuristic rules and classifiers, they achieve efficient cleaning of structured or semi-structured text. However, these solutions have very limited capabilities in processing unstructured documents, such as PDFs and scanned images. They often only perform simple text extraction and cannot handle complex layouts such as multi-column layouts, mixed text and images, and embedded tables and formulas. This makes it difficult to effectively incorporate a large amount of high-value academic papers and industry research reports into the training process. Other enterprise-level data management solutions transform multi-source data into standardized assets through a multi-layered architecture of collection, governance, and service. However, their initial design aims to meet the consistency data requirements of internal business systems. When facing the deep processing of massive amounts of unstructured text, centralized batch processing suffers from significant cost and efficiency bottlenecks and lacks dynamic resource scheduling capabilities for model training tasks. Among the publicly available technical solutions, some propose methods that first perform OCR recognition on multi-source heterogeneous data and then clean it based on vocabulary and regular expressions, but these methods have limited adaptability to complex layouts and are prone to information loss; others focus on multi-level rule filtering and model-assisted cleaning, but these methods are still mainly for data that has been converted into plain text, and the storage layer does not distinguish the differences in access patterns of data at different stages.
[0004] Therefore, there is an urgent need in this field for a data processing solution that can integrate multi-source heterogeneous data parsing, flexible task scheduling and hierarchical storage to achieve efficient transformation from raw data to high-quality training corpora. Summary of the Invention
[0005] In view of this, embodiments of the present invention provide a data processing method and system for large language models to solve the technical problems in the prior art, which are caused by the lack of systematic analysis capabilities for multi-source heterogeneous data, flexible task scheduling mechanisms, and differentiated storage strategies, resulting in low utilization of high-value unstructured data, limited system throughput, and high storage costs, thereby restricting the efficient construction of high-quality training corpora.
[0006] One aspect of the present invention provides a data processing method for large language models, the method comprising the following steps: The system receives raw data files from one or more data sources, generates a unique data identifier for each raw data file, and stores the raw data file and its corresponding metadata separately based on the data identifier. The raw data file is stored in the file entity storage carrier of the object storage service, and the metadata is stored in the raw layer of the non-relational database. The metadata includes the data identifier, the basic information of the raw data file, and its storage link in the file entity storage carrier. Parsing operations are performed on multiple metadata in the original layer to create a parsing task entity and publish it to the first message queue; the parsing task entity is obtained from the first message queue, the data identifier is extracted from it, the corresponding original data file is obtained from the file entity storage carrier according to the data identifier, and semantic information is extracted for structured data using a standardized parser and for unstructured data using layout parsing and multi-region identification, based on the type of the original data file; the semantic information is converted into a standard structured format and stored in the parsing layer of the non-relational database. Filtering operations are performed on the data in the parsing layer, a filtering task entity carrying the filtering strategy configuration is created and published to the second message queue; based on the filtering task entity obtained from the second message queue, the corresponding filtering algorithm sub-modules are sequentially called to process the data in the parsing layer and store it in the filtering layer of the non-relational database according to the filtering strategy configuration; A classification operation is performed on the filtered layer data, a classification task entity carrying classification model information is created and published to a third message queue; based on the classification task entity obtained from the third message queue, a specified multi-label text classification model is called to perform semantic recognition on the data in the filtered layer and add semantic labels, and the data carrying the semantic labels is stored in the classification layer of a non-relational database; In response to a data export request, data is retrieved from the classification layer based on the semantic tags according to the specified filtering conditions, and the retrieved data is exported.
[0007] In some embodiments of the present invention, the extraction of semantic information from the types of unstructured data through layout parsing and multi-region identification includes: Convert the original data file into a document image sequence; The document images in the document image sequence are input into the document layout analysis model, and a region candidate set is output. Each region in the region candidate set is represented by a bounding box, a region category label, and a confidence score. The region category includes at least a title region, a text region, a table region, and a formula region. For the title area, the OCR engine is invoked to recognize the characters and output a sequence of characters. For the regions in the text area that are determined to be text categories, the OCR engine is invoked to recognize them and output character sequences; For the regions in the table area that are determined to be of the table category, the table structure recognition model is invoked for parsing, and a structured table representation is output. For regions in the formula area that are determined to be formula categories, the formula recognition model is invoked for recognition, and a symbolic representation is output. The recognition results are stitched together according to the reading order of each region in the document image to generate a unified structured representation.
[0008] In some embodiments of the present invention, the extraction of semantic information from the structured data using a standardized parser includes: The original data file determined to be of a structured type is used as a structured document; Based on the type of the structured document, the corresponding parsing library is invoked to parse the underlying markup structure of the structured document; The semantic structure information of the structured document is identified and extracted from the parsing results. The semantic structure information includes at least heading levels, paragraph divisions, list structures, and table data. Based on the extracted semantic structure information, the hierarchical content structure of the structured document is reconstructed according to its original logical order, and the output is a standardized structured data representation.
[0009] In some embodiments of the present invention, according to the filtering strategy configuration, the corresponding filtering algorithm sub-modules are sequentially invoked to process the data in the parsing layer, including: The filtering policy is received by the user through the filtering policy configurator. The filtering policy includes the enabled filtering algorithms, the execution order of each filtering algorithm, and the corresponding algorithm parameters. The corresponding filtering algorithm submodules are called sequentially according to the execution order to process the data in the parsing layer, and the processing results of each filtering step are recorded. The data processed by all filtering algorithms is associated with the applied filtering rules and stored in the filtering layer.
[0010] In some embodiments of the present invention, the filtering algorithm submodule includes a SimHash deduplication submodule, a toxic text filtering submodule, and a quality filtering submodule, wherein: The SimHash deduplication submodule is used to calculate feature vectors and generate fingerprints for texts. For any two texts, the Hamming distance between the fingerprints is calculated. If the Hamming distance is less than a preset threshold, the two texts are determined to be highly similar and deduplication is performed. The toxic text filtering submodule is used to perform toxicity detection on text based on a pre-trained text classification model, and to identify and filter data containing insulting, discriminatory or violent content. The quality filtering submodule is used to calculate the perplexity of text based on the language model. If the perplexity is greater than a preset threshold, the text is determined to be of low quality and is removed.
[0011] In some embodiments of the present invention, the Hamming distance is calculated as follows: ; in, Representing text SimHash fingerprint; Representing text SimHash fingerprint; Indicates the binary length of the fingerprint; The binary bit index representing the fingerprint; This is an indicator function that takes the value 1 when the condition inside the parentheses is true, and 0 otherwise; The formula for calculating the degree of confusion is: ; in, Indicates inclusion A text sequence of words; Representing text The total number of words in the text; Indicates the position index of the word in the text sequence; The text represents The first in One word; Indicates the language model in the context of prior art. Word Predicting the first under the condition The word is The probability of.
[0012] In some embodiments of the present invention, the first message queue, the second message queue, and the third message queue belong to different topics in the message queue; The method further includes: The message queue is continuously monitored by multiple stateless worker nodes. When a task message is obtained, the corresponding processing module is called to execute the specific business logic according to the task type, and the task status is updated in real time during the execution process. Specifically, the worker node of the first message queue performs a parsing task, the worker node of the second message queue performs a filtering task, and the worker node of the third message queue performs a classification task.
[0013] In some embodiments of the present invention, the data carrying the semantic tags in the classification layer is associated with the file entity storage carrier through a bidirectional interface to support batch export and efficient access of large-scale datasets; The step of retrieving data from the classification layer based on the specified filtering conditions according to the semantic tags includes: retrieving data based on one or more combinations of the semantic tags, time range, or task identifiers.
[0014] A data processing system for large language models includes a processor, a memory, and a computer program or instructions stored in the memory. The processor is used to execute the computer program or instructions, and when the computer program or instructions are executed, the system implements the steps of the method described above.
[0015] A computer-readable storage medium having a computer program or instructions stored thereon, which, when executed by a processor, implement the steps of the method as described in any of the preceding claims.
[0016] The data processing method and system for large language models provided by this invention receive raw data files and generate unique data identifiers, and then separate and store file entities and metadata based on these identifiers. A parsing operation is performed on the raw layer metadata, creating a parsing task and publishing it to a first message queue. After obtaining the file based on the identifier, semantic information is extracted using structured or unstructured parsing depending on the file type. A filtering operation is performed on the parsed layer data, creating a filtering task carrying a filtering strategy and publishing it to a second message queue. The filtering algorithm submodule is then called for processing according to the strategy. A classification operation is performed on the filtered layer data, creating a classification task carrying classification model information and publishing it to a third message queue. A multi-label text classification model is called for semantic recognition and semantic labels are added. Finally, in response to an export request, data is retrieved from the classification layer based on the filtering conditions and exported.
[0017] Furthermore, by using data identifiers throughout the entire process to achieve precise association and traceability between file entities and metadata, and combining differentiated storage strategies of object storage and non-relational databases, storage costs are effectively reduced while ensuring query efficiency. An asynchronous task scheduling architecture based on message queues enables elastic concurrent execution of core operations such as parsing, filtering, and classification, significantly improving system throughput and resource utilization. For layout parsing and multi-region recognition technology of unstructured documents, high-value knowledge from complex layouts is successfully transformed into usable structured corpora, greatly expanding the range of high-quality training data sources and providing solid data support for large language model training.
[0018] Additional advantages, objects, and features of the invention will be set forth in part in the description which follows, and will also become apparent in part to those skilled in the art upon studying the description, or may be learned by practice of the invention. The objects and other advantages of the invention can be realized and obtained by means of the structures specifically pointed out in the description and drawings.
[0019] Those skilled in the art will understand that the objectives and advantages achievable with the present invention are not limited to those specifically described above, and that the above and other objectives achievable with the present invention will become clearer from the following detailed description. Attached Figure Description
[0020] The accompanying drawings, which are provided to further illustrate the invention and form part of this application, are not intended to limit the scope of the invention.
[0021] Figure 1 This is a flowchart illustrating a data processing method for large language models according to an embodiment of the present invention.
[0022] Figure 2 This is a schematic diagram of the architecture of a data processing system for large language models according to another embodiment of the present invention.
[0023] Figure 3 This is a schematic diagram of the hierarchical structure of the data storage module in a data processing system for large language models according to another embodiment of the present invention.
[0024] Figure 4 This is a schematic diagram of the internal process of the multi-source heterogeneous data parsing module in a data processing system for large language models according to another embodiment of the present invention.
[0025] Figure 5 This is a schematic diagram of the internal flow of the data filtering module in a data processing system for large language models according to another embodiment of the present invention.
[0026] Figure 6This is a schematic diagram of the message queue model of the task scheduling module in a data processing system for large language models according to another embodiment of the present invention. Detailed Implementation
[0027] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the illustrative embodiments and descriptions of this invention are used to explain the invention, but are not intended to limit the invention.
[0028] It should also be noted that, in order to avoid obscuring the invention with unnecessary details, only the structures and / or processing steps closely related to the solution according to the invention are shown in the accompanying drawings, while other details that are not closely related to the invention are omitted.
[0029] Existing technologies generally suffer from three main drawbacks: 1) Insufficient ability to parse unstructured documents, resulting in the inability to convert high-value knowledge such as academic papers and industry research reports into usable training corpora, thus limiting the knowledge depth of large language models in professional fields; 2) Rigid system resource scheduling, lacking a dynamic and elastic scheduling mechanism based on task priority and data load, which easily leads to resource contention in multi-user concurrent scenarios, limiting overall throughput and resulting in poor data processing timeliness; 3) Single data storage strategy, failing to adopt differentiated storage according to the form of data in the processing flow, resulting in both low efficiency in accessing massive amounts of data and unreasonable storage costs.
[0030] In view of this, one aspect of the present invention provides a data processing method for large language models, such as... Figure 1 As shown, the process includes the following steps S101 to S105: Step S101: Receive raw data files obtained from one or more data sources, generate a unique data identifier for each raw data file, and store the raw data file and its corresponding metadata separately based on the data identifier; wherein, the raw data file is stored in the file entity storage carrier of the object storage service, and the metadata is stored in the raw layer of the non-relational database, and the metadata includes the data identifier, the basic information of the raw data file and its storage link in the file entity storage carrier.
[0031] Step S102: Perform parsing operations on multiple metadata in the original layer, create a parsing task entity and publish it to the first message queue; obtain the parsing task entity from the first message queue, extract the data identifier from it, obtain the corresponding original data file from the file entity storage carrier according to the data identifier, and extract semantic information for structured data using a standardized parser, and extract semantic information for unstructured data using layout parsing and multi-region identification, and convert the semantic information into a standard structured format and store it in the parsing layer of the non-relational database.
[0032] Step S103: Perform filtering operations on the data in the parsing layer, create a filtering task entity carrying the filtering strategy configuration and publish it to the second message queue; based on the filtering task entity obtained from the second message queue, according to the filtering strategy configuration, sequentially call the corresponding filtering algorithm sub-modules to process the data in the parsing layer and store it in the filtering layer of the non-relational database.
[0033] Step S104: Perform classification operations on the filter layer data, create a classification task entity carrying classification model information and publish it to the third message queue; based on the classification task entity obtained from the third message queue, call the specified multi-label text classification model to perform semantic recognition on the data in the filter layer and add semantic labels, and store the data carrying semantic labels in the classification layer of the non-relational database.
[0034] Step S105: In response to the data export request, retrieve data from the classification layer based on semantic tags according to the specified filtering conditions, and export the retrieved data.
[0035] In step S101, firstly, raw data files obtained from one or more data sources are received through the upload interface provided by the data transmission module. These raw data files include, but are not limited to, various formats such as PDF documents, Word documents, scanned images, and e-books. During the receiving process, pre-verification is performed on each file, including integrity checks, virus scans, and format validity checks, to ensure the validity and security of the files.
[0036] In some embodiments, for a verified original data file, a hash value is calculated based on the file content as a unique data identifier. For example, using... The algorithm calculates the binary content of a file to ensure that the same file can be identified as having the same content even if it is uploaded at different times or through different paths, thus avoiding duplicate storage. The generated data identifier will serve as a unique identity credential for the file throughout the entire data processing flow, used for association and traceability in all subsequent processing stages.
[0037] After the identifier is generated, the original data file and its corresponding metadata are stored separately based on this identifier. The binary content of the original data file is stored in the file entity storage medium of the object storage service, while the metadata describing the file is stored in the raw layer of the non-relational database. Specifically, the object storage service stores massive amounts of file entities in a low-cost and highly scalable manner, and the storage path can be organized according to date and data identifier, such as "upload date / data identifier.extension". Meanwhile, the raw layer data tables of the non-relational database record the corresponding metadata information. This metadata includes at least the data identifier, basic information about the original data file, and the storage link of the file in the file entity storage medium. The basic information may include attributes such as filename, file size, and MIME type, while the storage link is used in subsequent steps to quickly locate and retrieve the file entity based on the data identifier.
[0038] Through the above process, step S101 completes the access of the original data file, the generation of a unique identifier, and the separate storage of the file entity and metadata, providing a traceable data foundation for subsequent processing stages such as parsing, filtering, and classification. At the same time, a balance between storage cost and access efficiency is achieved through a differentiated storage strategy.
[0039] In step S102, firstly, in response to a user's parsing operation on multiple metadata records in the original layer, the system creates a parsing task entity. This parsing task entity at least contains a list of identifiers for the data to be parsed and the corresponding parsing parameter configurations. After the parsing task entity is created, it is published to the first message queue to await scheduling and execution.
[0040] In some embodiments, the system employs an asynchronous task scheduling mechanism based on message queues. A first message queue is dedicated to storing parsing tasks, and multiple stateless worker nodes continuously listen to this queue. When a worker node retrieves a parsing task entity from the first message queue, it extracts a list of data identifiers from that entity and retrieves the corresponding original data file from the file entity storage carrier based on each data identifier.
[0041] Specifically, the worker node first queries the corresponding metadata record in the raw layer of the non-relational database based on the data identifier, obtains the storage link from it, and then downloads the binary content of the raw data file from the object storage service through the link.
[0042] After obtaining the raw data files, the worker nodes perform differentiated parsing processing based on the file type. Specifically, the type of the raw data file is determined: for structured data files, a standardized parser is used to extract semantic information; for unstructured data files, layout parsing and multi-region recognition are used to extract semantic information.
[0043] In some embodiments, the parsing process for unstructured data types includes the following sub-steps S1021~S1024: S1021: Convert the raw data file into a sequence of document images. For example, for a PDF file or scanned image, render each page as an independent image.
[0044] S1022: Input each document image in the document image sequence into the document layout analysis model, which outputs a set of candidate regions. Each region in the candidate region set is represented by a bounding box, a region category label, and a confidence score. The region categories include at least title regions, text regions, table regions, and formula regions.
[0045] S1023: For different types of regions, call the corresponding recognition engine for processing: For title and text regions, call the OCR engine for recognition and output character sequences; for table regions, call the table structure recognition model for parsing and output structured table representations; for formula regions, call the formula recognition model for recognition and output symbolic representations.
[0046] S1024: According to the reading order of each region in the document image, all recognition results are stitched together to generate a unified structured representation result.
[0047] In some embodiments, a standardized parser is used to extract semantic information from the structured data type, including the following sub-steps S1025~S1028: S1025: Treat the raw data file that is determined to be of a structured type as a structured document.
[0048] S1026: Based on the specific type of the structured document, call the corresponding parsing library to parse the underlying markup structure of the document. For example, for a Word document, parse its underlying OOXML tree structure; for an EPUB document, parse its DOM tree structure.
[0049] S1027: Identify and extract semantic structure information of structured documents from the parsing results. This semantic structure information includes at least heading levels, paragraph divisions, list structures, and table data.
[0050] S1028: Based on the extracted semantic structure information, reconstruct the hierarchical content structure of the structured document according to its original logical order, and output a standardized structured data representation, such as JSON or Markdown format.
[0051] After parsing is complete, the worker node stores the obtained standard structured data and its associated information in the parsing layer of the non-relational database. The associated information includes at least the original data identifier, parsing task identifier, and parsing time, for subsequent traceability and management.
[0052] Through the above process, step S102 completes the automated parsing of multi-source heterogeneous raw data, and converts documents of different formats and structures into standardized structured data, providing a data foundation with unified format and complete semantics for subsequent filtering and classification processing. At the same time, through layout parsing and multi-region recognition technology, it effectively solves the problem of information extraction from complex layouts in unstructured documents.
[0053] In step S103, in response to a user's filtering operation on multiple data points in the parsing layer, the system creates a filtering task entity carrying a filtering strategy configuration. This filtering task entity contains at least a list of identifiers for the data to be filtered and a user-defined filtering strategy configuration. After the filtering task entity is created, it is published to a second message queue to await scheduling and execution.
[0054] In some embodiments, the system employs the same asynchronous task scheduling mechanism as the parsing phase. A second message queue is specifically used to store filtering tasks, and multiple stateless worker nodes continuously listen to this queue. When a worker node retrieves a filtering task entity from the second message queue, it extracts the list of data identifiers and the filtering strategy configuration, and then reads the corresponding standard structured data from the parsing layer based on the data identifiers.
[0055] The filtering strategy configuration is generated by the filtering strategy configurator, which allows users to flexibly set the filtering process. Specifically, the user-configured filtering strategy includes the enabled filtering algorithms, the execution order of each filtering algorithm, and the corresponding algorithm parameters. After receiving the filtering task, the worker node sequentially calls the corresponding filtering algorithm submodules to process the data in the parsing layer according to the configured order, and records the processing results of each filtering step, such as the identifier of the data to be removed, the reason for filtering, and related quality indicators. After processing, the data processed by all filtering algorithms is associated with the applied filtering rules and stored in the filtering layer of the non-relational database to ensure the traceability of the filtering process.
[0056] In some embodiments, the filtering algorithm submodule includes at least a SimHash deduplication submodule, a toxic text filtering submodule, and a quality filtering submodule. The SimHash deduplication submodule is used to identify and remove highly similar duplicate text. Its working principle is as follows: a feature vector is calculated for each text, generating a fixed-length fingerprint. For any two texts, the Hamming distance between the fingerprints is calculated. If the Hamming distance is less than a preset threshold, the two texts are considered highly similar and deduplication is performed. The toxic text filtering submodule performs toxicity detection on the text based on a pre-trained text classification model, identifying and filtering data containing harmful content such as insults, discrimination, or violence. The quality filtering submodule calculates the perplexity of the text based on a language model. If the perplexity is greater than a preset threshold, the text is considered low-quality and is removed.
[0057] In some embodiments, the Hamming distance is calculated as follows: ; in, Representing text SimHash fingerprint; Representing text SimHash fingerprint; Indicates the binary length of the fingerprint; The binary bit index representing the fingerprint; This is an indicator function that takes the value 1 when the condition in parentheses is true, and 0 otherwise; this distance measures the number of different binary bits between two fingerprints, and the smaller the distance, the more similar the texts are.
[0058] In some embodiments, the perplexity is calculated as follows: ; in, Indicates inclusion A text sequence of words; Representing text The total number of words in the text; Indicates the position index of a word in a text sequence; Representing text The first in One word; Indicates the language model in the context of prior art. Word Predicting the first under the condition The word is The probability of.
[0059] A lower perplexity value indicates that the text conforms more to the laws of natural language and is of higher quality; conversely, text with a high perplexity value often contains abnormal or messy content and should be removed.
[0060] Through the above process, step S103 completes the multi-dimensional cleaning of the parsing layer data, effectively removing duplicate, harmful and low-quality text. At the same time, the configurable filtering strategy meets the cleaning needs of different application scenarios, providing a high-quality data foundation for subsequent classification and labeling.
[0061] In step S104, in response to the user's classification operation on multiple data points in the filtering layer, the system creates a classification task entity carrying classification model information. This classification task entity includes at least a list of identifiers for the data to be classified and the specified classification model information. After the classification task entity is created, it is published to a third message queue to await scheduling and execution.
[0062] In some embodiments, the system employs the same asynchronous task scheduling mechanism as steps S102 and S103. A third message queue is specifically used to store classification tasks, and multiple stateless worker nodes continuously listen to this queue. When a worker node obtains a classification task entity from the third message queue, it extracts a list of data identifiers and classification model information from it, and reads the corresponding filtered data from the filtering layer based on the data identifiers.
[0063] After acquiring the data to be classified, the worker node invokes a specified multi-label text classification model to perform semantic recognition on the data in the filtering layer. This multi-label text classification model is trained based on a predefined classification system and can predict one or more related semantic categories for the same text. After calculating for each data input, the model outputs the predicted probability of each category. The worker node, based on a preset confidence threshold, uses the category with a probability higher than the threshold as the semantic label for that data. For example, a medical paper may be assigned multiple labels such as "deep learning," "medical imaging," and "diagnostic algorithm," with each label corresponding to a confidence score.
[0064] After semantic recognition and label addition are completed, the worker node stores the data carrying the semantic labels in the classification layer of the non-relational database. During storage, the data is also associated with information such as the filter layer data identifier, classification task identifier, and the version of the classification model used, for subsequent traceability and management.
[0065] In some embodiments, the data carrying semantic labels in the classification layer is associated with the file entity storage carrier through a bidirectional interface. This design allows the classified dataset to be stored in a non-relational database for flexible retrieval, or directly exported to an object storage service for batch delivery and efficient access to large-scale datasets, meeting the downstream demand for reading massive amounts of data during large language model training.
[0066] In some embodiments, in response to a user's data export request, the system retrieves data from the classification layer based on specified filtering criteria. These filtering criteria can be flexibly combined according to actual needs; for example, data from a specific domain can be retrieved based on semantic tags, data processed within a specific time period can be retrieved based on a time range, or all data generated from a particular classification task can be retrieved based on a task identifier. The system packages the retrieved data into a standard format, including but not limited to JSONL or CSV formats, and provides it for download. Simultaneously, it generates a statistical report on the dataset's quality, which includes at least information on data volume, label distribution, and average confidence level.
[0067] Through the above process, step S104 completes the automated semantic annotation of the filter layer data, providing a directly usable labeled dataset for model training in different domains or tasks. At the same time, through the design of the association between the classification layer and object storage, it takes into account both the needs of flexible data retrieval and efficient delivery.
[0068] In step S105, in response to the user's data export request, the system retrieves data from the classification layer based on the specified filtering conditions and exports the retrieved data.
[0069] Specifically, when a user initiates a data export request, they need to specify filtering criteria to determine the range of data to be exported. These filtering criteria can be flexibly set according to actual application needs. For example, data in a specific domain or topic can be retrieved based on semantic tags, data processed within a specific time period can be retrieved based on a time range, or all data produced by a classification task can be retrieved based on a task identifier. After receiving the export request, the system searches at the classification layer according to these criteria to locate data records with semantic tags that meet the conditions.
[0070] In some embodiments, the data stored in the classification layer is associated with the file entity storage medium through a bidirectional interface. This design allows the classified dataset to be stored in a structured form in a non-relational database for flexible retrieval, or large-scale datasets can be directly exported to an object storage service for efficient batch access. When users need to export large-scale datasets for downstream model training, the system can directly read the data from the object storage service, avoiding the performance bottleneck of reading data record by record from the database.
[0071] After completing the data retrieval, the system packages the retrieved dataset into a standard format, such as JSONL or CSV, and provides a download link for users. Simultaneously, the system generates a dataset quality statistics report, which includes at least information such as data volume, label distribution, and average confidence score, helping users understand the basic characteristics and quality level of the exported data. For example, users can view the data quantity distribution under different semantic labels, assess the data balance across categories, or judge the reliability of the data annotation based on the average confidence score.
[0072] Through the above process, step S105 completes the operation of retrieving and exporting high-quality labeled data from the classification layer as needed, providing a standard dataset that can be directly used for downstream large language model training or other application scenarios. Thus, the entire data processing flow completes a closed loop from raw data access to final corpus output.
[0073] On the other hand, the present invention also provides a data processing system for large language models, including a processor, a memory, and a computer program or instructions stored in the memory. The processor is used to execute the computer program or instructions, and when the computer program or instructions are executed, the system implements the steps of any of the methods described above.
[0074] On the other hand, the present invention also provides a computer-readable storage medium having a computer program or instructions stored thereon, which, when executed by a processor, implement the steps of any of the methods described above.
[0075] The present invention will now be described with reference to a specific embodiment: This embodiment will elaborate on the complete technical solution of the data processing method and system for large language models proposed in this invention, specifically including the overall system architecture, the implementation of the multi-source heterogeneous data parsing module, the specific algorithms of the data filtering and classification module, the task scheduling mechanism, and the hierarchical storage architecture, among other core contents. The specific technical solution followed in this embodiment is as follows: I. System Design like Figure 2 As shown, the overall system architecture comprises six core modules: data transmission module, data parsing module, data filtering module, data classification module, data storage module, and task management module. Each module is demarcated using domain-driven design principles, and achieves loosely coupled collaboration through clearly defined domain interfaces and event communication mechanisms.
[0076] 1. Data storage module The data storage module is the foundation of the system architecture, employing a layered architecture design to correspond to different stages of data processing, such as... Figure 3 As shown: The raw layer stores user-uploaded raw files, employing a hybrid storage scheme combining object storage services and a non-relational database. Specifically, the UUID of the raw file is first calculated based on its filename. Raw files include formats such as PDF, DOCX, TXT, and images, and are stored in their binary form within a specific object storage bucket. Storage paths are generated according to rules based on upload date, UUID, and file extension, ensuring path uniqueness. Simultaneously, a raw data table is created in the non-relational database, recording metadata for each record, including data identifier, filename, file size, MIME type, upload time, uploading user, and a link to the corresponding file in object storage. This separate storage design balances the economy of massive file storage with the efficiency of metadata retrieval.
[0077] The parsing layer stores the structured data output by the parsing module, using a non-relational database. Its table structure is designed to store the parsed data in standard format, including JSON or Markdown text, while also recording the original data identifier, parsing task ID, parsing time, and version number.
[0078] The filtering layer stores the data processed by the filtering module using a non-relational database. In addition to storing the filtered content, it also records the applied combination of filtering rules, the filtering results, and the associated parsing layer data identifiers. The filtering results may include whether the data was removed and quality score information.
[0079] The classification layer stores the final labeled data using a non-relational database. Its table structure includes data content, one or more classification labels, classification confidence scores, classification model versions, and associated filter layer data identifiers.
[0080] 2. Data transmission module This module is responsible for data input and output, including: Data upload interface: Receives files uploaded by users via the client. After the file is uploaded, the module calls the storage service to store the file in OSS, generates a corresponding metadata record in the original layer database table, and returns a unique data identifier.
[0081] Data Export Interface: Responds to user requests to export data from the category layer. It retrieves relevant data based on combinations of criteria such as tags, time ranges, and task IDs, and packages the filtered results into standard formats, including JSONL, CSV, or specific compressed file formats, for user download.
[0082] 3. Multi-source heterogeneous data parsing module The data parsing process within the multi-source heterogeneous data parsing module is as follows: Figure 4 As shown, this module transforms multi-source heterogeneous data into unified semi-structured data, and it contains an intelligent dispatcher and multiple parsing sub-modules.
[0083] The intelligent dispatcher automatically determines the parsing path based on the file type of the data to be parsed. If the file type is a structured or semi-structured format from which text can be directly extracted, including Word, EPUB, MOBI, etc., it is dispatched to the structured data parsing submodule; if the file type is an unstructured format with complex layout, including PDF, scanned images, etc., it is dispatched to the unstructured data parsing submodule.
[0084] The structured data parsing submodule converts documents with explicit structural markers into a unified structured representation. It calls specialized parsing libraries for different document types, such as using python-docx to process Word documents and the epub library to process ebooks, extracting and reconstructing the underlying semantic structure information. This reconstructed semantic structure information includes heading levels, paragraph divisions, list structures, table data, and hyperlinks.
[0085] Taking a Word document as an example, its underlying storage uses the Office Open XML standard. The parser traverses the tree structure of OOXML to identify the mapping relationship between styles and content blocks, reconstructs the document's logical hierarchy based on formatting, and finally outputs it in standardized JSON or Markdown format. Specifically, the Word document to be processed is modeled as an ordered sequence of paragraphs. Each paragraph Indicates that it contains text content Style type and the triplet of additional attributes. When the style type satisfies When, the paragraph is determined to be the first. Level headings, and define their heading hierarchy. The system maintains a title count vector during the parsing process. When a k-level heading is detected, the recursive rule is executed. And for all The count value is set to zero, thereby generating a hierarchy number used to uniquely identify the title node. It constructs a parent-child hierarchical structure between heading nodes based on heading level relationships; during the full-text parsing process, it concatenates all text content into a continuous string according to the original paragraph order. And through the offset function Calculate the character start position of each heading paragraph in the entire text to define the content range for the heading node. ,in , The offset function sets the starting offset of the nearest sibling or parent heading or the end of the entire document. The formula for calculation is: ; in, This represents the i-th paragraph in the Word document to be processed, which is an element in the triplet structure; This indicates the length of the text content in the j-th paragraph.
[0086] At the same time, the system relies on the recursive path relationship of the title number. A hierarchical content structure is constructed, automatically assigning non-heading paragraph content to the node corresponding to its nearest heading path. Table content is uniformly converted into structured text with preset tags and embedded in the corresponding positions, thus achieving unified management of text and table content while maintaining the original semantic order of the document. Finally, a structured data unit containing heading text, content text, content type, and content length is generated for each heading node, where the content length meets the following requirements. , The length of the title text. This refers to the text length of the content. The above process completes the parsing and reconstruction of the chapter hierarchy, content boundaries, and semantic structure of the Word document.
[0087] The unstructured data parsing submodule performs unified parsing and structure reconstruction on documents lacking explicit structural markers. Its main operations involve mapping various types of unstructured data into standardized document image representations, and then performing layout parsing and multi-region recognition processing on this basis. Specifically, the input unstructured data set is represented as follows: ,in Indicates the first Each unstructured data item can be a scanned document, PDF, or image. It is first converted into a document image sequence through format normalization and rendering processing. ,in Indicates the first The document image of each page, each image represented as a pixel matrix. , Image height, Image width, This represents the number of channels. The document image is then input into a document layout analysis model to perform region semantic parsing. This model uses a deep neural network to extract features and predict regions from the image, outputting a set of candidate regions. ,in Indicates the first The first page of the image Each region is defined by a bounding box. Regional category labels and confidence score Common representation, bounding box parameters , The coordinates of the top left corner , The width and height are specified. Region categories must include at least text regions, table regions, image regions, formula regions, and header / footer regions. After layout detection, a reading order and hierarchy are constructed based on the spatial relationships of the region blocks, specifically from left to right and from top to bottom. For regions determined to be text regions... The OCR engine is invoked for recognition processing, mapping the image region into a character sequence. ,in to For the identified characters, the confidence level and position information are preserved. For table areas, the table structure recognition model is invoked to parse the row boundaries and cell merges in the table, converting them into a structured table representation. , where the set of nodes Represents the cells in a table, and the set of edges. This represents the row and column relationships between cells; for formula ranges, a formula recognition model is invoked to convert mathematical expressions in image form into editable symbolic representations, such as... The representation format is selected; other types of regions are not recognized or processed. Finally, the data is concatenated according to the reading order to generate a unified structured representation.
[0088] 4. Data Filtering Module like Figure 5 As shown, the data filtering module provides configurable and scalable cleaning capabilities, consisting of a filtering strategy configurator and multiple filtering algorithm sub-modules.
[0089] The filtering strategy configurator is used for policy-level management and scheduling of data filtering tasks. It allows users to flexibly configure multiple filtering algorithm sub-modules according to actual data cleaning needs, including enabling or disabling specified filtering algorithms, setting the execution order of each algorithm, and configuring the parameter information corresponding to the algorithm.
[0090] The filtering algorithm submodule library includes at least the SimHash deduplication submodule, the toxic text filtering submodule, and the quality filtering submodule.
[0091] SimHash's deduplication submodule processes text... Calculate the feature vector after word segmentation. Then, fingerprints are generated based on the SimHash algorithm. , The fingerprint length. For any two texts... and Calculate the Hamming distance between fingerprints If the Hamming distance is less than the preset threshold If the two texts are highly similar, then duplicates are removed. The Hamming distance is calculated as follows: ; in, Representing text SimHash fingerprint; Representing text SimHash fingerprint; Indicates the binary length of the fingerprint; The binary bit index representing the fingerprint; This is an indicator function that takes the value 1 when the condition in parentheses is true, and 0 otherwise; this distance measures the number of different binary bits between two fingerprints, and the smaller the distance, the more similar the texts are.
[0092] The toxic text filtering submodule identifies and filters data containing harmful content such as insults, discrimination, and violence based on a pre-trained text classification model. Using a pre-trained multi-class classification model Predicted toxicity probability distribution ,in Representing text Category The probability of is such that the sum of all probabilities is 1, that is... If category If it is classified as non-toxic, then retain it. Threshold data.
[0093] The quality filtering submodule evaluates text quality based on the perplexity metric of the language model and removes low-quality text. For text... Calculate perplexity using the language model LM , For language models, given the prior Predicting the first word under the condition of word . The word is The probability. If the perplexity is greater than a preset threshold. If the text is deemed to be of low quality, it will be discarded. The perplexity calculation formula is: ; in, Indicates inclusion A text sequence of words; Representing text The total number of words in the text; Indicates the position index of a word in a text sequence; Representing text The first in One word; Indicates the language model in the context of prior art. Word Predicting the first under the condition The word is The probability of.
[0094] 5. Data Classification Module The data classification module adds semantic labels to the cleaned text data. This module is based on a multi-label text classification model, predicting one or more categories and their confidence levels for the text according to a predefined classification system. It calculates the probability distribution of multiple labels for text T. ,in Representing text Category The probability, with a range of values, is According to the threshold Determine whether to assign a label to this category. Finally, a multi-label set of text is obtained. .
[0095] 6. Task Scheduling Module like Figure 6 As shown, the task scheduling module is responsible for coordinating the scheduling and execution of various processing modules in the system.
[0096] The system abstracts each data operation into a task object, which includes information such as task type, task status, a list of input data identifiers, configuration parameters, priority, and creator. All created tasks are published to the corresponding topic in a central message queue, which can be implemented using RabbitMQ or Kafka. Multiple stateless worker nodes are deployed in the system backend. Each worker node continuously listens to the message queue, and when it receives a task message, it calls the corresponding processing module to execute the specific business logic based on the task type, thereby achieving asynchronous task execution and load balancing. During task execution, worker nodes update the task status to the database in real time. The system provides a monitoring dashboard displaying key indicators such as task queue length, node load, and task success rate.
[0097] II. The overall system workflow is as follows: Users upload raw data files in batches through the system's web interface. The system performs security checks on each file, including file integrity verification, virus scanning, and format validity checks. After successful verification, the system calculates a unique hash value for the file's binary content as a data identifier. Subsequently, the file's binary data is uploaded to the object storage service and a storage path is generated. Simultaneously, metadata records are written to the raw layer of the non-relational database. The metadata includes at least the data identifier, filename, file type, file size, upload time, uploading user, and object storage access link, completing the raw data import process.
[0098] Users select a batch of data records to be processed at the raw layer to create a parsing task. The system abstracts the operation into a parsing task entity, which includes information such as task type, task identifier, list of input data identifiers, parsing parameters, priority, and creation time. After the task entity is serialized, it is published to the parsing task topic in the message queue, and the task metadata is persisted in the task management database.
[0099] Distributed worker nodes listen to the message queue, consume parsing task messages, and then read metadata from the raw layer based on the data identifier list, retrieving the original file via an object storage link. Worker nodes execute parsing routing based on file type: documents with directly extractable semantic structure enter the structured parsing pipeline, while unstructured documents such as PDFs or scanned images enter the unstructured parsing pipeline. During parsing, data from different sources is converted into a standardized structured representation, and the parsing results are bound to the original data identifier and parsing task ID, then written to the parsing layer database. The task status is updated upon completion of parsing.
[0100] Users select a batch of structured data from the parsing layer and define filtering strategies through a visual configuration interface, including which filtering algorithms to enable, their execution order, and parameter configurations. The system generates filtering task entities based on the configuration, serializes them, publishes them to the message queue as filtering task topics, and establishes a mapping relationship between filtering tasks and parsed data.
[0101] After a worker node receives a filtering task, it sequentially calls the filtering algorithm submodules according to the strategy configuration. Each filtering stage performs a judgment on the data and records the filtering results, including the identifier of the data to be removed, the reason for filtering, and relevant metrics. The filtered data is written to the filtering layer database, and the applied filtering rule combination and result summary are also saved.
[0102] Users select a cleaned dataset in the filtering layer and initiate a classification task creation request, specifying the multi-label classification model version, classification system, and threshold parameters. The system encapsulates the request as a classification task entity, publishes it to the classification task topic in the message queue, and completes task registration.
[0103] After consuming a classification task, the worker node calls the specified multi-label text classification model to perform semantic prediction on each line of the input text, outputting the predicted probabilities for multiple categories, and generating a label set and confidence score based on a preset threshold. The classification results are bound to the filter layer data identifier and the classification model version, and written to the classification layer database. The status is updated upon task completion.
[0104] Users can perform searches based on classification layer data using a combination of conditions such as label criteria, time range, or task identifiers. The search results can be exported as standardized labeled data in a format that generates a data quality statistical report to evaluate data distribution and cleaning effectiveness, providing high-quality data support for subsequent model training.
[0105] In summary, the data processing method and system for large language models provided by this invention receive raw data files and generate unique data identifiers, then separate and store file entities and metadata based on these identifiers; perform parsing operations on the raw layer metadata, create parsing tasks and publish them to a first message queue, and extract semantic information by using structured or unstructured parsing based on the file type after obtaining the file according to the identifier; perform filtering operations on the parsed layer data, create filtering tasks carrying filtering strategies and publish them to a second message queue, and call the filtering algorithm submodule for processing according to the strategy; perform classification operations on the filtered layer data, create classification tasks carrying classification model information and publish them to a third message queue, call a multi-label text classification model for semantic recognition and add semantic labels; finally, respond to the export request, retrieve data from the classification layer based on the filtering conditions, and export the data.
[0106] Furthermore, by using data identifiers throughout the entire process to achieve precise association and traceability between file entities and metadata, and combining differentiated storage strategies of object storage and non-relational databases, storage costs are effectively reduced while ensuring query efficiency. An asynchronous task scheduling architecture based on message queues enables elastic concurrent execution of core operations such as parsing, filtering, and classification, significantly improving system throughput and resource utilization. For layout parsing and multi-region recognition technology of unstructured documents, high-value knowledge from complex layouts is successfully transformed into usable structured corpora, greatly expanding the range of high-quality training data sources and providing solid data support for large language model training.
[0107] Those skilled in the art will understand that the exemplary components, systems, and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software, or a combination of both. Whether implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this invention. When implemented in hardware, it can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this invention are programs or code segments used to perform the desired tasks. The programs or code segments can be stored in a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried in a carrier wave.
[0108] It should be clarified that the present invention is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of the present invention.
[0109] In this invention, features described and / or illustrated for one embodiment may be used in the same or similar manner in one or more other embodiments, and / or combined with or in place of features of other embodiments.
[0110] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations of the embodiments of the present invention are possible. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A data processing method for large language models, characterized in that, The method includes the following steps: The system receives raw data files from one or more data sources, generates a unique data identifier for each raw data file, and stores the raw data file and its corresponding metadata separately based on the data identifier. The raw data file is stored in the file entity storage carrier of the object storage service, and the metadata is stored in the raw layer of the non-relational database. The metadata includes the data identifier, the basic information of the raw data file, and its storage link in the file entity storage carrier. Parsing operations are performed on multiple metadata in the original layer to create a parsing task entity and publish it to the first message queue; the parsing task entity is obtained from the first message queue, the data identifier is extracted from it, the corresponding original data file is obtained from the file entity storage carrier according to the data identifier, and semantic information is extracted for structured data using a standardized parser and for unstructured data using layout parsing and multi-region identification, based on the type of the original data file; the semantic information is converted into a standard structured format and stored in the parsing layer of the non-relational database. Filtering operations are performed on the data in the parsing layer, a filtering task entity carrying the filtering strategy configuration is created and published to the second message queue; based on the filtering task entity obtained from the second message queue, the corresponding filtering algorithm sub-modules are sequentially called to process the data in the parsing layer and store it in the filtering layer of the non-relational database according to the filtering strategy configuration; A classification operation is performed on the filtered layer data, a classification task entity carrying classification model information is created and published to a third message queue; based on the classification task entity obtained from the third message queue, a specified multi-label text classification model is called to perform semantic recognition on the data in the filtered layer and add semantic labels, and the data carrying the semantic labels is stored in the classification layer of a non-relational database; In response to a data export request, data is retrieved from the classification layer based on the semantic tags according to the specified filtering conditions, and the retrieved data is exported.
2. The data processing method for large language models according to claim 1, characterized in that, The extraction of semantic information from unstructured data types through layout parsing and multi-region identification includes: Convert the original data file into a document image sequence; The document images in the document image sequence are input into the document layout analysis model, and a region candidate set is output. Each region in the region candidate set is represented by a bounding box, a region category label, and a confidence score. The region category includes at least a title region, a text region, a table region, and a formula region. For the title area, the OCR engine is invoked to recognize the characters and output a sequence of characters. For the regions in the text area that are determined to be text categories, the OCR engine is invoked to recognize them and output character sequences; For the regions in the table area that are determined to be of the table category, the table structure recognition model is invoked for parsing, and a structured table representation is output. For regions in the formula area that are determined to be formula categories, the formula recognition model is invoked for recognition, and a symbolic representation is output. The recognition results are stitched together according to the reading order of each region in the document image to generate a unified structured representation.
3. The data processing method for large language models according to claim 1, characterized in that, The extraction of semantic information from structured data using a standardized parser includes: The original data file determined to be of a structured type is used as a structured document; Based on the type of the structured document, the corresponding parsing library is invoked to parse the underlying markup structure of the structured document; The semantic structure information of the structured document is identified and extracted from the parsing results. The semantic structure information includes at least heading levels, paragraph divisions, list structures, and table data. Based on the extracted semantic structure information, the hierarchical content structure of the structured document is reconstructed according to its original logical order, and the output is a standardized structured data representation.
4. The data processing method for large language models according to claim 1, characterized in that, According to the filtering strategy configuration, the corresponding filtering algorithm sub-modules are sequentially invoked to process the data in the parsing layer, including: The filtering policy is received by the user through the filtering policy configurator. The filtering policy includes the enabled filtering algorithms, the execution order of each filtering algorithm, and the corresponding algorithm parameters. The corresponding filtering algorithm submodules are called sequentially according to the execution order to process the data in the parsing layer, and the processing results of each filtering step are recorded. The data processed by all filtering algorithms is associated with the applied filtering rules and stored in the filtering layer.
5. The data processing method for large language models according to claim 4, characterized in that, The filtering algorithm submodule includes a SimHash deduplication submodule, a toxic text filtering submodule, and a quality filtering submodule, wherein: The SimHash deduplication submodule is used to calculate feature vectors and generate fingerprints for texts. For any two texts, the Hamming distance between the fingerprints is calculated. If the Hamming distance is less than a preset threshold, the two texts are determined to be highly similar and deduplication is performed. The toxic text filtering submodule is used to perform toxicity detection on text based on a pre-trained text classification model, and to identify and filter data containing insulting, discriminatory or violent content. The quality filtering submodule is used to calculate the perplexity of text based on the language model. If the perplexity is greater than a preset threshold, the text is determined to be of low quality and is removed.
6. The data processing method for large language models according to claim 5, characterized in that, The formula for calculating the Hamming distance is: ; in, Representing text SimHash fingerprint; Representing text SimHash fingerprint; Indicates the binary length of the fingerprint; The binary bit index representing the fingerprint; This is an indicator function that takes the value 1 when the condition inside the parentheses is true, and 0 otherwise; The formula for calculating the degree of confusion is: ; in, Indicates inclusion A text sequence of words; Representing text The total number of words in the text; Indicates the position index of the word in the text sequence; The text represents The first in One word; Indicates the language model in the context of prior art. Word Predicting the first under the condition The word is The probability of.
7. The data processing method for large language models according to claim 1, characterized in that, The first message queue, the second message queue, and the third message queue belong to different topics within the message queue; The method further includes: The message queue is continuously monitored by multiple stateless worker nodes. When a task message is obtained, the corresponding processing module is called to execute the specific business logic according to the task type, and the task status is updated in real time during the execution process. Specifically, the worker node of the first message queue performs a parsing task, the worker node of the second message queue performs a filtering task, and the worker node of the third message queue performs a classification task.
8. The data processing method for large language models according to claim 1, characterized in that, The data carrying the semantic tags in the classification layer is associated with the file entity storage carrier through a bidirectional interface, which is used to support the batch export and efficient access of large-scale datasets. The step of retrieving data from the classification layer based on the specified filtering conditions according to the semantic tags includes: retrieving data based on one or more combinations of the semantic tags, time range, or task identifiers.
9. A data processing system for large language models, comprising a processor, a memory, and computer programs or instructions stored in the memory, characterized in that, The processor is configured to execute the computer program or instructions, and when the computer program or instructions are executed, the system implements the steps of the method as described in any one of claims 1 to 8.
10. A computer-readable storage medium having a computer program or instructions stored thereon, characterized in that, When the computer program or instructions are executed by a processor, they implement the steps of the method as described in any one of claims 1 to 8.