Instruction fine-tuning dataset screening method, device and medium

By constructing a label graph for semantic space modeling, the problem of low efficiency in data filtering for instruction fine-tuning in existing technologies is solved, and efficient and accurate dataset filtering is achieved, selecting high-quality and diverse training data.

CN120492639BActive Publication Date: 2026-06-19SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANGHAI ARTIFICIAL INTELLIGENCE INNOVATION CENT
Filing Date
2025-03-28
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively filter out high-quality and diverse training data during instruction fine-tuning, and existing methods cannot accurately capture the semantics of complex instructions, resulting in low sampling efficiency and insufficient accuracy.

Method used

Semantic space modeling is performed by constructing a label graph. The relationships between labels are used as edge weights to calculate the information gain value of the data subset. The final instruction fine-tuning data subset is selected by maximizing the information gain value, taking into account the transmission of information on the label graph.

Benefits of technology

It significantly improves the sampling efficiency and accuracy of instruction fine-tuning datasets, enabling the selection of high-quality and diverse training data from large-scale datasets, balancing the diversity and quality of data selection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120492639B_ABST
    Figure CN120492639B_ABST
Patent Text Reader

Abstract

This invention relates to a method, device, and medium for filtering instruction fine-tuning datasets. The method includes: acquiring an original instruction fine-tuning dataset; constructing a label graph using labels as graph nodes and the relationships between labels as edge weights to perform semantic space modeling on the original instruction fine-tuning dataset; wherein the contribution of each data point in the original instruction fine-tuning dataset to the dataset's information content comes from the label of the corresponding data point; considering the transfer of information content on the label graph, calculating the information gain value of a data subset, and using the subset with the largest information gain value as the filtering objective to filter out the final instruction fine-tuning data subset from the original instruction fine-tuning dataset. Compared with existing technologies, this invention comprehensively considers the quality and diversity of the dataset in the semantic space, making dataset filtering more efficient and reliable.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of natural language processing, and in particular to a method, device, and medium for fine-tuning instruction dataset selection. Background Technology

[0002] With the rapid development of natural language processing technology, large language models have broad application prospects in the field of natural language understanding. However, in the process of fine-tuning instructions, how to select a high-quality and diverse subset as training data from a large-scale data pool has become an urgent problem to be solved.

[0003] Traditional methods for fine-tuning data filtering based on instructions typically employ defined data quality evaluation criteria and heuristic rules to maintain data diversity. However, this rule-based approach fails to comprehensively consider the overall quality and diversity of the dataset, and heuristic rules are mostly implemented in the embedding space, making it difficult to accurately capture the semantics of complex instructions. Other methods based on submodular functions, while defining a way to evaluate dataset diversity, require calculating the similarity of the embedding space between each element in each iteration, resulting in low sampling efficiency on large data pools and failing to meet the demands of efficient data processing.

[0004] A search revealed Chinese invention patent application CN118260429A, which discloses a method for optimizing a large language model fine-tuning dataset. This method scores and labels each sample record in a first sample library based on a first sample scoring model and a first sample labeling model. Then, it clusters all sample records in the first sample library based on the sample labels to obtain multiple first-class label record clusters. Using a preset data distribution index set as a reference, it constructs a fine-tuning dataset based on all the obtained first-class label record clusters and the first sample library. However, the construction of the sample scoring model and sample labeling model cannot incorporate more semantic and syntactic features, resulting in insufficient accuracy and stability of the model.

[0005] Therefore, how to effectively filter out high-quality and diverse instruction data from large datasets while considering the semantic information of the data is a problem that urgently needs to be solved. Summary of the Invention

[0006] The purpose of this invention is to overcome the shortcomings of the prior art by providing a method, device and medium for fine-tuning instruction dataset filtering.

[0007] The objective of this invention can be achieved through the following technical solutions:

[0008] According to a first aspect of the present invention, a method for fine-tuning a dataset is provided, comprising:

[0009] Obtain the original instruction fine-tuning dataset;

[0010] Using labels as graph nodes and the relationships between labels as edge weights, a label graph is constructed to perform semantic space modeling on the original instruction fine-tuning dataset; wherein, the contribution of each data point in the original instruction fine-tuning dataset to the dataset's information content comes from the label of the corresponding data point;

[0011] Considering the transmission of information on the label graph, the information gain value of the data subset is calculated. The subset with the largest information gain value is selected as the filtering target to filter out the final instruction fine-tuning data subset from the original instruction fine-tuning dataset.

[0012] Preferably, the edge weights between the labels are represented by semantic similarity. When the semantic similarity between the labels is lower than a preset threshold, the edge weights between the labels are reset to 0.

[0013] Preferably, in the original instruction fine-tuning dataset, the i-th data item is represented as:

[0014]

[0015] In the formula: This represents the M rounds of dialogue data. For the query in the j-th round of dialogue in the i-th data, r i j For the response to the i-th data in the j-th round of dialogue, L i s represents the list of tags corresponding to the i-th data item. i This represents the score of the i-th data point.

[0016] Preferably, the information gain value of the data subset is calculated using the following expression:

[0017]

[0018] E i =s i v i

[0019] In the formula: D is a subset of data; Φ is the information gain function, which is monotonically increasing and the rate of increase decreases; A is the transfer matrix, and the element a in the transfer matrix A is... pq E represents the amount of information transferred between the p-th and q-th labels, calculated based on the edge weights between the labels. i Let s be the information content of the i-th data item. i For the score of the i-th data point, v i Let be the label vector of the i-th data item.

[0020] Preferably, the information gain function Φ is a monotonically increasing function with a decreasing rate of increase.

[0021] Preferably, the information gain function Φ is expressed mathematically as follows:

[0022] Φ(x)=x a ,0<a<1

[0023] Alternatively, the information gain function Φ can be expressed mathematically as follows:

[0024] Φ(x)=-e -x

[0025] In the formula: x is the input data.

[0026] Preferably, the information content v of the i-th data is... i The calculation expression is:

[0027]

[0028] In the formula: Let σ be the information content corresponding to the k-th word in the i-th data, and L be the activation function. i Let l be the list of tags corresponding to the i-th data item. k Let K be the label corresponding to the k-th word in the i-th data item, where K is the number of words in the i-th data item.

[0029] Preferably, element a in the transfer matrix A pq This represents the amount of information transferred between the p-th and q-th labels, calculated based on the edge weights between the labels. The calculation expression is:

[0030]

[0031] In the formula: α is the transfer parameter; ω p The similarity of the p-th label itself can be set to a constant value of 1.

[0032] According to a second aspect of the present invention, an electronic device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the program to implement any of the methods described above.

[0033] According to a third aspect of the invention, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements any of the methods described herein.

[0034] Compared with the prior art, the present invention has the following beneficial effects:

[0035] (1) This invention constructs a label graph and performs semantic space modeling on the original instruction fine-tuning dataset. It quantifies the quality and diversity of the dataset in the semantic space and maximizes the information gain of the current label graph during each screening. It does not require calculating the similarity of the embedding space between each element in each iteration of screening, which significantly improves the sampling efficiency on large-scale instruction fine-tuning datasets. It obtains the information gain of the dataset in the semantic space, avoiding the problem that the heuristic rules in the embedding space in the prior art cannot accurately capture the semantics of complex instructions, and improves the accuracy of instruction fine-tuning dataset screening.

[0036] (2) Data labels are used to characterize the contribution of each data in the original instruction fine-tuning dataset to the amount of information in the dataset. The relationship between different labels is considered, and the information is defined to be transmitted along the edge of the label graph. This effectively depicts the distribution of information on the label graph and improves the quality and reliability of the instruction fine-tuning dataset screening.

[0037] (3) In this invention, the information gain function is a monotonically increasing function with decreasing rate of increase. If the derivative of the information gain function decreases too quickly, the focus is on the diversity of selection, which can effectively balance the diversity and quality of data selection. Attached Figure Description

[0038] Figure 1 This is a flowchart of the method of the present invention;

[0039] Figure 2 A schematic diagram of the semantic space modeling structure. Detailed Implementation

[0040] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0041] Example 1

[0042] like Figure 1 As shown, this embodiment provides a method for filtering instruction fine-tuning datasets based on maximizing information gain. The method includes:

[0043] S1. Obtain the original instruction fine-tuning dataset;

[0044] The i-th data in the original instruction fine-tuning dataset is represented as:

[0045]

[0046] In the formula: This represents the M rounds of dialogue data. For the query in the j-th round of dialogue in the i-th data, r i j For the response to the i-th data in the j-th round of dialogue, L i This represents the list of labels corresponding to the i-th data item; s i The score of the i-th data point can be directly set to 1. To make the score proportional to the data quality, the IFD score and DEITA score can also be used, with DEITA being the preferred option.

[0047] S2. Construct a label graph using labels as graph nodes and the relationships between labels as edge weights, such as... Figure 2 As shown, semantic space modeling is performed on the original instruction fine-tuning dataset; where the contribution of each data point in the original instruction fine-tuning dataset to the information content of the dataset comes from the label of the corresponding data point.

[0048] In this embodiment, the edge weights between labels are represented by semantic similarity. When the semantic similarity between labels is lower than a preset threshold, the edge weights between labels are reset to 0. Specifically, the edge weight ω between label p and label q... pq for:

[0049] ω pq =σ[ω(l p ,l q )≥T]ω(l p ,l q )

[0050] In the formula: l p and l q These represent the p-th and q-th tags in the tag list, respectively; ω(l p ,l q ) represents the semantic similarity between the p-th and q-th tags, and semantic similarity functions such as Word2Vec and BERT can be used; T is the semantic similarity threshold, which can be set to 0.6.

[0051] For the label graph: the information content of the entire dataset is the sum of the information content of each label; the contribution of each data point to the information content comes from the label of that data point, and the value of this information content is positively correlated with the quality of that data point. For example, if the label of a data point is [storytelling, writing] and the score is 0.9, then its contribution to the "storytelling" label is 0.9, and its contribution to the "writing" label is also 0.9, with a total contribution of 1.8; in order to better characterize the distribution of information content on the label graph, the relationship between different labels is considered, and the information content is defined to be transmitted along the edges of the label graph.

[0052] For the original instruction fine-tuning dataset D PGiven a selection upper limit N and an information gain function E, the goal is to select a subset D. S This makes E(D) reach its maximum value:

[0053]

[0054] Information gain function E(D) of the data subset:

[0055]

[0056] E i =s i v i

[0057] In the formula: D is a subset of data; Φ is the information gain function, which is monotonically increasing and the rate of increase decreases; A is the transfer matrix, and the element a in the transfer matrix A is... pq E represents the amount of information transferred between the p-th and q-th labels, calculated based on the edge weights between the labels. i Let s be the information content of the i-th data item. i For the score of the i-th data point, v i Let be the label vector of the i-th data item.

[0058] Specifically, the element a in matrix A is passed. pq This represents the amount of information transferred between the p-th and q-th labels, calculated based on the edge weights between the labels. The calculation expression is:

[0059]

[0060] In the formula: α is the transmission parameter, α = 0 means no transmission, and the larger the value, the more transmission. In this embodiment, the transmission parameter is set to 1 for the best effect; ω p The similarity of the p-th tag itself is set to a constant value of 1 in this embodiment.

[0061] Specifically, the label vector v of the i-th data item i :

[0062]

[0063] In the formula: Let σ be the information content corresponding to the k-th word in the i-th data, and L be the activation function. i Let l be the list of tags corresponding to the i-th data item. k Let v1 be the label corresponding to the k-th word in the i-th data item, where K is the number of words in the i-th data item. For example, v1 = (1,0,1,0,0,0) means that the first data item has two labels, the first and the third.

[0064] S3. Considering the transfer of information on the label graph, calculate the information gain value of the instruction fine-tuning subset. Using the subset with the largest information gain value as the selection target, filter out the final instruction fine-tuning data subset from the original instruction fine-tuning dataset. The filtered instruction fine-tuning data subset can be used for training large-scale model knowledge question answering, large-scale model mathematical ability, and large-scale model logical reasoning.

[0065] This invention models the semantic space by constructing a label graph, taking into account both the quality and diversity of the dataset. This overcomes the shortcomings of existing technologies that only define data quality evaluation criteria or use heuristic rules to maintain diversity, and can more accurately evaluate the dataset as a whole.

[0066] This embodiment was validated on Tulu3. The original data pool contained 939K data points, and 50K training data points were sampled. The base model used for training was Llama3.1-8B.

[0067] In Table 1 below, HE is an abbreviation for HumanEval, AE represents AlpacaEvalv2, MT represents MTBench, Wild represents WildBench, and Avg is the average of the scores of the corresponding benchmarks after normalization to percentage.

[0068] Table 1

[0069]

[0070]

[0071] This invention achieves best results on multiple benchmarks, using only 5% of the data, and achieves the effect of training with the full data pool on the average of multiple benchmarks. It can select high-quality and diverse training data for instruction fine-tuning tasks, and achieves the best results compared with other data selection methods. The proposed data selection method has good effectiveness and robustness.

[0072] The electronic device of this invention includes a central processing unit (CPU), which can perform various appropriate actions and processes according to computer program instructions stored in read-only memory (ROM) or loaded from a storage unit into random access memory (RAM). The RAM may also store various programs and data required for device operation. The CPU, ROM, and RAM are interconnected via a bus. Input / output (I / O) interfaces are also connected to the bus.

[0073] Multiple components in the device are connected to the I / O interface, including: input units such as keyboards and mice; output units such as various types of displays and speakers; storage units such as disks and optical discs; and communication units such as network interface cards (NICs), modems, and wireless transceivers. The communication unit allows the device to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0074] The processing unit executes the various methods and processes described above, such as methods S1 to S3. For example, in some embodiments, methods S1 to S3 may be implemented as computer software programs tangibly contained in a machine-readable medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and / or installed on the device via ROM and / or a communication unit. When the computer program is loaded into RAM and executed by the CPU, one or more steps of methods S1 to S3 described above may be performed. Alternatively, in other embodiments, the CPU may be configured to execute methods S1 to S3 by any other suitable means (e.g., by means of firmware).

[0075] The functions described above in this document can be performed at least in part by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: field programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload programmable logic devices (CPLDs), and so on.

[0076] The program code used to implement the methods of the present invention can be written in any combination of one or more programming languages. This program code can be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing device, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code can be executed entirely on the machine, partially on the machine, as a standalone software package partially on the machine and partially on a remote machine, or entirely on a remote machine or server.

[0077] In the context of this invention, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media can include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0078] Example 2

[0079] The difference between this embodiment and Embodiment 1 is that the information gain function Φ is a monotonically increasing function with a decreasing rate of increase, in order to balance the diversity and quality of data selection. If the derivative of the information gain function decreases too quickly, the focus is on the diversity of selection.

[0080] Specifically, the information gain function Φ has the following mathematical expression:

[0081] Φ(x)=x a ,0<a<1

[0082] Alternatively, the information gain function Φ can be expressed mathematically as follows:

[0083] Φ(x)=-e -x

[0084] In the formula: x is the input data.

[0085] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any person skilled in the art can easily conceive of various equivalent modifications or substitutions within the technical scope disclosed in the present invention, and these modifications or substitutions should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A method for filtering instruction-based fine-tuning datasets, characterized in that, include: Obtain the original instruction fine-tuning dataset; Using labels as graph nodes and the relationships between labels as edge weights, a label graph is constructed to perform semantic space modeling on the original instruction fine-tuning dataset; wherein, the contribution of each data point in the original instruction fine-tuning dataset to the dataset's information content comes from the label corresponding to that data point; Considering the transmission of information on the label graph, calculate the information gain value of the data subset, and use the data subset with the largest information gain value as the filtering target to filter out the final instruction fine-tuning data subset from the original instruction fine-tuning dataset; The information gain value of the data subset is calculated using the following expression: In the formula: A subset of data; It is an information gain function that is monotonically increasing and its rate of increase is decreasing. It is a transfer matrix, a transfer matrix elements in Indicates the first The first tag and the first The amount of information transmitted between tags is calculated based on the edge weights between tags. For the first The amount of information in each piece of data For the first The score of each data point For the first Label vectors for each data item.

2. The instruction fine-tuning dataset filtering method according to claim 1, characterized in that, The edge weights between the labels are represented by semantic similarity. When the semantic similarity between the labels is lower than a preset threshold, the edge weights between the labels are reset to 0.

3. The instruction fine-tuning dataset filtering method according to claim 1, characterized in that, In the original instruction fine-tuning dataset, the first... The data is represented as follows: In the formula: express Round-robin dialogue data, For the first The first data item Inquiry during the round of dialogue, For the first Data item number Replies in the dialogue round Indicates the first The list of tags corresponding to each data item. Indicates the first The score for each data point.

4. The instruction fine-tuning dataset filtering method according to claim 1, characterized in that, The information gain function The mathematical expression is: Or, the information gain function The mathematical expression is: In the formula: For input data.

5. The instruction fine-tuning dataset filtering method according to claim 1, characterized in that, The first Label vector of each data point The calculation expression is: In the formula: For the first The first data item The amount of information corresponding to each word For activation function, For the first The list of tags corresponding to each data item. For the first The first data item The tags corresponding to each word For the first The number of words in a data entry.

6. The instruction fine-tuning dataset filtering method according to claim 1, characterized in that, The transfer matrix elements in Indicates the first The first tag and the first The amount of information transferred between tags is calculated based on the edge weights between tags, and the calculation expression is: In the formula: To pass parameters; For the first The similarity of each tag to itself is a constant value.

7. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the program, it implements the method as described in any one of claims 1 to 6.

8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1 to 6.