A code word construction method based on reinforcement learning
By constructing QC-LDPC codewords through reinforcement learning, the problem of complex construction and inability to adapt to channel changes in existing LDPC codes is solved, achieving codeword flexibility and performance improvement, especially in terms of resistance to burst errors.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- PEKING UNIV
- Filing Date
- 2022-08-04
- Publication Date
- 2026-06-19
Smart Images

Figure CN115940963B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of communication and machine learning, and aims to construct quasi-cyclic low-density parity-check codes (QC-LDPC) for communication systems using a machine learning method, thereby obtaining a low-complexity QC-LDPC codeword construction method, while also improving the performance of codeword construction. Background Technology
[0002] Low-density parity-check codes (LDPCs) were initially proposed by Gallager, and were re-examined by D. MacKay, M. Neal, and others in 1996. LDPC codes are linear block codes based on sparse parity-check matrices. Their performance approaches the Shannon limit, they have low encoding and decoding complexity, are relatively simple to implement in hardware, and are considered good codes with good error correction performance. Currently, a large amount of research focuses on the construction, encoding, decoding, and applications of LDPC codes.
[0003] Burst errors refer to a series of errors that occur during data transmission. These errors often arise from faulty transmission lines, relay malfunctions, or lightning interference. These errors are interconnected; the occurrence of one error can affect the occurrence of errors in the next symbol. This paper aims to construct LDPC codewords to improve the resilience of communication systems against burst errors.
[0004] Traditional LDPC construction methods suffer from disadvantages such as complexity and cumbersome steps. Furthermore, these methods cannot generate optimal codewords based on variations in channel characteristics across different communication systems. It remains unknown whether existing codewords will perform well in new channels. Summary of the Invention
[0005] To address the problems existing in the prior art, the present invention aims to provide a QC-LDPC codeword construction method based on reinforcement learning. This invention utilizes reinforcement learning machine learning methods to construct QC-LDPC codewords, incorporating channel information during the construction process and using channel characteristics to generate optimal codewords suitable for different channels. This makes codeword construction simple, easy to understand, flexible, and adaptable, meeting diverse needs. This method can construct not only codewords with strong resistance to burst errors but also codewords with strong resistance to random errors.
[0006] The technical solution of this invention is as follows:
[0007] A codeword construction method based on reinforcement learning, comprising the following steps:
[0008] 1) Randomly initialize the index value Init_H_block of a parity check matrix H used to generate low-density parity check codes, and randomly initialize the neural network; where θ is the parameter of the neural network, which is used to determine the next action a based on the current state s;
[0009] 2) The neural network calculates the output mean matrix μ based on the initial state s0 and Init_H_block;
[0010] 3) Based on the reinforcement learning method, the neural network samples N possible trajectories from the current state s to the next action a of the agent; from the normal distribution N(μ,σ) 2 The action a corresponding to the nth trajectory is determined by sampling. n ; by action a n and initial state Determine the next state Depend on The check matrix H corresponding to the nth trajectory is determined by the matrix Init_H_block. n Based on this verification matrix H n Calculate the reward G for the nth trajectory. n Then the neural network is based on μ, a n G n Calculate the expected return; where n = 1, 2, 3, ..., N, N is the codeword length, σ 2 The degree of dispersion of the sampling trajectory;
[0011] 4) Take the derivative of the expected return to obtain the gradient Δ θ The parameters θ of the neural network are optimized and updated using the gradient ascent method; if the parameters θ converge, the neural network calculates and updates the output mean matrix μ′ based on s0, from the normal distribution N(μ′, σ 2 In the sampling process, action a is determined; the next state s1 is determined by action a and the initial state s0; the parity check matrix H is determined by s1 and Init_H_block and then output.
[0012] 5) If the parameter θ does not converge, repeat steps 2) to 4);
[0013] 6) Use the parity check matrix H output in step 4) as the parity check matrix of the corresponding low-density parity check code LDPC.
[0014] Furthermore, the neural network is a multilayer perceptron neural network (NET). θ .
[0015] Furthermore, the neural network is a continuously differentiable function π. θ (a|s).
[0016] Furthermore, the gradient ascent method is used to update the parameters θ to improve the objective function. Maximum; where, objective function τ is the trajectory, p θ Let (τ) be the probability of generating trajectory τ, and G(τ) be the reward for trajectory τ. t+1 The state s at time t t Transition to state s at time t+1 t+1 The resulting reward, where T is the trajectory termination time.
[0017] Furthermore, the expected return is Where τ is the trajectory, p θ Let G(τ) be the probability of generating trajectory τ, and let G(τ) be the reward for trajectory τ. For the action at time t of the nth trajectory, Let this be the state of the nth trajectory at time t. Let this be the state of the nth trajectory at time t. This represents a portion of the trajectory of the nth trajectory from time t to time T. The total reward generated for the portion of the trajectory from time t to time T of the nth trajectory is given, where T is the trajectory termination time.
[0018] A server is characterized by comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program including instructions for performing the steps of the methods described above.
[0019] A computer-readable storage medium having a computer program stored thereon, characterized in that the computer program, when executed by a processor, implements the steps of the above-described method.
[0020] This invention mainly includes the following:
[0021] 1. Reinforcement learning
[0022] Reinforcement learning, also known as traction learning, is a process in which an intelligent agent learns new strategies to maximize rewards through interaction with its environment. It is a special type of supervised learning that does not require manual labeling of data. It uses a policy to determine the current "correct" label and updates the policy to find the "optimal" label.
[0023] Let me first briefly describe the basic elements of reinforcement learning.
[0024] (1) State s: is a description of the environment, which can be continuous or discrete.
[0025] (2) Action a: is a description of the agent’s behavior, which is determined by the policy. This behavior can be continuous or discrete.
[0026] (3) Strategy π(a|s): is a function that determines the next action a based on the current state s.
[0027] (4) State transition probability p(s1|s,a): is the probability that after s makes action a, the next state will change to s1. Therefore, the interaction process between the agent and the environment can also be regarded as a Markov decision process.
[0028] (5) Reward r(s1,a,s): is the reward given after state s performs action a.
[0029] Typically, the interaction between an agent and its environment involves multiple state transitions, and this process can be described by a trajectory.
[0030] τ=s0,a0,s1,r1,a1,s2,r2,a2,s3,r3,a3,s4,r4,,...,s T-1 ,a T-1 ,s T ,r T ,
[0031] The total reward for this trajectory is The probability of generating this trajectory is
[0032]
[0033] Because both the state and policy have a degree of randomness, the interaction trajectories generated each time will be different, and therefore the total reward will also be different. Therefore, the goal of reinforcement learning is to learn a policy π. θ (a|s) to maximize expected return Therefore, the objective function for learning is...
[0034]
[0035] Where θ is the parameter of the policy function.
[0036] There are many methods for learning policies; this paper uses the policy gradient reinforcement learning method. Assume the policy function is π. θ (a|s) is a continuously differentiable function. The gradient ascent method is used to optimize the parameters θ to make the objective function... Maximum. Objective function The derivative with respect to the strategy parameters is given by the following formula:
[0037]
[0038] To find the expected return of a policy, we need to exhaustively enumerate all trajectories of that policy. However, this is often impractical in high-dimensional action spaces and for long trajectories. Therefore, we can approximate it by sampling, i.e., based on the current policy π. θ (a|s), multiple trajectories τ are collected through random walk. 1 ,τ 2 ,τ 3 ,…,τ N Each trajectory The expected return of this strategy can then be approximated as:
[0039]
[0040] The gradient of this policy can then be written as
[0041]
[0042] 2. A method for designing QC-LDPC codes using reinforcement learning
[0043] The implementation of reinforcement learning mainly consists of two modules: the environment and the agent. All the operations mentioned above are interactions between them. Correspondingly, the implementation of updating LDPC codewords is mainly divided into two modules: the encoder (Constructor) and the decoder (Decoder).
[0044] A QC-LDPC code is described by a parity check matrix, which consists of a zero matrix and a cyclic shift matrix. This parity check matrix has the following form: It is an M×N matrix. It is a z×z matrix, b i,j The value of b is an integer from -1 to z-1. i,j When P equals -1, -1 For a zero matrix 0, when b i,j When P0 equals 0, it is the identity matrix, and b i,j For other values, This indicates that the identity matrix cyclically shifts to the right by b. i,j The parity check matrix H can be divided into two parts, Hs. m×m Corresponding information bits, matrix Hp m×(n-m) Corresponding parity bits; M = m × z, N = n × z. Take the index value in matrix H and increment it by 1, represented by matrix H_block.
[0045]
[0046] The positions in H_block where the value is 0 represent If the matrix is zero, then constructing the parity matrix H is the process of determining the H_block matrix.
[0047] The H_block matrix can be determined as: H_block = Init_H_block.*s t Init_H_block is the initial H_block matrix, determined by a uniform integer distribution U(0,z), i.e., an integer matrix of size m×n; s t Given an m×n binary matrix, after calculation, s t The matrix stores s in Init_H_block t The position equal to 0 is changed to 0, thus obtaining a new matrix H_block. After continuous adjustment of s t The optimal H_bolck is obtained by determining the positional distribution of 0s and 1s in the H_bolck. best .
[0048] s t The adjustment process can be viewed as an interactive process in reinforcement learning, thus yielding the interaction trajectory:
[0049] τ s =s0,a0,s1,r1.
[0050] Let s0 be the initial state, an m×n matrix of all ones. The agent executes action a0, transitioning from state s0 to the next state s1, which is an m×n binary matrix (each element is either 0 or 1). Then the total reward for this trajectory τ is G(τ). s ) = r1.
[0051] Encoder Constructor: This is where codewords are built and strategies are updated.
[0052] Based on the current state s0 and the current policy, we can determine the next new state s1. This allows us to determine H_block (H_block = Init_H_block.*s1), thereby determining the parity check matrix H1 and achieving the goal of constructing the parity check matrix.
[0053] The decoder is where the strategy is evaluated and the total reward G is fed back to the encoder.
[0054] A burst deletion channel refers to a channel in which every codeword transmitted contains a random burst deletion error of length L bits. That is, for a codeword of length N, an L-bit error randomly appears at L positions within the codeword, while the rest are received without errors. This channel can be used with SBE (Surprise Batch Eruption). L L represents the number of bits continuously deleted in the burst deletion channel.
[0055] Bit error rate E = ldpc_decoder(H, SBE) L ), indicating that the codeword uses SBE.t Channel information is transmitted, decoded by the ldpc_decoder, and the final simulation yields the bit error rate BLER = E. With a fixed bit error rate E, a larger L indicates a longer duration of burst errors in the channel, stronger resistance to burst errors in the codeword, and a better strategy for constructing the codeword. Therefore, t can be used to represent the reward for this strategy, i.e., G = L.
[0056] The relevant elements of the improved reinforcement learning are as follows:
[0057] (1) Initial state s0: a matrix of all 1s of size m×n;
[0058] (2) Policy function: Represented by a multilayer perceptron neural network net, which has two hidden layers and a nonlinear output, the output being an m×n matrix μ, which determines the next state s1;
[0059] (3) Action a is a high-dimensional action of m×n, and a is determined by μ;
[0060] (4) The reward G is determined by the Decoder. G = Decoder(H, SBE) L ).
[0061] The process flow of this invention is as follows:
[0062] (1) Determine the size of the parity-check matrix of the LDPC code, M1×N1, where z1 is the size of the identity matrix. Randomly initialize Init_H_block and randomly initialize the neural network net. θ The parameter is θ, which is the initial policy π. θ (a|s)=net θ ,
[0063] (2) In the encoder constructor
[0064] 1. Input the initial state s0, calculate the output value μ of the neural network, μ = net θ (s0).
[0065] 2. From the normal distribution N(μ,σ) 2 The action 'a' is determined by sampling in the code; when an element in action 'a' is greater than ε, the corresponding position in state 's1' is 1; when an element in action 'a' is less than or equal to ε, the corresponding position in state 's1' is 0, thus obtaining state 's1'. The new H_block1, i.e., the index value of the QC-LDPC code, is determined by the dot product of 's1' and Init_H_block, thus determining the parity check matrix H1; σ 2 This represents the degree of dispersion of the sampled data.
[0066] (3) In the decoder, the reward r1 of the trajectory is calculated based on the parity matrix H1 obtained in step (2).
[0067] (4) Repeat steps (2) to (3) N times to obtain N trajectories given the policy function.
[0068] (5) From μ, a n G n , n = 1, 2, 3, ..., N, (based on this patent) Calculate the expected return of this strategy.
[0069] (6) Take the derivative of the expected return to obtain the gradient Δ θ The parameter θ is updated using the gradient ascent method.
[0070] (7) If the parameter θ converges, then calculate and update the output value μ′ of the neural network from the normal distribution N(μ′,σ). 2 The sampling process determines action a; the action a and the initial state s0 determine the next state s1; and s1 and Init_H_block determine the verification.
[0071] Matrix H, output H.
[0072] (8) If the parameter θ does not converge, repeat steps (2) to (7).
[0073] The policy gradient algorithm for codeword construction is shown below:
[0074] Input: Randomly initialized Init_H_block; randomly initialized neural network parameters θ, i.e., the initial policy is net. θ =π θ (a|s);
[0075] Repeat:
[0076]
[0077]
[0078] The advantages of this invention are as follows:
[0079] This invention establishes a deep reinforcement learning model and uses channel coding simulation to determine codeword performance, thereby establishing a connection between the model and the codewords. This allows the neural network to autonomously learn towards codeword performance that is optimal, thus gradually improving codeword performance with each iteration of the neural network. Furthermore, the codeword construction process incorporates burst channel characteristics, making the trained codewords more targeted and possessing a strong ability to correct burst errors. Attached Figure Description
[0080] Figure 1 This is a flowchart of the method of the present invention. Detailed Implementation
[0081] The present invention will now be described in further detail with reference to the accompanying drawings. The examples given are only for explaining the present invention and are not intended to limit the scope of the present invention.
[0082] This invention takes the construction of a burst-error-resistant QC-LDPC codeword as an example, with a size of 504×1008, a cyclic submatrix size of 63×63, a burst-deleted channel (SBE) channel condition, and soft-decision decoding of LDPC. The reward G(τ) is the bit error rate of 0.01 when there are r consecutive errors in the SBE channel. G = r.
[0083] Construct a QC-LDPC code with a double diagonal structure, i.e., H has the following structure:
[0084]
[0085] That is, the Hp of this structure is known. The steps of this invention include:
[0086] (1) Initial variable: neural network net θ The neural network parameters are θ, and the number of samples N = 1000.
[0087]
[0088]
[0089] (2) Neural network output matrix μ
[0090]
[0091] Action a0 is sampled from a normal distribution N(μ, 0.2).
[0092]
[0093] when hour, otherwise The next state s1 is
[0094]
[0095] The new H_block is
[0096]
[0097] The value in (3)(H_block-1) corresponds to the matrix shift value of the QC-LDPC parity check matrix, so the constructed parity check matrix can be obtained.
[0098]
[0099] (4) Simulation shows that when there are 257 consecutive errors in the burst error channel, the bit error rate is 0.01, and the reward of the strategy trajectory is G(τ1) = 257.
[0100] (5) Steps (2) to (4) yield a trajectory τ1 = s0, a0, s1, r1. Repeating steps (2) to (4) N times yields N trajectories based on this strategy. The probability of each trajectory being generated is...
[0101]
[0102] The expected return for this strategy is then...
[0103]
[0104] Therefore, the gradient of this strategy is...
[0105]
[0106] (6) The parameters of this strategy are updated to θ = θ + Δθ. If θ converges, the check matrix H and state s1 are output. If they do not converge, steps (2) to (5) are repeated.
[0107] After training, s1 converges to
[0108]
[0109] The final parity-check matrix H can then be obtained as follows:
[0110]
[0111] Simulations show that the codeword can correct consecutive errors of t <= 420. If conditions permit, training does not require a specific QC-LDPC codeword structure. Under the same conditions, it may be possible to construct a codeword with stronger resistance to burst errors than H_block1. That is, by using reinforcement learning, it is possible to construct a codeword that can correct at least 420 bits of consecutive errors.
[0112] This section provides an example of constructing QC-LDPC codewords with strong resistance to burst errors. However, this method is not limited to this. By changing the channel conditions, we can train corresponding codewords with better performance than existing codewords. Furthermore, since only the code length, code rate, and channel conditions are required, the optimal codeword can be obtained through machine training, greatly solving the problem of the difficulty in constructing LDPC codes due to their excessive length.
[0113] Although specific embodiments of the invention have been disclosed for illustrative purposes to aid in understanding and implementing the invention, those skilled in the art will understand that various substitutions, variations, and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the content disclosed in the preferred embodiments, and the scope of protection claimed by the invention is defined by the claims.
Claims
1. A codeword construction method based on reinforcement learning, comprising the following steps: 1) Randomly initialize the index value Init_H_block of a parity check matrix H used to generate low-density parity check codes, and randomly initialize the neural network; where θ is the parameter of the neural network, which is used to determine the next action a based on the current state s; 2) The neural network calculates the output mean matrix μ based on the initial state s0 and Init_H_block; 3) Based on the reinforcement learning method, the neural network samples N possible trajectories from the current state s to the next action a of the agent; from the normal distribution N(μ,σ) 2 The action a corresponding to the nth trajectory is determined by sampling. n ; by action a n and initial state Determine the next state Depend on The check matrix H corresponding to the nth trajectory is determined by the matrix Init_H_block. n Based on this verification matrix H n Calculate the reward G for the nth trajectory. n Then the neural network is based on μ, a n G n Calculate the expected return; where n = 1, 2, 3, ..., N, N is the codeword length, σ 2 The degree of dispersion of the sampling trajectory; 4) Take the derivative of the expected return to obtain the gradient Δ θ The parameters θ of the neural network are optimized and updated using the gradient ascent method; if the parameters θ converge, the neural network calculates and updates the output mean matrix μ′ based on s0, from the normal distribution N(μ′, σ 2 In the sampling process, action a is determined; the next state s1 is determined by action a and the initial state s0; the parity check matrix H is determined by s1 and Init_H_block and then output. 5) If the parameter θ does not converge, repeat steps 2) to 4); 6) Use the parity check matrix H output in step 4) as the parity check matrix of the corresponding low-density parity check code LDPC.
2. The method of claim 1, wherein, The neural network is a multilayer perceptron neural network net θ .
3. The method of claim 2, wherein, The neural network is a continuously differentiable function πθ(a|s).
4. The method according to claim 1 or 2 or 3, characterized in that, Update the parameters θ using gradient ascent to improve the objective function. Maximum; where, objective function τ is the trajectory, p θ Let (τ) be the probability of generating trajectory τ, and G(τ) be the reward for trajectory τ. t+1 Let s be the state at time t. t Transition to state s at time t+1 t+1 The resulting reward, where T is the trajectory termination time.
5. The method according to claim 1 or 2 or 3, characterized in that, The expected return is Where τ is the trajectory, p θ Let G(τ) be the probability of generating trajectory τ, and let G(τ) be the reward for trajectory τ. For the action at time t of the nth trajectory, Let this be the state of the nth trajectory at time t. Let this be the state of the nth trajectory at time t. This represents a portion of the trajectory of the nth trajectory from time t to time T. The total reward generated for the portion of the trajectory from time t to time T of the nth trajectory is given, where T is the trajectory termination time.
6. A server, characterized by It includes a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program including instructions for performing each step of the method of any one of claims 1 to 5.
7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 5.