A control logic wiring method for a fully programmable valve array biochip

By optimizing the control channel wiring using the D3QN architecture in the FPVA biochip, the time delay deviation problem of synchronous valve switching was solved, a more efficient control logic design was achieved, and the accuracy of the measurement results was ensured.

CN116663470BActive Publication Date: 2026-06-30NORTHWESTERN POLYTECHNICAL UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NORTHWESTERN POLYTECHNICAL UNIV
Filing Date
2023-05-19
Publication Date
2026-06-30

Smart Images

  • Figure CN116663470B_ABST
    Figure CN116663470B_ABST
Patent Text Reader

Abstract

This invention discloses a control logic routing method for a fully programmable valve array biochip, used to automate the channel routing of FPVA control logic. The method aims to minimize the time delay deviation between synchronized valves and minimize the control channel bus length. It implements control channel routing that considers line length matching for the logic architecture. The method uses a competitive deep dual-Q network as the agent of the DRL to achieve adaptive routing of channels in the control logic. The key parts of the framework, such as the state space, action space, and reward function, are designed based on the line length matching requirements of the control channels, thereby minimizing the time delay deviation of synchronized valves and the control channel bus length in the biochip.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of computer-aided design technology, specifically relating to a control logic wiring method for a fully programmable valve array biochip. Background Technology

[0002] Flow-based microfluidic biochips (FBMBs), also known as lab-on-a-chip, have become an increasingly attractive platform for biochemical experiments over the past two decades. On this microscale platform with fluid volumes in milliliters / nanoliters, a variety of complex biochemical applications in biochemistry and biomedicine can be automated, such as immunoassays and DNA analyses. Compared to traditional biochemical experimental procedures that require manual intervention, FBMBs not only significantly improve experimental efficiency but also minimize human-induced experimental errors, offering advantages such as high efficiency, high precision, and low reagent consumption.

[0003] The basic structure of FBMB is as follows: Figure 4 As shown in (a), two independent logic layers (i.e., the flow layer and the control layer) are constructed on top of a glass substrate, each with its own microchannel network. The flow layer contains a set of microchannels for transporting fluid samples and reagents, also known as flow channels. The control layer contains another set of microchannels connected to an external pressure source via control ports, i.e., control channels, for transmitting air pressure. Valves are flexible membranes located at the intersection of the flow channels and control channels, made of an elastomer material [polydimethylsiloxane (PDMS)]. FBMBs primarily achieve the movement of fluid samples or reagents through precise control of the valves. Complex microfluidic devices, such as mixers, can be constructed using valves as basic units. Furthermore, by programming a given series of control sequences onto the valves, bioassays can be automatically executed according to user-customized assay plans.

[0004] With advancements in manufacturing technology, thousands of valves can now be integrated onto a single chip, achieving a density of up to one million valves per square centimeter. Therefore, fully programmable valve arrays (FPVAs), offering flexibility and reconfigurability, have emerged as a novel type of flow microfluidic device. Figure 4(b) illustrates the basic structure of FPVA, where valves (dark blocks) are regularly deployed along the horizontal and vertical directions of the flow channels (light blocks). Each flow channel intersection is surrounded by four valves. By opening two valves around the intersection and closing the other two, the fluid sample in the flow channel will move in the specified direction according to the formed transport path. In this way, FPVA can quickly construct arbitrary channel structures, thus flexibly transporting fluid to any location on the chip.

[0005] In FBMBs (Fluidized Bodies Microwave Modules), external air pressure is typically injected into the control port to switch valves. However, for FPVAs (Fluidized Bodies Microwave Modules), which integrate a large number of valves, it is impractical to allocate a separate control port for each valve due to limitations in chip manufacturing costs and the size of mechanical components. Therefore, multiplexers play a crucial role in biochips to overcome the challenge of controlling a large number of valves. Figure 4 (c) shows a complete FPVA biochip general-purpose platform, where the control channels, multiplexers, core inputs, and control ports on the right side surrounding the valve array collectively constitute a control logic. The central valve array contains 116 valves; however, instead of a separate control port for each valve, the control logic uses a multiplexer with only 14 control ports to generate control modes for switching all valves. Specifically, the core input provides the pressure source to guide valve opening or closing, and the control mode specifies the connection of the control channels between the valves and the core input. Typically, 2 n / 2 Each valve only needs to be controlled by a multiplexer with n control ports. Therefore, the control logic based on the multiplexer also greatly reduces the number of control ports used in the biochip.

[0006] As mentioned above, FPVA biochips require control logic with a small number of ports to generate control patterns, thereby updating the state of valves (hereinafter referred to as flow valves) in the flow layer so that users can automatically perform bioassays according to customized assay plans. Figure 2 This paper demonstrates a multi-channel switching control logic for driving five flow valves. The core input flow valves are connected by control channels to achieve state switching between corresponding flow valves. This logic structure deploys six control ports x1 connected to an external pressure source. x2, x3 and This drives the state of the control valve located at the top of the control channel. It's important to note that only one of a pair of complementary control ports can be set to high pressure (represented by a logic value "1"), while the other is updated to low pressure (represented by a logic value "0"). In this way, the control pattern generated by the port can specify the control channel's connection to the core input, thereby updating the corresponding flow valve state. For example, when the control port... and When set to high voltage, the corresponding control mode will be... The second control channel is connected to the core input, thus updating the state of flow valve 2 to match the state of the core input. Conversely, the remaining control channels are not connected, so the flow valve states corresponding to these channels remain unchanged. On the other hand, it can be noted that the first and third control channels are assigned control modes. Once control mode Once activated, these two control channels will simultaneously switch the states of flow valve 1 and flow valve 3. In other words, the control logic essentially switches flow valves with synchronization requirements by assigning the same control mode.

[0007] In recent years, researchers have proposed several architectural design methods for control logic. For example, Q. Wang proposed a switching optimization method based on Hamming distance, which improves the reliability of control logic by adjusting the switching sequence of control valves. Building on this, he further proposed a pressure refresh method based on XOR computation to overcome the pressure deviation problem caused by switching control valves in the control logic. Y. Zhu proposed a control logic architecture that supports multi-channel switching and fault tolerance, further improving the execution efficiency and fault tolerance of the control logic. In addition, S. Liang proposed a combinatorial coding strategy based on Sperner's theorem to further reduce the resource usage of the control logic.

[0008] While some design automation techniques have been proposed to address the architectural design challenges of control logic, shortcomings remain. Because the channels in the control logic are made of PDMS material, the propagation speed of pressure from the core input to the corresponding valves through the control channels is very slow. For two or more valves requiring synchronous switching, ensuring simultaneous pressure arrival at these valves is crucial, especially for biochips like FPVAs that integrate a large number of valves. Otherwise, it can lead to chip malfunctions and erroneous measurement results. Therefore, for microfluidic components with synchronization requirements, a key issue in the architectural design of the control logic is how to construct an efficient control channel network to effectively achieve synchronous switching of flow valves. However, existing technologies do not consider the line length matching requirements of the channels in the control logic, thus ignoring the actual time delay deviation in pressure transmission to the synchronous valves in the FPVA. Summary of the Invention

[0009] To overcome the shortcomings of existing technologies, this invention provides a control logic routing method for a fully programmable valve array biochip, used to automate the channel routing of FPVA control logic. This method aims to minimize the time delay deviation between synchronized valves and minimize the control channel bus length. It implements control channel routing that considers line length matching for the logic architecture. The method uses a competitive deep dual-Q network as the agent of the DRL to achieve adaptive routing of channels in the control logic. Furthermore, it designs key parts of the framework, such as the state space, action space, and reward function, based on the line length matching requirements of the control channels, thereby minimizing the time delay deviation of synchronized valves and the control channel bus length in the biochip.

[0010] The technical solution adopted by this invention to solve its technical problem includes the following steps:

[0011] Step 1: Define the action space;

[0012] At time step t, action a t Used to implement control channel wiring, the motion space is constituted by all possible wiring directions of the control channel, represented as:

[0013] Where D1, D2, D3, and D4 represent the four cabling directions of the control channel: North, South, West, and East, respectively. Each cabling direction D... i Each is represented by a combination of two 0-1 variables, w1 and w2, where i = 1, 2, 3, 4;

[0014] After the agent performs an action at each time step, the control path will pass through a grid, and the path length will increase by one unit.

[0015] Step 2: Define the state space;

[0016] The state s at time step t t It consists of four core parts: the flow valve / end point F that the current control path needs to reach. t Location E, a necessary point t The current path leads to position C. t and path attributes

[0017] The overall state space is defined as follows:

[0018]

[0019] Where fth and ftv are the x and y coordinates of the flow valve / end point position, respectively. and These are the x and y coordinates of the necessary points, respectively. and These are the x and y coordinates of the location reached by the control path. It is the sequence number of the control path. It is the length of the current control path. Indicates whether the control path needs to meet the length matching requirement; It is the degree of matching of the control path, that is, the absolute value of the difference between the current control path length and the matching line length;

[0020] If the control path needs to meet the length matching requirement otherwise, and Always equal to 0;

[0021] Step 3: Define the reward function;

[0022] The reward function is constructed based on the following 5 types of points that the control path may pass through to evaluate the quality of the action in each state: 1) Important point IP, 2) Obstacle point OP, 3) Target point TP, 4) Common point GP, and 5) Shared point SP.

[0023] The overall reward function is expressed as follows:

[0024]

[0025] Where C t+1 Indicates that action a is performed at time t. t The location reached by the subsequent control path, R TP (C t+1 ) and R GP (C t+1 ) are C t+1 The reward calculation method for TP and GP, and They are two positive numbers. It is a negative constant;

[0026] If action a is performed t Post-path C t+1 Positive rewards for IP. This will be introduced to encourage agents to guide paths through IP, where IP is merely a control valve at a necessary point in the path and will be updated based on the wiring.

[0027] During the routing of control paths, except for the necessary points that need to be traversed, all other control valves in the mesh are considered obstacle points (OPs). The control path should avoid passing through OPs during routing; if C t+1 If the operator is OP, then the current learning iteration will be terminated and a penalty will be introduced on the agent.

[0028] For C t+1 In the case of TP, a T is introduced to limit the maximum time delay deviation. m degree of matching and matching path identifier To calculate the reward, the corresponding reward calculation function R TP (C t+1 )as follows:

[0029]

[0030] in as well as All are positive values

[0031] The principles for calculating corresponding rewards are as follows: 1) Provide positive rewards 1) Guide the control path to TP to complete channel construction; 2) Provide positive rewards The agent is encouraged to construct channels for synchronous valves that meet the length matching requirements as much as possible, i.e., minimize the time delay deviation between synchronous valves.

[0032] C for paths t+1 For the case of a normal point GP, the relevant calculation function R is... NP (C t+1 )as follows:

[0033]

[0034] in as well as They are all negative constants. mdis(C t E t ) and mdis(C t+1 E t ) represent the Manhattan distances between the control path's destination and the current necessary point before and after taking the action at time t;

[0035] For C t+1 The principle for calculating rewards when C is GP is as follows: 1) If C t+1 Compared to C t If the agent gets closer to the necessary point, it will receive a smaller penalty. 2) If C t+1 Compared to C t If the distance to the necessary point remains constant or even increases, then the agent will obtain a value greater than [a certain value] based on the actual situation. punishment or

[0036] Based on the calculation function R NP (C t+1 Add a positive value To calculate C t+1 The reward for SP is the points traversed by other paths, where SP controls the path.

[0037] Step 4: Build the D3QN architecture:

[0038] Step 4-1: Based on the action space, state space, and reward function described above, construct the D3QN architecture. The constructed D3QN consists of two DNNs with identical structures, referred to as the policy network and the target network, respectively. The policy network guides the agent to select and take actions, while the target network evaluates the quality of the actions taken. Each network consists of three layers, including an input layer, a hidden layer, and an output layer.

[0039] Step 4-2: Used to calculate state-action pairs (s) t ,a t The action-value function of the Q-value is decomposed into a value function V, which is related to the state, and an advantage function A, which is related to both the action and the state. This network architecture estimates the values ​​of V and A separately using two separate streams in the last fully connected layer, and finally combines these two values ​​to output a single state-action pair Q-value Q(s). t ,a t );

[0040] Step 4-3: For the policy network, in state s t Take action a t The corresponding Q value is calculated as follows:

[0041] Q(s t ,a t ,θ t ,β t ,α t )=V(s t ,θ t ,β t )+A(s t ,a t ,θ t ,α t (6)

[0042] Where V(s) t ,θ t ,β t ) is the output of the value function, A(s) t ,a t ,θ t ,α t ) is the output of the dominance function, θ t These are the parameters of the policy network, β. tThese are the parameters of the value function, and α. t These are the parameters of the dominance function.

[0043] The average value of the advantage function A is used in the calculation, therefore the final Q value is calculated as follows:

[0044]

[0045] in Representing the action space The total number of actions in the process;

[0046] Step 4-4: The two DNNs in D3QN separate action selection from action evaluation, thus effectively solving the overestimation problem of Q-value; specifically, when in state s t Choose action a t At that time, the policy network first calculates the state-action pair (s) t ,a t The predicted Q value, i.e., Q(s) t ,a t ,θ t ,α t ,β t Then, the policy network finds the state s. t+1 Action a corresponding to the maximum Q value max Its definition is as follows:

[0047]

[0048] The target network utilizes action a max and state s t+1 To calculate state-action pairs (s) t ,a t The target Q value Y) t Its definition is as follows:

[0049]

[0050] Where θ t - Let r be the parameters of the target network at time t. t It is state s t Take action a t The reward obtained, η∈[0,1] represents the discount factor, Y t It is used to evaluate state s t Take action a t The quality of η; used to weigh current and future rewards;

[0051] Step 5: Construct a loss function based on the target Q-value and the predicted Q-value, as follows:

[0052]

[0053] Then, the loss function is minimized by using gradient descent, thereby updating the parameters in the policy network.

[0054] Preferably, the unit length is 1 mm.

[0055] Preferably, when η is close to 1, the agent pays more attention to long-term rewards; otherwise, the agent pays more attention to immediate rewards.

[0056] The beneficial effects of this invention are as follows:

[0057] 1. The method of the present invention aims to minimize the time delay deviation between synchronous valves and minimize the length of the control channel bus, thereby generating optimized control logic with good timing behavior and low control design cost.

[0058] 2. The method of the present invention designs key parts such as state space, action space, and reward function in the framework based on the line length matching requirements of the control channel, thereby minimizing the time delay deviation of the synchronization valve in the biochip and the bus length of the control channel. Attached Figure Description

[0059] Figure 1 This is the design flow of the DRL method for controlling channel cabling in this invention.

[0060] Figure 2 A schematic diagram of the multi-channel switching control logic for driving five flow valves.

[0061] Figure 3 This diagram illustrates the comparison of the number of zero-deviation groups between the D3QN architecture and other architecture implementations.

[0062] Figure 4 The basic structures of FBMB are: (a) basic structure of continuous microfluidic biochip, (b) general structure of FPVA biochip, and (c) FPVA biochip platform with multiplexer.

[0063] Figure 5 The estimated paths for flow valves 1 and 3, and a schematic diagram of the actual path for flow valve 1. Detailed Implementation

[0064] The present invention will be further described below with reference to the accompanying drawings and embodiments.

[0065] The purpose of this invention is to design a wiring method for the control logic of FPVA biochip based on deep reinforcement learning (DRL). This method aims to minimize the time delay deviation between synchronous valves and minimize the length of the control channel bus, thereby generating optimized control logic with good timing behavior and low control design cost.

[0066] This invention proposes a DRL-based framework for automating channel routing in FPVA control logic. It primarily focuses on control channel routing that considers line length matching for the logic architecture implementation. Within this framework, a Dueling Double Deep Q-Network (D3QN) is used as the agent in the DRL to achieve adaptive routing of channels in the control logic. The framework's key components, such as the state space, action space, and reward function, are designed based on the line length matching requirements of the control channels, thereby minimizing the time delay deviation of synchronous valves in biochips and the length of the control channel bus.

[0067] A control logic wiring method for a fully programmable valve array biochip includes the following steps:

[0068] 1. Control channel wiring considering line length matching

[0069] In this design phase, to generate a logic architecture for the FPVA biochip that considers line length matching, an adaptive routing method based on DRL is proposed to construct an efficient control channel network. A routing mesh is created to map the specific locations of all valves in the logic forest. The goal is to generate an optimized routing scheme that allows the control logic to connect to all flow valves according to the control functions specified by the logic forest, while minimizing the time delay deviation between synchronous valves and the length of the control channel bus used.

[0070] The design flow of the proposed DRL method is as follows: Figure 1 As shown, the environment consists of the leftmost logical forest and the wiring mesh created based on this forest. A DNN is constructed as the DRL agent, which builds the complete control channel network by sequentially planning the paths of the flow valves in the logical forest. Here, flow valve f1 is taken as an example. At time step t=0, the wiring mesh first captures the control path information of flow valve f1 (marked by shaded squares), including the core input position (circle), the flow valve position (triangle), and the control valve position. Then, the state is generated based on the path information and input into the agent to calculate the Q-value for different wiring directions. For example, Figure 1The routing direction with the largest Q value (i.e., 0.8) is selected as the first action and input into the environment. The environment then rewards the agent accordingly based on this action to achieve the interaction process. The proposed method performs a certain number of learning iterations based on the above process to generate an optimized routing scheme to construct an effective control channel network. Further technical details of the proposed method, including the action space, state space, reward function, and network architecture, will be introduced later.

[0071] 1) Motion space:

[0072] Because at time step t, action a t Primarily used for control channel wiring, the operational space should consist of all possible wiring directions of the control channel, represented as:

[0073]

[0074] D1, D2, D3, and D4 represent the four routing directions of the control channel: North, South, West, and East, respectively. Each routing direction Di is represented by a combination of two 0-1 variables, w1 and w2. For example, D1(north) = (0,0), D2(south) = (0,1), D3(west) = (1,0), and D4(east) = (1,1). After the agent executes an action at each time step, the control path will pass through a grid, and the path length will increase by one unit, i.e., 1 mm.

[0075] Through the action space defined above, the control paths of each flow valve can be flexibly constructed during the learning iteration process, and the agent can be guaranteed to have the ability to explore feasible wiring schemes.

[0076] 2) State space:

[0077] To enable the agent to better perceive relevant information and patterns of change in the interactive environment, thereby making effective wiring decisions, the state s at time step t is determined. t It should consist of four core parts: the flow valve / endpoint F that the current control path needs to reach. t The necessary point location E t The current path reaches position C. t and path attributes Therefore, the overall state space is defined as follows:

[0078]

[0079] Where f t h and f t vThese are the x and y coordinates of the flow valve / end point position, respectively. and These are the x and y coordinates of the necessary points, respectively. and These are the x and y coordinates of the location reached by the control path. It is the sequence number of the control path. It is the length of the current control path. This indicates whether the control path needs to meet the length matching requirement, and This refers to the degree of matching of the control path, which is the absolute value of the difference between the current control path length and the matching line length. Here, we also use... Figure 2 Taking synchronized flow valves 1 and 3 as examples, we can illustrate the calculation of the matching line length. Specifically, it is necessary to pre-estimate the control path length of these two flow valves and select the longest estimated path length as the matching line length of the synchronized valve group. The estimated control path length of the flow valve mainly consists of three parts: the Manhattan distance between the core input and the control valve, the Manhattan distance between the control valves, and the Manhattan distance between the control valve and the flow valve. Figure 5 The estimated control paths (light solid lines) for the two flow valves are shown above, with estimated path lengths of 9mm and 11mm respectively. Therefore, the matching line length for this synchronization valve group is 11mm. Please note that if the control path needs to meet the length matching requirement, otherwise, and It always equals 0.

[0080] by Figure 5 The state transition process is illustrated using the actual control path (solid black line) of the mid-flow valve 1 as an example, where the path number is marked as "10". From Figure 5 As can be seen from this, the flow valve is controlled by a specific mode. Therefore, during the entire wiring process, the control path "10" must pass through the control valves x1 and x2 in sequence. And flow valve 1. At time step t = 0, the agent first needs to perceive the initial state s0, and then select a wiring direction from the action space to guide the control path to start wiring from the core input. At this time step, the starting point, the necessary point, and the ending point of the control path are the position of the core input (0, 1), the position of the first control valve x1 (1, 1), and the position of the flow valve f1 (7, 0), respectively. In addition, since the agent has not yet selected an action for wiring at the initial moment, and the path number is "10", therefore In as well as Based on the above description, such that F0 = (7, 0), E0 = (1, 1), C0 = (0, 1), and Therefore, the initial state can be represented as s0 = [(7, 0), (1, 1), (0, 1), (10, 0, 1, 11)]. On the other hand, since the first grid in the south direction of the core input is traversed by the control path "10", the agent executes action D2 at t = 0 and transitions to state s1 at t = 1. It is important to note that this action causes the first necessary point, i.e., control valve x1, to be traversed by the control path; therefore, E1 in state s1 is updated to the position of the next necessary point, i.e., control valve x2. Furthermore, the control path reaches position C1, and the control path length... and the degree of matching of control paths The state is also updated accordingly. Therefore, the state at time step t=1 is represented as s1=[(7,0),(3,1),(1,1),(10,1,1,10)]. Ultimately, control path "10" will end its wiring after reaching flow valve 1, and the agent will begin preparing for the wiring of the next path until all control paths are completed. During this process, all possible states perceived by the agent constitute the entire state space.

[0081] 3) Reward function:

[0082] To guide the agent in learning effective wiring strategies during its interaction with the environment, a reward function is constructed based on five types of points that the control path may traverse to evaluate the quality of actions in each state: 1) Important Point (IP), 2) Obstacle Point (OP), 3) Target Point (TP), 4) General Point (GP), and 5) Shared Point (SP). Correspondingly, the overall reward function is expressed as follows:

[0083]

[0084] Where C t+1 Indicates that action a is performed at time t. t The location reached by the subsequent control path, R TP (C t+1 ) and R GP (C t+1 ) are C t+1 The reward calculation method for TP and GP, and They are two positive numbers. It is a negative constant. If action a is performed... t C of the back path t+1 Positive rewards for IP. This will be introduced to encourage agents to guide paths through IP, where IP is merely a control valve at a necessary point in the path, and will also be updated based on the wiring.

[0085] During control path routing, all control valves in the mesh, except for those that must be traversed, are considered obstacle points (OPs). Control paths should avoid passing through OPs during routing. If C t+1 If the operator is OP, then the current learning iteration will be terminated and a penalty will be introduced on the agent.

[0086] Due to the target of C t+1 In the case of TP, it may be necessary to consider control paths that satisfy length matching, therefore T is introduced to limit the maximum time delay deviation. m degree of matching and matching path identifier To calculate the reward, the corresponding reward calculation function R TP (C t+1 )as follows:

[0087]

[0088] in as well as All are positive values The principles for calculating corresponding rewards are as follows: 1) Provide positive rewards 1) Guide the control path to TP to complete channel construction; 2) Provide positive rewards. The agent is encouraged to construct channels for synchronous valves that meet the length matching requirements as much as possible; in other words, to minimize the time delay deviation between synchronous valves.

[0089] C for paths t+1 For the case of a normal point GP, the relevant calculation function R is... NP (C t+1 )as follows:

[0090]

[0091] in as well as All are negative constants mdis(C t E t ) and mdis(C t+1 E t Let C be the Manhattan distance between the control path's destination and the current necessary point, representing the distances before and after the action is taken at time t. t+1 The principle for calculating rewards when C is GP is as follows: 1) If C t+1 Compared to C t If the agent gets closer to the necessary point, it will receive a smaller penalty. 2) If C t+1 Compared to C tIf the distance to the necessary point remains constant or even increases, then the agent will obtain a value greater than [a certain value] based on the actual situation. punishment or This punishment mechanism not only guides the control path to the necessary point with the shortest distance, but also provides the agent with effective signals to approach the necessary point.

[0092] Furthermore, based on the computation function R NP (C t+1 Add a positive value To calculate C t+1 The reward for SP is the points traversed by other paths, where SP controls the path. This positive value... It can guide the control path to share resources with other paths as much as possible during the wiring process, thereby reducing the bus length used by the final control channel.

[0093] Through the reward function designed above, the agent will learn an effective control channel wiring scheme for all flow valves in the control logic, thereby minimizing the time delay deviation between synchronous valves and the control channel bus length required in the logic architecture.

[0094] 4) D3QN architecture:

[0095] Based on the aforementioned action space, state space, and reward function, an architecture called D3QN is introduced as the agent for DRL training, thereby achieving adaptive wiring of the control channels. This architecture combines the ideas of DuelingDQN and DoubleDQN, thus overcoming the shortcomings of each of the two DRL architectures. In this invention, the constructed D3QN consists of two DNNs with identical structures, referred to as the policy network and the target network, respectively. The policy network guides the agent to select and take actions, while the target network evaluates the quality of the actions taken. Each network consists of three layers: an input layer, a hidden layer, and an output layer.

[0096] Because the network architecture of D3QN borrows the idea of ​​DuelingDQN, the original method used to compute state-action pairs (s) is now obsolete. t ,a t The action-value function of the Q-value is decomposed into a value function V, which is state-dependent, and an advantage function A, which is dependent on both action and state. This network architecture estimates the values ​​of V and A separately using two separate streams in the last fully connected layer, and finally merges these two values ​​to output a single state-action pair Q-value Q(s). t ,a t Here, we take a policy network as an example, so in state s t Take action a t The corresponding Q value is calculated as follows:

[0097] Q(s t ,a t ,θ t ,β t ,α t )=V(s t ,θ t ,β t )+A(s t ,a t ,θ t ,α t ), (6)

[0098] Where V(s) t ,θ t ,β t ) is the output of the value function, A(s) t ,a t ,θ t ,α t ) is the output of the dominance function, θ t These are the parameters of the policy network, β. t These are the parameters of the value function, and α. t These are the parameters of the advantage function.

[0099] However, formula (6) cannot identify V(s) in the final output. t ,θ t ,β) and A(s t ,a t ,θ t The respective effects of α and β contribute to the degraded performance of the neural network. To address the recognizability problem, the average value of the advantage function is introduced for calculation; therefore, the final Q-value is calculated as follows:

[0100]

[0101] in Representing the action space The total number of actions in the process. The architecture described above can more accurately estimate the value of each action, thereby improving the learning efficiency and stability of the agent.

[0102] On the other hand, based on the idea of ​​DoubleDQN, the two DNNs in D3QN separate action selection from action evaluation, thus effectively solving the problem of Q-value overestimation. Specifically, when in state s t Choose action a t At that time, the policy network first calculates the state-action pair (s) t ,a t The predicted Q value, i.e., Q(s) t ,a t ,θ t ,αt ,β t Then, the policy network finds the state s. t+1 Action a corresponding to the maximum Q value max Its definition is as follows:

[0103]

[0104] It is important to note that once state s t Take action a t Then you can obtain state s t+1 Although the state s at this time t No transfer occurred. Then, the target network used action a max and state s t+1 To calculate state-action pairs (s) t ,a t The target Q value Y) t Its definition is as follows:

[0105]

[0106] in Let r be the parameters of the target network at time t. t It is state s t Take action a t The reward obtained, η∈[0,1] represents the discount factor, Y t It is used to evaluate state s t Take action a t The quality of the reward. Generally speaking, η is used to weigh current and future rewards. When η is close to 1, the agent focuses more on long-term rewards; otherwise, the agent focuses more on immediate rewards.

[0107] Furthermore, in order to update the parameters of the policy network, a loss function based on the target Q-value and the predicted Q-value needs to be constructed, as follows:

[0108]

[0109] Then, the loss function is minimized by using gradient descent, thereby updating the parameters in the policy network.

[0110] The method of this invention was tested on a PC with a 2.3-GHz CPU and 64-GB of memory. The effectiveness of the proposed method was verified using six randomly generated test cases. Details of these test cases are shown in Table 1, where #C s It refers to the chip area occupied by the control channel network, #N c It controls the number of valves, #N f It refers to the number of flow valves, and #N gIt refers to the number of synchronized valve groups.

[0111] In the DRL-based adaptive control logic routing method, a D3QN network architecture combining the characteristics of DuelingDQN and DoubleDQN is used as a DRL proxy, thus establishing an efficient DRL framework for controlling logic channel routing. To verify the effectiveness of the D3QN architecture, three other network architectures—DuelingDQN, DoubleDQN, and Deep Q-Network (DQN)—are used for testing in the proposed method, and their generated results are compared. Table 2 shows the corresponding comparison results, where the total delay deviation is defined as the maximum delay deviation in all synchronous valve groups within the range of #T. s The columns are shown, and the channel bus length used in the control logic is in #C. l The columns are displayed, with the "Optimization Rate (%)" column showing the degree of optimization of the D3QN architecture compared to the other three architectures.

[0112] As shown in Table 2, D3QN outperforms the basic DQN architecture, achieving an average optimization of 52.5% in total latency bias. The DoubleDQN architecture effectively addresses the Q-value overestimation problem by using two independent deep neural networks (DNNs), thereby improving the learning stability of the DRL agent. Compared to DoubleDQN, D3QN achieves a reduction in total latency bias of 40.0%–100.0% across all test cases, with an average reduction of 46.7%. Furthermore, DuelingDQN further improves the learning efficiency of the DRL agent by decomposing the single Q-value calculation part in the DNN into a state-value function and an advantage function. Table 2 shows that D3QN still outperforms DuelingDQN across all test cases, achieving an average reduction of 43.2% in total latency bias. On the other hand, the D3QN architecture also outperforms the other three architectures in terms of the total control channel length across all test cases. In particular, compared to the DQN architecture, the control channel bus length is reduced by an average of 10.2%. The above results demonstrate that the adopted D3QN architecture exhibits superior performance. This is primarily because the D3QN architecture can simultaneously address potential issues encountered by the other three models during training, including overestimation of Q-values ​​and the inability to distinguish differences between actions. Furthermore, the number of zero-biased sets completed by the four architectures across all test cases was compared, such as... Figure 3As shown in the figure, the four data points in each test case, from left to right, represent the number of zero-deviation groups achieved by D3QN, DuelingDQN, DoubleDQN, and DQN, respectively. The figure shows that the D3QN architecture generally outperforms the other three architectures in achieving the number of zero-deviation groups, and in the first two test cases, all synchronous valve groups achieved zero deviation. This result further demonstrates the effectiveness of the D3QN architecture.

[0113] Table 1. Detailed information on test samples

[0114]

[0115] Table 2 compares the delay skew and channel bus length of D3QN with DuelingDQN, DoubleDQN, and DQN.

[0116]

Claims

1. A control logic wiring method for a fully programmable valve array biochip, characterized in that, Includes the following steps: Step 1: Define the action space; At time step t, the action The control channel is used for wiring, therefore the motion space is constituted by all possible wiring directions of the control channel, represented as: (1) in D 1, D 2, D 3 and D 4 represents the four cabling directions for the control channel: North, South, West, and East. Each cabling direction... D i Both consist of two 0-1 variables w 1 and w Representing the combination of 2, i =1,2,3,4; After the agent performs an action at each time step, the control path will pass through a grid, and the path length will increase by one unit. Step 2: Define the state space; State at time step t It consists of four core parts: the flow valve / end point that the current control path needs to reach. Must-pass location The current location reached by the current path and path attributes ; The overall state space is defined as follows: (2) in and These are the x and y coordinates of the flow valve / end point position, respectively. and These are the x and y coordinates of the necessary points, respectively. and These are the x and y coordinates of the location reached by the control path. It is the sequence number of the control path. It is the length of the current control path. Indicates whether the control path needs to meet the length matching requirement; It is the degree of matching of the control path, that is, the absolute value of the difference between the current control path length and the matching line length; If the control path needs to meet the length matching requirement =1; otherwise, =0 and Always equal to 0; Step 3: Define the reward function; The reward function is constructed based on the following 5 types of points that the control path may pass through to evaluate the quality of the action in each state: 1) Important point IP, 2) Obstacle point OP, 3) Target point TP, 4) Common point GP, and 5) Shared point SP. The overall reward function is expressed as follows: (3) in Indicates the action to be performed at time t. The location reached by the subsequent control path and They are The reward calculation method for TP and GP, and They are two positive numbers. It is a negative constant; If the action is performed Post-path Positive rewards for IP. This will be introduced to encourage agents to guide paths through IP, where IP is merely a control valve at a necessary point in the path and will be updated based on the wiring. During the routing of control paths, except for the necessary points that need to be traversed, all other control valves in the mesh are considered obstacle points (OPs). The control path should avoid passing through OPs during routing; if If the operator is OP, then the current learning iteration will be terminated and a penalty will be introduced for the agent. ; against In the case of TP, a limit is introduced to limit the maximum time delay deviation. degree of matching and matching path identifier To calculate the reward, the corresponding reward calculation function as follows: (4) in as well as All are positive values > ; The principles for calculating corresponding rewards are as follows: 1) Provide positive rewards 1) Guide the control path to TP to complete channel construction; 2) Provide positive rewards The agent is encouraged to construct channels for synchronous valves that meet the length matching requirements as much as possible, i.e., minimize the time delay deviation between synchronous valves. Path-oriented For the case of a normal point GP, the relevant calculation function is... as follows: (5) in , ,as well as They are all negative constants. < < , and These are the Manhattan distances between the control path's destination and the current necessary point before and after taking the action at time t; for The principles for calculating rewards when it is GP are as follows: 1) If Compared to If the agent gets closer to the necessary point, it will receive a smaller penalty. ;2) If Compared to If the distance to the necessary point remains constant or even increases, then the agent will obtain a value greater than [a certain value] based on the actual situation. punishment or ; Based on calculation function Add a positive value To calculate The reward for SP is the points traversed by other paths, where SP controls the path. Step 4: Build the D3QN architecture: Step 4-1: Based on the action space, state space and reward function described above, construct the D3QN architecture. The constructed D3QN consists of two DNNs with the same structure, called the policy network and the target network, respectively. The policy network guides the agent to select and take actions, while the target network evaluates the quality of the actions taken. Each network consists of three layers, including an input layer, a hidden layer and an output layer. Step 4-2: Used to calculate state-action pairs The action-value function of the Q-value is decomposed into a value function V, which is related to the state, and an advantage function A, which is related to both the action and the state. This network architecture estimates the values ​​of V and A separately using two separate streams in the last fully connected layer, and finally combines these two values ​​to output a single state-action pair Q-value. ; Step 4-3: For the policy network, in the state Take action below The corresponding Q value is calculated as follows: (6) in It is the output of the value function. It is the output of the advantage function. These are the parameters of the policy network. These are the parameters of the value function, and These are the parameters of the dominance function. The average value of the advantage function A is used in the calculation, therefore the final Q value is calculated as follows: (7) in Representing the action space The total number of actions in the process; Step 4-4: The two DNNs in D3QN separate action selection from action evaluation, thus effectively solving the overestimation problem of Q-value; specifically, when in state... Select action At that time, the policy network first calculates the state-action pair. The predicted Q value, i.e. Then, the policy network finds the state. Action corresponding to the maximum Q value Its definition is as follows: (8) Target network exploitation actions and status To calculate state-action pairs Target Q value Its definition is as follows: (9) in Let be the parameters of the target network at time t. It is a state Take action The rewards received Indicates the discount factor. It is used to assess the status. Take action below The quality; Used to weigh current and future rewards; Step 5: Construct a loss function based on the target Q-value and the predicted Q-value, as follows: (10) Then, the loss function is minimized by using gradient descent, thereby updating the parameters in the policy network.

2. The control logic wiring method for a fully programmable valve array biochip according to claim 1, characterized in that, The unit length is 1 mm.

3. The control logic wiring method for a fully programmable valve array biochip according to claim 1, characterized in that, when When the value approaches 1, the agent focuses more on long-term rewards; otherwise, the agent focuses more on immediate rewards.