A container-based network security competition platform dynamic scheduling method and system
By employing a containerized dynamic scheduling method and a two-factor monitoring mechanism, the resource waste and performance issues of the cybersecurity competition platform under large-scale concurrent access were resolved, achieving efficient resource management and a stable user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGZHOU CITY UNIV OF TECH
- Filing Date
- 2026-04-08
- Publication Date
- 2026-06-19
AI Technical Summary
Existing cybersecurity competition platforms suffer from low resource utilization efficiency under large-scale concurrent access, and centralized startup can easily cause performance issues. The inaccurate instance reclamation mechanism leads to inconsistent user experience and wasted resources.
A container-based dynamic scheduling method is adopted to create target machine instances on demand by receiving access requests in real time. Combined with a two-factor monitoring mechanism (access timestamp and network connection count), instances are accurately reclaimed. Containerd is used to optimize resource management and achieve lazy loading and intelligent reclamation.
It improves resource utilization efficiency, avoids startup storms, ensures user response speed and platform stability, and achieves efficient resource management in high-concurrency scenarios.
Smart Images

Figure CN122247724A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of network information security technology, and in particular to a dynamic scheduling method and system for a containerized network security competition platform. Background Technology
[0002] As a critical infrastructure for cybersecurity training and drills, the core function of a Cybersecurity Challenge (CTF) platform is to provide participants with a stable and isolated challenge environment, i.e., target machine instances. Existing platforms generally adopt a strategy of pre-starting all target machine instances in a server cluster before the start of a competition or course. This pre-starting and persistent strategy is intended to ensure immediate response for users.
[0003] However, when faced with large-scale concurrent access, such as university competitions with tens of thousands of participants or parallel teaching across multiple classes, this strategy exposes two major technical bottlenecks. First, resource utilization efficiency is low. The access popularity of competition questions exhibits a clear long-tail distribution, with many target machine instances remaining idle for most of the time, yet still continuously consuming valuable computing resources such as CPU and memory. This directly limits the platform's concurrency capacity under equivalent hardware costs. Second, concentrated startup causes performance issues. Before competitions or during peak access periods, the platform needs to process massive container creation requests in a short time, easily triggering a startup storm. This leads to image pull congestion and instantaneous preemption of node resources, resulting in timeouts or inconsistent user experience on the first visit.
[0004] Furthermore, existing technologies have shortcomings in instance reclamation mechanisms. Simply relying on fixed idle periods, such as checking only the time of the last HTTP request, cannot accurately reflect the true usage status of an instance. A user may no longer have new page interactions, but may still maintain a long-lived connection with the target instance, such as for file downloads or web-shell sessions. Abruptly reclamating the instance in this situation would interrupt the user's normal operations. Therefore, how to accurately and dynamically manage the entire lifecycle of target instances, maximizing resource utilization while ensuring service availability, is a pressing technical problem that needs to be solved in this field. Summary of the Invention
[0005] To help solve the technical problems existing in the prior art, the present invention provides a dynamic scheduling method and system for a containerized network security competition platform, which can realize on-demand supply and automatic idle reclamation of the target machine environment according to the user's real-time load, aiming to improve the utilization efficiency of computing resources in large-scale competition scenarios and reduce the impact of container cold start on system performance.
[0006] To achieve the above-mentioned objectives, this invention discloses a dynamic scheduling method for a containerized cybersecurity competition platform, comprising the following steps: The system receives user access requests for specific topics in real time and queries the running status of target machine instances associated with the target tenant identified by the access request and the specific topic based on the access request. When it is determined that the running status of the target machine instance does not exist, a Pod resource as the target machine instance is dynamically created based on the template preset for the specific question, and a unique access entry rule is generated synchronously for the Pod resource. The access entry rule is used to establish an access link from the user terminal to the target machine instance. During the operation of the target machine instance, its last access timestamp is continuously collected and recorded, and the number of active network connections associated with the target machine instance and in the ESTABLISHED state is periodically counted using network probes. Calculate the difference between the current time and the last access timestamp, and compare the difference with a preset idle time threshold for the first time; and compare the counted number of active network connections with a preset connection number threshold for the second time, wherein the connection number threshold is zero. When the result of the first comparison is that the difference exceeds the idle time threshold, and the result of the second comparison is that the number of active network connections is equal to the connection number threshold, a delayed recycling task is triggered; after a preset delay time expires, the delayed recycling task performs a re-examination, the re-examination including: re-executing the first comparison and the second comparison; and when the result of the re-examination is that the newly calculated difference still exceeds the idle time threshold, and the newly counted number of active network connections is still equal to the connection number threshold, then the Pod resource is gracefully terminated and cleaned up.
[0007] As understood, the dynamic scheduling method for a containerized cybersecurity competition platform according to this invention uses actual user access requests as driving events to achieve on-demand allocation and intelligent recycling of resources. When the system frontend receives a user's access request for a specific question in real time, the scheduling process is triggered. The system first determines the corresponding target tenant based on the user's identity or team affiliation contained in the access request, and then queries the current running status of the target machine instance uniquely associated with that "tenant-question" combination, combined with the specific question pointed to by the request. Here, the target machine instance is typically implemented as one or a group of Pod resources running in a container orchestration cluster (such as Kubernetes).
[0008] If the query results indicate that the corresponding target instance is not yet running or does not exist, the system executes lazy loading creation logic. This step is the core of implementing on-demand resource allocation, avoiding the resource waste caused by pre-starting all challenges in the traditional model. The system dynamically creates a new Pod resource based on a template pre-set for that specific challenge, containing information such as container images, environment variables, and resource quotas. Simultaneously with Pod resource creation, to ensure user access is accurately routed to this newly created instance, the system synchronously generates a unique access entry rule, such as an Ingress rule. This rule maps a unique, publicly accessible URL (which may contain specific subdomains or paths) to the service port exposed by the newly created Pod resource, thus successfully establishing an end-to-end access link from the user to the target instance.
[0009] After the target instance is successfully created and put into operation, the system immediately initiates a two-factor parallel monitoring mechanism to accurately measure the instance's true activity level. The first factor is high-level application activity monitoring, where the system continuously collects and updates the timestamp of the instance's last valid access, typically associated with the arrival of an HTTP request. Simultaneously, the second factor is low-level network connection monitoring, where the system periodically counts the number of active TCP network connections associated with the instance and in the ESTABLISHED state using network probes deployed within the instance (e.g., in the form of sidecar containers). Introducing the number of active network connections as the second determining factor is crucial, as it compensates for the shortcomings of relying solely on access timestamps, accurately identifying scenarios where long-lived connections such as file transfers or persistent sessions still exist even without new user page actions.
[0010] Accordingly, during the recycling decision phase, the system matches the two monitoring factors mentioned above with preset thresholds. The system continuously calculates the difference between the current time and the recorded last access timestamp, and compares it with a preset idle time threshold. Simultaneously, the system compares the count of active network connections with a preset connection count threshold (usually zero). Only when both conditions—"idle time exceeds the threshold" and "active connection count is zero"—are met simultaneously, will the system preliminarily determine that the target machine instance has truly entered an idle state and trigger a delayed recycling task.
[0011] This delayed reclamation and re-check mechanism is a crucial element in ensuring system stability. After triggering a delayed reclamation task, the system does not immediately perform cleanup operations but waits for a preset delay time. This anti-bouncing design aims to avoid erroneous reclamation caused by brief network fluctuations or extremely short user interruptions. After the delay time expires, the system performs a re-check, completely repeating the aforementioned two-factor decision-making and matching process, recalculating the time difference, and re-counting the number of active connections. Only when the re-check result still meets the conditions of exceeding the idle time limit and having zero active connections does the system finally confirm the reclamation decision and perform graceful termination and cleanup of the Pod resources corresponding to the target machine instance. The graceful termination process ensures that the instance has sufficient time to complete log recording, state saving, and other cleanup tasks before exiting, thus guaranteeing the smoothness and reliability of the entire reclamation process.
[0012] To implement the above-mentioned dynamic scheduling method, this invention also provides a containerized network security competition platform dynamic scheduling system, including a processor and a memory storing computer-executable instructions. The processor executes the computer-executable instructions, enabling the system to implement the dynamic scheduling method of this invention through several cooperative modules; the several cooperative modules include: The application service module is configured to handle question management, perform tenant authentication on the access requests, and receive and verify the Flag submitted by the user in response to the specific question. The scheduling control module is configured to respond to the tenant authentication result of the application service module, interact with the API of the container orchestration cluster deployed in the system, execute the steps in the dynamic scheduling method except for counting the number of active network connections, and synchronously generate the unique access entry rule in the dynamic scheduling method. The container runtime module is configured to use the containerd native container runtime to receive and execute instructions from the scheduling control module regarding the lifecycle management of the target machine instance; The security isolation and monitoring module includes a network policy distributor, a resource statistics probe, and a role-based access control (RBAC) permission manager. The network policy distributor is configured to configure network policies for the independent namespace of the target tenant. The resource statistics probe is configured to execute the step of counting active network connections in the dynamic scheduling method. The RBAC permission manager is configured to restrict the permissions of the target tenant to its corresponding independent namespace. The ingress gateway module includes an Ingress controller and a TLS automation management component; wherein, the Ingress controller is configured to dynamically generate Ingress rules containing unique subdomains or unique paths based on the unique identifier of the target tenant and a specific topic, and the TLS automation management component is configured to automatically apply for, mount, and renew TLS certificates based on the unique subdomains in the Ingress rules.
[0013] It is understood that the containerized network security competition platform dynamic scheduling system of the present invention aims to implement the above-mentioned dynamic scheduling method. The system consists of a processor and memory, and its core logic is implemented by running computer-executable instructions in the memory. These instructions concretize the system into an organic whole composed of several functionally defined, loosely coupled, and cooperative modules, which jointly execute the aforementioned dynamic scheduling method.
[0014] Specifically, the application service module, as the business logic hub of the system, directly faces the user. It is responsible not only for managing all question information on the platform but also for the crucial task of authenticating user identities. When an access request arrives, the application service module first performs tenant authentication to confirm the user's legitimate identity and their associated tenant (e.g., participating team). During the user's problem-solving process, this module is also responsible for receiving and verifying the Flags submitted by the user in response to specific questions to complete subsequent business processes such as scoring.
[0015] After tenant authentication is completed, the application service module transmits the authentication result and the user's access intent to the scheduling control module. The scheduling control module is the core executor of the entire dynamic scheduling algorithm, acting as a bridge between the upper-layer business logic and the underlying container infrastructure. Upon receiving the instruction, the scheduling control module first queries the target machine instance status. If the instance does not exist, it responds to the authentication result and interacts with the API of the container orchestration cluster (such as Kubernetes) deployed in the system to execute a series of key steps, including Pod resource creation, recycling decisions, and triggering delayed recycling tasks. It is worth noting that this module's responsibilities are precisely defined; it focuses on the macro-management of resource lifecycles, rather than directly participating in specific micro-operations handled by other specialized modules, such as network connection statistics.
[0016] The lifecycle management instructions issued by the scheduling and control module are ultimately executed by the container runtime module. To improve startup performance under high concurrency and reduce call chain overhead, this container runtime module is specifically configured to use containerd as the native container runtime. It directly receives instructions from the scheduling and control module, such as "pull an image and create a container" or "terminate a container," and efficiently completes these low-level operations. This direct-connection instruction execution mode bypasses the bottleneck of the traditional Docker Daemon and is key to ensuring high-performance system response.
[0017] Meanwhile, the security isolation and monitoring module provides a stable, fair, and observable operating environment for the entire platform. This module integrates multiple sub-components. Among them, the resource statistics probe is specifically responsible for performing the step of counting active network connections, continuously feeding back accurate instance activity data to the scheduling control module as an important basis for reclamation decisions. The network policy issuer dynamically creates and configures a dedicated independent namespace and network policy for each target tenant during instance creation, thereby achieving strict east-west traffic isolation. Correspondingly, the role-based access control (RBAC) permission manager provides defense from the perspective of management permissions, strictly restricting the target tenant's operational permissions to its corresponding independent namespace, preventing any form of unauthorized access or cross-tenant interference.
[0018] Finally, the ingress gateway module manages all external access traffic to all target machine instances. When the scheduling control module creates a new target machine instance, the Ingress controller within this module synchronously and dynamically generates an Ingress rule containing a unique subdomain or path. This not only enables accurate routing for different tenants and different challenges but also supports the simulation of complex real-world network topologies. Following this, the TLS automation management component within this module automatically completes the application, mounting, and renewal of TLS certificates based on the subdomain information in the newly generated Ingress rule. This ensures end-to-end encryption of all access traffic and effectively reduces the operational complexity of large-scale certificate management.
[0019] Through the precise division of labor and seamless collaboration among the above modules, the system proposed in this invention can construct a complete technical closed loop from user request triggering to on-demand resource creation, to precise status monitoring, and finally to intelligent recycling and secure isolation.
[0020] The technical effects of the containerized network security competition platform dynamic scheduling method and system of the present invention include: First, by employing lazy-loading on-demand creation and precise two-factor idle reclamation, the resource occupancy period of target machine instances is highly matched with the actual usage period of users. This effectively reduces the waste of computing resources caused by a large number of instances remaining idle for extended periods, thereby significantly improving the overall concurrency capacity and resource utilization efficiency of the platform under the same hardware scale.
[0021] Secondly, it can effectively avoid the startup storm problem caused by centralized startup in high-concurrency scenarios of traditional platforms. Since the creation of target machine instances is distributed at the moment of each user's first access, the cluster load is naturally smoothed out, avoiding instantaneous resource preemption and request queuing, thereby ensuring the response speed of user access and the overall service quality of the platform.
[0022] Finally, the two-factor recycling logic combined with the delayed re-examination mechanism can achieve accurate judgment of the target machine instance's activity status. By simultaneously considering application layer activity (access timestamp) and network layer connection status (number of active connections), and supplemented by anti-jitter buffering, this invention can effectively avoid interrupting normal user operations (such as long connection sessions) due to misjudgment, ensuring the stability and reliability of the competition or training environment while achieving efficient resource recycling. Attached Figure Description
[0023] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0024] Figure 1 This is a schematic diagram of the dynamic scheduling method according to an embodiment of the present invention; Figure 2 This is a schematic diagram of the competition platform system architecture according to an embodiment of the present invention; Figure 3 This is a schematic diagram of multi-tenant network isolation traffic control according to an embodiment of the present invention; Figure 4 This is a schematic diagram of the user login interface of a competition platform according to an embodiment of the present invention. Detailed Implementation
[0025] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.
[0026] This embodiment provides a containerized network security competition platform dynamic scheduling system, which aims to achieve efficient and dynamic management of competition target machine instance resources. In some embodiments, the system physically includes a processor and a memory storing computer-executable instructions. By executing the computer-executable instructions, the processor enables the system to logically present a complete architecture composed of several cooperating modules, thereby achieving more efficient execution of dynamic scheduling methods.
[0027] To gain a more intuitive understanding of the system architecture of this embodiment, please refer to [link / reference]. Figure 2 As shown, Figure 2 This is a schematic diagram of a competition platform system architecture according to an embodiment of the present invention. The dynamic scheduling system can be logically divided into a user interaction layer, an application service layer, a data management layer, a container orchestration layer, a security monitoring layer, and an automated operation and maintenance layer. This system achieves the technical solution of the present invention through several cooperating functional modules, which will be detailed below. Furthermore, please refer to... Figure 4 As shown, Figure 4 This is a schematic diagram of the user login interface of a competition platform according to an embodiment of the present invention. Users log in via... Figure 4 The interface shown allows users to log in and enter the platform, initiate a challenge on a specific competition question, and trigger a series of dynamic scheduling processes executed by the system of this invention.
[0028] In one specific embodiment, the aforementioned collaborative modules specifically include: an application service module, a scheduling control module, a container runtime module, a security isolation and monitoring module, and an ingress gateway module.
[0029] Regarding the aforementioned application service module, it serves as the core of the system's business logic processing and directly faces the end user. This application service module is configured to handle the management of all competition questions within the platform, such as the publishing, editing, and removal of questions. When a user accesses... Figure 4 When the user initiates an access request through the interface shown, the application service module is also responsible for tenant authentication to verify the user's identity and determine their tenant (e.g., a participating team). After the user successfully solves the problem, the application service module is also responsible for receiving and verifying the Flag submitted by the user in response to the specific problem, in order to complete subsequent business such as scoring and ranking updates.
[0030] Regarding the aforementioned scheduling control module, it serves as the central hub connecting upper-layer business logic and underlying infrastructure. It is configured to respond to the tenant authentication results of the application service module. Upon receiving an authenticated and legitimate request to access or create a target instance, the scheduling control module immediately interacts with the API of the container orchestration cluster (e.g., a Kubernetes cluster) deployed by the system. By calling this API, the scheduling control module is responsible for executing most of the core logic steps in the dynamic scheduling method, such as querying instance status, creating Pod resources based on templates, executing reclamation decisions, and managing delayed reclamation tasks. Simultaneously, during the Pod resource creation process, the scheduling control module is also responsible for synchronously generating the unique access entry rule in the dynamic scheduling method.
[0031] The aforementioned container runtime module is the direct execution unit for underlying container lifecycle management. To address performance challenges in high-concurrency scenarios, this container runtime module is configured to use containerd as the native container runtime. This configuration enables the container runtime module to receive and execute direct instructions from the scheduling control module regarding the lifecycle management of the target machine instance, such as image pulling, container startup, and shutdown. By employing a native link, the performance overhead of multi-level calls in traditional solutions is avoided, thereby significantly improving the creation speed of the target machine instance and the system's responsiveness.
[0032] Furthermore, to ensure fairness and system stability in a multi-tenant environment, the system in this embodiment also includes the aforementioned security isolation and monitoring module. This module comprises a network policy issuer, a resource statistics probe, and a role-based access control (RBAC) permission manager. The network policy issuer is configured to configure network policies for the target tenant's independent namespace to achieve strict network isolation. The resource statistics probe serves as the data source for eviction decisions and is configured to execute the step of counting active network connections in the dynamic scheduling method. The RBAC permission manager provides protection at the management permission level, configuring it to restrict the target tenant's permissions to its corresponding independent namespace to prevent unauthorized operations.
[0033] Finally, to achieve flexible access control for large-scale, dynamically changing target machine instances, the system in this embodiment also includes an ingress gateway module. This module internally includes an Ingress controller and a TLS automated management component. The Ingress controller is configured to dynamically generate Ingress rules containing unique subdomains or unique paths based on the unique identifiers of the target tenant and specific questions. The TLS automated management component works in conjunction with the Ingress controller; it is configured to automatically complete the application, mounting, and renewal of TLS certificates based on the unique subdomains in the Ingress rules, thereby achieving end-to-end HTTPS encrypted access and eliminating tedious manual certificate maintenance.
[0034] Furthermore, this embodiment also provides a dynamic scheduling method for a containerized cybersecurity competition platform, which is executed through the aforementioned dynamic scheduling system. For example... Figure 1 As shown, Figure 1 This is a schematic flowchart of a method according to an embodiment of the present invention. The method aims to achieve on-demand supply and intelligent recycling of target machine instance resources through a precise set of lifecycle management steps. Specifically, the method of this embodiment includes the following specific steps: Step 1: The system receives user access requests for specific topics in real time. These requests contain key information to identify the user and their intent. The system then queries the running status of the target machine instances associated with the target tenant identified by the access request and the specific topic. This query aims to determine whether the target machine instance representing the target "tenant-topic" combination is already running in the container orchestration cluster.
[0035] In some preferred embodiments, to address the risk of resource contention leading to duplicate creation due to multiple concurrent requests simultaneously detecting the absence of a target machine instance in high-concurrency scenarios, this invention introduces a concurrency safety control mechanism before executing the creation step. Specifically, when it is determined that the target machine instance does not exist, before formally proceeding to step 2 to execute the creation step, this method further includes the following concurrency control steps: The system first generates a unique key based on the identifier of the target tenant and the identifier of the specific question. For example, the tenant ID and the question ID can be concatenated into a string.
[0036] Subsequently, the system uses the unique key to request and acquire a distributed lock from a distributed coordination service (such as etcd or ZooKeeper). Following this, subsequent creation steps are contingent upon successfully acquiring the distributed lock. If the current request successfully acquires the lock, the creation of the target instance continues; if acquisition fails, it indicates that another request is already creating the instance, and the current request waits or is informed that creation is in progress.
[0037] Finally, after the creation step is completed, that is, after the Pod resources and access entry rules have been successfully generated and updated, the system releases the distributed lock to allow other processes to access it.
[0038] It is understood that by introducing the aforementioned distributed lock mechanism, this invention can ensure that the creation process of any target machine instance with a "tenant-question" combination is atomic and unique at any given time. This mechanism can solve the creation storm problem in high-concurrency environments, avoid resource duplication and waste caused by race conditions, guarantee the consistency of system state, and thus significantly enhance the stability and reliability of the platform under high load conditions.
[0039] Step 2: When it is determined that the target instance does not exist, the system enters the on-demand creation phase to avoid unnecessary resource reservations. This phase dynamically creates the Pod resource as the target instance based on a preset template for the specific problem. This preset template encapsulates all the configurations required to start the target instance, including container images, environment variables, and resource requirements. To ensure that the newly created instance can be accessed by users, the system synchronously generates a unique access entry rule for the Pod resource while creating it. In some preferred embodiments, this access entry rule is a dynamically generated Ingress rule, which establishes an access link from the user to the target instance.
[0040] In some preferred embodiments, to improve the network distribution efficiency of container images among cluster nodes and shorten the cold start time of target machine instances, the template preset for the specific problem in step 2 has been specially designed. Specifically, the template is set as a container image generated based on a layered image building strategy. This layered image building strategy includes the following steps: First, the common dependencies, library files, and system environment required by the competition platform are pre-built and solidified into one or more stable base image layers. For example, an environment containing a basic operating system (such as Alpine Linux), commonly used system libraries (such as libc), and a specific language runtime (such as Go 1.22) can be packaged into a base image.
[0041] Subsequently, when constructing the specific question image, a multi-stage construction process is adopted. In subsequent stages, the base image layer is reused, and the business modules, applications, and configuration files related to the specific question are loaded into a new upper-level image layer. In this way, a minimized running image containing the base image layer and the upper-level image layer can be generated, and this image is the preset template.
[0042] To more clearly illustrate the layered image building strategy in this embodiment, an exemplary Dockerfile is provided below, as shown in code example 1: Code Example 1: A Layered Dockerfile
[0001] # ---------- Layer 1: build dependencies ----------
[0002] FROM golang:1.22-alpine AS deps
[0003] WORKDIR / src
[0004] COPY go.mod go.sum . /
[0005] RUN go mod download
[0006]
[0007] # ---------- Layer 2: build application ----------
[0008] FROM golang:1.22-alpine AS builder
[0009] WORKDIR / src
[0010] COPY --from=deps / go / pkg / go / pkg
[0011] COPY . .
[0012] RUN CGO_ENABLED=0 GOOS=linux GOARCH=amd64 go build -o / out / ctfd-gin . / main.go
[0013]
[0014] # ---------- Layer 3: minimal runtime image ----------
[0015] FROM alpine:3.20
[0016] WORKDIR / app
[0017] COPY --from=builder / out / ctfd-gin / app / ctfd-gin
[0018] COPY configs / / app / configs /
[0019] ENTRYPOINT [" / app / ctfd-gin"] In code example 1 above, lines
[0002] -
[0005] define the first stage (deps), which is only used to download project dependencies, and the result is solidified into a separate, infrequently changing dependency image layer. In the second stage (builder) in lines
[0008] -
[0012] , the dependency cache downloaded in the previous stage is efficiently reused through the COPY --from=deps directive in line
[0010] , and the final binary application is compiled. Finally, in the third stage in lines
[0015] -
[0019] , the system starts from a minimal base image (alpine:3.20), copying only the binary files compiled in the second stage and the necessary configuration files, thereby generating a small, minimal runtime image that does not contain any compilation tools or source code.
[0043] It is understandable that by adopting the aforementioned layered image construction strategy, this invention achieves significant technical effects. First, since common dependencies are embedded in a widely reusable base image layer, when multiple projects share the same basic environment, this base layer only needs to be pulled once on the cluster nodes, greatly reducing network transmission overhead and storage consumption. Second, when the project's business logic is updated, only the upper image layer containing the business code needs to be rebuilt. The construction process can fully utilize the unchanged lower-level cache, thereby significantly improving the image construction speed. Finally, the minimized runtime image generated by this strategy is smaller, which directly shortens the image pull time for target machine instances during cold starts, bringing users lower access latency and a better platform experience.
[0044] Furthermore, to ensure the stable operation of the platform and the fair allocation of resources, and to prevent a single target instance with abnormal behavior or under attack from excessively consuming node resources and affecting other tenants, resource constraint configurations are also included in the templates preset for specific questions in some specific application scenarios. Specifically, the resource constraint configuration pre-sets the CPU and memory request values (requests) and limit values (limits) for each container within the Pod resources to be created based on the template.
[0045] Below is an example Kubernetes Deployment YAML snippet containing resource constraint configurations, as part of the template, as shown in Configuration Example 2: Configuration Example 2: A Deployment Resource Strategy
[0001] apiVersion: apps / v1
[0002] kind: Deployment
[0003] metadata:
[0004] name: ctfd-gin
[0005] spec:
[0006] template:
[0007] spec:
[0008] containers:
[0009] - name: ctfd-gin
[0010] image:your-registry / ctfd-gin:latest
[0011] resources:
[0012] requests:
[0013] cpu: "100m"
[0014] memory: "128Mi"
[0015] limits:
[0016] CPU: "500m"
[0017] memory: "512Mi" In configuration example 2 above, the resources field in lines
[0011] -
[0017] is the resource constraint configuration. The request value `requests` defines the minimum amount of resources that the container must guarantee during scheduling. As shown in lines
[0013] -
[0014] , `cpu: "100m"` and `memory: "128Mi"` declare to the scheduler of the container orchestration cluster that the container needs to be scheduled to a node with at least 0.1 CPU cores and 128MiB of available memory. The limit value `limits` defines the maximum amount of CPU and memory resources that the container is allowed to consume during runtime. As shown in lines
[0016] -
[0017] , `cpu: "500m"` and `memory: "512Mi"` constrain the container's CPU usage to not exceed 0.5 cores and its memory usage to not exceed 512MiB during runtime.
[0046] It is understood that by pre-setting the above resource constraint configuration in the template, this invention can achieve a dual purpose of resource governance. On the one hand, the setting of the `requests` value ensures that each target instance can obtain the necessary minimum resources to guarantee its normal operation when it is created, avoiding startup failure or poor performance due to insufficient resources, thereby guaranteeing basic quality of service (QoS). On the other hand, the setting of the `limits` value sets a strict upper limit on the resource consumption of each instance. This can effectively prevent a single instance from seizing node resources without restriction due to program errors, memory leaks, or resource exhaustion attacks, thereby ensuring the stable operation of other target instances on the same node and even the entire cluster.
[0047] In another specific application scenario, to further optimize the creation efficiency of Pod resources in step 2, this invention specifically defines the creation method. In this embodiment, a scheduling control module from the aforementioned system embodiment is used to execute the step of dynamically creating Pod resources as target machine instances. Furthermore, this creation step specifically includes the following technical details: First, the scheduling control module adopts the native container runtime interface (CRI) based on containerd.
[0048] Accordingly, during creation, the scheduling and control module no longer communicates through the traditional, lengthy DockerDaemon. Instead, it directly sends instructions for image pulling and container lifecycle management to a containerd instance via the gRPC interface or the native container runtime interface (CRI) to execute the creation of the Pod resource. The containerd instance here is the core component that directly manages the container lifecycle.
[0049] It is understandable that by adopting the creation method based on the containerd native interface, this invention aims to solve the performance bottleneck problem caused by the use of traditional container management tools such as Docker Daemon in existing technologies. This technical approach shortens the call chain from scheduling decision to container creation by bypassing the daemon process, reducing unnecessary intermediate overhead. This optimization is particularly critical in high-concurrency scenarios, significantly reducing the cold start latency of Pod resources and reducing container creation queuing or failures caused by daemon process overload, thereby ensuring that the platform can still provide stable services with low latency and high throughput even when a large number of users access the platform simultaneously.
[0050] In some embodiments, to achieve strict security isolation in a multi-tenant competition environment to ensure the fairness of the competition and prevent malicious interference, the step of dynamically creating Pod resources as the target machine instance in step 2 further includes the implementation of the following isolation mechanism: First, the system creates an independent namespace for the target tenant in the container orchestration cluster. This namespace provides a logical boundary for all resources of the tenant (including the target machine instance created subsequently).
[0051] Subsequently, the system creates the Pod resource within this independent namespace. Through this operation, target machine instances from different tenants may physically run on the same node, but logically they are clearly divided into their own independent and non-interfering resource domains.
[0052] Building upon this, the system further configures a network policy for this independent namespace to achieve fine-grained traffic access control. Specifically, this network policy, by default, denies all incoming and outgoing east-west traffic. Here, "east-west traffic" refers to network communication between different Pods within the same cluster. Correspondingly, based on the access entry rules, the network policy allows northbound traffic from designated entry gateways implementing the access entry rules to access the Pod resources, using a whitelist approach. "Northbound traffic" refers to traffic entering the cluster from outside the cluster (e.g., a user's browser) to Pods within the cluster.
[0053] To illustrate more specifically the implementation method of multi-tenant network isolation and the effect of traffic control in this embodiment, please refer to [link to relevant documentation]. Figure 3 . Figure 3 This is a schematic diagram of multi-tenant network isolation traffic control according to an embodiment of the present invention. Figure 3 As shown, legitimate access requests from external users (northbound traffic) first reach the Ingress controller, which correctly routes the traffic to the Pod within the namespace corresponding to the target tenant (Team A) based on the requested hostname (e.g., host:webA.ctf.local). Meanwhile, the Network Policy deployed in the independent namespaces of each tenant (Team A, Team B, Team C) strictly restricts lateral communication (eastbound traffic) between Pods in different namespaces. For example, any access attempt from a Pod in Team B to a Pod in Team A will be explicitly rejected by this network policy.
[0054] It is understandable that, through the aforementioned isolation mechanism based on namespaces and network policies, this invention achieves strong security in a multi-tenant environment. This mechanism not only logically delineates resource ownership, but more importantly, it constructs a robust firewall at the network layer, effectively preventing potential security threats such as cross-team target machine instance access, unauthorized data sniffing, and lateral movement attacks. This whitelist strategy of default denial and on-demand permission ensures that only expected and legitimate traffic can reach the target machine instances, thereby providing a fair, stable, and secure competition environment for all participants.
[0055] In some embodiments, to support realistic simulation of complex web target topologies and provide a clear, isolated access point for each dynamically created target instance, the step of synchronously generating unique access point rules described in step 2 needs further refinement. This step specifically includes: First, the system maintains a preset Ingress routing template in the platform configuration, containing hostname and path variables. For example, a template can be defined as {{.TenantID}}-{{.ChallengeID}}.ctf.example.com, where {{.TenantID}} and {{.ChallengeID}} are the variables in the hostname.
[0056] When it is necessary to generate access entry rules, the system first obtains the unique identifier of the target tenant and the unique identifier of the specific question. The unique identifier here can be the tenant's ID, the abbreviation of the team name, or the unique number of the question, etc.
[0057] Subsequently, the system uses the unique identifier of the target tenant and the unique identifier of the specific question to populate the hostname and / or path variables in the Ingress routing template, respectively. Through this population process, the system dynamically generates an Ingress rule containing a unique subdomain or a unique path for the target instance, and finally applies this rule as the unique access entry rule to the cluster. For example, for a tenant with ID team01 accessing a question with ID web101, the system will generate an Ingress rule with the hostname team01-web101.ctf.example.com pointing to the service address of the target instance based on the aforementioned template.
[0058] It is understandable that, through the aforementioned template-based dynamic route generation mechanism, this invention can automatically create an isolated and unique access point for each target machine instance with a "tenant-question" combination. This method not only greatly simplifies the entry point configuration work for hundreds or thousands of target machine instances in large-scale competitions, but more importantly, it can flexibly simulate complex web topologies in the real world, including multiple subdomains, multiple service links, and front-end / back-end separation. This high degree of flexibility and automation significantly improves the realism and scalability of competition and training environments, thereby better meeting the needs of enterprise-level attack and defense drills or advanced cybersecurity teaching.
[0059] Furthermore, to provide HTTPS encryption for all dynamically generated access points and ensure data transmission security, in some preferred embodiments of the present invention, after the synchronous generation of unique access point rules, an automated TLS certificate management process is also included. This process specifically includes: first, the system automatically extracts the generated unique subdomain from the unique access point rules, for example, extracting the complete subdomain from the previously generated team01-web101.ctf.example.com.
[0060] The system then triggers an integrated, ACME-compliant client (e.g., cert-manager) that automatically requests and retrieves a matching TLS certificate from a Certificate Authority (CA) (e.g., Let's Encrypt) using the unique subdomain. This request and verification process is fully automated and requires no manual intervention.
[0061] After successfully obtaining the certificate, the system automatically configures and mounts the acquired TLS certificate to the unique access entry rule. This operation typically manifests as configuring a Secret that references the TLS certificate in the Ingress resource object.
[0062] Finally, to ensure the long-term validity of the certificate, the system is also configured to allow the client (e.g., cert-manager) to monitor the validity period of the TLS certificate and automatically trigger the renewal process within a preset period of time (e.g., 30 days) before its expiration, obtain a new certificate and seamlessly replace the old certificate.
[0063] It is understandable that by integrating the aforementioned automated TLS certificate management mechanism, this invention can solve the huge operational costs and potential risks associated with manually configuring and maintaining HTTPS certificates for a large number of dynamic subdomains in traditional platforms. This fully automated application, mounting, and renewal process not only ensures end-to-end encryption of all target machine instance access links, improving the overall security of the platform, but also completely eliminates access interruptions caused by human errors such as certificate expiration, configuration errors, or asynchronous updates, thereby significantly enhancing the stability and reliability of the platform under long-term operation.
[0064] Step 3: During the operation of the target machine instance, to accurately assess its actual usage, the system initiates a two-factor monitoring mechanism. On one hand, the system continuously collects and records its last access timestamp, which reflects the instance's activity at the application layer. On the other hand, to more accurately determine the connection status at the network layer, the system also uses network probes to periodically count the number of active network connections associated with the target machine instance that are in the ESTABLISHED state. Here, the "ESTABLISHED" state specifically refers to a TCP connection that has completed the three-way handshake and is in the data transmission phase. This metric can effectively identify scenarios such as long-lived connections, avoiding misjudgments that may occur based solely on access timestamps.
[0065] In some embodiments, step 3, which involves periodically counting the number of active network connections using a network probe, is specifically implemented. This step specifically includes: when creating the Pod resource in step 2, deploying a sidecar container within it as the network probe. In this structure, the Pod resource also includes a main application container for hosting the business logic of the target machine instance, and the lifecycle of the sidecar container is bound to the main application container, but its process and image are independent of the main application container.
[0066] Since all containers within the same Pod share the same network namespace, the sidecar container allows the system to periodically query and count the number of connections in the TCP ESTABLISHED state within the Pod's resource network namespace, thus obtaining the number of active network connections. For example, the sidecar container can have a built-in lightweight script that periodically executes Linux system commands such as `ss -tn` or `netstat -an`, parses and counts the results, and then reports the statistics to the scheduling and control module.
[0067] It is understandable that by employing the aforementioned sidecar container as a network probe, this invention achieves decoupling between monitoring logic and business logic. The purpose of this solution is to allow the active network connection statistics function to be attached to any target instance as a standardized, pluggable component without modifying the target instance's own business code. This technique not only greatly simplifies the creation and maintenance of the challenge environment, but more importantly, it provides a precise and reliable means to identify whether a user is using a long-lived connection (such as WebSocket, file transfer, or interactive shell). This accurate judgment effectively avoids the possibility of incorrectly reclaiming instances still in use by relying solely on the last access timestamp, thus ensuring efficient resource reclamation while guaranteeing the continuity of user operations and the overall stability of the platform.
[0068] Step 4: Based on the monitoring data above, the system periodically executes the recycling decision logic. The system first calculates the difference between the current time and the last access timestamp to obtain the instance's idle time, and then compares this difference with a preset idle time threshold. Simultaneously, the system compares the counted number of active network connections with a preset connection threshold. To ensure that only completely idle instances are recycled, this connection threshold is set to zero in this embodiment.
[0069] Step 5: When the result of the first comparison is that the difference exceeds the idle time threshold, and the result of the second comparison is that the number of active network connections equals the connection number threshold, that is, when both idle conditions are met simultaneously, the system triggers a delayed recycling task. The purpose of introducing a delayed recycling task is to provide a buffer period to prevent the accidental recycling of instances due to brief network jitter or user operation intervals.
[0070] Step 6: After a preset delay period, the delayed recycling task undergoes a re-examination. This re-examination process includes: re-executing the first comparison and the second comparison, that is, re-acquiring the current time, timestamp, and number of active connections, and re-comparing and judging them.
[0071] Step 7: When the results of this re-check show that the newly calculated difference still exceeds the idle time threshold and the newly counted number of active network connections is still equal to the connection threshold, the system finally confirms that the target machine instance is stably in an unused state. At this point, the system performs graceful termination and cleanup on the Pod resource. The graceful termination process first sends a termination signal to the containers within the Pod, giving them a period of time to complete cleanup tasks such as data saving and log writing to disk, and then forcibly stops them, thereby ensuring the smoothness of the resource reclamation process and data consistency.
[0072] In summary, the containerized network security competition platform dynamic scheduling method and system of this invention, firstly, by tightly coupling the lifecycle of target machine instances with actual user access requests, achieves on-demand creation and precise reclamation of resources, fundamentally solving the problems of low resource utilization and cost waste caused by instance pre-startup and persistent strategies in existing technologies. Secondly, based on lazy loading distributed creation logic, it effectively distributes high-concurrency access pressure to various time points, avoiding startup storms and performance jitter caused by centralized startup, significantly improving the platform's concurrent carrying capacity and the user's first-time access experience. Finally, by introducing a two-factor reclamation decision based on the last access timestamp and the number of active network connections, supplemented by a delayed review debouncing mechanism, this embodiment can accurately determine the true idle state of instances, avoiding the erroneous reclamation of instances with long-connection sessions, thereby maximizing resource efficiency while effectively ensuring the stability and continuity of the competition or training process.
[0073] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A dynamic scheduling method for a containerized cybersecurity competition platform, characterized in that, Includes the following steps: The system receives user access requests for specific topics in real time and queries the running status of target machine instances associated with the target tenant identified by the access request and the specific topic based on the access request. When it is determined that the running status of the target machine instance does not exist, a Pod resource as the target machine instance is dynamically created based on the template preset for the specific question, and a unique access entry rule is generated synchronously for the Pod resource. The access entry rule is used to establish an access link from the user terminal to the target machine instance. During the operation of the target machine instance, its last access timestamp is continuously collected and recorded, and the number of active network connections associated with the target machine instance and in the ESTABLISHED state is periodically counted using network probes. Calculate the difference between the current time and the last access timestamp, and compare the difference with a preset idle time threshold. In addition, a second comparison is made between the counted number of active network connections and a preset connection threshold, wherein the connection threshold is zero; When the result of the first comparison is that the difference exceeds the idle time threshold, and the result of the second comparison is that the number of active network connections is equal to the connection number threshold, a delayed recycling task is triggered; after the preset delay time expires, the delayed recycling task is re-checked, and the re-check includes: re-executing the first comparison and the second comparison; and when the result of the re-check is that the newly calculated difference still exceeds the idle time threshold, and the newly counted number of active network connections is still equal to the connection number threshold, then the Pod resource is gracefully terminated and cleaned up.
2. The dynamic scheduling method for a containerized cybersecurity competition platform according to claim 1, characterized in that, The scheduling control module is used to execute the step of dynamically creating Pod resources as target machine instances, and this creation step specifically includes: The scheduling and control module adopts the native container runtime interface (CRI) based on containerd; and the scheduling and control module directly sends instructions for image pulling and container lifecycle management to a containerd instance through the gRPC interface or the native container runtime interface (CRI) to execute the creation of the Pod resources.
3. The dynamic scheduling method for a containerized network security competition platform according to claim 1, characterized in that, The template for the specific question is set as a container image generated based on a layered image building strategy; The layered image construction strategy includes: The common dependencies, library files and system environment required by the competition platform are pre-built and solidified into one or more stable base image layers; A multi-stage build process is adopted, in which the base image layer is reused in subsequent stages, and the business modules, applications and configuration files related to the specific topic are loaded into a new upper image layer to generate a minimized running image containing the base image layer and the upper image layer, which is the preset template.
4. The dynamic scheduling method for a containerized network security competition platform according to claim 1, characterized in that, The specific steps for periodically counting active network connections using network probes include: When the Pod resource is created, a sidecar container is deployed within it as the network probe. The Pod resource also includes a main application container for carrying the business logic of the target machine instance. The sidecar container is independent of the main application container. And through the sidecar container, periodically query and count the number of all connections in the TCP protocol ESTABLISHED state within the Pod resource network namespace to obtain the number of active network connections.
5. The dynamic scheduling method for a containerized network security competition platform according to claim 1, characterized in that, When it is determined that the target machine instance does not exist, the process further includes the following steps before performing the creation step: A unique key is generated based on the identifier of the target tenant and the identifier of the specific question; The unique key is used to request and obtain a distributed lock from a distributed coordination service, and the execution of the creation step is premised on successfully obtaining the distributed lock; After the creation step is completed, the distributed lock is released.
6. The dynamic scheduling method for a containerized network security competition platform according to claim 1, characterized in that, The step of dynamically creating the Pod resource as the target machine instance further includes: Create a separate namespace for the target tenant within the container orchestration cluster; Create the Pod resource within this independent namespace; Additionally, a network policy (NetworkPolicy) is configured for this independent namespace. This network policy denies all incoming and outgoing east-west traffic by default, and allows northbound traffic from a specified ingress gateway that implements the access ingress rule to access the Pod resources in a whitelist manner based on the access ingress rule.
7. The dynamic scheduling method for a containerized network security competition platform according to claim 1, characterized in that, The steps for synchronously generating unique access entry rules specifically include: Maintain a pre-defined Ingress route template that includes hostname and path variables; Obtain the unique identifier of the target tenant and the unique identifier of the specific question; Furthermore, using the unique identifier of the target tenant and the unique identifier of the specific question, the hostname variable and / or path variable in the Ingress routing template are respectively filled in, and an Ingress rule containing a unique subdomain or a unique path is dynamically generated for the target machine instance, and the rule is used as the unique access entry rule.
8. The dynamic scheduling method for a containerized network security competition platform according to claim 1, characterized in that, After the synchronous generation of a unique access entry rule, the following is also included: Extract the generated unique subdomain from the unique access entry rule; The integrated, ACME-compliant client is triggered to automatically request and obtain a matching TLS certificate from the Certificate Authority (CA) using the unique subdomain. The obtained TLS certificate will be automatically configured and mounted to the unique access entry rule; Additionally, the client is configured to monitor the validity period of the TLS certificate and automatically trigger the renewal process within a preset time before its expiration.
9. The dynamic scheduling method for a containerized network security competition platform according to claim 1, characterized in that: The templates pre-set for specific questions include resource constraint configurations; The resource constraint configuration is to pre-set the CPU and memory request values (requests) and limit values (limits) for each container within the Pod resource created based on the template; The request value (requests) defines the minimum amount of resources that must be guaranteed during container scheduling, and the limit value (limits) defines the maximum amount of CPU and memory resources that the container is allowed to consume during runtime.
10. A containerized network security competition platform dynamic scheduling system, characterized in that, It includes a processor and a memory storing computer-executable instructions, the processor executing the computer-executable instructions to enable the system to implement the dynamic scheduling method as described in claim 1 through a plurality of cooperating modules; The cooperating modules include: The application service module is configured to handle question management, perform tenant authentication on the access requests, and receive and verify the Flag submitted by the user in response to the specific question. The scheduling control module is configured to respond to the tenant authentication result of the application service module, interact with the API of the container orchestration cluster deployed in the system, execute the steps in the dynamic scheduling method except for counting the number of active network connections, and synchronously generate the unique access entry rule in the dynamic scheduling method. The container runtime module is configured to use the containerd native container runtime to receive and execute instructions from the scheduling control module regarding the lifecycle management of the target machine instance; The security isolation and monitoring module includes a network policy distributor, a resource statistics probe, and a role-based access control (RBAC) permission manager. The network policy distributor is configured to configure network policies for the independent namespace of the target tenant. The resource statistics probe is configured to execute the step of counting active network connections in the dynamic scheduling method. The RBAC permission manager is configured to restrict the permissions of the target tenant to its corresponding independent namespace. The ingress gateway module includes an Ingress controller and a TLS automation management component; wherein, the Ingress controller is configured to dynamically generate Ingress rules containing unique subdomains or unique paths based on the unique identifier of the target tenant and a specific topic, and the TLS automation management component is configured to automatically apply for, mount, and renew TLS certificates based on the unique subdomains in the Ingress rules.