Deep learning-based end-to-end inference method and system, electronic device

CN122242741APending Publication Date: 2026-06-19CHINA ACADEMY OF INFORMATION & COMM

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: CHINA ACADEMY OF INFORMATION & COMM
Filing Date: 2026-03-16
Publication Date: 2026-06-19

Application Information

Patent Timeline

16 Mar 2026

Application

19 Jun 2026

Publication

CN122242741A

IPC: G06N5/04; G06N20/00; G06F9/48

AI Tagging

Application Domain

Program initiation/switching Machine learning

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

Smart Images

Figure CN122242741A_ABST

Patent Text Reader

Abstract

This application relates to the field of artificial intelligence technology and discloses an end-to-end inference method based on deep learning, applied to a multimodal intelligent inference service system. The method includes: standardizing the original data to be inferred to obtain standardized data to be inferred; selecting an inference model adapted to the data from a pre-built model library based on the features of the standardized data to be inferred; splitting the inference task corresponding to the data to be inferred into multiple sub-inference tasks and scheduling the sub-inference tasks to heterogeneous computing hardware; controlling the heterogeneous computing hardware to load and execute the inference model to accelerate the inference of the sub-inference tasks and obtain the inference result. This invention achieves low-latency, high-reliability inference for multimodal data, suitable for edge computing, high-concurrency online services, and other scenarios, effectively solving the problems of heavy load and uncontrollable accuracy in traditional inference systems. This application also discloses an end-to-end inference system and electronic device based on deep learning.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, such as an end-to-end reasoning method and system based on deep learning, and electronic devices. Background Technology

[0002] With the large-scale implementation of artificial intelligence technology in fields such as intelligent transportation, smart healthcare, and high-concurrency online services, the application of multimodal intelligent reasoning service systems is becoming increasingly widespread. Multimodal intelligent reasoning service systems typically include a data source access unit, a heterogeneous computing cluster unit, a model management unit, and a result output unit. The raw data to be processed for reasoning covers various types of data, including road monitoring images / videos in intelligent transportation scenarios, CT images / voice medical orders in smart healthcare scenarios, and text queries / voice consultations in high-concurrency online service scenarios. The core requirement is to achieve efficient, low-latency, and highly reliable reasoning of multimodal data.

[0003] End-to-end reasoning methods, as the core supporting technology of multimodal intelligent reasoning service systems, directly determine the system's reasoning performance and scenario adaptability.

[0004] In the process of implementing the embodiments of this disclosure, at least the following problems were found in the related art: Traditional end-to-end inference methods are prone to incurring additional overhead when data flows between steps. They fail to fully realize end-to-end collaborative optimization from raw data input to inference result output, making it difficult to maximize inference efficiency while ensuring inference accuracy. They also lack adaptability to the inference requirements of dynamically changing multimodal data.

[0005] There is an urgent need in this field for an end-to-end inference solution that can achieve adaptive data processing, intelligent task scheduling, dynamic acceleration of inference, and reliable results.

[0006] It should be noted that the information disclosed in the background section above is only used to enhance the understanding of the background of this application, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention

[0007] To provide a basic understanding of some aspects of the disclosed embodiments, a brief summary is given below. This summary is not intended as a general commentary, nor is it intended to identify key / important components or describe the scope of protection of these embodiments, but rather as a prelude to the detailed description that follows.

[0008] This disclosure provides an end-to-end inference method, system, and electronic device based on deep learning, which optimizes the entire inference process from the original data to be inferred to the final inference result, thereby improving inference efficiency and accuracy.

[0009] In some embodiments, a deep learning-based end-to-end inference method is applied to a multimodal intelligent inference service system, which includes a data source access unit, a heterogeneous computing cluster unit, a model management unit, and a result output unit. The method includes: The raw data to be inferred is obtained through the data source access unit, and the raw data to be inferred is standardized to obtain standardized data to be inferred. Based on the characteristics of the standardized data to be reasoned, a reasoning model that is compatible with the data to be reasoned is selected from the pre-set model library of the model management unit. The reasoning task corresponding to the data to be reasoned is split into multiple sub-reasoning tasks, and the sub-reasoning tasks are scheduled to the heterogeneous computing hardware of the heterogeneous computing cluster unit. The system controls heterogeneous computing hardware to load and execute inference models, accelerates inference for sub-inference tasks, obtains inference results, and outputs the inference results through the result output unit.

[0010] In some embodiments, a deep learning-based end-to-end inference system, applied to multimodal intelligent inference scenarios, is characterized by comprising: a data preprocessing subsystem and an inference scheduling subsystem, wherein... The data preprocessing subsystem is configured to standardize the acquired raw data to be inferred, thereby obtaining standardized data to be inferred. The inference scheduling subsystem, which communicates with the data preprocessing subsystem, is configured to: select an inference model that matches the data to be inferred from the pre-set model library of the model management unit based on the characteristics of the standardized data to be inferred; split the inference task corresponding to the data to be inferred into multiple sub-inference tasks and schedule the sub-inference tasks to the heterogeneous computing hardware of the heterogeneous computing cluster unit; control the heterogeneous computing hardware of the heterogeneous computing cluster unit to load and execute the inference model, accelerate the inference of the sub-inference tasks, obtain the inference results and output them.

[0011] In some embodiments, an electronic device includes: a processor, a memory, and a computer program stored in the memory, wherein the processor, when executing the computer program, implements the aforementioned deep learning-based end-to-end inference method.

[0012] The end-to-end inference method, system, and electronic device based on deep learning provided in this disclosure can achieve the following technical effects: By standardizing and adapting data, finely splitting tasks, and intelligently scheduling heterogeneous hardware, it fully leverages the performance advantages of hardware such as CPU, GPU, and NPU. Combined with acceleration mechanisms such as early exit and speculative sampling, it can significantly reduce inference latency and improve inference throughput. It supports adaptive processing of multimodal data such as images, voice, and text, and can dynamically select and adapt the inference model according to data characteristics. It is suitable for a variety of inference scenarios and has strong adaptability.

[0013] The above general description and the description below are exemplary and illustrative only and are not intended to limit this application. Attached Figure Description

[0014] One or more embodiments are illustrated by way of example with reference to the accompanying drawings. These illustrations and drawings do not constitute a limitation on the embodiments. Elements having the same reference numerals in the drawings are shown as similar elements. The drawings are not to be scaled. And wherein: Figure 1 This is a flowchart illustrating the end-to-end inference method provided in this embodiment of the disclosure; Figure 2 This is a flowchart illustrating the standardization process of raw data to be inferred, provided in an embodiment of this disclosure. Figure 3 This is a schematic diagram of the process of scheduling sub-inference tasks to heterogeneous computing hardware according to an embodiment of this disclosure; Figure 4 This is a schematic diagram of a process for obtaining inference results by accelerating exit using confidence levels, provided in an embodiment of this disclosure. Figure 5 This is a schematic diagram of the process of accelerating inference using speculative sampling provided in an embodiment of this disclosure; Figure 6 This is a schematic diagram of the overall structure of the end-to-end inference system provided in this embodiment of the disclosure; Figure 7 This is a schematic diagram of the structure of the end-to-end inference device provided in the embodiments of this disclosure. Detailed Implementation

[0015] To provide a more detailed understanding of the features and technical content of the embodiments of this disclosure, the implementation of the embodiments of this disclosure will be described in detail below with reference to the accompanying drawings. The accompanying drawings are for illustrative purposes only and are not intended to limit the embodiments of this disclosure. In the following technical description, for ease of explanation, several details are used to provide a full understanding of the disclosed embodiments. However, one or more embodiments may still be implemented without these details. In other cases, well-known structures and devices may be simplified in their depiction to simplify the drawings.

[0016] The terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate for the embodiments of this disclosure described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion.

[0017] Unless otherwise stated, the term "multiple" means two or more.

[0018] In this embodiment of the disclosure, the character " / " indicates that the objects before and after it are in an "or" relationship. For example, A / B means: A or B.

[0019] The term "and / or" describes an association between objects, indicating that three relationships can exist. For example, A and / or B means: A or B, or A and B.

[0020] The term "correspondence" can refer to an association or binding relationship. The correspondence between A and B means that there is an association or binding relationship between A and B.

[0021] Traditional end-to-end inference methods are prone to incurring additional overhead when data flows between steps. They fail to fully realize end-to-end collaborative optimization from raw data input to inference result output, making it difficult to maximize inference efficiency while ensuring inference accuracy. They also lack adaptability to the inference requirements of dynamically changing multimodal data. There is an urgent need for an end-to-end inference solution that can achieve adaptive data processing, intelligent task scheduling, dynamic acceleration of inference, and reliable results.

[0022] The end-to-end inference method based on deep learning provided in this disclosure fully leverages the performance advantages of hardware such as CPU, GPU, and NPU through data standardization and adaptation, fine-grained task decomposition, and intelligent scheduling of heterogeneous hardware. Combined with acceleration mechanisms such as early exit and speculative sampling, it can significantly reduce inference latency and improve inference throughput. It supports adaptive processing of multimodal data such as images, voice, and text, and can dynamically select and adapt the inference model according to data characteristics. It is suitable for various inference scenarios and has strong adaptability.

[0023] This disclosure provides an end-to-end inference method based on deep learning, applied to the implementation of a multimodal intelligent inference service system, which includes: Data source access unit: Supports access to multiple data sources such as cameras, sensors, user terminals, and electronic medical record systems. For example, it can acquire road monitoring images / videos in intelligent transportation scenarios, CT images / voice medical orders in smart healthcare scenarios, and raw data to be inferred such as text queries / voice consultations in high-concurrency online service scenarios. Heterogeneous computing cluster unit: can be composed of CPU (Intel i7-12700H), GPU (NVIDIA RTX 3090), and NPU (Huawei Ascend 310), supporting the parallel execution of lightweight and heavy computing tasks; Model Management Unit: Deploys a pre-built model library and stores inference models adapted to different scenarios (such as MobileNet, the target detection model for intelligent transportation; ResNet50, the image analysis model for smart healthcare; and GPT-2, the text generation model for online services). Results output unit: Supports multiple types of results output, including image annotation results, diagnostic reports, and text responses, and can connect to downstream devices such as traffic control platforms, hospital information systems, and online service terminals.

[0024] Its overall process is as follows Figure 1 As shown, this reasoning method specifically includes the following steps: Step S1: Obtain the original data to be inferred through the data source access unit, and perform standardization processing on the original data to be inferred to obtain standardized data to be inferred. Step S2: Based on the characteristics of the standardized data to be reasoned, select a reasoning model that is compatible with the data to be reasoned from the preset model library of the model management unit; Step S3: Divide the reasoning task corresponding to the data to be reasoned into multiple sub-reasoning tasks, and schedule the sub-reasoning tasks to the heterogeneous computing hardware of the heterogeneous computing cluster unit. Step S4: Control the heterogeneous computing hardware to load and execute the inference model, accelerate the inference of the sub-inference task, obtain the inference result and output the inference result through the result output unit.

[0025] The inference method in this embodiment effectively reduces data flow overhead and computational load by standardizing the original data to be inferred, matching and adapting the inference model on demand, splitting tasks and scheduling them to heterogeneous computing hardware, and loading the model to accelerate inference. It fully leverages the performance advantages of heterogeneous hardware, significantly improves inference efficiency while ensuring inference accuracy, and adapts to the needs of multimodal data inference.

[0026] Reference Figure 2 As shown, the specific steps for standardizing the original data to be inferred to obtain standardized data to be inferred include: Step S11: Identify the data type of the original data to be inferred; where the data type includes image data, voice data, or text data; In this specific embodiment, the identification of data types includes: receiving externally input multimodal raw data to be inferred, the data types of which cover image data, voice data, and text data, and can be connected to external data sources such as databases, sensors, user terminals, and cloud storage through wired networks, wireless networks, or local interfaces to ensure the continuity and compatibility of data reception.

[0027] The original data to be inferred is accessed through a data source access unit, which may include, for example: Intelligent transportation scenario: RGB images (PNG format) captured by road surveillance cameras, including targets such as vehicles, traffic signs, and pedestrians; Smart healthcare scenario: Medical images generated by CT equipment (DICOM format) and doctor's voice medical orders (WAV format, 16kHz sampling rate); High-concurrency online service scenario: Text queries (TXT format) and voice consultations (MP3 format) sent by user terminals.

[0028] A dual recognition mechanism of "format parsing + feature verification" is adopted to determine the data type: First, the file header magic number and extension are parsed for preliminary determination. For example, the magic number of PNG format image is 89 50 4E 47, the magic number of WAV format voice is 52 49 46 46, and the magic number of TXT format text has no specific magic number, so it is determined by the extension. For binary stream data without clear format identification, a lightweight feature extractor with a parameter scale of ≤1 million is called to extract data segments (e.g., 256×256 pixel blocks for images, 1-second segments for voice, and 50-character segments for text) for feature analysis. Based on the spectral features (voice data), pixel distribution histogram (image data), and character encoding statistics (text data), the matching probability of each type is output. When the probability of a certain type exceeds the preset type recognition threshold (this threshold can be configured according to the data recognition accuracy requirements, preferably 0.8), it is determined to be data of that type, and the recognition result is synchronized to the subsequent standardization operation in real time.

[0029] Step S12: Based on the identified data type, perform a standardization operation on the original data to be inferred to obtain standardized data to be inferred.

[0030] In this specific embodiment, the targeted standardization process includes: performing adaptive standardization operations on the original data to be inferred based on the identified data types, eliminating differences between different data sources and formats, and generating standardized data to be inferred that is directly recognizable by the inference model, has a unified format, and consistent dimensions. The specific operations are adjusted according to different data types. If the data type is image data: Perform size normalization and / or pixel value normalization operations. Size normalization uses bilinear interpolation to uniformly adjust images of different resolutions to a preset resolution of 224×224 or 448×448. Pixel value normalization maps the original pixel values in the [0,255] interval to the [0,1] interval (calculation formula: normalized value = original pixel value / 255). Optional 3×3 Gaussian filtering for noise reduction (standard deviation σ = 0.5) is performed. The final output is an NCHW format tensor (N is the batch size, C is the number of channels, H is the height, and W is the width). RGB images have 3 channels, and grayscale images have 1 channel. If the data type is speech data: perform resampling and framing operations, and optionally perform feature extraction operations. Resampling uses interpolation to uniformly adjust data with different sampling rates such as 8kHz and 24kHz to 16kHz. Framing uses the Hamming window function, with a frame length of 25ms (corresponding to 240 sampling points) and a frame shift of 10ms (corresponding to 160 sampling points). Feature extraction uses the Mel filter bank to extract 13-dimensional Mel frequency cepstral coefficients (MFCC). The final output is a feature vector sequence with dimensions of [batch size × number of frames × 13]. If the data type is text data: perform word segmentation and vectorization operations, and optionally perform encoding conversion operations. For word segmentation, jieba segmentation is used for Chinese and NLTK segmentation is used for English. Custom dictionary import is supported. Vectorization converts the word segmentation results into vectors through a pre-trained 300-dimensional Word2Vec model. Out-of-vocabulary words are initialized with 300-dimensional random vectors with a mean of 0 and a variance of 0.01. Encoding conversion can be one-hot or embedding encoding. At the same time, the sequence is truncated or padded to a uniform 512-dimensional fixed length. The final output is a numeric encoded sequence with dimensions of [batch size × 512 × 300].

[0031] After standardization is completed, the standardized data to be inferred is transmitted to the subsequent step S2 via an internal data bus with checksum verification to ensure the integrity of data transmission.

[0032] In step S2, the core features of the standardized data to be inferred are first extracted, and then, based on the feature matching relationship, the inference model that is compatible with the data to be inferred is selected from the pre-set model library of the model management unit to avoid wasting computing power and ensure accurate matching between the model and the inference task.

[0033] In one specific embodiment, core features of the standardized data to be inferred that are highly adaptable to the inference model are extracted. The feature extraction is strongly correlated with the data type and also includes general features required for real-time operation. Image data: extract tensor dimension (H×W×C), texture complexity (calculated based on gray-level co-occurrence matrix, with values [0, 1]), and target region proportion (target pixel count / total pixel count). Speech data: Extract sequence length (number of frames), signal-to-noise ratio (signal power / noise power, in dB), and fundamental frequency (50-500Hz). Text data: Extract sequence length (number of characters), word frequency statistics (percentage of top 5 high-frequency words), and semantic complexity (calculated based on word vector entropy). General characteristics: The maximum allowable inference latency (in milliseconds) preset by the user or system is the core reference indicator for model selection.

[0034] In one specific embodiment, the pre-built model library is deployed in local storage or distributed cache. The model files adopt common deep learning model formats such as ONNX and include multiple types of inference models such as image, speech, and text. At the same time, the accompanying metadata of each model is stored. The model types cover lightweight MobileNet, CNN-LSTM, BERT-base, mid-range ResNet50, Transformer, GPT-2, and heavyweight EfficientNet, T5. The metadata includes the data types supported by the model, the range of input dimensions, the measured inference latency under different hardware, the computational power requirement (FLOPs), and the range of adaptive features.

[0035] A weighted cosine similarity algorithm is used to calculate the matching degree between data features and model adaptation features. The weight allocation can be designed as 60% for real-time requirements, 30% for data complexity, and 10% for data dimensionality. The above weight allocation can be flexibly adjusted according to the real-time and data complexity requirements of the actual business scenario. The matching degree calculation formula is: Matching degree = 0.6 × (1 - |Data real-time requirements - Model inference latency| / Data real-time requirements) + 0.3 × Feature similarity + 0.1 × (1 - |Data dimensionality - Model adaptation dimension| / Model adaptation dimension). Models with a matching degree ≥ the preset model matching threshold (preferably 0.7) and model inference latency ≤ data real-time requirements are selected as the adapted inference models. If multiple qualified models exist, the model with the smallest inference latency is selected. The matching result includes the model identifier, storage path, and input / output specifications, and is transmitted to the subsequent task splitting and scheduling stages.

[0036] Reference Figure 3 As shown, in one embodiment of this disclosure, the reasoning task corresponding to the data to be reasoned is split into multiple sub-reasoning tasks, and the sub-reasoning tasks are scheduled to the heterogeneous computing hardware of the heterogeneous computing cluster unit, including the following steps: Step S31: Based on the computation graph structure of the inference model and the computational characteristics of heterogeneous computing hardware, the inference task is divided into a lightweight computational sub-inference task and a heavy computational sub-inference task.

[0037] In this specific embodiment, the decomposition and division of the inference task includes: loading the ONNX format computation graph structure selected in step S2 to adapt to the inference model; parsing the operator types, computation order, and input-output dependencies in the computation graph; and combining the inherent computational characteristics of heterogeneous computing hardware such as CPU, GPU, and NPU. CPUs excel at logical judgments and lightweight computations, with logical judgments taking ≤1ms / time and lightweight computation throughput 10... 6 Transactions per second; GPU excels at floating-point operations, with a throughput of 10... 12 Transactions / second; the NPU excels at low-power, heavy-duty computing, with a throughput of 5×10⁻⁶. 11With a throughput of 10 times per second and a power consumption of ≤15W, the overall inference task corresponding to the data to be inferred is broken down into multiple sub-inference tasks, specifically divided into lightweight computing sub-inference tasks (data format conversion, simple feature filtering, intermediate result integration, single task computing power requirement ≤10). 8 FLOPs) and computationally intensive sub-inference tasks (convolution computation, matrix multiplication, autoregressive generation, feature depth extraction, single task computational power requirement ≥10 9 FLOPs).

[0038] Step S32: Generate a task dependency graph based on the dependency relationship between the lightweight computational sub-inference task and the heavy computational sub-inference task.

[0039] In this specific embodiment, generating the task dependency graph specifically includes: topologically sorting the computation graph of the inference model to identify the data flow dependency and control flow dependency between each sub-inference task. For example, the output of sub-inference task A is the input of sub-inference task B, and sub-inference task D can only be executed after sub-inference task C is completed. Based on the dependency relationship, a task dependency graph represented by a directed acyclic graph is generated to clarify the execution order of the sub-inference tasks. The dependency graph is stored in a local cache to provide a basis for subsequent hardware scheduling.

[0040] Step S33: Based on the computation type of the sub-inference task and the resource status information of the heterogeneous computing hardware, schedule the sub-inference task to the heterogeneous computing hardware according to the task dependency graph.

[0041] In this specific embodiment, the scheduling of heterogeneous computing hardware follows the principle of "optimal matching + load balancing," scheduling sub-inference tasks to the corresponding heterogeneous computing hardware. The specific process is as follows: Resource status acquisition: CPU utilization, GPU memory usage (accuracy ±10MB), and NPU load rate are collected in real time through the hardware monitoring interface with a sampling period of 10ms. The collected data is updated to the scheduling decision module in real time. Scheduling decision: Based on the computation type (lightweight / heavyweight) of the sub-inference task, the task dependency graph and the hardware resource status, lightweight computation sub-inference tasks are scheduled to the CPU, floating-point intensive heavy computation sub-inference tasks are scheduled to the GPU, and low-power heavy computation sub-inference tasks are scheduled to the NPU. Load balancing control: When the real-time load rate of a certain hardware exceeds the preset load threshold (this threshold can be configured according to the computing power characteristics of heterogeneous hardware, preferably 80%), the task migration mechanism is triggered to migrate the 1-3 lowest priority subtasks to be executed in the hardware to the same type of hardware with a load rate lower than the preset low load threshold (preferably 50%). During the migration process, data caching is used to ensure that intermediate results are not lost. Task distribution: The scheduling decision results are distributed to the corresponding hardware through the task queue. Each subtask carries a unique identifier, input data address, execution priority, and a default timeout threshold of 500ms. After receiving the data, the hardware returns an acknowledgment signal to ensure successful task distribution.

[0042] Reference Figure 4 As shown, in one embodiment of this disclosure, the heterogeneous computing hardware is controlled to load and execute the inference model to accelerate the inference of sub-inference tasks and obtain the inference results. Specifically, there are two acceleration mechanisms: an early exit driven by confidence and a speculative sampling mechanism for sequence generation tasks. The two mechanisms can be executed separately or in combination according to the type of inference task.

[0043] The confidence-driven early exit includes the following steps: Step S41: During the execution of the sub-reasoning task, obtain the confidence level of the intermediate reasoning results; In this specific embodiment, the loading and execution of the inference model includes: after the heterogeneous computing hardware receives the sub-inference task, it reads the adapted inference model from the model storage path determined in step S2, completes initialization operations such as model file parsing, memory / video memory allocation, weight loading, and operator compilation, and returns a ready signal after initialization. Then, it calls the model to perform inference calculation on the received sub-task input data, and generates intermediate inference results in real time during the calculation process.

[0044] In this specific embodiment, the confidence level of intermediate inference results is obtained in real time during the execution of the sub-inference task. Specifically, a lightweight confidence estimator with a parameter scale of ≤500,000 is inserted into the key intermediate layers of the inference model (after every two residual blocks in the image model and after every three encoder layers in the speech / text model). The estimator is jointly trained with the main model to fit the mapping relationship between the intermediate layer outputs and the final results. In the classification task, the confidence level is the maximum probability value after softmax of the estimator output; in the regression task, it is the R² fitting coefficient between the estimator output and the true value.

[0045] Step S42: If the confidence level of the intermediate inference result is greater than or equal to the preset confidence level threshold, the subsequent calculation of the current sub-inference task is terminated in advance, and the intermediate inference result is taken as the final inference result of the current sub-inference task.

[0046] In this specific embodiment, when the confidence level meets a preset condition, subsequent calculations are prematurely terminated, and the intermediate result is used as the final result of the subtask. A preset confidence threshold is set (preferably 0.95 for classification tasks and 0.90 for regression tasks; this threshold can be customized and adjusted according to the balance between inference accuracy and inference efficiency). When the confidence level of the intermediate inference result is greater than or equal to this threshold, the execution of subsequent calculation layers of the current sub-inference task is immediately terminated, and the intermediate inference result is marked with an "early exit" flag and used as the final inference result of the current sub-inference task.

[0047] In another embodiment of this disclosure, if the sub-inference task is a sequence generation task such as text generation or speech synthesis, the draft model and the main model in the inference model are coordinated to perform speculative sampling to accelerate inference during the execution process.

[0048] Reference Figure 5 As shown in the figure, in this specific embodiment, the process of accelerating inference using speculative sampling specifically includes the following: Step S51: Generate K candidate lexical units based on the current context using the draft model; where K is the preset speculative step size.

[0049] Specifically, the draft model is a lightweight Transformer model obtained through knowledge distillation, with a parameter size of 1 / 5 to 1 / 3 that of the main model. Its distillation process employs a cross-entropy loss (weight 1) and a KL divergence loss at a distillation temperature of T=5 (weight 4) for 100 training rounds. The optimizer used is AdamW (initial learning rate 1e-4, weight decay 1e-5), ensuring that the draft model maintains inference speed while maintaining a high degree of consistency with the output distribution of the main model. The current context is the valid word sequence already generated in the sequence generation task (such as the preceding sentence in text generation, or the preceding audio features in speech synthesis). The preset speculative step size K can be dynamically adjusted according to the performance of heterogeneous hardware. K=3 is preferred in CPU environments, K=8 in GPU environments, and K=5 in NPU environments, with the maximum value of K not exceeding 15 to avoid excessive validation overhead in the main model. The draft model quickly generates K candidate words using an autoregressive approach, with a generation speed ≤1ms / word, ensuring acceleration.

[0050] Step S52: Input the K candidate words along with the current context into the main model. The main model calculates the probability distribution of the candidate word positions in parallel and verifies whether the K candidate words are accepted based on the probability distribution output by the main model.

[0051] Specifically, the main model is the adapted inference model selected in step S2 (such as sequence generation models like GPT-2 and T5). After concatenating the K candidate words with the current context to form a complete input sequence, the main model performs a forward propagation to calculate the probability distribution of the corresponding positions of the K candidate words in parallel (time ≤ 5ms), eliminating the need for serial calculation word by word and significantly improving verification efficiency. The verification logic is based on the probability distribution output by the main model: for each candidate word, its probability value in the probability distribution at the corresponding position is extracted. If the probability value is ≥ the preset word verification threshold (preferably 0.5, which can be adjusted according to the sequence generation accuracy requirements) and is the word with the highest probability in the probability distribution at that position, then the candidate word is determined to be accepted; otherwise, it is determined to be rejected.

[0052] Step S53: Accept the candidate words that have been validated consecutively as the output of this inference, and at the position of the first rejected candidate word, the main model regenerates the correct word.

[0053] Specifically, verification is performed sequentially according to the generation order of candidate words (from the 1st to the Kth). If the first m candidate words (m≤K) are accepted consecutively, these m candidate words are directly used as the output of this inference. When the verification reaches the (m+1)th candidate word, the verification process for subsequent candidate words is immediately terminated. At the (m+1)th position, the main model regenerates the correct word word based on the current context (including the first m accepted candidate words). The generation process follows the autoregressive generation logic of the main model to ensure the accuracy and consistency of the word words. For example, when K=5, if the first 3 candidate words are accepted and the 4th is rejected, the first 3 candidate words are output, the main model regenerates the 4th word word, and the 5th candidate word word is discarded.

[0054] Step S54: Use the newly generated word sequence as the input context for the next speculative sampling.

[0055] The output sequence of this inference (including consecutively accepted candidate terms + terms regenerated by the main model) constitutes a new context. This new context is fed into the draft model, triggering the next round of speculative sampling. The four steps S51-S54 mentioned above are repeated until a complete sequence that meets the preset length requirement or contains a terminator is generated, thus achieving cyclical acceleration of the sequence generation task. This cyclical mechanism ensures that each round of sampling is based on the latest valid context, guaranteeing the coherence and accuracy of sequence generation, while continuously leveraging the parallel acceleration advantages of speculative sampling.

[0056] After all sub-inference tasks are completed, the final inference results of each sub-task are associated, spliced, and feature-fused in reverse order of the task dependency graph. During the integration process, the integrity of data transmission and calculation is verified by check and comparison. Finally, the overall inference result corresponding to the original data to be inferred is generated and transmitted to the subsequent post-processing stage (if a result post-processing subsystem is configured).

[0057] In a preferred embodiment of this application, after obtaining the inference result of step S4, a post-processing step of the inference result can also be performed, namely, error correction and consistency verification of the inference result to obtain an accurate and reliable final inference result, and the final inference result is stored in the cache, which can further improve the reliability of inference and the response efficiency of repeated requests.

[0058] In one embodiment of this disclosure, error correction and consistency verification include two possible implementation methods: Method 1: Match the reasoning results with a preset rule base, correct any reasoning results that do not conform to the rules, and obtain the final reasoning result.

[0059] Specifically, Method 1 corrects errors in the inference results by using a pre-defined scenario-based rule base. First, a scenario-based rule base in JSON format is pre-defined. The content of the rule base is strongly correlated with the type of inference task. For example, text generation tasks include subject-verb-object collocation rules, punctuation usage rules, and semantic logic constraints; image classification tasks include category attribution rules (e.g., "cat" and "dog" belong to the "animal" category) and target feature matching rules; speech recognition tasks include speech semantic coherence rules and common phrase collocation rules. Then, the inference results output by the inference scheduling subsystem are matched and verified against the rule base field by field and dimension by dimension: for text-based inference results, grammatical correctness and semantic rationality are verified; for image-based inference results, category attribution accuracy and target feature consistency are verified; for speech-based inference results, the completeness of speech-to-text conversion and semantic coherence are verified. If anomalies that do not conform to the rule base are detected in the reasoning result (such as text syntax errors, image category errors, and speech semantic breaks), targeted adjustments are made according to the correction logic in the rule base. For example, the grammatically incorrect "I eat in the morning" is corrected to "I eat in the morning", and "dog" which is misclassified as "plant" is adjusted to "animal". At the same time, a correction log (including the original result, the anomaly type, and the basis for correction) is recorded. The corrected result is the final reasoning result.

[0060] Method 2: Use multiple inference models or perform multiple inferences on the same data to be inferred to obtain multiple candidate inference results. Vote or weightedly merge the multiple candidate inference results and select the candidate inference result with the highest consistency as the final inference result.

[0061] Specifically, Method 2 generates multiple candidate results through multiple models or multiple inferences, and then merges and filters them to obtain the final inference result. Its specific implementation includes the following process: Candidate result generation: For the same original data to be inferred, if the number of adapted models with a matching degree ≥ the preset model matching threshold (preferably 0.7) in step S2 is ≥ 2, then the multiple adapted inference models are called in parallel to perform inference respectively, or the same adapted model is used for multiple independent inferences (each inference uses different initialization parameters) to generate multiple candidate inference results, ensuring the diversity and coverage of the results. Fusion Method Selection: The appropriate fusion strategy is selected based on the inference task type. For classification tasks, voting fusion is preferred, while for regression tasks, weighted fusion is preferred. Voting Fusion: The frequency of occurrence of all candidate inference results is counted. Following the majority voting principle, the candidate result with the highest frequency is selected as the final inference result. If multiple candidate results have the same frequency (e.g., tied for first place), the mean confidence score of each candidate result is further calculated, and the candidate result with the highest mean confidence score is selected. Weighted Fusion: Weights are set based on the validation set accuracy of each inference model (weight = model accuracy / sum of accuracies of all candidate models). The fusion result is calculated using the formula "Final Result = Σ(Candidate Model Result × Model Weight)". Consistency screening: The consistency of the fused results is checked. If the consistency of multiple candidate results is greater than or equal to the preset consistency threshold for result fusion (e.g., label consistency for classification tasks and error ≤ 5% for regression tasks; this threshold can be flexibly configured according to the inference accuracy requirements), the fusion result is directly output. If the consistency is less than the threshold, the model with the highest accuracy on the validation set is selected to re-execute the inference, and the result obtained from the re-inference is used as the final inference result to ensure the reliability of the result.

[0062] The two methods mentioned above are parallel optional implementation schemes, which can be flexibly selected according to the accuracy requirements and computing resources of the inference scenario: for inference tasks with clear rules and fixed scenarios (such as standardized text classification and fixed category image recognition), method one is preferred to balance efficiency and accuracy; for inference tasks with complex scenarios and ambiguous rules (such as open text generation and multi-category mixed image detection), method two is preferred to improve the robustness of the results.

[0063] In one embodiment of this disclosure, the calibrated final inference result is stored in a local cache (supporting Redis and Memcached) in key-value pairs. For example, a lightweight convolutional neural network is used to extract a 256-dimensional feature vector from the original data to be inferred, which serves as the cache key. The final inference result (including result type, generation time, and confidence level) serves as the cache value. The LRU (Least Recently Used) algorithm is used to manage the cache, with a preset cache capacity limit of 100,000 inference results and a data validity period of 7 days. These parameters can be flexibly customized and adjusted according to the actual concurrency of inference requests and data storage resources. When the cache reaches its capacity limit, the cache item with the earliest last access time is removed. When data exceeds its validity period, it is marked as invalid. Atomic operations are used for cache writing to avoid data conflicts. When a new inference request is received, the feature vector of the requested data is first generated and cosine similarity matched with the cache key. If the similarity is ≥ a preset cache hit similarity threshold (preferably 0.95, which can be dynamically configured), it is determined to be a cache hit, and the cache value is directly returned. If there is no hit, the complete inference process of steps S1-S4 is executed, and the new result is stored in the cache.

[0064] The final inference results are returned to the user terminal or downstream system in formats such as JSON, XML, text, and images via output interfaces such as HTTP / HTTPS, TCP / UDP, and local files, meeting the output requirements of different application scenarios. For example, traffic sign recognition results in intelligent transportation scenarios are output to traffic control platforms for intelligent traffic light control; medical image diagnosis results in smart healthcare scenarios are output to hospital information systems to assist doctors in formulating treatment plans; and text reply results in high-concurrency online service scenarios are output to online service terminals to respond to user inquiries.

[0065] The end-to-end inference method based on deep learning disclosed in this embodiment first identifies the data type of the original data to be inferred and performs targeted standardization processing to obtain standardized data to be inferred. Then, based on the characteristics of this data, an appropriate inference model is selected from a pre-built model library. Subsequently, the inference task is split into lightweight and heavy sub-inference tasks by combining the computation graph structure of the inference model and the characteristics of heterogeneous computing hardware. A task dependency graph is generated and heterogeneous scheduling is completed according to the computation type of the sub-task and the status of hardware resources. It can accurately schedule to heterogeneous computing hardware such as CPU, GPU, and NPU, give full play to the computing power advantages of each hardware, improve resource utilization, and control the hardware loading model. Through confidence-driven early exit and / or speculative sampling of sequence generation tasks, the inference of sub-tasks is accelerated, reducing the hardware's invalid computation overhead. Finally, the inference results can be corrected for errors and verified for consistency, and the final results can be cached. This realizes end-to-end intelligent inference from the input of the original data to be inferred to the output of the final inference result. Each step is optimized collaboratively, and there is no additional overhead in data flow. This method leverages the computational advantages of different hardware through multimodal data adaptive standardization, precise model matching, fine-grained decomposition of inference tasks, and intelligent scheduling of heterogeneous hardware. Combined with a dual inference acceleration mechanism, it significantly reduces inference latency and increases inference throughput. Meanwhile, relying on result calibration and caching mechanisms, it effectively compensates for the accuracy loss during acceleration, ensuring the accuracy and reliability of inference results. It also improves the efficiency of responding to repeated requests and is adaptable to various inference scenarios involving multimodal data such as images, voice, and text. In particular, it meets the stringent requirements of edge computing and high-concurrency online services for inference efficiency and accuracy.

[0066] Combination Figure 6 As shown, Embodiment 2 of this disclosure provides an end-to-end inference system based on deep learning, applied to multimodal intelligent inference scenarios, including a data preprocessing subsystem 1 and an inference scheduling subsystem 2.

[0067] Data preprocessing subsystem 1 is configured to standardize the acquired raw data to be inferred, thereby obtaining standardized data to be inferred.

[0068] The inference scheduling subsystem 2, which is connected to the data preprocessing subsystem 1, is configured to select an appropriate inference model from the pre-set model library of the model management unit based on the features of the standardized data to be inferred, split the inference task into multiple sub-inference tasks and schedule them to the heterogeneous computing hardware of the heterogeneous computing cluster unit, control the heterogeneous computing hardware to load and execute the inference model and accelerate the inference of the sub-inference tasks, obtain the inference results and output them.

[0069] The external data source is processed sequentially by the data receiving module, the data type identification module 101, and the standardization processing module 102 before being transmitted to the inference scheduling subsystem 2.

[0070] In this specific embodiment, the original data to be inferred includes image / video data from intelligent transportation scenarios, diagnostic images / voice medical orders data from smart healthcare scenarios, and text interaction data from high-concurrency online service scenarios. The data receiving module provides multiple interface adaptations such as wired network, wireless network, USB, and PCIe, supports the reception of multimodal original data to be inferred, and has a built-in data buffer queue with a capacity of 10,000 entries to avoid data loss. The received data is pushed to the data type recognition module 101 in real time. The data type identification module 101 has a built-in multimodal data format parsing algorithm and a lightweight feature extractor. It performs type identification of the original data to be inferred, outputs the type determination result and matching probability, and transmits them to the standardization processing module 102. The standardization processing module 102 has built-in three types of standardization processing sub-modules: image, voice, and text. Based on the data type recognition results, it performs standardization operations such as size normalization, pixel value normalization, resampling, frame segmentation, word segmentation, and vectorization, and outputs standardized data to be inferred, which is transmitted to the inference scheduling subsystem 2 through the internal data bus.

[0071] After the data preprocessing subsystem 1 processes the data, it is then processed sequentially by the model selection module 201, the task splitting module 202, the hardware scheduling module 203, and the accelerated inference module 204. If the system does not have a result post-processing subsystem 3 configured, the data can be directly returned to the user terminal or downstream system through an external interface. If a result post-processing subsystem 3 is configured, the data is transmitted to that subsystem for subsequent calibration and caching.

[0072] In this specific embodiment, the model selection module 201 has a built-in data feature extraction algorithm and a weighted cosine similarity matching algorithm. It extracts the core features of the standardized data to be reasoned, queries the preset model library from the model management unit and selects the reasoning model that is suitable for the data to be reasoned, and outputs the model identifier, storage path and input / output specifications. In this specific embodiment, the task splitting module 202 has a built-in computation graph parser and dependency analyzer. It parses the computation graph structure of the adapted inference model and the computation characteristics of heterogeneous computing hardware, splits the overall inference task into lightweight / heavy sub-inference tasks, generates a task dependency graph based on the dependency relationship between sub-tasks, and outputs a list of sub-tasks and their dependencies. In this specific embodiment, the hardware scheduling module 203 has a built-in hardware resource monitor and scheduling decision-maker. It collects the resource status information of the heterogeneous computing hardware (including CPU, GPU, NPU) of the heterogeneous computing cluster unit in real time, schedules the sub-inference tasks to the corresponding heterogeneous computing hardware based on the computing type and task dependency graph of the sub-inference tasks, and performs load balancing control. In this specific embodiment, the accelerated inference module 204 has a built-in model loader, an early exit controller, and a speculative sampling controller. It controls the heterogeneous computing hardware of the heterogeneous computing cluster unit to load and execute the adapted inference model. Through the acceleration mechanism of early exit and speculative sampling, it accelerates the inference of sub-inference tasks. After integrating the inference results of each sub-task, it outputs the overall inference result. The output method supports docking with downstream devices such as traffic control platforms, hospital information systems, and online service terminals, adapting to the needs of multimodal intelligent inference scenarios.

[0073] In a preferred embodiment of this disclosure, the system may further include a result post-processing subsystem 3, which is communicatively connected to the inference scheduling subsystem 2. This subsystem is configured to perform error correction and consistency verification on the inference results to obtain the final inference result, and then store the final inference result in a cache. The inference data processed by the inference scheduling subsystem 2 is then processed sequentially by the result calibration module 301 and the cache management module 302 before being output externally.

[0074] In this specific embodiment, the result calibration module 301 has a built-in scenario-based rule base, a multi-model inference executor, and a result fusion unit. It corrects errors and verifies consistency of inference results through rule matching correction or multi-result fusion, and outputs accurate and reliable final inference results.

[0075] In this specific embodiment, the cache management module 302 has a built-in feature vector generator, LRU cache manager and cache hit detector. It generates the feature vector of the original data to be inferred as the cache key, stores the final inference result in the cache and performs cache update and eviction operations, performs cache hit detection on new inference requests, and directly returns the cache result when a hit occurs.

[0076] The deep learning-based end-to-end inference system disclosed in Embodiment 2 consists of a data preprocessing subsystem, an inference scheduling subsystem, and an optional result post-processing subsystem. The data preprocessing subsystem 1 completes the standardization transformation of the original data to be inferred through data receiving, type identification, and standardization processing modules. The inference scheduling subsystem 2, as the core, sequentially realizes the selection of the appropriate model, the splitting and heterogeneous scheduling of inference tasks, the loading of models on hardware, and the acceleration of inference subtasks through model selection, task splitting, hardware scheduling, and accelerated inference modules. The result post-processing subsystem completes the calibration, optimization, and cache storage of inference results through result calibration and cache management modules. All subsystems and internal functional modules are connected through internal data bus communication to form an integrated end-to-end inference execution architecture, which can accurately carry and implement the technical steps of the end-to-end inference method described in Embodiment 1. This system achieves professional and collaborative execution of each stage of the inference process through modular functional division. Each module's function corresponds one-to-one with the inference method steps, ensuring smooth data flow and accurate inference operations. It can fully leverage the computing power advantages of heterogeneous computing hardware to improve inference efficiency, while ensuring the reliability of inference results and the response speed of repeated requests through calibration and caching modules. The system has a streamlined structure, strong scalability, and can adapt to multimodal data inference needs. It can be flexibly deployed on various hardware carriers such as edge computing devices, servers, and smart terminals, and is suitable for multiple artificial intelligence inference industrial scenarios such as intelligent transportation, smart healthcare, and industrial inspection.

[0077] Combination Figure 7 As shown in the illustration, this disclosure also provides an end-to-end inference device based on deep learning, including a processor 700 and a memory 701. Optionally, the device may further include a communication interface 702 and a bus 703. The processor 700, communication interface 702, and memory 701 can communicate with each other via the bus 703. The communication interface 702 can be used for information transmission. The processor 700 can call logical instructions in the memory 701 to execute the end-to-end inference method based on deep learning described in the above embodiments.

[0078] Furthermore, the logic instructions in the aforementioned memory 701 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium.

[0079] The memory 701, as a computer-readable storage medium, can be used to store software programs and computer-executable programs, such as program instructions / modules corresponding to the methods in the embodiments of this disclosure. The processor 700 executes functional applications and data processing by running the program instructions / modules stored in the memory 701, that is, it implements the end-to-end inference method based on deep learning in the above embodiments.

[0080] The memory 701 may include a program storage area and a data storage area. The program storage area may store the operating system and application programs required for at least one function; the data storage area may store data created based on the use of the terminal device. Furthermore, the memory 701 may include high-speed random access memory and may also include non-volatile memory.

[0081] This disclosure provides an electronic device, including a device body and the aforementioned end-to-end inference device. The end-to-end inference device is mounted on the device body. The mounting relationship described herein is not limited to placement within the product, but also includes mounting connections with other components of the product, including but not limited to physical connections, electrical connections, or signal transmission connections. Those skilled in the art will understand that the end-to-end inference device can be adapted to feasible product bodies to achieve other feasible embodiments.

[0082] The foregoing description and accompanying drawings fully illustrate embodiments of this disclosure to enable those skilled in the art to practice them. Other embodiments may include structural, logical, electrical, procedural, and other changes. The embodiments represent only possible variations. Individual components and functions are optional unless explicitly required, and the order of operation may vary. Parts and features of some embodiments may be included in or replace parts and features of other embodiments. Moreover, the terminology used in this application is for describing embodiments only and is not intended to limit the claims. As used in the description of embodiments and claims, the singular forms “a,” “an,” and “the” are intended to equally include the plural forms unless the context clearly indicates otherwise. Similarly, the term “and / or” as used in this application means including one or more of the associated listed items and all possible combinations thereof. Additionally, when used in this application, the term "comprise" and its variations "comprises" and / or "comprising" refer to the presence of stated features, integrals, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components, and / or groups thereof. Without further limitations, an element defined by the phrase "comprises a..." does not exclude the presence of other identical elements in the process, method, or apparatus that includes said element. In this document, each embodiment may focus on the differences from other embodiments, and similar or identical parts between embodiments can be referred to mutually. For methods, products, etc., disclosed in the embodiments, if they correspond to the method section disclosed in the embodiments, the relevant parts can be referred to the description of the method section.

[0083] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of the embodiments of this disclosure. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

Claims

1. A deep learning-based end-to-end inference method applied to a multimodal intelligent inference service system, the multimodal intelligent inference service system comprising a data source access unit, a heterogeneous computing cluster unit, a model management unit, and a result output unit, characterized in that, The methods include: The raw data to be inferred is obtained through the data source access unit, and the raw data to be inferred is standardized to obtain standardized data to be inferred. Based on the characteristics of the standardized data to be reasoned, a reasoning model that is compatible with the data to be reasoned is selected from the pre-set model library of the model management unit. The reasoning task corresponding to the data to be reasoned is split into multiple sub-reasoning tasks, and the sub-reasoning tasks are scheduled to the heterogeneous computing hardware of the heterogeneous computing cluster unit. The system controls heterogeneous computing hardware to load and execute inference models, accelerates inference for sub-inference tasks, obtains inference results, and outputs the inference results through the result output unit.

2. The end-to-end inference method according to claim 1, characterized in that, The reasoning task corresponding to the data to be reasoned is split into multiple sub-reasoning tasks, and the sub-reasoning tasks are scheduled to the heterogeneous computing hardware of the heterogeneous computing cluster unit, including: Based on the computation graph structure of the inference model and the computational characteristics of heterogeneous computing hardware, the inference task is divided into a lightweight computational sub-inference task and a heavy computational sub-inference task. Based on the dependencies between the lightweight computational sub-inference task and the heavy computational sub-inference task, a task dependency graph is generated. Based on the computation type of the sub-inference task and the resource status information of the heterogeneous computing hardware, the sub-inference task is scheduled to the heterogeneous computing hardware according to the task dependency graph.

3. The end-to-end inference method according to claim 1, characterized in that, The heterogeneous computing hardware of the heterogeneous computing cluster unit is controlled to load and execute the inference model, accelerate the inference of sub-inference tasks, and obtain inference results, including: During the execution of the sub-reasoning task, obtain the confidence level of the intermediate reasoning results; If the confidence level of the intermediate inference result is greater than or equal to the preset confidence level threshold, the subsequent calculation of the current sub-inference task is terminated in advance, and the intermediate inference result is taken as the final inference result of the current sub-inference task. And / or, When the sub-inference task is a sequence generation task, during the execution of the sub-inference task, the draft model and the main model in the coordinated inference model perform speculative sampling to accelerate inference.

4. The end-to-end inference method according to claim 3, characterized in that, In the coordinated reasoning model, speculative sampling is used between the draft model and the master model to accelerate inference, including: The draft model is used to generate K candidate lexical units based on the current context; where K is the preset speculative step size. The K candidate words, along with the current context, are input into the main model. The main model calculates the probability distribution of the candidate word positions in parallel and verifies whether the K candidate words are accepted based on the probability distribution output by the main model. The system accepts consecutively validated candidate words as the output of this inference, and at the position of the first rejected candidate word, the main model regenerates the correct word. The newly generated word sequence is used as the input context for the next speculative sampling.

5. The end-to-end inference method according to claim 1, characterized in that, The original data to be reasoned is standardized to obtain standardized data to be reasoned, including: Identify the data type of the original data to be inferred; where data types include image data, voice data, or text data; Based on the identified data type, a standardization operation is performed on the original data to be inferred to obtain standardized data to be inferred; where: When the data type is image data, perform size normalization and / or pixel value normalization operations; When the data type is speech data, perform resampling, framing and / or feature extraction operations; When the data type is text, perform word segmentation, vectorization, and / or encoding conversion operations.

6. The end-to-end inference method according to any one of claims 1 to 5, characterized in that, Also includes: Error correction and consistency verification are performed on the inference results to obtain the final inference result; The final inference result is stored in the cache.

7. The end-to-end inference method according to claim 6, characterized in that, Error correction and consistency verification are performed on the inference results to obtain the final inference results, including: The reasoning results are matched against a pre-defined rule base, and reasoning results that do not conform to the rules are corrected to obtain the final reasoning result; or, Multiple inference models or multiple inferences are used to obtain multiple candidate inference results for the same data to be inferred. The multiple candidate inference results are voted on or weighted and fused, and the candidate inference result with the highest consistency is selected as the final inference result.

8. An end-to-end inference system based on deep learning, applied to multimodal intelligent inference scenarios, characterized in that, include: The data preprocessing subsystem and the inference scheduling subsystem, among which, The data preprocessing subsystem standardizes the acquired raw data to be inferred, obtaining standardized data to be inferred. The inference scheduling subsystem, which communicates with the data preprocessing subsystem, is configured to: select an inference model that matches the data to be inferred from the pre-set model library of the model management unit based on the characteristics of the standardized data to be inferred; split the inference task corresponding to the data to be inferred into multiple sub-inference tasks and schedule the sub-inference tasks to the heterogeneous computing hardware of the heterogeneous computing cluster unit; control the heterogeneous computing hardware of the heterogeneous computing cluster unit to load and execute the inference model, accelerate the inference of the sub-inference tasks, obtain the inference results and output them.

9. The end-to-end inference system according to claim 8, characterized in that, Also includes: The post-processing subsystem, which communicates with the inference scheduling subsystem, is configured to perform error correction and consistency verification on the inference results to obtain the final inference results. The final inference result is stored in the cache.

10. An electronic device, characterized in that, include: A processor, a memory, and a computer program stored in the memory, wherein the processor, when executing the computer program, implements the deep learning-based end-to-end inference method as described in any one of claims 1 to 7.