Logistics communication speech and dialect real-time talkback and training system and method based on large language model
The logistics communication dialogue system based on a large language model solves the problems of low dialect recognition rate and semantic ambiguity, and realizes efficient logistics communication and safety feedback in complex language environments.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- YUKUAI CHUANGLING INTELLIGENT TECH (NANJING) CO LTD
- Filing Date
- 2026-04-17
- Publication Date
- 2026-06-26
Smart Images

Figure CN122050375B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of vehicle network data processing technology, and in particular to a logistics communication script and dialect real-time intercom and training system and method based on a large language model. Background Technology
[0002] Against the backdrop of globalized trade and China's massive domestic market circulation, the efficiency of the logistics and transportation industry, as a vital infrastructure supporting the national economy, is directly limited by the accuracy of information exchange. However, the complex geographical and linguistic distribution within China presents significant challenges to logistics operations. Dialects remain widely used in informal operational scenarios such as long-distance transportation and loading / unloading scheduling within the logistics and transportation industry.
[0003] The linguistic environment of the logistics and transportation industry presents a dual complexity. Firstly, there is the diversity of regional dialects, encompassing numerous sub-dialects and major categories such as Northern dialects, Wu dialects, Cantonese, Gan dialects, Xiang dialects, Min dialects, and Hakka. Secondly, there is the industry's long-evolving system of specialized terminology, which is highly specialized and context-dependent.
[0004] Existing automatic speech recognition (ASR) systems are mostly trained on standard Mandarin corpora. Their recognition rate drops significantly when dealing with heavy regional accents or low-resource dialects. Furthermore, the same dialectal words can have different meanings in different logistics scenarios, further exacerbating the recognition difficulties. While generalized large language models (LLMs) perform exceptionally well in text understanding, they often produce ambiguities and even serious comprehension biases for logistics terms that haven't undergone industry alignment. These communication barriers not only reduce loading, unloading, and scheduling efficiency but can also lead to driving safety risks or economic disputes due to misunderstandings of instructions. Summary of the Invention
[0005] To address the aforementioned technical issues, this application proposes a real-time intercom and training system for logistics communication scripts and dialects based on a large language model, including an in-vehicle intelligent terminal, an in-vehicle communication gateway, and a cloud server.
[0006] The vehicle-mounted intelligent terminal (HMI) is installed inside the logistics vehicle and is used to collect the driver's voice signal and vehicle status data, perform noise reduction and enhancement processing on the voice signal, and also for voice broadcasting.
[0007] The vehicle communication gateway is connected to the vehicle intelligent terminal and is used to encapsulate the collected voice signals and vehicle status data and upload them to the cloud server via the MQTT protocol. It also receives and forwards the feedback voice sent by the cloud server.
[0008] The cloud server is equipped with a speech large language model (Speech LLM), an industry knowledge distillation module, a confidence assessment module, a high-fidelity dialect synthesis module, and an interactive training module.
[0009] The aforementioned speech big language model is used to perform multimodal feature extraction, modal information fusion, and semantic reasoning on the received speech signal to generate a serialized semantic parsing result. Specifically, the multimodal feature extraction uses a cross-dialect pre-trained speech encoder to convert the speech signal into a speech feature vector. The modal information fusion uses a multimodal projector to map the speech feature vector to a text embedding space and fuses it with vehicle status data to obtain multimodal features. The semantic reasoning is achieved by using a high-performance open-source big language model (LLM) as the backend of the speech big language model and performing logistics domain adaptive fine-tuning (DAPT) on the LLM to generate the semantic parsing result.
[0010] The industry knowledge distillation module is used to adaptively fine-tune the high-performance open-source large language model in the logistics industry domain through the parameter efficient fine-tuning (PEFT) technology, and introduces a gating bias mechanism inside the high-performance open-source large language model to dynamically adjust the response sensitivity to logistics professional terms.
[0011] The confidence assessment module is used to calculate the normalized confidence of the semantic parsing result. When the normalized confidence level is lower than a preset threshold, a confirmation-clarification dialogue flow is triggered.
[0012] The formula for calculating the normalized confidence level is as follows:
[0013] ;
[0014] ;
[0015] Where x represents the multimodal features and y represents the semantic parsing result;
[0016] The probability distribution output for a high-performance open-source large language model;
[0017] T represents the length of the semantic parsing result, that is, the number of tokens contained in the semantic parsing result;
[0018] This represents the token at position t in the semantic parsing result;
[0019] Y represents the constrained output space, i.e., the preset set of logistics terms.
[0020] This represents any candidate output sequence in the constrained output space Y;
[0021] when At that time, the semantic parsing result is executed;
[0022] when When this happens, a confirmation-clarification dialogue flow is triggered, and a confirmation request is sent to the driver;
[0023] The preset threshold;
[0024] The high-fidelity dialect synthesis module is used to synthesize feedback text into feedback speech that matches the driver's dialect features based on neural network timbre conversion technology.
[0025] The interactive training module is used to construct logistics business scenarios, conduct simulated dialogues with drivers, evaluate drivers' professional expression ability in response from multiple dimensions, and generate training reports.
[0026] Preferably, the preset threshold Set it to 0.85.
[0027] Furthermore, the in-vehicle intelligent terminal integrates a multi-microphone array, uses beamforming technology to lock onto the driver's voice source, and introduces adaptive echo cancellation and noise suppression algorithms to preprocess the voice signal.
[0028] Furthermore, the vehicle status data includes at least one or more of the following: vehicle identification code, geographic location coordinates, load status, engine fault code, and current waybill number. It is obtained from the CAN bus through the vehicle communication gateway to provide business context for the voice language model.
[0029] Furthermore, the gating bias mechanism enhances the recognition of logistics terminology, and its calculation formula is as follows:
[0030] ;
[0031] ;
[0032] Where B is a trainable bias embedding vector. The gating weight matrix is learned during the training of the high-performance open-source large language model.
[0033] G is the gate value;
[0034] For weighted expert bias;
[0035] K is the standard key vector, which is a core feature inherent in high-performance open-source large language models and is used to understand general semantics;
[0036] K' is the fused key vector, which replaces the original standard key vector K of the attention layer of the high-performance open-source large language model and is used for subsequent attention calculation and semantic decoding processes.
[0037] This is a dynamic adjustment coefficient. The adaptive adjustment based on the gate value G is given by the following formula: , The function is a Sigmoid activation function; the gate value G reflects the overall overlap between the current input audio features and the preset logistics terminology database. When a high degree of overlap between the input and the preset logistics terminology database is detected, G tends to be positive. Approaching 0 makes K' more inclined to B*, forcing high-performance open-source large language models to tilt towards the semantics of logistics industry terms.
[0038] Furthermore, the speech encoder adopts a speech encoder pre-trained across dialects. Through contrastive learning on multiple types of dialect datasets, it brings together different dialect speech features with the same semantics and pushes apart dialect speech features with different semantics in the latent space, achieving preliminary alignment between dialect acoustic features and Mandarin speech semantics. It is then fine-tuned by supervised fine-tuning through dialect speech-text pairs in the logistics field.
[0039] Furthermore, the multimodal projector is a multilayer perceptron, used to map the dimension of the speech feature vector to the same dimension as the text embedding space of the speech large language model, and to fold consecutive speech frames to match the information density of the text token.
[0040] Furthermore, the specific process of semantic reasoning includes:
[0041] The input voice signal is sequentially mapped into a three-layer structure: a recognition intermediate layer, a standardized semantic output layer, and a business operation mapping layer. The recognition intermediate layer is used to identify regional indicator pronouns, logistics professional terms, or industry-specific terms in dialect expressions. The standardized semantic output layer is used to convert dialect expressions into standard business expressions. The business operation mapping layer is used to associate the standard business expressions with preset business operation interfaces.
[0042] By using context-aware reasoning and combining the deep cognitive tasks capabilities of high-performance open-source large language models, the system identifies the driver's potential business intentions and generates serialized, standardized instructions as the result of semantic parsing.
[0043] Furthermore, the high-fidelity dialect synthesis module uses a small number of dialect audio clips uploaded by the driver to generate personalized timbres, and drives the personalized timbres to produce dialect broadcasts in combination with the generated feedback text.
[0044] Furthermore, the interactive training module constructs logistics business scenarios including abnormal cargo damage claims, long-distance shuttle dispatch, and dangerous goods declaration; the evaluation dimensions include terminology accuracy, logical fluency, dialect proficiency, and emotional state. The interactive training module enhances the professionalism and cross-regional communication skills of newly hired drivers.
[0045] According to another aspect of this application, a method for real-time dialogue and training of logistics communication scripts and dialects based on a large language model includes the following steps:
[0046] S1: The in-vehicle intelligent terminal collects the driver's voice signals and vehicle status data, and uploads them to the cloud server in real time through the in-vehicle communication gateway;
[0047] The audio signal was also subjected to noise reduction and enhancement processing before being uploaded;
[0048] S2: The speech big language model deployed on the cloud server performs semantic analysis on the speech signal to obtain the semantic analysis result;
[0049] Step S2 specifically includes the following sub-steps:
[0050] S21: The speech signal is converted into a speech feature vector using a speech encoder;
[0051] S22: The speech feature vector is mapped to the text embedding space by a multimodal projector and fused with vehicle status data to generate multimodal features;
[0052] S23: Adaptive fine-tuning of a high-performance open-source large language model for the logistics domain through an industry knowledge distillation module;
[0053] S24: Input the multimodal features into a high-performance open-source large language model that has been adaptively fine-tuned in the logistics field for inference, and generate serialized semantic parsing results;
[0054] S3: The confidence assessment module calculates the normalized confidence score of the semantic parsing result. When the normalized confidence score is higher than or equal to a preset threshold, step S4 is executed; when the normalized confidence score is lower than the preset threshold, step S5 is executed.
[0055] S4: The speech big language model generates feedback text based on the semantic parsing results, synthesizes the feedback text into feedback speech that matches the driver's dialect features through the high-fidelity dialect synthesis module, and sends the feedback speech to the vehicle intelligent terminal for broadcast through the vehicle communication gateway.
[0056] S5: The confidence assessment module triggers the confirmation-clarification dialogue flow, outputs a question to the driver, receives the driver's secondary voice input, and returns to step S1 for re-parse;
[0057] S6: Based on the interactive training module, build logistics business scenarios, conduct simulated dialogues with drivers, evaluate the standardization of driver responses in terms of terminology and communication logic, and generate training reports.
[0058] The beneficial effects of the present invention on the real-time intercom and training system and method for logistics communication dialogue and dialect based on the large language model are as follows: (1) It adopts a speech large language model architecture, and through the cross-dialect pre-trained speech encoder, it directly realizes the alignment of dialect acoustic features with standard language semantics in the latent space, and directly maps dialect speech with heavy accents to business intent. Even when there are huge differences between dialect vocabulary and standard language vocabulary, semantic stability can be maintained through the vector space of the logistics domain; (2) It introduces trainable bias embedding and gating bias mechanism into the LLM, and at the same time, it obtains vehicle status data in real time through the vehicle communication gateway, which provides LLM inference with the help of the vehicle communication gateway. (3) Deploy a confidence assessment module on the cloud server to calculate the normalized confidence of each semantic parsing result. When the normalized confidence is lower than the preset threshold, the system does not directly execute the instruction, but triggers the confirmation-clarification dialogue flow to ask the driver a question for confirmation, which effectively avoids semantic deviation caused by model illusion or heavy accent. (4) Not only does it complete the conversion from dialect to standard Mandarin, but it also broadcasts the standardized feedback text back in the most acceptable localized form to the driver through the high-fidelity dialect synthesis module. (5) Combine the intercom with driver training and use the evaluation capability of LLM to provide implicit quality feedback on each communication record of the driver and dynamically optimize the knowledge injection path of industry terminology. Attached Figure Description
[0059] Figure 1 This is a schematic diagram of the logistics communication dialogue and dialect real-time intercom and training system based on a large language model according to the present invention.
[0060] Figure 2 This is a flowchart illustrating the logistics communication dialogue and dialect real-time intercom and training method based on a large language model according to the present invention. Detailed Implementation
[0061] To provide a further understanding of the purpose, structure, features, and functions of the present invention, detailed descriptions are provided below with reference to specific embodiments.
[0062] like Figure 1 As shown, this application provides a real-time intercom and training system for logistics communication scripts and dialects based on a large language model, including an in-vehicle intelligent terminal, an in-vehicle communication gateway, and a cloud server.
[0063] To address the high-noise environment generated by logistics vehicles during operation, such as engine noise, wind noise, and road noise, the in-vehicle intelligent terminal integrates a multi-microphone array and employs beamforming technology to pinpoint the driver's voice source location, generating a masking value in the time-frequency domain to suppress environmental noise. Simultaneously, the in-vehicle intelligent terminal also incorporates adaptive echo cancellation and noise suppression algorithms to effectively eliminate feedback echoes from the in-vehicle speakers and various background noises, thereby extracting clear voice signals in noisy driving environments.
[0064] Preferably, when the in-vehicle intelligent terminal collects the driver's voice signal and vehicle status data, the sampling rate is set to 16,000 Hz to ensure the complete features of the voice signal within an 8kHz bandwidth and to adapt to the ASR model; the quantization depth is set to 16 bit PCM to improve the dynamic range and ensure that dialect prosodic features are not lost; the pickup range is set to 360° uniform coverage to support the collection of voice commands from the passenger or other locations in the vehicle; and the noise reduction scheme adopts WebRTC AEC3 to effectively suppress speaker feedback and ambient background noise.
[0065] The vehicle communication gateway is connected to the vehicle's CAN bus and is used to acquire vehicle status data in real time, including at least one or more of the following: vehicle identification number (VIN), geographic location coordinates, load status, engine fault code, and current waybill number.
[0066] The vehicle-mounted communication gateway encapsulates the collected voice signals and vehicle status data, and then uploads them to the cloud server via the MQTT protocol. Due to its low bandwidth consumption and asynchronous bidirectional communication characteristics, the MQTT protocol is extremely suitable for use on long-distance transportation routes with unstable network conditions. The uploaded message body is in JSON format, and data routing is performed through a predefined topic hierarchy structure.
[0067] The cloud server is the core processing unit of the system, and it is equipped with a large speech language model, an industry knowledge distillation module, a confidence assessment module, a high-fidelity dialect synthesis module, and an interactive training module.
[0068] The speech big language model is used to perform multimodal feature extraction, modal information fusion, and semantic reasoning on received speech signals to generate serialized semantic parsing results. Specifically, the multimodal feature extraction uses a cross-dialect pre-trained speech encoder to convert the speech signal into a speech feature vector. The modal information fusion uses a multimodal projector to map the speech feature vector to a text embedding space and fuses it with vehicle status data to obtain multimodal features. The semantic reasoning is achieved by using a high-performance open-source big language model (LLM) as the backend of the speech big language model and performing logistics domain adaptive fine-tuning (DAPT) on the LLM to generate semantic parsing results.
[0069] In one embodiment, the speech encoder employs a pre-trained speech encoder based on the Transformer architecture and performs cross-dialect adaptation in the following manner:
[0070] (a) The speech encoder is continuously pre-trained using a speech-text parallel corpus covering multiple dialect categories, wherein the speech in the parallel corpus is dialect speech and the text is the corresponding standard semantic representation; during the pre-training process, a contrastive learning loss function is used to make different dialect speech with the same semantics close to each other in the feature space output by the encoder, and dialect speech with different semantics far away from each other in the feature space.
[0071] (b) By using cross-modal consistency constraints, dialect speech features and corresponding standard semantic text features are mapped to the same semantic space, thereby achieving latent space alignment between dialect acoustic features and standard semantics;
[0072] (c) Supervised fine-tuning of the pre-trained speech encoder is performed using dialect speech-text parallel corpus labeled in the logistics industry, wherein the text consists of standardized logistics business terms and instructions.
[0073] The industry knowledge distillation module is used to adaptively fine-tune the high-performance open-source large language model on which the speech large language model depends through efficient parameter fine-tuning technology, and introduces a gating bias mechanism inside the high-performance open-source large language model to dynamically adjust the response sensitivity to logistics professional terms.
[0074] This embodiment employs the LoRA technique in efficient parameter fine-tuning, updating only the low-rank factorization matrix in the high-performance open-source large language model, significantly reducing the need for labeled data. Simultaneously, through text-driven unsupervised adaptation, continuous pre-training is performed using a large corpus of logistics industry texts (freight contracts, industry regulations, dispatch logs), and a text denoising task is introduced to simulate potential typos in speech recognition, enabling the model to learn domain priors without relying on speech signals.
[0075] The gate bias mechanism enhances the recognition of logistics terminology, and its calculation formula is as follows:
[0076] ;
[0077] ;
[0078] Where B is a trainable bias embedding vector. The gating weight matrix is learned during the training of the high-performance open-source large language model.
[0079] G is the gate value;
[0080] For weighted expert bias;
[0081] K is the standard key vector, which is a core feature inherent in high-performance open-source large language models and is used to understand general semantics;
[0082] K' is the fused key vector, which replaces the original standard key vector K of the attention layer of the high-performance open-source large language model and is used for subsequent attention calculation and semantic decoding processes.
[0083] This is a dynamic adjustment coefficient. The adaptive adjustment based on the gate value G is given by the following formula: , The function is a Sigmoid activation function; the gate value G reflects the overall overlap between the current input audio features and the preset logistics terminology database. When a high degree of overlap between the input and the preset logistics terminology database is detected, G tends to be positive. Approaching 0 makes K' more inclined to B*, forcing high-performance open-source large language models to tilt towards the semantics of logistics industry terms.
[0084] By employing a gating bias mechanism, a trainable bias embedding vector is assigned to each technical term, and the fusion weights of the key vectors are dynamically adjusted during self-attention computation. When an audio feature is detected to be highly matched with a bias embedding vector in the terminology corpus, the model is forced to tilt towards the standard semantics of that term, without relying on phoneme-to-character conversion, thereby improving the recognition and recall rate of logistics technical terms.
[0085] The confidence assessment module is used to calculate the normalized confidence of the semantic parsing results. When the normalized confidence is lower than a preset threshold, a confirmation-clarification dialogue flow is triggered.
[0086] The formula for calculating the normalized confidence level is as follows:
[0087] ;
[0088] ;
[0089] Where x represents the multimodal features and y represents the semantic parsing result;
[0090] The probability distribution output for a high-performance open-source large language model;
[0091] T represents the length of the semantic parsing result, that is, the number of tokens contained in the semantic parsing result;
[0092] This represents the token at position t in the semantic parsing result;
[0093] Y represents the constrained output space, i.e., the preset set of logistics terms.
[0094] This represents any candidate output sequence in the constrained output space Y;
[0095] when At that time, the semantic parsing result is executed;
[0096] when When this happens, a confirmation-clarification dialogue flow is triggered, and a confirmation request is sent to the driver;
[0097] The preset threshold;
[0098] By comparing the normalized confidence score with a preset threshold, the credibility of the semantic parsing results can be judged, which can effectively reduce the error rate of semantic parsing and execution.
[0099] The high-fidelity dialect synthesis module is used for neural network-based timbre conversion technology to synthesize feedback text into feedback speech that matches the driver's dialect features.
[0100] The interactive training module is used to build logistics business scenarios, conduct simulated dialogues with drivers, evaluate the drivers' professional expression ability in response from multiple dimensions, and generate training reports.
[0101] In one embodiment, a driver from Sichuan working for a logistics company discovered that the payment received for a delivery was not as expected after completing a transport task. He was driving on the highway and unable to stop to operate his mobile app. The driver activated the voice button on the in-vehicle smart terminal and spoke the instruction directly in Sichuan dialect: "The return trip fee for this vehicle was calculated incorrectly; it's fifty yuan short."
[0102] The in-vehicle intelligent terminal integrates a multi-microphone array, employs beamforming technology to lock onto the driver's voice source, and combines adaptive echo cancellation and noise suppression algorithms to extract clear voice signals in high-noise driving environments. Simultaneously, the in-vehicle communication gateway obtains vehicle status data such as the vehicle's VIN code, current GPS coordinates, and corresponding waybill number from the CAN bus, encapsulates this data along with the voice signal, and uploads it to the cloud server in real time via the MQTT protocol.
[0103] After receiving the audio stream and vehicle status data, the speech big language model on the cloud server first converts the speech signal into speech feature vectors using a cross-dialect pre-trained speech encoder. This speech encoder has been fine-tuned on multiple dialect datasets, including Wu, Cantonese, and Sichuanese, enabling preliminary semantic alignment between dialects and standard language in the latent space. Subsequently, a multimodal projector maps the speech feature vectors to the text embedding space and fuses them with the vehicle status data to generate multimodal features.
[0104] The multi-modal features are input into a high-performance open-source large language model (LLM) fine-tuned for the logistics field for semantic parsing. The parsing process adopts a three-layer mapping structure, and the parsing logic is shown in Table 1:
[0105] Table 1 Three-layer semantic mapping parsing logic table for dialect or term input
[0106] Dialect / Terminology Input Identify intermediate layers Normalized semantic output Business operation mapping "This plate" Geographical demonstrative pronouns "This shipment / this waybill" Lock current waybill ID "Turnaround car" Logistics terminology "Return transport vehicles" Calling the resource scheduling interface "Affiliation fee" Industry-specific terminology Agency management fee / operating qualification fee Trigger financial module accounting
[0107] In this parsing process, context-aware reasoning is also carried out in combination with the business context provided by vehicle status data. Specifically, first, "this lot" is marked as a regional demonstrative pronoun in the recognition intermediate layer, and then in the normalized semantic output layer, it is converted into "this consignment note" or "this order" according to the business context (such as the current consignment status of the vehicle and GPS coordinates). Finally, in the business operation mapping layer, it is associated with the business interface of "locking the current consignment note ID"; combined with vehicle status data, using the deep cognitive ability of the large language model, "fifty yuan less" is accurately inferred as the business intention of "fee verification and reissuance", rather than a simple information inquiry or complaint. This mechanism enables the fuzzy reference in the dialect to be accurately mapped to specific business objects, realizing the conversion from natural language to business instructions, avoiding problems such as wrong matching of consignment notes and misdelivery of goods caused by misunderstanding of demonstrative pronouns, and also enabling the system to understand the driver's unspoken deep needs, greatly improving the interaction efficiency and user satisfaction. At the same time, by specifically setting up categories for logistics professional terms and industry-specific terms in the recognition intermediate layer, in the normalized semantic output layer, "attachment fee" is clearly converted into the standard business expression of "agency management fee / operating qualification fee", eliminating the ambiguity of the same term in different contexts, greatly improving the parsing accuracy of industry jargon, and effectively avoiding fee disputes and dispatching errors caused by term misunderstandings.
[0108] After the semantic parsing result is generated, the confidence evaluation module calculates the normalized confidence of the semantic parsing result. After calculation, the normalized confidence of the semantic parsing result is higher than the preset threshold of 0.85, and the system directly executes the semantic parsing result and automatically queries the financial database. The system generates a feedback text based on the query result, and this feedback text is in the form of natural language and is used to convey the execution result to the driver.
[0109] The high-fidelity dialect synthesis module uses a small number of dialect audio clips uploaded by the driver when using it for the first time to generate a personalized voice, and drives the personalized voice to synthesize the feedback text into a dialect voice with a Sichuan accent: "Received, the attachment fee for this order is indeed undercounted. It has been replenished for you and is expected to arrive within two hours." The synthesized dialect voice is sent to the in-vehicle intelligent terminal through the in-vehicle communication gateway for broadcast, completing a complete real-time intercom closed-loop.
[0110] In another embodiment, a logistics company hired a new batch of long-distance drivers from different provinces. These drivers came from different dialect regions and were unfamiliar with the operational procedures and standardized terminology for transporting hazardous materials. Company management activated an interactive training module in the system backend to create a customized training plan for these new drivers.
[0111] After the interactive training module is activated, it first constructs a logistics business scenario. In this embodiment, the interactive training module simulates a "site inspector" who asks the driver through the vehicle's intelligent terminal: "Sir, you are currently transporting a Class II flammable liquid. Please describe your emergency response procedure." This scenario is generated based on the actual business needs of dangerous goods transportation. The interactive training module autonomously generates contextually coherent dialogue content according to preset scenario templates and business rules.
[0112] The driver responds via voice, and the voice signal is processed by the in-vehicle intelligent terminal for noise reduction and enhancement before being uploaded to the cloud server through the in-vehicle communication gateway. The large speech language model performs semantic parsing on the driver's response and generates text. The interactive training module receives the generated text and evaluates the driver's response from multiple dimensions.
[0113] In terms of terminology accuracy, the interactive training module calculates the recall rate of terminology sequences, detecting whether the driver's answers contain standardized terms such as "cut off the main power," "evacuation direction," "fire extinguisher location," and "emergency phone number." In terms of logical fluency, the interactive training module analyzes the driver's answer sequence to assess whether they followed the correct emergency response steps, such as "stop - turn off the engine - evacuate - call the police - wait for rescue." In terms of dialect proficiency, the interactive training module analyzes the proportion of heavy accent features in the driver's answers to determine whether their dialect is so strong that it affects cross-regional communication and understanding. In terms of emotional state, the interactive training module analyzes the driver's emotional stability under simulated stress through emotional features in their speech, such as the presence of anxiety or tension.
[0114] After training, the interactive training module generates a training report for the driver, including scores for each assessment dimension, analysis of weaknesses, and targeted improvement suggestions. Addressing these weaknesses, the system utilizes spare moments during subsequent actual driving to push targeted knowledge tips via the in-vehicle smart terminal screen. For example, while waiting to load or unload cargo, it pushes standard terminology cards for hazardous materials emergency response, achieving a virtuous cycle of practical training leading to real-world application.
[0115] Through the interactive training described above, newly hired drivers can quickly master the standardized terminology and operating procedures of the logistics industry without affecting their normal work, effectively improving their cross-regional communication skills and professional competence.
[0116] In another embodiment, a logistics vehicle suddenly experienced a loss of power while driving in a mountainous area. The driver reported to the system via voice: "The vehicle is losing power and can't climb the hill." Due to the extremely high noise level in the mountainous environment and the driver's accent being amplified by anxiety, the vehicle's intelligent terminal successfully located the driver's voice source using a multi-microphone array and beamforming technology. It generated a masking value in the time-frequency domain to suppress the high-decibel roar of the engine and, combined with adaptive echo cancellation and noise suppression algorithms, extracted a relatively clear voice signal from the noisy environment.
[0117] Meanwhile, the vehicle communication gateway collects vehicle status data from the CAN bus in real time, including critical information such as abnormal intake pressure signals and engine fault code P0299 (insufficient turbocharger / supercharger boost). This vehicle status data, along with the processed voice signal, is encapsulated and uploaded to the cloud server via the MQTT protocol.
[0118] During inference, the cloud-based speech language model performs modal fusion between speech features and fault code data. Specifically, a cross-dialect pre-trained speech encoder converts the speech signal into speech feature vectors. These vectors are then mapped to a text embedding space using a multimodal projector and concatenated with vehicle status data to generate multimodal features containing both acoustic and vehicle operating condition information. These multimodal features are then input into a high-performance, open-source speech language model that has been adaptively fine-tuned for the logistics domain for inference.
[0119] Although the driver spoke in a mumbled dialect, saying "The car is weak and can't climb hills," the speech big data model, combined with the crucial information of fault code P0299, instantly determined through context-aware reasoning to "turbocharger system malfunction." The semantic analysis results output by the speech big data model not only included the recognition of the driver's intent but also incorporated the fault diagnosis conclusions to generate semantic analysis results.
[0120] like Figure 2 As shown, this application provides a method for real-time dialogue and training of logistics communication scripts and dialects based on a large language model, including the following steps:
[0121] S1: The in-vehicle intelligent terminal collects the driver's voice signals and vehicle status data, and uploads them to the cloud server in real time through the in-vehicle communication gateway;
[0122] The audio signal was also subjected to noise reduction and enhancement processing before being uploaded;
[0123] S2: The speech big language model deployed on the cloud server performs semantic analysis on the speech signal to obtain the semantic analysis result;
[0124] Step S2 specifically includes the following sub-steps:
[0125] S21: The speech signal is converted into a speech feature vector using a speech encoder;
[0126] S22: The speech feature vector is mapped to the text embedding space by a multimodal projector and fused with vehicle status data to generate multimodal features;
[0127] S23: Adaptive fine-tuning of a high-performance open-source large language model for the logistics domain through an industry knowledge distillation module;
[0128] S24: Input the multimodal features into a high-performance open-source large language model that has been adaptively fine-tuned in the logistics field for inference, and generate serialized semantic parsing results;
[0129] S3: The confidence assessment module calculates the normalized confidence score of the semantic parsing result. When the normalized confidence score is higher than or equal to a preset threshold, step S4 is executed; when the normalized confidence score is lower than the preset threshold, step S5 is executed.
[0130] S4: The speech big language model generates feedback text based on the semantic parsing results, synthesizes the feedback text into feedback speech that matches the driver's dialect features through the high-fidelity dialect synthesis module, and sends the feedback speech to the vehicle intelligent terminal for broadcast through the vehicle communication gateway.
[0131] S5: The confidence assessment module triggers the confirmation-clarification dialogue flow, outputs a question to the driver, receives the driver's secondary voice input, and returns to step S1 for re-parse;
[0132] S6: Based on the interactive training module, build logistics business scenarios, conduct simulated dialogues with drivers, evaluate the standardization of driver responses in terms of terminology and communication logic, and generate training reports.
[0133] The present invention has been described in the above-described embodiments; however, these embodiments are merely examples for implementing the present invention. It must be noted that the disclosed embodiments do not limit the scope of the present invention. Conversely, any modifications and refinements made without departing from the spirit and scope of the present invention are within the scope of patent protection of the present invention.
[0134] The contents of this invention not described in detail are existing technologies known to those skilled in the art.
Claims
1. A real-time intercom and training system for logistics communication scripts and dialects based on a large language model, characterized in that, This includes in-vehicle intelligent terminals, in-vehicle communication gateways, and cloud servers; The in-vehicle intelligent terminal is located inside the logistics vehicle and is used to collect the driver's voice signals and vehicle status data, perform noise reduction and enhancement processing on the voice signals, and also for voice broadcasting. The vehicle communication gateway is connected to the vehicle intelligent terminal and is used to encapsulate the collected voice signals and vehicle status data and upload them to the cloud server via the MQTT protocol. It also receives and forwards the feedback voice sent by the cloud server. The cloud server is equipped with a speech large language model, an industry knowledge distillation module, a confidence assessment module, a high-fidelity dialect synthesis module, and an interactive training module. The aforementioned large-scale speech language model is used to perform multimodal feature extraction, modal information fusion, and semantic reasoning on the received speech signal to generate a serialized semantic parsing result. Specifically, the multimodal feature extraction uses a cross-dialect pre-trained speech encoder to convert the speech signal into a speech feature vector. The modal information fusion uses a multimodal projector to map the speech feature vector to a text embedding space and fuses it with vehicle status data to obtain multimodal features. The semantic reasoning is achieved by using a high-performance open-source large-scale speech language model as the backend of the large-scale speech language model and performing adaptive fine-tuning of the high-performance open-source large-scale speech language model in the logistics domain to generate the semantic parsing result. The large-scale speech language model is also used to convert the semantic parsing result into feedback text. The industry knowledge distillation module is used to adaptively fine-tune the high-performance open-source large language model in the logistics industry domain through efficient parameter fine-tuning technology, and introduces a gating bias mechanism inside the high-performance open-source large language model. The confidence assessment module is used to calculate the normalized confidence of the semantic parsing result. When the normalized confidence level is lower than a preset threshold, a confirmation-clarification dialogue flow is triggered, and the driver confirms or clarifies the semantic parsing results. The formula for calculating the normalized confidence level is as follows: ; ; Where x represents the multimodal features and y represents the semantic parsing result; The probability distribution output for a high-performance open-source large language model; T represents the length of the semantic parsing result, that is, the number of tokens contained in the semantic parsing result; This represents the token at position t in the semantic parsing result; Y represents the constrained output space, i.e., the preset set of logistics terms. This represents any candidate output sequence in the constrained output space Y; when At that time, the semantic parsing result is executed; when When this happens, a confirmation-clarification dialogue flow is triggered, and a confirmation request is sent to the driver; The preset threshold; The high-fidelity dialect synthesis module is used to synthesize feedback text into feedback speech that matches the driver's dialect features based on neural network timbre conversion technology. The interactive training module is used to construct logistics business scenarios, conduct simulated dialogues with drivers, evaluate drivers' professional expression ability in response from multiple dimensions, and generate training reports. The interactive training module constructs logistics business scenarios including abnormal cargo damage claims, long-distance shuttle dispatch, and dangerous goods declaration; the evaluation dimensions include terminology accuracy, logical fluency, dialect level, and emotional state.
2. The system according to claim 1, characterized in that, The in-vehicle intelligent terminal integrates a multi-microphone array, uses beamforming technology to lock onto the driver's voice source, and introduces adaptive echo cancellation and noise suppression algorithms to preprocess the voice signal.
3. The system according to claim 1, characterized in that, The vehicle status data includes at least one or more of the following: vehicle identification number, geographic location coordinates, load status, engine fault code, and current waybill number. It is obtained from the CAN bus through the vehicle communication gateway to provide business context for the voice language model.
4. The system according to claim 1, characterized in that, The gate bias mechanism enhances the recognition of logistics terminology, and its calculation formula is as follows: ; ; Where B is a trainable bias embedding vector. The gating weight matrix is learned during the training of the high-performance open-source large language model. G is the gate value; For weighted expert bias; K is the standard key vector, which is a core feature inherent in high-performance open-source large language models and is used to understand general semantics; K' is the fused key vector, which replaces the original standard key vector K of the attention layer of the high-performance open-source large language model and is used for subsequent attention calculation and semantic decoding processes. This is a dynamic adjustment coefficient. The adaptive adjustment based on the gate value G is given by the following formula: , The function is a Sigmoid activation function; the gate value G reflects the overall overlap between the current input audio features and the preset logistics terminology database. When a high degree of overlap between the input and the preset logistics terminology database is detected, G tends to be positive. Approaching 0 makes K' more inclined to B*, forcing high-performance open-source large language models to tilt towards the semantics of logistics industry terms.
5. The system according to claim 1, characterized in that, The speech encoder is a pre-trained speech encoder across dialects. Through contrastive learning on multiple types of dialect datasets, it brings together different dialect speech features with the same semantics and pushes apart dialect speech features with different semantics in the latent space, achieving a preliminary alignment between dialect acoustic features and Mandarin speech semantics. It is then fine-tuned by supervised fine-tuning using dialect speech-text pairs in the logistics field.
6. The system according to claim 1, characterized in that, The multimodal projector is a multilayer perceptron, used to map the dimension of the speech feature vector to the same dimension as the text embedding space of the speech large language model, and to fold consecutive speech frames to match the information density of the text token.
7. The system according to claim 1, characterized in that, The specific process of semantic reasoning includes: The input voice signal is sequentially mapped into a three-layer structure: a recognition intermediate layer, a standardized semantic output layer, and a business operation mapping layer. The recognition intermediate layer is used to identify regional indicator pronouns, logistics professional terms, or industry-specific terms in dialect expressions. The standardized semantic output layer is used to convert dialect expressions into standard business expressions. The business operation mapping layer is used to associate the standard business expressions with preset business operation interfaces. By using context-aware reasoning and combining the deep cognitive task capabilities of a high-performance open-source large language model, the system identifies the driver's potential business intentions and generates serialized, standardized instructions as the result of semantic parsing.
8. The system according to claim 1, characterized in that, The high-fidelity dialect synthesis module uses a small number of dialect audio clips uploaded by the driver to generate a personalized voice tone, and drives the personalized voice tone to produce a dialect broadcast in combination with the generated feedback text.
9. A method for real-time dialogue and training of logistics communication scripts and dialects based on a large language model, wherein the method employs the system for real-time dialogue and training of logistics communication scripts and dialects based on a large language model as described in claims 1-8, characterized in that... Includes the following steps: S1: The in-vehicle intelligent terminal collects the driver's voice signals and vehicle status data, and uploads them to the cloud server in real time through the in-vehicle communication gateway; The audio signal was also subjected to noise reduction and enhancement processing before being uploaded; S2: The speech big language model deployed on the cloud server performs semantic analysis on the speech signal to obtain the semantic analysis result; Step S2 specifically includes the following sub-steps: S21: The speech signal is converted into a speech feature vector using a speech encoder; S22: The speech feature vector is mapped to the text embedding space by a multimodal projector and fused with vehicle status data to generate multimodal features; S23: Adaptive fine-tuning of a high-performance open-source large language model for the logistics domain through an industry knowledge distillation module; S24: Input the multimodal features into a high-performance open-source large language model that has been adaptively fine-tuned in the logistics field for inference, and generate serialized semantic parsing results; S3: The confidence assessment module calculates the normalized confidence score of the semantic parsing result. When the normalized confidence score is higher than or equal to a preset threshold, step S4 is executed; when the normalized confidence score is lower than the preset threshold, step S5 is executed. S4: The speech big language model generates feedback text based on the semantic parsing results, synthesizes the feedback text into feedback speech that matches the driver's dialect features through the high-fidelity dialect synthesis module, and sends the feedback speech to the vehicle intelligent terminal for broadcast through the vehicle communication gateway. S5: The confidence assessment module triggers the confirmation-clarification dialogue flow, outputs a question to the driver, receives the driver's secondary voice input, and returns to step S1 for re-parse; S6: Based on the interactive training module, build logistics business scenarios, conduct simulated dialogues with drivers, evaluate the standardization of driver responses in terms of terminology and communication logic, and generate training reports.