An AR wearing device-based field crop investigation offline voice system and method

By integrating a lightweight offline voice recognition system into AR wearable devices, the problems of network dependence and recognition accuracy in agricultural pest and disease survey systems in remote fields are solved, realizing efficient and automated pest and disease identification and data entry, adapting to multi-dialect environments, and meeting the needs of field surveys.

CN122201280APending Publication Date: 2026-06-12HANGZHOU YINGHE JIATIAN TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HANGZHOU YINGHE JIATIAN TECH CO LTD
Filing Date
2026-02-05
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing agricultural pest and disease survey systems rely on network services, making them ineffective in remote fields. They also suffer from poor dialect compatibility, low recognition accuracy, insufficient hardware compatibility, low automation of data entry, and an inability to efficiently recognize agricultural terminology and local dialects.

Method used

A lightweight offline speech recognition system based on AR wearable devices is adopted, combined with offline speech recognition modules of Vosk and Kaldi frameworks, for adaptive training in the agricultural field. It integrates dynamic syntax management and error tolerance modules, supports multiple dialects and complex environments, and realizes structured data input of natural sentences.

🎯Benefits of technology

It achieves high-precision identification of agricultural pests and diseases in environments without network coverage, adapts to multiple local dialects, reduces hardware requirements, improves recognition accuracy and automation, meets the needs of field surveys, has a long battery life, and provides a smooth interactive experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201280A_ABST
    Figure CN122201280A_ABST
Patent Text Reader

Abstract

The application relates to the field of agricultural information technology, in particular to a field crop investigation offline voice system and method based on an AR wearable device, the system comprising an AR wearable device carrying a mobile operating system, an offline voice recognition engine and a crop investigation application program, the AR wearable device being integrated with a processor, a memory, a microphone, a camera, a display and an AR rendering module, the offline voice recognition engine comprising a lightweight offline voice recognition module, a dynamic grammar management module and an instruction analysis and data generation module. By deploying a lightweight offline voice model trained in terms of agricultural dialects and professional vocabularies, and in combination with dynamic grammar limitation and homophone error tolerance technology, the system realizes high-precision recognition of agricultural professional vocabularies and abbreviations in a completely offline environment.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of agricultural information technology, specifically to an offline voice system and method for field crop surveys based on AR wearable devices. Background Technology

[0002] Agricultural pest and disease surveys are a fundamental prerequisite for crop pest and disease control, and the accuracy and efficiency of data collection directly affect the scientific nature of control decisions. Traditional pest and disease surveys mainly rely on manual data recording by surveyors, which is cumbersome, inefficient, and prone to errors.

[0003] With the development of speech recognition technology, some survey systems have begun to introduce voice interaction functions. However, existing technologies still have many shortcomings: They rely on network support; current mainstream speech recognition solutions, such as iFlytek and Baidu speech recognition, depend on online services and rely on cloud servers for recognition. However, agricultural survey areas are mostly remote fields with weak or no network coverage, causing the system to malfunction. They also have poor compatibility with dialects and industry-specific terms. my country's main agricultural production areas are widely distributed, and surveyors often communicate using local dialects. However, general speech recognition models have poor support for complex agricultural terms such as "rice leaf roller" and "white-backed planthopper long-winged adult," resulting in low recognition accuracy. Furthermore, they almost cannot correctly recognize abbreviations commonly used in agricultural practice to improve efficiency, such as shortening "white-backed planthopper long-winged adult" to "white-long." Although offline speech recognition technologies exist, such as those based on Vosk-android and Kaldi, their general models are not optimized for agricultural terminology, user habits, and local dialects, resulting in insufficient recognition accuracy for survey needs. Finally, they suffer from insufficient hardware compatibility; some emerging large-scale speech models, such as Whisper, lack hardware support. While FunASR boasts powerful recognition capabilities, it is bulky and resource-intensive, making it difficult to operate stably on mobile terminals such as AR glasses. It also suffers from issues like high response latency and excessive power consumption. Furthermore, its data entry automation is low: existing systems often require step-by-step voice command operations and cannot directly generate and record structured data automatically through natural language, still requiring manual intervention for data processing.

[0004] Therefore, developing a voice interaction system for pest and disease surveys that can operate offline, is compatible with multiple dialects and agricultural terminology, has low latency, and is highly automated has become an urgent need in the field of agricultural technology. Summary of the Invention

[0005] The purpose of this invention is to provide an offline voice system and method for field crop surveys based on AR wearable devices, in order to solve the problems mentioned in the background art.

[0006] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows: An offline voice system for field crop surveys based on AR wearable devices includes an AR wearable device running a mobile operating system, an offline voice recognition engine, and a crop survey application. The AR wearable device integrates a processor, memory, microphone, camera, display, and AR rendering module. The offline voice recognition engine includes: A lightweight offline speech recognition module, less than 50MB in size, is obtained by adaptive training based on an offline speech recognition model using speech data from the agricultural field. It is used for speech recognition in environments without a network. The dynamic grammar management module is used to maintain and manage multiple scenario-based grammar library files, and dynamically switch to load different grammar library files according to the state of the application. The grammar library files contain a limited vocabulary set. The instruction parsing and data generation module is used to parse the recognition results of the lightweight offline speech recognition module, extract the target pest and disease entity and quantity attribute parameters from the recognition results, and automatically generate a structured survey record containing the full standard name, quantity, timestamp and location information of the pest and disease.

[0007] The offline speech recognition model is based on the Vosk and Kaldi frameworks and has a size of less than 100MB. The speech data includes agricultural professional terms from various local dialects and background noise from the field environment.

[0008] The dynamic grammar management module also includes a fault-tolerance module for homophones and near-homophones. The fault-tolerance module is used to maintain a mapping table of homophones and near-homophones and to configure multiple equivalent recognition sequences for standard entries in the grammar library file.

[0009] The syntax library file includes a login syntax library and an agricultural survey syntax library; the login syntax library contains numbers and basic login command vocabulary; the agricultural survey syntax library contains 500-2000 full names of agricultural pests and diseases, industry-recognized abbreviations, survey action commands, and quantity unit vocabulary.

[0010] The instruction parsing and data generation module supports composite voice instruction structures including [abbreviation of pest / disease] + [quantity] + [unit] and [construction operation] + [location] + [object].

[0011] After the instruction parsing and data generation module generates the survey record, the AR rendering module overlays confirmation information related to the survey record onto the display.

[0012] Furthermore, the mobile operating system is the Android system, and the offline speech recognition engine is integrated into the crop survey application; Furthermore, the core professional vocabulary and operation instructions for agricultural pest and disease surveys total approximately 2,000 words. Therefore, Vosk-android and Kaldi's offline speech recognition modules were selected as the original models and customized for agricultural scenarios. By strictly limiting the recognition range to a fixed agricultural survey field, and using a lightweight model combined with strong domain grammar rules, the lightweight offline speech recognition module is made to be less than 50MB in size. It can be stored and inferred in real time on the hardware platform of mainstream consumer AR glasses, and can achieve low latency response without the need for a special high-performance processor. Furthermore, the original model undergoes domain-adaptive retraining as a fundamental solution to improve recognition accuracy. A dedicated training dataset is constructed through large-scale corpus collection, encompassing speech samples from multiple dialects including Guangxi, Sichuan, Hunan, and Fujian, as well as the full names and officially recognized abbreviations of agricultural professional terms. Based on the Kaldi toolchain, a combination of transfer learning and fine-tuning is used to retrain the basic acoustic and language models. Simultaneously, background noise from the field environment is mixed into the training data to augment the data, enhancing the model's robustness to complex field acoustic environments. The training process employs discriminative training criteria to optimize the objective function, effectively mitigating the impact of model assumption biases and strengthening the discriminative power of classification boundaries. The customized model, after retraining, deeply learns the language patterns and pronunciation features of the agricultural field. It strengthens the learning of accent invariance features through a domain adversarial training mechanism, and has a strong adaptability to multiple dialect accents. It can achieve high-precision recognition of agricultural professional terms and abbreviations in complex field environments. To address the computational limitations of mobile devices, targeted parameter configurations are made for the recognizer. By setting setMaxAlternatives(1) to limit the optimal output and setWords(false) to disable redundant word-level information output, the computational load on the device is significantly reduced while ensuring the accuracy of the recognition results, thus improving the system's operating efficiency under the mobile operating system.

[0013] Furthermore, the dynamic grammar management module addresses the bottleneck of recognizing agricultural terminology and dialect accents through the collaborative application of multi-dimensional grammar control technologies, specifically including: Syntax restriction mechanism: For the application scenario of the Vosk recognition engine, abandon its default open vocabulary mode, build and load a strict customized grammar library file, accurately collect core elements such as the full set of valid system instructions, the full names of agricultural pests and diseases, and officially recognized abbreviations, and standard quantity units, to form a finite and closed vocabulary set. The vocabulary set covers action instructions such as "start", "take a photo", "record", etc., the full names of pests and diseases such as "Cnaphalocrocis medinalis", officially recognized abbreviations such as "white long", "white short", "rice leaf roller", and standardized quantity expression modes. By greatly compressing the search space during recognition, a large number of irrelevant interfering words and confusing items are effectively filtered out, significantly improving the recognition accuracy and response speed of target words.

[0014] Homophone and near-homophone error tolerance mechanism: During the construction of the grammar library file and the training of the language model, for core words that are prone to pronunciation confusion, actively incorporate their common homophone and near-homophone combinations as equivalent recognition sequences, and maintain a preset homophone and near-homophone mapping table. For example, for the standard entry "Sogatella furcifera", configure multiple equivalent recognition sequences such as "Sogatella furcifera", "Sogatella furbei", "Sogatella furcifera shi" in the grammar library file. When the output result of the lightweight offline speech recognition module matches any equivalent sequence, the error tolerance module automatically maps it to the standard entry for output, effectively avoiding character-level recognition errors caused by accent differences and unclear pronunciation.

[0015] Multi-scenario grammar dynamic switching mechanism: The offline speech recognition engine incorporates an agricultural survey grammar library and a login grammar library. The dynamic grammar management module can, according to whether the current running state of the application is login verification or field survey, call the corresponding grammar library in real time through a dedicated switching interface, realizing the scene-adaptive optimization of grammar rules, and ensuring that the recognition efficiency in different usage scenarios is in the optimal state.

[0016] Furthermore, the offline speech recognition engine uses a speech-driven data entry mechanism. Based on the natural language understanding technology of the agricultural survey scenario, it uses a triple structured parsing framework of [construction operation] + [location] + [object]. Users do not need to perform step-by-step operations. They only need to issue a natural language statement that conforms to the habits of field surveys, such as "Record that the occurrence degree of Cnaphalocrocis medinalis in the paddy field is medium". The offline speech system can automatically complete semantic parsing and target confirmation, and map the key information to standardized structured data for recording.

[0017] Even further, the training process of the offline speech recognition model includes: The first step is to build a dialect dataset: Audio duration: Accumulatively collect more than 500 hours of effective speech data to ensure that the data scale meets the requirements of model deep training.

[0018] Dialect types: The model focuses on core dialects from major agricultural regions in China, including Guangxi dialect, Sichuan dialect, Hunan dialect, Fujian dialect, Henan dialect, and Northeastern dialect. To ensure balanced model adaptation across dialects, each dialect's data constitutes at least 12% of the total data.

[0019] Speaker categories: Over 300 native speakers from the aforementioned dialect regions were recruited to complete the recordings, spanning ages 18-60, with a male-to-female ratio of approximately 1:1. Through a diverse combination of speakers, the acoustic characteristics of different age and gender groups were comprehensively captured, enhancing the model's generalization ability.

[0020] Content Design: The audio transcripts closely match the actual scenarios of agricultural surveys, accurately covering four core dimensions: full names of pests and diseases (e.g., "rice sheath blight", "corn borer", "rice leaf roller"); industry abbreviations (e.g., "white long", "brown planthopper", "brown short"); action instructions (e.g., "start survey", "take a photo", "record location", "end survey"); and composite quantity structures (e.g., "5 found", "10 found", "severity level three").

[0021] Environmental Enhancement: To improve the robustness of the model in complex field environments, environmental noise enhancement was performed on all clean speech data. The noise mixed in all came from real field scenes, including wind noise, birdsong and insect chirping, farm machinery noise, and background human voices. The signal-to-noise ratio (SNR) fluctuated randomly between 0dB and 20dB to simulate speech propagation conditions under different field environments.

[0022] Step 2: Model Architecture and Training Basic framework: The training framework is built on the open-source Kaldi speech recognition toolkit, relying on its mature acoustic model training chain to ensure the stability and efficiency of the training process.

[0023] Model Structure: The core acoustic model employs a factorized time-delay neural network (TDNN-F). This structure simplifies the model parameter scale through factorization decomposition technology, significantly reducing computational resource consumption while maintaining high recognition accuracy, perfectly adapting to the lightweight deployment requirements of mobile terminals.

[0024] Feature extraction: 40-dimensional Mel frequency cepstral coefficients (MFCC) or 40-dimensional Mel filter bank coefficients (FBANK) are used as basic features, and their first-order and second-order difference features are extracted simultaneously to construct a 120-dimensional high-dimensional feature vector. Online cepstral mean normalization (CMVN) technology is also applied to effectively eliminate interference from different audio acquisition channels and improve feature robustness.

[0025] Training process: A three-stage training strategy of "pre-training - transfer learning - fine-tuning" is adopted to ensure a balance between model accuracy and efficiency. The basic GMM-HMM model is pre-trained using large-scale Mandarin data to complete the initial alignment of speech and text, providing reliable initialization parameters for subsequent dialect model training; Using the dialect dataset constructed in the first step, we carried out transfer learning and fine-tuning of the TDNN-F model to enable the model to quickly adapt to the acoustic characteristics and agricultural vocabulary semantics of each target dialect. The language model adopts a text-driven 3-gram model, and the vocabulary is strictly limited to the core agricultural vocabulary and instruction set defined by the system to reduce interference from irrelevant words and improve recognition efficiency.

[0026] Step 3: Model Optimization Structured pruning algorithm optimization: To further reduce model size and inference power consumption, structured pruning is performed on the trained TDNN-F acoustic model. The "L1 regularization guided pruning" strategy is adopted, which uses the L1 norm to measure the importance of the weights. The smaller the L1 norm, the lower the contribution of the channel / convolution kernel to the model output, and the more suitable it is to be pruned.

[0027] Specific steps: ① Pruning preparation: Load the trained TDNN-F model weights into memory, traverse each layer of the model (including convolutional layers and fully connected layers), and calculate the L1 norm of each channel weight in each layer; ② Threshold determination: Based on the model accuracy loss threshold, determine the pruning threshold through iterative testing on the validation set. The pruning threshold is set to 0.3 times the average L1 norm of the weights in each layer; ③ Channel pruning: Remove channels in each layer whose L1 norm is less than the pruning threshold, and delete the corresponding weight parameters; for the factorized layers of TDNN-F, prune the dimension of the corresponding factor matrix simultaneously to ensure the integrity of the model structure; ④ Fine-tuning and recovery: Fine-tune the pruned model on a 50-hour speech dataset using the SGD optimizer with an initial learning rate of 1e-5, iterating for 20 rounds to adapt the model to the pruned structure and restore recognition accuracy; ⑤ Validation and evaluation: Validate the performance of the pruned model on the test set, ensuring that the accuracy loss of core agricultural vocabulary recognition is ≤1%, and that the model size is reduced from 55MB to 49MB after pruning.

[0028] Dynamic range quantization optimization: To reduce memory usage and computational latency during model inference, dynamic range quantization is performed on the pruned model. The 32-bit floating-point (FP32) weights and activation values ​​in the model are converted into 8-bit integers (INT8). By statistically analyzing the dynamic range (maximum value Max, minimum value Min) of each tensor, a linear mapping relationship between FP32 and INT8 is established: INT8 = round((FP32 - Min) / (Max - Min) * 255). During inference, INT8 is used directly for calculation, which greatly improves computational efficiency. At the same time, the quantization accuracy is guaranteed through dynamic range statistics.

[0029] Specific steps: ① Quantization calibration: Select 10 hours of representative speech data (covering various dialect types and field noise scenarios) as the calibration set, input it into the pruned model, and calculate the dynamic range (Max, Min) of the weights and activation values ​​of each layer; ② Weight quantization: Based on the dynamic range obtained from calibration, quantize all weights of the model from FP32 to INT8, generate a quantized weight file, and store the Max and Min parameters of each layer for inverse quantization during inference; ③ Dynamic quantization of activation values: During the inference stage, the dynamic range of the input activation values ​​is calculated in real time and quantized, without the need to pre-store the activation value quantization parameters, adapting to different speech input scenarios; ④ Inference optimization: Call the INT8 inference interface of TensorFlow Lite on the Android system to achieve efficient inference of the quantized model; ⑤ Performance verification: After quantization, the model memory usage is reduced from 180MB after pruning to 65MB, the inference latency is reduced from 168 milliseconds after pruning to 142 milliseconds, and the core agricultural vocabulary recognition accuracy loss is ≤0.5%, fully meeting the deployment requirements of AR glasses on mobile devices.

[0030] Performance metrics: Validation results based on a 50-hour independent test set (not used for model training) show: overall dialect recognition accuracy: > 85%; core agricultural vocabulary (including full and abbreviations) recognition accuracy: > 92%; final model size: approximately 49MB, fully meeting the deployment requirements of mobile / embedded devices.

[0031] Step 4: Model Validation Test environment: Simulated Environment: A controllable test scenario is constructed in a professional anechoic chamber. By playing field noises with different signal-to-noise ratios, a full range of acoustic environments, from extremely quiet to noisy, are simulated to conduct precise controllability tests.

[0032] Realistic Environment: Typical field plots in the three major rice-producing areas of Guangxi, Sichuan and Hunan were selected as test sites, covering various natural weather conditions such as sunny, light wind and light rain, to recreate the real agricultural survey operation environment.

[0033] Hardware: It adopts the mainstream Android AR glasses platform on the market, with a core configuration of Qualcomm Snapdragon XR2 processor and 8GB of RAM, which meets the portability needs of field workers.

[0034] Test methods and results: The core of the test focused on the entire process of identifying and processing compound commands of "pests and diseases + quantity". A total of 2,000 valid test commands were collected (such as "3 white rice stem borers" and "10 rice leaf rollers"). The key test indicators and results are as follows: Success rate: defined as the percentage of times the system correctly identifies the entities and quantities of pests and diseases and successfully generates database records. Test results show an overall success rate of 88.5%; a success rate > 95% in a quiet indoor environment; and a success rate > 85% in a noisy field environment, verifying the system's stable processing capability in complex environments.

[0035] Scene switching effect: A dynamic syntax switching mechanism was designed for the two core scenarios of login and survey. In the login scenario, a simplified numeric syntax library is enabled. Tests show that the numeric recognition accuracy reaches 98.2% in this scenario, fully verifying the effectiveness of the dynamic syntax switching strategy and optimizing recognition accuracy in different scenarios.

[0036] Average response time: The test result for the entire time from when the user completes the command and pronounces it to when the AR interface displays confirmation feedback (such as "Recorded: xxx") is 142 milliseconds. This value is far below the human perception latency threshold of 300 milliseconds, ensuring a smooth and lag-free user interaction experience.

[0037] Power consumption test: In continuous voice survey mode, the AR glasses can support more than 4 hours of continuous work on a full charge, which fully covers the time requirements of a single field operation and meets the battery life requirements of field operations.

[0038] The following code snippet demonstrates the core steps of integrating Vosk, configuring syntax, and parsing results in an Android application.

[0039] kotlin / / Sample code: Vosk speech recognition and data entry process class AsrVosk( private val mContext: Context, private val asrResultHandler: AsrResultHandler ): IAsrCore { private value model: Model private var speechService: SpeechService init { / / 1. Initialize the Vosk recognizer and load the lightweight model val outputPath = StorageService.sync(context, "models / vosk-cn-1.0", "vosk-models") model = Model(outputPath) / / 2. Create a recognizer and set the agricultural survey syntax speechService = SpeechService(Recognizer(model,SpeechService.SAMPLE_RATE.toFloat()).apply { setMaxAlternatives(1) / / Get the best result setWords(false) / / Optimize performance setGrammar(getDefaultGrammar()) / / Application-specific syntax }) } / / 3. Dynamic Syntax Switching Method override fun switchVoiceRecognizer(recognizerMode: String) { when (recognizerMode) { "default" -> { / / Reconfigure for agricultural survey syntax speechService.stop() speechService.shutdown() speechService = SpeechService(Recognizer(model,SpeechService.SAMPLE_RATE.toFloat()).apply { setGrammar(getDefaultGrammar()) / / Loads an agricultural grammar containing 500+ words }) start() } "login" -> { / / Switch to login-specific syntax speechService.stop() speechService.shutdown() speechService = SpeechService(Recognizer(model,SpeechService.SAMPLE_RATE.toFloat()).apply { setGrammar(getLoginGrammar()) / / Loads the login syntax containing only numbers. }) start() } } } / / 4. Agricultural Survey Syntax Generation Method private fun getDefaultGrammar(): String { val baseGrammar = arrayListOf( "White-backed Flying Master Long-winged Adult", "White-backed Flying General with Long Wings" "Bai Chang", / / Professional abbreviation mapping "Brown-winged short-winged adult" "Brown Flyer Shortwing" "Brown Shorthair", / / Professional Abbreviation Mapping "Photograph", "Investigation concluded" / / ... Approximately 500 core agricultural terms ) val reviseData = arrayListOf( "one head", "one plant", "one ear of grain" "Two heads", "Two stalks", "Two ears" / / ... Complete combination of quantity units ) / / 5. Dynamic Construction and Optimization of Syntax Lists val numbersGrammar = arrayListOf <string>() for (num in 0..1000) { numbersGrammar.add(ChineseNumberUtils.numberToChinese(num.toDouble())) } / / Remove easily confused combinations and add separate versions numbersGrammar.removeAll(numbersRemove.toSet()) numbersGrammar.addAll(numbersAdd) val grammarList = numbersGrammar + baseGrammar + reviseData return JSONArray(grammarList).toString() } / / 6. Login syntax generation method private fun getLoginGrammar(): String { val grammarList = arrayListOf( "zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "Scan QR code to log in" ) return JSONArray(grammarList).toString() } } Key points of the code: Lightweight model deployment: Using a Vosk Chinese model of approximately 50MB, its size and computational requirements are matched to the hardware resources of mobile AR devices.

[0040] Dynamic grammar management: The switchVoiceRecognizer method responds to scene changes and enables hot switching of grammar libraries to improve recognition accuracy.

[0041] Agricultural terminology processing: The syntax library built by the getDefaultGrammar() method contains full-name-abbreviation mappings and supports natural interaction.

[0042] Quantifier optimization: By combining programmatic generation with manual optimization, we construct an accurate grammar for recognizing quantity units.

[0043] Layered syntax architecture: It adopts a layered structure of numbersGrammar + baseGrammar + reviseData, which is easy to maintain and extend.

[0044] This invention also provides an offline voice method for field crop surveys based on AR wearable devices, applied to the offline voice system, the method comprising the following steps: S1. The user wears an AR wearable device and launches a crop survey application locally on the AR wearable device, loading a pre-installed offline speech recognition engine; the dynamic grammar management module loads the corresponding grammar library file from the local machine according to the current running scenario of the application, so as to limit the vocabulary search range of speech recognition. S2. Receive the user's voice input through the microphone integrated into the AR wearable device; S3. A lightweight offline speech recognition module recognizes the speech input using the current grammar library to obtain preliminary text; the parsing and data generation module parses the preliminary text and extracts the target pest and disease entities and quantity attribute parameters; based on the parsing results, a structured survey record is automatically generated and saved to the local database. S4, the AR rendering module overlays operation confirmation information on the display.

[0045] Furthermore, the operating scenarios described in S1 include user login scenarios and field survey scenarios, and corresponding login syntax libraries or agricultural survey syntax libraries are loaded.

[0046] Furthermore, during the recognition process in S3, a predefined mapping table of homophones and near-homophones is used to perform error-tolerant processing on the recognition results.

[0047] Furthermore, in S3, when parsing the initial text, the identified abbreviations of pests and diseases are converted into their corresponding standard full names through a predefined mapping relationship, and the standard full names are then used to fill the structured survey record.

[0048] Due to the adoption of the above technical solution, the technical progress achieved by this invention compared to the prior art is as follows: 1. It can run independently offline. By deploying a lightweight offline speech recognition model locally, it can complete speech recognition and data recording without network support, making it suitable for remote fields without network environments.

[0049] 2. High accuracy in recognizing agricultural dialects: Through customized dialect model training and homophone tolerance mechanism, it achieves accurate recognition of dialects in major agricultural areas such as Guangxi dialect and Sichuan dialect, improving the robustness of recognition of various local dialects, with a core agricultural vocabulary recognition accuracy of >92%.

[0050] 3. Excellent hardware compatibility: The 50MB lightweight model is compatible with mainstream consumer-grade AR wearable devices, achieving low latency response without the need for a high-performance processor, with an average latency of 142 milliseconds. It can work continuously for more than 4 hours, meeting the needs of field operations.

[0051] 4. Automated data entry supports the one-time completion of structured data recording using natural language, realizing "what is said is what is recorded", greatly reducing manual operation, improving survey efficiency and data accuracy, with an overall processing success rate of 85%.

[0052] 5. Dynamic scene adaptation enhances user experience. Through a multi-syntax library dynamic switching mechanism, the system can provide the best recognition effect in different usage scenarios, greatly improving the recognition rate in login scenarios, significantly reducing the probability of misoperation, and providing an efficient and intuitive interactive experience. The system seamlessly connects with the on-site working mode of investigators who "see, read, and take notes" while working, reducing operation steps and cognitive load, and improving interaction efficiency. Attached Figure Description

[0053] Figure 1 This is a schematic diagram of the system architecture used in the embodiment; Figure 2 This is a flowchart illustrating the workflow of an offline voice system as an example. Figure 3 This is a schematic diagram illustrating the offline speech model training and optimization of the present invention. Detailed Implementation

[0054] The present invention will be further described in detail below with reference to embodiments: like Figure 1 As shown, an offline voice system for field crop surveys based on AR wearable devices includes an AR wearable device with a mobile operating system, an offline voice recognition engine, and a crop survey application. In this embodiment, the AR wearable device is AR glasses, which integrates a Qualcomm Snapdragon XR2 processor, 8GB of high-speed memory, 1TB of UFS3.1 flash memory, a 48MP camera, a high-sensitivity microphone, and an OLED near-eye display. The mobile operating system it runs on is Android 12. The crop survey application is developed based on Android Studio and integrates an offline voice recognition engine, a local database module, an AR rendering module, and a GPS positioning module.

[0055] Offline speech recognition engines include: A lightweight offline speech recognition module, 49MB in size, is obtained by adaptively training an offline speech recognition model using speech data from the agricultural field. The training and optimization process is as follows: Figure 3 It is used for speech recognition in environments without a network connection. The dynamic grammar management module is used to maintain and manage multiple scenario-based grammar library files, and dynamically switch between loading different grammar library files according to the application's state. The grammar library files contain a limited set of vocabulary. The instruction parsing and data generation module is used to parse the recognition results of the lightweight offline speech recognition module, extract the target pest and disease entity and quantity attribute parameters from the recognition results, and automatically generate a structured survey record containing the full standard name, quantity, timestamp and location information of the pest and disease.

[0056] The dynamic grammar management module also includes a fault-tolerance module for homophones and near-homophones. The fault-tolerance module is used to maintain a mapping table of homophones and near-homophones and to configure multiple equivalent recognition sequences for standard entries in the grammar library file.

[0057] The syntax library files include a login syntax library and an agricultural survey syntax library; the login syntax library contains vocabulary for numbers and basic login commands; the agricultural survey syntax library contains 1200 full names of agricultural pests and diseases, industry-recognized abbreviations, survey action commands, and unit of quantity vocabulary.

[0058] The instruction parsing and data generation module supports compound voice instruction structures including [abbreviation of pest / disease] + [quantity] + [unit] and [construction operation] + [location] + [object].

[0059] After the instruction parsing and data generation module generates the survey record, the AR rendering module overlays confirmation information related to the survey record onto the display.

[0060] This embodiment uses the Guangxi dialect, such as Figure 2 An offline voice method for field crop surveys based on AR wearable devices, applying the aforementioned system, includes the following steps: S1. The user wears an AR wearable device, launches a crop survey application locally on the AR wearable device, and loads the pre-installed offline speech recognition engine; the model is loaded into memory through the loadModel() method, which occupies approximately 65MB of memory resources; The dynamic grammar management module loads the corresponding grammar library file from the local machine based on the current running scenario of the application to limit the vocabulary search range for speech recognition. The grammar file is written in JSGF format, and the recognition results are mapped to standard words using the mapStandardWord() method. The system maintains two core grammar libraries: an agricultural survey grammar library and a login grammar library (containing numbers 0-9 and "confirm" and "cancel" commands). The switchGrammar() method switches between these libraries in real time based on the application status (login / survey), with a switching response time of less than 10 milliseconds.

[0061] S2. Receive user voice input via a microphone integrated into the AR wearable device; After discovering pests and diseases, users can directly say a compound command in their local dialect: "White hair, 5 heads".

[0062] S3, a lightweight offline speech recognition module, recognizes speech input using the current grammar library to obtain preliminary text; the instruction parsing and data generation module parses the preliminary text, extracts the target pest and disease entities and quantity attribute parameters; based on the parsing results, it automatically generates a structured survey record and saves it to the local database; When the command "white long, 5 heads" is recognized, the system automatically maps "white long" to "white-backed planthopper long-winged adult", extracts the quantity "5", calls the GPS module to obtain the current location, records the system time, and generates a structured record. Optionally, the camera can be called to take photos of the scene using the camera.takePicture() method, and the photo path is associated with the record for storage.

[0063] S4, the AR rendering module overlays operation confirmation information on the display.

[0064] The monitor displays the confirmation message: "Recorded: 5 long-winged adult white-backed planthoppers", completing one data recording.

[0065] The S1 runtime scenarios include user login scenarios and field survey scenarios, and corresponding login syntax libraries or agricultural survey syntax libraries are loaded.

[0066] In S3, a predefined mapping table of homophones and near-homophones is used to perform error-tolerant processing on the recognition results.

[0067] In S3, when parsing the initial text, the identified abbreviations of pests and diseases are converted into their corresponding standard full names through a predefined mapping relationship, and the standard full names are then used to fill the structured survey records.

[0068] Recognition accuracy: The overall dialect recognition accuracy was 86.2%, and the core agricultural vocabulary recognition accuracy was 92.4%. Success rate: 85% Response time: 142 milliseconds on average; Power consumption: In continuous survey mode, a full charge provides 286 minutes of battery life, which meets the needs of a single field operation.

[0069] It should be noted that, in the description of this disclosure, unless otherwise expressly specified and limited, the terms "installation," "connection," and "joining" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal communication of two components or the interaction between two components. Those skilled in the art can understand the specific meaning of the above terms in this disclosure according to the specific circumstances.

[0070] The present invention has been described in detail above. However, modifications or improvements can be made to it, which will be obvious to those skilled in the art. Therefore, any modifications or improvements that do not depart from the spirit of the present invention are within the scope of protection of the present invention.< / string>

Claims

1. An offline voice system for field crop surveys based on AR wearable devices, characterized in that, This includes an AR wearable device running a mobile operating system, an offline speech recognition engine, and a crop survey application. The AR wearable device integrates a processor, memory, microphone, camera, display, and AR rendering module. The offline speech recognition engine includes: A lightweight offline speech recognition module, less than 50MB in size, is obtained by adaptive training based on an offline speech recognition model using speech data from the agricultural field. It is used for speech recognition in environments without a network. The dynamic grammar management module is used to maintain and manage multiple scenario-based grammar library files, and dynamically switch to load different grammar library files according to the state of the application. The grammar library files contain a limited vocabulary set. The instruction parsing and data generation module is used to parse the recognition results of the lightweight offline speech recognition module, extract the target pest and disease entity and quantity attribute parameters from the recognition results, and automatically generate a structured survey record containing the full standard name, quantity, timestamp and location information of the pest and disease.

2. The offline voice system for field crop surveys based on AR wearable devices according to claim 1, characterized in that, The offline speech recognition model is based on the Vosk and Kaldi frameworks and has a file size of less than 100MB. The speech data includes agricultural vocabulary from various local dialects and background noise from the field environment.

3. The offline voice system for field crop surveys based on AR wearable devices according to claim 1, characterized in that, The dynamic grammar management module also includes a fault-tolerance module for homophones and near-homophones. The fault-tolerance module is used to maintain a mapping table of homophones and near-homophones and to configure multiple equivalent recognition sequences for standard entries in the grammar library file.

4. The offline voice system for field crop surveys based on AR wearable devices according to claim 1, characterized in that, The syntax library file includes a login syntax library and an agricultural survey syntax library; the login syntax library contains numbers and basic login command vocabulary; the agricultural survey syntax library contains 500-2000 full names of agricultural pests and diseases, industry-recognized abbreviations, survey action commands, and quantity unit vocabulary.

5. The offline voice system for field crop surveys based on AR wearable devices according to claim 1, characterized in that, The instruction parsing and data generation module supports composite voice instruction structures including [abbreviation of pest / disease] + [quantity] + [unit] and [construction operation] + [location] + [object].

6. The offline voice system for field crop surveys based on AR wearable devices according to claim 1, characterized in that, After the instruction parsing and data generation module generates the survey record, the AR rendering module overlays confirmation information related to the survey record onto the display.

7. An offline voice method for field crop surveys based on AR wearable devices, characterized in that, The method, applied to the offline voice system for field crop surveys based on AR wearable devices as described in any one of claims 1 to 6, comprises the following steps: S1. The user wears an AR wearable device and launches a crop survey application locally on the AR wearable device, loading a pre-installed offline speech recognition engine; the dynamic grammar management module loads the corresponding grammar library file from the local machine according to the current running scenario of the application, so as to limit the vocabulary search range of speech recognition. S2. Receive the user's voice input through the microphone integrated into the AR wearable device; S3. A lightweight offline speech recognition module recognizes the speech input using the current grammar library to obtain preliminary text; the instruction parsing and data generation module parses the preliminary text and extracts the target pest entity and quantity attribute parameters; based on the parsing results, a structured survey record is automatically generated and saved to the local database. S4, the AR rendering module overlays operation confirmation information on the display.

8. The offline voice method for field crop survey based on AR wearable devices according to claim 7, characterized in that, The operating scenarios described in S1 include user login scenarios and field survey scenarios, and corresponding login syntax libraries or agricultural survey syntax libraries are loaded.

9. The offline voice method for field crop survey based on AR wearable devices according to claim 7, characterized in that, In S3, a predefined mapping table of homophones and near-homophones is used to perform error-tolerant processing on the recognition results.

10. The offline voice method for field crop survey based on AR wearable devices according to claim 7, characterized in that, In S3, when parsing the initial text, the identified abbreviations of pests and diseases are converted into their corresponding standard full names through a predefined mapping relationship, and the standard full names are then used to fill the structured survey record.