Patents

Literature

Patsnap Eureka AI that helps you search prior art, draft patents, and assess FTO risks, powered by patent and scientific literature data.

20 results about "Speech transcription" patented technology

Filter

Efficacy Topic

Property

Owner

Technical Advancement

Application Domain

Technology Topic

Technology Field Word

Patent Country/Region

Patent Type

Patent Status

Application Year

Inventor

Transcription (linguistics), the representations of speech or signing in written form Orthographic transcription, a transcription method that employs the standard spelling system of each target language. Phonetic transcription, the representation of specific speech sounds or sign components.

Speech-to-speech translation

PendingUS20260154515A1Natural language translationSound input/outputSpeech to speech translationSpeech translation

A speech-to-speech translation method comprises transcribing speech spoken in a source language into transcribed text data in the source language using an on-premises speech recognition model. The transcribed text data is translated into translated text data in a target language using a first on-premises machine translation model. The translated text data is reverse translated into retranslated text data in the source language using a second, different on-premises machine translation model. The transcribed text and the retranslated text are displayed on a screen. The method also involves synthesizing, using an on-premises speech synthesis model, translated speech data in the target language based on the translated text data and play back, in response to a user confirmation, translated speech in the target language based on the translated speech data in the target language.

Speech-to-speech translation

Owner:MABEL AI AB

A portable sound amplification and speech transcription system

PendingCN122266383AAdaptive networkSpeech recognitionSound sourcesFeature extraction speech recognition

The application relates to the technical field of audio signal processing and speech recognition, and discloses a portable sound amplification and speech transcription system which comprises a wind noise sensing and separating module, a beam forming processing module, a dual-path processing module, a speech recognition module, a vocabulary correction module, a power consumption management module and an interactive management module.

A portable sound amplification and speech transcription system

A portable sound amplification and speech transcription system

A portable sound amplification and speech transcription system

Owner:LONGQINGWEI (SHANGHAI) INTELLIGENT TECHNOLOGY CO LTD

Speech processing software graphical user interface for electronic device

ActiveCN310018629SGraphical user interfaceEngineering

1. Name of the product subject to this design: Graphical User Interface for Voice Processing Software of Electronic Device. 2. Intended use of this design: for use in an electronic device. 3. The key design feature of this product is its graphical user interface. 4. The image or photograph that best illustrates the design's key features: the front view. 5. Purpose of the graphical user interface: This graphical user interface is used for speech processing operations and content display, such as speech recognition, speech transcription, and speech content translation. 6. Human-computer interaction method of graphical user interface: The main view is the main interface of the speech processing software. In the main view, after the user clicks the tabs of "Real-time Recording", "File Import", "Simultaneous Interpretation" or "Dialogue Translation" at the top of the interface, they will be redirected to the corresponding function interface. After the user clicks the "Simultaneous Interpretation" tab at the top of the interface, a status graph will be displayed. The status change diagram is the interface for simultaneous interpretation. The interface displays the records of simultaneous interpretation. After the user clicks the "English to Chinese" or "Chinese to English" tab at the top of the interface, the corresponding simultaneous interpretation function will be activated. 7. Other situations requiring explanation: In each view, the letter "X" is used to represent areas of text, numbers, or symbols. The characters represented by the letter "X" are replaceable, and the number of letters "X" does not limit the number of characters that can be replaced.

Speech processing software graphical user interface for electronic device

Speech processing software graphical user interface for electronic device

Owner:HANVON CORP

Intelligent video editor for creating non-linear editing timeline

PendingUS20260148754A1Electronic editing digitised analogue information signalsUsing detectable carrier informationPersonalizationSocial media

An automated video editing system facilitates the creation of non-linear editing (NLE) timelines using a single prompt from a user. The system automatically ingests digital media, including video, audio, text, and images, and processes them to generate a proxy version with extracted features such as speech transcription, shot detection, facial recognition, and text recognition. A prompt-driven editing engine interprets user input and generates an edit decision list (EDL) using a large language model, which guides the assembly of an edited video timeline. The system also applies advanced editing features-such as captioning, animated title cards, font and color styling, sound effects, and transitions-based on learned user preferences. Additionally, it enables contextual overlays, chapter cards, and hierarchical timelines, while continuously learning user preferences to personalize editing results. The system may be integrated with social media platforms and existing media libraries for content sourcing, customization, and automated publishing.

Intelligent video editor for creating non-linear editing timeline

Owner:OPEN VIDEO EXPLORATION INC

A Depression Detection Method Based on Instruction Fine-tuning Multimodal Speech-Language Model

PendingCN122090880AOvercoming underutilizationfully excavatedSpeech recognitionSpeech inputSpeech sound

This invention proposes a method for depression detection based on a multimodal speech-language model with instruction fine-tuning. The steps are as follows: First, a multimodal depression instruction dataset is constructed, integrating three types of information—original audio, automatically transcribed text, and emotional descriptions—into structured instruction-response samples. The emotional descriptions are automatically generated using a dedicated large-scale emotional inference model. Next, a low-rank adaptation technique is employed to efficiently fine-tune the parameters of the multimodal speech-language model. Audio features are extracted through an audio encoder and mapped to the text embedding space, then fused with the text and emotional description embeddings in a multimodal manner. The model parameters are optimized using a cross-entropy loss function. Finally, the speech to be detected is input into the fine-tuned model, and the depression recognition result is output. The technical solution provided by this invention enables joint inference based on audio, text, and emotional information, significantly improving the accuracy of depression recognition while offering advantages in parameter efficiency.

A Depression Detection Method Based on Instruction Fine-tuning Multimodal Speech-Language Model

A Depression Detection Method Based on Instruction Fine-tuning Multimodal Speech-Language Model

A Depression Detection Method Based on Instruction Fine-tuning Multimodal Speech-Language Model

Owner:EAST CHINA UNIV OF SCI & TECH

Conference minutes intelligent analysis and position automatic creation system based on speech transcription and semantic understanding

PendingCN122414149ASpeech inputSpeech sound

本发明涉及人工智能与自然语言处理技术领域，且公开了基于语音转写与语义理解的会议纪要智能解析及职位自动创建系统，包括依次联动的六个核心模块：语音输入与转写层、说话人分离层、招聘领域语义解析引擎、结构化生成与置信度评估层、人机协同审核层、职位创建执行层；本发明实现端到端自动化处理，打破语音转写与职位创建的环节割裂，将语音转写、语义理解、信息抽取、职位创建整合为完整流程，消除人工二次处理环节，大幅提升招聘业务的信息流转效率，降低人工工作量。

Conference minutes intelligent analysis and position automatic creation system based on speech transcription and semantic understanding

Owner:AOYE HUMAN RESOURCES (GUANGDONG) CO LTD

System and method for automatic tagging of images and video in an operative report

PendingUS20260203798A1Operative reportEngineering

Systems and methods for automatic tagging of images and video in surgical streams are described. A plurality of machine learning models, trained on annotated surgical data, are used to extract salient images and video clips from surgical video streams. In addition, speech transcription models process audio streams to generate transcriptions that are then associated with the tagged media. Subsequently, the system synchronizes the multimodal data and generates structured operative records. After synchronization, billing rules are applied to produce accurate billing reports. Applications of the system include improving surgical documentation, reducing administrative burden, enhancing billing accuracy, and accelerating revenue cycles in healthcare environments.

System and method for automatic tagging of images and video in an operative report

Owner:VAIM TECHNOLOGIES LLC

A low-cost high-accuracy e-commerce goods-carrying video analysis method and system

PendingCN122269107ARealize intelligenceAchieve high efficiencyBiological modelsCharacter and pattern recognitionAlgorithmDocumentation

The application discloses a low-cost high-accuracy e-commerce goods-carrying video analysis method and system, and belongs to the technical field of data processing, and comprises the following steps: performing multidimensional availability checking on a to-be-analyzed goods-carrying video, then demultiplexing the video into a video audio stream and a video picture stream; performing streaming voice transcription on the video audio stream to generate a text semantic sequence, performing dynamic analysis on the video picture stream to extract a motion intensity coefficient, dynamically dividing video clips based on the two, and generating picture frame sets corresponding to each semantic segment, and simultaneously performing cross-segment summary aggregation to generate a video overall summary; performing parallel reasoning analysis on the picture frame sets of each segment, the audio text and the overall summary, generating multidimensional text analysis results, structurally processing each dimension analysis result, and integrating and concatenating in time sequence to generate a complete video structured description document, so that low-cost, high-accuracy e-commerce goods-carrying video automatic analysis is realized.

A low-cost high-accuracy e-commerce goods-carrying video analysis method and system

Owner:BEIJING DUODIAN ONLINE SCI & TECH CO LTD

A conference recording processing method and related device

PendingCN122122661ASpeech recognitionConference noteSpeech sound

The application provides a conference record processing method and related equipment, which can improve the quality of conference records and has fewer error words. The conference record processing method can include: obtaining an original conference record, which is obtained by speech transcription based on target conference audio; obtaining a target text, which is a dialogue text of the original conference record; obtaining a first hot word based on the target text; and correcting part or all of the original conference record based on a target hot word to obtain a target conference record, wherein the target hot word includes the first hot word.

A conference recording processing method and related device

A conference recording processing method and related device

A conference recording processing method and related device

Owner:BOE TECHNOLOGY GROUP CO LTD

A Speaker Recognition and Emotion Perception Method Based on AR Glasses

PendingCN122135743ASpeech analysisEmotion perceptionEmotional perception

This invention discloses a speaker recognition and emotion perception method based on AR glasses. The system combines microphone arrays, cloud computing, and multi-model fusion technology to achieve real-time recognition and display of speaker identity information, speech content, and emotional state. The method first acquires multi-channel speech signals through a microphone array and improves speech quality using preprocessing techniques such as beamforming, noise reduction, speech enhancement, and voice endpoint detection (VAD). Subsequently, the system runs an ECAPA-TDNN voiceprint recognition model in the cloud to obtain speaker embedding vectors. Through acoustic features such as MFCC, Fbank, and LPCC, and an AAM-Softmax training mechanism, it identifies gender, age, and speaker identity. A CTC end-to-end speech recognition model is used to complete real-time speech transcription. Finally, an emotion recognition model based on Wav2Vec 2.0 and Transformer, combined with MFCC features, emotion embedding, and attention-weighted fusion strategies, achieves high-precision recognition of multiple emotions such as "happy, angry, sad, and calm."

A Speaker Recognition and Emotion Perception Method Based on AR Glasses

Owner:HUZHOU UNIVERSITY

A speech transcription error correction method based on context memory pool

PendingCN122266363ASemantic analysisSpeech recognitionSemantic vectorSpeech sound

The application discloses a speech transcription error correction method based on context memory pool, and relates to the technical field of speech transcription and natural language processing. The method first preprocesses input audio, generates initial transcription text through a speech recognition model, and calculates confidence. Then, a context memory pool is constructed to store and manage historical transcription text, semantic vectors, timestamps, confidence and proper nouns in a structured manner. The memory pool is dynamically updated during transcription, and a context correction set is corrected based on the context memory pool. A language model is used to generate multiple correction candidate texts, and the best one is selected to correct the transcription text as the final transcription result and written back to the context memory pool. The application can fully utilize historical context information, improve the accuracy and consistency of speech transcription results, and is suitable for conference recording, voice interaction and voice content analysis scenarios.

A speech transcription error correction method based on context memory pool

A speech transcription error correction method based on context memory pool

A speech transcription error correction method based on context memory pool

Owner:HUNAN UNIV

Multimodal data processing for content query systems and applications

PendingDE102025152519A1Natural language translationVideo data indexingDatasheetContent retrieval

This document describes multimodal data processing for content retrieval systems and applications through various examples. The systems and methods described here can transform different data modalities into a common type of modality. For example, content data representing a video can be separated into audio data, which represents the video's sound—such as speech—and video data, which represents the video's individual frames. The audio data can then be processed using one or more models to generate an initial text that corresponds to a transcription of the speech. Additionally, the video data can be processed to identify specific keyframes that provide important information related to the video. These keyframes can then be processed using one or more models to generate a second text that describes them.The systems and procedures can then combine the text from the different modalities and generate data for storage in one or more databases.

Multimodal data processing for content query systems and applications

Multimodal data processing for content query systems and applications

Multimodal data processing for content query systems and applications

Owner:NVIDIA CORP

Voice command recognition for human-robot communication

UndeterminedDE102025144738A1Natural language understandingEngineering

Techniques for using verbal commands in human-robot communication. The number of tasks the robot can perform is limited to a specific set, while providing syntactic flexibility to users. The system includes two components: a speech recognizer for speech-to-text conversion and a natural language understanding module that maps the text to a command for the robot. After speech is transcribed into text, a nearest-neighbor classifier can be applied in the high-dimensional space of embedding tokens. Multiple variants of each command are provided in a database of reference embeddings, and the classifier can identify the k nearest reference embedding tokens to determine the command. The text-similarity model allows fast detection solutions to be deployed locally on a robot or other device.Local deployment reduces potential latency caused by a cloud connection, which can be important in many assistant robot applications.

Owner:INTEL CORP

Speech transcription multi-language transcription and switching method

ActiveCN116844549Bquick fiximprove accuracyVoice communicationDegree of similarity

The present application relates to the technical field of voice communication, and particularly relates to a voice transcription multi-language transcription and switching method. The method comprises the following steps: S1, identifying a language type; S2, selecting a corresponding language engine for translation; S3, calculating a similarity evaluation value to determine whether the voice transcription meets a preset standard; S4, if the standard is met, the process continues; if the standard is not met, synonymous replacement is performed and the translation is re-performed, or the translation mode is adjusted to re-perform the translation; and S5, performing secondary determination on the re-translated result. Compared with the prior art, the present application has the beneficial effect that by calculating the similarity evaluation value and comparing it with the preset standard, the reason why the preset standard is not met can be quickly analyzed, so that the corresponding processing mode can be quickly determined, and the voice transcription process can be quickly adjusted according to the processing mode, thereby improving the accuracy of translation.

Speech transcription multi-language transcription and switching method

Owner:GUANGZHOU BAOLUN ELECTRONICS CO LTD

Data processing, training of a speech synthesis model, speech synthesis method, apparatus, device, readable storage medium and program product

PendingCN122454953ASynthesis methodsSpeech synthesis

The application relates to a data processing method, a speech synthesis model training method, a speech synthesis method, a device, a readable storage medium and a program product. The method comprises the following steps: obtaining original audio data; performing speech transcription on the original audio data to obtain original transcription text; identifying paralanguage events in the original audio data based on a multi-modal processing model to generate paralanguage labels, and inserting the paralanguage labels into corresponding positions of the paralanguage events in the original transcription text to obtain target transcription text. The method can improve the robustness of speech synthesis.

Data processing, training of a speech synthesis model, speech synthesis method, apparatus, device, readable storage medium and program product

Owner:MOORE THREADS TECH CO LTD

Artificial intelligence based natural language interactive data generation method and system

PendingCN122157652ASpeech recognitionAlgorithmAdaptive denoising

The application discloses a natural language interactive data generation method and system based on artificial intelligence, and specifically comprises the following steps: receiving a user voice audio stream; performing spectrum transformation and noise suppression on the voice audio stream to generate a denoised mel spectrum feature; performing incremental fine-tuning on a Whisper speech recognition model by using a business scenario voice corpus, and performing speech transcription on the denoised mel spectrum feature to output a transcribed text sequence; performing feature fusion on the transcribed text sequence and the denoised mel spectrum feature to generate a fusion feature vector; generating a set of structured information units based on the fusion feature vector; generating a form filling result according to the set of structured information units and a business scenario identifier; performing PCard conflict discrimination and projection correction in the model training stage to complete model parameter updating; and performing verification on the form filling result to generate standardized form data. The application combines adaptive noise reduction, business fine-tuning and gradient conflict resolution to realize high-precision speech-driven structured data generation.

Artificial intelligence based natural language interactive data generation method and system

Owner:YUNHE (JIANGXI) INFORMATION TECHNOLOGY CO LTD

A real-time speech transcription method and system based on multi-modal deep learning

ActiveCN121789688BBiological modelsSpeech recognitionPattern recognitionEnvironmental noise

The present application relates to the field of multi-modal model data processing, in particular to a real-time speech transcription method and system based on multi-modal deep learning, first, the timing alignment of audio and lip video frames is realized through a hardware clock; secondly, the audio spectral entropy is extracted as an acoustic aliasing index, and the lip movement displacement rate is extracted as a lip movement saliency index by using a deep neural network; then, based on the cross-modal consistency deviation ratio, dynamic gating weights are generated and the audio and video features are weighted and fused; finally, the text is generated by inputting into an end-to-end model based on a self-attention mechanism. The present application effectively overcomes the acoustic interference in the multi-speaker overlapping scene, automatically increases the weight of the visual lip movement feature which is not affected by the environmental noise when the audio aliasing is serious, greatly improves the robustness and accuracy of the transcription in the complex conference scene; at the same time, through the pipeline architecture, the system delay is extremely low, meeting the real-time interaction requirements.

A real-time speech transcription method and system based on multi-modal deep learning

Owner:GUANGZHOU GUANGYOU COMM EQUIP

A multimodal meeting minutes automatic generation method and system

ActiveCN121683973BSpeech recognitionInference methodsEngineeringHuman–computer interaction

The application relates to the technical field of multi-modal data processing, and discloses a multi-modal conference minutes automatic generation method and system, which comprises the following steps: fusing multi-lingual speech transcription and speaker identity alignment information, and constructing a conference speech directed heterogeneous graph; generating a task-dependent simplex based on the conference speech directed heterogeneous graph and an execution semantic anchor point; and performing context-aware minutes structured rendering and intelligent action scheme generation based on the task-dependent simplex. Through the construction of the conference speech directed heterogeneous graph and the task-dependent simplex, the application realizes the automatic generation of multi-modal conference minutes, accurately encodes the correlation between conference tasks, helps the efficient conversion of conference decision-making into actual execution, and improves the execution efficiency of team conference decision-making and the project closed-loop management capability.

A multimodal meeting minutes automatic generation method and system

A multimodal meeting minutes automatic generation method and system

A multimodal meeting minutes automatic generation method and system

Owner:SHAANXI YULIN ENERGY GRP CO LTD

A multi-modal perception driven full-process automated conference management method and management system

PendingCN122317228AText recognitionEngineering

A multimodal perception-driven, fully automated meeting management method and system, relating to the field of meeting management technology, includes meeting terminal equipment and a meeting management server connected to it. The meeting management server is equipped with virtual machines corresponding to each meeting area and a speech recognition module for extracting speech features based on audio and video data collected by the meeting terminal equipment; an action detection module for acquiring the operation frequency and trajectory features of the USB input device; a projection arbitration module for determining the main speaker virtual machine and outputting it to the projection device; and a data extraction module for synchronizing the projected image to other virtual machines and supporting local storage of the results of capturing, text recognition, and speech transcription based on the synchronized image. This invention solves the problems of easy accidental image switching and rigid collaboration in existing meeting systems through multimodal anti-shake, audiovisual decoupling, and cross-domain routing schemes, realizing non-intrusive secure sharing and seamless flow, and improving the smoothness of the meeting.

A multi-modal perception driven full-process automated conference management method and management system

A multi-modal perception driven full-process automated conference management method and management system

A multi-modal perception driven full-process automated conference management method and management system

Owner:FUJIAN ZHONGKE XINGTAI DATA TECH CO LTD

Popular searches

Acoustics Subvocal recognition Audio frequency Signal processing Audio signal Wind noise Processing Human machine interaction Electronic equipment Speech processing