Speech recognition method and speech recognition device

The speech recognition method enhances accuracy by dynamically managing speech data accumulation and recognition timing, addressing inefficiencies in conventional systems to improve transcription quality.

JP2026109305APending Publication Date: 2026-07-01SHARP KK

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
SHARP KK
Filing Date
2024-12-19
Publication Date
2026-07-01

AI Technical Summary

Technical Problem

Conventional speech recognition systems often output incorrect speech-to-text conversion results due to inefficiencies in handling speech data accumulation.

Method used

A speech recognition method that includes steps for determining the presence of speech, deciding on data storage termination based on speech presence and duration, and performing recognition only when appropriate data is accumulated.

Benefits of technology

Improves speech-to-text conversion accuracy by ensuring appropriate length of audio data for AI model processing, incorporating temporary speech interruptions, and optimizing data storage timing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026109305000001_ABST
    Figure 2026109305000001_ABST
Patent Text Reader

Abstract

This invention provides a speech recognition method that can improve the transcription accuracy (speech recognition accuracy) using an AI model. [Solution] The speech recognition method of the present invention comprises: a speech data storage step of acquiring speech data and storing the acquired speech data as stored speech data; a speech determination step of determining whether or not there is speech in the acquired speech data; a decision step of determining whether or not to terminate the storage of the speech data or to continue the storage of the speech data based on the result of determining whether or not there is speech in the speech determination step and the storage time of the stored speech data; and a speech recognition step of performing speech recognition based on the stored speech data when the storage of the speech data has been terminated and the stored speech data exists.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to a speech recognition method and a speech recognition apparatus.

Background Art

[0002] In a speech recognition system, when performing speech-to-text conversion from speech, an AI model (for example, HMM-DNN method, etc.) for performing speech recognition is used (for example, see Patent Document 1). In such a speech recognition system, a speech section is detected from speech data, and speech recognition processing is performed on the speech data of the speech section using an AI model.

Prior Art Documents

Patent Documents

[0003]

Patent Document 1

Summary of the Invention

Problems to be Solved by the Invention

[0004] However, in a conventional speech recognition system, an incorrect speech-to-text conversion result may be output. The present invention has been made in view of such circumstances, and provides a speech recognition method capable of improving the speech-to-text conversion accuracy (speech recognition accuracy) using an AI model.

Means for Solving the Problems

[0005] The present invention provides a speech recognition method comprising: a speech data storage step of acquiring speech data and storing (recording) the acquired speech data as stored speech data; a speech determination step of determining whether or not there is speech in the acquired speech data; a decision step of determining whether or not to terminate the storage of the speech data or continue the storage of the speech data based on the result of determining whether or not there is speech in the speech determination step and the storage time of the stored speech data; and a speech recognition step of performing speech recognition based on the stored speech data when the storage of the speech data has been terminated and the stored speech data exists. Furthermore, the present invention provides a speech recognition device comprising a control unit having a storage unit, wherein the control unit is configured to store acquired voice data in the storage unit as stored voice data, and to determine whether to terminate or continue storing the voice data based on the result of determining whether or not there is speech in the acquired voice data and the storage time of the stored voice data, and in the case where the storage of voice data has been terminated and stored voice data exists, the control unit is configured to perform speech recognition based on the stored voice data. [Effects of the Invention]

[0006] According to the present invention, since the AI ​​model determines whether or not to continue accumulating (recording) the acquired audio data based on the result of determining whether or not there is speech and the accumulation time of the accumulated audio data, speech recognition can be performed on audio data of an appropriate length for the AI ​​model. Therefore, the accuracy of transcription using the AI ​​model can be improved. [Brief explanation of the drawing]

[0007] [Figure 1] This is a flowchart of the character recognition method in one embodiment of the present invention. [Figure 2] This is a flowchart of the character recognition method in one embodiment of the present invention. [Figure 3] This is a block diagram showing the configuration of an embodiment of a character recognition device. [Figure 4] This is an example of an audio waveform when speech is spoken. [Modes for carrying out the invention]

[0008] The speech recognition method of the present invention comprises: a speech data storage step of acquiring speech data and storing the acquired speech data as stored speech data; a speech determination step of determining whether or not there is speech in the acquired speech data; a decision step of determining whether or not to terminate the storage of the speech data or continue the storage of the speech data based on the result of determining whether or not there is speech in the speech determination step and the storage time of the stored speech data; and a speech recognition step of performing speech recognition based on the stored speech data when the storage of the speech data has been terminated and stored speech data exists.

[0009] The determination step is preferable if the speech determination step determines that there is no speech, and the duration of the state without speech is longer than the first threshold, and the storage time of the stored voice data is longer than the second threshold. The determination step is preferable if the speech determination step determines that there is no speech, and the duration of the state without speech is longer than the first threshold, and the storage time of the stored speech data is shorter than the second threshold, in which case the storage of the speech data is preferably continued. The determination step, if the speech determination step determines that there is no speech and the duration of the speechless state is shorter than the first threshold, determines whether there is any accumulated voice data accumulated before a predetermined time, and if it is determined that there is any accumulated voice data accumulated before the predetermined time, it is preferable to continue accumulating the voice data.

[0010] Preferably, the determination step involves determining whether there is any stored voice data accumulated before a predetermined time point if the speech determination step determines that there is no speech and the duration of the speechless state is shorter than the first threshold, and if it is determined that there is no stored voice data accumulated before the predetermined time point, the storage of voice data is terminated. The determination step preferably involves deleting any stored audio data that has been stored after the predetermined time point if such stored audio data exists after that predetermined time point. The determination step is preferable to terminate the accumulation of the voice data if it is determined in the speech determination step that speech has occurred and the duration of the speech is longer than the third threshold.

[0011] The determination step is preferable if, in the speech determination step, it is determined that there is speech, and the duration of the speech is shorter than the third threshold, in which case the accumulation of the voice data is continued. It is preferable to repeatedly execute the cycle including the voice data storage step, the speech determination step, and the decision step.

[0012] One embodiment of the present invention will be described below with reference to the drawings. The configurations shown in the drawings and the following description are illustrative, and the scope of the present invention is not limited to those shown in the drawings and the following description.

[0013] Figures 1 and 2 are flowcharts of the speech recognition method of this embodiment. Figure 3 is a block diagram of a speech recognition device capable of implementing the speech recognition method of this embodiment. The speech recognition method of this embodiment includes: a speech data storage step of acquiring speech data and storing the acquired speech data as stored speech data; a speech determination step of determining whether or not there is speech in the acquired speech data; a decision step of determining whether or not to terminate the storage of the speech data or continue the storage of the speech data based on the result of determining whether or not there is speech in the speech determination step and the storage time of the stored speech data; and a speech recognition step of performing speech recognition based on the stored speech data when the storage of the speech data has been terminated and stored speech data exists.

[0014] The aforementioned audio data storage step includes, for example, at least one of steps S2 and S4 in the flowcharts shown in Figures 1 and 2. The speech determination step includes, for example, at least one of steps S5, S6, S7, and S13 of the flowchart shown in Figures 1 and 2. The aforementioned decision step includes, for example, at least one of steps S8, S10, S14, S15, S18, and S20 of the flowchart shown in Figures 1 and 2. The aforementioned speech recognition step includes, for example, step S11 of the flowchart shown in Figures 1 and 2.

[0015] The speech recognition method of this embodiment can be implemented, for example, by a speech recognition device 10 as shown in Figure 3. The speech recognition device 10 of this embodiment includes a control unit 2 having a storage unit 3, the control unit 2 is configured to store acquired speech data in the storage unit 3 as stored speech data, and to decide whether to terminate or continue storing the speech data based on the result of determining whether or not there is speech in the acquired speech data and the storage time of the stored speech data, and in the case where the storage of the speech data has been terminated and stored speech data exists, the control unit 2 is configured to perform speech recognition based on the stored speech data. The voice recognition device 10 may be included in a voice recognition method automatic minutes system, a voice recognition method conversation recording system, or a voice text conversion system. The control unit 2 can include a processor, a storage unit 3, a communication unit 4, and the like. The processor can include at least one of, for example, a CPU, MPU, GPU, and the like. The storage unit 3 is a RAM, a storage, or the like. The communication unit 4 is a part provided to connect to an Internet network, a local area network, or the like. The control unit 2 can be connected to the microphone 5 so as to be able to input the voice signal output from the microphone 5. In addition, the control unit 2 can be connected to a user interface such as the display unit 6 so as to be able to output the recognition result of the voice recognition method of the present embodiment to the user interface.

[0016] FIG. 4 is an example of a voice waveform when there is speech. The voice waveform shows the change of the voice signal (for example, the output signal of the microphone 5) on the time axis. Also, the voice data is time-series data of the voice signal. In FIG. 4, the delimiter of the voice data acquired by the control unit 2 in step S4 is indicated by a dotted line, and the number (or cycle number) of the time interval of the voice data in each time interval is also shown. The number of the time interval and the cycle number are the same number. The voice recognition method of the present embodiment will be mainly described with reference to the flowchart shown in FIGS. 1 and 2 and the block diagram of the voice recognition device 10 shown in FIG. 3 for the example shown in FIG. 4. When starting the flow (step S1), first, the control unit 2 starts accumulating (recording) voice data (step S2). For example, the control unit 2 stores the voice data acquired in future cycles as accumulated voice data in the storage unit 3 until the accumulation is completed. The accumulation (recording) of the voice data continues to be stored, for example, until the accumulation is completed in steps S10, S20, and the like. The accumulated voice data from the start to the end of the accumulation of the voice data can be regarded as one data.

[0017] In step S3, the control unit 2 starts a cycle (1) (see Figure 4) for the time interval (1), and in step S4, it acquires audio data for the time interval (1) shown in Figure 4. Since the accumulation of audio data has started, the audio data acquired in step S4 is stored in the storage unit 3 as accumulated audio data. If accumulated audio data is already stored in the storage unit 3, the control unit 2 combines the acquired audio data with the stored accumulated audio data. In step S4, the time interval of the audio data acquired by the control unit 2 is, for example, 0.01 seconds or more and 1.0 second or less, preferably 0.01 seconds or more and 0.05 seconds or less. For example, the control unit 2 may directly acquire the audio data output from the microphone. Alternatively, the control unit 2 may store the audio signal output from the microphone in the storage unit 3 and acquire the audio data for the above time interval from the storage unit 3. Furthermore, the control unit 2 may acquire audio data from the Internet network or a local area network via the communication unit 4, or it may acquire the audio data for the above time interval from audio data already stored in the storage unit 3.

[0018] In step S5, the control unit 2 determines whether the sound pressure of the audio data acquired in step S4 is greater than a predetermined value. For example, the control unit 2 determines whether the relative magnitude (sound pressure) of the audio signal included in the audio data acquired in step S4 is greater than a predetermined value to the magnitude of the audio signal during a time when there is almost no change in the audio waveform (a time when there is no speech). This predetermined value is set to determine whether the audio data contains speech, and can be, for example, the minimum sound pressure of speech. If the control unit 2 determines that the sound pressure of the voice data is less than a predetermined value, the process proceeds to step S13 and determines that there is no speech. If the control unit 2 determines that the sound pressure of the voice data is greater than a predetermined value, the process proceeds to step S6. Since there is almost no change in the audio waveform in the audio data for time interval (1), the control unit 2 determines in step S13 that there is no speech and proceeds to S14.

[0019] In step S14, the control unit 2 determines whether the state of no speech has continued for a first threshold or longer. The first threshold is a threshold used to determine whether the interruption of speech is due to breathing, nodding, thinking, etc. The first threshold is, for example, a value of 0.1 seconds or more and 1.5 seconds or less. If the state of no speech continues for longer than the first threshold, the control unit 2 determines that it is not a temporary interruption of speech and proceeds to step S15. If the duration of the silent state is shorter than the first threshold, the control unit 2 determines that there may be a temporary interruption in speech and proceeds to step S18. Temporary interruptions in speech, such as breathing, interjections, and thoughts, are important information for transcription by the AI ​​model. Therefore, the control unit 2 performs processes such as steps S14 and S18 to ensure that such temporary interruptions are included in the stored audio data. Furthermore, by including temporary interruptions in the stored audio data, the audio data used for speech recognition can be made relatively longer when performing speech recognition using the AI ​​model, allowing for speech recognition to be performed while considering context. This improves the transcription accuracy of speech recognition using the AI ​​model. In cycle (1), the period of silence is short, so we proceed to step S18.

[0020] In step S18, the control unit 2 determines whether or not there is stored audio data that was stored before a predetermined time. The predetermined time is, for example, 0.5 seconds before the start of the current cycle. Alternatively, the predetermined time may be the start of the current cycle, or the start of the previous cycle. In cycle (1), there is no stored audio data stored up to the previous cycle, so the process proceeds to step S20 and the storage of audio data ends. Then, in step S21, the control unit 2 deletes the stored audio data that was stored after the predetermined time, and in step S22, the cycle (1) ends. If the predetermined time is the start of the current cycle, the stored audio data stored in cycle (1) is deleted.

[0021] After completing cycle (1) in step S22, the control unit 2 returns to step S2 and begins accumulating audio data. The audio data acquired in the subsequent step S4 is accumulated as separate audio data from the previously stored audio data in the storage unit 3. In step S3, the control unit 2 starts cycle (2) for time interval (2), and in step S4, it acquires audio data for time interval (2) shown in Figure 4. The audio data acquired in step S4 is stored in the storage unit 3 as accumulated audio data. In step S5, the control unit 2 determines whether the sound pressure of the audio data acquired in step S4 is greater than a predetermined value. The audio waveform in time interval (2) of Figure 4 has changed significantly, so the control unit 2 determines that the sound pressure of the audio data acquired in step S4 is greater than a predetermined value and proceeds to step S6.

[0022] In step S6, the control unit 2 determines whether or not the audio data acquired in step S4 contains spoken voice. In step S6, the control unit 2 can use VAD (Voice Activity Detection) to detect spoken voice and determine whether or not spoken voice is included in the audio data. VAD is a process that determines from the audio signal whether or not a speaker is actually speaking. VAD can use a machine learning model to determine whether or not it is spoken voice. For example, using VAD, if it is noise such as a cough, even if it is a voice, the control unit 2 can determine that the audio data does not contain spoken voice. If the control unit 2 determines that the audio data does not contain speech data, it determines in step S13 that there is no speech and proceeds to step S14. If the control unit 2 determines that the audio data contains speech data, it determines in step S7 that there is speech and proceeds to step S8. The control unit 2 determines that the audio waveform in time interval (2) of Figure 4 contains speech and proceeds to step S8.

[0023] In step S8, the control unit 2 determines whether the storage time (recording time) of the stored audio data stored in the memory unit 3 is longer than the third threshold. The third threshold is the upper limit of the storage time of the stored audio data. For example, the third threshold is a value of 20 seconds or more and 40 seconds or less. The third threshold is longer than the second threshold. By making the third threshold relatively long in this way, when performing speech recognition using the AI ​​model, the audio data to be recognized can be made relatively long, and speech recognition can be performed while taking context into consideration. Therefore, the transcription accuracy of speech recognition using the AI ​​model can be improved. Also, even if speech is continuing, if the storage time of the stored audio data is too long, a time lag will occur from speech to speech recognition. Therefore, if the storage time of the stored audio data is greater than the third threshold, even if speech is in progress, the process proceeds to step S10, and the control unit 2 terminates the storage of audio data. If the storage time for the stored audio data is less than the third threshold, the process proceeds to steps S9 and S3, and the control unit 2 continues storing the audio data and starts the next cycle. In cycle (2), since the storage time for the stored audio data is short, the process proceeds to steps S9 and S3, and the control unit 2 starts cycle (3) while continuing to store the audio data.

[0024] The control unit 2 performs control processing in the same order as in cycle (2) for each of the following cycles: cycle (3) for time interval (3), cycle (4) for time interval (4), cycle (5) for time interval (5), and cycle (6) for time interval (6), in the order of steps S3, S4, S5, S6, S7, S8, and S9. In cycle (3), the control unit 2 acquires audio data for time interval (3), in cycle (4), in cycle (5), in cycle (5), and in cycle (6), in cycle (6). When acquiring each audio data, the control unit 2 combines the acquired audio data with the stored audio data already stored. In this way, the control unit 2 accumulates stored audio data.

[0025] After completing cycle (6), the control unit 2 starts cycle (7) for the time interval (7) in step S3, and in step S4 acquires audio data for the time interval (7) and combines the acquired audio data with the stored audio data already in storage. In step S5, the control unit 2 determines whether the sound pressure of the audio data acquired in step S4 is greater than a predetermined value. In the audio data for time interval (7) in Figure 4, there is almost no change in the audio waveform, so in step S13 the control unit 2 determines that there is no speech and proceeds to S14.

[0026] In step S14, the control unit 2 determines whether the state of no speech continues for a period of 1 threshold or longer. In cycle (7), the control unit 2 determines that the state of no speech is short and may be a temporary interruption in speech, and proceeds to step S18. In step S18, the control unit 2 determines whether or not there is stored audio data that was accumulated before a predetermined time. The predetermined time is, for example, the time when the current cycle begins. In cycle (7), since there is accumulated voice data from the previous cycle, the process proceeds to step S19 and ends cycle (7). In this case, the control unit 2 determines that the state of no speech may be a temporary interruption and returns to step S3 to continue accumulating voice data.

[0027] After completing cycle (7) in step S19, the control unit 2 returns to step S3 while continuing to accumulate audio data and starts cycle (8) for the time interval (8). In step S4, it acquires the audio data for the time interval (8) shown in Figure 4 and combines the acquired audio data with the already stored audio data. In step S5, the control unit 2 determines whether the sound pressure of the audio data acquired in step S4 is greater than a predetermined value. The audio waveform in time interval (8) of Figure 4 has changed significantly, and the control unit 2 determines that the sound pressure of the audio data acquired in step S4 is greater than a predetermined value, and proceeds to step S6.

[0028] In step S6, the control unit 2 determines whether or not the audio data acquired in step S4 contains spoken audio. In step S6, the control unit 2 can use VAD (Voice Activity Detection) to detect spoken audio and determine whether or not spoken audio is included in the audio data. As shown in the audio waveform of time interval (8) in Figure 4, the audio data contains spoken audio, so in step S7 the control unit 2 determines that there is spoken audio and proceeds to step S8. In step S8, the control unit 2 determines whether the storage time (recording time) of the stored audio data stored in the storage unit 3 is longer than the third threshold. In cycle (8), since the storage time for the stored audio data is short, the process proceeds to steps S9 and S3, and the control unit 2 starts cycle (9) while continuing to store the audio data. In cycle (7), no speech was detected, but in cycle (8), speech was detected. Therefore, the time interval (7) is thought to be a temporary interruption due to breathing, interjections, thinking, etc. In the speech recognition method of this embodiment, such temporary interruptions can be included in the stored speech data, thereby improving the accuracy of transcription using the AI ​​model.

[0029] The control unit 2 performs control processing in the same order as in cycle (8) for each of the following cycles: cycle (9) for time interval (9), cycle (10) for time interval (10), cycle (11) for time interval (11), cycle (12) for time interval (12), cycle (13) for time interval (13), and cycle (14) for time interval (14). In cycle (9), the control unit 2 acquires audio data for time interval (9), in cycle (10), for time interval (10), in cycle (11), for time interval (11), in cycle (12), for time interval (12), in cycle (13), and in cycle (14), for time interval (14). In each cycle, the control unit 2 combines the acquired audio data with the stored audio data already stored. In this way, the control unit 2 accumulates the stored audio data, and then starts cycle (15). However, if in any of the steps S8 of cycles (9) to (14), the control unit 2 determines that the storage time of the stored audio data stored in the storage unit 3 is longer than the third threshold, the control unit 2 determines that the upper limit of the storage time of the stored audio data has been reached and proceeds to step S10.

[0030] In step S10, the control unit 2 finishes accumulating the voice data, and in step S11, it performs voice recognition processing on the accumulated voice data stored in the storage unit 3 using the AI ​​model. If the control unit 2 has stored the AI ​​model, the control unit 3 can perform the voice recognition processing. Alternatively, the control unit 3 may transmit the accumulated voice data to a server on the Internet or a local area network via the communication unit 4, have the server perform voice recognition processing, and receive the results via the communication unit 4. Furthermore, the control unit 2 may output the speech recognition results to a user interface such as the display unit 6. Subsequently, the cycle ends in step S12, and the accumulation of the next audio data begins in step S2. The audio data acquired in the following step S4 is stored as separate audio data from the previously stored audio data in the storage unit 3.

[0031] The following explanation assumes that the storage time (recording time) of the accumulated audio data has not reached the upper limit in cycles (9) to (14). After completing cycle (14), the control unit 2 starts cycle (15) for the time interval (15) in step S3, acquires audio data for the time interval (15) in step S4, and combines the acquired audio data with the stored audio data already in storage. In step S5, the control unit 2 determines whether the sound pressure of the audio data acquired in step S4 is greater than a predetermined value. In the audio data for the time interval (15) in Figure 4, there is almost no change in the audio waveform, so in step S13 the control unit 2 determines that there is no speech and proceeds to S14.

[0032] In step S14, the control unit 2 determines whether the state of no speech has continued for a period of 1 threshold or longer. In cycle (15), the control unit 2 determines that the state of no speech is short and may be a temporary interruption in speech, and proceeds to step S18. In step S18, the control unit 2 determines whether or not there is stored audio data that was accumulated before a predetermined time. The predetermined time is, for example, the time when the current cycle begins. In cycle (15), since there is accumulated audio data from the previous cycle, the process proceeds to step S19, ending cycle (15), returning to step S3, and starting cycle (16) while continuing to accumulate audio data.

[0033] The control unit 2 continues to accumulate audio data and performs control processing in the same order as in cycle (15) for each cycle (16) for time interval (16) and for time interval (17), in the order of steps S3, S4, S5, S13, S14, S18, and S19. Here, in cycles (16) and (17), the period of silence is assumed to be shorter than the first threshold. In cycle (18), the period of silence is assumed to be longer than the first threshold.

[0034] After completing cycle (17), the control unit 2 starts cycle (18) for the time interval (18) in step S3, and in step S4 acquires audio data for the time interval (18) and combines the acquired audio data with the stored audio data already in storage. In step S5, the control unit 2 determines whether the sound pressure of the audio data acquired in step S4 is greater than a predetermined value. Since there is almost no change in the audio waveform in the audio data for time interval (18) in Figure 4, the control unit 2 determines in step S13 that there is no speech and proceeds to S14. In step S14, the control unit 2 determines whether the state of no speech has continued for a period of 1 threshold or longer. In cycle (18), the control unit 2 determines that the state of no speech has continued for a period of 1 threshold or longer, and proceeds to step S15.

[0035] In step S15, the control unit 2 determines whether the storage time (recording time) of the stored audio data stored in the memory unit 3 is longer than the second threshold. The second threshold is a threshold used to determine whether there is sufficient storage time for the AI ​​model to perform highly accurate speech recognition. The second threshold is, for example, a value of 0.5 seconds or more and 10.0 seconds or less, preferably around 1.0 second. The second threshold is shorter than the third threshold. If the storage time of the stored audio data is longer than the second threshold, the control unit 2 determines that there is sufficient storage time for the stored audio data to perform speech recognition and proceeds to step S10. This allows the control unit 2 to terminate the storage of audio data at a suitable timing immediately after the speech is interrupted and perform speech recognition. Furthermore, by setting the second threshold to a relatively long length, the audio data used for speech recognition can be made relatively long when performing speech recognition using the AI ​​model, allowing speech recognition to be performed while taking context into consideration. This improves the transcription accuracy of speech recognition using the AI ​​model. If the storage time of the stored audio data is shorter than the second threshold, in step S15, the control unit 2 determines that the stored audio data does not have a sufficient storage time for speech recognition and proceeds to step S16.

[0036] In step S15 of cycle (18), if the control unit 2 determines that the storage time of the stored audio data is longer than the second threshold and proceeds to step S10, the control unit 2 terminates the storage of audio data and performs speech recognition processing on the stored audio data stored in the memory unit 3 in step S11 using the AI ​​model. After that, the control unit 2 terminates the cycle (18) in step S12 and starts storing the next audio data in step S2. The audio data acquired in the subsequent step S4 is stored as separate stored audio data from the stored audio data previously stored in the memory unit 3.

[0037] In step S15 of cycle (18), if the control unit 2 determines that the storage time of the stored audio data is shorter than the second threshold and proceeds to step S16, the control unit 2 deletes the audio data acquired in step S4 of the current cycle and terminates cycle (18) in step S17. Furthermore, in step S16, if the voice data acquired in step S4 is stored in the storage unit 3 as stored voice data, the control unit 2 deletes the stored voice data. If the voice data acquired in step S4 is combined with stored voice data, the control unit 2 deletes the voice data acquired in step S4 of the current cycle from the stored voice data. This prevents voice data without speech from being stored as stored voice data, making it possible to perform speech recognition efficiently. When cycle (18) ends in step S17, the system returns to cycle S3 and starts cycle (19) while continuing to accumulate audio data. [Explanation of Symbols]

[0038] 2: Control Unit 3: Memory Unit 4: Communication Unit 5: Microphone 6: Display Unit 10: Voice Recognition Device

Claims

1. A voice data storage step that acquires voice data and stores the acquired voice data as stored voice data, A speech detection step that determines whether or not there is speech in the acquired audio data, A decision step to determine whether to terminate the accumulation of the voice data or continue accumulating the voice data, based on the result of determining whether or not the utterance is present as determined in the utterance determination step and the accumulation time of the accumulated voice data, If the accumulation of the aforementioned audio data has been completed and the accumulated audio data exists, a speech recognition step is performed based on the accumulated audio data. A speech recognition method comprising the following features.

2. The speech recognition method according to claim 1, wherein the determination step terminates the accumulation of speech data if it is determined in the speech determination step that there is no speech, and the duration of the state of no speech is longer than a first threshold and the accumulation time of the accumulated speech data is longer than a second threshold.

3. The speech recognition method according to claim 1, wherein the determination step is performed if it is determined in the speech determination step that there is no speech, and the duration of the state of no speech is longer than a first threshold, and the accumulation time of the accumulated speech data is shorter than a second threshold, and the accumulation of the accumulated speech data is continued.

4. The speech recognition method according to claim 1, wherein the determination step determines whether there is any stored voice data accumulated before a predetermined time if the speech determination step determines that there is no speech and the duration of the state of no speech is shorter than a first threshold, and if it determines that there is any stored voice data accumulated before the predetermined time, the accumulation of the voice data is continued.

5. The speech recognition method according to claim 1, wherein the determination step determines whether there is any stored voice data accumulated before a predetermined time if the speech determination step determines that there is no speech and the duration of the state of no speech is shorter than a first threshold, and if it determines that there is no stored voice data accumulated before the predetermined time, the storage of voice data is terminated.

6. The speech recognition method according to claim 5, wherein the determination step deletes the stored speech data that was stored after the predetermined time if such stored speech data exists after the predetermined time.

7. The speech recognition method according to claim 1, wherein the determination step terminates the accumulation of the speech data if it is determined in the speech determination step that there is speech and the duration of the speech state is longer than the third threshold.

8. The speech recognition method according to claim 1, wherein the determination step continues to accumulate the speech data if it is determined in the speech determination step that there is speech and the duration of the speech state is shorter than the third threshold.

9. The speech recognition method according to any one of claims 1 to 8, wherein the cycle including the speech data storage step, the speech determination step, and the decision step is repeatedly executed.

10. It includes a control unit that has a memory unit, The control unit is provided to store the acquired audio data as stored audio data in the storage unit, and is provided to determine whether to terminate or continue storing the audio data based on the result of determining whether or not there is speech in the acquired audio data and the storage time of the stored audio data, and is provided to perform speech recognition based on the stored audio data when the storage of the audio data has been terminated and stored audio data exists.