Information processing device, communication terminal, information processing system, information processing method, and program
The information processing apparatus improves speech-to-text accuracy by using screen data to replace non-general terms with corresponding terms from screen text, addressing the challenge of recognizing technical terms in speech-to-text systems.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- CANON KK
- Filing Date
- 2025-11-27
- Publication Date
- 2026-07-02
AI Technical Summary
Existing speech-to-text technologies struggle with accurately recognizing non-general terms such as technical terms and special terms used within organizations, leading to inaccuracies in transcription.
An information processing apparatus that includes an acquisition unit for screen and audio data, a conversion unit for converting data into text, and a generation unit that replaces specific terms in audio text with corresponding terms from screen text, utilizing AI models for improved accuracy.
Enhances the accuracy of speech-to-text transcription by leveraging screen data to correctly identify and replace technical or special terms, ensuring precise transcription even in complex contexts.
Smart Images

Figure JP2025041475_02072026_PF_FP_ABST
Abstract
Description
Information Processing Apparatus, Communication Terminal, Information Processing System, Information Processing Method, and Program
[0001] This disclosure relates to speech-to-text processing of voice data.
[0002] Speech-to-text of voice data is used in various fields, such as transcription in meetings, interviews, lectures, classes, etc., and subtitle generation of spoken words in video content. Also, automatic speech-to-text tools that create speech-to-text data from voice data using a computer are provided. Patent Document 1 discloses a technique for performing speech recognition on input voice data to convert it into text and creating a meeting minutes. This Patent Document 1 describes selecting dictionary data to be used for speech recognition from pre-prepared field-specific dictionary data for speech recognition based on the character string included in the image data projected on the projector.
[0003] In recent years, speech-to-text services that utilize generative AI have also been provided, and in online meeting services conducted between multiple communication terminals via a network, a speech-to-text function and a minutes generation function are provided.
[0004] Japanese Patent Application Laid-Open No. 2006-267934
[0005] However, the spoken voice to be subjected to speech-to-text may include non-general terms such as technical terms and special terms used only within an organization. In that case, non-general terms are not correctly speech-recognized, and there is room for improvement in the accuracy of the speech-to-text result.
[0006] This disclosure is an information processing apparatus communicatively connected to a communication terminal via a network, having an acquisition means for acquiring screen data and voice data shared by the communication terminal, a first conversion means for converting the voice data into voice text data, a second conversion means for converting the screen data into screen text data, a generation means for generating speech-to-text data in which a part of the character string included in the voice text data is replaced with the character string included in the screen text data, and an output means for outputting the speech-to-text data generated by the generation means to the communication terminal.
[0007] This disclosure provides an information processing device capable of generating accurately transcribed text data from audio data.
[0008] Further features of the technology of this disclosure will become apparent from the following description of embodiments with reference to the accompanying drawings.
[0009] This is a diagram showing an example of the system configuration of an online meeting system. This is a diagram showing the hardware configuration of the meeting server. This is a diagram showing the hardware configuration of the communication terminal. This is a block diagram showing the configuration of the transcription processing unit 100. This is a diagram showing an example of a meeting screen. This is a diagram showing screen data and audio data in chronological order. This is a diagram showing an example of an editing screen. This is a flowchart showing the editing process of meeting data performed by the meeting server. This is a flowchart showing the transcription process performed by the meeting server. This is a diagram showing an example of audio text converted from audio data. This is a diagram showing an example of screen text converted from screen data. This is a flowchart showing the mapping process of S906. This is a diagram showing an example of an editing screen where replacement candidates are displayed in a selectable format. This is a diagram showing an example of a correspondence list. This is an example of displaying transcription data in which a part of the string of audio text has been replaced with screen text. This is a diagram showing variations in the display of transcription data and a GUI for correct / incorrect input. This is a diagram illustrating an example of mapping between user-instructed locations and audio text. This is a diagram illustrating an example of mapping between user-instructed areas and audio text. This is a diagram showing an example of replacement candidates similar to audio text. This is a flowchart showing the transcription process during a meeting in the second embodiment. This is a flowchart showing the mapping process in the second embodiment. This is a diagram showing an example of reading candidates for screen text. This figure shows an example of how transcript data is displayed on a meeting screen. This figure shows a GUI for selecting conversion candidates that include screen text. This is a flowchart of the mapping process in the third embodiment. This figure shows an example of mapping between time-series screen transitions and audio text. This is a block diagram showing the configuration of the transcription processing unit in the fourth embodiment. This is a flowchart of the transcription process in the fourth embodiment.
[0010] The embodiments of this disclosure will be described in detail below with reference to the drawings. Note that the following embodiments are not intended to limit the invention as defined in the claims, and not all combinations of features described in the embodiments are necessarily essential to the solution of the invention.
[0011] In the following embodiments, an online meeting system will be described as an example of an information processing system related to this disclosure. The data subject to transcription processing will be described as meeting data of an online meeting. The meeting data includes screen data and audio data shared among communication terminals used by meeting attendees. This disclosure is not limited to online meeting systems and is applicable to any information processing system in which image data and audio data are shared among multiple communication terminals. For example, it is applicable to information processing systems used for webinars, online events (presentations, classes, training, workshops, game tournaments, etc.), video content distribution, advertising content distribution, etc. Furthermore, this information processing system can also be applied to video distribution on SNS (Social Networking Services), etc.
[0012] <First Embodiment> (System Configuration) Figure 1 is a diagram showing the system configuration of an online conferencing system 1, which is an example of an information processing system according to this embodiment. As shown in the figure, the online conferencing system 1 consists of a conferencing server 110 and a communication terminal 120 connected to each other via a network 130, forming a server-client system. The conferencing server 110 functions as the server of the online conferencing system 1, and the communication terminal 120 functions as the client.
[0013] The conference server 110 may consist of a single information processing device or may include multiple information processing devices. Figure 1 shows an example where three communication terminals 120 are connected to the network 130, but the number of connected communication terminals 120 is not limited to three and may be any number of one or more. The network 130 includes communication networks such as LAN (Local Area Network), WAN (Wide Area Network), the Internet, and short-range communication such as Bluetooth®. The communication connection method may be wired or wireless.
[0014] The conference server 110 performs conference processing and conference management processing to conduct online conferences among multiple communication terminals 120. In conference processing, the conference server 110 receives audio data, image data, and shared screen data transmitted from multiple communication terminals 120 attending the online conference. Shared screen data will be described later. Based on the received data, the conference server 110 generates conference screen data and conference audio data and transmits them in real time to the communication terminals 120 of the conference attendees. This allows conference screen data and conference audio data to be shared among multiple communication terminals 120. Management processing includes schedule management, conference room management, and processing related to attending and leaving conferences. For example, the conference server 110 performs processing such as setting the start and end dates and times of the conference, setting the conference site to be used as the conference room, and notifying the communication terminals 120 used by attendees of its URL, according to user operations. In the following description, user operations on the conference server 110 are operations performed by the users of each communication terminal 120 and are implemented using the web browser function. Unless otherwise specified, "user" refers to the user of communication terminal 120. Meeting attendees and the person taking the minutes are also users of communication terminal 120.
[0015] Furthermore, the conference server 110 has a transcription processing unit 100 that performs transcription processing. In the transcription processing, the transcription processing unit 100 acquires screen data and audio data shared among multiple communication terminals 120, and converts the acquired audio data into audio-text data. The transcription processing unit 100 also converts the acquired screen data into screen-text data, which is text data. The transcription processing unit 100 then generates transcription data that includes strings contained in the audio-text data and strings contained in the screen-text data. Specifically, the transcription processing unit 100 generates transcription data in which some strings (replaceable strings) contained in the audio-text data are replaced with strings contained in the screen-text data. Alternatively, the transcription processing unit 100 generates transcription data that lists the replaceable strings contained in the audio-text data and screen-text data that are candidates for replacement of the replaceable strings. Details of the transcription processing unit 100 will be described later (Figure 4).
[0016] The communication terminal 120 is an information processing device used by users of the online meeting system 1, and has a web browser function for browsing websites on the internet and running web applications provided by a server. The communication terminal 120 can be composed of, for example, a PC (personal computer), a smartphone, a tablet, other information processing terminals, or a camera equipped with network communication capabilities.
[0017] The communication terminal 120 transmits audio data spoken by attendees during the meeting, image data of attendees, and shared screen data to the conference server 110. The audio data, image data, and shared screen data transmitted from the communication terminal 120 to the conference server 110 are processed by the conference server 110 and become audio data, image data, and shared screen data that are shared with other communication terminals 120. The shared screen data includes data of the screen displaying a selected document file or image file on one of the communication terminals 120. This data is uploaded to the conference server 110 in the form of a compressed video stream, such as H.264 / H.265 or VP8 / VP9, and shared with each communication terminal 120. The shared screen data includes at least one of text and images. The communication terminal 120 also receives audio data, image data, and shared screen data from other communication terminals 120 transmitted from the conference server 110 and outputs them so that the users of those communication terminals 120 can view them. Furthermore, the communication terminal 120 receives the transcript data generated by the conference server 110 and displays it on the display unit on the communication terminal 120. The transcript data will be described later.
[0018] (Hardware Configuration) Figure 2 shows an example of the hardware configuration of an information processing device used as a conference server 110. The conference server 110 includes a CPU 201, ROM 202, RAM 203, storage 204, communication interface (I / F) 205, display unit 206, and input unit 207, etc. These units are connected to each other via a bus 209. Note that the configuration of the conference server 110 is not limited to the example in Figure 2, and additions or omissions may be made as appropriate. For example, the conference server 110 may have a configuration without a display unit 206 and an input unit 207, or it may include a GPU (Graphics Processing Unit), additional memory, etc., in addition to the configuration shown in the figure. Furthermore, the storage 204, communication I / F 205, display unit 206, and input unit 207 of the conference server 110 may be provided integrally with the main body of the conference server 110, or they may be provided separately.
[0019] The CPU 201 is the central processing unit and controls the operation of each component of the conference server 110. The CPU 201 reads programs stored in the ROM 202 or storage 204 and uses the RAM 203 as a work area to execute various processes described in the programs. The ROM 202 is a non-volatile memory area that stores the OS (Operating System), fixed operating parameters and operating programs used by the conference server 110. The RAM 203 is a memory that temporarily stores data and control information and serves as a work area used by the CPU 201 when executing various processes.
[0020] Storage 204 is a storage device and includes a hard disk drive (HDD), solid state drive (SSD), optical disc drive, flash memory, or other storage device. Storage 204 stores computer programs and data for causing the CPU 201 to execute each of the processes described later performed by the conference server 110. Communication interface (I / F) 205 includes a communication control circuit and a communication port and mediates the transmission and reception of data with other devices via the network 130.
[0021] The display unit 206 includes a display control circuit and a display, and displays the display data input from the CPU 201 on the display in a manner that is visible to the user of the conference server 110. The input unit 207 includes, for example, a keyboard, mouse, touch panel, etc., and receives input of information corresponding to the user's operation of the conference server 110 and inputs it to the CPU 201. Note that the display unit 206 and the input unit 207 may be an integrated touch panel display.
[0022] Figure 3 shows an example of the hardware configuration of a communication terminal 120. The communication terminal 120 includes a CPU 301, ROM 302, RAM 303, storage 304, communication I / F 305, display unit 306, input unit 307, camera 308, microphone 309, and speaker 310, etc. These units are connected to each other via a bus 311. The configuration of the communication terminal 120 is not limited to the example in Figure 3, and may be added or omitted as appropriate. For example, it may include a GPU, additional memory, or other components not shown in the figure. Also, the storage 304, communication I / F 305, display unit 306, input unit 307, camera 308, microphone 309, and speaker 310 may be provided integrally with the main body of the communication terminal 120, or they may be provided separately.
[0023] The CPU 301 of the communication terminal 120 is a central processing unit that controls the operation of each part that makes up the communication terminal 120. The CPU 301 uses the RAM 303 as a work area to execute various processes according to the program held in the ROM 302 or storage 304. The ROM 302 is a non-volatile memory area that stores the OS, operating parameters and operating programs that the communication terminal 120 uses permanently. The RAM 303 is a memory that temporarily stores data and control information and serves as a work area used by the CPU 301 when executing various processes.
[0024] The storage 304 is a storage device and includes an HDD, SSD, optical disc drive, flash memory, or other storage device. The storage 304 stores computer programs and data for causing the CPU 301 to execute each of the processes described later performed by the communication terminal 120. The communication interface 305, display unit 306, and input unit 307 are the same as the communication interface 205, display unit 206, and input unit 207 described above, so their description is omitted.
[0025] The camera 308 has an image sensor such as a CCD or CMOS and a lens, and inputs the captured data (hereinafter referred to as "image data") to the CPU 301. The microphone 309 converts input voice, such as speech by the user, into voice data and inputs it to the CPU 301. The speaker 310 outputs the voice data output from the CPU 301 as voice.
[0026] Note that the hardware configuration examples shown in Figures 2 and 3 are just examples, and you may add or remove components as needed.
[0027] In this embodiment, the various functions related to online meetings provided by the conference server 110 of the online meeting system 1 are provided to the communication terminal 120 as a web application. That is, the conference server 110 provides the communication terminal 120 with a UI (user interface) screen for online meetings via the communication terminal 120's web browser. The conference server 110 displays various data to the communication terminal 120 via the UI screen and, while accepting data input from the communication terminal 120, executes processing corresponding to the various functions provided by the conference server 110. Note that the various functions related to online meetings are not limited to a web application and may be implemented by a dedicated application for the communication terminal 120 to execute the processing according to this embodiment. In that case, the communication terminal 120 executes the dedicated application, sending and receiving data with the conference server 110 while executing processing corresponding to the various functions related to online meetings. Furthermore, the functions of this embodiment may be implemented using a dedicated circuit (such as an ASIC), not limited to the form of an application program. In addition, the various hardware components constituting the conference server 110 may be virtual hardware resources on the cloud. In that case, the conference server 110 sends a function execution request to the relevant hardware resource via the communication interface 205 and obtains the processing result via the communication interface 205.
[0028] (Transcription Processing Unit) Figure 4 is a block diagram showing the functional configuration of the transcription processing unit 100 in this embodiment. As shown in Figure 4, the transcription processing unit 100 has an acquisition unit 410, a conversion unit 420, a generation unit 430, and an output unit 440. In the first embodiment, the transcription processing unit 100 is provided in the conference server 110 as shown in Figure 1. The functions of each functional unit shown in Figure 4 are realized when the CPU 201 of the conference server 110 calls a program stored in the ROM 202 or storage 204, and the CPU 201 executes processing according to the program.
[0029] The acquisition unit 410 acquires screen data and audio data shared with the communication terminal 120 during an online meeting. The screen data and audio data are generated by the meeting processing performed by the meeting server 110. In the following description, screen data shared between the meeting server 110 and one or more communication terminals 120 will be referred to as screen data, and shared audio data will be referred to as audio data.
[0030] The screen data includes data from the shared screen that is transmitted from the online meeting attendees' communication terminals 120 to the conference server 110 via the screen sharing function of the meeting processing. Specifically, at the communication terminal 120 of an attendee who requested the execution of the screen sharing function, data of the display screen, such as a document file or image file, specified by the attendee is transmitted to the conference server 110 as shared screen data. The conference server 110 transmits the received shared screen data to all attendees' communication terminals 120. This allows the shared screen data to be shared. The shared screen data is updated each time the display content is updated at the source communication terminal 120, and the updated display screen data is uploaded to the conference server 110 as shared screen data, transmitted to all attendees' communication terminals 120, and shared. The shared screen data also includes the trajectory of the mouse pointer and drawing trajectory operated on the display screen by the user of the source communication terminal 120.
[0031] The audio data is compiled by the conference server 110 from the audio data of the spoken words input by each participant in the online meeting via their communication terminal 120. Specifically, the conference server 110 acquires the audio data of the spoken words transmitted from each communication terminal 120 to the conference server 110. The conference server 110 performs audio processing, such as noise reduction, as necessary, and transmits it to all participants' communication terminals 120, thereby sharing the audio of the discussions in the meeting as audio data. Screen data and audio data are shared in real time during the meeting.
[0032] In the first embodiment, the conference server 110 records screen data and audio data shared during the conference and saves them to the storage 204. Time information is also added to the screen data and audio data for synchronization purposes.
[0033] The conversion unit 420 includes a speech-to-text conversion unit 421 and a screen-to-text conversion unit 422.
[0034] The speech-to-text conversion unit 421 converts the speech data acquired by the acquisition unit 410 into speech-to-text data (hereinafter referred to as speech-to-text) which is text data (conversion process). The conversion from speech data to speech-to-text can be performed using known speech recognition processing. In speech recognition processing, for example, feature quantities of the speech data are extracted, and phoneme analysis, word combination analysis, grammatical and vocabulary analysis, etc., are performed using acoustic models or language models to predict word sequences that are close to natural language. Alternatively, speech recognition processing may be performed using a speech recognition model constructed by deep learning.
[0035] The speech-to-text conversion unit 421 may be configured to include a speech recognition conversion engine and speech recognition model within the transcription processing unit 100, or it may utilize a conversion engine 461 and generation AI 462 provided in an external device 460. The speech-to-text conversion unit 421 sends a request to the external device 460 to convert the audio data to text data, and in response, it obtains the converted audio text. At that time, the speech-to-text conversion unit 421 generates a prompt (instruction) to convert the audio data to audio text and inputs it to the generation AI 462. Known services for converting audio data to text include, for example, Wave2Vec and Whisper.
[0036] The screen text conversion unit 422 converts the screen data acquired by the acquisition unit 410 into screen text, which is text data (conversion process). As mentioned above, the screen data acquired by the acquisition unit 410 includes data from the display screen of document files and image files displayed on the communication terminal 120 as a shared screen. The target of conversion to screen text may be any of the strings, figures, graphs, illustrations, or other images contained in the screen data.
[0037] If the target of conversion is a string, the screen text conversion unit 422 recognizes the string by performing OCR (Optical Character Recognition) processing on the screen data and outputs it as text data. In the OCR processing, processing utilizing deep learning such as CNN or RNN may be performed. CNN is a convolutional neural network. RNN is a recurrent neural network.
[0038] If the target of conversion is an image, the screen text conversion unit 422 generates descriptive text describing the image, for example, using image captioning technology. In image captioning, for example, features of the image are extracted using a CNN, and the descriptive text is generated using an RNN or Transformer model. The screen text conversion unit 422 may also use a generation AI to generate text from an image. For example, generation AIs such as CLIP and BLIP can be used.
[0039] The screen text conversion unit 422 may also accept a user's selection of an area within the screen data and convert the screen data to screen text for the selected area. By limiting the area to be converted from screen data to screen text in this way, processing speed can be increased and confidential information can be protected.
[0040] The screen text conversion unit 422 may be configured to include a conversion engine for image captioning, an image recognition model, or a generation AI such as an image captioning model within the transcription processing unit 100. Alternatively, a conversion engine 461 or generation AI 462 provided in an external device 460 may be used. The screen text conversion unit 422 sends a request to the external device 460 or an external service to convert the image data to be processed into text data, and receives the converted image text in response. At that time, the screen text conversion unit 422 generates a prompt (instruction) to convert the screen data into screen text and inputs it to the generation AI.
[0041] The voice text converted by the voice text conversion unit 421 and the screen text converted by the screen text conversion unit 422 are output to the generation unit 430.
[0042] The generation unit 430 generates character recognition data based on the voice text converted by the voice text conversion unit 421 and the screen text converted by the screen text conversion unit 422 (generation process). In the present embodiment, the generation unit 430 generates character recognition data including the character string included in the voice text and the character string included in the screen text. Specifically, the generation unit 430 generates character recognition data by replacing a part of the character string included in the voice text with the screen text. Alternatively, the generation unit 430 generates character recognition data in which the character string included in the voice text and the screen text that is a replacement candidate for the character string are written together.
[0043] The generation unit 430 includes a replacement candidate extraction unit 431 and a replacement unit 433.
[0044] The replacement candidate extraction unit 431 extracts the character string (replacement target character string) to be replaced among the character strings included in the voice text. In addition, the replacement candidate extraction unit 431 extracts, as replacement candidates, character strings that are candidates for replacing the replacement target character string from the screen text.
[0045] The generation unit 430 generates character recognition data in which the replacement target character string of the voice text is replaced with one of the replacement candidate character strings (screen text). First, the generation unit 430 determines the character string (screen text) that replaces the replacement target character string among the replacement candidates. For example, the generation unit 430 receives a selection by the user from a plurality of replacement candidates and determines the candidate (screen text) to replace the replacement target character string.
[0046] The generation unit 430 stores the combination of the character string (screen text) determined as the character string that replaces the replacement target character string among the replacement candidates and the replacement target character string (voice text) as a correspondence list (correspondence data) 432. The correspondence list 432 will be described later.
[0047] The replacement unit 433 replaces the replaced string extracted from the voice text with the string extracted from the screen text based on the combination stored in the correspondence list 432. Thereby, character conversion data in which a part of the voice text is replaced with the screen text is generated. That is, character conversion data including the string included in the voice text and the string included in the screen text is generated.
[0048] The output unit 440 outputs the character conversion data generated by the generation unit 430 so that it can be displayed on the communication terminal 120 (output process). In this way, character conversion data including the string included in the voice text and the string included in the screen text is generated and displayed on the communication terminal 120. Thereby, even when the voice data includes technical terms or special terms and it is difficult to perform character conversion by general voice recognition processing, the corresponding term can be determined from the screen text. Therefore, character conversion data can be generated accurately. Note that the output destination of the character conversion data is not limited to the communication terminal 120 of the conference attendees. The output unit 440 may transmit the character conversion data generated by the generation unit 430 to the communication terminal of a user who is not attending the conference. Also, the output method is not limited to display on an editing screen, and methods such as publication on a website or SNS, file transfer, email transmission, etc. may be used. Further, in order for the minutes creator to perform editing processing later, the output unit 440 may store the character conversion data generated by the generation unit 430 in the storage 204 in an editable state.
[0049] Also, the output unit 440 preferably outputs the character conversion data so as to highlight, for example, the screen text included in the character conversion data. Since the screen text in the character conversion data is highlighted, it becomes easier for the user to recognize that the term is a term that has been character-converted using the screen text.
[0050] Furthermore, when the output unit 440 outputs the screen text included in the transcription data to the communication terminal 120, it may accept input from the user of the communication terminal 120 regarding whether the text is correct or incorrect. In this case, it is preferable to display a GUI (Graphical User Interface) for inputting correctness or incorrectness on the communication terminal 120. In this way, the user can judge and input whether the screen text included in the transcription data generated by the transcription processing unit 100 is correct or incorrect, and by reflecting the input results, the transcription processing unit can generate more accurate transcription data.
[0051] Next, the UI (User Interface) screen provided in the online meeting system 1 will be described. While an online meeting is in progress, the meeting server 110 provides a meeting screen to the communication terminals 120 of the users attending the online meeting. After the meeting ends, an editing screen is provided for the user to create and edit the meeting transcript data and create meeting minutes. The editing screen is accessed from the home screen (not shown) provided in the online meeting system 1 by operating the editing icon and is displayed on the communication terminal 120. The meeting screen and editing screen will be described below.
[0052] (Meeting screen) Figure 5 shows an example of a meeting screen 500. As shown in Figure 5, the meeting screen 500 is provided with a menu area 510, a sharing area 520, an attendee image area 530, and a transcription data area 540.
[0053] The menu area 510 is provided with buttons for instructing the execution of various functions. In the example shown in Figure 5, there are buttons such as a chat button 511, an attendee button 512, a transcription button 513, a camera button 514, a microphone button 515, a share button 516, and an exit button 517. The chat button 511 is used to start the chat function. The attendee button 512 is used to switch the display / hide of attendee images in the meeting, or to change how they are displayed. The camera button 514 and microphone button 515 are used to switch the camera and microphone of the communication terminal 120 on / off, respectively. The share button 516 is used when specifying document files or image files to display in the shared area 520. The exit button 517 is used when leaving an online meeting.
[0054] The transcription button 513 is operated when the function to transcribe the audio during a meeting is used. The transcription button 513 also allows selection of the meeting minutes creation menu and the recording and meeting minutes creation menu, which are displayed in a pull-down list format. The meeting minutes creation menu is a function that generates transcription data and automatically generates meeting minutes based on the transcription data. The recording and meeting minutes creation menu is a function that records the audio and screen of the meeting (audio and video) and automatically generates meeting minutes. The generated transcription data and meeting minutes can be edited by the user on the editing screen (Figure 6). The recorded audio and screen can also be played back on the editing screen. The buttons provided on the meeting screen 500 are not limited to the example in Figure 5, and may be added or deleted as appropriate depending on the functions provided by the online meeting system 1.
[0055] The shared area 520 displays the shared screen. When one of the attendees presses the share button 516 and specifies a file to share, the data of the displayed screen of that file is uploaded to the conference server 110 as shared screen data. The shared screen may also include a mouse pointer operated by the user of the communication terminal 120. The conference server 110 sends the uploaded shared screen data to each communication terminal 120 of the online conference attendees and displays it in the shared area 520 of the conference screen 500. In this embodiment, the shared screen data displayed in the shared area 520 is the screen data that is subject to transcription processing, meeting minutes creation processing, and recording processing.
[0056] The attendee image area 530 displays images of attendees captured by their communication terminals 120, as well as their profile pictures. The attendee button 512 may be used to turn the display of attendee images in the attendee image area 530 on or off, or to rearrange the display order.
[0057] The transcription data area 540 displays the transcript data of the meeting audio when the transcription function is turned on by the transcription button 513. This transcript data is generated by the transcription processing unit 100.
[0058] Figure 6 shows the shared screen displayed in the shared area 520 of the meeting screen 500 and the spoken audio in chronological order. The direction from top to bottom in the figure is the direction in which time progresses. Figure 6 shows that screen 610 is displayed first, and screen 620 is displayed later. Also, while screen 610 is displayed, the audio data 611, 612, and 613 spoken by the online meeting attendees are output to the transcription data area 540 in that order. Similarly, while screen 620 is displayed, the audio data 621 and 622 spoken by the online meeting attendees are output to the transcription data area 540 in that order. The audio data 611, 612, 613, 621, and 622 are converted into text data by the speech-to-text conversion unit. However, some strings in the transcript data, such as "ni-ni-ni," "bee-bee-bee," "ji-san-nan-maru," and "batsu-batsu phase," are simply transcriptions of the spoken audio and are therefore incomplete (or incomplete) as transcript data. For example, if the spoken audio is in Japanese, it may be transcribed using hiragana or some Romanization, which could lead to misinterpretations of the meaning. In such cases, users can use the editing function to edit the transcript data later or use the transcript data to create meeting minutes.
[0059] (Editing Screen) Figure 7 shows an example of the editing screen 700. As shown in Figure 7, the editing screen 700 is provided with a menu area 710, an editing area 720, and a tool button area 730. The menu area 710 is provided with buttons for instructing the execution of various functions in the editing function. In the example in Figure 7, there is a meeting file button 711, an upload button 712, a back button 713, a play / stop button 714, a forward button 715, etc. The meeting file button 711 is operated when selecting meeting data to be edited and downloading it from the meeting server 110. The upload button 712 is operated when uploading data edited on the editing screen 700 to the meeting server 110. The back button 713 and the forward button 715 are operated when going back / forward the playback position of the downloaded meeting data. The play / stop button 714 is operated when switching between playing and stopping the meeting data.
[0060] The editing area 720 displays the meeting data to be edited. For example, it displays screen data, audio data, and transcript data included in the downloaded meeting data. The tool button area 730 displays buttons for executing functions that can be performed on the meeting data to be edited. For example, it displays a transcript button 731, a minutes creation button 732, a memo button 733, a comment button 734, etc.
[0061] The transcript button 731 is operated when acquiring and editing transcript data of the meeting data to be edited. When the transcript button 731 is operated, the CPU 201 of the communication terminal 120 sends a transcript request to the conference server 110. The transcript request includes the target meeting ID and identification information of the requesting communication terminal 120. In response to the transcript request, the CPU 301 of the communication terminal 120 receives the transcript data generated by the transcript processing of the conference server 110. The CPU 301 displays the received transcript data in the editing area 720 and makes it editable. The editing process for transcript data will be described later.
[0062] The minutes creation button 732 is operated when acquiring and editing meeting minutes data for the meeting data to be edited. The minutes creation function summarizes the transcribed data and outputs it as meeting minutes data in a predetermined format. When the minutes creation button 732 is operated, the CPU 201 of the communication terminal 120 sends a minutes creation request to the conference server 110. The minutes creation request is accompanied by the target conference ID and identification information of the requesting communication terminal 120. In response to the minutes creation request, the CPU 301 of the communication terminal 120 receives the minutes data generated by the conference server 110. The CPU 301 displays the received minutes data in the editing area 720 and accepts editing by the user.
[0063] The memo button 733 and the comment button 734 are used when adding memos or comments to the meeting minutes data or transcript data to be edited, or when viewing memos or comments that have been added to the meeting minutes data or transcript data.
[0064] (Editing Meeting Data) Figure 8 is a flowchart showing the processes executed by the meeting server 110 during the editing of meeting data. The processes shown in the flowchart of Figure 8 are described in a program stored in the storage 204. The program is called by the CPU 201 of the meeting server 110, loaded into RAM 203, and executed by the CPU 201. After the meeting server 110 is started up, it begins the processes shown in this flowchart. It is assumed that the meeting to be processed has ended, and that the audio data and screen data from the meeting are stored as meeting data in the storage 204 of the meeting server 110. In the following explanation, the symbol "S" represents a step.
[0065] In S801, the CPU 201 of the conference server 110 waits for an instruction to call editing processing from the communication terminal 120. For example, when the editing icon displayed on the home screen (not shown) of the online conference system 1 is operated by a user of the communication terminal 120, an editing processing request is sent to the conference server 110. When the CPU 201 of the conference server 110 receives the editing processing request, it proceeds to S802. Otherwise, it terminates processing.
[0066] In step S802, the CPU 201 transmits the display data of the editing screen 700 to the communication terminal 120 that requested the editing process. Upon receiving the display data of the editing screen, the communication terminal 120 displays the editing screen 700 on the display unit 306.
[0067] In S803, the CPU 201 of the conference server 110 determines whether or not it has received a request to acquire conference data sent from the communication terminal 120. For example, on the editing screen 700 displayed on the communication terminal 120, the user operates the conference file button 711, specifies the conference ID, and sends a request to acquire conference data for that conference to the conference server 110. The request to acquire conference data includes the conference ID and identification information of the requesting communication terminal 120. If the CPU 201 of the conference server 110 receives the request to acquire conference data, it proceeds to S804. Otherwise, it terminates the process.
[0068] In S804, the CPU 201 retrieves the requested meeting data from the storage 204 in response to a meeting data acquisition request. The meeting data includes screen data and audio data shared during the meeting. This screen data and audio data are a series of time-series data from the start to the end of the meeting, and time information is attached to them.
[0069] In step S805, the CPU 201 transmits the conference data acquired in step S804 to the requesting communication terminal 120. Upon receiving the conference data, the communication terminal 120 displays the first frame of the screen data in the editing area 720 of the editing screen 700. On the editing screen 700, the conference data is played or stopped according to the operation of the play / stop button 714, the back button 713, and the forward button 715.
[0070] In S806, the CPU 201 of the conference server 110 determines whether or not it has received a transcription request sent from the communication terminal 120. For example, if the user operates the transcript button 731 on the editing screen 700 displayed on the communication terminal 120 and specifies a conference ID, a transcription request for the conference data corresponding to that conference ID is sent. The transcription request includes the conference ID and identification information of the requesting communication terminal 120. If the CPU 201 of the conference server 110 has received a transcription request, it proceeds to S807. Otherwise, it terminates processing.
[0071] In S807, the CPU 201 performs transcription processing on the meeting data requested by the transcription request. The transcription processing will be described later.
[0072] In S808, the CPU 201 transmits the transcription data obtained as a result of processing in S807 to the requesting communication terminal 120. Upon receiving the transcription data, the communication terminal 120 displays it in the editing area 720 of the editing screen 700. The display of the transcription data will be described later.
[0073] In S809, the CPU 201 of the conference server 110 determines whether or not it has received a meeting minutes creation request sent from the communication terminal 120. When the user of the communication terminal 120 operates the meeting minutes creation button 732 on the editing screen 700 displayed on the communication terminal 120 and specifies a meeting ID, a meeting minutes creation request for the meeting data corresponding to that meeting ID is sent to the conference server 110. The meeting minutes creation request includes the meeting ID and identification information of the requesting communication terminal 120. If the CPU 201 of the conference server 110 has received a meeting minutes creation request, it proceeds to S810. Otherwise, it terminates the process.
[0074] In S810, the CPU 201 executes a meeting minutes creation process for the meeting data requested by the meeting minutes creation request. As a result of the processing, meeting minutes data is obtained. In addition, in S810, the meeting minutes data is created in a way that may also utilize the strings replaced by the transcription process in S807.
[0075] In S811, the CPU 201 sends the meeting minutes data obtained as a result of processing in S807 to the requesting communication terminal 120. Upon receiving the meeting minutes data, the communication terminal 120 displays it in the editing area 720 of the editing screen 700. When the processing in S811 is completed, this flowchart is terminated. Although not shown in the diagram, the meeting minutes creation process in S810 does not necessarily have to be the process that goes through the transcription request in S806. Also, the processing when the memo button 733 and the comment button 734 are operated is omitted from the flowchart in Figure 8.
[0076] (Transcription Processing) Figure 9 is a flowchart showing the transcription processing performed by the transcription processing unit 100 of the conference server 110. The processing shown in the flowchart of Figure 9 is described in a program stored in the storage 204. The program is called by the CPU 201 of the conference server 110, loaded into RAM 203, and executed by the CPU 201. The CPU 201 functions as each component included in the transcription processing unit 100 by executing the processing described in the program.
[0077] When the conference server 110 receives a transcription request from the communication terminal 120, it starts the process shown in the flowchart in Figure 9. It is assumed that the conference to be processed has ended, and that its conference data is stored in the conference server 110's storage 204, associated with the conference ID.
[0078] In S901, the transcription processing unit 100 of the conference server 110 acquires the shared data that is the subject of the transcription request. In this embodiment, the shared data is conference data. The conference data includes screen data shared during the conference and audio data from the conference. The CPU 201 of the conference server 110 acquires the conference data corresponding to the conference ID included in the transcription request from the storage 204. The acquisition unit 410 sends the audio data included in the conference data to the speech-to-text conversion unit 421 of the conversion unit 420. The acquisition unit 410 also sends the screen data included in the conference data to the screen-to-text conversion unit 422 of the conversion unit 420.
[0079] In S902, the speech-to-text conversion unit 421 acquires the audio data included in the meeting data.
[0080] In step S903, the speech-to-text conversion unit 421 converts the acquired speech data into speech-to-text.
[0081] Figure 10 shows an example of speech-to-text conversion from speech data. The speech data in Figure 10 corresponds to the speech data spoken in the meeting shown in Figure 6. In the example shown in Figure 10, assuming the start of the meeting is "00:00:00", the speech data spoken at time "00:00:10" is converted to the speech-to-text "I have a report about Ni-ni-ni". The speech data spoken at time "00:00:20" is converted to the speech-to-text "This part of this picture". The speech data spoken at time "00:00:45" is converted to the speech-to-text "It's B-B-B". The speech data spoken at time "00:01:30" is converted to the speech-to-text "I have a progress confirmation regarding Grandpa Nanamaru". The speech data spoken at time "00:01:45" is converted to the speech-to-text "The work in the B-B-Faces...". In the example in Figure 10, for illustrative purposes, the audio text is written with the pronunciation directly represented. However, at stage S903, the audio text may be obtained after being converted to appropriate words. "Converted to appropriate words" means that words with the same or similar pronunciations have been converted to words that fit the context. In the case of Japanese, this includes conversion to kanji. The same principle applies to other languages; at stage S903, the audio text may be obtained after being converted to appropriate words using the characters used in that language. In that case, the incomplete transcription data shown in Figure 6 will be represented as shown in lines 611-622.
[0082] In S904, the screen text conversion unit 422 acquires screen data. The screen data includes the text portion and the image portion.
[0083] In S905, the screen text conversion unit 422 converts the acquired screen data into screen text. For the string portion of the screen data, the screen text conversion unit 422 extracts the string portion, for example, by OCR processing and converts it into text data. For the image portion of the screen data, the screen text conversion unit 422 performs image captioning processing on the extracted image and converts it into text data that describes the image.
[0084] Figure 11 shows an example of screen text converted from screen data. Screen 1 and Screen 2 in Figure 11 correspond to screens 610 and 620 shown in Figure 6, respectively. Screen 610 is displayed on meeting screen 500 at the time "00:00:00", and screen 620 is displayed on meeting screen 500 at the time "00:01:30".
[0085] From the text portion of screen 610 (screen 1) in Figure 6, for example, as shown in Figure 11, the following can be obtained as screen text: "About 2,2,2", "Regarding BBD", "PBP", and "General term C...DDD". Also, from the image portion of screen 610 (screen 1), text indicating the image, such as "Bar graph" and "Image of Mt. Fuji", can be obtained as screen text.
[0086] From the text portion of screen 620 (screen 2) in Figure 6, for example, as shown in Figure 11, the following can be obtained as screen text: "Progress check regarding G-370" and "XX phase". Since screen 620 (screen 2) does not contain an image, there is no screen text converted from an image. The replacement candidates shown in Figure 11 will be described later.
[0087] The conversion unit 420 outputs the converted audio text and screen text to the generation unit 430.
[0088] In S906, the generation unit 430 performs a mapping process between the audio text and the screen text to generate a mapping list 432. The mapping process will be described later.
[0089] In S907, the generation unit 430 stores the correspondence list 432 generated in S906 in the RAM 203. The correspondence list 432 stores combinations of strings contained in the audio text and screen text that correspond to those strings.
[0090] In S908, the generation unit 430 replaces the strings (strings to be replaced) contained in the audio text with the screen text (replacement candidates) that are associated in the correspondence list 432. In this way, the generation unit 430 completes the transcription data in which the strings contained in the audio text have been replaced with the strings contained in the screen text, and terminates the process. The transcription data generated by the generation unit 430 is processed by the output unit 440 to become data that can be displayed on the communication terminal 120, and is transmitted to the communication terminal 120.
[0091] (Mapping process) Next, we will explain the mapping process.
[0092] Figure 12 is a flowchart showing the mapping process executed at S906 in Figure 9. The process shown in the flowchart of Figure 12 is described in a program stored in storage 204. The program is called by the CPU 201 of the conference server 110, loaded into RAM 203, and executed by the CPU 201. The CPU 201 functions as a component of the transcription processing unit 100 by executing the process described in the program.
[0093] When the CPU 201 of the conference server 110 completes the processing in S903 and S905 shown in Figure 9 and obtains the voice text and screen text, it starts the mapping process shown in the flowchart in Figure 12 in S906. In the mapping process, the editing screen displayed on the communication terminal 120 is used as the user interface. The conference server 110 receives selections and instructions from the user on the editing screen from the communication terminal 120, performs processing according to those selections and instructions, transmits the processing results to the communication terminal 120, and displays them on the editing screen. When S906 is started, the editing screen is assumed to display screen data from one of the shared data acquired in S901.
[0094] In S1201, the generation unit 430 of the transcription processing unit 100 of the conference server 110 accepts the selection of the screen to be processed. In the editing screen 700, for example, screen 610 (screen 1) in Figure 6 is selected by the user.
[0095] In S1202, the generation unit 430 obtains screen text corresponding to the selected screen from the screen text converted in S905. The generation unit 430 extracts replacement candidates from the obtained screen text using the replacement candidate extraction unit 431. The replacement candidate extraction unit 431 decomposes the screen text into words and extracts characteristic terms. For example, it removes auxiliary words such as particles and auxiliary verbs from the screen text and extracts nouns, verbs, adjectives, adjectival nouns, adverbs, etc. as characteristic terms. Furthermore, strings not registered in the dictionary database used by the conversion engine may also be extracted as characteristic terms. The extraction of replacement candidates may use an extraction model constructed by deep learning or the like, or an algorithm based on predetermined rules.
[0096] As shown in Figure 11, from the screen text "About 2,2,2" converted from screen 610 (screen 1), "2,2,2" is extracted as a replacement candidate. From the screen text "Regarding BBD", "BBD" is extracted as a replacement candidate. From the screen text "PBP", "PBP" is extracted as a replacement candidate. From the screen text "General term C...DDD", "General term C" and "DDD" are extracted as replacement candidates. In addition, from the screen text "Bar graph", "Bar graph" is extracted as a replacement candidate, and from the screen text "Image of Mt. Fuji", "Mt. Fuji" is extracted as a replacement candidate. Note that these extraction results are just examples and are not limited to these examples.
[0097] In S1203, the generation unit 430 obtains the speech text spoken while the screen selected in S1201 is being displayed, from among the speech text converted in S903.
[0098] In S1204, the generation unit 430 extracts replacement strings from the acquired speech text using the replacement candidate extraction unit 431. The replacement candidate extraction unit 431 decomposes the speech text (or transcription data) into words and extracts characteristic terms. For example, it removes unnecessary words such as particles, auxiliary verbs, and filler words (hesitations) such as "um" from the screen text and extracts nouns, verbs, adjectives, adjectival nouns, adverbs, etc. as characteristic terms. Alternatively, it extracts strings from the transcription data that are not registered in the dictionary database used by the conversion engine as replacement strings. The extraction of replacement strings may use an extraction model constructed by deep learning or the like, or an algorithm based on predetermined rules.
[0099] For example, among the audio texts shown in Figure 10, the audio texts spoken while the screen 610 (screen 1) selected in S1201 was being displayed are "I have a report about Ni-ni-ni," "This part of this picture," and "It's B-B-B." The audio text "I have a report about Ni-ni-ni" contains the strings "Ni-ni-ni," "about," "report," and "is." The replacement candidate extraction unit 431 removes unnecessary words from these strings. In addition, the audio-text conversion unit 421, conversion engine 461, generation AI 462, etc., convert strings that can be converted into words or terms registered in dictionary data using known methods into words or registered terms. As a result, "Ni-ni-ni" is extracted as the string to be replaced.
[0100] The audio text "This is the part of this picture" contains the strings "this", "picture", "of", "this", "part", "is", and "right". Of these, the first "this" and the second "this" will be extracted as the string to be replaced. The audio text "It's beeeeeeee" contains the strings "beeeeeeee" and "is", and of these, "beeeeeeee" will be extracted as the string to be replaced. Note that these extraction results are just examples and are not limited to these examples.
[0101] In S1205, the generation unit 430 displays the replacement candidates extracted by the replacement candidate extraction unit 431 on the editing screen so that they can be selected.
[0102] Figure 13 shows an example of an editing screen 1300 in which replacement candidates are displayed in a selectable format. The editing screen 1300 in Figure 13 is provided with an editing area 720, a replacement candidate display area 1301, and a voice-to-text display area 1303. Since screen 610 (screen 1) is selected as the screen to be processed, screen 610 (screen 1) is displayed in the editing area 720. In addition, the voice-to-text display area 1303 displays the voice-to-text corresponding to the voice data spoken while screen 610 (screen 1) is displayed during the online meeting. A scroll bar 1304 may be provided to allow scrolling of the voice-to-text displayed in the voice-to-text display area 1303. In the replacement candidate display area 1301, the replacement candidates extracted from the screen text in S1204 are displayed in a selectable format. In the replacement candidate display area 1301, all extracted replacement candidates may be displayed, or only some of the replacement candidates may be displayed. When displaying some of the replacement candidates, for example, those whose pronunciation in the screen text is similar to the voice-to-text may be kept, and those that are not may be omitted. The similarity metric is determined based on the percentage of agreement between the spoken text on the screen and the audio text. A higher percentage of agreement indicates a higher degree of similarity (more similar).
[0103] In S1206, the generation unit 430 accepts the user's selection of a string to be replaced from among the audio text displayed in the audio text display area 1303 of the editing screen 1300. The generation unit 430 also accepts the user's selection of one of the replacement candidates displayed in the replacement candidate display area 1301. If the user makes a selection, the process proceeds to S1207; otherwise, the process proceeds to S1209.
[0104] In the example shown in Figure 13, the audio text display area 1303 displays audio texts 1311 and 1312, and the audio text 1311, "ni-ni-ni," is selected by the user as the string to be replaced using the mouse pointer 1302 or the like. At this time, the generation unit 430 may underline or shade the selected string to be replaced to distinguish it from the unselected string. In addition, "2,2,2" is selected by the user from the replacement candidate display area 1301 using the mouse pointer 1302 or the like as the screen text to replace the selected string to be replaced, "ni-ni-ni."
[0105] In S1207, the replacement unit 433 replaces the selected string to be replaced with the selected replacement candidate. The generation unit 430 associates the selected string to be replaced with the selected replacement candidate and adds them to the correspondence list 432 held in the RAM 203.
[0106] In S1208, the output unit 440 replaces the selected string to be replaced with the selected replacement candidate and displays it as transcription data in the audio text display area 1303. Thus, from the audio text 1311 in Figure 13, "Nii-nii-nii ni tsuite gohokoku desu" generates the transcription data 1321, "2,2,2 ni tsuite gohokoku desu." The generated transcription data may be displayed in the audio text display area 1303 in place of the original audio text 1311, or it may be displayed in a different display area. When processing S1206 to S1208 is completed, the process proceeds to S1209.
[0107] In S1209, the generation unit 430 determines whether the replacement process for all strings to be replaced extracted in S1204 has been completed. If there are any strings among those to be replaced extracted in S1204 that have not yet been replaced into screen text, it determines that the replacement process for all strings to be replaced has not been completed and proceeds to S1210.
[0108] In S1210, the generation unit 430 accepts the user's selection of the next string to be replaced and executes the processes in S1206 to S1208 on that string. If the replacement process for all the strings to be replaced extracted in S1204 has been performed on the screen text, or if the user instructs the end of the replacement process, it is determined that all the strings to be replaced have been replaced and the process proceeds to S1211.
[0109] In S1211, the generation unit 430 determines whether the processing in S1201 to S1210 has been completed for all screen data included in the meeting data. If the processing is not completed for all screen data included in the meeting data, the process proceeds to S1212.
[0110] In S1212, the generation unit 430 accepts the user's selection of the next screen to be processed and repeats the processing in S1202 to S1210 for that screen. In S1211, if it is determined that processing has been completed for all screen data included in the meeting data, the flowchart in Figure 12 (the process of matching audio text with screen text) is terminated. After that, the process proceeds to S907 in Figure 9.
[0111] Figure 14 shows an example of the correspondence list 432 saved in S907. As shown in Figure 14, the string to be replaced extracted from the audio text is associated with one of the replacement candidates extracted from the screen text. In the example in Figure 14, the audio text string to be replaced, "ni-ni-ni," is associated with the screen text "2,2,2." The audio text string to be replaced, "bee-bee-bee," is associated with the screen text "BBD." Similarly, for other screens, the audio text and screen text are associated, added sequentially to the correspondence list 432, and saved.
[0112] The combinations of speech-to-text and replacement text stored in the correspondence list 432 are speech data included in a series of meeting data acquired by the acquisition unit 410, and can be used when converting speech data spoken while another screen is displayed into speech-to-text. Furthermore, it can be used not only for the same meeting data, but also for different meeting data of the same type. The transcription process in that case will be described later.
[0113] As described above, in the online meeting system 1 of the first embodiment, the transcription processing unit 100 acquires meeting data shared by multiple communication terminals 120 in an online meeting. The transcription processing unit 100 also converts screen data included in the meeting data into screen text and audio data included in the meeting data into audio text. The transcription processing unit 100 generates transcription data by replacing some or all of the strings in the audio text that are not correctly transcribed by normal speech recognition with strings included in the screen text. This makes it possible to replace technical terms, company terms, product codes, company system names, and other special terms included in the audio data using information included in the screen data and reflect this in the transcription data, thereby improving the accuracy of transcription data generation. Furthermore, in the processing of the first embodiment, no prior preparation such as dictionary registration of special terms is required.
[0114] In the first embodiment, the transcription processing unit 100 provides the user with an editing screen and displays replacement candidates automatically extracted from the screen data for selection. This allows the user to generate transcription data as intended with simple selection operations. Furthermore, when converting audio data to speech-to-text, speech recognition may not be performed correctly due to accents or ambient noise during speech, resulting in incorrect conversions or no conversion at all. In addition, the user may mispronounce terms.
[0115] Figure 15 shows an example of transcribed data obtained by replacing audio data. As shown in Figure 15, suppose the user intended to pronounce "BBD," but the audio-to-text conversion resulted in "BBBB." In this case, by performing the transcription process shown in the first embodiment, the user can create accurate transcribed data by selecting "BBD" from the screen text extracted as replacement candidates while referring to the screen. Alternatively, if there is no suitable screen text extracted as a replacement candidate, the user may be allowed to input a string.
[0116] <Modification 1> Next, as Modification 1 of the first embodiment, another example of displaying the transcribed data will be described.
[0117] Figure 16 shows variations in the display of transcribed data. Figure 16(a) shows an example where the string "BBD" which has been replaced with screen text in the generated transcribed data, such as "BBD desu," is highlighted in bold and underlined. In this way, highlighting the string that has been replaced with screen text allows the user to recognize that the text on the screen is being used in the transcribed data. The method of highlighting is not limited to bold and underlined; any method that can distinguish it from other strings, such as changing the color or using shading, is acceptable.
[0118] Figure 16(b) shows an example of transcription data that displays both the string to be replaced 1601 contained in the audio text and the screen text that is the replacement candidate 1602 for the string to be replaced 1601. If "beeeeeeee" is extracted as the string to be replaced from the audio text "beeeeeeee", the string to be replaced "beeeeeeee" and the replacement candidate "BBD" selected by the user will be displayed side by side. In the example in Figure 16(b), the replacement candidate "BBD" is shown in parentheses and is shown in bold and underlined, but the display method is not limited to this.
[0119] Furthermore, the generation unit 430 may accept user input regarding the correctness of the screen text (replacement candidate 1602) that is replaced or co-presented in the transcription data. In this case, the generation unit 430 displays a GUI 1603 (Graphical User Interface) for inputting the correctness of the screen text (replacement candidate 1602). The GUI 1603 shown in Figure 16(b) has a check mark and an "x" mark. If the screen text (replacement candidate 1602) is correct, the user points to the check mark with the mouse pointer 1604. If the screen text (replacement candidate 1602) is incorrect, the user points to the "x" mark with the mouse pointer 1604. If the screen text co-presented with the string in the transcription data is confirmed as correct, the string may be replaced with the screen text and displayed as shown in Figure 16(a). Furthermore, if the user inputs that the on-screen text accompanying the string is incorrect, the generation unit 430 may accept a re-selection of replacement candidates or a replacement with arbitrary text. This allows the user to generate transcription data accurately as intended.
[0120] <Modification 2> In the first embodiment, replacement candidates extracted from the screen text were displayed on the editing screen, and the user's selection was accepted to associate the string to be replaced in the audio text with the replacement candidates and perform the replacement. However, the method of associating the string to be replaced in the audio text with the replacement candidates (screen text) is not limited to this. As Modification 2 of the first embodiment, another example of the association will be described.
[0121] Modification 2 describes an example of processing when the screen data displayed (shared) at the time the replacement string contained in the audio text was spoken contains a location indicated by the user. In this case, the generation unit 430 associates the replacement string with the screen text corresponding to a string or image located close to the indicated location and performs the replacement. This process is particularly suitable when the replacement string is a pronoun such as "this," "that," or "here."
[0122] Figure 17 illustrates an example of the correspondence between a location indicated by the user and the replacement string contained in the audio text. The editing screen 1700 shown in Figure 17 includes a menu area 710, an editing area 720, and an audio text display area 1703, similar to the first embodiment. In the editing screen 1700, the same parts as in the editing screen 700 of the first embodiment are denoted by the same reference numerals.
[0123] In the editing area 720, screen 610 (screen 1) is displayed as the screen to be processed. The mouse pointer 1701 displayed on screen 610 during the meeting is also recorded along with screen 610. In the example in Figure 17, the mouse pointer 1701 is pointing to the string "PBP" on screen 610. In addition, the audio-text display area 1303 displays the audio text 1711 corresponding to the audio data spoken while screen 610 (screen 1) was displayed during the online meeting. In this example, it indicates that the audio "Regarding this" was spoken.
[0124] The generation unit 430 extracts the string to be replaced from the audio text 1711. In the example in Figure 17, the demonstrative pronoun "this" is extracted as the string to be replaced from the audio text "Regarding this". In this case, the generation unit 430 extracts the string in association with the utterance time information.
[0125] The generation unit 430 determines whether there is a location indicated by the user on the screen 610 displayed (shared) on the communication terminal 120 when the extracted replacement string "this" is spoken. In determining whether there is a location indicated, the generation unit 430, for example, analyzes the screen data in chronological order and detects and tracks feature images that have specific features such as a mouse pointer. The generation unit 430 determines that the string closest to the location of the detected feature image was indicated when "this" was spoken. The generation unit 430 associates the screen text corresponding to the string closest to the location indicated with the replacement string "this" and replaces or displays it alongside it. In the example in Figure 17, the string "PBP" is indicated by the feature image (mouse pointer 1701). Therefore, the generation unit 430 associates the replacement string "this" with the string "PBP". As a result, the transcription data 1721 "Regarding this (PBP)" is generated. This process is executed, for example, in place of steps S1205 to S1208 in Figure 12.
[0126] Note that while Figure 17 shows an example where the replacement string "this" and the screen text "PBP" associated with this replacement string are displayed together, the display method is not limited to this example. As explained in the first embodiment (Figure 15), the replacement string "this" may be displayed in a state where it has been replaced with the screen text "PBP", or the replaced string may be highlighted as shown in Modification 1 (Figure 16). Alternatively, a GUI 1603 for inputting correctness for the screen text "PBP" associated with the replacement string may be displayed, and the user's input of correctness may be accepted and reflected in the transcription data.
[0127] If the location indicated by the user is not present on screen 610 at the time the replacement string "this" is spoken, the system will associate the string selected by the user from among multiple replacement candidates with the replacement string "this" and perform the replacement, similar to the first embodiment.
[0128] As explained above, if the screen data displayed at the time the replacement string in the audio text was spoken contains a location indicated by the user, the generation unit replaces the screen text with the replacement string that is closest to the indicated location. This eliminates the need for the user to select replacement candidates. Without any user intervention, the strings in the audio text can be associated with the screen text indicated at the time of speaking, transcribed, and reflected in the data.
[0129] The above example shows a case where the indicated location at the time the audio text (replacement string) "this" is uttered is a text on the screen (PBP). However, the indicated location is not limited to text; it may also be an image on the screen. For example, suppose the indicated location pointed to by the mouse pointer 1701 at the time the audio text (replacement string) "this" is uttered is graph 1704 or image 1705 on screen 610. In this case, the audio text "this" is replaced with the screen text corresponding to those images. Specifically, if graph 1704 is the indicated location, the screen text corresponding to graph 1704 is "bar graph," as shown in Figure 11. Therefore, the audio text "Regarding this" generates transcription data such as "Regarding this (bar graph)." Also, for example, if image 1705 is the indicated location, the screen text corresponding to image 1705 is "Mt. Fuji," as shown in Figure 11, so the audio text "Regarding this" generates transcription data such as "Regarding this (Mt. Fuji)."
[0130] <Modification 3> In Modification 2, when a specific location in the screen data displayed on the communication terminal 120 is indicated at the time the replacement string contained in the voice text is spoken, the voice text is replaced with the screen text corresponding to the string or image closest to that indicated location. However, the indicated location may also be specified as a range. Modification 3 describes the case where, at the time the replacement string contained in the voice text is spoken, there is an indicated area in the screen data displayed on the communication terminal 120 that has been indicated by the user. In this case, the generation unit 430 replaces the replacement string contained in the voice text with the screen text corresponding to the string or image contained in the indicated area.
[0131] Figure 18 illustrates an example of the correspondence between the instruction area indicated by the user and the replacement string contained in the voice text. The editing screen 1800 shown in Figure 18 is provided with a menu area 710, an editing area 720, and a voice text display area 1803, similar to the first embodiment. In the editing screen 1800, the same parts as in the editing screen 700 of the first embodiment are denoted by the same reference numerals.
[0132] The editing area 720 displays screen 610 (screen 1) as the screen to be processed. In the example in Figure 18, the mouse pointer 1801 indicates a range of size (hereinafter referred to as the instruction area 1802). The speech-to-text display area 1803 displays the speech-to-text corresponding to the speech data spoken during the online meeting while screen 610 (screen 1) was displayed. In this example, it indicates that the speech "Regarding this" was spoken.
[0133] The generation unit 430 extracts the string to be replaced from the audio text. In doing so, it extracts the string in association with the utterance time information. In the example in Figure 18, the pronoun "this" is extracted as the string to be replaced from the audio text "Regarding this".
[0134] The generation unit 430 determines whether there is a user-indicated area on the screen 610 displayed on the communication terminal 120 at the time the extracted replacement string "this" is spoken. In determining whether there is a user-indicated area, the generation unit 430, for example, performs image analysis of the screen data in chronological order and tracks images that have specific features such as a mouse pointer. If the tracked trajectory has a length range, such as an arrow or an underline, or a range with a width, such as a circle, ellipse, or rectangle, that range is identified as a user-indicated area. In the example in Figure 18, it is assumed that a rectangular user-indicated area is pointed to by the mouse pointer 1801 on the screen 610 displayed on the communication terminal 120 at the time the replacement string "here" is spoken. In this case, if there is a user-indicated area, the generation unit 430 extracts screen text corresponding to the string or image contained in the user-indicated area as replacement candidates. In the example shown in Figure 18, the screen text "2,2,2", "BBD", and "PBP", corresponding to the strings "About 2,2,2", "Regarding BBD", and "PBP" contained in the instruction area 1802, are extracted as replacement candidates. The extracted replacement candidates are displayed as selectable options in the replacement candidate display area 1804 of the editing screen 1800. This process is performed, for example, at S1205 in the flowchart shown in Figure 12.
[0135] Subsequently, the process of accepting a selection from the displayed replacement candidates and replacing the string to be replaced with the selected screen text is the same as in S1206 to S1208. Furthermore, the display example of the transcription data may also be shown as in Figure 15 of the first embodiment, where the string to be replaced is displayed with the replacement string, or as shown in Modification 1 (Figure 16), where the replacement string is highlighted. Additionally, a GUI 1603 for correct / incorrect input of the screen text "PBP" associated with the string to be replaced may be displayed, and the user's correct / incorrect input may be accepted and reflected in the transcription data.
[0136] If, at the time the replacement string "this" is spoken, there is no designated area on the screen 610 displayed on the communication terminal 120 that is specified by the user, the generation unit 430 displays multiple replacement candidates extracted from the screen text, similar to the first embodiment. Then, it associates the replacement candidate selected by the user with the replacement string and performs the replacement.
[0137] As explained above, if the screen data displayed at the time the string to be replaced in the audio text was spoken contains an instruction area indicated by the user, the generation unit 430 extracts replacement candidates from the strings or images contained in the instruction area. This narrows down the replacement candidates to screen text corresponding to the strings or images in the instruction area, making it easier for the user to make a selection.
[0138] In the example above, the determination was made whether or not the instruction area indicated by the user was present on the screen 610 displayed on the communication terminal 120 at the time the replacement string "this" was spoken. However, the system is not limited to this. For example, instead of limiting the determination to the time when the replacement string is spoken, the system may determine whether or not the instruction area indicated by the user was present on the screen 610 displayed on the communication terminal 120 within a predetermined time range from the time of utterance. The length of the predetermined time is arbitrary, but for example, if the replacement string is "this" or "here," the time range may be about 1 to 2 seconds. If the replacement string is "just now," "the previous one," or "that one," the time range may be longer, such as 2 seconds or more but less than 10 seconds.
[0139] <Modification 4> In the first embodiment and its modified form, if multiple replacement candidates are extracted from the screen text, the generation unit 430 may display the replacement candidates in order of similarity to the replacement string in the audio text.
[0140] Figure 19 shows an example of an editing screen 1900 in which replacement candidates are displayed for selection. The editing screen 1900 in Figure 19 is provided with an editing area 720, a replacement candidate display area 1904, and a voice-to-text display area 1903. Screen 610 (Screen 1) is selected as the screen to be processed and is displayed in the editing area 720. The voice-to-text display area 1903 displays the voice-to-text corresponding to the voice data spoken while Screen 610 (Screen 1) is displayed during an online meeting. In the example in Figure 19, the voice-to-text 1911 "Bee bee bee" is displayed. It is assumed that "Bee bee bee" from this voice-to-text 1911 is extracted as the string to be replaced (S1204 in Figure 12). The screen text and replacement candidates converted from Screen 610 (Screen 1) are "2,2,2", "BBD", "PBP", "General term C", "DDD", "Bar graph", and "Mt. Fuji", as shown in Figure 11.
[0141] In Modification 4, the generation unit 430 selects replacement candidates to display in the replacement candidate display area 1301 based on predetermined criteria. In this Modification 4, the replacement candidates are narrowed down based on the similarity between the string to be replaced contained in the audio text and the screen text that will be replaced, as the criteria. For example, the generation unit 430 excludes replacement candidates whose similarity is less than a predetermined value. The narrowed-down replacement candidates are then displayed in the replacement candidate display area 1904 in order of decreasing similarity, making them selectable. The similarity is determined, for example, based on the percentage of matching between the pronunciation of the screen text and the audio text. The method for determining the similarity is arbitrary; a similarity determination model constructed using a deep learning method may be used, or the similarity may be determined by a predetermined algorithm. In the example in Figure 19, for the string to be replaced, "BBD", "PBP", and "DDD" are extracted as screen texts with high similarity and displayed in that order. This process is performed, for example, at S1205 in the flowchart shown in Figure 12.
[0142] The subsequent process, which accepts a selection from the displayed replacement candidates and replaces the string to be replaced with the selected screen text, is the same as in S1206-S1208.
[0143] When the user selects the screen text to be replaced from the replacement candidate display area 1904 using the mouse pointer 1901, the replacement string "bee bee bee" in the audio text 1911 is replaced with the selected screen text "BBD". As a result, the audio text 1911 "It's bee bee bee" is replaced with the transcribed data 1921 "It's BBD".
[0144] Furthermore, if the selection of replacement candidates is narrowed down to one as a result of the similarity-based filtering, the generation unit 430 may directly replace the replacement string with that replacement candidate without displaying it in the replacement candidate display area 1301. Alternatively, the generation unit 430 may determine the screen text with the highest similarity among the replacement candidates and replace the replacement string with the screen text with the highest similarity. In this case, the user's selection process for replacement candidates can be omitted. In other words, it becomes possible to automatically associate the audio text with the screen text included in the screen data and reflect it in the transcription data without any user intervention.
[0145] Furthermore, the example of displaying the transcription data may be shown as in Figure 15 of the first embodiment, where the string to be replaced is replaced with the string to be replaced, or as shown in Modification 1 (Figure 16), where the replaced string is highlighted. In addition, a GUI 1603 for inputting correctness for the screen text "BBD" associated with the string to be replaced may be displayed, and the user's input of correctness may be accepted and reflected in the transcription data.
[0146] <Second Embodiment> In the first embodiment, an example was shown in which a series of meeting data (screen data and audio data) recorded during the meeting was acquired after the meeting had ended, and transcription data was generated and edited. However, the generation of transcription data may be performed in real time during the meeting. As a second embodiment, a transcription process performed in real time during the meeting will be described.
[0147] In the second embodiment, the configuration of the online meeting system 1, the hardware configuration and functional configuration of the communication terminal 120 and the meeting server 110 are the same as in the first embodiment. Hereinafter, parts similar to those in the first embodiment will be denoted by the same reference numerals as in the first embodiment, and the differences from the first embodiment will be described in detail.
[0148] (Flow of transcription processing during a meeting) Figure 20 is a flowchart showing the flow of transcription processing during a meeting performed by the conference server 110. The processes shown in the flowchart of Figure 20 are written in a program stored in storage 204. The program is called by the CPU 201 of the conference server 110, loaded into RAM 203, and executed by the CPU 201. After the conference server 110 is started up, it begins the processes shown in this flowchart.
[0149] In S2001, if the CPU 201 of the conference server 110 receives an access request from the communication terminal 120 to the online conference room, it proceeds to S2002 and starts conference processing. If conference processing does not start, this flowchart is terminated.
[0150] In S2002, the conference server 110 provides the conference screen 500 shown in Figure 5 to the communication terminals 120 of the online conference participants. During conference processing, the conference server 110 receives image data, shared screen data, and audio data transmitted from multiple communication terminals 120 participating in the online conference. Based on the received data, the conference server 110 generates data to be displayed on the conference screen 500 and conference audio data, and transmits them in real time to the communication terminals 120 of the conference participants. This allows the display data of the conference screen 500 and conference audio data to be shared among the multiple communication terminals 120.
[0151] On the conference screen 500 displayed on the communication terminal 120, when the user turns on the transcription button 513 and selects either the minutes creation menu or the recording and minutes creation menu, a transcription request is sent to the conference server 110.
[0152] In S2003, if the conference server 110 receives a transcription request from the communication terminal 120, it proceeds to S2004 and starts the transcription process. If no transcription request is received, it proceeds to S2006.
[0153] In the transcription process of S2004, the conference server 110 performs the transcription process in the same manner as shown in the flowchart of Figure 9. The shared data acquired in S901 is screen data and audio data shared among the communication terminals 120 of users attending the conference in progress. The screen data is uploaded from one communication terminal 120 to the conference server 110, processed by the conference server 110, and then transmitted and shared to the communication terminals 120 of multiple users attending the conference. The audio data is uploaded from each of the multiple communication terminals 120 attending the conference to the conference server 110, processed by the conference server 110, and then transmitted and shared to the communication terminals 120 of multiple users attending the conference.
[0154] The processing in S902 to S905 is the same as in the first embodiment. That is, in S902, the speech-to-text conversion unit 421 acquires audio data during the meeting. In S903, the speech-to-text conversion unit 421 converts the acquired audio data into speech-to-text. In S904, the screen-to-text conversion unit 422 acquires screen data. The screen data includes text and image portions. In S905, the screen-to-text conversion unit 422 converts the acquired screen data into screen text. For the text portion, for example, the text portion is extracted using OCR processing and converted into text data. For the image portion, image captioning processing is performed on the extracted image and converted into text data that describes the image. If the screen is switched during the meeting, screen data is acquired each time the screen is switched (S905) and converted into screen text (S906). The speech-to-text and screen text are stored in association with the time information of when they were spoken and the time information of when they were displayed.
[0155] In S906, the generation unit 430 performs a mapping process between the audio text and the screen text to generate a mapping list 432. The mapping process in the second embodiment will be described later.
[0156] In S907, the generation unit 430 stores the correspondence list 432 generated in S906 in the RAM 203.
[0157] In S908, the generation unit 430 replaces the strings included in the audio text that are included in the correspondence list 432 with the corresponding screen text. In this way, the audio data included in the meeting data is replaced with screen text, the transcription data is completed, and the transcription process is terminated. Once the transcription process is complete, the process proceeds to S2005 in Figure 20.
[0158] In S2005, the output unit 440 processes the generated transcription data so that it can be displayed on the communication terminal 120, and transmits it to the communication terminal 120.
[0159] Subsequently, in S2006, the CPU 201 of the conference server 110 determines whether the conference has ended or not. If all users have not left the conference room, the CPU 201 determines that the conference has not ended, returns to S2002, and continues the transcription process in the conference processing. In S2003, if the transcription process is switched off, proceed to S2006. In S2006, if all users have left the conference room, the CPU 201 determines that the conference has ended and terminates the flowchart.
[0160] (Mapping process in the second embodiment) Next, the mapping process in the second embodiment (S906 in Figure 9), which is performed in the transcription process of S2004, will be described.
[0161] Figure 21 is a flowchart showing the process executed by the transcription processing unit 100 of the conference server 110 in the mapping process in the second embodiment. The process shown in the flowchart of Figure 21 is described in a program stored in the storage 204. The program is called by the CPU 201 of the conference server 110, loaded into the RAM 203, and executed by the CPU 201. That is, the CPU 201 functions as the transcription processing unit 100 and a component included in the transcription processing unit 100 by executing the process described later in the program.
[0162] When the conference server 110 completes the processes S903 and S905 of the transcription process (Figure 9) and obtains the audio text and screen text, it starts the mapping process shown in the flowchart of Figure 21. In the mapping process in the second embodiment, the conference screen 500 displayed on the communication terminal 120 is used as the user interface. That is, the conference server 110 receives selections and instructions from the user on the conference screen 500 from the communication terminal 120, performs processing according to those selections and instructions, transmits the processing results to the communication terminal 120, and reflects them on the conference screen 500. When the mapping process in Figure 21 is started, the data of the shared screen being shared at that time is displayed as screen data on the conference screen 500. In addition, in the transcription data area 540 of the conference screen 500, if the transcription function is set to ON by the transcription button 513, the transcription data obtained by converting the audio data during the conference into text (transcription) is displayed. This transcription data is generated by the transcription processing unit 100.
[0163] In S2101, the generation unit 430 obtains screen text converted from screen data, and the replacement candidate extraction unit 431 extracts replacement candidates from the screen text. This process is the same as the process in S1202 of the first embodiment. As shown in Figure 22, "2,2,2", "BBD", "PBP", "General term C", "DDD", "Bar graph", "Mt. Fuji", etc. are extracted as replacement candidates from the screen text converted from screen 610 (screen 1). Note that these extraction results are just examples and are not limited to these examples.
[0164] In S2102, the generation unit 430 generates reading candidates for each of the substitution candidates extracted in S2101. A reading candidate is a candidate for how to read the substitution candidate.
[0165] Figure 22 shows examples of possible readings for screen text. As shown in Figure 22, for example, the replacement candidate "2,2,2" extracted from screen 1 is assumed to have the following possible readings: "ninini", "ninoninoni", and "tripleni". Also, the replacement candidate "BBD" is assumed to have the following possible readings: "beebeedee" and "doublebedee". Also, the replacement candidate "PBP" is assumed to have the following possible readings: "peebeepee". Also, the replacement candidate "DDD" is assumed to have the following possible readings: "deedeedee", "tripledee", and "deedee".
[0166] Furthermore, the following reading options will be generated for the replacement candidate "G-370" extracted from screen 2: "jiisan nanazero", "jiisan byaku nanajuu", and "jiisan nanamaru". Also, the following reading options will be generated for the replacement candidate "shinchoku kakunin". Furthermore, the following reading options will be generated for the replacement candidate "xx phase": "batsu batsu fezu" and "daburu batsu fezu".
[0167] In S2103, the generation unit 430 acquires the audio text converted in S903, and the replacement candidate extraction unit 431 extracts the strings to be replaced from the acquired audio text. The method for extracting the strings to be replaced is the same as in the first embodiment. The replacement candidate extraction unit 431 decomposes the audio text into words, removes unnecessary words such as particles, auxiliary verbs, and filler words (hesitations), and extracts verbs, adjectives, adjectival nouns, nouns, adverbs, etc. as the strings to be replaced. Alternatively, strings not registered in the dictionary database used by the conversion engine may be extracted as the strings to be replaced from the transcription data. The extraction of the strings to be replaced may use an extraction model constructed by deep learning or the like, or an algorithm based on predetermined rules.
[0168] For example, suppose that at the time "00:01:30", the audio text "Regarding Grandpa Nanamaru, we need to confirm the progress" is obtained. From this audio text, the string to be replaced is extracted as "Grandpa Nanamaru". Note that these extraction results are just examples and are not limited to these examples. Subsequent processing will be carried out in the order of utterance.
[0169] In S2104, the generation unit 430 determines whether there is a reading candidate among the reading candidates generated in S2102 that is similar to the replacement string to be processed. If there is a reading candidate similar to the replacement string, the process proceeds to S2105; otherwise, it proceeds to S2106. In determining similarity, the generation unit 430 determines the degree of similarity based, for example, on the ratio of the reading of the screen text to the audio text, and determines whether the two items to be compared are similar based on the similarity value. The method for determining the degree of similarity is arbitrary; a similarity determination model constructed using a deep learning method may be used, or the degree of similarity may be determined by a predetermined algorithm.
[0170] In S2105, the generation unit 430 replaces the string to be replaced with a replacement candidate that has a reading candidate similar to the string to be replaced, using the replacement unit 433. For example, in the example in Figure 22, the reading candidate that is most similar to the string to be replaced (speech text) "ji-san nanamaru" is "ji-san nanamaru", and the replacement candidate corresponding to that reading candidate is "G-370". In this case, the replacement unit 433 replaces the string to be replaced "ji-san nanamaru" with "G-370".
[0171] In S2106, the generation unit 430 transcribes the string to be replaced using a normal transcription process. This transcription can be done using a normal transcription generation AI or conversion engine. For example, in the audio text "Regarding the progress of Jii-san Nanamaru", "Kanshite" is extracted as the string to be replaced. In this case, in S2106, it is determined that there are no similar reading candidates, and it is converted to "Kanshite" using a normal transcription process.
[0172] The generation unit 430 adds the combination of the string to be replaced and the replacement candidate, which was replaced in S2105, to the correspondence list stored in RAM 203. After processing in S2105 or S2106, the process proceeds to S2107.
[0173] In S2107, the output unit 440 displays the audio text processed in S2105 or S2106 as transcription data in the transcription data area 540 of the conference screen 500. After that, the flowchart in Figure 21 is terminated. When the transcription processing unit 100 acquires new audio data, it executes the flowchart in Figure 21. If the screen data has not been updated when new audio data is acquired, the processes in S2101 to S2102 (extraction of replacement candidates from screen text, generation of reading candidates) are skipped. In that case, the processes in S2104 to S2105 can be executed using the already generated reading candidates.
[0174] Figure 23 shows an example of how transcript data is displayed on the meeting screen 2300. In Figure 23, parts similar to those in the meeting screen 500 shown in Figure 5 are denoted by the same reference numerals as in Figure 5. The text 2301 displayed in the transcript data area 540 is the transcript data obtained when a normal transcription process is performed. In this case, the string to be replaced, "G-san Nanamaru," remains unreplaced, or even if replaced, the screen text is not reflected, for example, "G37maru." On the other hand, when the transcription process of this embodiment is performed, the screen text is reflected and accurately transcribed as "G-370." As a result, transcript data 2302 including the screen text is generated and displayed, such as "Regarding the progress of G-370,..." This transcript data 2302 is displayed, for example, in place of the text 2301 that was displayed in the transcript data area 540.
[0175] As described above, according to the second embodiment, in the transcription process performed in real time during a meeting, the screen data included in the meeting data is converted to screen text, and the audio data included in the meeting data is converted to audio text, similar to the first embodiment. At that time, the strings included in the audio text are replaced with those in the screen text. As a result, the information included in the screen data is reflected in the transcription data, and the accuracy of the transcription data generation can be improved.
[0176] In the second embodiment, the transcription processing unit 100 generates reading candidates for the screen text converted and extracted from the meeting data, and replaces the strings to be replaced in the audio text with the screen text corresponding to the reading candidates that are similar to the audio text. Therefore, it is possible to reflect the screen text in the transcription data without any operation by the user.
[0177] Furthermore, the transcription process in the second embodiment can also be modified according to the various modifications of the first embodiment. Specifically, the display of the transcription data is not limited to the example in Figure 23, and as shown in Figure 16(a), the replaced string in the screen text may be highlighted. Also, as shown in Figure 16(b), transcription data may be generated and displayed that shows both the string to be replaced in the audio text and the screen text to which it is replaced. In addition, the system may accept input from the user regarding the correctness of the replaced or combined screen text. In this case, the transcription processing unit 100 may display a GUI for inputting the correctness of the replaced or combined screen text and accept input from the user regarding the correctness of the replaced or combined screen text.
[0178] Furthermore, in the second embodiment described above, the correspondence and replacement between the string to be replaced (voice text) and the replacement candidate (screen text) is determined by the similarity of pronunciation, but this is not limited to this. As shown in Modification 2, suppose that at the time the string to be replaced contained in the voice text is spoken, there is a location indicated by the user in the screen data displayed on the conference screen. In this case, the generation unit 430 associates the string to be replaced with the screen text corresponding to a string or image that is close to the indicated location and performs the replacement. This process is particularly suitable when the string to be replaced is a pronoun such as "this," "that," or "here." In this case, the voice text is transcribed using screen text that is appropriate to the intent of the user who is speaking.
[0179] Furthermore, if the screen data displayed on the conference screen at the time the string to be replaced in the audio text is spoken contains an instruction area indicated by the user, the generation unit 430 narrows down the replacement candidates from the screen text corresponding to the string or image contained in the instruction area. Then, it may generate reading candidates from the replacement candidates and perform a similarity check. In this case, the audio text is transcribed using screen text that is appropriate to the intent of the user who is speaking. Also, since the target of the similarity check is narrowed down to the screen text within the instruction area, the amount of processing is reduced and the transcription data can be obtained at high speed.
[0180] Furthermore, in the second embodiment, if there are no reading candidates similar to the string to be replaced, normal transcription is performed (S2106). However, there may be cases where there are multiple conversion candidates for the string to be replaced (speech text). In that case, the multiple conversion candidates and the replacement candidates extracted from the screen data may be merged and displayed as selectable.
[0181] Figure 24 shows a GUI for selecting conversion candidates for transcription. When multiple conversion candidates exist for the string to be replaced 2401 during normal transcription processing, the generation unit 430 displays the GUI shown in Figure 24. The generation unit 430 of the transcription processing unit 100 displays a candidate group 2402 on the meeting screen, which includes the conversion candidates "G370", "G37maru", and "J370" obtained through normal transcription processing, and the replacement candidate "G-370" obtained from the screen text. The generation unit 430 replaces the string to be replaced 2401 with the candidate selected by the user from the candidate group 2402 using a mouse pointer 2403 or the like. This makes it possible to obtain more accurate transcription data that aligns with the user's intentions. Note that the replacement process, including selection by the user, may be performed during the meeting or on the editing screen after the meeting has ended.
[0182] <Third Embodiment> In the first and second embodiments, the transcription process determined the screen text to replace the audio text from the screen data displayed at the time of utterance. However, the screen text to replace the audio text is not necessarily limited to the screen text included in the screen data displayed at the time the audio text was uttered. Screen text from a screen that was not displayed at the time the string to be replaced was uttered may also be used as a replacement candidate for that string.
[0183] (Mapping process in the third embodiment) Figure 25 is a flowchart showing the process executed by the transcription processing unit 100 of the conference server 110 in the mapping process in the third embodiment. The process shown in the flowchart of Figure 25 is described in a program stored in the storage 204. The program is called by the CPU 201 of the conference server 110, loaded into the RAM 203, and executed by the CPU 201. That is, the CPU 201 functions as the transcription processing unit 100 and a component included in the transcription processing unit 100 by executing the process described later in the program.
[0184] When the conference server 110 completes the transcription process (Figure 9) at steps S903 and S905 and obtains the audio text and screen text, it starts the mapping process shown in the flowchart of Figure 25. In the mapping process in the third embodiment, the conference screen 500 displayed on the communication terminal 120 is used as the user interface, similar to the second embodiment.
[0185] Note that in the flowchart of Figure 25, processes similar to those in the mapping process (second embodiment) of Figure 21 are shown with the same step numbers as in Figure 21. The following explanation will focus on the differences from the mapping process (second embodiment) of Figure 21.
[0186] S2101 is the same as the mapping process in Figure 21 (second embodiment). The generation unit 430 obtains the screen text converted in S905, and the replacement candidate extraction unit 431 extracts replacement candidates from the screen text. Then the process proceeds to S2501.
[0187] In S2501, the generation unit 430 generates reading candidates for each of the replacement candidates (screen text) extracted in S2101. The method for generating reading candidates is the same as in S2102 of the second embodiment, but the difference from S2102 is that in S2501, the generation unit 430 stores all the generated reading candidates in RAM 203, associating them with the screen text (replacement candidates). In other words, RAM 203 also stores and accumulates reading candidates for screen text (replacement candidates) extracted from screens that were displayed before the currently displayed screen.
[0188] Next, in S2103, the generation unit 430 acquires the speech text converted in S903, and the replacement candidate extraction unit 431 extracts the string to be replaced from the acquired speech text. This process is the same as S2103 in Figure 21. Next, the process proceeds to S2502.
[0189] In S2502, the generation unit 430 retrieves the correspondence list 432 stored in the RAM 203. In the transcription process performed during a meeting, the combination of screen text included in the screen displayed up to the time S2502 is executed and the audio text (replacement string) of the spoken audio data up to the time the correspondence list 432 is retrieved is stored.
[0190] In S2503, the generation unit 430 determines whether the replacement string extracted in S2103 is stored in the acquired correspondence list 432. If the replacement string is stored, the process proceeds to S2504.
[0191] In S2504, the generation unit 430 retrieves screen text (replacement candidates) associated with the string to be replaced stored in the correspondence list 432, and the replacement unit 433 replaces the string to be replaced with the retrieved screen text (replacement candidates). This process makes it possible to find screen text (replacement candidates) to replace the string to be replaced from the combination of associated voice text and screen text on screens that were displayed before the screen displayed at the time of utterance. The process then proceeds to S2107.
[0192] In S2503, if the generation unit 430 determines that the replacement string extracted in S2103 is not stored in the acquired correspondence list 432, the process proceeds to S2505.
[0193] In S2505, the generation unit 430 determines whether there are any reading candidates similar to the replacement string to be processed. The difference from S2104 in the second embodiment is that in S2505, all reading candidates stored in RAM 203 are subject to similarity determination. In other words, as described above, RAM 203 also stores and saves reading candidates of screen text (replacement candidates) extracted from screens that were displayed before the currently displayed screen. Therefore, it is possible to search for screen text (replacement candidates) that have reading candidates similar to the speech text from the screen displayed at the time of utterance and screens that were displayed before it.
[0194] In S2505, if it is determined that there are similar reading candidates for the string to be replaced, proceed to S2506. Otherwise, proceed to S2106.
[0195] In S2506, the generation unit 430 replaces the string to be replaced with a replacement candidate having a reading candidate similar to the string to be replaced, using the replacement unit 433.
[0196] In S2106, the generation unit 430 performs normal transcription on the string to be replaced. After executing the processes in S2504, S2506, and S2106, the process proceeds to S2107.
[0197] In S2107, the output unit 440 displays the replaced audio text as transcription data in the transcription data area 540 of the conference screen. After that, the flowchart in Figure 25 is terminated.
[0198] Figure 26 shows an example of the correspondence between the time-series screen transitions in an online meeting and the audio-text. The flow of time is from top to bottom in the figure. At time t1, it is assumed that there is no shared screen data on meeting screen 2601. Subsequently, at time t2, screen 2602 is displayed, and at time t3, screen 2603 is displayed.
[0199] At time t2, the screen 2602 displayed yields the screen text "1,1,1". Processing S2502 generates "ichiichiichi", "ichi no ichi no ichi", and "triple ichi" as reading candidates for this screen text, and these are stored in RAM 203. Between time t2 and time t3, when screen 2602 is displayed, if the audio data "triple ichi ga..." is spoken, the generation unit 430 determines "triple ichi" as the reading candidate most similar to the corresponding audio text. As a result, the replacement unit 433 replaces the audio data "triple ichi ga..." with the transcribed data "1,1,1 ga...". At this time, the generation unit 430 saves the combination of the audio text "triple ichi" and the screen text "1,1,1" in the correspondence list 2610.
[0200] Subsequently, at time t3, screen 2603 is displayed. From screen 2603, the screen text obtained is "2,2,2". As reading candidates for this screen text, processing in S2501 generates "ni nini", "ni noninoni", "ni-ni-ni", "triple ni", etc., and adds them to RAM 203 for storage. Assume that after time t3, when screen 2603 is displayed, the audio data "triple ichi de wa..." is spoken. Processing in S2103 in Figure 25 extracts the replacement string "triple ichi" from the audio data. The correspondence list 2610 obtained in S2502 stores the replacement string "triple ichi". Therefore, S2503 determines YES, and the replacement string "triple ichi" is replaced with the screen text "1,1,1".
[0201] Furthermore, suppose that after the time t3 when screen 2603 is displayed, another audio data, "ichi ichi ichi...", is spoken. In that case, the replacement string "ichi ichi ichi" is extracted from the audio data by the process in S2103 in Figure 25. Since the replacement string "ichi ichi ichi" is not stored in the correspondence list 2610 obtained in S2502, it is determined to be NO in S2503 and proceeds to S2505.
[0202] In S2505, the generation unit 430 checks all reading candidates stored in RAM 203 for similarity and determines whether there are any reading candidates similar to the string to be replaced, "ichiichiichi". Since the reading candidate "ichiichiichi" for "1,1,1" in screen 1 is already stored in RAM 203, S2505 determines YES, and in S2507, the string to be replaced, "ichiichiichi", is replaced with "1,1,1". At that time, the generation unit 430 adds the combination of the audio text "ichiichiichi" and the screen text "1,1,1" to the correspondence list 2610 and obtains the updated correspondence list 2611. In subsequent processing, the strings to be replaced extracted from the audio data, "tripleichi" and "ichiichiichi", are both replaced with the screen text "1,1,1".
[0203] Note that the example in Figure 26 is an example of real-time transcription processing during a meeting, so replacement candidates and reading candidates associated with a time before the time of speaking are replaced with the string to be replaced. However, in the editing process performed after the meeting, the screen data and audio data of the entire meeting can be acquired. In that case, the generation unit 430 extracts replacement candidates from all screen text converted from the series of screen data from the start to the end of the meeting acquired by the acquisition unit, and generates reading candidates. In this way, the string to be replaced (audio data) can be replaced using screen text contained in all screen data, not just the screen displayed during the utterance of the string to be replaced (audio data), and transcription data can be generated.
[0204] Furthermore, if all screen text included in all screen data is targeted, the number of candidates may increase too much, potentially slowing down processing speed. Therefore, the target screens may be limited to those displayed within a predetermined time range from the time the string to be replaced was uttered. This predetermined time range may be before or after the time the string to be replaced was uttered.
[0205] As described above, according to the third embodiment, the generation unit 430 can generate transcript data of the audio data using screen text converted from screen data displayed on the communication terminal 120 within a predetermined time range from the time the string contained in the audio text was spoken. The predetermined time range from the time of speaking may be a time range before the time of speaking, a time range after the time of speaking, or both. However, in the case of real-time transcription processing, it will be limited to a time range before the time of speaking. For example, suppose a meeting material file with a total of 10 pages is shared as a shared screen sequentially from page 1, and the audio data to be transcribed is spoken when page 5 is being shared. In this case, the generation unit 430 can generate transcript data of the audio data to be transcribed using screen text contained in the screen data of pages 1 to 5 that were displayed before the time of speaking.
[0206] Furthermore, the generation unit 430 may generate transcript data of audio data based on all screen text converted from a series of screen data acquired by the acquisition unit 410. For example, suppose that in an editing process performed after an online meeting, transcription processing is performed on the audio data of that meeting. In this case, regardless of when the audio data was spoken, the transcript data of audio data may be generated using all screen text converted from all screen data displayed in the online meeting as replacement candidates (including reading candidates).
[0207] This means that not only the screen data displayed at the time the audio data is spoken, but also screen text extracted from screen data displayed at other times is reflected in the transcription data. Therefore, the number of vocabulary words that can be used as replacement candidates increases, and the accuracy of the transcription data can be improved.
[0208] Furthermore, the transcription process in the third embodiment can also be modified according to the various modifications of the first embodiment. Specifically, the transcription data may be displayed by replacing the strings to be replaced in the audio text with screen text, or, as shown in Modification 1, the replaced strings in the screen text may be highlighted. Alternatively, the transcription data may be displayed with both the strings to be replaced in the audio text and the screen text to which they are replaced. The transcription data may also accept input from the user regarding the correctness of the screen text that has been replaced or is displayed alongside it. In this case, the transcription processing unit 100 may display a GUI for inputting the correctness of the displayed screen text.
[0209] Furthermore, in the third embodiment, the correspondence and replacement between the string to be replaced (audio text) and the replacement candidate (screen text) is determined by the similarity of pronunciation, but this is not limited to this. For example, if the audio data contains pronouns such as "that," "that over there," "there," "over there," or "the one from earlier," the replacement candidate (screen text) may be determined based on screen data that was displayed before the time the string to be replaced contained in the audio text was spoken. In that case, if there are indicated locations or areas on the screen, the candidates may be narrowed down to the screen text corresponding to the indicated location or the screen text contained in the indicated area.
[0210] <Fourth Embodiment> In the third embodiment described above, the screen text to be replaced with audio text was determined by referring to a correspondence list 432 generated from the meeting data to be processed. However, the correspondence list 432 to be referred to is not limited to this. The correspondence lists generated for each meeting data may be stored as accumulated data, and this accumulated data may be used for transcription processing of other related meetings.
[0211] Figure 27 is a block diagram showing the functional configuration of the transcription processing unit 100A of the fourth embodiment. The transcription processing unit 100A shown in Figure 27 has an acquisition unit 410, a generation unit 430, a conversion unit 420, a generation unit 430, and an output unit 440, similar to the first embodiment (Figure 4). The differences from the first embodiment are that the correspondence list generated by the generation unit 430 is stored in the stored data storage unit 2710, and the replacement unit 433 obtains a correspondence list of related meetings from the stored data and uses it for transcription of the audio data. In Figure 27, the same parts as in Figure 4 are denoted by the same reference numerals.
[0212] The data storage unit 2710 classifies online meetings conducted by the online meeting system 1 into groups according to the type of meeting, and stores the corresponding list 432 of meetings generated by the transcription function as stored data for each group. In the example in Figure 27, the data storage unit 2710 holds stored data 2711 for the "monthly meeting" group and stored data 2712 for the "planning meeting" group. The stored data 2711 for the "monthly meeting" group stores corresponding lists of meetings classified into the same group, such as the January regular meeting, the February regular meeting, the March regular meeting, and so on. Similarly, the stored data 2712 for the "planning meeting" group stores corresponding lists of meetings classified into the same group, such as the 1st planning meeting, the 2nd planning meeting, the 3rd planning meeting, and so on.
[0213] The replacement unit 433A acquires stored data corresponding to the type of meeting from the stored data storage unit 2710 during the transcription process. The replacement unit 433A replaces the strings to be replaced, extracted from the audio text converted from the meeting data, with screen text (replacement candidates) by referring to the correspondence list generated from the meeting data and the stored data. This allows the system to generate transcription data by referring to the stored data if the terminology was used in the same type of meeting data that was held in the past. Furthermore, by classifying and storing the stored data by type of meeting, terms used in unrelated meetings can be excluded from the replacement candidates, enabling efficient generation of transcription data.
[0214] Figure 28 is a flowchart showing the transcription process executed by the conference server 110 in the fourth embodiment. The processes shown in the flowchart of Figure 28 are described in a program stored in the storage 204. The program is called by the CPU 201 of the conference server 110, loaded into the RAM 203, and executed by the CPU 201. When the transcription process in the post-conference editing process in the first embodiment (S807) or the transcription process during the conference in the second embodiment (S2004) is started, the conference server 110 starts the processes shown in this flowchart. Note that in Figure 28, processes similar to the transcription process in Figure 9 are shown with the same step numbers as in Figure 20. The following explanation will focus on the differences from the transcription process in Figure 9.
[0215] In S2801, the conference server 110 determines whether or not there is stored data related to the conference currently in progress. If stored data for the same group or related groups is stored in the stored data storage unit 2710, the process proceeds to S2802. If stored data for the same group or related groups is not stored, S2802 is skipped and the process proceeds to S901. The determination in S2801 may be made, for example, by the conference server 110's CPU 201 determining whether or not there is an existing group related to the conference currently in progress based on the conference name, conference attendees, etc., or by accepting a selection of an existing group by the user.
[0216] In S2802, the conference server 110 acquires stored data for the same group or related groups. Then, the process proceeds to S901. The processing in S901 to S908 is the same as in the first and second embodiments. However, in the mapping process in S906, if stored data has been acquired in S2802, the replacement unit 433A refers not only to the correspondence list generated from the conference data being conducted, but also to the stored data acquired in S2802. Then, it maps (replaces) the string to be replaced in the audio text to the screen text (replacement candidate) included in the stored data. If stored data has not been acquired in S2802, the replacement unit 433A refers only to the correspondence list generated from the conference data being conducted and maps (replaces) the string to be replaced in the audio text to the screen text (replacement candidate). When the processing up to S908 is completed, the flowchart in Figure 28 is terminated.
[0217] As described above, according to the fourth embodiment, if a correspondence list 432 generated from past meetings of the same type is stored as accumulated data, the generation unit 430 can refer to that accumulated data and replace the audio text with screen text. This makes it possible to generate transcription data efficiently. For example, even if technical terms, company terms, product codes, company system names, or other special terms are not included in the screen of the current meeting, transcription is possible if a correspondence between screen text and audio text has been made in past meetings.
[0218] Furthermore, the transcription process in the fourth embodiment can also be modified according to the various modifications of the first embodiment. Specifically, the transcription data may be displayed by replacing the strings to be replaced in the audio text with the screen text, or, as shown in Modification 1, the replaced strings in the screen text may be highlighted. Alternatively, the transcription data may be displayed with both the strings to be replaced in the audio text and the screen text to which they are replaced. The transcription data may also accept input from the user regarding the correctness of the screen text that has been replaced or is displayed alongside it. In this case, the transcription processing unit 100 may display a GUI for inputting the correctness of the displayed screen text.
[0219] Preferred embodiments of this disclosure have been described above with reference to the attached drawings, but this disclosure is not limited to such examples. For example, the configurations and display examples of various screens, examples of audio data, examples of speech recognition results, examples of replacement candidate extraction, and processing procedures of each flowchart are examples and may be modified as appropriate. Furthermore, it is clear that those skilled in the art can conceive of various modifications or alterations within the scope of the disclosed technical idea, and these will naturally also fall within the technical scope of this disclosure.
[0220] <Other Embodiments> This disclosure can also be implemented by supplying a program that implements one or more of the functions of the above-described embodiments to a system or device via a network or storage medium, and by having one or more processors in the computer of that system or device read and execute the program. It can also be implemented by a circuit (e.g., ASIC) that implements one or more functions.
[0221] The above-described embodiments include the following configurations.
[0222] (Configuration 1) An information processing device that is connected to a communication terminal via a network, comprising: an acquisition means for acquiring screen data and audio data shared with the communication terminal; a first conversion means for converting the audio data into audio text data; a second conversion means for converting the screen data into screen text data; a generation means for generating transcription data in which a part of the string contained in the audio text data is replaced with a string contained in the screen text data; and an output means for outputting the transcription data generated by the generation means.
[0223] (Configuration 2) An information processing device that is connected to a communication terminal via a network, comprising: an acquisition means for acquiring screen data and audio data shared with the communication terminal; a first conversion means for converting the audio data into audio text data; a second conversion means for converting the screen data into screen text data; a generation means for generating transcription data that includes a string contained in the audio text data and the screen text data that is a candidate for replacing the string; and an output means for outputting the transcription data generated by the generation means.
[0224] (Configuration 3) The information processing apparatus according to Configuration 1 or Configuration 2, characterized in that the output means outputs the transcription data in a manner that can be displayed on the communication terminal.
[0225] (Configuration 4) The information processing apparatus according to Configuration 3, characterized in that the output means outputs the transcription data in such a way that the screen text data included in the transcription data is highlighted.
[0226] (Configuration 5) The information processing apparatus according to any one of Configurations 1 to 3, further comprising a receiving means for receiving input from a user regarding whether the screen text data included in the transcription data is correct or incorrect.
[0227] (Configuration 6) When there are multiple screen text data that are candidates for replacing the string contained in the voice text data, the generation means generates the transcription data using the screen text data selected by the user from the multiple replacement candidates, as described in any one of Configurations 1 to 5.
[0228] (Configuration 7) When there are multiple screen text data that are candidates for replacing the string contained in the voice text data, the generation means generates the transcription data using the screen text data selected from the multiple replacement candidates based on a predetermined determination criterion, as described in any one of Configurations 1 to 5.
[0229] (Configuration 8) The information processing device according to Configuration 7, characterized in that the similarity between the string contained in the voice text data and the screen text data that is a replacement candidate is used as the predetermined determination criterion.
[0230] (Configuration 9) The information processing apparatus according to Configuration 8, characterized in that the similarity is determined based on the reading of the string contained in the voice text data and the screen text data that is a replacement candidate.
[0231] (Configuration 10) The information processing apparatus according to any one of Configurations 1 to 9, characterized in that the transcription data includes screen text data converted from the screen data displayed on the communication terminal at the time the string contained in the voice text data was spoken.
[0232] (Configuration 11) The information processing apparatus according to any one of Configurations 1 to 9, characterized in that the generation means generates the transcription data based on all the screen text data converted from a series of screen data acquired by the acquisition means.
[0233] (Configuration 12) The information processing apparatus according to any one of Configurations 1 to 11, characterized in that, when the string contained in the voice text data is spoken, the screen data displayed on the communication terminal has an instruction area instructed by the user, the generation means generates the transcription data including the screen text data extracted from the instruction area.
[0234] (Configuration 13) The information processing apparatus according to Configuration 12, characterized in that the string contained in the speech text data is a pronoun.
[0235] (Configuration 14) The information processing apparatus according to any one of Configurations 1 to 11, characterized in that, when the string contained in the voice text data is spoken, there is a location indicated by the user in the screen data displayed on the communication terminal, the generation means generates the transcription data which includes the screen text data corresponding to the text or image located close to the indicated location.
[0236] (Configuration 15) The information processing apparatus according to Configuration 14, characterized in that the string contained in the speech text data is a pronoun.
[0237] (Configuration 16) The information processing apparatus according to any one of Configurations 1 to 15, further comprising a receiving means for receiving a selection of a region from the screen data acquired by the acquisition means, wherein the second conversion means converts the screen data into screen text data with respect to the region received by the receiving means.
[0238] (Configuration 17) The information processing apparatus according to any one of Configurations 1 to 16, wherein the generation means generates the transcription data by referring to the correspondence data from a storage means that stores and stores correspondence data which is obtained by associating the string contained in the voice text data with the screen text data obtained by replacing the string.
[0239] (Configuration 18) The information processing apparatus according to any one of Configurations 1 to 17, characterized in that the conversion process by the first conversion means and the second conversion means, the generation process by the generation means, and the output process by the output means are executed in real time when the acquisition means acquires the screen data and the audio data.
[0240] (Configuration 19) The information processing apparatus according to any one of Configurations 1 to 17, characterized in that the conversion process by the first conversion means and the second conversion means, the generation process by the generation means, and the output process by the output means are performed after the acquisition means has finished acquiring a series of audio data and screen data.
[0241] (Configuration 20) The information processing apparatus according to any one of Configurations 1 to 19, wherein the screen data includes at least one of text and an image, and the second conversion means converts at least one of the text and the image included in the screen data into screen text data.
[0242] (Configuration 21) A communication terminal connected to an information processing device via a network, comprising: a transmitting means for transmitting screen data and audio data shared in the communication terminal to the information processing device; a receiving means for receiving transcription data output from the information processing device, in which a part of the string contained in the audio text data converted from the audio data is replaced with a string contained in the screen text data converted from the screen data; and a display means for displaying the transcription data.
[0243] (Configuration 22) An information processing system in which a communication terminal and a server are connected via a network, wherein the server comprises: receiving means for receiving screen data and audio data shared with the communication terminal from the communication terminal; first conversion means for converting the audio data into audio text data; second conversion means for converting the screen data into screen text data; generation means for generating transcription data in which a part of the string contained in the audio text data is replaced with a string contained in the screen text data; and output means for outputting the transcription data generated by the generation means to the communication terminal, wherein the communication terminal comprises: communication means for sending and receiving the screen data and audio data shared at the communication terminal with the server; receiving means for receiving the transcription data output from the server; and display means for displaying the transcription data.
[0244] (Configuration 23) An information processing method in an information processing system in which a communication terminal and a server are connected via a network, the method comprising: the server acquiring screen data and audio data shared with the communication terminal; converting the audio data into audio text data; converting the screen data into screen text data; generating transcription data by replacing a part of the string contained in the audio text data with a string contained in the screen text data; and outputting the generated transcription data to the communication terminal.
[0245] (Configuration 24) An information processing method in an information processing system in which a communication terminal and a server are connected via a network, the method comprising: the steps of: the communication terminal transmitting screen data and audio data shared at the communication terminal to the server; the steps of: receiving transcription data output from the server, in which a part of a string contained in audio text data converted from the audio data is replaced with a string contained in screen text data converted from the screen data; and the steps of: displaying the transcription data.
[0246] (Configuration 25) A program executed by an information processing device that is connected to a communication terminal via a network, wherein the information processing device is configured to function as: an acquisition means for acquiring screen data and audio data shared with the communication terminal; a first conversion means for converting the audio data into audio text data; a second conversion means for converting the screen data into screen text data; a generation means for generating transcription data in which a part of the string contained in the audio text data is replaced with a string contained in the screen text data; and an output means for outputting the transcription data generated by the generation means to the communication terminal.
[0247] (Configuration 26) A program executed by a communication terminal connected to an information processing device via a network, characterized in that the communication terminal functions as: a transmitting means for transmitting screen data and audio data shared at the communication terminal to the information processing device; a receiving means for receiving transcription data obtained by replacing a part of a string contained in audio text data converted from the audio data output from the information processing device with a string contained in screen text data converted from the screen data; and a display means for displaying the transcription data.
[0248] (Configuration 27) A communication terminal connected to an information processing device via a network, comprising: a transmitting means for transmitting screen data and audio data shared in the communication terminal to the information processing device; a receiving means for receiving transcribed data output from the information processing device; and a display means for displaying the transcribed data, wherein the first transcribed data displayed by the display means when the audio data is transmitted from the transmitting means is different from the second transcribed data displayed by the display means when the screen data and audio data are transmitted from the transmitting means.
[0249] This disclosure is not limited to the embodiments described above, and various modifications and alterations are possible without departing from the spirit and scope of this disclosure. Accordingly, the following claims are attached to make the scope of this disclosure public.
[0250] This application claims priority based on Japanese Patent Application No. 2024-232257, filed on 27 December 2024, and all of its contents are incorporated herein by reference.
Claims
An information processing device that is connected to a communication terminal via a network, An acquisition means for acquiring screen data and audio data shared with the aforementioned communication terminal, A first conversion means for converting the aforementioned audio data into audio-text data, A second conversion means for converting the aforementioned screen data into screen text data, A generation means for generating transcription data by replacing a portion of the string contained in the aforementioned audio text data with a string contained in the aforementioned screen text data, An output means for outputting the transcription data generated by the generation means, An information processing device characterized by comprising: An information processing device that is connected to a communication terminal via a network, An acquisition means for acquiring screen data and audio data shared with the aforementioned communication terminal, A first conversion means for converting the aforementioned audio data into audio-text data, A second conversion means for converting the aforementioned screen data into screen text data, A generation means for generating transcription data that includes a string contained in the aforementioned audio text data and the aforementioned screen text data that serves as a replacement candidate for the string, An output means for outputting the transcription data generated by the generation means, An information processing device characterized by comprising: The information processing apparatus according to claim 1 or 2, characterized in that the output means outputs the transcription data in a manner that can be displayed on the communication terminal. The information processing apparatus according to claim 3, characterized in that the output means outputs the transcription data in such a manner that the screen text data included in the transcription data is highlighted. The information processing apparatus according to claim 3, further comprising a receiving means for receiving input from a user regarding the correctness of the screen text data included in the transcription data. The information processing apparatus according to claim 1 or 2, wherein, if there are multiple screen text data that are candidates for replacing the string contained in the voice text data, the generation means generates the transcription data using the screen text data selected by the user from the multiple replacement candidates. The information processing apparatus according to claim 1 or 2, wherein if there are multiple screen text data that are candidates for replacing the string contained in the voice text data, the generation means generates the transcription data using the screen text data selected from the multiple replacement candidates based on a predetermined determination criterion. The information processing apparatus according to claim 7, characterized in that the similarity between the string contained in the voice text data and the screen text data that is a replacement candidate is used as the predetermined determination criterion. The information processing apparatus according to claim 8, characterized in that the similarity is determined based on the reading of the string contained in the voice text data and the screen text data that is a replacement candidate. The information processing apparatus according to claim 1 or 2, characterized in that the transcription data includes screen text data converted from the screen data displayed on the communication terminal at the time the string included in the voice text data was spoken. The information processing apparatus according to claim 1 or 2, characterized in that the generation means generates the transcription data based on all the screen text data converted from the series of screen data acquired by the acquisition means. The information processing apparatus according to claim 1 or 2, wherein, when the string contained in the voice text data is spoken, the screen data displayed on the communication terminal contains an instruction area instructed by the user, the generation means generates the transcription data including the screen text data extracted from the instruction area. The information processing apparatus according to claim 12, characterized in that the string contained in the voice text data is a pronoun. The information processing apparatus according to claim 1 or 2, wherein, when the string contained in the voice text data is spoken, there is a location indicated by the user in the screen data displayed on the communication terminal, the generation means generates the transcription data which includes the screen text data corresponding to the text or image located close to the indicated location. The information processing apparatus according to claim 14, characterized in that the string contained in the voice text data is a pronoun. The acquisition means further comprises a receiving means for receiving a selection of a region from the screen data acquired by the acquisition means, The information processing apparatus according to claim 1 or 2, characterized in that the second conversion means converts the screen data into screen text data with respect to the area received by the receiving means. The information processing apparatus according to claim 1 or 2, wherein the generating means generates the transcription data by referring to the corresponding data from a storage means that stores and stores corresponding data which associates the string contained in the voice text data with the screen text data obtained by replacing the string. The information processing apparatus according to claim 1 or 2, characterized in that the conversion processing by the first conversion means and the second conversion means, the generation processing by the generation means, and the output processing by the output means are performed in real time when the acquisition means acquires the screen data and the audio data. The information processing apparatus according to claim 1 or 2, characterized in that the conversion processing by the first conversion means and the second conversion means, the generation processing by the generation means, and the output processing by the output means are performed after the acquisition means has finished acquiring a series of audio data and screen data. The aforementioned screen data includes at least one of text and an image. The information processing apparatus according to claim 1 or 2, characterized in that the second conversion means converts at least one of the text and the image contained in the screen data into screen text data. A communication terminal connected to an information processing device via a network, A transmission means for transmitting screen data and audio data shared at the communication terminal to the information processing device, Receiving means for receiving transcription data obtained by replacing a portion of the string contained in the audio text data converted from the audio data output from the information processing device with a string contained in the screen text data converted from the screen data, A display means for displaying the aforementioned transcription data, A communication terminal characterized by being equipped with the following features. An information processing system in which a communication terminal and a server are connected via a network, The aforementioned server, Receiving means for receiving screen data and audio data shared with the aforementioned communication terminal from the aforementioned communication terminal, A first conversion means for converting the aforementioned audio data into audio-text data, A second conversion means for converting the aforementioned screen data into screen text data, A generation means for generating transcription data by replacing a portion of the string contained in the aforementioned audio text data with a string contained in the aforementioned screen text data, The system includes an output means for outputting the transcription data generated by the generation means to the communication terminal, The aforementioned communication terminal is A communication means for sending and receiving the screen data and audio data shared at the communication terminal with the server, A receiving means for receiving the transcription data output from the server, A display means for displaying the aforementioned transcription data, An information processing system characterized by comprising the following features. An information processing method in an information processing system in which a communication terminal and a server are connected via a network, The aforementioned server, The steps include acquiring screen data and audio data shared with the aforementioned communication terminal, The steps include converting the aforementioned audio data into audio-text data, The steps include converting the aforementioned screen data into screen text data, The steps include generating transcription data by replacing a portion of the string contained in the audio text data with a string contained in the screen text data, The steps include outputting the generated transcription data to the communication terminal, An information processing method characterized by including An information processing method in an information processing system in which a communication terminal and a server are connected via a network, The aforementioned communication terminal, The steps include transmitting screen data and audio data shared by the communication terminal to the server, The steps include receiving transcription data obtained by replacing a portion of the strings contained in the audio text data converted from the audio data output from the server with strings contained in the screen text data converted from the screen data, The steps include displaying the aforementioned transcript data, An information processing method characterized by including A program executed by an information processing device that is connected to a communication terminal via a network, The aforementioned information processing device An acquisition means for acquiring screen data and audio data to be shared with the aforementioned communication terminal, A first conversion means for converting the aforementioned audio data into audio-text data, A second conversion means for converting the aforementioned screen data into screen text data, A generation means for generating transcription data which includes replacing a portion of the string contained in the aforementioned audio text data with a string contained in the aforementioned screen text data, Output means for outputting the transcription data generated by the generation means to the communication terminal, A program characterized by being designed to function as such. A program executed by a communication terminal connected to an information processing device via a network, The aforementioned communication terminal, A transmission means for transmitting screen data and audio data shared at the communication terminal to the information processing device, Receiving means for receiving transcription data obtained by replacing a part of a string contained in the audio text data converted from the audio data output from the information processing device with a string contained in the screen text data converted from the screen data, A display means for displaying the aforementioned transcription data, A program characterized by being designed to function as such. A communication terminal connected to an information processing device via a network, A transmission means for transmitting screen data and audio data shared at the communication terminal to the information processing device, A receiving means for receiving the transcription data output from the information processing device, A display means for displaying the aforementioned transcription data, Equipped with, A communication terminal characterized in that the first transcription data displayed by the display means when the audio data is transmitted from the transmission means is different from the second transcription data displayed by the display means when the screen data and the audio data are transmitted from the transmission means.