Interactive ai (artificial intelligent) agent for video or audio communication with a website or a web application

An AI agent processes user gestures or speech to generate software code for website interaction, addressing the need for typing-free engagement and improving accessibility and user experience.

WO2026139857A1PCT designated stage Publication Date: 2026-07-02GLOBAL OPTIC HUB LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
GLOBAL OPTIC HUB LTD
Filing Date
2025-12-22
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing technologies require typing or mouse interactions for user engagement with websites or web applications, limiting accessibility for individuals who cannot type and hindering a more user-friendly interface.

Method used

An AI agent that analyzes user interaction inputs, such as gesturing or talking, determines the next step, and generates software code to execute actions on the website or application, eliminating the need for typing.

Benefits of technology

Enables seamless interaction with websites or applications without typing, enhancing accessibility and providing a more user-friendly interface, akin to human interaction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure IB2025063335_02072026_PF_FP_ABST
    Figure IB2025063335_02072026_PF_FP_ABST
Patent Text Reader

Abstract

A computer implemented method; the method comprises: while in a video or audio communication session with a website or a web application receiving end-user interaction input; analyzing the end-user interaction input for determining a next step in the session; and performing the next step; wherein the next step comprises navigation in the website or the application, redirection in the website or web application, content update of the website or web application or content refresh on the website or application.
Need to check novelty before this filing date? Find Prior Art

Description

INTERACTIVE Al (ARTIFICIAL INTELLIGENT) AGENT FOR VIDEO OR AUDIO COMMUNICATION WITH A WEBSITE OR A WEB APPLICATIONFIELD OF THE INVENTIONThe present disclosure relates to world wide web in general, and to automating web sessions in particular.BACKGROUND OF THE INVENTIONSimulation of human-like conversations with users is typically performed via chat bots. Chatbots use artificial intelligence (Al) and natural language processing (NLP) to interpret user inputs, respond appropriately, and carry out tasks based on commands or questions. They are widely used in customer service, marketing, healthcare, and many other industries to provide information, automate responses, and assist users efficiently.SUMMARY OF THE INVENTIONThe term computing device refers herein to a device that includes a processing unit. Examples of such devices are a personal computer, a laptop, a server, virtual Reality (VR) Headsets, smart glasses, augmented reality glasses, mixed reality headset, mixed reality glasses, wearable device, a tablet, a television, a set-top box, a smartwatch, a cellular device and IOT (internet of things) device.The term multimedia content refers to audio or video or an image or text or a combination thereof.The term present refers herein to display or to play multimedia content. The term end-user refers herein to a user of a website or a web application.The term user refers herein to the operator of the website or the web application.The term end-user interaction input refers herein to measured data that is sensed by sensors that track the behavior of the user while interacting with an application or a web site. The sensing input includes multimedia content or to software code associated with the interaction of the end-user with the application . Such interaction includes talking or gesturing. Examples of such sensor are a camera, amicrophone or any other sensor or includes a screen capture. In some embodiments the interaction of the user includes also keyboard or mouse.The term next step refers herein to the response to the end-user interaction. The next step is presented to the user. In some embodiments the response is presented as a virtual human being.The term LLM refers herein to Large Language Model.In some aspects of the present invention relates to a non - transitory computer - readable medium comprising instructions which when executed by at least one processor causes the processor to perform the method of the present invention.Embodiment of the invention provides interactive Al (artificial intelligent) agent for video or audio communication with a website or a web application.Such an Al agent enables the user to interact with the website or the application by communicating with a virtual human being and without the need to use keyboard or mouse. The agent analyzes the end-user interaction input and decides about the next step in the communication between the end-user and the website. The decision is based on the end-user interaction input and the data associated with the website or web application. The Al agent generates code to be executed on the website or the web application for performing the next step. The code includes any combination of content awareness, content update, content presentation, interaction with the end user and navigating. In some cases the software code already exists and the Al agent generates instructions for operating the software code.One technical problem dealt with by the present disclosure is how to eliminate the need for typing when interacting with an application or with a website. Eliminating typing may enable the use of the website or the web application while performing another activity such as driving. Eliminating typing causes the site or the application to be accessible for blind people. Eliminating the typing enables a more user-friendly interface and provides the user with the feeling that he is interacting with a human being.One technical solution is while in a video or audio communication session with a website or a web application identifying user interaction; the userinteraction comprises gesturing or talking; receiving an end-user interaction input from an end user computing device; analyzing the end-user interaction input for determining a next step in the session; playing the next step to the end user; transforming the next step into a software code, the software code implementing the next step in the site or the application; transmitting the software code to the user computing device; and executing the software code on the end user computing device; wherein the executing the software code manipulating the application or the website.One other technical problem dealt with by the present disclosure is how to provide seamless integrating of the agent with any website or a web.One technical solution is providing embedded code for the website or application.One other technical solution is while in a video or audio communication session of an end user with a website or a web application, identifying, by an at least one sensor, an interaction of the end user with an end user computing device, as a result of the identifying, capturing by the at least one sensor enduser interaction input; feeding a first LLM (Large Language Model) with a first input, the first input comprises the end-user interaction input and an instruction associated with a response of the first LLM; receiving the response from the first LLM; feeding or a second LLM (Large Language Model) with a second input, the second input comprises the response of the first LLM, a software code of the website or a web application and a second instruction for generating a second software code for implementing the response in the session; receiving the second code from the second LLM; transmitting the second code to the computing device of the end user; and executing the second code, wherein the executing the second code manipulating the application or the website.One exemplary embodiment of the disclosed subject matter is a computer implemented method; A computer implemented method; the method comprises: while in a video or audio communication session with a website or a web application, identifying user interaction; receiving an end-user interaction input from an end user computing device; analyzing said end-user interaction input for determining a next step in said session;transforming said next step into a software code, said software code implementing said next step in said site or said application; or identifying a second software code implementing said next step; transmitting said software to said user computing device; or transforming an instruction for executing said second software code and executing said software code or said second software code on said end user computing device; wherein said executing said software code manipulating said application or said website.According to some embodiments, transforming the next step into software code utilizing application-software-code or site-software-code. According to some embodiments the method further comprising identifying user interaction and capturing the user interaction input in the end-user computing device as a result of the identifying. According to some embodiments the end-user interaction input comprises multimedia content, wherein the multimedia content comprises one member selected from a group consisting of audio or video, image, screen capture and text wherein the end-user interaction input comprises website or application software code, wherein said website or application software code being captured as a result of said user interaction. According to some embodiments the method further comprising transforming the response into audio and playing the response in the end user computing device. According to some embodiments the instruction comprises instruction for providing the response as a human being.One other exemplary embodiment of the disclosed subject matter isA computer implemented method; the method comprises:while in a video or audio communication session of an end user with a website or a web application, receiving, from an end user computing device, an end-user interaction input; feeding a first LLM (Large Language Model) with an at least one member selected from a group consisted of: said end-user interaction input, data extracted from a data repository associated with said website or said web application and an instruction associated with a response of said first LLM jreceiving said response from said first LLM; transmitting said response to said end user computing device for playing said next step to said end user; and feeding a second LLM (Large Language Model) or said first LLM with onemember selected from a group consisting of: said response of said first LLM, a software code of said website or a web application and a second instruction for generating a second software code for implementing said response in said session; for manipulating the website or web application in accordance with said response.According to some embodiments, the interaction comprises talking or gesturing. According to some embodiments the method further comprising transforming the response into audio and playing the response in the end user computing device. According to some embodiments the instruction comprises instruction for providing the response as a human being. According to some embodiments the end-user interaction input being captured in the computing device as a result of identifying user interaction and wherein the end-user interaction input comprises multimedia content capturing interaction of the enduser with the application or with the website, wherein the multimedia content comprises one member selected from a group consisting of audio or video, image, screen capture and text.One other exemplary embodiment of the disclosed subject matter is at least one non-transitory computer-readable storage medium having stored thereon instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: while in a video or audio communication session with a website or a web application identifying user interaction; the user interaction comprises gesturing or talking; receiving an end-user interaction input from an end user computing device; analyzing the end-user interaction input for determining a next step in the session; playing the next step to the end user; transforming the next step into a software code, the software code implementing the next step in the site or the application;Transmitting the software to the user computing device andexecuting the software code on the end user computing device; wherein the executing the software code manipulating the application or the website.Embodiments of the invention may be implemented as a computer process (method), a computing system, or as an article of manufacture, such as a computer program product or a non-transitory computer-readable media. Thecomputer program product may be a computer storage media readable by a computer system and encoding a computer program of instructions for executing a computer process on the computer and network devices. The computer program product may also be a propagated signal on a carrier readable by a computing system and encoding a computer program of instructions for executing a computer process.THE BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGSThe present disclosed subject matter will be understood and appreciated more fully from the following detailed description taken in conjunction with the drawings in which corresponding or like numerals or characters indicate corresponding or like components. Unless indicated otherwise, the drawings provide exemplary embodiments or aspects of the disclosure and do not limit the scope of the disclosure. In the drawings:Fig. 1 shows a block diagram of an environment for interactive Al (artificial intelligent) agent for video or audio communication with a website or a web application, in accordance with some exemplary embodiments of the subject matter;Fig. 2 shows a flowchart diagram of a methos for interactive Al (artificial intelligent) agent for video or audio communication with a website or a web application, in accordance with some exemplary embodiments of the subject matter; andFig. 3 shows a flowchart diagram of a scenario of interactive Al (artificial intelligent) agent for video or audio communication with a website or a web application, in accordance with some exemplary embodiments of the subject matter.DETAILED DESCRIPTION

[0001] Fig. 1 shows a block diagram of an environment for interactive ai (artificial intelligent) agent for video or audio communication with a website or a web application, in accordance with some exemplary embodiments of the subject matter.Environment 100 includes website or application server 101, an end-user 102 computing device, system server 103 and database 104.The end-user computing device 102 is the computing device from which the user interacts with the website or the web application.System server 103 is the server of the Al agent for video or audio interaction with a website or a web application.Website or application server 101 is the server of the website or the application.Website or application server 101 includes an agent activation 1011 , an input collector and dispatcher module 1012 and code execution module 1013.System server 103 includes a first LLM (Large Language Module) ,1031, a second LLM 1032, an STT (speech to text) module 1033, a TTS (text to speech) module 1034 and a handler module 1035.The STT module 1033 is configured for converting the speech of the end user into text for interacting with the first LLM 1031. An example of such an STT is the DEEPGRAM application.The first LLM 1031 is a Large Language Module. An example of such an LLM is chat GPT. The first LLM 1031 receives the text from the STT module 1033, the end-user interaction input from the input collector and dispatcher module 1012 and instructions of the expected behavior and the expected goal. The first LLM 1031 is configured for outputting response in accordance with the received input. The response is output as text. The response is contextually relevant to the end-user interaction input and is a human-like response. The response includes the next step of the interaction.The TTS (text to speech) module 1034 is configured for converting the text output of the first LLM 1031 into speech. The speech is sent to the computing device of the end user to be played to the end user. An example of text to speech is the ELEVEN LABS application.The second LLM 1032 is a Large Language Module. An example of such an LLM is chat GPT with various configurations. The second LLM 1032 receives the output of the first LLM 1031, instructions for writing code for the Website or application, the code of the site or the application and the end-user interactioninput. The second LLM 1032 is configured for generating code in accordance with the received input. The generated code is executed on the website or the web application and operates the next step on the website or web application. In some cases the software code already exists and the Al agent receives instructions to operate the existing code.The handler module 1035 is configured for executing the first LLM 1031 the second LLM 1032, the STT module 1033 and the TTS module 1034 and for communicating with the end user device.The agent activation module 1011 , the input collector and dispatcher module 1012 and the code execution module 1013 are downloaded from the system server 103 to the computing device of the end-user 102 upon starting a session between the end-user and the website or the web application.The agent activation module 1011 , the input collector and dispatcher module 1012 and the code execution module 1013 are executed on the computing device of the end-user 102 during the session between the end-user and the website or the web application.The agent activation module 1011 is configured for opening an audio / video chat area, for handling the interaction between the agent and the end user upon starting a session with the end user.The input collector and dispatcher module 1012 is configured for identifying an interaction of the end-user with the website or application and for retrieving the end-user interaction input. The input collector and dispatcher module 1012 monitors the output of the sensors in the environment of the user for identifying the interaction. The input collector and dispatcher module 1012 further monitors the changes in the code of the site or application for identifying the interaction. The monitored sensors includer camera, microphone, keyboard and mouse. The input collector and dispatcher module 1012 collects the data from the sensors and from the screen upon identifying an interaction. The collected data includes screen captures and output from the camera and / or the microphone. The input collector and dispatcher module 1012 is further configured for transmitting data to first LLM 1031.The code execution module 1013 is configured for receiving code or instructions for operating a certain code from the second LLM 1032 and for executing the code on the application or the website. The code implements the next step by performing navigation, redirection, content update or content refresh on the website or application. The content update or refresh includes playing multimedia content.Fig. 2 shows a block diagram of a method of an Al (artificial intelligent) agent for video or audio communication with a website or a web application, in accordance with some exemplary embodiments of the disclosed subject matter.At block 200 the end-user enters the website or the application. As a result, the application or the website plays a welcoming message in addition to displaying the home page to the end-user.At block 205 the agent at the end user computing device identifies interaction of the end-user with the website or with the web application. The interaction includes talking, typing, moving a mouse, hovering and performing gestures.At block 210 the agent at the end user computing device captures the enduser interaction input and transmits the input to the system server.At block 215 the agent at the system server receives the end-user interaction input. In one example the application is transportation application and the user requests to know how to arrive to a certain location by a vehicle.If the interaction input includes audio, the agent transforms the audio into text message utilizing speech to text methods such as DEEPGRAM.At block 217 the agent at the system server feeds the LLM with parameters. Such parameters include the end-user interaction input, data extracted from a data repository associated with the website or application and instructions to the LLM application and or the application. The instructions include instructions on how to behave and what is the goal of the interaction and with the next step.At block 220, the LLM analyzes the input data and provides a response. The response identifies the next step in human language.In the example of the transportation application the response is “The best way is to drive via route 1 , however I will show you other alternatives.”At block 225 the agent at the server of the system transforms the text into voice message and transmits the voice message to the end user computing device for playing the response to the user.At block 230 the agent at the server feeds another LLM with: the response of the first LLM, the code of the website or application, data from the data repository, user interaction input and instructions for generating code. In the example of the transportation the input to the second LLM includes the text message “The best way is to drive via route 1 , however I will show you other alternatives”. In this example the code of the transportation application is extracted from the application and a map is extracted from a database.At block 235 the second LLM generates the code. The code is configured for manipulating the website or web application. The manipulation causes presentation of multimedia content on the computing device of the end user or navigating the website or web application or presenting a new page or interacting with said end user. In some embodiments the code is further configured for running the code of the website or application.At block 240 the agent on the server transmits the code to the computing device of the end user for executing the code on the website or application. In the example of the transportation application the code causes the transportation application to switch to another page and to present the alternate routes.It should be noted that the first LLM and the second LLM may be same LLM.Fig. 3 shows a flowchart diagram of a scenario of an Al (artificial intelligent) agent for video or audio communication with a website or a web application, in accordance with some exemplary embodiments of the disclosed subject matter. The scenario illustrates a session between a user and an Al agent that simulates a human being in an optical department. The Al agent is introduced to the user as a person with the name David.The scenario is between the Al agent and a user. The Al agent issues voice messages which are played to the user by the website and the user talks to the microphone of his computing device.Referring now to the drawing:At block 300 the end-user with the name Adam visits the online optical department of a website. He is greeted by David, the Al agent.At block 305 the Al agent says: “Hi there! I’m David, your virtual optician. Welcome to our Optical Department. I’m here to assist you in finding the perfect eyewear today. Can you tell me if you’re looking for contact lenses, sunglasses, or prescription glasses?”The Al agent also displays visual icons for contact lenses, sunglasses, and prescription glasses, making it easy for Adam to choose.At block 310 the end-user says” I’m here to look for prescription glasses for everyday use”.The Al agent says: “Great choice, Adam! Now, before we dive into styles, I want to make sure you understand the different frame types. Let me show you a few options, so you know exactly what I mean”At block 315 the end-user and the Al agent interact to find the frame that fits the end-user. The Al agent displays images of each frame type with a brief animated explanation showing the differences.The Al Agent explains the differences between the frames: “As you can see here, the lens of a full rim frame is completely surrounded by the frame. It offers a classic and durable look. The lens of a half rim frame is only partially covered by the frame, making it lighter and giving a more minimalist feel and the lenses of a rimless frame are held together without a frame, making it the lightest and most discreet style. Which one catches your eye? Would you prefer the full coverage of a full-rim, the lightweight feel of half-rim, or the sleek look of rimless?”The end-user says: “I think I’d like to try the half-rim frames. They look sturdy”.The Al agent displays on the website various brands of full rim frames and explains each brand: “Ray-Ban Classic are Timeless and iconic, known for itsdurability and style, the Oakley Airframe are Sleek and sporty, designed for comfort during long wear and the Thomas Rhodes is a premium sports eyewear choice for everyday use. Would you like to try any of these on virtually?”The end-user says: “Yes, I’d love to try the Ray-Ban Classic. Let’s see how it looks on me.”The Al agent says: “Perfect! Here’s the virtual try-on feature. Align your face with the camera, and I’ll show you a live preview of the Ray-Ban Classic.AT block 320 the Al agent activates the virtual try-on which enables the end-user to rotate his head and see how it fits from different angles.The Al says: “How does this feel?”The end-user says “Yes let’s go with it”Block 325 illustrates the interactive prescription entry support. At block 330 the Al agent fills the prescription for the end-user.First the Al agent says: “Excellent choice! Now, we just need to enter your prescription details.”The end-user responses: “I have my prescription with me, but I’m not sure how to enter it correctly. Can you help me type it in?”The Al agent says: “Please go ahead and read your prescription to me, and I’ll enter it. If you’re unsure about any part, let me know, and I can guide you through it.” The Al agent presents a page with the form of the script on the site. The end-user reads aloud the prescription. The Al agent fills the details in accordance with the data received from the end-user and presents the prescription to the end-user. The details are displayed on the form.The Al agent says: “Does everything look correct? If so, we can move on to finalizing your lenses.”The end-user says: Yes, it looks good. But just to be safe, can I upload a photo of my prescription for you to double-check?The Al agent responses “Absolutely! Go ahead and upload it here, and I’ll verify the details for you.”The end-user uploads the photo of the prescription. The system analyzes the photo and compares the information of the photo with the information that is entered by the end-user.When the prescription is verified the Al agent says “Everything looks great. Would you like to proceed with checkout, or do you have any last questions?” The end-user replies: “No, everything looks perfect. Let’s proceed to check out.”Block 330 illustrates the final order summary. At block 320 the Al agent says: “Here’s a complete summary of your order:• Frame: Oakley half-rim frame.• Lens: Blue Light Filtering Single Vision, customized to your prescription • Shipping: Express (2-day delivery)The Al agent displays the summary of the order and says: “your order is now being processed. You should receive your new glasses within 2 days. I’ll keep you updated on the delivery status via email.”The end-user thanks the agent and the session is terminated.The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and / or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and / or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof.It should be noted that, in some alternative implementations, the functions noted in the block of a figure may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

Claims

CLAIMSWhat is claimed is:

1. A computer implemented method; the method comprises:while in a video or audio communication session with a website or a web applicationidentifying user interaction;receiving an end-user interaction input from an end user computing device;analyzing said end-user interaction input for determining a next step in said session;transforming said next step into a software code, said software code implementing said next step in said site or said application; or identifying a second software code implementing said next step;transmitting said software to said user computing device; or transforming an instruction for executing said second software code and executing said software code or said second software code on said end user computing device; wherein said executing said software code manipulating said application or said website.

2. The method of claim 1 , wherein said transforming said next step into software code utilizing application-software-code or site-software-code.

3. The method of claim 1 , further comprising identifying user interaction and capturing said user interaction input in said end-user computing device as a result of said identifying.

4. The method of claim 3, wherein said end-user interaction input comprises multimedia content, wherein said multimedia content comprises one member selected from a group consisting of audio or video, image, screen capture and text wherein said end-user interaction input comprises website or application software code, wherein said website or application software code being captured as a result of said user interaction.

5. The method of claim 1 , further comprising transforming said response into audio and playing said response in said end user computing device.

6. The method of claim 1 , wherein said instruction comprises instruction for providing said response as a human being.

7. A computer implemented method; the method comprises:while in a video or audio communication session of an end user with a website or a web application,receiving, from an end user computing device,an end-user interaction input;feeding a first LLM (Large Language Model) with an at least one member selected from a group consisted of: said end-user interaction input, data extracted from a data repository associated with said website or said web application and an instruction associated with a response of said first LLM;receiving said response from said first LLM;transmitting said response to said end user computing device for playing said next step to said end user; andfeeding a second LLM (Large Language Model) or said first LLM with one member selected from a group consisting of: said response of said first LLM, a software code of said website or a web application and a second instruction for generating a second software code for implementing said response in said session; for manipulating the website or web application in accordance with said response.

8. The method of claim 7, wherein said interaction comprises talking or gesturing.

9. The method of claim 7, further comprising transforming said response into audio and playing said response in said end user computing device.

10. The method of claim 7, wherein said instruction comprises instruction for providing said response as a human being.

11. The method of claim 7, wherein said end-user interaction input being captured in said computing device as a result of identifying user interaction and wherein said end-user interaction input comprises multimedia content capturing interaction of said end-user with said application or with said website, wherein said multimedia content comprises one member selected from a group consisting of audio or video, image, screen capture and text.

12. At least one non-transitory computer-readable storage medium having stored thereon instructions that, when executed by one or more processors, cause the one or more processors to perform operations comprising: while in a video or audio communication session with a website or a web application: identifying user interaction; receiving an end-user interaction input from an end user computing device; analyzing said end-user interaction input for determining a next step in said session;playing said next step to said end user; transforming said next step into a software code, said software code implementing said next step in said site or said application; transmitting said software code to said user computing device; and executing said software code on said end user computing device; wherein said executing said software code manipulating said application or said website.