Large model-based trip planning method and device, agent and storage medium

By using a large-scale model to process itinerary planning methods, combined with interactive elements such as spoken video and visual elements, the complexity and omissions in itinerary planning are solved, achieving efficient and accurate itinerary planning and improved user experience.

CN120509663BActive Publication Date: 2026-06-19BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
Filing Date
2025-05-16
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

When users travel for leisure or business, the itinerary planning process is complex and important travel needs are easily overlooked, resulting in low planning efficiency and difficulty in meeting actual needs.

Method used

By using a large-scale model-based itinerary planning method, the target object's demand information is obtained, and voice-over videos and visual elements that match the demand information are displayed. The large-scale model is used to process extended itinerary demand information and conversation information to generate itinerary planning information. The target object determines its extended demands through the specified actions of virtual objects and the interactive behavior of visual elements.

Benefits of technology

It improves the accuracy and efficiency of travel demand, avoids missing important needs, provides an immersive interactive environment, and enhances the user experience.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120509663B_ABST
    Figure CN120509663B_ABST
Patent Text Reader

Abstract

This disclosure provides a method, apparatus, intelligent agent, and storage medium for trip planning based on a large model, relating to the field of artificial intelligence technology, particularly deep learning, large models, big data, and smart e-commerce. The trip planning method based on a large model includes: acquiring the demand information of a target object; displaying spoken video and visual elements matching the demand information, wherein the spoken video represents a virtual object instructing visual elements through specified actions during the spoken video, and the visual elements represent extended trip demand information with semantic differences from the demand information; processing at least one of the extended trip demand information and conversation information using a large model to obtain trip planning information, wherein the conversation information is the target object's input data to the spoken video, and the extended trip demand information is determined based on the interaction behavior with the visual elements.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence technology, particularly to the fields of deep learning, large models, big data, and smart e-commerce. Background Technology

[0002] With the rapid development of internet technology, users can input relevant text queries via smartphones and other smart devices. Related internet platforms can then use large-scale models to perform semantic understanding of the user-input text and output feedback information. Summary of the Invention

[0003] This disclosure provides a method, apparatus, intelligent agent, electronic device, and storage medium for route planning based on a large model.

[0004] According to one aspect of this disclosure, a large-scale model-based itinerary planning method is provided, comprising: obtaining the demand information of the target object;

[0005] Display spoken video and visual elements that match the demand information. Spoken video represents virtual objects instructing visual elements through specified actions during the speaking process. Visual elements represent extended trip demand information that has semantic differences from the demand information.

[0006] The trip planning information is obtained by processing at least one of the extended trip demand information and the conversation information using a large model. The conversation information is the input data of the target object for the voice-over video, and the extended trip demand information is determined based on the interaction behavior of the screen elements.

[0007] According to another aspect of this disclosure, a large-scale model-based trip planning device is provided, comprising: an acquisition module for acquiring demand information of a target object; a display module for displaying a spoken video and screen elements matching the demand information, wherein the spoken video represents a virtual object instructing screen elements through specified actions during the speaking process, and the screen elements represent extended trip demand information that has semantic differences from the demand information; and a trip planning information acquisition module for processing at least one of the extended trip demand information and session information using the large-scale model to obtain trip planning information, wherein the session information is the input data of the target object for the spoken video, and the extended trip demand information is determined based on the interactive behavior of the screen elements.

[0008] According to another aspect of this disclosure, an artificial intelligence agent is provided, comprising: an input module for receiving input information; a processing module for determining a target task based on the input information received by the input module, determining a large model based on the target task, and executing a route planning method based on a large model provided in the embodiments of this disclosure by calling the large model to obtain output information; and an output module for outputting the output information obtained by the processing module.

[0009] According to another aspect of this disclosure, an electronic device is provided, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform a large-model-based route planning method provided in embodiments of this disclosure.

[0010] According to another aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are used to cause a computer to execute a large-model-based route planning method provided in embodiments of this disclosure.

[0011] According to another aspect of this disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the large-model-based route planning method provided in embodiments of this disclosure.

[0012] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0013] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:

[0014] Figure 1 The illustration schematically shows an exemplary system architecture for applying large-model-based travel planning methods and apparatus according to embodiments of the present disclosure;

[0015] Figure 2 A flowchart illustrating a large-model-based route planning method according to an embodiment of the present disclosure is shown schematically.

[0016] Figure 3 The diagram illustrates an application scenario of the large-model-based route planning method provided according to embodiments of the present disclosure.

[0017] Figure 4 A schematic diagram illustrates a service system for performing a large-model-based route planning method provided according to embodiments of the present disclosure;

[0018] Figure 5 The diagram illustrates a system architecture suitable for implementing the large-model-based route planning method provided according to embodiments of the present disclosure.

[0019] Figure 6 A block diagram of a large-model-based route planning apparatus according to an embodiment of the present disclosure is shown schematically.

[0020] Figure 7A schematic diagram illustrating the structure of an intelligent agent of artificial intelligence according to embodiments of the present disclosure; and

[0021] Figure 8 A schematic block diagram of an example electronic device 800 for implementing a model-based route planning method according to embodiments of the present disclosure is shown. Detailed Implementation

[0022] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0023] In the technical solution disclosed herein, the acquisition, storage, and application of user personal information comply with the provisions of relevant laws and regulations, necessary confidentiality measures have been taken, and there is no violation of public order and good morals.

[0024] The inventors discovered that before undertaking travel activities such as tourism and business trips, users often search for information on attractions, accommodations, and transportation options on relevant internet platforms to plan their itineraries. However, the process of itinerary planning is usually quite complex and prone to overlooking important travel needs, resulting in low planning efficiency and difficulty in meeting actual travel requirements.

[0025] Embodiments of this disclosure provide a method, apparatus, intelligent agent, electronic device, and storage medium for trip planning based on a large model. The method includes: acquiring demand information of a target object; displaying spoken video and visual elements matching the demand information, wherein the spoken video represents a virtual object instructing visual elements through specified actions during the spoken video, and the visual elements represent extended trip demand information that semantically differs from the demand information; and processing at least one of the extended trip demand information and session information using a large model to obtain trip planning information, wherein the session information is input data from the target object to the spoken video, and the extended trip demand information is determined based on interactive behavior towards the visual elements.

[0026] According to embodiments of this disclosure, by displaying a spoken video matching the demand information, the target audience can naturally focus on the displayed screen elements by watching the virtual object's specified actions during the spoken video. This allows the use of screen elements representing extended travel demand information that semantically differs from the demand information. Thus, while watching the spoken video, the target audience is prompted by the virtual object's immersive expression of the extended travel demand information to quickly determine this information through interaction with screen elements. The video also prompts the target audience to express subsequent demands regarding travel planning by inputting conversational information. This improves the accuracy and efficiency of exploring the target audience's travel needs, preventing important travel demands from being overlooked, enhancing the accuracy and timeliness of travel planning, and improving the user experience by providing an immersive interactive environment.

[0027] Figure 1 The illustration schematically shows an exemplary system architecture for applying large-model-based travel planning methods and apparatus according to embodiments of the present disclosure.

[0028] It is important to note that Figure 1 The examples shown are merely examples of system architectures that can be applied to embodiments of this disclosure, to help those skilled in the art understand the technical content of this disclosure, but do not imply that embodiments of this disclosure cannot be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture for applying the large-model-based trip planning method and apparatus may include a terminal device, but the terminal device may implement the large-model-based trip planning method and apparatus provided by embodiments of this disclosure without interacting with a server.

[0029] like Figure 1 As shown, the system architecture 100 according to this embodiment may include terminal devices 101, 102, and 103, a network 104, and a server 105. The network 104 serves as a medium for providing a communication link between the terminal devices 101, 102, and 103 and the server 105. The network 104 may include various connection types, such as wired and / or wireless communication links, etc.

[0030] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Various communication client applications can be installed on terminal devices 101, 102, and 103, such as knowledge reading applications, web browser applications, search applications, instant messaging tools, email clients, and / or social platform software, etc. (for example only).

[0031] Terminal devices 101, 102, and 103 can be various electronic devices with displays and web browsing capabilities, including but not limited to smartphones, tablets, laptops, and desktop computers.

[0032] Server 105 can be a server that provides various services, such as a backend management server that supports the content browsed by users using terminal devices 101, 102, and 103 (for example only). The backend management server can analyze and process data such as received user requests, and feed back the processing results (such as web pages, information, or data obtained or generated according to user requests) to the terminal devices.

[0033] Server 105 can be a cloud server, also known as a cloud computing server or cloud host. It is a host product in the cloud computing service system, which solves the shortcomings of traditional physical hosts and VPS services ("Virtual Private Server", or simply "VPS"), such as high management difficulty and weak business scalability. Server 105 can also be a server for a distributed system or a server combined with blockchain.

[0034] It should be noted that the large-model-based travel planning method provided in this disclosure embodiment can generally be executed by terminal devices 101, 102, or 103. Correspondingly, the large-model-based travel planning device provided in this disclosure embodiment can also be disposed in terminal devices 101, 102, or 103.

[0035] Alternatively, the large-model-based travel planning method provided in this embodiment can generally be executed by server 105. Correspondingly, the large-model-based travel planning device provided in this embodiment can generally be located in server 105. The large-model-based travel planning method provided in this embodiment can also be executed by a server or server cluster that is different from server 105 and capable of communicating with terminal devices 101, 102, 103 and / or server 105. Correspondingly, the large-model-based travel planning device provided in this embodiment can also be located in a server or server cluster that is different from server 105 and capable of communicating with terminal devices 101, 102, 103 and / or server 105.

[0036] For example, after a user inputs their requirements, terminal devices 101, 102, and 103 can obtain the requirements and send them to server 105. Server 105 then displays voice-over video and visual elements that match the requirements; and calls a large model to process at least one of the session information and extended travel requirements input by the user through any one of terminal devices 101, 102, and 103 to obtain travel planning information. Alternatively, a server or server cluster capable of communicating with terminal devices 101, 102, 103, and / or server 105 can display voice-over video and visual elements that match the requirements; and call a large model to process at least one of the session information and extended travel requirements input by the user to obtain travel planning information.

[0037] It should be understood that Figure 1 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.

[0038] Figure 2 A flowchart illustrating a large-model-based route planning method according to an embodiment of the present disclosure is shown schematically.

[0039] like Figure 2 As shown, the method includes operations S210~S230.

[0040] In operation S210, the requirement information of the target object is obtained.

[0041] When operating the S220, display spoken videos and visual elements that match the required information.

[0042] In operation S230, at least one of the extended trip demand information and session information is processed using a large model to obtain trip planning information.

[0043] According to embodiments of this disclosure, demand information can represent the target object's demand for travel planning, and demand information can include any demand information related to travel planning, such as travel destination and mode of transportation.

[0044] It should be noted that while requirement information can be determined based on the text input by the target object, it is not limited to this. Requirement information can also be determined through other interactive behaviors of the target object, such as voice or images input by the target object. Alternatively, it can be determined based on the target object's gesture interaction to identify corresponding requirement information options.

[0045] According to embodiments of this disclosure, a spoken video represents a virtual object instructing screen elements through specified actions during a spoken presentation. The spoken video can be a video in which the virtual object verbally explains content related to requested information. For example, a spoken video could be a video in which a virtual object explains information such as attractions and featured videos of destination city A. The virtual object in the spoken video can enhance the naturalness of its explanation by performing actions during the presentation, and the specified actions performed by the virtual object can instruct the screen elements displayed on the playback screen of the spoken video.

[0046] It should be noted that virtual objects can be fictional characters such as animated figures, but are not limited to these. Virtual objects can also be real objects obtained with the authorization of the relevant real person. The embodiments of this disclosure do not limit the specific type of virtual object.

[0047] According to embodiments of this disclosure, the spoken video can be determined from a preset video library, or the spoken video can be generated by driving a spoken video script corresponding to the demand information. Embodiments of this disclosure do not limit the specific method of determining the spoken video, as long as it can represent the virtual object explaining the travel information related to the demand information.

[0048] According to embodiments of this disclosure, the image elements represent extended travel demand information that has semantic differences from the demand information. Extended travel demand information can be, for example, missing demand information from the demand information. For instance, if the demand information is "travel to Province A," the extended travel demand information could include terms like "1-day trip" or "2-3 day trip," indicating the missing travel duration. However, this is not limited to this; extended travel demand information can also be information that limits the semantic scope of the demand information. For example, if the demand information is "travel to Province A," the extended travel demand information could include demand information representing destinations within Province A, such as "City A" or "Town B."

[0049] It should be noted that the embodiments of this disclosure do not limit the specific semantic type of the extended trip demand information representation, as long as it can be semantically different from the demand information. Screen elements can be represented based on any form of element that can be displayed on the screen, such as text or icons; the embodiments of this disclosure do not limit the display method of screen elements.

[0050] In one embodiment, a voice-over video can be played on a display interface such as a screen. During the process of a virtual object explaining the extended travel demand information represented by the screen elements, multiple screen elements can be indicated by a finger pointing to the bottom of the screen. This allows the target audience to naturally focus on the extended travel demand information represented by the screen elements through the specified actions of the virtual object while watching the voice-over video, thereby enabling the target audience to promptly pay attention to potential travel demands while immersing themselves in watching the video.

[0051] According to embodiments of this disclosure, the session information is the input data of the target object for the spoken video, and the extended trip requirement information is determined based on the interactive behavior of the screen elements.

[0052] In one embodiment, a target object can determine its required extended travel itinerary information by performing interactive actions on screen elements. For example, the target object can perform a click interaction on a target screen element representing a "2 to 3-day trip" to determine its required extended travel itinerary information.

[0053] In one embodiment, the target user can also input conversational information that is the same as or different from the extended travel demand information represented by the screen elements, based on the narrated content and screen elements of the virtual object while watching the narrated video, so that the target user can input actual travel demand through flexible interactive methods.

[0054] According to embodiments of this disclosure, processing at least one of extended travel demand information and conversation information using a large model can be understood as using the large model to perform travel planning based on at least one of the extended travel demand information selected by the target object's interactive behavior and the input conversation information. The resulting travel planning information can match the target object's travel demand intention. The travel demand intention may include at least one of a first travel demand intention represented by the extended travel demand information and a second travel demand intention represented by the conversation information, as well as part or all of the demand intention represented by the demand information. This leverages the powerful semantic understanding capabilities of the large model to output travel planning information that accurately matches the target object's travel demand, thereby improving the efficiency of the target object's travel planning.

[0055] It should be noted that the itinerary planning information can be represented by any type of data such as text, images, maps, and videos. The embodiments disclosed herein do not limit the specific data types included in the itinerary planning information.

[0056] In one embodiment, the trip planning information can be structured text information. For example, structured text information can include multiple types of data based on structured hierarchical relationships, such as text theme, order of travel destinations, description of transportation methods, and ticket purchase links. This allows the target audience to prepare for their trip by browsing the text information and interacting with the links within it, thereby improving the user experience.

[0057] It should be understood that the display screen for displaying spoken video and visual elements can be a display device of any type of terminal device such as a mobile phone, tablet computer, or smart glasses. The embodiments of this disclosure do not limit the type of display device for displaying spoken video and visual elements.

[0058] According to embodiments of this disclosure, extended trip demand information represents at least one of the following demand attributes: trip duration attribute, trip budget attribute, waypoint demand attribute, mode of transportation demand attribute, and traveler attribute.

[0059] The trip duration attribute can represent the trip duration of the target object. Extended trip demand information representing the trip duration attribute can be such as "1-day trip" or "2 to 3-day trip".

[0060] The trip budget attribute can represent the budgeted amount for travel expenses. Extended trip demand information representing the trip budget attribute can be, for example, "1000 to 2000 yuan" or "10,000 to 20,000 yuan". It should be noted that the trip budget attribute can represent the specific amount of money spent on the trip plan, or it can represent the spending range that the target individual can afford, or it can represent the travel spending level of the target individual. For example, extended trip demand information could be "luxury travel" or "economic travel". The embodiments of this disclosure do not limit the monetary attribute represented by the extended trip demand information.

[0061] The waypoint requirement attribute represents the cities, villages, scenic spots, and other locations that the target traveler needs to pass through or stop at during their trip. It should be understood that a waypoint represents a location that the target traveler needs to reach, which can be any of the starting point, intermediate point, or ending point in the itinerary planning information.

[0062] The transportation demand attribute can represent any type of transportation, such as high-speed rail, airplane, car, or subway.

[0063] Traveler attributes can represent the number of people traveling, or they can represent other traveler-related attributes such as the relationship between the traveler and the target. For example, extended travel demand information representing traveler attributes could be "family trip" or "traveling with parents."

[0064] It should be noted that extended travel demand information can also represent other types of demand attributes, such as whether to bring a pet, food preference attributes, etc. The embodiments of this disclosure do not limit the specific types of demand attributes represented by extended travel demand information.

[0065] In one embodiment, extended itinerary requirement information can represent multiple types of requirement attributes. For example, extended itinerary requirement information could be "family trip for 3 people" or "luxury trip for 2 people." The extended itinerary requirement information represented by screen elements can also represent one or more types of requirement attributes.

[0066] To facilitate explanation of the methods provided in the embodiments of this disclosure, the following embodiments represent the extended travel demand information characterized by screen elements as candidate extended travel demand information, and the extended travel demand information determined by the target object's interaction behavior with the screen elements as target extended travel demand information. It should be understood that, when the target object selects and confirms the target screen element, the target extended travel demand information represented by the target screen element can be processed using a large model to perform travel planning for the target object and output travel planning information.

[0067] According to embodiments of this disclosure, interactive behaviors for screen elements include at least one of the following: voice interaction behavior, gesture interaction behavior, eye-tracking interaction behavior, and touch operation behavior.

[0068] Voice interaction behavior can represent the voice audio data output by a target object while watching a spoken video. This voice audio data can be used to represent the selection of a target screen element. For example, voice interaction behavior can be the voice audio data input by the target object representing "a family trip for three people".

[0069] In one embodiment, a speech recognition algorithm is used to process the speech audio data used for voice interaction. The resulting speech recognition information, such as speech recognition text and speech recognition vectors, can be used to determine the target screen elements required by the target object from the screen elements, and then determine the target extended journey requirement information for the large model.

[0070] Touch operation behavior can be understood as the interactive behavior of a target object selecting target screen elements by operating on touchable components such as a touchable display screen or physical buttons on a smart terminal. Touch operation behavior can include any touch-related operation behavior such as double-click operation, single-click operation, and selection operation. The embodiments of this disclosure do not limit the specific operation type of touch operation behavior.

[0071] In one embodiment, the target object can determine the target extended itinerary requirement information representing the luxury travel requirement attribute by performing a click operation on a target screen element representing "luxury travel" indicated by a virtual object with its finger on a touchable display screen.

[0072] Gesture interaction can be generated without touching the touchable components of the smart terminal. For example, the front-facing camera of the smart terminal device can detect the finger movement trajectory of the target object to select and determine the target screen element that the target object needs to select.

[0073] Eye-tracking interaction can be understood as an interactive operation determined by detecting the position of the target object's eyeballs. For example, a front-facing camera can be used to detect the eye movement trajectory of the target object facing the display screen to determine the target screen element that the target object needs to select.

[0074] In one embodiment, the target screen element selected and confirmed by the target object can be determined by detecting any one or more of the target object's voice interaction behavior, gesture interaction behavior, eye-tracking interaction behavior, and touch operation behavior. This allows the target object to instruct on displayed screen elements through specified actions from a virtual object while watching a spoken video, enabling them to perform interactive operations on required extended travel information based on the virtual object's prompts. This achieves an immersive "watch and select" interactive experience, improving the efficiency of the target object's interaction in selecting travel-related information and enhancing the accuracy of generated travel planning information.

[0075] In one embodiment, if no interactive behavior is detected between the target audience and on-screen elements while watching the narrated video, it can be understood that the extended travel requirement information displayed by the on-screen elements does not meet the target audience's travel needs. The target audience can also input text representing their travel needs as conversational information based on the narrated content of the virtual object or the extended travel requirement information prompted by the on-screen elements. This allows the narrated video and on-screen prompts to guide the target audience to more accurately and quickly consider their actual travel needs, enabling a deeper exploration of their further travel intentions. By processing the conversational information through a large model to understand the target audience's needs, travel planning information that accurately meets their actual requirements can be obtained.

[0076] In one embodiment, the large model can perform trip planning by extending the target's travel demand information based on the session information and the interactive behavior indications of the target object. Under the multiple prompts of the virtual object's verbal explanations and the screen elements indicating specified actions, the potential travel needs of the target object can be mined based on the diverse interactive behaviors of the target object, thereby accurately and timely planning the trip planning information to quickly and accurately meet the user's travel planning needs.

[0077] It should be noted that the acquisition of data involved in any embodiment of this disclosure, including but not limited to session information, extended travel request information, and request information, was carried out under the condition of obtaining authorization from the relevant users or organizations, and the purpose of acquiring the data was clearly stated before acquisition as pushing travel planning information to the target audience. At the same time, necessary confidentiality measures were taken for the acquired data to avoid information leakage, complying with relevant laws and regulations and not violating public order and good morals.

[0078] According to embodiments of this disclosure, the designated action includes at least one of the following actions of the virtual object: eye movements, head movements, and gestures.

[0079] In one embodiment, eye movements represent the direction or location indicated by the virtual object's eyes. For example, eye movements could indicate that the virtual object's eyes are looking at elements displayed below the spoken video. By using eye movements to indicate elements displayed in the spoken video, the virtual object can naturally guide the target audience to focus on the elements during the spoken explanation. This allows the target audience to naturally and smoothly focus on available extended travel needs information while immersing themselves in the virtual object's explanation, enabling them to quickly determine the target extended travel needs representing their travel requirements through interactive behavior. This improves the accuracy and efficiency of the travel planning information output by the large model.

[0080] In one embodiment, head movements can be understood as the actions of a virtual object's head turning or moving. These head movements can be used to indicate displayed screen elements. For example, a head-down gesture can be used to indicate screen elements below a spoken video, allowing the target audience to promptly focus on available extended travel needs information. This provides an immersive interactive experience and improves the efficiency and accuracy of exploring potential needs information and generating travel planning information.

[0081] In one embodiment, a gesture can be understood as an action by which a virtual object instructs on screen elements. For example, a gesture can be a single hand sliding to the left to point to the display area of ​​screen elements on the left side of a spoken video. This allows the target object to be prompted to pay attention to available extended travel needs information through a large range of motion, thereby providing an immersive interactive experience and improving the efficiency of exploring potential needs information and the efficiency and accuracy of generating travel planning information.

[0082] In one embodiment, the specified action may include at least two of the following: eye movements, head movements, and gestures of the virtual object. This can be combined with the diverse actions displayed by the virtual object during the verbal expression process to enhance the prompting of the target audience to pay attention to the available extended travel demand information in a timely manner, thereby providing an immersive interactive experience and improving the efficiency of exploring potential demand information and the efficiency and accuracy of generating travel planning information.

[0083] Figure 3 The diagram illustrates an application scenario of the large-model-based route planning method provided according to embodiments of the present disclosure.

[0084] like Figure 3 As shown, after the user inputs their needs, the terminal device's display interface 300 shows a voice-over video and visual elements matching those needs. The voice-over video can represent a virtual object providing a narration of tourism content in City A. During the narration, the virtual object can use specified gestures to indicate the first visual element 311 and the second visual element 312 displayed on the left side of the display interface 300. The first visual element 311 and the second visual element 312 respectively represent extended travel needs information such as "In-depth Experience" and "Attraction Check-in." The user can interact with the first visual element 311 to determine their selected target extended travel needs information, and can also input actual travel needs by entering conversation information in the input box 331 on the display interface 300. Furthermore, the display interface 300 can also display an attraction video window 321, playing video content introducing attractions in City A, prompting the user to determine their target extended travel needs information through interactive actions while immersing themselves in the voice-over video, and to enter conversation information through the input box 331. The trained large model deployed on the server can perform trip planning by processing session information, target extended trip demand information, and demand information to obtain trip planning information that meets the user's actual travel needs.

[0085] In one embodiment, the visual elements are obtained by using a large language model to detect travel requirements based on demand information and the historical preference attributes of the target object.

[0086] Historical preference information can be determined based on the target object's historical interaction behavior. For example, the target object's historical preference attributes can be determined based on interactions such as liking, forwarding, and commenting during historical time periods. In one example, the target object's historical preference attributes can be determined based on its historical satisfied consumption of resources.

[0087] By utilizing large language models to process demand information and historical preference attributes, these models can comprehensively and accurately understand the preferences and needs of the target audience. This allows the extended travel demand information corresponding to the output visual elements to accurately represent the target audience's potential travel needs. In this way, by displaying visual elements representing extended travel demand information, the potential needs of the target audience can be accurately uncovered. The target audience can then conveniently and quickly provide feedback on their potential travel needs through interactive behaviors with these visual elements. The large model, by understanding the target audience's extended travel demand information and overall demand information, can obtain more accurate travel planning information to meet the target audience's actual needs.

[0088] In one embodiment, the semantic similarity of screen elements can be obtained by performing semantic similarity detection between the demand information and preset extended travel demand information. Preset extended travel demand information whose semantic similarity meets the predicted similarity interval threshold is selected as the extended travel demand information to be displayed. This allows for the relatively quick identification of extended travel demand information that has semantic differences from the demand information, but not excessively large ones. This enables the rapid and timely display of corresponding screen elements for the target audience to choose from, improving the efficiency of travel planning information generation while saving computational costs.

[0089] In one embodiment, the target user's input requirement information is "tourism in City A". During the video playback, multiple first-screen elements representing similar extended travel needs are displayed, such as two first-screen elements representing "1-day tour" and "2-3 day tour". The virtual object can indicate the position of the first-screen elements on the display interface through gestures and eye movements during the playback. After the target user performs a touch operation on the first-screen element representing "2-3 day tour", it can be determined that "2-3 day tour" is the target extended travel need information that meets the target user's travel requirements. By processing the requirement information "tourism in City A", the currently determined target extended travel need information "2-3 day tour", and the target user's historical preference attributes through a large language model, the extended travel need information represented by new second-screen elements is determined to be "budget 2000-3000 yuan" and "budget 5000-8000 yuan", respectively. The target user can perform interactive operations on the second-screen elements to determine the new target extended travel need information.

[0090] It should be understood that during the playback of the audio-visual video, multiple rounds of visual elements can be continuously displayed. By using the extended travel demand information represented by these multiple rounds of visual elements, the target audience can explore potential travel needs while immersing themselves in watching the audio-visual video. In turn, by using a large model to process the demand information and the extended travel demand information from multiple rounds of target audiences, travel planning information can be generated efficiently and accurately.

[0091] In another embodiment, a large model can be used to process demand information, conversational information input by the target audience during the viewing of the audio-visual presentation, and multi-round expanded travel demand information to obtain travel planning information. This allows for the discovery of potential travel needs by providing diverse interaction methods to the target audience, improving the accuracy and timeliness of travel demand exploration and enhancing travel planning efficiency.

[0092] According to embodiments of this disclosure, displaying spoken video and visual elements that match demand information may include: specifying a target area for action instructions in the spoken video and displaying visual elements.

[0093] In one example, a spoken video clip associated with a specified action is used to explain extended requirements for the representation of on-screen elements. For instance, during the spoken video playback, a virtual object performs a specified gesture to indicate on-screen elements representing "1-day tour" or "2-3 day tour" in a target area of ​​the screen. While performing this gesture, the virtual object can explain the following: "City A has many attractions that are scattered. You can choose a 1-day tour to experience attraction A in depth, or choose a 2-3 day tour to experience several other attractions around attraction A."

[0094] By having virtual objects perform specified actions in spoken videos to instruct on-screen elements, and explaining the extended travel needs represented by these elements during the actions, the immersion of the target audience in the spoken videos can be further enhanced. This effectively prompts the target audience while stimulating their subconscious to provide extended travel needs information and conversational information that meet their actual travel requirements through interactive behavior. This improves the efficiency and accuracy of uncovering potential travel needs, thereby enhancing the matching degree between travel planning information and the actual travel needs of the target audience.

[0095] In one embodiment, the voice-over video is determined based on the following operations: using a large model to edit the script based on the demand information to obtain the voice-over script; and using the voice-over text fragments and action script data to drive virtual objects to obtain the voice-over video.

[0096] For example, the voice-over script includes voice-over text fragments that match the demand information and action script data representing the specified actions. The action script data representing the specified actions can drive the virtual object to perform the specified actions to indicate the target area of ​​the screen element display, and drive the virtual object to explain the screen element indicated by the specified actions through the voice-over text fragments, so as to improve the naturalness of the virtual object's expression in the voice-over video, thereby stimulating the target object's subconscious to raise travel demand, so that the large model can output more accurate travel planning information based on a full understanding of the conversation information representing potential actual needs and the extended travel demand information.

[0097] In one embodiment, a multimodal large model can be used to understand the spoken text fragments and action script data in the spoken script, driving virtual objects to operate through the voice attributes, facial expression attributes, and action attributes in the spoken script. This allows the virtual objects in the spoken video to explain the required information while simultaneously prompting the extended travel requirement information represented by the on-screen elements through specified actions, facial expressions, and spoken content. This enables dynamic collaboration between the graphical user interface (GUI) and the spoken video during the travel planning process for the target audience, enhancing the immersive experience of the target audience's interaction. Consequently, the conversation information and extended travel requirement information determined based on the interaction behavior can more accurately and efficiently represent the potential travel needs of the target audience. Thus, the large model can be used to process the conversation information and extended travel requirement information representing the potential travel needs of the target audience to accurately plan the trip and improve the interactive experience of the target audience.

[0098] In one embodiment, the large-model-based trip planning method can further utilize the large model to process the currently acquired session information and target extended trip demand information. This allows for semantic conflict detection of the acquired session information and target extended trip demand information of the target object. If the semantic conflict detection results indicate semantic logical conflicts between multiple session information, multiple target extended trip demand information, or between session information and target extended trip demand information, a spoken text segment representing the semantic conflict detection result can be generated. This spoken video segment can then represent a virtual object explaining the spoken text segment representing the semantic conflict detection result, prompting the target object to promptly modify and confirm the entered session information and target extended trip demand information. This enables the target object to promptly identify and correct semantic conflicts between the provided travel demand information. Thus, the semantic detection capability of the large model assists the target object in checking its actual travel needs, improving the efficiency of travel demand exploration and the logical coherence and accuracy of the generated trip planning information.

[0099] According to embodiments of this disclosure, the large-model-based itinerary planning method may further include: displaying itinerary planning information during the playback of a spoken video.

[0100] For example, during the playback of a spoken video, the current travel planning information output by the large model can be displayed on the screen through scrolling. However, this is not limited to this; the travel planning information can also be displayed in a floating window set on the spoken video screen, allowing the target audience to understand the currently planned itinerary content in a timely and clear manner. Simultaneously, by displaying the travel planning information output by the large model, the target audience can determine new travel needs through interactive behaviors on screen elements or input of conversational information. Thus, the large model can process new conversational information and target expanded travel needs information to update the current travel planning information, enabling timely modification and correction of the travel planning information and improving the target audience's satisfaction with their travel needs.

[0101] According to embodiments of this disclosure, the large-model-based trip planning method may further include: displaying trip planning information when the voice-over video playback terminates.

[0102] In one example, the travel planning information output by the large model can be displayed after the spoken video is played. This allows the target audience to modify and confirm their actual travel needs based on the prompts from the spoken video and visual elements, thereby improving the accuracy of the displayed travel planning information.

[0103] In one example, the currently generated itinerary planning information can also be displayed during the pause in the spoken video playback. This allows the target audience to carefully read the currently generated itinerary planning information to deeply consider their actual needs. By inputting session information and interacting with subsequently generated screen elements, they can control the large model to process at least one of the subsequently input session information and the target extended itinerary requirement information, so as to achieve a more accurate update of the current itinerary plan information and obtain itinerary planning information that meets the actual needs of the target audience.

[0104] Figure 4 A schematic diagram of a service system for performing a large-model-based route planning method provided according to embodiments of the present disclosure is shown.

[0105] like Figure 4As shown, the service system includes a smart terminal, a server, and a strategy planning terminal. The smart terminal can be a device used by the target object to watch a voice-over video and perform interactive actions based on it. The strategy planning terminal can deploy a large model for generating travel planning information. The target object initiates a planning request to the server by operating the smart terminal. The planning request carries the target object's requirement information, "A City Family Travel Guide." The server uses a large language model to perform intent understanding on the requirement information to obtain the intent understanding result. The intent understanding result can represent the target object's basic travel needs as traveling to City A, requesting a travel guide, and family travel. Based on the intent understanding result, extended travel requirement information with semantic differences from the intent understanding result is determined, as well as a voice-over video that matches the intent understanding result. By returning the voice-over video and visual elements representing the extended travel requirement information to the smart terminal, an immersive interactive experience is provided to the target object.

[0106] The target audience can select from continuously updated visual elements while watching the narrated video to determine their extended travel needs and obtain session information representing their travel demands. This session information and extended travel demand information are then transmitted to the server as the target audience's travel demand information. The server can integrate this demand and travel demand information to perform a data supplementation process for trip planning. A planning strategy is requested by sending a message containing both demand and travel demand information to the strategy planning end.

[0107] The strategy planning end processes demand and travel demand information by calling a trained large model to perform trip planning and obtain trip planning information. The strategy planning end returns the trip planning information to the server, and the server sends a page containing the trip planning information to the smart terminal, so that the target user can browse the trip planning information on the page to meet their actual travel planning needs.

[0108] According to embodiments of this disclosure, the large-model-based itinerary planning method further includes: displaying extended video clips.

[0109] According to embodiments of this disclosure, displaying extended video clips may include inserting extended video clips during the playback of spoken video, or may include displaying extended video clips after the playback of spoken video has ended.

[0110] According to embodiments of this disclosure, extended video clips represent virtual objects providing verbal explanations of content related to extended travel needs or conversation information. For example, the verbal explanations provided by virtual objects in extended video clips represent attraction-related data such as attraction introductions and transportation methods related to conversation information or target extended travel needs.

[0111] For example, in the extended video clip, the virtual object can verbally explain at least one of the needs represented by the target extended travel demand information and the conversation information determined by the target object. The verbal explanation can prompt the target object whether the conversation information and target extended travel demand information input by the interactive behavior actually meet the target object's needs. This enables timely feedback on the interactive behavior that represents the target object's travel needs, improves the immersive experience of the target object in thinking about potential travel needs, and improves the accuracy and generation efficiency of the travel planning information output by the large model.

[0112] In one embodiment, the extended video segment can be a pre-generated preset video segment, which can be generated by driving a virtual object to perform narration through a preset video segment script.

[0113] In one embodiment, extended video clips are determined from a preset video clip set based on at least one of extended travel demand information and conversation information. For example, extended video clips matching the target's current travel demand can be obtained by matching at least one of the target's extended travel demand information and conversation information with the respective clip themes of the preset video clip library.

[0114] In one example, the extended video clip can also be output as a script by processing at least one of the following: demand information, conversation information, and target extended travel demand information through a large model. By processing the extended video clip script using a multimodal large model with video generation capabilities, an online-driven virtual object can verbally explain the target object's currently selected target extended travel demand information. This allows the target object to understand the relevant travel information in real time, consider whether to modify the target extended demand information, or further raise deeper travel demands through other interactive behaviors such as inputting conversation information. This provides an immersive interactive experience, improves the efficiency of exploring the target's potential travel needs, and enhances the matching degree between travel planning information and the target object's actual travel needs.

[0115] According to embodiments of this disclosure, the extended video clips can be used as part of the spoken video, thereby updating the currently displayed spoken video by displaying the extended video clips. During the display of the extended video clips, corresponding visual elements are shown to prompt the target audience to continue providing subsequent extended travel needs and conversation information in an immersive interactive environment while watching the updated spoken video. This allows for continuous exploration of the target audience's potential travel needs by displaying extended video clips. Furthermore, based on the target audience's interactive behavior and conversation information input behavior towards the visual elements displayed in the extended video clips, travel needs information that more accurately matches the target audience's actual travel needs can be obtained, enabling the large model to output more accurate travel planning information that matches the target audience's travel needs.

[0116] According to embodiments of this disclosure, the extended video segment is determined based on the following operations: scripting using a large model to obtain an extended segment script based on at least one of extended trip requirement information and session information; and driving virtual objects based on the extended segment script to obtain the extended video segment.

[0117] In one embodiment, a large model can be used to process demand information, target extended travel demand information selected by the target object, and input session information. This allows the large model to fully understand the target object's current actual travel needs and edit extended segment scripts to control the virtual object to provide narration. This allows for continuous exploration of the target object's potential travel needs through extended video segments, improving the match between travel planning information and the target object's actual needs.

[0118] In one embodiment, the demand information could be "tour of city A", the conversation information could be "visit several famous attractions in city A while keeping the budget to a minimum", and the target extended demand information could be "family trip for 3 people". By using a large model to process the demand information "tour of city A", the conversation information "visit several famous attractions in city A while keeping the budget to a minimum", and the target extended demand information "family trip for 3 people", an extended fragment script is obtained that can explain the diverse travel needs currently input by the target object. This script then drives the virtual object to provide verbal explanations based on the extended fragment script, thereby continuously exploring the target object's potential travel needs.

[0119] In one embodiment, the extended segment script may further include extended specified action script data that indicates the screen elements displayed during the playback of the extended video segment. Thus, the extended segment script can drive virtual objects to perform specified actions to indicate the screen elements displayed during the playback of the extended video segment, thereby continuously providing an immersive interactive experience. This allows the target object to continuously consider and respond to potential travel needs, improving the accuracy and timeliness of the itinerary planning information output by the large model.

[0120] According to embodiments of this disclosure, at least two of the extended video segment representation requirement information, extended trip requirement information, and first session information exhibit semantic contradictions, and the first session information is determined based on a first request from the target object. For example, the first request may be generated based on the interactive behavior of the target object inputting session information, and the first request may carry the first session information.

[0121] Semantic contradictions can be understood as semantic conflicts in the travel needs input by the target object through interaction. For example, the demand information is "tour of city A", the first conversation information is "budget of 2000 yuan", and the target extended demand information is "5-day luxury tour". In this case, "it is difficult to make the itinerary planning information output by the large model simultaneously satisfy the itinerary planning information of "budget of 2000 yuan" and "5-day luxury tour", therefore there is a semantic contradiction between "budget of 2000 yuan" and "5-day luxury tour".

[0122] For example, a trained large model can be used to process demand information, target extended trip demand information, and first session information, so that the semantic detection results determine that there is a semantic contradiction between at least two of the demand information, extended trip demand information, and first session information. In the case of semantic contradictions in the semantic detection results, an extended script fragment can be generated based on the semantic detection results using a large language model. This extended script fragment can then drive a virtual object to obtain an extended video fragment that explains the semantic contradiction.

[0123] In one embodiment, the large-model-based trip planning method further includes: responding to a second request from the target object, updating at least one of demand information, expanding trip demand information, and first session information according to the second request to obtain target trip demand information.

[0124] By expanding the narration of virtual objects in video clips, the system prompts the target user that there are semantic conflicts among the multiple travel needs currently input, making it difficult for the trip planning information to simultaneously satisfy multiple semantically contradictory travel needs. This helps the target user generate a second request through subsequent interactions with on-screen elements or input session information. The second request carries second session information and new target extended trip need information to update at least one of the current need information, extended trip need information, and first session information, thereby correcting the semantic conflicts among multiple travel need information and improving efficiency.

[0125] For example, the first session information "budget 2,000 yuan" can be updated by the second session information "budget 20,000 yuan" carried in the second request, so that the target travel demand information includes "5-day luxury tour", "City A tour" and "budget 20,000 yuan". This allows the large model to generate travel planning information that meets the diverse travel needs of the target object and does not have semantic conflicts by processing the target travel demand information including "5-day luxury tour", "City A tour" and "budget 20,000 yuan".

[0126] In one embodiment, the second request can be determined based on the target object's interaction with screen elements. For example, during the playback of an extended video clip, if the target object interacts with a second screen element displayed during the playback period of the extended video clip, a second request carrying extended travel requirement information represented by the second screen element can be determined. This allows the target object to select screen elements to generate extended travel requirement information representing new requirements during further expansion of the already selected extended travel requirement information with the virtual object, thereby improving the naturalness and fluency of the target object's request submission and increasing the efficiency of the large model's output of travel planning information.

[0127] In one embodiment, the second request can be determined based on second session information input by the target object. The target object can propose new travel needs according to its thought process by inputting session information, thereby providing diverse interaction methods to improve the efficiency of exploring the target object's travel needs and improving the efficiency of trip planning.

[0128] It should be noted that the target object may generate multiple first requests and second requests during the process of watching the voice-over video. The embodiments of this disclosure do not limit the number of first requests or second requests.

[0129] According to embodiments of this disclosure, the large-model-based itinerary planning method may further include: using a large language model to perform demand detection on a target object based on at least one of first extended information and conversation information to obtain updated second extended information; and determining the screen elements representing the second extended information.

[0130] According to embodiments of this disclosure, the extended trip demand information includes first extended information and second extended information.

[0131] In one example, the first and second extended information may have semantic conflicts. By using a large language model to process at least one of the first extended information and the conversational information to detect the actual travel needs of the target object, the obtained updated second extended information can represent the target object's corrected demand intention. Thus, by displaying screen elements representing the second extended information, the target object can be guided to accurately select extended travel demand information by targeting the updated screen elements. New screen elements are continuously displayed to help users explore their travel needs. Furthermore, the target object can continue to correct the travel needs represented by the already determined extended travel demand information and conversational information through input operations or interactive behaviors on screen elements, thereby improving the accuracy of the travel planning information output by the large model.

[0132] In one example, there is a semantic hierarchy between the first and second extended information. For instance, the second extended information can represent the travel needs at the next semantic level of the first extended information. For example, the first extended information could be "luxury travel," and the second extended information could be "budget of 10,000 to 20,000 yuan" or "budget of 20,000 to 50,000 yuan." By utilizing a large model to process at least one of the first extended information and conversational information to detect the target object's needs, the second extended information at the next semantic level can be generated based on the target object's current feedback on travel needs. This prompts the target object to explore its needs more accurately, improving the accuracy of its travel need intention feedback, and thus enhancing the efficiency and accuracy of the travel planning information generated by the large model.

[0133] In one embodiment, while displaying the visual elements used for the second extended information, the virtual object can also be driven to adjust its spoken content in real time within the extended video clip to explain the reason for updating the visual elements. For example, if the target object changes its budget requirement from "1,000 yuan" to "10,000 yuan," the virtual object will adjust its spoken content in real time to "Based on your budget, we recommend a hotel with a better accommodation experience."

[0134] According to embodiments of this disclosure, processing at least one of extended trip demand information and session information using a large model includes: processing demand information, first extended information, and second extended information using a large model.

[0135] By utilizing large models to process demand information, first extended information, and second extended information with semantic hierarchical relationships, accurate travel planning information can be generated efficiently and effectively, improving travel planning efficiency, while accurately exploring the travel needs of the target audience.

[0136] According to embodiments of this disclosure, the large-model-based itinerary planning method further includes: using the large model to perform semantic understanding of demand information to obtain basic itinerary planning information; and displaying the basic itinerary planning information.

[0137] According to embodiments of this disclosure, by utilizing a large model to perform semantic understanding of the target object's input needs, basic travel planning information matching the needs information is generated and displayed. This allows the target object to gain a preliminary understanding of the initial travel planning strategy, helps the target object to provide subsequent interactive behaviors for screen elements based on the planned basic travel planning information, and provides conversational information to represent travel needs, thereby improving the efficiency of exploring the target object's travel intentions and thus improving the efficiency of the large model in generating travel planning information.

[0138] Figure 5 The diagram illustrates a system architecture suitable for implementing a large-model-based route planning method provided according to embodiments of the present disclosure.

[0139] like Figure 5 As shown, the system architecture includes an intent detection module 501, a demand exploration module 502, and a trip planning module 503. The intent detection module 501 includes an intent detection component, a prompt engineering component, and an intelligent retrieval component. The intent detection component can detect the intent of the demand information "I want to travel in city A for 3 days with a budget of 3000 yuan" and obtain the intent detection result. The prompt engineering component determines the corresponding prompt information based on the intent detection result and sends the prompt information to the trip planning module 503. The intelligent retrieval component can retrieve data on attractions and transportation methods related to city A based on the intent detection result. Attraction data can include opening hours, reservation links, and other attraction-related information. Transportation data can include real-time traffic congestion and transportation types. The attraction data, transportation data, and other travel demand-related information retrieved by the intelligent retrieval component can also be sent to the trip planning module 503.

[0140] The demand exploration module 502 includes a loading component, n screen element components, and a demand integration component. The loading component identifies the spoken video and n screen element components that match the demand information and loads the spoken video into the display interface. During the playback of the spoken video, the displayed screen elements are dynamically adjusted according to the display order of the N screen element components in the display strategy, sequentially from the first screen element component to the nth screen element component. The target user can interact with the screen elements or input conversation information to transmit travel demand information to the demand exploration module 502 while watching the spoken video. The demand integration component acquires the travel demand information and uses a large language model to perform semantic contradiction detection on multiple travel demand information items. Based on the semantic contradiction detection results, it generates new screen elements to prompt the target user to modify their travel demand information by interacting with the screen elements or inputting corrected conversation information. The demand integration component then sends the target user's target travel demand information to the travel planning module 503.

[0141] The trip planning module 503 includes a trip planning component and a trip planning information display component. The trip planning component calls a large model to process prompts, travel demand-related information, and target trip demand information, enabling it to use the large model to plan trips based on the target user's intentions and obtain trip planning information. The trip planning information display component generates a trip planning display page based on the trip planning information, allowing the target user to browse detailed trip planning strategies and information.

[0142] In one embodiment, a service system for executing the large-model-based itinerary planning method of the present disclosure includes an interaction module, a task execution module, a GUI integration module, and a data management module.

[0143] The interaction module includes a multimodal driving engine built on a large multimodal model and an intent parser built on a large language model.

[0144] The multimodal driving engine integrates Natural Language Processing (NLP), Text-to-Speech (TTS), and motion capture algorithms into a large multimodal model. This enables the virtual object to be driven by facial expressions, actions, and spoken audio data, achieving real-time synchronization of facial expressions, actions, and speech in spoken videos. For example, when the target object inputs the request "quality tour," the virtual object can be driven to nod in confirmation and then perform a gesture pointing to the target area corresponding to the screen element.

[0145] The intent parser leverages the semantic analysis capabilities of a large language model to identify the target audience's needs, including actual travel requirements such as the number of days, budget, and itinerary preferences contained in the conversation. It then generates prompts suitable for controlling the large model to generate itinerary planning information. For example, a prompt might be "Generate a 3-day itinerary for city A."

[0146] The task execution engine performs the trip planning process by calling large-scale models. For example, it can intelligently retrieve relevant data such as attraction opening hours and traffic congestion by combining target trip requirements with real-time data related to the trip. Simultaneously, it uses large-scale models to process real-time data and target trip requirements to generate personalized trip planning information for the target individual. This trip planning information may include, for example, multiple recommended attractions, as well as optimal travel routes and transportation options between these attractions.

[0147] The task execution engine can also detect semantic contradictions in conversation information and target extended itinerary requirements in real time by calling a large language model. It can also perform semantic contradiction detection on the itinerary planning information generated by the large model to obtain planning contradiction detection results. Planning contradiction detection results indicate time conflicts in the itinerary planning information, such as attractions being closed, or budget overruns due to expenses in the itinerary planning information. The itinerary planning process of the large model can be optimized based on reinforcement learning mechanisms to improve the accuracy of the itinerary planning information.

[0148] The GUI fusion module synchronously renders the spoken video and the on-screen elements to dynamically overlay on-screen elements that provide interactive options during the playback of the spoken video, enabling the target object to have an immersive interactive experience of "watching and selecting".

[0149] The data management module is used to analyze the interactive behavior of the target object and determine its preferences in order to optimize the trip planning capabilities of the large model.

[0150] Figure 6 A block diagram of a large-model-based travel planning apparatus according to an embodiment of the present disclosure is shown schematically.

[0151] like Figure 6 As shown, the large-model-based trip planning device 600 includes: an acquisition module 610, a display module 620, and a trip planning information acquisition module 630.

[0152] The acquisition module 610 is used to acquire the requirement information of the target object.

[0153] The display module 620 is used to display spoken video and screen elements that match the demand information. The spoken video represents the virtual object instructing the screen elements through specified actions during the spoken process. The screen elements represent extended travel demand information that has semantic differences from the demand information.

[0154] The itinerary planning information acquisition module 630 is used to process at least one of extended itinerary demand information and conversation information using a large model to obtain itinerary planning information. The conversation information is the input data of the target object for the voice-over video, and the extended itinerary demand information is determined based on the interaction behavior of the screen elements.

[0155] According to an embodiment of this disclosure, the display module 620 includes a first display submodule.

[0156] The first display submodule is used to specify the target area for the action instruction in the narrated video and display the screen elements. The narrated video clips related to the specified action are used to explain the extended requirements information representing the screen elements.

[0157] According to embodiments of this disclosure, the voice-over video is determined based on the following operations: using a large model to edit the script based on the demand information to obtain a voice-over script, the voice-over script including a voice-over text fragment that matches the demand information and action script data representing a specified action; driving a virtual object based on the voice-over text fragment and action script data to obtain the voice-over video.

[0158] According to embodiments of this disclosure, the trip planning device 600 further includes an extended video clip display module.

[0159] The extended video clip display module is used to display extended video clips, which represent virtual objects providing verbal explanations of content related to extended trip requirements or conversation information.

[0160] According to embodiments of this disclosure, the extended video segment is determined based on the following operations: scripting using a large model to obtain an extended segment script based on at least one of extended trip requirement information and session information; and driving virtual objects based on the extended segment script to obtain the extended video segment.

[0161] According to embodiments of this disclosure, extended video segments are determined from a preset set of video segments based on at least one of extended trip demand information and session information.

[0162] According to embodiments of this disclosure, at least two of the extended video segment representation demand information, extended trip demand information, and first session information are semantically contradictory, and the first session information is determined based on a first request from the target object; the trip planning device 600 also includes a first acquisition module.

[0163] The first obtaining module is used to respond to a second request from the target object, and obtain target travel requirement information by updating at least one of the requirement information, expanding travel requirement information and first session information according to the second request. The travel planning information is determined by processing the target travel requirement information using a large model.

[0164] According to embodiments of this disclosure, the second request is determined based on at least one of the following methods: determined based on interactive behavior of screen elements; determined based on second session information input by the target object.

[0165] According to embodiments of this disclosure, the trip planning device 600 further includes a second obtaining module and a first determining module.

[0166] The second acquisition module is used to perform demand detection on the target object based on at least one of the first extended information and the conversation information using a large language model, and to obtain updated second extended information. The extended trip demand information includes the first extended information and the second extended information.

[0167] The first determining module is used to determine the screen elements that represent the second extended information.

[0168] According to embodiments of this disclosure, the trip planning information acquisition module 630 includes a first processing submodule.

[0169] The first processing submodule is used to process the requirement information, the first extended information, and the second extended information using the large model.

[0170] According to embodiments of this disclosure, the designated action includes at least one of the following actions of the virtual object: eye movements, head movements, and gestures.

[0171] According to embodiments of this disclosure, the screen elements are obtained by using a large language model to detect travel requirements based on demand information and the historical preference attributes of the target object.

[0172] According to embodiments of this disclosure, interactive behaviors for screen elements include at least one of the following: voice interaction behavior, gesture interaction behavior, eye-tracking interaction behavior, and touch operation behavior.

[0173] According to embodiments of this disclosure, the itinerary planning information acquisition module 630 further includes: a first display submodule and a second display submodule.

[0174] The first display submodule is used to display itinerary planning information during the audio-visual video playback.

[0175] The second display submodule is used to display itinerary planning information when the voice-over video playback ends.

[0176] According to embodiments of this disclosure, the trip planning device 600 further includes: a third acquisition module and a basic trip planning information display module.

[0177] The third module is used to perform semantic understanding of the demand information using a large model to obtain basic travel planning information.

[0178] The basic itinerary planning information display module is used to display basic itinerary planning information.

[0179] According to embodiments of this disclosure, extended trip demand information represents at least one of the following demand attributes: trip duration attribute, trip budget attribute, waypoint demand attribute, mode of transportation demand attribute, and traveler attribute.

[0180] Figure 7 A schematic block diagram of an artificial intelligence agent according to an embodiment of the present disclosure is shown.

[0181] In embodiments of this disclosure, such as Figure 7 As shown, the AI ​​agent 700 may include an input module 710, a processing module 720, and an output module 730.

[0182] Input module 710 is used to receive input information.

[0183] The processing module 720 is used to determine the target task based on the input information received by the input module, determine the large model based on the target task, and execute the model-based travel planning method provided in the embodiments of this disclosure to output information by calling the large model.

[0184] Output module 730 is used to output the output information obtained by the processing module.

[0185] According to embodiments of this disclosure, the input module 710 is responsible for receiving or sensing information such as queries, requests, instructions, signals, or data from the outside world (e.g., users or the external environment), and converting it into a format that the AI ​​agent 700 can understand and process. The input module 710 is the primary link for the AI ​​agent 700 to interact with the outside world, enabling the AI ​​agent 700 to efficiently and accurately obtain necessary "sensory" information from the outside world and respond to this information.

[0186] In the example, input module 710 can input the aforementioned requirement information, extended trip requirement information, and session information, etc.

[0187] In the example, processing module 720 is the core support for the AI ​​agent 700's ability to handle complex tasks. Processing module 720 can execute the model-based travel planning method described above.

[0188] In the example, the performance of the processing module 720 is closely related to the large model on which the AI ​​agent 700 is based. To fully leverage the capabilities of the large model, the internal structure of the processing module 720 can be designed to be highly configurable and scalable to handle various types of tasks and requirements in real-world scenarios.

[0189] In the example, after the AI ​​agent 700 obtains demand information, extended travel demand information, and conversation information, the processing module 720 can use a large model to process the demand information, extended travel demand information, and conversation information to obtain travel planning information, and then pass the travel planning information to the output module 730.

[0190] Understandably, while large language models possess excellent language understanding and generation capabilities, like humans, their ability to solve tasks is limited without the aid of any tools. Once the AI ​​agent 700 is given the ability to invoke tools, it can perform tasks such as using a calculator to complete mathematical calculations, using Python to perform data analysis, and using a search engine to create weather forecasts.

[0191] In the example, output module 730 can output the trip planning information described above.

[0192] The AI ​​agent 700 according to embodiments of this disclosure can simply and effectively improve the level of intelligence, as well as enhance flexibility and versatility.

[0193] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.

[0194] According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method described above.

[0195] According to embodiments of the present disclosure, a non-transitory computer-readable storage medium stores computer instructions, wherein the computer instructions are used to cause a computer to perform the method described above.

[0196] According to an embodiment of this disclosure, a computer program product includes a computer program that, when executed by a processor, implements the method described above.

[0197] Figure 8A schematic block diagram of an example electronic device 800 for implementing a model-based travel planning method according to embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0198] like Figure 8 As shown, device 800 includes a computing unit 801, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 802 or a computer program loaded from storage unit 808 into random access memory (RAM) 803. RAM 803 may also store various programs and data required for the operation of device 800. The computing unit 801, ROM 802, and RAM 803 are interconnected via bus 804. Input / output (I / O) interface 805 is also connected to bus 804.

[0199] Multiple components in device 800 are connected to I / O interface 805, including: input unit 806, such as keyboard, mouse, etc.; output unit 807, such as various types of monitors, speakers, etc.; storage unit 808, such as disk, optical disk, etc.; and communication unit 809, such as network card, modem, wireless transceiver, etc. Communication unit 809 allows device 800 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0200] The computing unit 801 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as model-based travel planning methods. For example, in some embodiments, the model-based travel planning method can be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program can be loaded and / or installed on device 800 via ROM 802 and / or communication unit 809. When the computer program is loaded into RAM 803 and executed by the computing unit 801, one or more steps of the model-based travel planning method described above can be performed. Alternatively, in other embodiments, the computing unit 801 can be configured to perform model-based travel planning methods by any other suitable means (e.g., by means of firmware).

[0201] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0202] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0203] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0204] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0205] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with implementations of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0206] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, distributed system servers, or servers incorporating blockchain technology.

[0207] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0208] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A route planning method based on a large model, comprising: Obtain the target object's requirements information; Display spoken video and visual elements that match the demand information. The spoken video represents a virtual object instructing the visual elements through a specified action during the spoken video. The visual elements represent extended trip demand information that has semantic differences from the demand information. The extended trip demand information represents at least one of the following demand attributes: trip duration attribute, trip budget attribute, waypoint demand attribute, means of transportation demand attribute, and traveler attribute. Using a large language model, the target object is subjected to demand detection based on at least one of the first extended information and the conversation information to obtain second extended information that has a semantic hierarchical relationship with the first extended information. The extended trip demand information includes the first extended information and the second extended information, and the second extended information represents the trip demand of the next semantic level of the first extended information. Display extended video clips, which represent the virtual object giving a verbal explanation of content related to the second extended information, and displaying screen elements corresponding to the second extended information during the display of the extended video clips; In the event of a semantic contradiction between at least two of the extended video clips representing the demand information, the extended trip demand information selected by the target object, and the first session information, in response to a second request from the target object, at least one of the demand information, the extended trip demand information selected by the target object, and the first session information is updated according to the second request to obtain target trip demand information. The first session information is determined based on the first request from the target object, which is determined based on the interactive behavior of the target object inputting session information. The second request is determined based on at least one of the following methods: based on the interactive behavior of the screen elements; or based on the second session information input by the target object. The target travel demand information is processed using a large model to obtain travel planning information.

2. The method according to claim 1, wherein, The presentation of spoken video and visual elements that match the required information includes: The video element is displayed in the target area indicated by the specified action in the video recording, wherein the video clip related to the specified action is used to explain the extended travel requirement information represented by the video element.

3. The method according to claim 1 or 2, wherein, The spoken video is determined based on the following operations: The large model is used to edit the script based on the demand information to obtain the voice script. The voice script includes a voice text fragment that matches the demand information and action script data that represents the specified action. The virtual object is driven by the spoken text fragment and the action script data to obtain the spoken video.

4. The method according to claim 1, wherein, The extended video segment was determined based on the following operations: Based on at least one of the extended trip requirement information and the session information, the script is edited using the large model to obtain the extended fragment script; and The extended video segment is obtained by driving the virtual object based on the extended segment script.

5. The method according to claim 1, wherein, The extended video segments are determined from a preset set of video segments based on at least one of the extended trip requirements information and the session information.

6. The method according to claim 1, wherein, The specified action includes at least one of the following actions of the virtual object: Eye movements, head movements, and hand gestures.

7. The method according to claim 1, wherein, The image elements are obtained by using a large language model to detect travel requirements based on the required information and the historical preference attributes of the target object.

8. The method according to claim 1, wherein, The interactive behaviors for the aforementioned screen elements include at least one of the following: Voice interaction behavior, gesture interaction behavior, eye-tracking interaction behavior, and touch operation behavior.

9. The method according to claim 1, wherein, The method further includes: The itinerary planning information is displayed during the video playback; or If the spoken video playback ends, the itinerary planning information will be displayed.

10. The method according to claim 1, wherein, The method further includes: Using the large model, semantic understanding of the demand information is performed to obtain basic itinerary planning information; and Display the basic itinerary planning information.

11. A large-scale model-based travel planning device, comprising: The acquisition module is used to obtain the requirement information of the target object; The display module is used to display spoken videos and visual elements that match the demand information. The spoken videos represent virtual objects instructing the visual elements through specified actions during the spoken process. The visual elements represent extended trip demand information that has semantic differences from the demand information. The extended trip demand information represents at least one of the following demand attributes: trip duration attribute, trip budget attribute, waypoint demand attribute, transportation demand attribute, and traveler attribute. The itinerary planning information acquisition module is used to process at least one of the extended itinerary demand information and the session information using a large model to obtain itinerary planning information. The session information is the input data of the target object for the voice-over video, and the extended itinerary demand information is determined based on the interaction behavior of the screen elements. The second obtaining module is used to perform demand detection on the target object based on at least one of the first extended information and the conversation information using a large language model, and obtain second extended information that has a semantic hierarchical relationship with the first extended information. The extended trip demand information includes the first extended information and the second extended information, and the second extended information represents the trip demand of the next semantic level of the first extended information. An extended video clip display module is used to display extended video clips, wherein the extended video clips represent the virtual object giving a verbal explanation of content related to the second extended information, and during the display of the extended video clips, screen elements corresponding to the second extended information are displayed; The itinerary planning information acquisition module includes: The first processing submodule is used to process the demand information and the extended itinerary demand information determined based on the interaction behavior of the screen elements using a large model, so as to obtain the itinerary planning information; The device is further configured to: in the event of a semantic contradiction between at least two of the extended video clips representing the demand information, the extended travel demand information selected by the target object, and the first session information, responding to a second request from the target object, update at least one of the demand information, the extended travel demand information selected by the target object, and the first session information according to the second request to obtain target travel demand information, wherein the first session information is determined based on a first request from the target object, the first request is determined based on the interactive behavior of the target object inputting session information, and the second request is determined based on at least one of the following methods: determined based on interactive behavior towards the screen elements; determined based on the second session information input by the target object; and process the target travel demand information using a large model to obtain the travel planning information.

12. An intelligent agent of artificial intelligence, comprising: The input module is used to receive input information; The processing module is configured to determine a target task based on the input information received by the input module, determine a large model based on the target task, and obtain output information by calling the large model to execute the method of any one of claims 1 to 10. An output module is used to output the output information obtained by the processing module.

13. An electronic device, comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 10.

14. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1 to 10.

15. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1 to 10.