Question answering method and device based on community multi-modal information, equipment and medium
By constructing a question-and-answer system for community multimodal information, collecting and parsing multimodal data, establishing a hybrid indexing system, identifying target roles, and generating differentiated responses, the system solves the problem of misunderstandings caused by the dispersion of cross-modal information and role differences, and achieves efficient and accurate question-and-answer and self-optimization.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANCHANG UNIV
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-19
AI Technical Summary
Existing community question-and-answer systems cannot uniformly understand cross-modal semantic information, lack contextual and role-based interpretations, resulting in information asymmetry between roles, difficulty in forming a unified understanding, inability to detect knowledge conflicts in real time, and inability to use feedback to continuously improve the quality of question-and-answer and decision-making.
We construct a question-answering method based on community multimodal information, collect full multimodal data, establish a knowledge base and material index library, perform retrieval and adaptation through a hybrid index system, identify target roles and generate differentiated responses, resolve conflicts by using multi-agent and role semantic space mapping mechanism, and support semantic query and visual source tracing.
It achieves unified semantic parsing of cross-modal information, improves the accuracy and efficiency of knowledge retrieval, ensures that different roles receive understandable answers, reduces misunderstandings, and supports the system's self-learning and optimization.
Smart Images

Figure CN122240813A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence technology, and in particular to a question-answering method, apparatus, device, and medium based on community multimodal information. Background Technology
[0002] With the acceleration of urbanization and the advancement of digital transformation in community governance, urban communities, as the core carriers of residents' lives, are experiencing a continuous increase in management complexity and information interaction needs.
[0003] The current community update and management system has the following main problems: 1. Cross-modal semantic information is scattered and has various formats, making it impossible to understand in a unified way.
[0004] 2. Existing community Q&A systems can only retrieve answers and lack contextualized and role-based explanations.
[0005] Existing community management question-and-answer systems rely on keyword searches or FAQ (Frequently Asked Questions) sets. They lack the ability to understand multimodal data, answer questions in the context of the community's current situation, and generate different interpretations for different roles. Residents, property management, and planners may receive the same answers, but these answers may have different meanings for them, easily leading to misunderstandings.
[0006] 3. Semantic shift in role-based expression.
[0007] The conversion of the same knowledge point between different audiences (such as residents and experts) can easily create illusions or inaccurate expression, leading to misunderstandings.
[0008] 4. Information asymmetry among multiple roles makes it difficult to form a unified understanding.
[0009] Based on existing community Q&A systems, residents cannot understand why plans are made this way, property management cannot understand what residents are truly concerned about, planners cannot quickly summarize resident feedback, and decision-makers cannot see the overall trend of information. Therefore, there is a lack of a comprehensive system that can automatically convert semantic levels so that each role receives a version that they can understand.
[0010] 5. Potential knowledge conflicts in dynamic environments.
[0011] Traditional systems often fail to detect updates, such as whether residents' pain points have changed or whether there are trends in community public opinion. This leads to planning and interpretations that are not based on the actual situation and the quality of recommendations being low. There is a timeliness conflict between real-time feedback and pre-stored knowledge bases.
[0012] 6. Unable to leverage feedback to continuously improve the quality of question-and-answer sessions and decision-making.
[0013] Most systems only output answers without assessing residents' acceptance, analyzing the sources of misunderstandings, judging the effectiveness of explanations, or summarizing long-term feedback for planning optimization. They cannot form a self-learning loop and are unable to improve themselves as the community changes. Summary of the Invention
[0014] In view of the above, it is necessary to provide a question-answering method, apparatus, device and medium based on community multimodal information, which can provide accurate and role- and context-appropriate responses to user questions based on community multimodal data.
[0015] A question-answering method based on community multimodal information, the question-answering method based on community multimodal information includes: Collect full multimodal data from the target community, and construct a knowledge base for question-answering understanding and reasoning, and a material index library for evidence citation and visualization based on the full multimodal data; A hybrid indexing system is established based on the knowledge base and the material index library; When a target question is received from a target user, the target user's target role is identified based on the target question, and a target response strategy is generated based on the target role. Based on the target response strategy, multiple agents are invoked to perform retrieval in the hybrid indexing system to obtain retrieval results. The retrieval results are then adapted and adjusted based on the role semantic space mapping mechanism and the knowledge conflict resolution mechanism to obtain target response data including answer data and evidence chains.
[0016] A question-answering device based on community multimodal information, the question-answering device based on community multimodal information includes: The construction unit is used to collect the full multimodal data of the target community and construct a knowledge base for question-answering understanding and reasoning, and a material index library for evidence citation and visualization based on the full multimodal data. A building unit is used to establish a hybrid indexing system based on the knowledge base and the material index library; The generation unit is used to identify the target user's target role based on the target question when it receives a target question initiated by the target user, and generate a target response strategy based on the target role. The response unit is used to invoke multiple agents to search in the hybrid index system based on the target response strategy to obtain search results, and to adapt and adjust the search results based on the role semantic space mapping mechanism and the knowledge conflict resolution mechanism to obtain target response data including answer data and evidence chain.
[0017] A computer device, the computer device comprising: A memory for storing at least one instruction; and a processor for executing the instructions stored in the memory to implement the question-answering method based on community multimodal information.
[0018] A computer-readable storage medium storing at least one instruction, which is executed by a processor in a computer device to implement the question-answering method based on community multimodal information.
[0019] As can be seen from the above technical solutions, this invention can construct a knowledge base for question-and-answer understanding and reasoning, and a material index library for evidence citation and visualization based on full-volume multimodal data, solving the problems of scattered cross-modal information and inability to achieve collaborative understanding. A hybrid index system is established based on the knowledge base and material index library, improving the accuracy and efficiency of knowledge retrieval and supporting bidirectional verification of semantic queries and visual tracing. The target user's target role is identified based on the target question, and a target response strategy is generated based on the target role, providing a basis for generating differentiated answers and solving the problem of misunderstandings caused by role differences. Based on the target response strategy, multiple agents are invoked to retrieve results in the hybrid index system, and the retrieval results are adapted and adjusted based on the role semantic space mapping mechanism and knowledge conflict resolution mechanism to obtain target response data including answer data and evidence chains, ensuring that different roles can obtain understandable answers, improving answer accuracy, and reducing misleading information. Attached Figure Description
[0020] Figure 1 This is a flowchart of a preferred embodiment of the question-answering method based on community multimodal information of the present invention.
[0021] Figure 2 This is a functional block diagram of a preferred embodiment of the question-and-answer device based on community multimodal information of the present invention.
[0022] Figure 3 This is a schematic diagram of the structure of a computer device that implements a question-answering method based on community multimodal information according to a preferred embodiment of the present invention. Detailed Implementation
[0023] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
[0024] like Figure 1 The diagram shown is a flowchart of a preferred embodiment of the question-answering method based on community multimodal information of the present invention. The order of the steps in this flowchart can be changed, and some steps can be omitted, depending on different requirements.
[0025] The question-answering method based on community multimodal information is applied to one or more computer devices. The computer device is a device that can automatically perform numerical calculations and / or information processing according to pre-set or stored instructions. Its hardware includes, but is not limited to, microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), digital signal processors (DSPs), embedded devices, etc.
[0026] The computer device can be any electronic product that can interact with the user, such as a personal computer, tablet computer, smartphone, personal digital assistant (PDA), game console, interactive network television (IPTV), smart wearable device, etc.
[0027] The computer equipment may also include network equipment and / or user equipment. The network equipment includes, but is not limited to, a single network server, a server group consisting of multiple network servers, or a cloud based on cloud computing consisting of a large number of hosts or network servers.
[0028] The server can be a standalone server or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDN), and big data and artificial intelligence platforms.
[0029] Artificial intelligence (AI) is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
[0030] Foundational technologies for artificial intelligence generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies mainly encompass computer vision, robotics, biometrics, speech processing, natural language processing, and machine learning / deep learning.
[0031] The network in which the computer device is located includes, but is not limited to, the Internet, wide area network, metropolitan area network, local area network, and virtual private network (VPN).
[0032] S10, collect all multimodal data of the target community, and construct a knowledge base for question-answering understanding and reasoning, and a material index library for evidence citation and visualization based on the all multimodal data.
[0033] In this embodiment, the full multimodal data may include, but is not limited to: guidance texts, community regulations, planning drawings (CAD (Computer-Aided Design) drawings, renderings), current status photos or videos, meeting minutes, resident chat records, questionnaire feedback, and structured facility data (such as elevator numbers and green space area statistics).
[0034] In this embodiment, after collecting the full amount of multimodal data of the target community, the method further includes: Text data is extracted from the full multimodal data. Redundant symbols are removed, terminology is standardized (e.g., "greening rate" and "green space rate" are standardized according to certain standards), typos are corrected, and natural language processing (NLP) tools are used for word segmentation and part-of-speech tagging. Image or video data is extracted from the full multimodal data. The image data is then processed sequentially to achieve uniform resolution, format conversion (e.g., from PNG to JPG), and removal of blur or duplicate material. Video data is then processed using frame extraction technology to preserve key images.
[0035] In this embodiment, the construction of a knowledge base for question-answering understanding and reasoning, and a material index library for evidence citation and visualization, based on the full amount of multimodal data, includes: The large language model is invoked to perform semantic parsing on the full multimodal data to obtain conceptual semantics, functional regions and indicator items, and the conceptual semantics, functional regions and indicator items are converted into a specified format to obtain the first data; The text information of the image data in the full multimodal data is extracted using optical character recognition (OCR) technology. The key regions of the image data are located by bounding box detection algorithm, and the image data is segmented into image fragments according to semantic logic. The spatial topological feature vector of each image fragment is extracted using a visual encoder, and a visual semantic label for each image fragment is generated. Structured knowledge and visual index information are extracted from the first data, the text information, the key regions, the image fragments, the spatial topological feature vectors, and the visual semantic labels; The structured knowledge is written into the knowledge base, and the visual index information is written into the material index library.
[0036] The conceptual semantics mentioned above may include community public service facilities, plot ratio indicators, etc.
[0037] The functional areas may include activity areas for the elderly, fire exits, etc.
[0038] The indicators may include plot ratio, green space ratio, building density, number of parking spaces, etc.
[0039] The specified format may include JSON format, and the fields in the JSON format may include knowledge entities, relationships, source files, confidence scores, etc.
[0040] The image data may include drawings, photographs, etc.
[0041] The key areas may include building outlines, green areas, etc.
[0042] The process of segmenting the image data into image fragments according to semantic logic may include: dividing the master plan into sub-blocks such as residential clusters, commercial facilities, and municipal facilities.
[0043] The visual encoder may include ResNet (Residual Neural Network), ViT (Vision Transformer), etc.
[0044] The spatial topological feature vector may include shape, adjacency relationship, area ratio, coordinate position, etc.
[0045] The visual semantic tags may include: tile type (such as road, green space, building), coordinate range (such as X1-Y1 to X2-Y2), image embedding vector, original document page number and paragraph index, etc.
[0046] The structured knowledge may include textual semantic entities, indicator items, relationships, and JSON format data.
[0047] The visual index information may include image segmentation, visual labels, spatial feature vectors, and original file tracing information.
[0048] Through the above embodiments, the barriers of multimodal data can be broken down, and unified semantic parsing of text, images, and structured data can be achieved, solving the problem of scattered cross-modal information and inability to collaboratively understand in traditional systems; generating a structured knowledge base and a visual index library can provide standardized data support for subsequent knowledge matching and question-answering generation, ensuring that information is searchable and reasonable; through OCR, visual coding and other technologies, the refined parsing of image information is achieved, transforming visual materials such as planning drawings into computable feature data, thereby improving the utilization rate of spatial information.
[0049] S11, Establish a hybrid indexing system based on the knowledge base and the material index library.
[0050] In this embodiment, establishing a hybrid indexing system based on the knowledge base and the material index library includes: The data in the knowledge base is processed into text and semantic vectors to obtain text and semantic vectors, including: converting text-based knowledge in the knowledge base into high-dimensional semantic vectors, and introducing a logic encoder to convert the structured indicators in the knowledge base into operator-based high-dimensional vectors. Visual vectorization is performed on the data in the material index library to obtain visual vectors, including: obtaining the visual tags and original file traceability information of each image fragment from the material index library; generating the image embedding vector of each image fragment through a visual encoder; and associating the visual tags and original file traceability information corresponding to each image embedding vector; obtaining spatial entities and local spatial feature vectors of each spatial entity from the material index library; constructing a spatial feature matrix; and establishing a mapping association between visual vectors and semantic tags. Based on the text and semantic vectors, and the visual vectors, a hybrid index system is constructed, including a semantic vector index, a visual vector index, and a spatial topology index. The hybrid index system is updated in real time based on the newly added multimodal data.
[0051] For example, large models or specialized embedding models (such as Sentence-BERT (SentenceBidirectional Encoder Representations from Transformers) or ERNIE (Enhanced Representation through kNowledgeIntEgration)) can be used to convert textual knowledge such as clauses and semantic entities in the knowledge base into high-dimensional semantic vectors (such as 768-dimensional or 1024-dimensional vectors).
[0052] For example, the operator-based high-dimensional vector may include: extracting the structured indicator "greening rate not less than 30%" into a triple {object: green space, operator: ≥, value: 0.3}, and mapping it into a vector representation containing operator features.
[0053] For example, the visual labels can be tile type, coordinate location, etc., and the original file tracing information can be file name, page number, paragraph index, etc. The spatial entities can include green space, buildings, roads, etc., and the local spatial feature vectors can include geometric shape, center coordinates, adjacency topology, etc.
[0054] For example, the semantic vector index can be used for text-based knowledge similarity retrieval, the visual vector index can be used for image fragment matching, and the spatial topology index can be used for spatial relationship querying.
[0055] Furthermore, vector databases can be used to optimize retrieval performance, supporting millisecond-level query responses.
[0056] The hybrid indexing system also supports incremental writing, which can receive new data from the feedback collection module (such as new feedback from residents and updated guidance documents), and update the knowledge base, index, and spatial feature matrix in real time to ensure data timeliness.
[0057] Through the above embodiments, a unified vector representation of multimodal knowledge can be achieved, providing a foundation for cross-modal knowledge matching and solving the problem of difficulty in associating text and images in traditional systems; the combination of vector index and spatial topology index improves the accuracy and efficiency of knowledge retrieval and supports bidirectional verification of semantic query and visual tracing; the dynamic update mechanism ensures that the knowledge base and material index can receive new data in real time, solving the pain points of knowledge solidification and poor timeliness in traditional systems.
[0058] S12, when a target question initiated by a target user is received, the target user's target role is identified based on the target question, and a target response strategy is generated based on the target role.
[0059] In this embodiment, multiple input channels are supported, including text questions via the app, voice-to-text questions, and image uploads with text descriptions (such as uploading photos to ask "Can charging stations be added to this area?"). Support for multiple input channels and role-based verification improves system adaptability and recognition accuracy.
[0060] In this embodiment, the step of identifying the target user's target role based on the target question and generating a target response strategy based on the target role includes: Obtain the user login identity tag of the target user, and determine the candidate role based on the user login identity tag of the target user; Extract the key entities and constraints in the target problem, and verify the candidate roles based on the key entities and constraints in the target problem; When the candidate role passes the verification, the candidate role is determined as the target role; The target question type is identified using an intent recognition model. Retrieve priority rules configured based on issue type; The target response strategy is generated based on the target role, the target problem type, and the priority rules.
[0061] For example, the user login identity tags may include residents, property management personnel, planning and design personnel, street office managers, etc.; the key entities may include parking fees, senior activity areas, west gate access control, etc.; the constraints may include planning requirements, additions, etc. If the question includes "plot ratio calculation standard," it is likely a planning and design personnel; if the question includes "garbage collection time," it is likely a resident.
[0062] The intent recognition model may include BERT (Bidirectional Encoder Representations from Transformers), etc.
[0063] The target question types may include simple retrieval questions (such as "What is the standard for community parking fees?"), complex planning and reasoning questions (such as "Does the addition of an activity area for the elderly meet planning requirements?"), and feedback and suggestion questions (such as "It is recommended to add access control at the west gate of the community").
[0064] The priority rules can include: urgent requests (such as "elevator malfunction repair") > inquiries (such as "demolition compensation standards") > planning suggestions (such as "adding fitness facilities") > general inquiries (such as "community office hours"). Furthermore, the priority rules can be adjusted based on the complexity of the issue and the role's weight; for example, a manager's query for "community public opinion trends" has a higher priority than a resident's general inquiry.
[0065] The target response strategy may include the following: pre-setting processing paths based on role type (such as resident, property management, planner, manager), including response granularity (common or professional), output content (such as text description, data report, visualization chart), and feedback collection focus (such as satisfaction, execution difficulty, matching degree).
[0066] The above embodiments can accurately identify user question types and roles, providing a basis for generating differentiated answers and solving the misunderstanding problem caused by the "one-size-fits-all" answers of traditional systems; the priority determination mechanism can ensure that urgent requests and important queries are responded to quickly, improving the user experience.
[0067] S13, based on the target response strategy, call multiple agents to search in the hybrid index system to obtain search results, and adapt and adjust the search results based on the role semantic space mapping mechanism and knowledge conflict resolution mechanism to obtain target response data including answer data and evidence chain.
[0068] In this embodiment, based on the target response strategy, multiple agents are invoked to perform retrieval in the hybrid indexing system to obtain retrieval results. These results are then adapted and adjusted based on a role semantic space mapping mechanism and a knowledge conflict resolution mechanism to obtain target response data including solution data and evidence chains. Retrieval-Augmented Generation (RAG) technology is employed to obtain candidate texts by performing similarity matching on the target question in the knowledge base, and candidate images by performing similarity matching on the target question in the material index database. The candidate texts and candidate images are then combined to obtain the retrieval result. When the target question is a high-complexity question, a cross-modal attention mechanism is introduced to use the text constraint vector in the target question as a query operator to focus on relevant regions in the spatial feature matrix, thereby verifying the consistency between the candidate texts and the candidate images. When the target problem is a high-complexity problem, each agent in the multi-agent group is invoked to generate response data. Construct a learnable role adaptation matrix that includes mapping parameters for different roles; Obtain the raw semantic features of the response data corresponding to each agent; The following formula is used to project the original semantic features corresponding to each agent into the semantic space of the target role based on the role adaptation matrix: H target =σ(W role ·H pro +b), H target Let σ represent the target semantic features corresponding to each agent within the semantic space of the target role, b represent the bias term, and W represent the target semantic features. role H represents the role adaptation matrix. pro This represents the original semantic features corresponding to each agent; Based on the target role, the target semantic features corresponding to each intelligent agent are adapted and adjusted to obtain diversified output data for each intelligent agent; Identify conflicting data in the diverse output data corresponding to each agent; The confidence level of each agent is obtained, and the conflicting data is eliminated based on the confidence level of each agent. The optimal evidence chain and solution data are generated based on the data after eliminating the conflict as the target response data. When the target problem is a low-complexity problem, a lightweight machine learning model (such as logistic regression combined with a rule base) in the multi-agent system is invoked to generate the target response data. The system dynamically adjusts mapping parameters for different roles based on audience preference data. For example, if residents frequently report that "floor area ratio" is difficult to understand, the weight of this term is reduced, and a more colloquial alternative, "building crowding level," is added. The frequently used term "community greening" is also added to the resident role terminology database. By adjusting the mapping parameters, the system achieves "on-demand evolution," continuously improving the accuracy of question answering and the suitability for different roles.
[0069] For example, by using the target question to perform similarity matching in the knowledge base and in the material index database, Top-N related knowledge nodes can be retrieved.
[0070] For example, the text constraint vector "greening rate ≥ 35%" in the target problem can be used as a query operator to focus on relevant areas in the spatial feature matrix to verify whether the proportion of green area in the planning map meets the relevant requirements.
[0071] For example, when each agent in the multi-agent group generates response data, parallel collaboration among the agents can be triggered, including RAG retrieval agent (responsible for knowledge retrieval), clause parsing agent (responsible for clause interpretation), planning and design assistance agent (responsible for professional solution analysis), image generation agent (responsible for visualization chart output), and knowledge conflict resolution agent (responsible for multi-source information consistency verification), thereby generating different response data.
[0072] For example, the role adaptation matrix may include exclusive mapping parameters for four types of roles: residents, property management, planners, and managers.
[0073] For example, when adapting and adjusting the target semantic features corresponding to each intelligent agent according to the target role, the resident version can reduce the proportion of professional terms (such as converting "plot ratio" into a more colloquial explanation of "building density") and increase life-related cases; the planner version retains professional parameters and supplements the calculation basis; the manager version focuses on core conclusions and data summaries.
[0074] For example, the diverse output data may include: natural language responses (such as text), knowledge citations (such as excerpts from guidance documents and source annotations), visual evidence (such as partial screenshots and annotations of planning maps and data charts), and actionable suggestions (such as requiring property management to check elevator malfunctions within 3 working days).
[0075] For example, consistency checks can be performed on the diverse output data of each agent. If conflicts exist (such as inconsistencies between guidance text and on-site feedback), the optimal chain of evidence is selected based on the confidence weight model (such as on-site feedback with high confidence taking precedence over guidance text). Combining visual evidence with actionable suggestions improves the credibility and practicality of the answers and reduces information misunderstandings.
[0076] In the above embodiments, the combination of multi-agent collaboration and cross-modal attention mechanism improves the accuracy of answering complex questions and enables two-way verification of text and images; semantic mapping solves the problem of mismatch between professional knowledge and audience comprehension ability, ensuring that different roles can obtain answers that they can understand and use.
[0077] In this embodiment, after obtaining the target response data including the solution data and the chain of evidence, the method further includes: Feedback data on the target response data is collected through different channels; The feedback data of the target response data is transformed into a computable feedback feature vector; Semantic clustering is performed on the feedback feature vectors to obtain each feedback type; Status monitoring data is generated based on the feedback data of the target response data; A feedback summary report is generated based on each feedback type and the status monitoring data.
[0078] For example, feedback data collected from different channels may include: (1) Instant feedback: After the answer is output, a pop-up window prompts the user to rate (1-5 stars), select the feedback type (satisfied, difficult terminology, inaccurate information, suggestions for supplementation), and fill in a text comment, etc.; (2) Indirect feedback: Collect user behavior data (such as whether they click to view visual evidence, the duration of time spent on the answer page, and whether they ask the same question again); (3) Batch feedback: Regularly collect residents' discussion content (such as community forums, owner group chat records), property management results (such as complaint handling progress, planning adjustment implementation status), and public opinion data (community-related information captured by third-party public opinion platforms).
[0079] Multi-channel feedback collection can comprehensively capture user needs and problems, overcoming the limitations of relying solely on proactive ratings and having a single feedback dimension.
[0080] For example, the feedback feature vector may include satisfaction score, feedback type label, semantic keywords (such as the difficult-to-understand terminology of floor area ratio), timestamp, and user role.
[0081] Feedback feature vectorization can provide standardized input for subsequent dynamic learning, supporting the system's self-evolution.
[0082] For example, the K-means clustering algorithm can be used to perform semantic clustering on the feedback feature vectors to summarize high-frequency demands (such as "adding charging piles"), conflicting opinions (such as "some residents support building parking lots, while others oppose it"), and potential problems (such as "the guidance is not clear").
[0083] For example, it can monitor in real time the trend of Q&A satisfaction (such as triggering an alert if the satisfaction score is below 3 for 3 consecutive days), changes in public opinion (such as a sudden surge in the popularity of a controversial topic of a certain plan), and knowledge gaps (such as multiple users asking the same question but not answering it accurately), and generate a feedback summary report that includes the top 5 most frequent feedback, a list of conflicting opinions, knowledge gap annotations, and execution result statistics.
[0084] Semantic clustering and state monitoring enable the precise extraction of feedback information, which can help to quickly locate high-frequency demands and knowledge gaps, providing a clear direction for optimization.
[0085] In this embodiment, after generating a feedback summary report based on each feedback type and the status monitoring data, the method further includes: At preset time intervals, retrieve all generated feedback summary reports; For each target feedback data in each obtained feedback summary report, obtain the source authority weight, feedback frequency, and time decay factor of the target feedback data; Obtain a first weight corresponding to the source authority weight, a second weight corresponding to the feedback frequency, and a third weight corresponding to the time decay factor; The target confidence level of the target feedback data is obtained by weighting the first weight, the second weight, the third weight, the source authority weight of the target feedback data, the feedback frequency, and the time decay factor. Target feedback data with a target confidence level greater than a confidence threshold are obtained as candidate feedback data; The semantic consistency between the candidate feedback data and the knowledge base is calculated to detect whether there is a conflict between the candidate feedback data and the knowledge base; Calculate the confidence level of the knowledge base; When a conflict is detected between the candidate feedback data and the knowledge base, the difference between the confidence level of the knowledge base and the target confidence level of the candidate feedback data is calculated. When the difference is less than a preset threshold, push administrator decision data; or When the difference is greater than or equal to the preset threshold, the vector representation of the corresponding node in the knowledge base is updated according to the incremental candidate feedback data, and the material index library is updated according to the incremental candidate feedback data. In this process, the answer templates for each role are optimized at preset time intervals. For example, the resident version of the answer adds a "simple explanation + on-site case" module, while the planner version adds a "data source + calculation process" module to meet the needs of different roles.
[0086] For example, the confidence level calculation formula can be expressed as follows: C=α·Psource+β·Ffreq-γ·ΔT; Where C represents confidence level; α, β, and γ are the first weight, the second weight, and the third weight, respectively; Psource is the source authority weight (e.g., 0.6 for individual resident feedback, 0.8 for property management feedback, 1.0 for planning expert feedback, and 0.7 for batch public opinion); Ffreq is the feedback frequency (1.0 for a single feedback, and 2.0 for clustered feedback of more than 10 feedback); ΔT is the time decay factor (1.0 for feedback in the last 7 days, 0.5 for feedback in 1-3 months, and 0.2 for feedback in more than 3 months).
[0087] For example, if the candidate feedback data is "the actual green space ratio is 28%", while the knowledge base states "the actual green space ratio is 38%", then a conflict exists between the candidate feedback data and the knowledge base. The preset threshold can be 0.3. If the difference between the confidence level of the knowledge base and the confidence level of the new feedback is <0.3, an administrator decision is pushed (e.g., "Please check the green space ratio statistics"). If the confidence level of the new feedback is much higher than the confidence level of the knowledge base (i.e., the difference is ≥0.3), then the knowledge base is automatically updated (e.g., the green space ratio is corrected to 28%, and on-site measured photos are added as evidence).
[0088] This embodiment achieves accurate evaluation of feedback information through confidence level calculation, avoids erroneous updates caused by single feedback, and improves the reliability of knowledge updates; the closed-loop evolution mechanism reduces manual maintenance costs, enabling the system to dynamically adjust with changes in the community, solving the pain point of traditional systems being "unchanging".
[0089] As can be seen from the above technical solutions, this invention can construct a knowledge base for question-and-answer understanding and reasoning, and a material index library for evidence citation and visualization based on full-volume multimodal data, solving the problems of scattered cross-modal information and inability to achieve collaborative understanding. A hybrid index system is established based on the knowledge base and material index library, improving the accuracy and efficiency of knowledge retrieval and supporting bidirectional verification of semantic queries and visual tracing. The target user's target role is identified based on the target question, and a target response strategy is generated based on the target role, providing a basis for generating differentiated answers and solving the problem of misunderstandings caused by role differences. Based on the target response strategy, multiple agents are invoked to retrieve results in the hybrid index system, and the retrieval results are adapted and adjusted based on the role semantic space mapping mechanism and knowledge conflict resolution mechanism to obtain target response data including answer data and evidence chains, ensuring that different roles can obtain understandable answers, improving answer accuracy, and reducing misleading information.
[0090] like Figure 2 The diagram shown is a functional block diagram of a preferred embodiment of the question-answering device based on community multimodal information of the present invention. The question-answering device 11 based on community multimodal information includes a construction unit 110, an establishment unit 111, a generation unit 112, and a response unit 113. The module / unit referred to in this invention is a series of computer program segments that can be executed by a processor and perform a fixed function, and which are stored in memory. In this embodiment, the functions of each module / unit will be described in detail in subsequent embodiments.
[0091] The construction unit 110 is used to collect the full multimodal data of the target community and construct a knowledge base for question-answering understanding and reasoning, and a material index library for evidence citation and visualization based on the full multimodal data. The establishment unit 111 is used to establish a hybrid indexing system based on the knowledge base and the material indexing library; The generation unit 112 is used to identify the target user's target role based on the target question when it receives a target question initiated by the target user, and generate a target response strategy based on the target role. The response unit 113 is used to call multiple agents to search in the hybrid index system based on the target response strategy to obtain search results, and to adapt and adjust the search results based on the role semantic space mapping mechanism and the knowledge conflict resolution mechanism to obtain target response data including answer data and evidence chain.
[0092] As can be seen from the above technical solutions, this invention can construct a knowledge base for question-and-answer understanding and reasoning, and a material index library for evidence citation and visualization based on full-volume multimodal data, solving the problems of scattered cross-modal information and inability to achieve collaborative understanding. A hybrid index system is established based on the knowledge base and material index library, improving the accuracy and efficiency of knowledge retrieval and supporting bidirectional verification of semantic queries and visual tracing. The target user's target role is identified based on the target question, and a target response strategy is generated based on the target role, providing a basis for generating differentiated answers and solving the problem of misunderstandings caused by role differences. Based on the target response strategy, multiple agents are invoked to retrieve results in the hybrid index system, and the retrieval results are adapted and adjusted based on the role semantic space mapping mechanism and knowledge conflict resolution mechanism to obtain target response data including answer data and evidence chains, ensuring that different roles can obtain understandable answers, improving answer accuracy, and reducing misleading information.
[0093] like Figure 3 The diagram shown is a schematic representation of the structure of a computer device that implements a question-answering method based on community multimodal information according to a preferred embodiment of the present invention.
[0094] The computer device 1 may include a memory 12, a processor 13, and a bus (the arrow in the figure represents the bus), and may also include a computer program stored in the memory 12 and executable on the processor 13, such as a question-and-answer program based on community multimodal information.
[0095] Those skilled in the art will understand that the schematic diagram is merely an example of computer device 1 and does not constitute a limitation on computer device 1. Computer device 1 can be either a bus topology or a star topology. Computer device 1 may also include more or fewer other hardware or software than shown in the diagram, or different component arrangements. For example, computer device 1 may also include input / output devices, network access devices, etc.
[0096] It should be noted that the computer device 1 described is merely an example. Other existing or future electronic products that are adaptable to this invention should also be included within the scope of protection of this invention and are incorporated herein by reference.
[0097] The memory 12 includes at least one type of readable storage medium, such as flash memory, portable hard drive, multimedia card, card-type memory (e.g., SD or DX memory), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 12 can be an internal storage unit of the computer device 1, such as a portable hard drive of the computer device 1. In other embodiments, the memory 12 can be an external storage device of the computer device 1, such as a plug-in portable hard drive, smart media card (SMC), secure digital card (SD), flash card, etc., equipped on the computer device 1. Furthermore, the memory 12 can include both internal and external storage units of the computer device 1. The memory 12 can be used not only to store application software and various types of data installed on the computer device 1, such as the code of a question-and-answer program based on community multimodal information, but also to temporarily store data that has been output or will be output.
[0098] In some embodiments, the processor 13 may be composed of integrated circuits, such as a single packaged integrated circuit or multiple integrated circuits packaged with the same or different functions, including combinations of one or more central processing units (CPUs), microprocessors, digital processing chips, graphics processors, and various control chips. The processor 13 is the control unit of the computer device 1, connecting various components of the computer device 1 via various interfaces and lines. It executes programs or modules stored in the memory 12 (e.g., executing a question-and-answer program based on community multimodal information) and calls data stored in the memory 12 to perform various functions of the computer device 1 and process data.
[0099] The processor 13 executes the operating system of the computer device 1 and various installed applications. The processor 13 executes the applications to implement the steps in the various embodiments of the question-answering method based on community multimodal information described above, for example... Figure 1 The steps are shown.
[0100] For example, the computer program may be divided into one or more modules / units, which are stored in the memory 12 and executed by the processor 13 to complete the present invention. The one or more modules / units may be a series of computer-readable instruction segments capable of performing a specific function, which describe the execution process of the computer program in the computer device 1. For example, the computer program may be divided into a construction unit 110, a building unit 111, a generation unit 112, and a response unit 113.
[0101] The integrated unit implemented as a software functional module described above can be stored in a computer-readable storage medium. This software functional module, stored in a storage medium, includes several instructions to cause a computer device (which may be a personal computer, a computer device, or a network device, etc.) or processor to execute portions of the question-answering method based on community multimodal information described in the various embodiments of this invention.
[0102] If the modules / units integrated in the computer device 1 are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments of the present invention can also be implemented by a computer program instructing related hardware devices. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above.
[0103] The computer program includes computer program code, which may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording media, USB flash drive, portable hard drive, magnetic disk, optical disk, computer memory, read-only memory (ROM), random access memory, etc.
[0104] Furthermore, the computer-readable storage medium may primarily include a stored program area and a stored data area, wherein the stored program area may store the operating system, an application program required for at least one function, etc.; and the stored data area may store data created based on the use of blockchain nodes, etc.
[0105] The blockchain referred to in this invention is a novel application model of computer technologies such as distributed data storage, peer-to-peer transmission, consensus mechanisms, and encryption algorithms. Essentially, a blockchain is a decentralized database, a chain of data blocks linked together using cryptographic methods. Each data block contains information about a batch of network transactions, used to verify the validity of the information (anti-counterfeiting) and generate the next block. A blockchain can include an underlying blockchain platform, a platform product service layer, and an application service layer.
[0106] The bus can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This bus can be divided into address bus, data bus, control bus, etc. For ease of representation, in... Figure 3 The bus is represented by only one straight line, but this does not mean that there is only one bus or one type of bus. The bus is configured to enable communication between the memory 12 and at least one processor 13, etc.
[0107] Although not shown, the computer device 1 may also include a power supply (such as a battery) to power various components. Preferably, the power supply can be logically connected to the at least one processor 13 through a power management device, thereby enabling functions such as charging management, discharging management, and power consumption management. The power supply may also include one or more DC or AC power supplies, recharging devices, power fault detection circuits, power converters or inverters, power status indicators, and other arbitrary components. The computer device 1 may also include various sensors, Bluetooth modules, Wi-Fi modules, etc., which will not be described in detail here.
[0108] Furthermore, the computer device 1 may also include a network interface. Optionally, the network interface may include a wired interface and / or a wireless interface (such as a Wi-Fi interface, a Bluetooth interface, etc.), which is typically used to establish communication connections between the computer device 1 and other computer devices.
[0109] Optionally, the computer device 1 may further include a user interface, which may be a display, an input unit (such as a keyboard), and optionally, a standard wired interface or a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, or an OLED (Organic Light-Emitting Diode) touchscreen, etc. The display may also be appropriately referred to as a screen or display unit, used to display information processed in the computer device 1 and to display a visual user interface.
[0110] It should be understood that the embodiments described are for illustrative purposes only and are not limited to this structure in the scope of the patent application.
[0111] It will be understood by those skilled in the art that Figure 3 The structure shown does not constitute a limitation on the computer device 1, and may include fewer or more components than shown, or combine certain components, or have different component arrangements.
[0112] Combination Figure 1 The memory 12 in the computer device 1 stores multiple instructions to implement a question-answering method based on community multimodal information, and the processor 13 can execute the multiple instructions to achieve: Collect full multimodal data from the target community, and construct a knowledge base for question-answering understanding and reasoning, and a material index library for evidence citation and visualization based on the full multimodal data; A hybrid indexing system is established based on the knowledge base and the material index library; When a target question is received from a target user, the target user's target role is identified based on the target question, and a target response strategy is generated based on the target role. Based on the target response strategy, multiple agents are invoked to perform retrieval in the hybrid indexing system to obtain retrieval results. The retrieval results are then adapted and adjusted based on the role semantic space mapping mechanism and the knowledge conflict resolution mechanism to obtain target response data including answer data and evidence chains.
[0113] Specifically, the processor 13's implementation method for the above instructions can be found in [reference needed]. Figure 1 The descriptions of the relevant steps in the corresponding embodiments are not repeated here.
[0114] It should be noted that all the data involved in this case was legally obtained.
[0115] If any AI models, software tools, or components not belonging to this company appear in the embodiments of this invention, they are merely illustrative examples and do not represent actual use. All user personal information involved in the embodiments of this invention has been obtained by an entity authorized (with the knowledge and consent) or fully authorized by all parties through various legal and compliant means. The collection, storage, use, processing, transmission, provision, and disclosure of the information, data, and signals involved all comply with relevant laws and regulations and do not violate public order and good morals.
[0116] In the several embodiments provided by this invention, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and other division methods may be used in actual implementation.
[0117] This invention can be used in a wide variety of general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices. This invention can be described in the general context of computer-executable instructions, such as program modules, that are executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. This invention can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0118] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0119] Furthermore, the functional modules in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or in the form of hardware plus software functional modules.
[0120] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the present invention can be implemented in other specific forms without departing from the spirit or essential characteristics of the present invention.
[0121] Therefore, the embodiments should be considered exemplary and non-limiting in all respects, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be embraced within the invention. No appended diagram markings in the claims should be construed as limiting the scope of the claims.
[0122] Furthermore, it is clear that the word "comprising" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices described in this invention can also be implemented by a single unit or device through software or hardware. Terms such as "first," "second," etc., are used to indicate names and do not indicate any specific order.
[0123] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.
Claims
1. A question-answering method based on community multimodal information, characterized in that, The question-answering method based on community multimodal information includes: Collect full multimodal data from the target community, and construct a knowledge base for question-answering understanding and reasoning, and a material index library for evidence citation and visualization based on the full multimodal data; A hybrid indexing system is established based on the knowledge base and the material index library; When a target question is received from a target user, the target user's target role is identified based on the target question, and a target response strategy is generated based on the target role. Based on the target response strategy, multiple agents are invoked to perform retrieval in the hybrid indexing system to obtain retrieval results. The retrieval results are then adapted and adjusted based on the role semantic space mapping mechanism and the knowledge conflict resolution mechanism to obtain target response data including answer data and evidence chains.
2. The question-answering method based on community multimodal information as described in claim 1, characterized in that, The construction of a knowledge base for question-answering understanding and reasoning, and a material index library for evidence citation and visualization, based on the full amount of multimodal data, includes: The large language model is invoked to perform semantic parsing on the full multimodal data to obtain conceptual semantics, functional regions and indicator items, and the conceptual semantics, functional regions and indicator items are converted into a specified format to obtain the first data; The text information of the image data in the full multimodal data is extracted using optical character recognition technology, the key regions of the image data are located using bounding box detection algorithm, and the image data is segmented into image fragments according to semantic logic. The spatial topological feature vector of each image fragment is extracted using a visual encoder, and a visual semantic label for each image fragment is generated. Structured knowledge and visual index information are extracted from the first data, the text information, the key regions, the image fragments, the spatial topological feature vectors, and the visual semantic labels; The structured knowledge is written into the knowledge base, and the visual index information is written into the material index library.
3. The question-answering method based on community multimodal information as described in claim 2, characterized in that, The step of establishing a hybrid indexing system based on the knowledge base and the material indexing library includes: The data in the knowledge base is processed into text and semantic vectors to obtain text and semantic vectors, including: converting text-based knowledge in the knowledge base into high-dimensional semantic vectors, and introducing a logic encoder to convert the structured indicators in the knowledge base into operator-based high-dimensional vectors. Visual vectorization is performed on the data in the material index library to obtain visual vectors, including: obtaining the visual tags and original file traceability information of each image fragment from the material index library; generating the image embedding vector of each image fragment through a visual encoder; and associating the visual tags and original file traceability information corresponding to each image embedding vector; obtaining spatial entities and local spatial feature vectors of each spatial entity from the material index library; constructing a spatial feature matrix; and establishing a mapping association between visual vectors and semantic tags. Based on the text and semantic vectors, and the visual vectors, a hybrid index system is constructed, including a semantic vector index, a visual vector index, and a spatial topology index. The hybrid index system is updated in real time based on the newly added multimodal data.
4. The question-answering method based on community multimodal information as described in claim 1, characterized in that, The step of identifying the target user's target role based on the target question and generating a target response strategy based on the target role includes: Obtain the user login identity tag of the target user, and determine the candidate role based on the user login identity tag of the target user; Extract the key entities and constraints in the target problem, and verify the candidate roles based on the key entities and constraints in the target problem; When the candidate role passes the verification, the candidate role is determined as the target role; The target question type is identified using an intent recognition model. Retrieve priority rules configured based on issue type; The target response strategy is generated based on the target role, the target problem type, and the priority rules.
5. The question-answering method based on community multimodal information as described in claim 3, characterized in that, Based on the target response strategy, multiple agents are invoked to perform searches in the hybrid indexing system to obtain search results. These search results are then adapted and adjusted based on a role semantic space mapping mechanism and a knowledge conflict resolution mechanism to obtain target response data including solution data and evidence chains. A retrieval enhancement generation technique is employed, which uses the target question to perform similarity matching in the knowledge base to obtain candidate text, and uses the target question to perform similarity matching in the material index to obtain candidate images. The candidate text and the candidate images are then combined to obtain the retrieval result. When the target question is a high-complexity question, a cross-modal attention mechanism is introduced to use the text constraint vector in the target question as a query operator to focus on relevant regions in the spatial feature matrix, thereby verifying the consistency between the candidate text and the candidate image. When the target problem is a high-complexity problem, each agent in the multi-agent group is invoked to generate response data. Construct a learnable role adaptation matrix that includes mapping parameters for different roles; Obtain the raw semantic features of the response data corresponding to each agent; According to the role adaptation matrix, project the original semantic feature corresponding to each agent to the semantic space of the target role by using the following formula: H target = σ(W role · H pro + b), H target indicates the target semantic feature corresponding to each agent in the semantic space of the target role, σ indicates an activation function, b indicates a bias term, W role indicates the role adaptation matrix, and H pro indicates the original semantic feature corresponding to each agent. Based on the target role, the target semantic features corresponding to each intelligent agent are adapted and adjusted to obtain diverse output data for each intelligent agent; Identify conflicting data in the diverse output data corresponding to each agent; The confidence level of each agent is obtained, and the conflicting data is eliminated based on the confidence level of each agent. The optimal evidence chain and solution data are generated based on the data after eliminating the conflict as the target response data. When the target problem is a low-complexity problem, the lightweight machine learning model in the multi-agent system is invoked to generate the target response data. Among them, the mapping parameters of different roles are dynamically adjusted based on audience preference data.
6. The question-answering method based on community multimodal information as described in claim 1, characterized in that, After obtaining the target response data, which includes the solution data and the chain of evidence, the method further includes: Feedback data on the target response data is collected through different channels; The feedback data of the target response data is transformed into a computable feedback feature vector; Semantic clustering is performed on the feedback feature vectors to obtain each feedback type; Status monitoring data is generated based on the feedback data of the target response data; A feedback summary report is generated based on each feedback type and the status monitoring data.
7. The question-answering method based on community multimodal information as described in claim 6, characterized in that, After generating a feedback summary report based on each feedback type and the status monitoring data, the method further includes: At preset time intervals, retrieve all generated feedback summary reports; For each target feedback data in each obtained feedback summary report, obtain the source authority weight, feedback frequency, and time decay factor of the target feedback data; Obtain a first weight corresponding to the source authority weight, a second weight corresponding to the feedback frequency, and a third weight corresponding to the time decay factor; The target confidence level of the target feedback data is obtained by weighting the first weight, the second weight, the third weight, the source authority weight of the target feedback data, the feedback frequency, and the time decay factor. Target feedback data with a target confidence level greater than a confidence threshold are obtained as candidate feedback data; The semantic consistency between the candidate feedback data and the knowledge base is calculated to detect whether there is a conflict between the candidate feedback data and the knowledge base; Calculate the confidence level of the knowledge base; When a conflict is detected between the candidate feedback data and the knowledge base, the difference between the confidence level of the knowledge base and the target confidence level of the candidate feedback data is calculated. When the difference is less than a preset threshold, push administrator decision data; or When the difference is greater than or equal to the preset threshold, the vector representation of the corresponding node in the knowledge base is updated according to the incremental candidate feedback data, and the material index library is updated according to the incremental candidate feedback data. In this process, the answer template for each role is optimized at preset time intervals.
8. A question-answering device based on community multimodal information, characterized in that, The question-answering device based on community multimodal information includes: The construction unit is used to collect the full multimodal data of the target community and construct a knowledge base for question-answering understanding and reasoning, and a material index library for evidence citation and visualization based on the full multimodal data. A building unit is used to establish a hybrid indexing system based on the knowledge base and the material index library; The generation unit is used to identify the target user's target role based on the target question when it receives a target question initiated by the target user, and generate a target response strategy based on the target role. The response unit is used to invoke multiple agents to search in the hybrid index system based on the target response strategy to obtain search results, and to adapt and adjust the search results based on the role semantic space mapping mechanism and the knowledge conflict resolution mechanism to obtain target response data including answer data and evidence chain.
9. A computer device, characterized in that, The computer device includes: A memory for storing at least one instruction; and a processor for executing the instructions stored in the memory to implement the question-answering method based on community multimodal information as described in any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that: The computer-readable storage medium stores at least one instruction, which is executed by a processor in a computer device to implement the question-answering method based on community multimodal information as described in any one of claims 1 to 7.