Mobile application deceptive design detection method based on large model and intelligent agent
By constructing a multimodal dataset and using intelligent agents for dynamic behavior verification, the problem of detecting dynamic code loading and highly covert deceptive designs in mobile applications is solved, achieving accurate identification and automated detection of deceptive designs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 武汉金银湖实验室
- Filing Date
- 2026-02-10
- Publication Date
- 2026-06-12
Smart Images

Figure CN122197010A_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the fields of cybersecurity and artificial intelligence technology, and more specifically, relates to a method for detecting deceptive designs in mobile applications based on large models and intelligent agents. Background Technology
[0002] With the rapid development of mobile internet technology, mobile applications (Apps) have become the main carrier of information interaction. However, in pursuit of commercial interests, some application developers use "deceptive design patterns" to manipulate user behavior. This is a malicious design technique that uses carefully designed user interface (UI) visual presentation or non-intuitive interaction logic to induce users to make involuntary decisions (such as hidden subscriptions, forced ad redirects, privacy theft, etc.).
[0003] Common deceptive design patterns include, but are not limited to, visual prominence, disguised ads, tiny buttons, easy entry but difficult exit, forced renewals, emotional blackmail, and false urgency. Compared to traditional web environments, mobile applications have limited screen sizes, diverse interaction methods (touch, long press, swipe), and closed application ecosystems, making such deceptive behaviors more covert and difficult to detect.
[0004] Current mobile application security testing methods primarily rely on static code analysis techniques, such as automated scanning and stream analysis tools for Android APK installation packages. These traditional detection tools mainly check for sensitive API calls or known malicious patterns in the code through rule-based or static feature matching methods. However, modern mobile applications often employ dynamic code loading, cloud-based logic delivery, and high-strength code obfuscation techniques. This "black box" characteristic makes it difficult for traditional static methods to parse out the true business logic. Even if some code fragments can be scanned, it is impossible to connect the fragmented logic to reconstruct the complete interaction path, resulting in a very high false negative rate for deceptive logic hidden in the cloud.
[0005] To overcome the shortcomings of traditional code analysis methods, researchers have proposed using computer vision technology to directly detect anomalies in screenshots of application runtime interfaces. Specifically, this method treats the mobile application interface as a visual object, using object detection or optical character recognition (OCR) to extract the interface's geometric features and textual semantics, attempting to identify deceptive elements from visual appearance. This, to some extent, compensates for the problem of invisible code logic. However, current methods using a single visual model for deception detection still suffer from two key drawbacks: a lack of semantic understanding and a lack of dynamic logic.
[0006] Furthermore, while general-purpose multimodal large-scale models possess a certain level of graphic and textual understanding capabilities, they lack domain-specific expertise regarding mobile UI design guidelines and fraudulent tactics employed by black-market industries. Directly using general-purpose models for detection can easily lead to numerous false alarms or misleading results due to a failure to understand the functional meaning of specific UI components or to overlook hidden, industry-standard fraudulent methods.
[0007] Therefore, there is an urgent need for a targeted detection method that can integrate high-level semantic understanding capabilities with dynamic behavior verification mechanisms to achieve accurate identification of various types of highly concealed deceptive designs. Summary of the Invention
[0008] To address the shortcomings of existing technologies, the purpose of this application is to integrate high-level semantic understanding capabilities with dynamic behavior verification mechanisms to achieve accurate identification of various types of highly concealed deception designs.
[0009] To achieve the above objectives, in a first aspect, this application provides a method for detecting deceptive designs in mobile applications based on large models and intelligent agents, comprising: Construct a multimodal dataset for deceptive design in mobile applications; the multimodal dataset includes at least visual interference samples, semantic inducement samples, and logical obstacle samples; Based on the multimodal dataset, the large multimodal model is fine-tuned using instructions to obtain a domain-adaptive fraud semantic recognition model; Input a screenshot of the application's interface into the fraud semantic recognition model to obtain all suspected deceptive design candidate points as static semantic analysis results, and generate a verification task for each candidate point; The verification task is received and executed by the intelligent agent. The actual interface state after the operation is compared with the expected response in the verification task to obtain the dynamic behavior verification result that confirms fraud or corrects false alarm. By integrating static semantic analysis results and dynamic behavioral verification results, a comprehensive detection report is generated, which includes the fraud type, location, and dynamic verification evidence chain.
[0010] Optionally, the construction of the multimodal dataset for deceptive design in mobile applications specifically includes: Collect screenshots of real application interfaces, including various deceptive design types; Each screenshot of the application interface is labeled to form labeled data including screenshot, fraud type description and coordinates of malicious components, resulting in a multimodal dataset.
[0011] Optionally, it also includes: Sample augmentation is performed on semantic trap deceptive designs implemented through a combination of text and images in the multimodal dataset to enable the model to learn complex fraud logic involving the combination of text and images.
[0012] Optionally, the training process of the fraud semantic recognition model includes: The multimodal dataset is used as the training set, and the open-source multimodal large model is trained using instruction fine-tuning technology until the training objective is achieved to obtain a trained fraud semantic recognition model. The fraud semantic recognition model takes a UI screenshot as input and outputs a structured description of a list of fraudulent elements, including a type, component description, coordinates, and reason field.
[0013] Optionally, the process of obtaining the dynamic behavior verification result specifically includes: The system receives the verification task, simulates user operations on the mobile device, clicks the coordinates of the candidate point, and captures the state change information of the interface on the mobile device; the state change information includes a screenshot of the new interface and the activity name of the current application. The state change information is compared and analyzed with the verification task; If the verification task is expected to be in a normal state, but the state change information is in an abnormal state, then it is determined that the current candidate point has a deceptive design. If the verification task is expected to be in an abnormal state, but the state change information is in a normal state, then the current candidate point is determined to be a false alarm and is removed.
[0014] Optionally, the determination that the current candidate point has a deceptive design includes: If the expected interface of the verification task should be closed, but the actual interface changes to an ad download page or a payment page, then a semantic conflict is determined between the expectation and the reality, and the current candidate point is judged to be a deceptive design.
[0015] Optionally, determining that the current candidate point is a false alarm and removing it includes: When the verification task is expected to be malicious, but the actual interface state change information indicates that it is a normal function entry and no malicious jump has occurred, the intelligent agent will compare the results and return them, marking the current candidate point as a false alarm and deleting it.
[0016] Secondly, this application provides a mobile application deceptive design detection system based on large models and intelligent agents, comprising: A dataset construction module is used to build a multimodal dataset for deceptive design in mobile applications; the multimodal dataset includes at least visual interference samples, semantic inducement samples, and logical obstacle samples; The model fine-tuning module is used to fine-tune the large multimodal model based on the multimodal dataset to obtain a domain-adaptive fraud semantic recognition model. The static analysis module is used to input the screenshot of the interface of the application to be detected into the fraud semantic recognition model, obtain all suspected deceptive design candidate points as static semantic analysis results, and generate a verification task for each candidate point. The dynamic verification module is used to receive and execute the verification task through the intelligent agent, compare the actual interface state after the operation with the expected response in the verification task, and obtain the dynamic behavior verification result of confirming fraud or correcting false alarms. The integrated output module is used to integrate static semantic analysis results and dynamic behavior verification results to generate a comprehensive detection report that includes fraud type, location, and dynamic verification evidence chain.
[0017] Thirdly, this application provides an electronic device, comprising: at least one memory for storing a program; and at least one processor for executing the program stored in the memory, wherein when the program stored in the memory is executed, the processor is configured to execute the method described in the first aspect or any possible implementation thereof.
[0018] Fourthly, this application provides a computer-readable storage medium storing a computer program that, when run on a processor, causes the processor to perform the method described in the first aspect or any possible implementation thereof.
[0019] Fifthly, this application provides a computer program product that, when run on a processor, causes the processor to perform the method described in the first aspect or any possible implementation thereof.
[0020] It is understood that the beneficial effects of the second to fifth aspects mentioned above can be found in the relevant descriptions in the first aspect mentioned above, and will not be repeated here.
[0021] Overall, the technical solutions conceived in this application have the following beneficial effects compared with the prior art: (1) This application effectively overcomes the limitation of traditional static methods in detecting single types by integrating multimodal semantic perception with domain fine-tuning and dynamic interactive verification technology based on intelligent agents. While greatly expanding the coverage of detection of complex semantic and logical deceptive designs, it significantly improves the accuracy and credibility of detection by using a closed-loop verification mechanism. By integrating high-level semantic understanding capabilities and dynamic behavior verification mechanisms, it achieves accurate identification of multiple types of highly concealed deceptive designs and ultimately realizes the automation and intelligence of the entire process of mobile application deceptive design detection.
[0022] (2) This application abandons the traditional visual detection scheme that relies solely on geometric features, and instead adopts a domain-tuned multimodal large model. By utilizing the powerful semantic understanding capabilities of the large model, it can not only detect visual anomalies, but also understand fraud types that rely on textual semantics and complex logic, such as false sense of urgency and misleading copywriting, thus breaking through the perception bottleneck of existing technologies.
[0023] (3) This application innovatively introduces a dynamic interactive proxy mechanism. Through a closed-loop process of static identification and dynamic verification, the guesses of the static model are verified by using the actual operational consequences. This mechanism can effectively eliminate false alarms of “looking like fraud but functioning normally” and confirm hidden fraud that looks normal but behaves illegally, so that the detection results have factual basis.
[0024] (4) The intelligent agent of this application can simulate the operation logic of human users, automatically traverse suspected fraud nodes, and complete the entire process from discovery to verification without human intervention. Attached Figure Description
[0025] Figure 1 This is one of the flowcharts of the mobile application deceptive design detection method based on large models and intelligent agents provided in the embodiments of this application; Figure 2 This is the second flowchart of the mobile application deceptive design detection method based on large models and intelligent agents provided in the embodiments of this application; Figure 3 This is a schematic diagram illustrating the construction and classification of a multimodal dataset according to an embodiment of this application; Figure 4 This is a flowchart illustrating the dynamic verification process in an embodiment of this application; Figure 5 This is a schematic diagram of the structure of the mobile application deceptive design detection system based on large models and intelligent agents provided in the embodiments of this application; Figure 6 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation
[0026] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0027] In this article, the term "and / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent three cases: A exists alone, A and B exist simultaneously, and B exists alone. The symbol " / " in this article indicates that the related objects are in an "or" relationship; for example, A / B means A or B.
[0028] The terms "first" and "second," etc., used in the specification and claims herein are used to distinguish different objects, not to describe a specific order of objects. For example, "first response message" and "second response message," etc., are used to distinguish different response messages, not to describe a specific order of response messages.
[0029] In the embodiments of this application, the terms "exemplary" or "for example" are used to indicate that something is an example, illustration, or description. Any embodiment or design that is described as "exemplary" or "for example" in the embodiments of this application should not be construed as being more preferred or advantageous than other embodiments or design. Specifically, the use of the terms "exemplary" or "for example" is intended to present the relevant concepts in a specific manner.
[0030] In the description of the embodiments of this application, unless otherwise stated, "multiple" means two or more, for example, multiple processing units means two or more processing units, multiple elements means two or more elements, etc.
[0031] The embodiments of this application are described below with reference to the accompanying drawings.
[0032] Reference Figure 1 This application provides a method for detecting deceptive design in mobile applications based on large models and intelligent agents, including: S101. Construct a multimodal dataset for deceptive design in mobile applications; the multimodal dataset includes at least visual interference samples, semantic inducement samples, and logical obstruction samples; S102. Based on the multimodal dataset, fine-tune the multimodal large model using instructions to obtain a domain-adaptive fraud semantic recognition model; S103. Input the screenshot of the interface of the application to be detected into the fraud semantic recognition model to obtain all suspected deceptive design candidate points as static semantic analysis results, and generate a verification task for each candidate point; S104. Receive and execute the verification task through the intelligent agent, compare the actual interface state after the operation with the expected response in the verification task, and obtain the dynamic behavior verification result of confirming fraud or correcting false alarm. S105. Integrate static semantic analysis results and dynamic behavior verification results to generate a comprehensive detection report carrying fraud type, location, and dynamic verification evidence chain.
[0033] Specifically, step S101 involves collecting screenshots of real mobile application interfaces from various app stores and constructing a multimodal instruction fine-tuning dataset based on the mechanism of deceptive design. This dataset not only includes traditional visual deception samples, such as those with visual asymmetry, color dilution, and hidden closing entrances, but more importantly, it incorporates two types of samples that are difficult to process using traditional computer vision methods: Semantic manipulation samples include examples of fake countdown copy, double negative manipulation copy (such as "I don't want to save money" as a rejection option), and scare tactics marketing copy. For example, using "Only 10 minutes left for the discount" to create a false sense of urgency, or using emotional blackmail through copy and button combinations with "cruel rejection"; The sample contains logical obstacles, including multi-level unsubscribe processes and samples with hidden default checkboxes.
[0034] Each data point is labeled in a structured format, including three elements: screenshot, fraud type description, and coordinates of malicious components. This gives the data visual, spatial, and semantic information, laying the foundation for subsequent training of models that can understand complex image and text fraud logic.
[0035] Secondly, in step S102, the dedicated dataset constructed in step S101 is used to fine-tune an open-source general-purpose multimodal model. Based on the visual layout, element styles, text content, and spatial relationships in the UI screenshot, the model comprehensively infers whether the interface has malicious design intentions to induce users to click, mislead users' decisions, or prevent users from exiting. After fine-tuning, the model is transformed into a domain-specific fraud semantic recognition model. Its core capability is: after receiving a UI screenshot, it can output a structured list of suspected fraudulent elements. Each entry in the list includes its type, component description, coordinates, and the reasoning based on the analysis of the image and text content.
[0036] Further, static detection and analysis are performed in step S103. The running interface of the mobile application to be detected (which can be obtained through automated screenshots) is input into the fine-tuning model obtained in step S102. Based on its learned domain knowledge, the model performs a comprehensive semantic scan of the interface, identifies all possible deceptive design candidates, and constitutes the static semantic analysis results.
[0037] For example, the model can not only identify ads disguised as close buttons, but also understand that misleading copy such as "I don't need a discount, buy at full price" actually corresponds to the user's desired "reject" function. For each identified candidate point, the model further generates an executable verification task. This task is a clear text instruction that can be understood by subsequent automated agents. Its core is to describe the expected result that should occur after operating on the candidate point, consistent with its surface semantics. For example, for a suspected fake close button, the generated task is: "Click the X button at coordinates [x,y], the current pop-up interface should close."
[0038] Step S104 is the core of dynamic interaction and behavior verification, aiming to address the shortcomings of purely static analysis in verifying the true functionality of components. This is achieved by introducing an automated intelligent agent capable of parsing verification tasks, mapping screen coordinates, and generating operation commands (such as clicks and swipes). The agent runs in a controlled sandbox environment, receiving verification tasks from S103 and accurately simulating user finger actions such as clicks on real devices or emulators using ADB (Android Debug Bridge) or similar automated testing frameworks.
[0039] After the operation is completed, the agent immediately captures the system's state changes, including new interface screenshots, the current activity stack, and other information. Subsequently, the system automatically compares and analyzes the actual interface state against the expected response described in the verification task. Scenario A (Confirming Fraud): If the verification task expects the pop-up to close, but clicking it redirects to an ad download page, payment page, or the pop-up remains open, a semantic conflict arises. This conflict constitutes factual evidence of deceptive design.
[0040] Scenario B (Correcting False Alarms): If the static model considers a component suspicious (such as a prominent button), but the Agent clicks it and finds that it is indeed just a normal function entry point and does not trigger any malicious redirection, this result will be fed back to the system, marking the candidate point as a false alarm and removing it from the final result. The output of this step is the empirically verified result of dynamic behavior.
[0041] Finally, in step S105, the static semantic analysis results from S103 are correlated and integrated with the dynamic behavior verification results from S104. For candidate points confirmed as fraudulent by dynamic verification, the report records in detail their fraud type, precise location in the screenshot, and crucially includes the dynamic verification evidence chain, such as: "A cruel rejection button was detected (coordinates: [x,y]), clicking it did not close the pop-up window, and it actually redirected to the app store download page." Points judged as false alarms are excluded. A structured comprehensive detection report is ultimately generated to achieve accurate identification of multiple types of highly concealed deceptive designs, and ultimately realizes the automation and intelligence of the entire process of mobile application deceptive design detection.
[0042] Reference Figure 2 , Figure 2 This is a complete flowchart of an embodiment of this application, including three stages: Phase 1: Fine-tuning phase; S1. Construct a domain-specific multimodal dataset; containing three elements: data / type / scenario; a general open-source multimodal base model; S2. Multimodal large model training; Phase Two: Static Testing Phase; Take a screenshot of the detection app running; Deployment model; Input the model to be tested; Semantic recognition; S2. Static semantic analysis; Output: List of suspected attack points; Generate a verification task; Expected behavior and operational coordinates; Phase 3: Dynamic Verification Phase; S4. Intelligent agent receives tasks; ADB simulation operation; Action execution: Click / swipe; Capture the new state; Status Observation: New Screenshots / Interface; Comparison of differences; Semantic conflict: Expected to close but redirects; Scenario A: Confirming fraud; Functionally normal: No malicious behavior; Scenario B: Correcting false alarms; S5. Output a comprehensive test report; Includes attack type / coordinates / chain of evidence.
[0043] Optionally, the construction of the multimodal dataset for deceptive design in mobile applications specifically includes: Collect screenshots of real application interfaces, including various deceptive design types; Each screenshot of the application interface is labeled to form labeled data including screenshot, fraud type description and coordinates of malicious components, resulting in a multimodal dataset.
[0044] Sample augmentation is performed on semantic trap deceptive designs implemented through a combination of text and images in the multimodal dataset to enable the model to learn complex fraud logic involving the combination of text and images.
[0045] Specifically, this embodiment constructs a domain-specific multimodal instruction fine-tuning dataset for deceptive design in mobile applications. Addressing the issue that general-purpose large models (such as GPT-4V) lack UI security domain knowledge and have low sensitivity to specific fraud schemes, this embodiment constructs a high-quality dataset containing rich semantic types.
[0046] Data source: Screenshots of real apps containing various deceptive designs.
[0047] Annotation dimensions: Unlike traditional object detection which only annotates "location", the annotation of this dataset includes three elements: <screenshot, fraud type description, and coordinates of malicious components>.
[0048] Semantic Enhancement: This enhancement specifically targets "semantic traps" that traditional computer vision (CV) struggles to identify. For example, it collects samples containing phrases like "only 10 minutes left for the discount" (false sense of urgency) and "cruel rejection" (emotional blackmail copy). The goal is to enable the model to learn complex fraud logic combining text and images, thus covering more types of detection in subsequent steps.
[0049] Reference Figure 3 , Figure 3 This is a schematic diagram illustrating the construction and classification of a multimodal dataset according to an embodiment of this application, including: Input the raw data; Data annotation; Core annotation dimensions; Data annotation; screenshot; Fraud type description; Coordinates of the malicious component; Visual interference; Semantic guidance; Logical obstacles; Output multimodal instruction fine-tuning dataset.
[0050] Optionally, the training process of the fraud semantic recognition model includes: The multimodal dataset is used as the training set, and the open-source multimodal large model is trained using instruction fine-tuning technology until the training objective is achieved to obtain a trained fraud semantic recognition model. The fraud semantic recognition model takes a UI screenshot as input and outputs a structured description of a list of fraudulent elements, including a type, component description, coordinates, and reason field.
[0051] Specifically, this embodiment implements domain-adaptive fine-tuning of a multimodal large model. A general-purpose open-source multimodal large model is selected as the base. The model is trained using a dedicated dataset constructed with S1 and employing instruction fine-tuning techniques.
[0052] Training objective: To enable the model to go beyond simply describing the visuals, and instead, to infer, like a security expert, whether there is a design intent to induce users to click or prevent them from exiting, based on the visual layout and text content of the interface.
[0053] Output definition: The fine-tuned model can take a UI screenshot as input and output a structured description containing a list of suspected fraudulent elements, for example: {Type: "Induced Click", Component: "Red Button", Coordinates: [x1,y1,x2,y2], Reason: "The copy implies that clicking can get a red envelope, but there is a very small close button around it"}.
[0054] Optionally, the process of obtaining the dynamic behavior verification result specifically includes: The system receives the verification task, simulates user operations on the mobile device, clicks the coordinates of the candidate point, and captures the state change information of the interface on the mobile device; the state change information includes a screenshot of the new interface and the activity name of the current application. The state change information is compared and analyzed with the verification task; If the verification task is expected to be in a normal state, but the state change information is in an abnormal state, then it is determined that the current candidate point has a deceptive design. If the verification task is expected to be in an abnormal state, but the state change information is in a normal state, then the current candidate point is determined to be a false alarm and is removed.
[0055] Furthermore, the determination that the current candidate point has a deceptive design includes: If the expected interface of the verification task should be closed, but the actual interface changes to an ad download page or a payment page, then a semantic conflict is determined between the expectation and the reality, and the current candidate point is judged to be a deceptive design.
[0056] Furthermore, the step of determining that the current candidate point is a false alarm and removing it includes: When the verification task is expected to be malicious, but the actual interface state change information indicates that it is a normal function entry and no malicious jump has occurred, the intelligent agent will compare the results and return them, marking the current candidate point as a false alarm and deleting it.
[0057] Specifically, this embodiment aims to address the limitation of static detection in verifying actual functionality. It introduces an automated intelligent agent: Action execution: The Agent receives the verification task generated by S3 and simulates a human hand clicking the corresponding coordinates on the device through ADB (Android Debug Bridge) or other automated testing interfaces.
[0058] State observation: The agent listens for changes in the system interface, captures screenshots of the new interface after an operation, and the name of the current application's Activity.
[0059] Comparison of differences: Scenario A (Confirmed Fraud): If the static model expects "Close," but the agent finds that clicking it redirects to an "ad download page" or a "payment page," a semantic conflict arises. This conflict constitutes factual evidence of deceptive design.
[0060] Scenario B (Correcting False Alarms): If the static model considers a button to look like a misleading advertisement, but the Agent clicks it and finds that it is indeed just a normal function entry point and no malicious redirection occurs, the Agent reports this result back to the system, and the system marks the candidate point as a "false alarm" and removes it.
[0061] Reference Figure 4 , Figure 4 This is a flowchart illustrating the dynamic verification process in an embodiment of this application; Start verification; Receive verification task; Perform dynamic interactions: The intelligent agent simulates click operations; Wait for the interface to render stably; Capture post-operation state: Get a screenshot of the new screen and information about the current Activity; State prediction generation: Generate expected behavior based on the verification task (e.g., the pop-up should close after clicking); Risk Prior (RP): Based on static analysis, generate a risk prediction value for the current candidate point (RP=1 indicates suspected fraud, RP=0 indicates expected normal). Consistency determination: Compare the expected function (FE_expected) with the actual function (FE_actual); If they match, the functional equivalence (FE) is marked as 1; If they are inconsistent, the functional equivalence (FE) is marked as 0; Joint Judgment: A comprehensive decision based on Risk Prior (RP) and Functional Equivalence (FE) a. Case 1 (Normal Behavior): If RP=0 and FE=1 (expected to be normal, and actually is normal), then it is judged as normal behavior; b. Case 2 (False Alarm Removal): If RP=1 and FE=1 (expected fraud, but actually normal), it is judged as a false alarm and removed from the results; c. Scenario 3 (Confirmed Fraud): If RP=1 and FE=0 (expected fraud, also abnormal in reality), then it is determined to be confirmed fraud; d. Case 4 (High Confidence Fraud): If RP=0 and FE=0 (expected to be normal, but actually abnormal), it is judged as confirmation fraud (high confidence), which is a typical "semantic conflict" fraud; Update the detection results; record the chain of evidence: save the expected behavior, the actual behavior, screenshots before and after the operation, and Activity information as evidence.
[0062] The following detailed explanation of dynamic interaction and behavior verification, using a specific fraud detection example, illustrates the process. The specific process is as follows: The agent first parses and executes a verification task, such as simulating a click on specific coordinates on the interface. After the operation is complete, the system waits for the interface rendering to stabilize, then captures a new screenshot of the interface and the current application Activity information as the actual state. Simultaneously, the system generates a clear expected behavior based on the verification task, such as "the pop-up should close after clicking." The judgment process introduces two core criteria: first, risk prior (RP), derived from the initial judgment of the component by the static semantic analysis model (RP=1 indicates the model initially judges it as "suspected fraud," RP=0 indicates the expected "normal function"); second, functional equivalence (FE), determined by comparing whether the expected function (FE_expected) and the actually observed function (FE_actual) are consistent (FE=1 if consistent, FE=0 if inconsistent). The final fraud determination is jointly made by RP and FE: (A) When RP=0 and FE=1, the component behavior is determined to be normal; (B) When RP=1 and FE=1, the static model is determined to be a false alarm, and the candidate point is removed; (C) When RP=1 and FE=0, the component is confirmed to have fraudulent behavior; (D) When RP=0 and FE=0, the component is determined to have high-confidence "semantic conflict" fraud (i.e., the component's surface semantics suggest normal function, but its actual behavior is abnormal). After the determination is completed, the system updates the detection results and fully records the expected behavior, actual behavior, screenshots before and after the operation, and Activity information to form a traceable chain of evidence.
[0063] Reference Figure 5 This application provides a mobile application deceptive design detection system based on large models and intelligent agents, comprising: The dataset construction module 510 is used to construct a multimodal dataset for deceptive design in mobile applications; the multimodal dataset includes at least visual interference samples, semantic inducement samples, and logical obstacle samples; The model fine-tuning module 520 is used to fine-tune the multimodal large model based on the multimodal dataset to obtain a domain-adaptive fraud semantic recognition model. The static analysis module 530 is used to input the screenshot of the interface of the application to be detected into the fraud semantic recognition model, obtain all suspected deceptive design candidate points as static semantic analysis results, and generate a verification task for each candidate point. The dynamic verification module 540 is used to receive and execute the verification task through the intelligent agent, compare the actual interface state after the operation with the expected response in the verification task, and obtain the dynamic behavior verification result of confirming fraud or correcting false alarm. The integrated output module 550 is used to integrate static semantic analysis results and dynamic behavior verification results to generate a comprehensive detection report carrying fraud type, location, and dynamic verification evidence chain.
[0064] Optionally, the dataset construction module is specifically used for: Collect screenshots of real application interfaces, including various deceptive design types; Each screenshot of the application interface is labeled to form labeled data including screenshot, fraud type description and coordinates of malicious components, resulting in a multimodal dataset.
[0065] Optionally, the dataset construction module further includes a sample augmentation submodule, used for: Sample augmentation is performed on semantic trap deceptive designs implemented through a combination of text and images in the multimodal dataset to enable the model to learn complex fraud logic involving the combination of text and images.
[0066] Optionally, the training process of the fraud semantic recognition model includes: The multimodal dataset is used as the training set, and the open-source multimodal large model is trained using instruction fine-tuning technology until the training objective is achieved to obtain a trained fraud semantic recognition model. The fraud semantic recognition model takes a UI screenshot as input and outputs a structured description of a list of fraudulent elements, including a type, component description, coordinates, and reason field.
[0067] Optionally, the process of obtaining the dynamic behavior verification result specifically includes: The system receives the verification task, simulates user operations on the mobile device, clicks the coordinates of the candidate point, and captures the state change information of the interface on the mobile device; the state change information includes a screenshot of the new interface and the activity name of the current application. The state change information is compared and analyzed with the verification task; If the verification task is expected to be in a normal state, but the state change information is in an abnormal state, then it is determined that the current candidate point has a deceptive design. If the verification task is expected to be in an abnormal state, but the state change information is in a normal state, then the current candidate point is determined to be a false alarm and is removed.
[0068] Optionally, the determination that the current candidate point has a deceptive design includes: If the expected interface of the verification task should be closed, but the actual interface changes to an ad download page or a payment page, then a semantic conflict is determined between the expectation and the reality, and the current candidate point is judged to be a deceptive design.
[0069] Optionally, determining that the current candidate point is a false alarm and removing it includes: When the verification task is expected to be malicious, but the actual interface state change information indicates that it is a normal function entry and no malicious jump has occurred, the intelligent agent will compare the results and return them, marking the current candidate point as a false alarm and deleting it.
[0070] Reference Figure 6 Based on the methods in the above embodiments, this application provides an electronic device that may include: a processor 610, a communications interface 620, a memory 630, and a communication bus 640. The processor 610, communications interface 620, and memory 630 communicate with each other via the communication bus 640. The processor 610 can call logical instructions in the memory 630 to execute the methods in the above embodiments.
[0071] Furthermore, the logical instructions in the aforementioned memory 630 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application.
[0072] Based on the methods in the above embodiments, this application provides a computer-readable storage medium storing a computer program that, when run on a processor, causes the processor to execute the methods in the above embodiments.
[0073] Based on the methods in the above embodiments, this application provides a computer program product that, when run on a processor, causes the processor to execute the methods in the above embodiments.
[0074] It is understood that the processor in the embodiments of this application can be a central processing unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. A general-purpose processor can be a microprocessor or any conventional processor.
[0075] The method steps in this application embodiment can be implemented in hardware or by a processor executing software instructions. The software instructions can consist of corresponding software modules, which can be stored in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), registers, hard disks, portable hard disks, CD-ROMs, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, enabling the processor to read information from and write information to the storage medium. Of course, the storage medium can also be a component of the processor. The processor and the storage medium can reside in an ASIC.
[0076] In the above embodiments, implementation can be achieved entirely or partially through software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented entirely or partially as a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted through the computer-readable storage medium. The computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state disk (SSD)).
[0077] It is understood that the various numerical designations used in the embodiments of this application are merely for the convenience of description and are not intended to limit the scope of the embodiments of this application.
[0078] Those skilled in the art will readily understand that the above description is merely a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this application should be included within the scope of protection of this application.
Claims
1. A method for detecting deceptive design in mobile applications based on large models and intelligent agents, characterized in that, include: Construct a multimodal dataset for deceptive design in mobile applications; The multimodal dataset includes at least visual interference samples, semantically induced samples, and logically obstructive samples; Based on the multimodal dataset, the large multimodal model is fine-tuned using instructions to obtain a domain-adaptive fraud semantic recognition model; Input a screenshot of the application's interface into the fraud semantic recognition model to obtain all suspected deceptive design candidate points as static semantic analysis results, and generate a verification task for each candidate point; The verification task is received and executed by the intelligent agent. The actual interface state after the operation is compared with the expected response in the verification task to obtain the dynamic behavior verification result that confirms fraud or corrects false alarm. By integrating static semantic analysis results and dynamic behavioral verification results, a comprehensive detection report is generated, which includes the fraud type, location, and dynamic verification evidence chain.
2. The method for detecting deceptive design in mobile applications based on large models and intelligent agents according to claim 1, characterized in that, The construction of the multimodal dataset for deceptive design in mobile applications specifically includes: Collect screenshots of real application interfaces, including various deceptive design types; Each screenshot of the application interface is labeled to form labeled data including screenshot, fraud type description and coordinates of malicious components, resulting in a multimodal dataset.
3. The method for detecting deceptive design in mobile applications based on large models and intelligent agents according to claim 2, characterized in that, Also includes: Sample augmentation is performed on semantic trap deceptive designs implemented through a combination of text and images in the multimodal dataset to enable the model to learn complex fraud logic involving the combination of text and images.
4. The method for detecting deceptive design in mobile applications based on large models and intelligent agents according to claim 1, characterized in that, The training process of the fraud semantic recognition model includes: The multimodal dataset is used as the training set, and the open-source multimodal large model is trained using instruction fine-tuning technology until the training objective is achieved to obtain a trained fraud semantic recognition model. The fraud semantic recognition model takes a UI screenshot as input and outputs a structured description of a list of fraudulent elements, including a type, component description, coordinates, and reason field.
5. The method for detecting deceptive design in mobile applications based on large models and intelligent agents according to claim 1, characterized in that, The process of obtaining the dynamic behavior verification result specifically includes: The system receives the verification task, simulates user operations on the mobile device, clicks the coordinates of the candidate point, and captures the state change information of the interface on the mobile device; the state change information includes a screenshot of the new interface and the activity name of the current application. The state change information is compared and analyzed with the verification task; If the verification task is expected to be in a normal state, but the state change information is in an abnormal state, then it is determined that the current candidate point has a deceptive design. If the verification task is expected to be in an abnormal state, but the state change information is in a normal state, then the current candidate point is determined to be a false alarm and is removed.
6. The method for detecting deceptive design in mobile applications based on large models and intelligent agents according to claim 5, characterized in that, The determination that the current candidate point has a deceptive design includes: If the expected interface of the verification task should be closed, but the actual interface changes to an ad download page or a payment page, then a semantic conflict is determined between the expectation and the reality, and the current candidate point is judged to be a deceptive design.
7. The method for detecting deceptive design in mobile applications based on large models and intelligent agents according to claim 5, characterized in that, The determination that the current candidate point is a false alarm and its removal includes: When the verification task is expected to be malicious, but the actual interface state change information indicates that it is a normal function entry and no malicious jump has occurred, the intelligent agent will compare the results and return them, marking the current candidate point as a false alarm and deleting it.
8. A mobile application deceptive design detection system based on large models and intelligent agents, characterized in that, include: The dataset building module is used to build multimodal datasets for deceptive design in mobile applications; The multimodal dataset includes at least visual interference samples, semantically induced samples, and logically obstructive samples; The model fine-tuning module is used to fine-tune the large multimodal model based on the multimodal dataset to obtain a domain-adaptive fraud semantic recognition model. The static analysis module is used to input the screenshot of the interface of the application to be detected into the fraud semantic recognition model, obtain all suspected deceptive design candidate points as static semantic analysis results, and generate a verification task for each candidate point. The dynamic verification module is used to receive and execute the verification task through the intelligent agent, compare the actual interface state after the operation with the expected response in the verification task, and obtain the dynamic behavior verification result of confirming fraud or correcting false alarms. The integrated output module is used to integrate static semantic analysis results and dynamic behavior verification results to generate a comprehensive detection report that includes fraud type, location, and dynamic verification evidence chain.
9. An electronic device, characterized in that, include: At least one memory for storing computer programs; At least one processor is configured to execute a program stored in the memory, wherein when the program stored in the memory is executed, the processor is configured to perform the method as described in any one of claims 1-7.
10. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is run on the processor, it causes the processor to perform the method as described in any one of claims 1-7.