Merchandise card instance segmentation method

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By recognizing product card instances using a pre-trained visual language model and generating automated operation instructions, the problem of insufficient universality in product card segmentation in existing technologies is solved. This enables accurate segmentation and operation across devices, improving the efficiency and accuracy of product card recognition and operation.

CN122244887APending Publication Date: 2026-06-19GUANGZHOU PINWEI SOFTWARE CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: GUANGZHOU PINWEI SOFTWARE CO LTD
Filing Date: 2026-04-21
Publication Date: 2026-06-19

Application Information

Patent Timeline

21 Apr 2026

Application

19 Jun 2026

Publication

CN122244887A

IPC: G06V30/413; G06V10/26; G06V10/75; G06V10/74; G06V10/764; G06V10/774

AI Tagging

Application Domain

Character and pattern recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing automation tools lack versatility and intelligence in product card instance segmentation, cannot adapt to various layouts, and cannot achieve fine segmentation at the instance level, resulting in low efficiency in product card recognition and operation under complex layouts.

Method used

A pre-trained visual language model is used to identify product card instances. By analyzing visual repetition pattern features and layout structure similarity parameters, combined with sliding window dynamic scanning and row and column position encoding strategies, automated operation instructions are generated to achieve accurate segmentation and operation across devices.

Benefits of technology

It significantly improves the ability to generalize the processing of diverse product lists, solves the problem that traditional tools cannot universally handle complex layouts, achieves high-precision product card instance segmentation and automated operation, and improves the accuracy and efficiency of operation.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122244887A_ABST

Patent Text Reader

Abstract

This application provides a method for segmenting product card instances. It achieves accurate identification and segmentation of product card instances through a pre-trained visual language model, outputs positioning information based on the screenshot coordinate system, and generates physical operation commands by combining device parameters. The layout structure similarity analysis and adaptive segmentation strategy design solve the problem of universal adaptation to diverse layouts such as waterfall and grid layouts; the convex hull algorithm and coordinate linear scaling mechanism ensure positioning accuracy across devices; and image enhancement and confidence feedback optimization during the training phase continuously improve the model's robustness. Therefore, this method achieves high-precision, universal product card instance segmentation and automated operation, significantly improving the efficiency and accuracy of verification in e-commerce scenarios.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence recognition, and in particular to a method for segmenting product card instances. Background Technology

[0002] Currently, while some feasible solutions for page element recognition use automated tools, they lack versatility and cannot be adapted to a wide range of application scenarios and diverse recognition tasks. Furthermore, current recognition methods struggle to uncover the deeper semantics of each page element and its distribution.

[0003] Specifically, in a scenario involving product card recognition on a single page, each product card contains the product's attributes, sales information, and a visual representation of the product. Some feasible automated tools cannot adapt to product lists with various layouts and cannot achieve fine-grained instance-level segmentation to distinguish individual product cards within complex layouts. Therefore, the root cause lies in the insufficient versatility and intelligence of automated tools in recognizing and segmenting product card instances. Thus, a product card instance segmentation method is needed to address these issues. Summary of the Invention

[0004] The purpose of this application is to at least address one of the aforementioned technical deficiencies, particularly the lack of versatility and intelligence in the segmentation of product card instances by existing automated tools.

[0005] Firstly, this application provides a method for segmenting product card instances, the method comprising: Get the target image corresponding to the target page; The target image is input into a pre-trained visual language model to identify product card instances and output position and size information based on the target image screenshot coordinate system; Based on the location information and the size information, an automated operation instruction is generated, and the click operation, information acquisition operation and analysis operation of the product card instance are executed according to the automated operation instruction.

[0006] As an optional implementation, the step of inputting the target image into a pre-trained visual language model to identify product card instances and output position and size information based on the target image screenshot coordinate system includes: The target image is input into a pre-trained visual language model to analyze the visual repetition pattern features of each product card instance and extract layout structure similarity parameters. When the layout structure similarity parameter of each of the product card instances is greater than the first threshold, the corresponding product card instances are marked as the same product list structure type. Select the corresponding instance segmentation strategy according to the product list structure type, and output the position information and size information of each product card instance based on the target image screenshot coordinate system in sequence; The instance segmentation strategy includes: using a sliding window dynamic scanning mechanism for the waterfall layout to extract each product card instance by dividing it into blocks in the vertical direction; and using a row and column position encoding mechanism for the grid layout to identify card boundaries by fixed grid intervals and divide each product card instance.

[0007] As an optional implementation, the calculation of the layout structure similarity parameter includes: Extract the edge orientation histogram features of each of the product card instances; Calculate the feature cosine similarity between adjacent instances of the product card; When the feature cosine similarity of a preset number of consecutive product card instances exceeds a repetition threshold, the relevant product card instances are determined to be in a repetition pattern.

[0008] As an optional implementation, the step of generating automated operation instructions based on the location information and the size information includes: Obtain the vertex coordinates of the outline polygon of each product card instance based on the screenshot coordinate system; Based on the current device display parameters, linear scaling is performed on the vertex coordinates of the outline polygon to obtain the physical operation position information; Automated operation instructions are generated based on the physical operation location information; The generation of the vertex coordinates of the contour polygon includes: Detect the edge feature points of each of the product card instances; The edge feature points are connected by the convex hull algorithm to form a closed polygon, and the horizontal and vertical coordinate sequences of the vertices of the closed polygon in the screenshot coordinate system are recorded.

[0009] As an optional implementation, the training process of the visual language model includes: Obtain e-commerce training images to construct a training dataset; For each of the e-commerce training images, the coordinate range and semantic identifier of the associated elements of the product display unit are labeled; A multimodal converter architecture is adopted to jointly optimize the loss function of the visual feature extraction layer and the semantic understanding layer; Additionally, random noise and simulated display parameter fluctuations are added during training to enhance the image.

[0010] As an optional implementation, before obtaining the vertex coordinates of the outline polygon of each of the product card instances based on the screenshot coordinate system, the method further includes: The confidence level of the output of the visual language model is verified. When the confidence score is less than the preset threshold, the relevant area of each product card instance is locally magnified, the target image is re-acquired, and the recognition operation is re-executed. In addition, the page context features of each product card instance whose confidence value is less than a preset threshold are recorded to optimize the training dataset.

[0011] Secondly, this application provides a product card instance segmentation device, the device comprising: The acquisition module is used to acquire the target image corresponding to the target page; The processing module is used to input the target image into a pre-trained visual language model, identify product card instances, and output position and size information based on the target image screenshot coordinate system. The processing module is further configured to generate automated operation instructions based on the location information and the size information, and execute click operations, information acquisition operations and analysis operations of the product card instance according to the automated operation instructions.

[0012] As an optional implementation, the processing module inputs the target image into a pre-trained visual language model, identifies product card instances, and outputs position and size information based on the target image screenshot coordinate system in a specific manner, including: The target image is input into a pre-trained visual language model to analyze the visual repetition pattern features of each product card instance and extract layout structure similarity parameters. When the layout structure similarity parameter of each of the product card instances is greater than the first threshold, the corresponding product card instances are marked as the same product list structure type. Select the corresponding instance segmentation strategy according to the product list structure type, and output the position information and size information of each product card instance based on the target image screenshot coordinate system in sequence; The instance segmentation strategy includes: using a sliding window dynamic scanning mechanism for the waterfall layout to extract each product card instance by dividing it into blocks in the vertical direction; and using a row and column position encoding mechanism for the grid layout to identify card boundaries by fixed grid intervals and divide each product card instance.

[0013] Thirdly, this application provides a computer device including one or more processors and a memory storing computer-readable instructions that, when executed by the one or more processors, perform the steps of the method described in the first aspect.

[0014] Fourthly, this application provides a storage medium storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the method described in the first aspect.

[0015] As can be seen from the above technical solutions, the embodiments of this application have the following advantages: Based on any of the above embodiments, the product card instances in the target image are first identified using a visual language model, outputting their position and size information in the screenshot coordinate system, laying the spatial positioning foundation for automated operations. During the identification process, by analyzing the visual repetition pattern features of the product cards and calculating layout structure similarity parameters, the system adaptively distinguishes between waterfall and grid layout types, and employs sliding window dynamic scanning and row / column position encoding strategies respectively to achieve accurate segmentation of instances under different layouts, significantly improving the generalization capability for diverse product lists. In the coordinate transformation stage, the convex hull algorithm generates the vertex coordinates of the card outline polygon and performs linear scaling based on device display parameters, ensuring cross-resolution adaptability of physical operation position information and solving the positioning deviation problem caused by device differences in traditional methods. During training, the joint optimization of the multimodal converter architecture and image enhancement strategies enhance the model's understanding of the association between product semantics and visual features, while confidence verification and local re-identification mechanisms continuously optimize the model's performance in edge cases through dynamic feedback. Finally, the generated automated operation instructions can directly drive the tool to perform tasks such as clicking and information retrieval, forming a closed loop from identification to operation. This method, with high-precision instance segmentation as its core, combined with layout adaptation, cross-device coordinate adaptation, and continuous model optimization mechanisms, completely solves the problem that traditional tools cannot universally handle complex layouts, providing an efficient and reliable technical solution for the automated review of massive amounts of product cards. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 A flowchart illustrating a product card instance segmentation method provided in one embodiment of this application; Figure 2 This is an internal structural diagram of a computer device provided in an embodiment of this application. Detailed Implementation

[0018] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0019] Currently, while some feasible solutions for page element recognition use automated tools, they lack versatility and cannot be adapted to a wide range of application scenarios and diverse recognition tasks. Furthermore, current recognition methods struggle to uncover the deeper semantics of each page element and its distribution.

[0020] For example, a platform may contain a massive number of product card instances. During major promotional periods, operations teams configure numerous promotional tags, prices, and special display styles. Operations staff must manually verify the completeness of each product card's display and the accuracy of promotional information, resulting in a huge workload and low efficiency. Some feasible implementation methods (based on XPath / CSS) lack versatility and cannot be universally applied to product lists with different layouts (waterfall, grid). More importantly, these methods cannot achieve fine-grained instance-level segmentation from page screenshots (visual information), making it difficult to accurately distinguish individual product cards and hindering large-scale automated verification of product cards.

[0021] Specifically, in the scenario of identifying operational placements on a page, the design of these placements is usually quite complex, containing multiple clickable sub-elements. However, some feasible automation tools cannot adapt to all types and architectures of operational placement designs, and struggle to identify the design logic behind the placements and the nesting of their internal sub-elements. Similarly, in the scenario of identifying product cards on a page, product cards contain the corresponding product's attributes, sales information, and a visual display of the product. Some feasible automation tools cannot adapt to product lists with various layouts and cannot achieve fine-grained instance-level segmentation to distinguish individual product cards within complex layouts. Furthermore, in the scenario of identifying filter or sorting bars, traditionally, manual testing is required. While some feasible automation tools can improve automation levels, they still cannot adapt to various styles of filter bars and cannot predict the associated interactive behaviors of filter bar clicks, resulting in a poor user experience. All of these scenarios suffer from problems due to the insufficient versatility and intelligence of automated tools in segmenting product card instances. Therefore, a product card instance segmentation method based on a visual language model is needed to address these issues.

[0022] In summary, the technical concept of this application lies in firstly identifying product card instances in a target image using a visual language model, outputting their position and size information in the screenshot coordinate system, thus laying the spatial positioning foundation for automated operations. During the identification process, by analyzing the visual repetition pattern features of the product cards and calculating layout structure similarity parameters, the application adaptively distinguishes between waterfall and grid layout types, and employs sliding window dynamic scanning and row / column position encoding strategies respectively to achieve accurate segmentation of instances under different layouts, significantly improving the generalization capability for diverse product lists. In the coordinate transformation stage, the convex hull algorithm generates the vertex coordinates of the card outline polygon and performs linear scaling based on device display parameters, ensuring cross-resolution adaptability of physical operation position information and solving the positioning deviation problem caused by device differences in traditional methods. During training, the joint optimization of the multimodal converter architecture and image enhancement strategies enhance the model's understanding of the association between product semantics and visual features, while confidence verification and local re-identification mechanisms continuously optimize the model's performance in edge cases through dynamic feedback. Finally, the generated automated operation instructions can directly drive the tool to perform tasks such as clicking and information retrieval, forming a closed loop from identification to operation. This method, with high-precision instance segmentation as its core, combined with layout adaptation, cross-device coordinate adaptation, and continuous model optimization mechanisms, completely solves the problem that traditional tools cannot universally handle complex layouts, providing an efficient and reliable technical solution for the automated review of massive amounts of product cards.

[0023] The methods provided in this application will be described in detail below based on the corresponding implementation methods in some practical application scenarios.

[0024] The product card instance segmentation architecture provided in this application is a product card instance segmentation method based on a visual language model, which may specifically include: Get the target image corresponding to the target page; The target image is input into a pre-trained visual language model, which performs a recognition task based on the target page elements and outputs position and size information based on the target image screenshot coordinate system. Based on the current device display parameters, the position information and the size information are linearly scaled to obtain the physical operation position information; Automated operation instructions are generated based on physical operation location information, and the target task is executed according to the automated operation instructions.

[0025] Specifically, this application provides a method for segmenting product card instances based on a visual language model. The specific operation process may include: first, obtaining a complete screenshot of the target page through the system interface of the terminal device; inputting the screenshot image into a pre-trained visual language model, which employs a multimodal converter architecture and achieves semantic understanding of page elements by jointly optimizing the loss functions of the visual feature extraction layer and the semantic understanding layer; the model outputs the position and size information of the target element in the screenshot coordinate system, including the horizontal and vertical coordinates of the element's center point, as well as its width and height values; furthermore, the screenshot coordinates can be linearly scaled according to the ratio of the current device's screen width parameters, screen height parameters, and pre-stored baseline resolution parameters: the horizontal coordinates are adjusted according to the ratio of the screen width to the baseline width, and the vertical coordinates are adjusted according to the ratio of the screen height to the baseline height; finally, based on the scaled physical operation position information, a sequence of touch commands executable by an automated testing tool is generated.

[0026] This implementation uses a pre-trained visual language model to perform semantic understanding of the target page screenshot, directly outputting the coordinates and size information of the target elements. This replaces traditional automated tools that rely on code-based positioning, solving the problems of high maintenance costs and poor versatility caused by dynamic changes in page structure. By using linear scaling, the screenshot coordinate system is converted into the device's physical coordinate system, achieving accurate operation adaptation across resolution devices. The final generated automated operation instructions form a "recognition-positioning-operation" closed loop, significantly improving the automation level and operation accuracy of product card instance segmentation, and providing a general infrastructure for scenarios such as nested element recognition and product card segmentation.

[0027] Based on this Figure 1 This is a flowchart illustrating a product card instance segmentation method provided in one embodiment of this application, as shown below. Figure 1 As shown, this application provides a method for segmenting product card instances, the method comprising: S101. Obtain the target image corresponding to the target page; In this application, the target object may include product card instances, and may also include the main body of the operation position and its internal nested sub-elements, filter bar controls and their internal operable buttons, decorative objects, and other elements.

[0028] S102. Input the target image into a pre-trained visual language model to identify product card instances and output position and size information based on the target image screenshot coordinate system; Taking a scenario where the target object includes the main body of the operation position and its internal nested sub-elements, product card instances, filter bar controls and internal operable buttons as an example, the target object recognition task in a specific application scenario can include target page element recognition, nested element recognition or filter bar element recognition. The target page element identification is used to determine the interface rendering, business logic, and page layout of each target page element in the target page; the nested element identification is used to determine the interface rendering, business logic, and element jump association of each target page element with jump function in the target page; and the filter bar element identification is used to determine the interface rendering, business logic, and expected behavior association of each target page element with filter or sort function in the target page. If a first page element with an interface rendering error is identified, the area percentage of each first page element with rendering error is calculated. When the percentage of the error area exceeds a preset ratio, the operable area of each first page element is optimized, and the coordinate information of a preset number of first page elements is retained according to the priority order of element semantic attributes. If a second page element with abnormal business logic is identified, the code segment associated with the second page element is obtained from the document object model, the abnormal situation of the second page element is further identified through the code language model, and the preset fault handling process is called to process the second page element. If a third page element with a behavior prediction task to be performed is identified, the expected behavior flow and expected execution result are generated according to the preset behavior prediction model. In the process of determining the execution effect of the target task based on the page state change image, the execution effect of the target task is determined based on the expected behavior flow and the expected execution result.

[0029] Specifically, for handling UI rendering anomalies, the area percentage of the rendered element with an anomaly can be calculated. When the percentage exceeds a preset ratio, an optimization algorithm is activated to filter operable areas, prioritizing the retention of high-value element coordinates. For handling business logic anomalies, a multimodal verification mechanism can be implemented. This involves extracting the code segment associated with the target element from the document object model, inputting the code into the code analysis model to detect anomaly patterns, and invoking a preset repair procedure based on the output anomaly identifier. For handling behavior prediction anomalies, the type of interaction behavior is predicted based on a visual language model. When the actual effect does not match the prediction, the differences in page features before and after the interaction are extracted to correct the prediction model parameters.

[0030] For UI rendering anomalies, visual conflict areas are identified through area quantization, and operation targets are optimized by combining semantic priority to avoid accidental touches. For business logic anomalies, code language models are called to analyze underlying logic defects and locate the root cause of the anomaly. For behavior prediction tasks, interaction results are predicted and verification strategies are customized to form a full-link processing mechanism for abnormal scenarios, which greatly improves the operational robustness of complex pages.

[0031] As an optional implementation, the step of inputting the target image into a pre-trained visual language model to identify product card instances and output position and size information based on the target image screenshot coordinate system includes: The target image is input into a pre-trained visual language model to analyze the visual repetition pattern features of each product card instance and extract layout structure similarity parameters. The calculation of the layout structure similarity parameter includes: Extract the edge orientation histogram features of each of the product card instances; Calculate the feature cosine similarity between adjacent instances of the product card; When the feature cosine similarity of a preset number of consecutive product card instances exceeds a repetition threshold, the relevant product card instances are determined to be in a repetition pattern.

[0032] By extracting the edge orientation histogram features of product card instances and calculating the feature cosine similarity between adjacent cards, a repetition pattern is identified when the similarity of multiple consecutive cards exceeds a threshold. This process, based on the local consistency features of visual structure, accurately captures the repetition patterns of the product list, avoiding missegmentation caused by minor layout differences, thereby improving the robustness and segmentation accuracy of product card instances in dynamic layouts such as waterfall layouts.

[0033] When the layout structure similarity parameter of each of the product card instances is greater than the first threshold, the corresponding product card instances are marked as the same product list structure type. Select the corresponding instance segmentation strategy according to the product list structure type, and output the position information and size information of each product card instance based on the target image screenshot coordinate system in sequence; The instance segmentation strategy includes: using a sliding window dynamic scanning mechanism for the waterfall layout to extract each product card instance by dividing it into blocks in the vertical direction; and using a row and column position encoding mechanism for the grid layout to identify card boundaries by fixed grid intervals and divide each product card instance.

[0034] This implementation analyzes the visual repetition pattern features of product card instances and extracts layout structure similarity parameters. Cards with similar layouts are categorized into the same product list structure type, and then an appropriate instance segmentation strategy is selected based on the type. For waterfall layouts, a sliding window dynamic scanning mechanism is used to extract cards in blocks along the vertical direction; for grid layouts, a row and column position encoding mechanism is used to identify boundaries at fixed grid intervals. This design adaptively distinguishes different layout patterns through visual semantic understanding, achieving generalized processing of diverse product lists without requiring manual rule customization, significantly improving the generalization ability and efficiency of instance segmentation.

[0035] S103. Generate automated operation instructions based on the location information and the size information, and execute the click operation, information acquisition operation and analysis operation of the product card instance according to the automated operation instructions.

[0036] As an optional implementation, the generation of automated operation instructions based on physical operation location information includes: Based on the target page elements, determine the interactive behavior instructions corresponding to the target automation tool based on the physical operation location information, and execute the information acquisition or analysis process corresponding to the target task based on the interactive behavior instructions. Furthermore, after executing the interactive behavior instruction, the page state change image is acquired in real time, and the execution effect of the target task is determined based on the page state change image.

[0037] In the process of generating automated operation instructions, the target task type is first selected based on business needs. The target page element recognition task detects the element rendering state and layout logic; the nested element recognition task constructs hierarchical coordinates for sub-elements within the operational area; and the filter bar recognition task combines behavior prediction to generate interactive intent identifiers, enabling subsequent prediction and operation. Then, the instructions are converted, mapping the physical coordinates to an automated tool instruction set. For example, for a click operation, a touch event sequence containing coordinate position, press duration, and release duration is generated; for a swipe operation, trajectory data of the start and end points is generated. Additionally, a state verification process is included, capturing images of page state changes after instruction execution. For example, a feature comparison engine can be used to extract structured features from the image and calculate similarity with a pre-stored template. When the similarity is below a set threshold, an anomaly marking process is triggered.

[0038] Physical coordinates are transformed into executable instructions for automated tools, driving them to perform precise operations; real-time verification of operation effects is achieved by capturing and comparing page state images after the operation; and operation processes are customized according to different task types to form a complete closed loop of operation-verification-feedback, significantly improving the accuracy of automated testing and the reliability of functional verification.

[0039] Specifically, generating automated operation instructions based on the location information and the size information may include: Obtain the vertex coordinates of the outline polygon of each product card instance based on the screenshot coordinate system; Based on the current device display parameters, linear scaling is performed on the vertex coordinates of the outline polygon to obtain the physical operation position information; Automated operation instructions are generated based on the physical operation location information; The generation of the vertex coordinates of the contour polygon includes: Detect the edge feature points of each of the product card instances; The edge feature points are connected by the convex hull algorithm to form a closed polygon, and the horizontal and vertical coordinate sequences of the vertices of the closed polygon in the screenshot coordinate system are recorded.

[0040] The linear scaling process includes: Obtain the screen width and screen height parameters of the current device, and calculate the screen aspect ratio; Read the pre-stored reference width and reference height parameters, and calculate the reference width-to-height ratio; Adjust the horizontal coordinates based on the ratio of the screen width parameter to the baseline width parameter; Adjust the vertical coordinates based on the ratio of the screen height parameter to the reference height parameter; Furthermore, when the difference between the aspect ratio of the device screen and the reference aspect ratio exceeds a preset threshold, an edge pixel filling strategy is adopted to adjust the aspect ratio of the screen.

[0041] Specifically, this application can perform the following operations during linear scaling: Obtain the actual screen width and height parameters of the terminal device, and calculate the device aspect ratio. Simultaneously, read the pre-stored reference width and height parameters, and calculate the reference aspect ratio. When the difference between the device aspect ratio and the reference aspect ratio exceeds a preset threshold, an edge pixel filling strategy is triggered: edge bands are added to both sides of the screenshot image to ensure the processed image aspect ratio matches the reference ratio. Subsequently, coordinate mapping is performed: the horizontal coordinate is scaled according to the ratio of the device width to the reference width, and the vertical coordinate is scaled according to the ratio of the device height to the reference height. This process can be implemented using a dedicated coordinate transformation engine with a built-in aspect ratio verification function, automatically calling the edge filling module to maintain the original shape of the elements.

[0042] It should be noted that the scaling ratio can be calculated based on the reference width and reference height parameters, and is a result based on the actual application scenario. The above implementation method is only a feasible example. In fact, this calculation method can also be used for local mapping, and those skilled in the art can deduce the relevant implementation details.

[0043] By dynamically calculating the aspect ratio between the device and the reference resolution, a linear mapping of coordinates is achieved, ensuring the accuracy of the operation position on devices with different resolutions. For scenarios where the device aspect ratio is distorted, an edge filling strategy is adopted to force the original aspect ratio of elements to be maintained, avoiding click position offset caused by screen deformation, thereby improving the robustness of cross-device adaptation and operational reliability.

[0044] This implementation detects edge feature points of product card instances, uses a convex hull algorithm to generate vertex coordinates of a closed polygon, and records its horizontal and vertical coordinate sequences based on the screenshot coordinate system. Subsequently, the vertex coordinates are linearly scaled according to device display parameters to obtain physical operation position information, ultimately generating automated operation instructions. This design accurately reconstructs the card outline through geometric modeling and dynamically adapts coordinates based on device parameters, ensuring the positioning accuracy of automated operation instructions on devices with different resolutions, effectively supporting stable execution in cross-platform scenarios.

[0045] This application acquires the target image of the target page and inputs it into a pre-trained visual language model for product card instance recognition. It outputs position and size information based on the screenshot coordinate system, providing a precise spatial positioning foundation for subsequent automated operations. Based on this position and size information, automated operation instructions are generated, directly driving automated tools to perform operations such as clicking and information retrieval, achieving a closed loop from visual recognition to physical operation. Therefore, this solution, through the instance-level recognition capability of the visual language model, solves the problem of traditional methods' inability to accurately locate individual product cards. Simultaneously, by leveraging the linkage between coordinate information and automated instructions, it significantly improves the accuracy and execution efficiency of product card operations, providing reliable technical support for large-scale product verification tasks.

[0046] As an optional implementation, the training process of the visual language model includes: Obtain e-commerce training images to construct a training dataset; For each of the e-commerce training images, the coordinate range and semantic identifiers of three types of elements are labeled, including the main body of the operation position and its nested sub-elements, product card instances, and filter bar controls and their internal operable buttons. A multimodal converter architecture is adopted to jointly optimize the loss function of the visual feature extraction layer and the semantic understanding layer; Additionally, random noise and simulated display parameter fluctuations are added during training to enhance the image.

[0047] For training the visual language model, the first step is data construction, collecting historical screenshots of e-commerce pages. Depending on the actual application scenario, this can include one or more of three types of elements: labeled operational areas (including sub-element boundaries), product cards (instance segmentation masks), and filter controls (including behavioral labels). Other types of elements, such as decorative elements, can also be included. Further, a visual encoder extracts local features, and a semantic decoder generates element description text. A cross-modal fusion layer aligns the visual and semantic vector spaces. An anti-interference mechanism is also designed during training, injecting noise to enhance the image. This can be achieved by adding random noise, simulating resolution fluctuations, or through image processing methods such as random cropping. This employs a hard sample optimization strategy to improve boundary recognition capabilities.

[0048] In this implementation, the model is trained using an e-commerce-specific dataset, enabling the visual language model to accurately understand the visual semantic features of elements such as promotional operation slots, product cards, and filter bars. A multimodal architecture is used to jointly optimize visual and semantic representation capabilities, improving the model's accuracy in recognizing element boundaries and types. Noise injection and parameter fluctuation simulation enhance the model's anti-interference capabilities, ensuring stable recognition performance in dynamically rendered pages.

[0049] In the application scenario of this application, the training process of the visual language model includes: Obtain e-commerce training images to construct a training dataset; For each of the e-commerce training images, the coordinate range and semantic identifier of the associated elements of the product display unit are labeled; A multimodal converter architecture is adopted to jointly optimize the loss function of the visual feature extraction layer and the semantic understanding layer; Additionally, random noise and simulated display parameter fluctuations are added during training to enhance the image.

[0050] This implementation constructs a high-quality training dataset by labeling the coordinate range and semantic identifiers of product display units in e-commerce training images. A multimodal converter architecture is employed to jointly optimize the loss functions of the visual feature extraction layer and the semantic understanding layer, enhancing the model's understanding of the visual and semantic associations of product cards. Random noise and image enhancement strategies simulating display parameter fluctuations are introduced during training to improve the model's robustness to real-world interference. As a result, the model can more accurately identify product card instances with complex styles, providing strong generalization capabilities for segmentation tasks.

[0051] In fact, visual language models can also perform iterative optimization of the training dataset. For example, before linearly scaling the position and size information according to the current device display parameters to obtain the physical operation position information, the method further includes: The confidence level of the output of the visual language model is verified. When the confidence score is less than the preset threshold, the screenshot re-capture process is triggered to re-acquire the target image and re-execute the recognition operation. Additionally, the page context features of elements whose confidence scores are less than a preset threshold are recorded to optimize the training dataset.

[0052] Furthermore, a confidence check is performed before coordinate transformation. First, a confidence assessment is conducted, analyzing the confidence score output by the visual language model. When an element's score is below a set threshold, it is considered a low-confidence recognition. A local resampling process is then initiated for low-confidence elements. For example, an extended region image can be cropped based on the predicted coordinates and input into the lightweight model for secondary recognition. A data optimization mechanism can also be designed to record the page context features of low-confidence elements (including the distribution of surrounding elements and color contrast) to generate a hard sample dataset for model iteration.

[0053] The confidence verification mechanism filters out low-reliability recognition results to prevent incorrect coordinates or element recognition results from being transmitted to the operation stage; the re-recognition process is triggered for low-confidence scenarios to improve the stability of the output results; the features of pages that fail to be recognized are recorded for model iteration and optimization, forming a data closed loop and continuously enhancing the generalization ability and scene adaptability of the visual language model.

[0054] Accordingly, corresponding to the application scenario of this application, before obtaining the vertex coordinates of the outline polygon of each product card instance based on the screenshot coordinate system, the method further includes: The confidence level of the output of the visual language model is verified. When the confidence score is less than the preset threshold, the relevant area of each product card instance is locally magnified, the target image is re-acquired, and the recognition operation is re-executed. In addition, the page context features of each product card instance whose confidence value is less than a preset threshold are recorded to optimize the training dataset.

[0055] This implementation verifies the confidence level of the visual language model's recognition results before outputting the vertex coordinates of the outline polygon. If the confidence level is below a threshold, the product card region is magnified and re-recognized, while the page context features of the low-confidence cards are recorded to optimize the training dataset. This mechanism, through dynamic verification and feedback iteration, promptly corrects recognition biases, continuously improves the model's segmentation accuracy in edge cases, and ensures the reliability of automated operations.

[0056] Based on a practical application scenario, the method provided in this application can accurately identify the product list on an e-commerce page, segment each product card instance, and output precise physical coordinates for automated clicking. It is implemented through a corresponding system architecture, which can include an image acquisition module, a Vision-LLM recognition module, a coordinate adaptation module, and an instruction output module.

[0057] Phase 1: Vision-LLM Recognition and Screenshot Coordinate Output The Vision-LLM recognition module receives screenshots of the NOVA page. By understanding the visual repetition patterns and semantic objects of Vipshop's product cards, it achieves the recognition of product card sets and the instance segmentation of individual product cards.

[0058] The module outputs the center coordinates and width / height information of the original screenshot pixels.

[0059] Phase Two: Screen Coordinate System Restoration and Adaptation The coordinate adaptation module obtains the actual resolution of the current device, linearly scales and restores the coordinates of the screenshot output by Vision-LLM, and obtains the physical coordinates and width and height information that can be executed on the current device.

[0060] Phase 3: Output of automated operation instructions: The instruction output module uses the restored physical coordinates to generate operation instructions that drive automation tools such as Appium to accurately click the corresponding product cards, and provides the context information of the operation area for subsequent operations.

[0061] This achieves instance-level high-precision segmentation, enabling precise separation and positioning of individual product cards within a product set, thus improving the granularity and accuracy of inspection. This method boasts high versatility and zero-code maintenance. Through Vision-LLM's understanding of visual semantics, it achieves universal recognition of different product display layouts, eliminating the need for customized code for specific lists. It also possesses cross-device and cross-resolution versatility, introducing coordinate restoration and adaptation mechanisms to solve the conversion problem between pixel coordinates and device physical coordinates, enhancing the tool's operational accuracy across different devices. Completely freeing up manual review, it provides the automation tool with the precise center position of each product card, effectively replacing the repetitive review work of operations personnel on massive numbers of product cards.

[0062] Secondly, this application provides a product card instance segmentation device, the device comprising: The acquisition module is used to acquire the target image corresponding to the target page; The processing module is used to input the target image into a pre-trained visual language model, identify product card instances, and output position and size information based on the target image screenshot coordinate system. The processing module is further configured to generate automated operation instructions based on the location information and the size information, and execute click operations, information acquisition operations and analysis operations of the product card instance according to the automated operation instructions.

[0063] This implementation acquires the target image of the target page and inputs it into a pre-trained visual language model for product card instance recognition. It outputs position and size information based on the screenshot coordinate system, providing a precise spatial positioning foundation for subsequent automated operations. Based on this position and size information, automated operation instructions are generated, directly driving automated tools to perform operations such as clicking and information retrieval, achieving a closed loop from visual recognition to physical operation. Therefore, this solution, through the instance-level recognition capability of the visual language model, solves the problem of traditional methods' inability to accurately locate individual product cards. Simultaneously, by utilizing the linkage between coordinate information and automated instructions, it significantly improves the accuracy and execution efficiency of product card operations, providing reliable technical support for large-scale product verification tasks.

[0064] As an optional implementation, the processing module inputs the target image into a pre-trained visual language model, identifies product card instances, and outputs position and size information based on the target image screenshot coordinate system in a specific manner, including: The target image is input into a pre-trained visual language model to analyze the visual repetition pattern features of each product card instance and extract layout structure similarity parameters. When the layout structure similarity parameter of each of the product card instances is greater than the first threshold, the corresponding product card instances are marked as the same product list structure type. Select the corresponding instance segmentation strategy according to the product list structure type, and output the position information and size information of each product card instance based on the target image screenshot coordinate system in sequence; The instance segmentation strategy includes: using a sliding window dynamic scanning mechanism for the waterfall layout to extract each product card instance by dividing it into blocks in the vertical direction; and using a row and column position encoding mechanism for the grid layout to identify card boundaries by fixed grid intervals and divide each product card instance.

[0065] This implementation analyzes the visual repetition pattern features of product card instances and extracts layout structure similarity parameters. Cards with similar layouts are categorized into the same product list structure type, and then an appropriate instance segmentation strategy is selected based on the type. For waterfall layouts, a sliding window dynamic scanning mechanism is used to extract cards in blocks along the vertical direction; for grid layouts, a row and column position encoding mechanism is used to identify boundaries at fixed grid intervals. This design adaptively distinguishes different layout patterns through visual semantic understanding, achieving generalized processing of diverse product lists without requiring manual rule customization, significantly improving the generalization ability and efficiency of instance segmentation.

[0066] As an optional implementation, the specific method by which the processing module calculates the layout structure similarity parameters includes: Extract the edge orientation histogram features of each of the product card instances; Calculate the feature cosine similarity between adjacent instances of the product card; When the feature cosine similarity of a preset number of consecutive product card instances exceeds a repetition threshold, the relevant product card instances are determined to be in a repetition pattern.

[0067] This implementation extracts the edge orientation histogram features of product card instances and calculates the feature cosine similarity between adjacent cards. When the similarity of multiple consecutive cards exceeds a threshold, it is determined to be a repetition pattern. This process, based on the local consistency features of visual structure, accurately captures the repetition patterns of the product list, avoiding missegmentation caused by minor layout differences, thereby improving the robustness and segmentation accuracy of product card instances in dynamic layouts such as waterfall layouts.

[0068] As an optional implementation, the specific method by which the processing module generates automated operation instructions based on the position information and the size information includes: Obtain the vertex coordinates of the outline polygon of each product card instance based on the screenshot coordinate system; Based on the current device display parameters, linear scaling is performed on the vertex coordinates of the outline polygon to obtain the physical operation position information; Automated operation instructions are generated based on the physical operation location information; The generation of the vertex coordinates of the contour polygon includes: Detect the edge feature points of each of the product card instances; The edge feature points are connected by the convex hull algorithm to form a closed polygon, and the horizontal and vertical coordinate sequences of the vertices of the closed polygon in the screenshot coordinate system are recorded.

[0069] This implementation detects edge feature points of product card instances, uses a convex hull algorithm to generate vertex coordinates of a closed polygon, and records its horizontal and vertical coordinate sequences based on the screenshot coordinate system. Subsequently, the vertex coordinates are linearly scaled according to device display parameters to obtain physical operation position information, ultimately generating automated operation instructions. This design accurately reconstructs the card outline through geometric modeling and dynamically adapts coordinates based on device parameters, ensuring the positioning accuracy of automated operation instructions on devices with different resolutions, effectively supporting stable execution in cross-platform scenarios.

[0070] As an optional implementation, the training process of the visual language model by the processing module specifically includes: Obtain e-commerce training images to construct a training dataset; For each of the e-commerce training images, the coordinate range and semantic identifier of the associated elements of the product display unit are labeled; A multimodal converter architecture is adopted to jointly optimize the loss function of the visual feature extraction layer and the semantic understanding layer; Additionally, random noise and simulated display parameter fluctuations are added during training to enhance the image.

[0071] This implementation constructs a high-quality training dataset by labeling the coordinate range and semantic identifiers of product display units in e-commerce training images. A multimodal converter architecture is employed to jointly optimize the loss functions of the visual feature extraction layer and the semantic understanding layer, enhancing the model's understanding of the visual and semantic associations of product cards. Random noise and image enhancement strategies simulating display parameter fluctuations are introduced during training to improve the model's robustness to real-world interference. As a result, the model can more accurately identify product card instances with complex styles, providing strong generalization capabilities for segmentation tasks.

[0072] As an optional implementation, the processing module is further configured to perform the following before obtaining the vertex coordinates of the outline polygon of each product card instance based on the screenshot coordinate system: The confidence level of the output of the visual language model is verified. When the confidence score is less than the preset threshold, the relevant area of each product card instance is locally magnified, the target image is re-acquired, and the recognition operation is re-executed. In addition, the page context features of each product card instance whose confidence value is less than a preset threshold are recorded to optimize the training dataset.

[0073] This implementation verifies the confidence level of the visual language model's recognition results before outputting the vertex coordinates of the outline polygon. If the confidence level is below a threshold, the product card region is magnified and re-recognized, while the page context features of the low-confidence cards are recorded to optimize the training dataset. This mechanism, through dynamic verification and feedback iteration, promptly corrects recognition biases, continuously improves the model's segmentation accuracy in edge cases, and ensures the reliability of automated operations.

[0074] It should be noted that the division of the various modules in the above device is merely a logical functional division. In actual implementation, they can be fully or partially integrated into a single physical entity, or they can be physically separated. Furthermore, these modules can be implemented entirely in software via processing element calls; they can be fully implemented in hardware; or some modules can be implemented by processing element calls to software, while others are implemented in hardware. For example, a processing module can be a separate processing element, or it can be integrated into a chip within the device. Alternatively, it can be stored as program code in the device's memory, and its functions can be called and executed by a processing element. The implementation of other modules is similar. Moreover, these modules can be fully or partially integrated together, or they can be implemented independently. The processing element here can be an integrated circuit with signal processing capabilities. During implementation, each step of the above method or each of the above modules can be completed through integrated logic circuits in the hardware of the processor element or through software instructions.

[0075] Indicatively, such as Figure 2 As shown, Figure 2 This is a schematic diagram of the internal structure of a computer device 300 provided in an embodiment of this application. The computer device 300 can be provided as a server. (Refer to...) Figure 2 The computer device 300 includes a processing component 302, which further includes one or more processors, and memory resources represented by memory 301 for storing instructions, such as application programs, that can be executed by the processing component 302. The application programs stored in memory 301 may include one or more modules, each corresponding to a set of instructions. Furthermore, the processing component 302 is configured to execute instructions to perform the methods of any of the embodiments described above.

[0076] The computer device 300 may also include a power supply component 303 configured to perform power management of the computer device 300, a wired or wireless network interface 304 configured to connect the computer device 300 to a network, and an input / output (I / O) interface 305. The computer device 300 may operate on an operating system stored in memory 301, such as Windows Server™, Mac OS X™, Unix™, Linux™, Free BSD™, or similar.

[0077] Those skilled in the art will understand that Figure 2 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0078] This application provides a storage medium storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the method provided in any embodiment.

[0079] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0080] The various embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments. The various embodiments can be combined as needed, and the same or similar parts can be referred to each other.

[0081] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for segmenting product card instances, characterized in that, Includes the following steps: Get the target image corresponding to the target page; The target image is input into a pre-trained visual language model to identify product card instances and output position and size information based on the target image screenshot coordinate system; Based on the location information and the size information, an automated operation instruction is generated, and the click operation, information acquisition operation and analysis operation of the product card instance are executed according to the automated operation instruction.

2. The method according to claim 1, characterized in that, The step of inputting the target image into a pre-trained visual language model to identify product card instances and output position and size information based on the target image screenshot coordinate system includes: The target image is input into a pre-trained visual language model to analyze the visual repetition pattern features of each product card instance and extract layout structure similarity parameters. When the layout structure similarity parameter of each of the product card instances is greater than the first threshold, the corresponding product card instances are marked as the same product list structure type. Select the corresponding instance segmentation strategy according to the product list structure type, and output the position information and size information of each product card instance based on the target image screenshot coordinate system in sequence; The instance segmentation strategy includes: using a sliding window dynamic scanning mechanism for the waterfall layout to extract each product card instance by dividing it into blocks in the vertical direction; and using a row and column position encoding mechanism for the grid layout to identify card boundaries by fixed grid intervals and divide each product card instance.

3. The method according to claim 2, characterized in that, The calculation of the layout structure similarity parameter includes: Extract the edge orientation histogram features of each of the product card instances; Calculate the feature cosine similarity between adjacent instances of the product card; When the feature cosine similarity of a preset number of consecutive product card instances exceeds a repetition threshold, the relevant product card instances are determined to be in a repetition pattern.

4. The method according to claim 2, characterized in that, The generation of automated operation instructions based on the location information and the size information includes: Obtain the vertex coordinates of the outline polygon of each product card instance based on the screenshot coordinate system; Based on the current device display parameters, linear scaling is performed on the vertex coordinates of the outline polygon to obtain the physical operation position information; Automated operation instructions are generated based on the physical operation location information; The generation of the vertex coordinates of the contour polygon includes: Detect the edge feature points of each of the product card instances; The edge feature points are connected by the convex hull algorithm to form a closed polygon, and the horizontal and vertical coordinate sequences of the vertices of the closed polygon in the screenshot coordinate system are recorded.

5. The method according to any one of claims 1-4, characterized in that, The training process of the visual language model includes: Obtain e-commerce training images to construct a training dataset; For each of the e-commerce training images, the coordinate range and semantic identifier of the associated elements of the product display unit are labeled; A multimodal converter architecture is adopted to jointly optimize the loss function of the visual feature extraction layer and the semantic understanding layer; Additionally, random noise and simulated display parameter fluctuations are added during training to enhance the image.

6. The method according to claim 5, characterized in that, Before obtaining the vertex coordinates of the outline polygon of each of the product card instances based on the screenshot coordinate system, the method further includes: The confidence level of the output of the visual language model is verified. When the confidence score is less than the preset threshold, the relevant area of each product card instance is locally magnified, the target image is re-acquired, and the recognition operation is re-executed. In addition, the page context features of each product card instance whose confidence value is less than a preset threshold are recorded to optimize the training dataset.

7. A product card instance segmentation device, characterized in that, The device includes: The acquisition module is used to acquire the target image corresponding to the target page; The processing module is used to input the target image into a pre-trained visual language model, identify product card instances, and output position and size information based on the target image screenshot coordinate system. The processing module is further configured to generate automated operation instructions based on the location information and the size information, and execute click operations, information acquisition operations and analysis operations of the product card instance according to the automated operation instructions.

8. The apparatus according to claim 7, characterized in that, The processing module inputs the target image into a pre-trained visual language model, identifies product card instances, and outputs position and size information based on the target image screenshot coordinate system in a specific manner, including: The target image is input into a pre-trained visual language model to analyze the visual repetition pattern features of each product card instance and extract layout structure similarity parameters. When the layout structure similarity parameter of each of the product card instances is greater than the first threshold, the corresponding product card instances are marked as the same product list structure type. Select the corresponding instance segmentation strategy according to the product list structure type, and output the position information and size information of each product card instance based on the target image screenshot coordinate system in sequence; The instance segmentation strategy includes: using a sliding window dynamic scanning mechanism for the waterfall layout to extract each product card instance by dividing it into blocks in the vertical direction; and using a row and column position encoding mechanism for the grid layout to identify card boundaries by fixed grid intervals and divide each product card instance.

9. A computer device, characterized in that, The method includes one or more processors and a memory storing computer-readable instructions that, when executed by the one or more processors, perform the steps of the method as described in any one of claims 1-6.

10. A storage medium, characterized in that, The storage medium stores computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the method as described in any one of claims 1-6.