Advertisement content sensitive word real-time identification method based on large model technology

By combining large models and particle swarm optimization, semantic gaps in dynamically generated ads are reverse-located and backfilled, solving the problem of missed detection caused by the inability to identify trigger word gapping in existing technologies, and achieving efficient compliance review of dynamic ads.

CN122242501APending Publication Date: 2026-06-19SHANGHAI JUGAO DEYE CULTURE DEVELOPMENT CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI JUGAO DEYE CULTURE DEVELOPMENT CO LTD
Filing Date
2026-03-23
Publication Date
2026-06-19

Smart Images

  • Figure CN122242501A_ABST
    Figure CN122242501A_ABST
Patent Text Reader

Abstract

This invention discloses a real-time identification method for sensitive words in advertising content based on large-scale model technology, relating to the field of advertising semantic recognition technology. The method involves acquiring the final displayed text, a set of triggering keywords, and a set of semantic items for the target page; calculating the candidate degree of empty slots based on the context recovery probability and multi-dimensional vector similarity of the final displayed text segmentation; aggregating and generating empty slot segments and calculating segment strength based on this; using a large-scale model to generate an initial candidate semantic set based on context and set information; establishing a multi-dimensional fitness function after cleaning and expansion; obtaining the optimal semantic vector through particle swarm optimization; extracting the corresponding optimal backfill trigger semantic; constructing backfill text for review and inputting it into the large-scale model's sensitive classifier; calculating the difference in the probability distribution of sensitive categories before and after backfilling to obtain the empty slot sensitivity gain; and fusing segment strength, sensitivity gain, and the absolute risk probability after backfilling to calculate the final risk score, thereby outputting identification tags and handling actions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of advertising semantic recognition technology, specifically to a method for real-time recognition of sensitive words in advertising content based on large model technology. Background Technology

[0002] In the current digital marketing and search engine advertising field, real-time identification of sensitive words in advertising content based on large language model technology has been widely used to ensure the security and compliance of advertising content. Traditional advertising risk control and review mechanisms are usually based on the fundamental assumption that if an advertisement has sensitive or illegal risks, its sensitive semantics will inevitably leave explicit lexical traces in the final advertisement text displayed to the audience. However, with the evolution of dynamic generation and matching technologies on advertising platforms, this traditional assumption has been broken in certain advanced advertising scenarios. Modern search advertising extensively uses dynamic word insertion ads and dynamic search ads, which rely heavily on dynamically generated technologies. In these scenarios, the final displayed text of the advertisement is not statically submitted by the advertiser, but undergoes a complex dynamic evolution. On the one hand, in dynamic keyword-inserted ads, the ad creative inserted by the platform is not the user's original search term, but rather the keywords of the ad group that trigger the ad display. Because search engines generally use approximate matching mechanisms, covering synonyms, paraphrased definitions, and implied search intent, coupled with a fallback mechanism that automatically reverts to a preset default word when the trigger word is too long or misspelled, the actual user intent triggering the ad is often more specific and sensitive than the keywords in the final inserted text. On the other hand, in dynamic search ads, the title and landing page content are entirely dynamically generated based on the website structure. When generating the final title, the platform performs extensive editing on the extracted original webpage text to align with the user's broad search terms and the platform's ad integrity policy. In summary, the final displayed text of dynamic search ads actually undergoes a complex semantic compression process involving trigger word selection, approximate matching expansion, default word reversion, webpage title extraction, and editing.

[0003] This complex dynamic generation process gives rise to a highly insidious phenomenon: the original sensitive trigger semantics that truly lead an ad to enter the display chain are no longer retained as complete words in the final ad text. Instead, they are replaced by shorter, more generalized, safer, or more editorially compliant surface words, resulting in trigger word emptying. Based on this, existing ad review methods based on large language models reveal serious applicability flaws. Existing technologies mechanically treat the final displayed surface text as the sole semantic source and review object. When key high-risk sensitive trigger words have been replaced or removed at the data level by the upstream dynamic generation mechanism, the input text received by the large language model only contains cleaned, safe surface words, leading to a false compliance judgment. Therefore, the urgent technical problem to be solved in this field is that in dynamically generated advertising scenarios, the upstream trigger words that truly carry the risk of violations are lost in the final advertising text. This makes it impossible for existing real-time sensitive word identification methods that rely solely on the surface word form to perceive and recover the missing semantics, thus creating a systemic risk of missed detection for implicitly sensitive advertisements that use the above mechanism to evade review. The industry urgently needs a new real-time identification method that can reverse locate semantic gaps in the surface text that lacks key trigger words and accurately reconstruct the upstream missing semantics in order to overcome the underlying limitations of traditional risk control systems. Summary of the Invention

[0004] To address the shortcomings of existing technologies, this invention proposes a real-time identification method for sensitive words in advertising content based on large model technology. This method solves the problem that in dynamic word insertion or dynamic search advertising scenarios, the final displayed text has a phenomenon of empty slots for trigger words, which causes the original trigger semantics carrying the risk of violations to be generalized or replaced. As a result, existing identification methods that rely solely on the surface word form cannot perceive and recover the missing semantics, and systematically miss detection of implicitly sensitive advertisements.

[0005] To achieve the above objectives, the present invention provides the following technical solution: Obtain the final displayed text, the set of triggering keywords, and the set of semantic items for the target page; Based on the context recovery probability of the final displayed text segmentation, and the vector similarity between the segmentation and the trigger keyword set and the target page semantic item set, the empty slot candidate degree is calculated. The empty slot candidate degree is compared with the empty slot segment threshold, the empty slot segment is aggregated and generated, the context window is extracted and the segment strength is calculated. Input the context window, the set of triggering keywords, and the set of semantic items of the target page into the large model to generate an initial candidate semantic set, and then clean it to obtain an expanded candidate set. Establish a fitness function that integrates the context window, the set of trigger keywords, the set of semantic items on the target page, and the similarity of vectors in the prior library of sensitive semantics. Use the particle swarm optimization algorithm to find the optimal semantic vector. Extract the candidate phrase whose semantic vector is closest to the optimal semantic vector from the expanded candidate set as the optimal backfill trigger semantic. Replace the empty slot section in the final displayed text to construct the backfill text for review. The final displayed text and the backfilled text for review are input into the large model's sensitivity classifier, and the difference in the probability distribution of sensitive categories is calculated to obtain the slot sensitivity gain. The final risk score is calculated by combining the intensity of the fusion section, the sensitivity gain of the empty slot, and the probability distribution of the sensitive category of the backfilled text used for review. Based on the final risk score, the identification label and advertising action are output.

[0006] Compared with existing technologies, it has the following advantages: This solution proposes a real-time sensitive word identification method for advertising content based on large-scale model technology, significantly improving the rigor and security of review in dynamically generated advertising scenarios. This invention fundamentally breaks through the limitations of traditional risk identification methods that rely solely on static scanning of surface-level advertising vocabulary, effectively solving the problem of missed detection of hidden risks in dynamic word insertion and dynamic search advertising scenarios due to the generalization, compression, or whitewashing of upstream trigger semantics. By deeply mining the synergistic relationship between the final displayed text and the original trigger keywords and target page semantics, this solution can accurately locate empty slots in the text with insufficient semantic capacity and use them as a breakthrough point for reverse deduction. The technical approach of actively searching for semantically missing nodes from surface-compliant text and restoring the original intent enables the regulatory system to see through advertising disguises, greatly improving the identification accuracy of various probing bypass strategies and ensuring the full-link controllability of advertising content during dynamic evolution.

[0007] This invention constructs a closed-loop architecture that combines candidate generation using a large language model with controlled optimization using heuristic algorithms. Compared to the conventional approach of treating the large language model as a black box and directly outputting judgment conclusions, this solution constrains the model's capabilities within the scope of material generation and introduces a particle swarm optimization algorithm for rigorous fitness filtering in a multi-dimensional vector space. This ensures that the semantics of the backfill not only conform to the context but also logically reproduce the real triggering background. A discrete mapping mechanism eliminates semantic illusions or random outputs that might arise from the large language model at the technical level, guaranteeing reliable stability in engineering implementation. Simultaneously, by calculating the difference in sensitivity probability distributions before and after backfilling to obtain sensitivity gain, this solution achieves quantitative capture of implicit risks, providing solid mathematical logic support for the review conclusions. Furthermore, by constructing an isolated data backfill stream in the review memory, this invention completes deep probing without tampering with the advertiser's original materials, providing the internet advertising industry with a high-quality real-time identification solution that balances business compliance and technological acumen. Attached Figure Description

[0008] Figure 1 This is a schematic diagram of the method flow of the present invention. Detailed Implementation

[0009] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0010] Please see Figure 1 This application provides a method for real-time identification of sensitive words in advertising content based on large model technology; The method specifically includes the following steps: Step 1: Calculate the empty slot candidate degree based on the context recovery probability of the final displayed text segmentation and the vector similarity between the final displayed text segmentation and the trigger keyword set and the target page semantic item set. This includes the following steps: Obtain the final displayed text, the set of triggering keywords, and the set of semantic items for the target page; Specifically, the system first obtains the final display text of the ad object to be reviewed, the set of trigger keywords associated with the ad object, and the set of semantic items of the target page from the data input channel. After obtaining the data, the system performs standardization processing on the final display text, such as converting full-width characters to half-width characters, unifying English letters to lowercase, removing redundant spaces, and clearing residual tags, and then performs word segmentation to obtain multiple final display text segments.

[0011] The logarithm of the context recovery probability of the final displayed text segmentation is extracted as the conditional semantic contribution value; Specifically, for each segmented word in the final displayed text, the system constructs a context environment after removing the segment, calculates the context recovery probability of the segment in the current context using a pre-trained language model, and extracts the logarithm of the context recovery probability as the conditional semantic contribution value. The core of this step is that if a segment highly matches its context, the conditional semantic contribution value is large; if the segment only serves a generalization and filling function and cannot fully carry the specific semantics that should be present at that position, the conditional semantic contribution value is small.

[0012] Subtracting the maximum value of the vector cosine similarity between the final displayed text segment and each trigger keyword in the trigger keyword set from the numerical value yields the degree of semantic decoupling of the keywords; The degree of semantic gap in the page is obtained by subtracting the maximum value of the vector cosine similarity between the final displayed text segmentation and the semantic items of each target page in the target page semantic item set from the numerical value; Specifically, the system uses a semantic encoder to segment the final displayed text into semantic vectors, and converts each trigger keyword in the trigger keyword set into a keyword semantic vector. The system calculates the cosine similarity between the segmented semantic vector and each keyword semantic vector, extracts the maximum value, and subtracts this maximum value to obtain the keyword semantic decoupling degree. This degree reflects the deviation between the current text and the ad trigger semantics; the greater the deviation, the higher the value. Similarly, the system converts each item in the target page semantic item set into a target page semantic vector, and subtracts the maximum cosine similarity between the segmented semantic vector and each target page semantic vector to obtain the page semantic gap degree.

[0013] Input the conditional semantic contribution value into the preset Sigmoid smoothing function to obtain the smoothed mapping value; subtract the smoothed mapping value from the numerical value to obtain the smoothed mapping difference; multiply the smoothed mapping difference by the conditional semantic weight parameter to obtain the basic empty slot risk value. The first risk summation value is obtained by multiplying the semantic decoupling degree of the keywords by the decoupling degree weighting parameter. The second risk summation value is obtained by multiplying the degree of semantic gap on the page by the gap degree weighting parameter; The candidate value of an empty slot is obtained by adding the basic empty slot risk value, the first risk superposition value, and the second risk superposition value.

[0014] Specifically, after calculating the values ​​of the three dimensions mentioned above, the system inputs the conditional semantic contribution value into a preset Sigmoid smoothing function to obtain a smoothing mapping value. The system then subtracts this smoothing mapping value from the original value to obtain the smoothing mapping difference, and multiplies this difference by the conditional semantic weight parameter to obtain the basic slot risk value. The system multiplies the keyword semantic decoupling degree by the decoupling degree weight parameter to obtain the first risk superposition value, and multiplies the page semantic gap degree by the gap degree weight parameter to obtain the second risk superposition value. Finally, the system adds the basic slot risk value, the first risk superposition value, and the second risk superposition value to calculate the slot candidate degree of the final displayed text segmentation. This step, through multi-dimensional data fusion, can accurately and quantitatively assess the probability of semantic missingness at each position in the text, providing a reliable data foundation for subsequent interval aggregation and large model backpropagation.

[0015] It should be noted that the empty slots in this scheme do not refer to the absolute absence of characters, but rather to an abnormal situation where a position that should carry a more specific triggering semantic is occupied by a text fragment with weaker semantics. When performing the above decoupling degree calculation, there are boundary cases where data is missing. If the current ad object has no explicit keywords, resulting in an empty set of triggering keywords, the system directly assigns the above keyword semantic decoupling degree a value of zero. Similarly, if page crawling fails, resulting in an empty set of semantic items for the target page, the system directly assigns the above page semantic gap degree a value of zero, thus ensuring the robustness of the algorithm. Furthermore, in the fusion calculation of empty slot candidate degree, the sum of the conditional semantic weight parameter, the decoupling degree weight parameter, and the gap degree weight parameter must be strictly equal to one.

[0016] In one specific embodiment, the conditional semantic weight parameter is initially set to 0.4, the decoupling degree weight parameter is initially set to 0.3, and the gap degree weight parameter is initially set to 0.3. The basis for adopting this proportional allocation is that the insufficient semantic capacity of the final displayed text itself is the primary manifestation of the gap phenomenon. Therefore, the proportion of the conditional semantic weight parameter is slightly higher, while the triggering keywords and target page semantics are respectively used as external auxiliary references in the calculation of the final score.

[0017] Step 2: Compare the candidate degree of empty slots with the threshold of empty slot segments, aggregate to generate empty slot segments, extract the context window and calculate the segment strength, which specifically includes the following steps: The positions of the final displayed text words with a slot candidate degree greater than the slot segment threshold are extracted as high candidate positions; Specifically, the system iterates through all the empty slot candidate degrees of the final displayed text segment output in step one, comparing each empty slot candidate degree with a pre-set empty slot segment threshold. When the empty slot candidate degree of the final displayed text segment is greater than the pre-set empty slot segment threshold, the system extracts the position of the final displayed text segment as a high candidate position. Through this operation, the system can accurately screen out basic coordinate nodes with a high suspicion of semantic missingness from discrete scores, establishing a clear starting point for subsequent aggregation of continuous intervals.

[0018] It should be noted that the threshold for empty slots is set to define whether a text position constitutes a high-probability anomaly.

[0019] In one embodiment, the empty slot segment threshold can be determined by the historical positive sample candidate degree distribution. For example, the 25th percentile of the abnormal position of the positive sample in the training set can be taken as the initial value, and the final empty slot segment threshold can be obtained by fine-tuning it in the validation set with the goal of minimizing the false negative rate.

[0020] Connect high candidate positions with adjacent or similar position indices into continuous text intervals to generate empty slot segments. Specifically, the system determines coherence based on the position indices of high-candidate positions. If the position indices of two high-candidate positions are completely adjacent, or if the position indices of two high-candidate positions differ by a value of one, the system determines that they belong to the same semantic missing window and connects the high-candidate positions with adjacent or differing position indices by a value of one to form a continuous text interval, generating an empty slot segment. Here, a difference of one position index means that only one word below the threshold of the empty slot segment is allowed between two high-candidate positions, such as a conjunction, particle, punctuation mark, or template placeholder. This splicing mechanism, which tolerates small intervals, ensures that the generated empty slot segments are complete phrases rather than fragmented segments, making the semantics of subsequent large-scale model backtesting more stable and coherent.

[0021] Retain empty slots where the number of final display text segments contained within the preset upper and lower limits of the number of segments, and split empty slots where the number of final display text segments contained within the preset upper limit of the number of segments. Specifically, the system counts the number of final displayed text segments contained within each empty slot segment and compares this number with the system's preset upper and lower limits for the number of segments. The system retains empty slot segments whose final displayed text segments are within the preset upper and lower limits and splits empty slot segments whose final displayed text segments exceed the preset upper limit. The splitting operation can be based on, for example, local peaks in the empty slot candidate degree within the empty slot segment. This constraint mechanism effectively avoids the problem of subsequent large-scale model semantic backfilling becoming uncontrollable due to excessively long empty slot segments.

[0022] It should be noted that the upper and lower limits of the preset word segmentation range are based on the fact that search ad text is usually short, and a small number of words are enough to cover a generalized phrase or a replaced trigger phrase.

[0023] In one embodiment, the preset lower limit for the number of word segments is one, and the preset upper limit for the number of word segments is six. Furthermore, if, after the above filtering and aggregation, the system does not identify any compliant empty slot segments (i.e., the total number of empty slot segments is zero), the system determines that the final displayed text does not exhibit obvious trigger word emptying phenomena. Therefore, it skips the subsequent large-scale model semantic backtracking steps and directly uses the original final displayed text to perform the sensitive identification process, thereby ensuring the algorithm's execution efficiency under normal text conditions.

[0024] Extract a preset number of words from before and after the empty slot segment and finally display the segmented text as a context window; Specifically, for each compliant empty slot segment, the system extracts a preset number of final displayed text segments as context windows, following the original order of the final displayed text. If sentence boundaries, field boundaries, or insufficient text length are encountered during extraction, the system directly extracts the actually obtainable text portion as the context window. This extraction of context windows provides ample background information for the subsequent large-scale model to understand the semantic environment before and after the empty slot segment.

[0025] It should be noted that the preset number is generally five to eight words. In one embodiment, the preset number of words extracted before and after the empty slot segment is set to five words of the final displayed text. The reason for setting it to five words is that search advertising text is generally short, and extracting five words can ensure that the large model obtains sufficient contextual semantics while keeping irrelevant noise information within a reasonable range.

[0026] The average value of the empty slot candidate scores of all final displayed text segments within the empty slot segment is used as the segment strength.

[0027] Specifically, the system accumulates the slot candidate scores of all final displayed text segments contained within a single slot segment, and divides the accumulated sum by the total number of final displayed text segments within the slot segment. The average slot candidate score of all final displayed text segments within the slot segment is then calculated as the segment strength. The purpose of calculating segment strength is to integrate discrete position-level scores into a unified phrase-level score, thereby objectively representing the overall semantic loss severity of the slot segment. Segment strength is then used as a crucial weighting factor in the final risk score fusion calculation.

[0028] Step 3: Input the context window, the set of triggering keywords, and the set of semantic items of the target page into the large model to generate an initial candidate semantic set, and clean it to obtain an expanded candidate set; A fitness function is established that integrates the context window, the set of triggering keywords, the set of semantic items on the target page, and the similarity of vectors from the prior library of sensitive semantics. The optimal semantic vector is obtained by using the particle swarm optimization algorithm. The candidate phrase whose semantic vector is closest to the optimal semantic vector is extracted from the expanded candidate set as the optimal backfill trigger semantic. The specific steps include: Using a large model, candidate phrases that meet the preset length limit and do not belong to the preset risk category name are generated based on the context window, the set of trigger keywords, and the set of semantic items on the target page. These candidate phrases are then used to form the initial candidate semantic set. Remove candidate phrases from the initial candidate semantic set that have completely duplicated content, and remove candidate phrases from the initial candidate semantic set that are grammatically incompatible with the context window; Based on the set of triggering keywords and the set of semantic items on the target page, synonym expansion is performed on the candidate phrases in the initial candidate semantic set, and candidate phrases exceeding the preset expansion length limit are truncated to obtain the expanded candidate set.

[0029] Specifically, for the context window of the empty slot segment extracted in step two, the system inputs the context window, the set of trigger keywords, and the set of semantic items for the target page as prompt information into the large model, instructing the large model to infer the original trigger semantic most likely to be replaced. The system requires that the candidate phrases generated by the large model must meet a preset length limit and not belong to a preset risk category name, and then the generated multiple candidate phrases together constitute the initial candidate semantic set. Subsequently, the system performs a cleaning operation on the initial candidate semantic set, removing candidate phrases with completely duplicate content, and using syntax detection rules to remove candidate phrases in the initial candidate semantic set that are grammatically incompatible with the context window. To enhance the coverage of the candidate space, the system performs a synonym expansion operation on the candidate phrases in the initial candidate semantic set based on the set of trigger keywords and the set of semantic items for the target page, and truncates candidate phrases that exceed the preset expansion length limit, finally obtaining an expanded candidate set.

[0030] It should be noted that the preset length limit and preset extended length limit are used to standardize the specifications of the phrases output by the large model. For example, the preset length limit can be set to no more than eight words. The preset risk category name is used to prevent the large model from directly generating instructive or explanatory sentences that violate regulations. The core technology of this step is that the large model is only used to generate a basic candidate material space, rather than directly giving the final conclusion, thereby effectively reducing the openness and uncontrollability inherent in generative artificial intelligence and ensuring the stability of the risk control system.

[0031] If the large model fails to generate an initial candidate semantic set, a degenerate backfilling step is performed, including: converting the context window into a context window semantic vector; converting each trigger keyword in the trigger keyword set into a trigger keyword semantic vector, calculating the vector cosine similarity between each trigger keyword semantic vector and the context window semantic vector, and extracting the trigger keyword corresponding to the maximum vector cosine similarity as the optimal backfill trigger semantic; if the trigger keyword set contains zero trigger keywords, converting each target page semantic item in the target page semantic item set into a target page semantic item semantic vector, calculating the vector cosine similarity between each target page semantic item semantic vector and the context window semantic vector, and extracting the target page semantic item corresponding to the maximum vector cosine similarity as the optimal backfill trigger semantic; if the trigger keyword set contains zero trigger keywords and the target page semantic item set contains zero target page semantic items, it is determined that the empty slot segment has no backfill trigger semantic, and the construction of backfill text for review is stopped.

[0032] Specifically, when calling the large model service, the system may encounter timeouts or rejections. If the large model fails to generate an initial candidate semantic set, the system automatically performs a degradation backfilling step to ensure the process is not interrupted. The system first uses a semantic encoder to convert the context window into a context window semantic vector, and then converts each trigger keyword in the trigger keyword set into a trigger keyword semantic vector. The system calculates the vector cosine similarity between each trigger keyword semantic vector and the context window semantic vector, extracting the trigger keyword with the highest vector cosine similarity as the optimal backfilling trigger semantic. If the trigger keyword set contains zero trigger keywords, the system converts each target page semantic item in the target page semantic item set into a target page semantic item semantic vector, calculates the vector cosine similarity between each target page semantic item semantic vector and the context window semantic vector, and extracts the target page semantic item with the highest vector cosine similarity as the optimal backfilling trigger semantic.

[0033] It should be noted that this degradation mechanism is a design feature introduced to improve system robustness. In extreme exceptional circumstances, if the set of triggering keywords contains zero triggering keywords and the set of target page semantic items contains zero target page semantic items, the system directly determines that the empty slot segment has no backfill triggering semantics and immediately stops subsequent operations of constructing backfill text for review for the current empty slot segment. Through strict hierarchical degradation logic, it can be ensured that even in the event that the external large model service completely fails, the system can still extract the safest and context-appropriate backfill semantics based on internal similarity calculations.

[0034] After converting each item in the trigger keyword set and the target page semantic item set into semantic vectors, the mean of each vector is calculated to obtain the aggregated semantic vectors of the trigger keyword set and the target page semantic item set. The context window is converted into a context window semantic vector. Each item in the sensitive semantic prior library is converted into a semantic vector of each item in the sensitive semantic prior library. The first fitness term is obtained by multiplying the vector cosine similarity between the particle position and the context window semantic vector by the first fitness weight parameter. The second fitness term is obtained by multiplying the vector cosine similarity between the particle position and the aggregated semantic vector of the trigger keyword set by the second fitness weight parameter. The third fitness term is obtained by multiplying the vector cosine similarity between the particle position and the aggregated semantic vector of the target page semantic item set by the third fitness weight parameter. The maximum value of the vector cosine similarity between the particle position and each semantic vector in the sensitive semantic prior library is calculated as the risk proximity, and the risk proximity is multiplied by the fourth fitness weight parameter to obtain the fourth fitness term. The first, second, third, and fourth fitness terms are added together to form the fitness function.

[0035] Specifically, to achieve precise and controlled optimization of the expanded candidate set, the system introduces a particle swarm optimization algorithm. First, the system converts each text element in the trigger keyword set and the target page semantic item set into semantic vectors. Then, it calculates the average value of each semantic vector to obtain the aggregated semantic vectors of the trigger keyword set and the target page semantic item set. The system further converts the context window into a context window semantic vector and converts each text element in the system's pre-set sensitive semantic prior library into a semantic vector of the sensitive semantic prior library. In constructing the fitness function, the system multiplies the cosine similarity between the particle position (representing the optimization coordinates) and the context window semantic vector by a first fitness weight parameter to obtain the first fitness term. The system multiplies the cosine similarity between the particle position and the aggregated semantic vector of the trigger keyword set by a second fitness weight parameter to obtain the second fitness term. Finally, the system multiplies the cosine similarity between the particle position and the aggregated semantic vector of the target page semantic item set by a third fitness weight parameter to obtain the third fitness term. The system calculates the vector cosine similarity between the particle position and each semantic vector in the sensitive semantic prior library, and extracts the maximum value as the risk proximity score. The risk proximity score is then multiplied by the fourth fitness weight parameter to obtain the fourth fitness term. Finally, the system adds the first, second, third, and fourth fitness terms to form the fitness function.

[0036] It should be noted that the sensitive semantic prior library is used to measure the potential violation risk of candidate triggering semantics. The data source for the sensitive semantic prior library can be, for example, sensitive phrases manually confirmed from historical violating advertisements or violating semantic phrases defined by platform rules. The first, second, third, and fourth fitness weight parameters together determine the optimization direction of the particle swarm optimization algorithm. The actual meaning of this fitness function is that the higher the fitness score of a particle position, the more it satisfies the four constraints of being highly consistent with the context of the empty slot segment, highly consistent with the original triggering semantics of the advertisement, consistent with the landing semantics of the target page, and having a certain proximity to known risky phrases.

[0037] In addition, the sum of the values ​​of the first fitness weight parameter, the second fitness weight parameter, the third fitness weight parameter, and the fourth fitness weight parameter must be strictly equal to the value of one.

[0038] In one specific embodiment, the first fitness weight parameter is initially set to 0.4, the second fitness weight parameter is initially set to 0.2, the third fitness weight parameter is initially set to 0.2, and the fourth fitness weight parameter is initially set to 0.2. This proportional allocation is adopted because it emphasizes the coherence of the context and the proximity of the risk, ensuring that the semantics of the backfill are not only fluent but also fully expose potential risks. Additionally, as a boundary rule, if the trigger keyword set is empty, the system assigns a value of zero to the second fitness item; if the target page semantic item set is empty, the system assigns a value of zero to the third fitness item.

[0039] Each candidate phrase in the expanded candidate set is converted into a candidate phrase semantic vector; the vector cosine similarity between each candidate phrase semantic vector and the optimal semantic vector is calculated; the candidate phrase corresponding to the maximum vector cosine similarity is extracted as the optimal backfill trigger semantic.

[0040] Specifically, before performing particle swarm optimization, the system converts each candidate phrase in the expanded candidate set into initial candidate semantic vectors. These initial candidate semantic vectors are then used as the initial positions and initial individual optimal positions of each particle in the particle swarm optimization algorithm. Simultaneously, the initial velocities of the particles are initialized to small random vectors with a mean of zero. During the iterative optimization process, the system calculates the fitness function score for each particle's position and records the individual historical optimal position and the global optimal position of the entire particle swarm. Subsequently, the particles continuously update their velocity and position vectors based on their individual historical optimal positions and global optimal positions. The formulas for updating the particle's velocity and position vectors are as follows: ; ; Where t represents the current iteration round; Indicates the m-th empty slot segment. The position vector of a particle in the t-th iteration, i.e. the semantic vector represented by the particle; This represents the corresponding particle velocity vector; Indicates the inertial weight, used to control the tendency of a particle to maintain its previous state of motion; and The learning factor represents the step size weights that control the particle's approach towards its individual historical best position and global best position, respectively. and This represents a random number that is uniformly distributed between the values ​​of zero and one, used to increase the randomness of the optimization process in order to escape local optima. This represents the historical best position of the individual with the highest fitness score that the particle has experienced from initialization to the current iteration. This represents the globally optimal position with the highest fitness score found by the entire particle swarm in the current empty slot segment.

[0041] Specifically, the core technology of this update formula lies in the fact that, within a multi-dimensional continuous semantic vector space, particles continuously adjust their flight speed and direction by combining their own historical optimization experience with the shared experience of the entire candidate group, ultimately converging towards the optimal semantic vector region that simultaneously satisfies contextual coherence, trigger semantic consistency, target page consistency, and approximation of sensitive risks. This process continues until a preset number of iterations is reached, at which point the particle swarm optimization algorithm converges in the continuous vector space and outputs a theoretically optimal semantic vector. To safely reconstruct the optimal semantic vector in the continuous vector space into discrete natural language text, the system converts each candidate phrase in the expanded candidate set into a candidate phrase semantic vector. The system then calculates the vector cosine similarity between each candidate phrase semantic vector and the optimal semantic vector output by the particle swarm optimization algorithm. Finally, the system extracts the candidate phrase corresponding to the maximum vector cosine similarity as the optimal backfill trigger semantic.

[0042] It should be noted that during the iterative optimization process of the particle swarm optimization algorithm, parameters such as inertia weight and learning factor can use conventional values ​​in this field. For example, the inertia weight can be set to 0.7, and the learning factor... and The values ​​can all be set to 1.5, and the number of iterations can be set to fifteen. This step not only avoids treating the large model as a black box tool for outputting the final judgment, but also avoids the dangerous technique of directly decoding unstable continuous vectors to generate text. Instead, it innovatively selects the existing phrase closest to the theoretically optimal vector from a known and controlled set of discrete extended candidate vectors as the final result. The discrete mapping mechanism greatly enhances the stability of the engineering implementation, effectively avoids the hidden dangers of the large model producing illusions or outputting uncontrollable garbled text, and improves the security of the text used in subsequent review stages.

[0043] Step 4: Replace the empty slots in the final display text to construct backfill text for review, and input the final display text and the backfill text for review into the large model's sensitivity classifier. Calculate the difference in the probability distribution of sensitive categories to obtain the empty slot sensitivity gain. This specifically includes the following steps: The final display text segments contained in the empty slots within the final display text are replaced with semantically triggered by optimal backfilling; adjacent final display text segments are re-concatenated and replaced, maintaining the original field order and punctuation boundaries of the final display text to generate backfill text for review; the backfill text for review is used as temporary data only for sensitive risk calculation, and the backfill text for review is blocked from being written into the rendering data stream of the final display text.

[0044] Specifically, the system performs semantic restoration at the text level for each empty slot segment located within the final displayed text. The system removes the final displayed text segments containing the empty slot segments and uses the optimal backfilling trigger from step three to semantically fill in the corresponding segment positions for replacement. After content replacement, the system reassembles the final displayed text segments adjacent to the replacement positions, strictly maintaining the original field order and punctuation boundaries of the final displayed text during the assembly process, thereby generating a structurally complete backfilled text for internal review.

[0045] It should be noted that the backfilled text for review is intermediate test data proactively constructed by the system to explore exploratory risks. At the system engineering implementation and data flow architecture level, the system restricts the backfilled text for review to temporary data used only for sensitive risk calculations within the system's review memory, completely blocking its writing into the final display text's rendering data stream at the underlying data transmission link. This business-level data isolation mechanism fundamentally prevents the system from substantially altering the advertiser's actual ad placement materials during review calculations, effectively balancing the technical requirements of deep semantic risk identification with the legal compliance requirement that advertising materials cannot be tampered with.

[0046] Obtain the probability distribution of sensitive categories in the backfill text for review and the probability distribution of sensitive categories in the final displayed text output by the large model sensitive classifier; extract the maximum probability value in the probability distribution of sensitive categories in the backfill text for review as the maximum sensitivity probability of the backfill text; extract the maximum probability value in the probability distribution of sensitive categories in the final displayed text as the maximum sensitivity probability of the final displayed text; subtract the maximum sensitivity probability of the final displayed text from the maximum sensitivity probability of the backfill text to obtain the empty slot sensitivity gain.

[0047] Specifically, the system inputs the previously acquired, unmodified final display text and the newly constructed, securely constructed backfill text for review into a large-scale sensitivity classifier. The large-scale sensitivity classifier performs probability prediction for multiple preset sensitivity categories, thereby obtaining the sensitivity category probability distributions of the backfill text and the final display text. Subsequently, the system extracts the maximum probability value from the backfill text's sensitivity category probability distribution as the maximum sensitivity probability of the backfill text, and simultaneously extracts the maximum probability value from the final display text's sensitivity category probability distribution as the maximum sensitivity probability of the final display text. The system subtracts the maximum sensitivity probability of the final display text from the maximum sensitivity probability of the backfill text to obtain the slot sensitivity gain.

[0048] It should be noted that the system's preset sensitive categories can cover the platform's high-risk classifications. The quantifiable metric of "slot sensitivity gain" essentially represents how much the maximum sensitivity probability of the entire ad text increases once the semantically ambiguous slotted sections in the text are restored to the most likely triggering semantics of the ad itself. A large slot sensitivity gain strongly suggests that the ad is highly likely to have used malicious circumvention techniques such as slotting trigger words to deliberately weaken the sensitivity of the surface text in order to evade machine review mechanisms. This difference hedging calculation logic effectively overcomes the serious limitation of traditional risk control systems that can only examine the literal meaning of text.

[0049] Step 5: Calculate the final risk score by fusing segment strength, slot sensitivity gain, and the probability distribution of sensitive categories in the backfilled text used for review. Based on the final risk score, output the identification label and advertising action, specifically including the following steps: Extract the maximum segment strength of all empty slot segments as the global maximum segment strength; extract the maximum empty slot sensitivity gain of all empty slot segments as the global maximum empty slot sensitivity gain; extract the maximum value of the maximum sensitivity probability of all backfilled texts used for review as the global maximum backfill sensitivity probability. Specifically, the system iterates through all data items within the entire ad text calculated in the preceding steps, filters out the maximum value of the segment strength from all empty slot segments, and establishes it as the global maximum segment strength. The system also filters out the maximum value of the empty slot sensitivity gain from all empty slot segments and establishes it as the global maximum empty slot sensitivity gain. Simultaneously, the system aggregates all backfilled text used for review, extracts the maximum value of the maximum sensitivity probability among the backfilled text, and establishes it as the global maximum backfill sensitivity probability. The logic of extreme value extraction is that the system does not need to focus on the safe and normal parts of the ad text, but directly captures the most severe semantic deficiencies, the most abrupt risk gains, and the worst absolute sensitivity in the entire ad, using these as the risk assessment benchmark for determining the compliance of the entire ad.

[0050] The first risk fusion term is obtained by multiplying the global maximum segment strength by the empty slot strength contribution weight; the second risk fusion term is obtained by multiplying the global maximum empty slot sensitivity gain by the risk gain contribution weight; and the third risk fusion term is obtained by multiplying the global maximum backfill sensitivity probability by the absolute risk contribution weight. The final risk score is obtained by adding the first, second, and third risk fusion terms. Specifically, the system multiplies the extreme values ​​of three dimensions—global maximum segment strength, global maximum empty slot sensitivity gain, and global maximum backfill sensitivity probability—with pre-set weights for empty slot strength contribution, risk gain contribution, and absolute risk contribution, respectively, to independently calculate the first, second, and third risk fusion terms. Subsequently, the system adds these three risk fusion terms together to calculate the final risk score used for final adjudication.

[0051] It should be noted that the sum of the values ​​of the empty slot strength contribution weight, the risk gain contribution weight, and the absolute risk contribution weight must be strictly equal to the value of one.

[0052] In one specific embodiment, the initial weighting for the contribution of slot intensity is set to 0.25, the initial weighting for risk gain is set to 0.35, and the initial weighting for absolute risk is set to 0.40. The significance of this proportional allocation is that the final risk score requires consideration of three key dimensions: whether there are obvious slots on the text surface, whether the risk significantly increases after backfilling, and whether the actual sensitivity level is reached after backfilling. The mere existence of slots is insufficient to constitute a sufficient condition for directly blocking ad placement. The core factor truly determining the risk of violation lies in the actual sensitivity exposed after the slots are restored and backfilled; therefore, the weightings for absolute risk and risk gain are set relatively higher.

[0053] When the final risk score is greater than or equal to the preset risk threshold, an identification tag indicating the existence of hidden sensitive risks is generated, and an action to block the delivery of advertisements is output; when the final risk score is less than the preset risk threshold, an identification tag indicating the absence of hidden sensitive risks is generated, and an action to allow the delivery of advertisements is output.

[0054] Specifically, the system compares the calculated final risk score with a preset risk threshold in the risk control engine. When the final risk score is greater than or equal to the preset risk threshold, the system determines that the advertisement is attempting to evade review by using empty slots for trigger words. It then generates an identification tag indicating a hidden sensitive risk and sends an ad-blocking action to the ad delivery system, directly intercepting the ad material from entering the external display chain. Conversely, when the final risk score is less than the preset risk threshold, the system determines that the ad content is compliant or that the empty slots do not involve high-risk sensitive semantics. It then generates an identification tag indicating no hidden sensitive risk and sends an ad-allowing action, allowing the ad to enter the normal display and distribution process.

[0055] It should be noted that the preset risk threshold is used to transform continuous final risk scores into clear black-and-white blocking or allowing actions. The basis for setting the preset risk threshold can be specifically determined based on the upper limit of the false positive rate allowed by the actual advertising business. In one embodiment, the preset risk threshold can be selected from the historical verification dataset, where the false positive rate does not exceed the system's preset upper limit and the risk recall rate reaches its maximum. This step perfectly realizes a complete industrial-grade defense closed loop from the discovery of potential risks, reverse semantic reconstruction to the final automated risk adjudication, enabling the system to reliably combat malicious bypass review strategies in real time without human intervention.

[0056] The above embodiments are only used to illustrate the technical methods of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical methods of the present invention without departing from the spirit and scope of the technical methods of the present invention.

Claims

1. A method for real-time identification of sensitive words in advertising content based on large model technology, characterized in that, include: Obtain the final displayed text, the set of triggering keywords, and the set of semantic items for the target page; Based on the context recovery probability of the final displayed text segmentation, and the vector similarity between the segmentation and the trigger keyword set and the target page semantic item set, the empty slot candidate degree is calculated. The empty slot candidate degree is compared with the empty slot segment threshold, the empty slot segment is aggregated and generated, the context window is extracted and the segment strength is calculated. Input the context window, the set of triggering keywords, and the set of semantic items of the target page into the large model to generate an initial candidate semantic set, and then clean it to obtain an expanded candidate set. Establish a fitness function that integrates the context window, the set of trigger keywords, the set of semantic items on the target page, and the similarity of vectors in the prior library of sensitive semantics. Use the particle swarm optimization algorithm to find the optimal semantic vector. Extract the candidate phrase whose semantic vector is closest to the optimal semantic vector from the expanded candidate set as the optimal backfill trigger semantic. Replace the empty slot section in the final displayed text to construct the backfill text for review. The final displayed text and the backfilled text for review are input into the large model's sensitivity classifier, and the difference in the probability distribution of sensitive categories is calculated to obtain the slot sensitivity gain. The final risk score is calculated by combining the intensity of the fusion section, the sensitivity gain of the empty slot, and the probability distribution of the sensitive category of the backfilled text used for review. Based on the final risk score, the identification label and advertising action are output.

2. The real-time identification method for sensitive words in advertising content based on large model technology according to claim 1, characterized in that, Based on the context recovery probability of the final displayed text segmentation, and the vector similarity between the segmentation and the trigger keyword set and the target page semantic item set, the empty slot candidate degree is calculated, including: The logarithm of the context recovery probability of the final displayed text segmentation is extracted as the conditional semantic contribution value; The final displayed text is segmented into semantic vectors. Each trigger keyword in the trigger keyword set is then converted into a semantic vector. The degree of semantic decoupling of the keywords is obtained by subtracting the maximum value of the cosine similarity between the segmented semantic vector and the semantic vector of each trigger keyword from the numerical value. Each target page semantic item in the target page semantic item set is converted into a target page semantic item semantic vector. The degree of page semantic gap is obtained by subtracting the maximum value of the vector cosine similarity between the word segmentation semantic vector and the semantic vector of each target page semantic item from the numerical value. Input the conditional semantic contribution value into the preset Sigmoid smoothing function to obtain the smoothed mapping value; Subtracting the smoothed mapping value from the numerical value yields the smoothed mapping difference. Multiplying the smoothed mapping difference by the conditional semantic weight parameter yields the basic empty slot risk value. The first risk summation value is obtained by multiplying the semantic decoupling degree of the keywords by the decoupling degree weighting parameter. The second risk summation value is obtained by multiplying the degree of semantic gap on the page by the gap degree weighting parameter; The candidate value of an empty slot is obtained by adding the basic empty slot risk value, the first risk superposition value, and the second risk superposition value.

3. The real-time identification method for sensitive words in advertising content based on large model technology according to claim 1, characterized in that, Compare the candidate degree of empty slots with the threshold of empty slot segments, aggregate to generate empty slot segments, extract the context window and calculate the segment strength, including: The positions of the final displayed text words with a slot candidate degree greater than the slot segment threshold are extracted as high candidate positions; Connect high candidate positions with adjacent or similar position indices into a continuous text interval to generate empty slot segments. Retain empty slots where the number of final display text segments contained within the preset upper and lower limits of the number of segments, and split empty slots where the number of final display text segments contained within the preset upper limit of the number of segments. Extract a preset number of words from before and after the empty slot segment and display the final text segmentation as a context window; The average value of the empty slot candidate scores of all final displayed text segments within the empty slot segment is used as the segment strength.

4. The real-time identification method for sensitive words in advertising content based on large model technology according to claim 1, characterized in that, The context window, the set of triggering keywords, and the set of semantic items for the target page are input into the large model to generate an initial candidate semantic set. After cleaning, an expanded candidate set is obtained, including: Using a large model, candidate phrases that meet the preset length limit and do not belong to the preset risk category name are generated based on the context window, the set of trigger keywords, and the set of semantic items on the target page. These candidate phrases are then used to form the initial candidate semantic set. Remove candidate phrases from the initial candidate semantic set that have completely duplicated content, and remove candidate phrases from the initial candidate semantic set that are grammatically incompatible with the context window; Based on the set of triggering keywords and the set of semantic items on the target page, synonym expansion is performed on the candidate phrases in the initial candidate semantic set, and candidate phrases exceeding the preset expansion length limit are truncated to obtain the expanded candidate set.

5. The real-time identification method for sensitive words in advertising content based on large model technology according to claim 1, characterized in that, If the large model fails to generate an initial set of candidate semantics, a degenerate backfilling step is performed, including: Convert the context window into a context window semantic vector; Each trigger keyword in the trigger keyword set is converted into a trigger keyword semantic vector. The vector cosine similarity between each trigger keyword semantic vector and the context window semantic vector is calculated. The trigger keyword with the maximum vector cosine similarity is extracted as the optimal backfill trigger semantic. If the set of triggering keywords contains zero triggering keywords, then each target page semantic item in the target page semantic item set is converted into a target page semantic item semantic vector. The vector cosine similarity between each target page semantic item semantic vector and the context window semantic vector is calculated, and the target page semantic item corresponding to the maximum vector cosine similarity is extracted as the optimal backfill triggering semantic. If the number of trigger keywords in the trigger keyword set is zero, and the number of target page semantic items in the target page semantic item set is zero, then it is determined that the empty slot segment has no backfill trigger semantics, and the construction of backfill text for review is stopped.

6. The real-time identification method for sensitive words in advertising content based on large model technology according to claim 1, characterized in that, Establish a fitness function that integrates the context window, the set of triggering keywords, the set of semantic items on the target page, and the similarity of sensitive semantic prior library vectors, including: After converting each item in the trigger keyword set and the target page semantic item set into a semantic vector, the mean of the vectors is calculated to obtain the aggregated semantic vector of the trigger keyword set and the aggregated semantic vector of the target page semantic item set. Convert the context window into a context window semantic vector; Convert each item in the sensitive semantic prior library into a semantic vector of each item in the sensitive semantic prior library; The first fitness term is obtained by multiplying the vector cosine similarity between the particle position and the semantic vector of the context window by the first fitness weight parameter. The second fitness term is obtained by multiplying the vector cosine similarity between the particle position and the aggregate semantic vector of the trigger keyword set by the second fitness weight parameter. The third fitness term is obtained by multiplying the vector cosine similarity between the particle position and the semantic term set of the target page into the third fitness weight parameter. The maximum value of the vector cosine similarity between the particle position and each semantic vector in the sensitive semantic prior library is used as the risk proximity score. The risk proximity score is multiplied by the fourth fitness weight parameter to obtain the fourth fitness term. The fitness function is formed by adding the first fitness term, the second fitness term, the third fitness term, and the fourth fitness term.

7. The real-time identification method for sensitive words in advertising content based on large model technology according to claim 1, characterized in that, The candidate phrases whose semantic vectors are closest to the optimal semantic vector are extracted from the expanded candidate set as the optimal backfill trigger semantics, including: Convert each candidate phrase in the expanded candidate set into a candidate phrase semantic vector; Calculate the vector cosine similarity between the semantic vector of each candidate phrase and the optimal semantic vector; The candidate phrases corresponding to the maximum vector cosine similarity are extracted as the optimal backfill trigger semantics.

8. The real-time identification method for sensitive words in advertising content based on large model technology according to claim 1, characterized in that, Replace the empty slot sections in the final display text with backfill text for the construction review, including: The final displayed text segment containing empty slots will be replaced with the optimal backfill trigger semantic replacement. Re-segment the adjacent final display text and maintain the original field order and punctuation boundaries of the final display text to generate backfill text for review. The review-filled text is used as temporary data solely for sensitive risk calculations, preventing it from being written into the rendering data stream of the final displayed text.

9. The real-time identification method for sensitive words in advertising content based on large model technology according to claim 1, characterized in that, The empty slot sensitivity gain is obtained by calculating the difference in the probability distributions of sensitive categories, including: Obtain the probability distribution of sensitive categories in the backfilled text for review and the probability distribution of sensitive categories in the final displayed text from the output of the sensitive classifier of the large model; Extract the maximum probability value from the probability distribution of sensitive categories in the backfilled text used for review as the maximum sensitivity probability of the backfilled text; Extract the maximum probability value from the probability distribution of the sensitive categories of the final displayed text as the maximum sensitivity probability of the final displayed text; The empty slot sensitivity gain is obtained by subtracting the maximum sensitivity probability of the final displayed text from the maximum sensitivity probability of the backfilled text.

10. The real-time identification method for sensitive words in advertising content based on large model technology according to claim 9, characterized in that, The final risk score is calculated by integrating the segment strength, slot sensitivity gain, and the probability distribution of sensitive categories in the backfilled text used for review. Based on the final risk score, identification tags and advertising handling actions are output, including: Extract the maximum segment strength of all empty slot segments as the global maximum segment strength, and extract the maximum slot sensitivity gain of all empty slot segments as the global maximum slot sensitivity gain. Extract the maximum value of the maximum sensitivity probability of all backfilled texts used for review as the global maximum backfill sensitivity probability; The first risk fusion term is obtained by multiplying the global maximum segment strength by the empty slot strength contribution weight; the second risk fusion term is obtained by multiplying the global maximum empty slot sensitivity gain by the risk gain contribution weight; and the third risk fusion term is obtained by multiplying the global maximum backfill sensitivity probability by the absolute risk contribution weight. The final risk score is obtained by adding the first risk fusion item, the second risk fusion item, and the third risk fusion item. When the final risk score is greater than or equal to the preset risk threshold, an identification tag indicating the existence of hidden sensitive risks is generated, and an action to block the delivery of advertisements is output. When the final risk score is less than the preset risk threshold, an identification label indicating that there is no hidden sensitive risk is generated, and the action of allowing the advertisement to be placed is output.