Entity word heat calculation method and device, equipment and storage medium
By acquiring and calculating search click logs and entity keyword webpage data, and combining multi-source data correction and time series analysis, the problem of changes in entity keyword popularity in time-sensitive scenarios was solved, user needs were identified, and reasonable ranking of entity keywords was achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- TERMINUSBEIJING TECH CO LTD
- Filing Date
- 2023-03-10
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies cannot effectively address the issue of changes in entity word popularity in time-sensitive scenarios, and cannot identify users' primary needs, resulting in unreasonable sorting of entity words with the same name and type.
By acquiring entity IDs, impressions, clicks, display locations, and field information from search click log datasets and entity word webpage datasets, we calculate entity word click popularity, search popularity, and initial popularity. We then combine weights to calculate entity word popularity and use multi-source data correction and time series analysis to identify hot entity words.
It enables accurate identification of users' main needs in time-sensitive scenarios, solves the problem of partial ordering of entity words with the same name and type, and ensures the reasonable ranking of hot entity words in search and recommendation.
Smart Images

Figure CN116304271B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of data processing, and more particularly to the field of information retrieval technology, specifically to a method, apparatus, device, and storage medium for calculating entity word popularity. Background Technology
[0002] In search and recommendation scenarios, to improve users' prioritization of entity words, a ranking mechanism is typically needed to represent the ranking effect or priority of entity words within the same category and overall. Entity word popularity reflects the degree of user attention to entity words in search and recommendation scenarios, directly influencing encyclopedic item selection. Entity word popularity can resolve the issue of ranking rationality for entities with the same name and type, and also improve the ranking effect of entities based on priority in time-sensitive scenarios. Furthermore, the rationality of entity word popularity can address the issue of biased ordering of entities with the same name and type, as well as the issue of popularity of trending entities, so that in search queries and entity disambiguation scenarios where there is no context or similar entity words, more popular entity words are prioritized.
[0003] Currently, entity keyword popularity is determined based on page views or PageRank in a knowledge graph. While this method can solve most fixed entity keyword popularity ranking problems, it cannot address the changing popularity of entity keywords in time-sensitive scenarios, nor can it identify the user's primary need. For example, if a user searches for "Zhang San," and in most scenarios the user searches for "actor Zhang San," but the relevant trending entity keyword is "host Zhang San," then it is necessary to determine whether "host Zhang San" is the primary need based on real-time popularity. Summary of the Invention
[0004] This disclosure provides a method, apparatus, device, and storage medium for calculating entity word popularity.
[0005] According to a first aspect of this disclosure, a method for calculating entity word popularity is provided. The method includes:
[0006] Obtain the entity ID, impressions, clicks, display location, and field information for each entity word with the same entity ID from the search click log dataset and the entity word webpage dataset;
[0007] Calculate the target entity word's click popularity, search popularity, and initial popularity;
[0008] The entity word popularity of the target entity word is calculated based on the entity word click popularity, the entity word search popularity, and the entity word initial popularity.
[0009] In addition to the aspects and any possible implementations described above, a further implementation is provided in which the calculation of the entity word click popularity of the target entity word includes:
[0010] Calculate the average click position for each entity word based on its display position and click volume;
[0011] Calculate the click score of the target entity word based on the average click position of the target entity word;
[0012] Calculate the average click score for all entity words based on the average click location of all entity words;
[0013] When the target entity word is an entity word with impressions and clicks, the entity word click popularity is calculated based on the target entity word's clicks, impressions, and click score, the average click score of all entity words, and the minimum clicks and impressions among the entity words in the preset sorting; wherein, the entity words in the preset sorting are the top n entity words obtained by sorting all entity words in descending order according to their impressions, and n is a positive integer greater than or equal to 1;
[0014] When the target entity word is an entity word with impressions but no clicks, the entity word click popularity is calculated by standardizing the impressions of the target entity word and the largest impression among all entity words without clicks.
[0015] In addition to the aspects and any possible implementations described above, a further implementation is provided in which the calculation of the entity word search popularity of the target entity word includes:
[0016] Based on the impressions of each entity word and the total number of entity words, calculate the average and standard deviation of the impressions of all entity words.
[0017] Calculate the standardized value by applying a standard normal distribution to the total average and standard deviation of all entity word impressions;
[0018] The search popularity of the target entity word is calculated based on the standardized value, the display volume of the target entity word, and the total average display volume of all entity words.
[0019] In addition to the aspects and any possible implementations described above, a further implementation is provided in which the calculation of the entity word search popularity of the target entity word includes:
[0020] Based on the impressions and clicks of the target entity word, the standardized value and the total average of the impressions of the target entity word and all entity words are weighted.
[0021] The search popularity of the target entity word is calculated based on the weighted standardized value, the display volume of the target entity word, and the total average display volume of all entity words.
[0022] In addition to the aspects and any possible implementations described above, a further implementation is provided in which the calculation of the initial popularity of the target entity word includes:
[0023] Based on the field information of the target entity words, static feature scores and dynamic feature scores are obtained. The static feature scores are calculated based on the number of duplicate words in the abstract, the number and length of words, the number of duplicate words in the main text, the number and length of words, the number of directories, the number of page attributes, the number of tags, the number of references, and the number of internal links in the field information. The dynamic feature scores are calculated based on the page views, number of edits, the most recent edit time, and the number of user likes in the field information.
[0024] The initial popularity of the target entity word is calculated based on the static feature score and the dynamic feature score.
[0025] In addition to the aspects described above and any possible implementations, a further implementation is provided, wherein before obtaining the entity ID, impressions, clicks, display position, and field information corresponding to each entity word with the same entity ID in the search click log dataset and the entity word webpage dataset, the method further includes:
[0026] Obtain data sources, including search click log datasets and entity word webpage datasets;
[0027] Extract the entity ID, impressions, clicks, and display location corresponding to each entity word in the search click log dataset, and the entity ID and field information corresponding to each entity word in the entity word webpage dataset;
[0028] Match the entity words in the search click log dataset with the entity word webpage dataset to obtain the entity ID, impressions, clicks, display position, and field information corresponding to each entity word with the same entity ID.
[0029] In addition to the aspects described above and any possible implementation, a further implementation is provided in which the data source further includes a search results dataset corresponding to the entity words, a user search session control log dataset, and a news article dataset, for verifying each entity word with a consistent entity ID.
[0030] According to a second aspect of this disclosure, an entity word popularity calculation device is provided. The device includes:
[0031] The acquisition module is used to obtain the entity ID, impressions, clicks, display location, and field information corresponding to each entity word with the same entity ID in the search click log dataset and entity word webpage dataset;
[0032] The calculation module is used to calculate the click popularity, search popularity, and initial popularity of the target entity word.
[0033] The calculation module is also used to calculate the entity word popularity of the target entity word based on the entity word click popularity, the entity word search popularity, and the entity word initial popularity.
[0034] According to a third aspect of this disclosure, an electronic device is provided. The electronic device includes a memory and a processor, wherein the memory stores a computer program, and the processor executes the program to implement the method described above.
[0035] According to a fourth aspect of this disclosure, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the method described above.
[0036] This application provides a method, apparatus, device, and storage medium for calculating entity word popularity. It can obtain the entity ID, impressions, clicks, display position, and field information corresponding to each entity word with the same entity ID in the search click log dataset and entity word webpage dataset. Then, it calculates the entity word click popularity, entity word search popularity, and initial popularity of the target entity word, and then calculates the entity word popularity of the target entity word. This can solve the problem of changes in entity word popularity in time-sensitive scenarios and identify the user's main needs.
[0037] It should be understood that the description in the Summary of the Invention is not intended to limit the key or essential features of the embodiments of this disclosure, nor is it intended to restrict the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description
[0038] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. The drawings are provided for a better understanding of the invention and are not intended to limit the scope of this disclosure. In the drawings, the same or similar reference numerals denote the same or similar elements, wherein:
[0039] Figure 1 A flowchart illustrating an entity word popularity calculation method according to an embodiment of the present disclosure is shown;
[0040] Figure 2 A block diagram of an entity word popularity calculation apparatus according to an embodiment of the present disclosure is shown;
[0041] Figure 3 A block diagram of an exemplary electronic device capable of implementing embodiments of the present disclosure is shown. Detailed Implementation
[0042] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.
[0043] Furthermore, the term "and / or" in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.
[0044] This disclosure can solve the problem of changes in the popularity of entity words in time-sensitive scenarios and identify the main user needs.
[0045] Figure 1 A flowchart of an entity word popularity calculation method 100 according to an embodiment of the present disclosure is shown.
[0046] In box 110, obtain the entity ID, impressions, clicks, display location, and field information for each entity word whose entity ID matches the entity word webpage dataset.
[0047] In some embodiments, the search click log dataset can be a dataset constructed by crawling search click log data within a preset time window. The entity word webpage dataset can be a dataset constructed by crawling entity word webpage data from a specified website.
[0048] The preset time can be set according to the user's actual needs, such as real-time crawling, periodic crawling, or crawling within a specified time period. The specified website can be specified according to the user's actual needs, such as a specific encyclopedia website or a specific search engine.
[0049] In some embodiments, the search click log dataset includes the entity ID, page views (PV), clicks (clicks), and position (pos) for each entity term. The entity term webpage dataset includes the entity ID, entity term name, and field information for each entity term.
[0050] The display location can be the position corresponding to the number of impressions and clicks of the entity word. The field information includes the definition name information, which is used to clearly identify the object referred to by the entity word entry name, such as the information of the profession, position, identity and other attributes of the person referred to, as well as the information of the obvious time-sensitive attributes such as "young singer" or "youth".
[0051] In box 120, calculate the entity word click popularity, entity word search popularity, and entity word initial popularity.
[0052] In some embodiments, based on the entity ID, pageviews (pv), clicks (click), position (pos), and field information corresponding to each entity word with the same entity ID, the entity word click popularity, entity word search popularity, and initial entity word popularity of the target entity word are calculated respectively.
[0053] In box 130, the entity word popularity of the target entity word is calculated based on the entity word click popularity, entity word search popularity, and entity word initial popularity.
[0054] In some embodiments, the popularity of a target entity word can be calculated using the following formula:
[0055] hscore=1.0*search_hot_score+alpha*init_hot_score+0.1*click_hot_score
[0056] Wherein, hscore represents the entity word popularity, search_hot_score represents the entity word search popularity, init_hot_score represents the initial popularity of the entity word, alpha represents the proportion of the initial popularity of the entity word (0~1, default 0.85), and click_hot_score represents the entity word click popularity.
[0057] According to the embodiments of this disclosure, the following technical effects are achieved:
[0058] The above method can obtain the entity ID, impressions, clicks, display position, and field information corresponding to each entity word with the same entity ID in the search click log dataset and entity word webpage dataset. Then, it can calculate the entity word click popularity, entity word search popularity, and initial popularity of the target entity word, and then calculate the entity word popularity of the target entity word. This can solve the problem of changes in entity word popularity in time-sensitive scenarios and identify the user's main needs. That is, it can solve the problem of partial order of entity words with the same name and entity words of the same type. At the same time, it can solve the problem of popularity ranking of suddenly popular entity words in time-sensitive hot spot scenarios.
[0059] In some embodiments, before obtaining the entity ID, impressions, clicks, display location, and field information corresponding to each entity word with the same entity ID in the search click log dataset and entity word webpage dataset, the above method further includes:
[0060] Obtain data sources, including search click log datasets and entity word webpage datasets;
[0061] Extract the entity ID, impressions, clicks, and display location for each entity word in the search click log dataset, and the entity ID and field information for each entity word in the entity word webpage dataset;
[0062] Match the entity words in the search click log dataset with the entity word webpage dataset to obtain the entity ID, impressions, clicks, display location, and field information corresponding to each entity word with the same entity ID.
[0063] In some embodiments, after acquiring the data source, the data source can be preprocessed. For example, based on the entity word links in the search click log dataset, the entity ID, page views (PV), clicks (CLOSE), and position (POS) corresponding to each entity word can be extracted. Another example is parsing the entity word webpage dataset to extract the entity ID, entity word name, and field information corresponding to each entity word.
[0064] In some embodiments, based on the entity ID corresponding to each entity word, the entity words in the search click log dataset and the entity word webpage dataset are matched, and each entity word with the same entity ID is filtered out to obtain the entity ID, pageviews (PV), clicks (clicks), position (pos), and field information corresponding to each entity word with the same entity ID.
[0065] According to embodiments of this disclosure, a specific method is provided for obtaining each entity word with the same entity ID in the search click log dataset and the entity word webpage dataset, thereby improving the accuracy of calculating the entity word popularity of the target entity word.
[0066] In some embodiments, the data source may further include a search results dataset corresponding to the entity word, a user search session control log dataset, and a news article dataset, used to verify each entity word with the same entity ID.
[0067] In some embodiments, the search results dataset corresponding to entity words can be a dataset constructed by crawling search results from a search engine using entity words as query terms. The user search session control log dataset can be a dataset constructed by crawling the attributes and configuration information required for a user search session in real time; the user search session log is the constructed dataset. The news article dataset can be a dataset constructed by crawling news articles in real time.
[0068] In some embodiments, the data source can be preprocessed after it has been acquired.
[0069] For example, based on the search results containing entity word links, the entity IDs corresponding to the results can be extracted and converted into their own entity IDs through the mapping relationship.
[0070] For example, you can parse user search session control log datasets and, within the same session log data, obtain data where co-occurring entity words are used as query terms and have been displayed but not clicked, and data where modifiers plus entity words are used as query terms and have been displayed and clicked.
[0071] For example, news article datasets can be parsed, entity words can be extracted and disambiguated and linked to entity word IDs, the word frequency of entity words can be counted, historical entity word frequency sequences can be merged, and time-series features can be constructed.
[0072] In some embodiments, to improve the accuracy of calculating the popularity of target entity words, multi-source data is acquired, namely, search click log datasets, entity word webpage datasets, entity word corresponding search result datasets, user search session control log datasets, and news article datasets. The impressions, clicks, display positions, and field information corresponding to each entity word with the same entity ID are corrected in order to improve the accuracy of calculating the popularity of target entity words.
[0073] According to embodiments of this disclosure, by acquiring multi-source data, the accuracy of calculating the popularity of target entity words can be further improved.
[0074] In some embodiments, after calculating the entity popularity of the target entity word, in real-time hot topic scenarios, when changes in entity word popularity cause the entity word popularity to fail to update in a timely manner, hot entity words can be identified by using user search session control log datasets to statistically analyze whether the same entity word has been clicked in the same session. Furthermore, by extracting entity IDs obtained through entity word recognition and disambiguation from articles, time-series features are constructed, and time-series analysis is used to discover entity words with hot topics. Both aspects verify whether entity words have hot topics in real-time hot topic scenarios and re-rank them to solve the problem of entity word popularity ranking in real-time hot topic scenarios, thereby further improving the accuracy of calculating the entity word popularity of the target entity word.
[0075] In some embodiments, after calculating the entity popularity of the target entity term, the primary demand for the entity term under the alias can be selected by voting after aligning the entity terms from various search engine results. For example, if "Old Beijing" is used as a query term, the entity terms include "cultural symbols of Beijing" - Old Beijing, and "American male basketball player" - alias of LeBron James. Based on entity popularity, "LeBron James" has a higher popularity score than "Old Beijing". To address this issue, a vote can be conducted using the results of various search engines to select "cultural symbols of Beijing" as the primary demand for "Old Beijing". This resolves the problem of insufficient primary demand for aliases and further improves the accuracy of calculating the entity popularity of the target entity term.
[0076] In some embodiments, the above calculation of the entity word click popularity of the target entity word includes:
[0077] Calculate the average click position for each entity word based on its display position and click volume;
[0078] Calculate the click score of the target entity word based on the average click position of the target entity word;
[0079] Calculate the average click score for all entity words based on the average click location of all entity words;
[0080] When the target entity word is an entity word with impressions and clicks, the entity word click popularity is calculated based on the target entity word's clicks, impressions, and click score, the average click score of all entity words, and the minimum clicks and impressions among the entity words in the preset sorting. Among these, the entity words in the preset sorting are the top n entity words obtained by sorting all entity words in descending order of their impressions, where n is a positive integer greater than or equal to 1.
[0081] When the target entity word is an entity word with impressions but no clicks, the entity word click popularity is calculated by standardizing the impressions of the target entity word and the largest impression among all entity words without clicks.
[0082] In some embodiments, each entity word includes a target entity word, and the average click location for each entity word can be calculated using the following formula:
[0083] click_avg_pos = pos / click
[0084] Here, click_avg_pos represents the average click position for each entity word, pos represents the display position, and click represents the number of clicks.
[0085] In some embodiments, when calculating the click popularity of a target entity word, noise reduction can be performed by filtering out entity words that are below a preset impression threshold in advance, so as to improve the accuracy of calculating the click popularity of the target entity word.
[0086] In some embodiments, the click popularity of a target entity word can be calculated using the Bayesian average method.
[0087] In some embodiments, the click score of the target entity word can be calculated using the following formula:
[0088] click_pos_score=100-(click_avg_pos-1 / )*10
[0089] Here, click_pos_score represents the click score of the target entity word, and click_avg_pos represents the average click position of the target entity word.
[0090] In some embodiments, the average click score for all entity words can be calculated using the following formula:
[0091] avg_score=100-(avg_pos-1)*10
[0092] Here, avg_score represents the average click score for all entity words, and avg_pos represents the average click position for all entity words.
[0093] In some embodiments, the click score for each entity word can be calculated based on the process of calculating the click score for the target entity word, and then the ranking score avg_score can be initialized based on the average click position of each entity word. The initial ranking score can be set according to the user's actual needs.
[0094] For example, you can set the click score for the entity word that ranks first on average to be 100, the click score for the entity word that ranks second to be 90, and so on, with the click score for the entity word that ranks tenth to be 10.
[0095] In some embodiments, all entity IDs corresponding to entity words are sorted in descending order of impressions. The click count and impression count of the nth entity ID are obtained. Then, the minimum impression count (topN_min_pv) and the minimum click count (topN_min_click) among the top N are selected. Here, n is a positive integer greater than or equal to 1, and n can be set according to the user's actual needs, such as setting n to 10000.
[0096] In some embodiments, for target entity words with both impressions and clicks, the entity word click popularity can be calculated using the following formula:
[0097]
[0098] Here, click_hot_score represents the click popularity of the target entity word, pv represents the number of impressions of the target entity word, click represents the number of clicks of the target entity word, click_pos_score represents the click score of the target entity word, topN_min_click represents the minimum number of clicks among the top N, topN_min_pv represents the minimum number of impressions among the top N, and avg_score represents the average click score of all entity words.
[0099] In some embodiments, for target entity words with impressions but no clicks, the click popularity of the target entity word can be calculated by standardizing based on the impressions using the following formula:
[0100]
[0101] Here, click_hot_score represents the click popularity of the target entity word, pv represents the number of impressions of the target entity word, and max_no_click_show represents the maximum number of impressions among entities with no clicks.
[0102] In some embodiments, the above calculation of the entity word search popularity of the target entity word includes:
[0103] Based on the impressions of each entity word and the total number of entity words, calculate the average and standard deviation of the impressions of all entity words.
[0104] Calculate the standardized value by applying a standard normal distribution to the total average and standard deviation of all entity word impressions;
[0105] The search popularity of the target entity word is calculated based on the standardized value, the display volume of the target entity word, and the total average display volume of all entity words.
[0106] In some embodiments, the total number of clicks for all entity words is divided by the total number of entity words to calculate the average value (pv_mean) and the standard deviation (pv_std) of the total number of impressions for all entity words.
[0107] In some embodiments, the impression count corresponding to each entity word is converted into a standard normal distribution, as shown below:
[0108] norm_pv = (pv - pv_mean) / pv_std
[0109] Where norm_pv represents the standardized value, pv_mean represents the total average of all entity word impressions, and pv_std represents the standard deviation of all entity word impressions.
[0110] In some embodiments, the search popularity of a target entity word can be calculated using the following formula:
[0111]
[0112] Wherein, search_hot_score represents the search popularity of the target entity word, norm_pv represents the standardized value, pv_mean represents the total average of all entity word impressions, and pv represents the impressions of the target entity word.
[0113] According to embodiments of this disclosure, the search popularity and click popularity of target entity words are reflected through search engine ranking results and user search attention, etc., which can characterize the degree of attention paid to a certain entity word, that is, represent the popularity information of the entity word. The more attention or search clicks there are, the higher the popularity of the entity word. The above method provides a specific way to calculate the search popularity and click popularity of target entity words, further improving the accuracy of calculating the popularity of target entity words.
[0114] In some embodiments, the above calculation of the entity word search popularity of the target entity word includes:
[0115] Based on the impressions and clicks of the target entity word, the standardized value and the impressions of the target entity word and the total average impressions of all entity words are weighted.
[0116] The search popularity of the target entity word is calculated based on the weighted standardized value, the display volume of the target entity word, and the total average display volume of all entity words.
[0117] In some embodiments, weighting processing includes deweighting processing or weighting processing.
[0118] In some embodiments, the search popularity of target entity words is calculated based on the standardized values corresponding to impressions, reflecting the range of impression segments and entity word search popularity scores. Depending on the impression segment and the range of search popularity scores, appropriate weighting / degradation can be applied. For example, if the impression volume is high but the search popularity score is low, weighting needs to be increased; conversely, if the impression volume is low but the search popularity score is relatively high, weighting needs to be decreased. The choice between weighting and degrading, and the corresponding adjustment of the weight values, are determined based on the impression volume and click volume.
[0119] According to embodiments of this disclosure, through weighted processing, the search popularity of target entity words can be adjusted relative to the selection of weighting and weighting and the adjustment of their corresponding weight values, thereby meeting different real-time user needs.
[0120] In some embodiments, weighting can also be used for entity words with the same entity name that require certain processing. For example, if a user searches for "Zhang San" as a query term, their primary demand is for the actor "Zhang San," but the current trending entity word is "Zhang San" (the host), then "Zhang San" becomes a popular entity word. In this case, appropriate weighting processing needs to be performed based on the real-time search click log dataset and the real-time user search session control log dataset to ensure that the primary demand for "Zhang San" (the host) is ranked first.
[0121] In some embodiments, the calculation of the initial popularity of the target entity word includes:
[0122] Based on the field information of the target entity words, static feature scores and dynamic feature scores are obtained. The static feature scores are calculated based on the number of duplicate words in the abstract, the number and length of words, the number of duplicate words in the main text, the number and length of words, the number of directories, the number of page attributes, the number of tags, the number of references, and the number of internal links in the field information. The dynamic feature scores are calculated based on the page views, number of edits, the most recent edit time, and the number of user likes in the field information.
[0123] The initial popularity of the target entity word is calculated based on static and dynamic feature scores.
[0124] In some embodiments, based on the fact that entity words currently displayed in the search engine account for approximately 30% of the total number of entity words, the remaining entity words have no search popularity or click popularity. However, these entity words also need a cold start popularity to characterize their ranking position among those with the same name and type (i.e., entity word popularity). Therefore, the initial popularity of entity words will be calculated based on the static and dynamic features of the entity words within the site's pages.
[0125] In some embodiments, static features are characterized as website page text features, such as page richness and quality. These features are calculated based on features such as the richness of entity words in the abstract, the quality of the cover image, the richness of the table of contents and the main text, the number and authority of references, and the number of internal link entity words, etc., to obtain a static score page_static_score.
[0126] Specifically, this can be based on 11 static features, such as the number of duplicate characters in the abstract (ab_uniq_zh), the number of characters (ab_zh_count), and the length (ab_len); the number of duplicate characters in the main text (para_uniq_zh), the number of characters (para_zh_count), and the length of the main text (para_len); the number of directories (directory_count); the number of key-value attributes (info_boxes_count); the number of tags (tag_count); the number of references (ref_count); and the number of internal links (link_count). These can be combined with the weights corresponding to each of the 11 features, such as the weight of 0.5 for the number of duplicate characters in the abstract (ab_uniq_zh), and the weights for the number of characters (ab_zh_count) and the length (ab_len). The weights for each static feature dimension are 0.25 and 0.25, respectively: the weight for the number of unique characters in the main text (para_uniq_zh) is 0.5; the weights for the number of characters (para_zh_count) and the length of the main text (para_len) are 0.25 and 0.25, respectively; the weight for the number of directories (directory_count) is 1.0; the weight for the number of key-value attributes (info_boxes_count) is 1.0; the weight for the number of tags (tag_count) is 1.0; the weight for the number of references (ref_count) is 1.0; and the weight for the number of internal links (link_count) is 1.0. Finally, the values of each static feature dimension are regularized based on a standard normal distribution, multiplied by their respective weights, and then weighted and summed to obtain the static feature score.
[0127] The weights of each feature in the static features can be set according to the user's actual needs.
[0128] In some embodiments, the dynamic feature is represented as user attention, which is calculated based on page views (visit_count), edit count, last edit time (diff_day), and user likes and shares, etc., to obtain a dynamic feature score (page_dynamic_score).
[0129] Specifically, the page views (visit_count), edits (edit_count), last edit time (diff_day), and user likes and shares are all weighted at 1.0. The values of each dynamic feature dimension are processed according to a standard normal distribution, multiplied by their respective weights, and then weighted and summed to obtain the dynamic feature score.
[0130] The weights of each feature in the dynamic features can be set according to the user's actual needs.
[0131] It should be noted that the numerical values of each static / dynamic feature dimension, after regularization based on the standard normal distribution, can be expressed as norm = (X - μ) / σ, such as norm(edit_count), norm(visit_count), and norm(diff_day). Based on this, the dynamic feature score is page_dynamic_score = norm(edit_count) + norm(visit_count) + norm(diff_day).
[0132] In some embodiments, the initial popularity of the target entity word can be calculated using the following formula:
[0133] init_hot_score=page_static_score+page_dynamic_score
[0134] Wherein, init_hot_score represents the initial hotness of the target entity word, page_static_score represents the static feature score, and page_dynamic_score represents the dynamic feature score.
[0135] According to embodiments of this disclosure, based on the initial popularity of entity words in a cold start scenario, the richness, authority, and number of user views and edits of entity words on the site page can potentially represent the initial ranking relationship of entity words. By calculating the initial popularity of target entity words as described above, the initial ranking of users in a search-free scenario can be solved, further improving the accuracy of calculating the popularity of target entity words.
[0136] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this disclosure is not limited to the described order of actions, because according to this disclosure, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are all optional embodiments, and the actions and modules involved are not necessarily essential to this disclosure.
[0137] The above is an introduction to the method embodiments. The following describes the solution described in this disclosure further through device embodiments.
[0138] Figure 2 A block diagram of an entity word popularity calculation device 200 according to an embodiment of the present disclosure is shown. Figure 2 As shown, the device 200 includes:
[0139] The 210 acquisition module is used to acquire the entity ID, impressions, clicks, display position, and field information corresponding to each entity word with the same entity ID in the search click log dataset and entity word webpage dataset.
[0140] The 220 calculation module is used to calculate the target entity word's click popularity, search popularity, and initial popularity.
[0141] The 220 calculation module is also used to calculate the entity popularity of the target entity word based on the entity word click popularity, entity word search popularity, and entity word initial popularity.
[0142] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the described module can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.
[0143] The acquisition, storage, and application of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.
[0144] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.
[0145] Figure 3 A schematic block diagram of an electronic device 300 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.
[0146] Electronic device 300 includes a computing unit 301, which can perform various appropriate actions and processes according to a computer program stored in ROM 302 or a computer program loaded into RAM 303 from storage unit 308. RAM 303 can also store various programs and data required for the operation of electronic device 300. The computing unit 301, ROM 302, and RAM 303 are interconnected via bus 304. I / O interface 305 is also connected to bus 304.
[0147] Multiple components in electronic device 300 are connected to I / O interface 305, including: input unit 306, such as keyboard, mouse, etc.; output unit 307, such as various types of displays, speakers, etc.; storage unit 308, such as disk, optical disk, etc.; and communication unit 309, such as network card, modem, wireless transceiver, etc. Communication unit 309 allows electronic device 300 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.
[0148] The computing unit 301 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 301 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 301 performs the various methods and processes described above, such as method 100. For example, in some embodiments, method 100 may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 308. In some embodiments, part or all of the computer program may be loaded and / or installed on the electronic device 300 via ROM 302 and / or communication unit 309. When the computer program is loaded into RAM 303 and executed by the computing unit 301, one or more steps of method 100 described above may be performed. Alternatively, in other embodiments, the computing unit 301 may be configured to perform method 100 by any other suitable means (e.g., by means of firmware).
[0149] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.
[0150] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.
[0151] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0152] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including voice input, speech input, or tactile input).
[0153] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with embodiments of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.
[0154] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.
[0155] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.
[0156] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.
Claims
1. A method for calculating entity word popularity, characterized in that, include: Obtain the entity ID, impressions, clicks, display location, and field information for each entity word with the same entity ID from the search click log dataset and the entity word webpage dataset; Calculate the target entity word's click popularity, search popularity, and initial popularity; The calculation of the target entity word's click popularity includes: calculating the average click position of each entity word based on its corresponding display position and click volume; calculating the click score of the target entity word based on its average click position; calculating the average click score of all entity words based on their average click positions; when the target entity word has both impressions and clicks, the entity word's click popularity is calculated based on its click volume, impressions, and click score, the average click score of all entity words, and the minimum click volume and impressions among the entity words in the preset ranking; wherein, the entity words in the preset ranking are the top n entity words obtained by descending all entity words according to their impression volume, where n is a positive integer greater than or equal to 1; when the target entity word has impressions but no clicks, the entity word's click popularity is calculated by standardizing the target entity word's impression volume and the maximum impression volume among all entity words without clicks. The entity word popularity of the target entity word is calculated based on the entity word click popularity, the entity word search popularity, and the entity word initial popularity.
2. The method according to claim 1, characterized in that, The calculation of the search popularity of the target entity word includes: Based on the impressions of each entity word and the total number of entity words, calculate the average and standard deviation of the impressions of all entity words. Calculate the standardized value by applying a standard normal distribution to the total average and standard deviation of all entity word impressions; The search popularity of the target entity word is calculated based on the standardized value, the display volume of the target entity word, and the total average display volume of all entity words.
3. The method according to claim 2, characterized in that, The calculation of the search popularity of the target entity word includes: Based on the impressions and clicks of the target entity word, the standardized value, the impressions of the target entity word, and the total average impressions of all entity words are weighted. The search popularity of the target entity word is calculated based on the weighted standardized value, the display volume of the target entity word, and the total average display volume of all entity words.
4. The method according to claim 1, characterized in that, The calculation of the initial popularity of the target entity word includes: Based on the field information of the target entity words, static feature scores and dynamic feature scores are obtained. The static feature scores are calculated based on the number of duplicate words in the abstract, the number and length of words, the number of duplicate words in the main text, the number and length of words, the number of directories, the number of page attributes, the number of tags, the number of references, and the number of internal links in the field information. The dynamic feature scores are calculated based on the page views, number of edits, the most recent edit time, and the number of user likes in the field information. The initial popularity of the target entity word is calculated based on the static feature score and the dynamic feature score.
5. The method according to claim 1, characterized in that, Before obtaining the entity ID, impressions, clicks, display location, and field information corresponding to each entity word with the same entity ID in the search click log dataset and entity word webpage dataset, the method further includes: Obtain data sources, including search click log datasets and entity word webpage datasets; Extract the entity ID, impressions, clicks, and display location corresponding to each entity word in the search click log dataset, and the entity ID and field information corresponding to each entity word in the entity word webpage dataset; Match the entity words in the search click log dataset with the entity word webpage dataset to obtain the entity ID, impressions, clicks, display position, and field information corresponding to each entity word with the same entity ID.
6. The method according to claim 5, characterized in that, The data source also includes a dataset of search results corresponding to entity words, a dataset of user search session control logs, and a dataset of news articles, which are used to verify each entity word with the same entity ID.
7. A device for calculating the popularity of entity words, characterized in that, include: The acquisition module is used to obtain the entity ID, impressions, clicks, display location, and field information corresponding to each entity word with the same entity ID in the search click log dataset and entity word webpage dataset; The calculation module is used to calculate the click popularity, search popularity, and initial popularity of the target entity word. The calculation module is specifically used to calculate the average click position of each entity word based on the display position and click volume corresponding to each entity word; Calculate the click score of the target entity word based on the average click position of the target entity word; Calculate the average click score for all entity words based on their average click locations. When the target entity word has both impressions and clicks, the entity word click heat is calculated based on the target entity word's clicks, impressions, and click score, the average click score of all entity words, and the minimum clicks and impressions among the entity words in the preset ranking. The preset ranking of entity words refers to the top n entity words obtained by descending the order of impressions, where n is a positive integer greater than or equal to 1. When the target entity word has impressions but no clicks, the entity word click heat is calculated by standardizing the target entity word's impressions and the maximum impression among all entity words without clicks. The calculation module is also used to calculate the entity word popularity of the target entity word based on the entity word click popularity, the entity word search popularity, and the entity word initial popularity.
8. An electronic device, characterized in that, include: At least one processor; as well as A memory that is communicatively connected to the at least one processor; The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method described in any one of claims 1-6.
9. A non-transitory computer-readable storage medium storing computer instructions, characterized in that, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-6.