Method and apparatus for extracting page theme
A page and theme technology, applied in the computer field, can solve problems such as page theme offset, page theme words cannot accurately reflect page theme, and cannot accurately meet user needs, so as to meet user needs and reduce deviations. Effect
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0071] figure 1 The flow chart of the method for extracting page topics provided by Embodiment 1 of the present invention, as shown in figure 1 As shown, the method may include the following steps:
[0072] Step 101: Obtain candidate paragraphs expressing the theme of the page in the page.
[0073] In this step, the candidate paragraphs on the page that express the theme of the page refer to those paragraphs that may reflect the theme of the page, which may specifically include but not limited to at least one of the following paragraphs:
[0074] The page title paragraph with the label title, the page title row with the label realtitle, the navigation paragraph with the label mypos, and the front link with the label preanchor.
[0075] For example, for http: / / www.22zw.cn / XH / 91H53969KX / The page from which the above is obtained are the four paragraphs:
[0076] The title paragraph of the page with the label title reads: The latest chapter of Fights Break the Sky.
[0077]...
Embodiment 2
[0104] figure 2 The flow chart of the method for calculating the confidence of each paragraph provided by Embodiment 2 of the present invention, such as figure 2 As shown, the method may include the following steps:
[0105] Step 201: Perform word segmentation processing on each paragraph.
[0106] Preferably, each word obtained after the word segmentation process can also be filtered based on a preset stop word list. Wherein, the stop word list includes words that appear frequently in webpages, including but not limited to: adverbs, function words, modal particles, particles, pronouns, etc. These words usually have low expressive ability.
[0107] Step 202: According to formula D ij =α*S ij +β*P ij , and calculate the confidence of each word after word segmentation processing.
[0108] Among them, D ij Confidence of the jth word obtained after word segmentation for the ith paragraph, S ij The frequency of occurrence of the jth word in all paragraphs obtained after t...
Embodiment 3
[0118] image 3 The flow chart of the method for extracting page keywords provided by Embodiment 3 of the present invention, such as image 3 As shown, the method may include the following steps:
[0119] Step 301: Perform word segmentation processing on the maintitle determined in the first embodiment.
[0120] If there is only one maintitle of the page determined, the process shown in the third embodiment is executed only for this maintitle; if there are multiple maintitles determined for the page, the process shown in the third embodiment is respectively executed for each maintitle.
[0121] Step 302: Perform part-of-speech tagging on each word obtained after word segmentation.
[0122] Step 303: Filter each word obtained after word segmentation based on a preset stop word list.
[0123] This step is to filter out the words contained in the stop vocabulary list from the words obtained after word segmentation. Wherein, the stop words list includes words that appear frequ...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com