Method and apparatus for extracting page theme
A page and theme technology, applied in the computer field, can solve problems such as page theme offset, page theme words cannot accurately reflect page theme, and cannot accurately meet user needs, so as to meet user needs and reduce deviations. Effect
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Example Embodiment
[0070] Example one
[0071] figure 1 This is a flow chart of the method for extracting page topics provided in the first embodiment of the present invention, such as figure 1 As shown, the method can include the following steps:
[0072] Step 101: Obtain candidate paragraphs in the page that express the theme of the page.
[0073] In this step, the candidate paragraphs expressing the theme of the page in the page refer to those paragraphs that may reflect the theme of the page, and may specifically include but not limited to at least one of the following paragraphs:
[0074] The page title paragraph labeled title, the page title line labeled realtitle, the navigation paragraph labeled mypos, and the front chain labeled preanchor.
[0075] For example, for http: / / www.22zw.cn / XH / 91H53969KX / The page from which the above four paragraphs are obtained are:
[0076] The title paragraph of the page with the label title, the content is: The latest chapter of Dou Breaking the Sky.
[0077] The...
Example Embodiment
[0103] Embodiment two
[0104] figure 2 It is a flowchart of the method for calculating the confidence of each paragraph provided in the second embodiment of the present invention, such as figure 2 As shown, the method can include the following steps:
[0105] Step 201: Perform word segmentation processing on each paragraph.
[0106] Preferably, it is also possible to filter each word obtained after word segmentation based on a preset stop word list. Among them, the stop vocabulary list contains words that usually appear frequently in web pages, including but not limited to: adverbs, function words, modal particles, auxiliary words, pronouns, etc. These words usually have low expressive ability.
[0107] Step 202: Follow formula D ij =α*S ij +β*P ij , Calculate the confidence of each word after word segmentation processing.
[0108] Among them, D ij Confidence of the j-th word obtained after word segmentation for the i-th paragraph, S ij The frequency of occurrence of the j-th word i...
Example Embodiment
[0117] Embodiment three
[0118] image 3 This is a flowchart of the method for extracting the subject words of a page provided by the third embodiment of the present invention, such as image 3 As shown, the method may include the following steps:
[0119] Step 301: Perform word segmentation processing on the maintitle determined in the first embodiment.
[0120] If there is only one maintitle of the determined page, the process shown in the third embodiment is executed only for the maintitle, and if there are multiple maintitles of the determined page, the process shown in the third embodiment is executed for each maintitle.
[0121] Step 302: Perform part-of-speech tagging on each word obtained after word segmentation processing.
[0122] Step 303: Filter each word obtained after word segmentation based on a preset stop word list.
[0123] This step is to filter out the words contained in the stop word list from the words obtained after word segmentation. Among them, the stop word li...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap