Method and apparatus for extracting page theme

A page and theme technology, applied in the computer field, can solve problems such as page theme offset, page theme words cannot accurately reflect page theme, and cannot accurately meet user needs, so as to meet user needs and reduce deviations. Effect

Active Publication Date: 2012-10-17
BEIJING BAIDU NETCOM SCI & TECH CO LTD
View PDF7 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there may be multiple paragraphs in the title of the page, and some paragraphs are irrelevant to the page theme, which will cause the offset of the page theme
The application may not be able t

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and apparatus for extracting page theme
  • Method and apparatus for extracting page theme
  • Method and apparatus for extracting page theme

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0070] Example one

[0071] figure 1 This is a flow chart of the method for extracting page topics provided in the first embodiment of the present invention, such as figure 1 As shown, the method can include the following steps:

[0072] Step 101: Obtain candidate paragraphs in the page that express the theme of the page.

[0073] In this step, the candidate paragraphs expressing the theme of the page in the page refer to those paragraphs that may reflect the theme of the page, and may specifically include but not limited to at least one of the following paragraphs:

[0074] The page title paragraph labeled title, the page title line labeled realtitle, the navigation paragraph labeled mypos, and the front chain labeled preanchor.

[0075] For example, for http: / / www.22zw.cn / XH / 91H53969KX / The page from which the above four paragraphs are obtained are:

[0076] The title paragraph of the page with the label title, the content is: The latest chapter of Dou Breaking the Sky.

[0077] The...

Example Embodiment

[0103] Embodiment two

[0104] figure 2 It is a flowchart of the method for calculating the confidence of each paragraph provided in the second embodiment of the present invention, such as figure 2 As shown, the method can include the following steps:

[0105] Step 201: Perform word segmentation processing on each paragraph.

[0106] Preferably, it is also possible to filter each word obtained after word segmentation based on a preset stop word list. Among them, the stop vocabulary list contains words that usually appear frequently in web pages, including but not limited to: adverbs, function words, modal particles, auxiliary words, pronouns, etc. These words usually have low expressive ability.

[0107] Step 202: Follow formula D ij =α*S ij +β*P ij , Calculate the confidence of each word after word segmentation processing.

[0108] Among them, D ij Confidence of the j-th word obtained after word segmentation for the i-th paragraph, S ij The frequency of occurrence of the j-th word i...

Example Embodiment

[0117] Embodiment three

[0118] image 3 This is a flowchart of the method for extracting the subject words of a page provided by the third embodiment of the present invention, such as image 3 As shown, the method may include the following steps:

[0119] Step 301: Perform word segmentation processing on the maintitle determined in the first embodiment.

[0120] If there is only one maintitle of the determined page, the process shown in the third embodiment is executed only for the maintitle, and if there are multiple maintitles of the determined page, the process shown in the third embodiment is executed for each maintitle.

[0121] Step 302: Perform part-of-speech tagging on each word obtained after word segmentation processing.

[0122] Step 303: Filter each word obtained after word segmentation based on a preset stop word list.

[0123] This step is to filter out the words contained in the stop word list from the words obtained after word segmentation. Among them, the stop word li...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and an apparatus for extracting a page theme. The method comprises: A. acquiring candidate paragraphs which convey the page theme; B, if a candidate paragraph which can be re-paragraphed exists, paragraphing the candidate paragraph which can be re-paragraphed; otherwise performing step C; C. calculating the confidences of the paragraphs obtained after the step B respectively; and D. taking the paragraph with a confidence that meets the requirement of a preset confidence as the paragraph of the page theme. By using the method and the apparatus, the page theme can be determined more accurately, and the deviation between an extracted page theme and an actual page theme can be reduced.

Description

【Technical field】 [0001] The invention relates to the field of computer technology, in particular to a method and device for extracting page topics. 【Background technique】 [0002] Whether it is the sorting in the page search, the determination of the page subject words or other aspects, the acquisition of the page topic will be involved. For example, in the sorting of the page search, the higher the correlation between the page topic and the query, the higher the ranking. , page keywords are usually extracted from the page subject, and so on. [0003] Currently, it is common to simply use the entire title paragraph (title) of the page as the page theme. However, there may be multiple paragraphs in the title of the page, and some paragraphs are irrelevant to the theme of the page, which will cause the deviation of the theme of the page. The application may not be able to accurately meet the needs of users in the ranking of page search, and the determined page keywords may ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 刘海浪
Owner BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products