Method and device for extracting webpage title and information processing system

A technology of web page title and extraction method, which is applied in the field of information processing, can solve the problems of low accuracy rate and recall rate of web page search, and achieve the effect of improving accuracy rate and recall rate

Inactive Publication Date: 2012-11-07
SHENZHEN SHI JI GUANG SU INFORMATION TECH
View PDF2 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The embodiment of the present invention provides a method for extracting webpage titles, aiming to sol

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting webpage title and information processing system
  • Method and device for extracting webpage title and information processing system
  • Method and device for extracting webpage title and information processing system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0025] figure 1 The implementation flow of the method for extracting the title of a web page provided by Embodiment 1 of the present invention is shown, and the process of the method is described in detail as follows:

[0026] In step S101, the text in the title tag and the auxiliary tag in the source file of the web page is extracted.

[0027] In this embodiment, a text parser is used to parse title (title) tags and auxiliary (meta) tags in the webpage source file, and extract text in the title tags and auxiliary tags. For example: in described web page source file is HTML (HyperText Mark-up Language, hypertext markup language) source file, by HTML text parser, title tag economic center The words "economic center" and "politics, economy, technology, culture" in the meta tag are analyzed.

[0028] As an embodiment of the present invention, the method can also use a Document Object Model (Document Object Model, DOM) tree to replace the source file of the webpage, and subseq...

Embodiment 2

[0047] figure 2 It is a specific process for calculating the probability value of each text block in the web page source file as a web page title according to the feature points of the extracted text block and the text in the title tag and auxiliary tags provided by Embodiment 2 of the present invention:

[0048]In step S201, according to the extracted feature points of the text block and the words in the title tag and auxiliary tags, the probability value related to the feature point of the text block and the title of the webpage is obtained through a decision model obtained through offline training.

[0049] In this embodiment, the feature points are extracted from the collected webpage samples by means of offline training, and the feature points are stored in the feature point database, and a decision model is trained according to the feature points in the database, and then according to The decision-making model determines the probability value of each feature point relat...

specific example

[0054] In order to better illustrate the web page title extraction method, image 3 A specific example of the web page title extraction method provided by Embodiment 3 of the present invention is shown, and the steps of the specific example are as follows:

[0055] 1. Enter the URL (Universal Resource Locator, webpage address): http: / / news.qq.com / a / 20101120 / 000780.htm to obtain the HTML source file of the webpage;

[0056] 2. Extract the text in the title tag in the source file: "The State Council issued 16 measures to stabilize the overall level of consumer prices News Tencent Network";

[0057] 3. Extract the text in the meta tag in the source file: "The State Council issued 16 measures to stabilize the overall level of consumer prices and prices";

[0058] 4. Divide the continuous text nodes in the source file into multiple independent text blocks, for example: "Tencent.com Homepage", "Website Navigation", "Mailbox", "The State Council issued 16 measures to stabilize the o...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention is applicable to the field of information processing, and provides a method and a device for extracting a webpage title and an information processing system. The method includes: extracting characters of a title tag and a meta tag in a webpage source file; extracting feature points of each text block of the webpage source file; calculating probability value of being the title of each text block according to the extracted feature points of the text blocks and the characters of the title tag and the meta tag; and extracting the text block with the highest probability value as the title. By the method, characters which are irrelevant to webpage contents and piled up in the title tag and the meta tag by a webpage designer can be filtered effectively, characters which most describe webpage subject contents or main ideas are extracted as the webpage title, and accuracy and recall value of webpage research are improved.

Description

technical field [0001] The invention belongs to the field of information processing, and in particular relates to a method, device and information processing system for extracting a web page title. Background technique [0002] The so-called web page title refers to a sentence that expresses the subject content or central idea of ​​the web page text. With the development of network technology, the extraction of web page titles is being used more and more widely. For example, the extraction technology of web page titles is required in web page searches such as web page preview and web page fingerprint calculation. [0003] The existing method for extracting the title of a web page is mainly to directly extract the text in the title (title) tag and the auxiliary (meta) tag in the source file of the web page as the title of the web page. However, as the status of webpage titles in the calculation of webpage search relevance increases, more and more website designers add some u...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 杨巍张立明
Owner SHENZHEN SHI JI GUANG SU INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products