Method and device for extracting information from webpage
A technology of web pages and information attributes, which is applied in the fields of instruments, calculations, electrical digital data processing, etc., and can solve the problem of not being able to obtain accurate information from web pages in unstructured formats
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0048] Embodiment 1. Here, the information may specifically be news information. The candidate news attributes are obtained from the webpage and put into the corresponding attribute candidate set. See figure 2 ,include:
[0049] Step 201: Read a webpage containing news, and convert the read webpage into a Document Object Model (DOM, Document Object Model) tree structure.
[0050] In the embodiment of the present invention, the WebBRowsER component of C# language in Visual STudio2005 developed by Microsoft can be used to convert the read network element into a DOM tree structure.
[0051] Each node of the DOM tree structure is an attribute of news information, that is, it may be title, release time, author, comment link, source, text, subject, related news or pictures. The attributes corresponding to each node include sub-attributes. For example, if a node is a title, the node contains font information, label information, position information, text information, etc. of the t...
Embodiment 2
[0099] Embodiment 2 is a preferred embodiment of the present invention, which can extract more comprehensive information according to the positional relationship between every two types of information attributes in the webpage, so as to obtain more accurate information.
[0100] Of course, the embodiment of the present invention can also extract information attribute combinations that meet the positional relationship only according to the positional relationship between certain two types of information attributes, and output the extracted information attribute combinations as information.
[0101] According to the above method for extracting information from webpages, a device for extracting information from webpages can be constructed, see Image 6 , including: an acquisition unit 100, a determination unit 200 and an extraction unit 300.
[0102] The obtaining unit 100 is configured to search for each type of information attribute in the webpage, and obtain a candida...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 