A method and device for capturing effective webpage content
A technology of web content and crawling device, which is applied in website content management, network data retrieval, digital data information retrieval and other directions, can solve the problems of long connection time, inability to extract text information, slow speed, etc. Effect
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment Construction
[0040] The specific embodiments of the present invention will be described in detail below. It should be noted that the embodiments described here are for illustration only, and are not intended to limit the present invention.
[0041] The present invention starts with the overall structure of the effective content webpage to be extracted and examines the position information of various text entities in the webpage, the unique result information and the label information, It can realize the automatic extraction function of web page text entities. Because the web page file conforms to the tree structure of HTML DOM (Document ObjectModel). For a web page with effective content, such as a news web page, there are many types of tags in the web page, which are generally divided into logically, page function tags, advertisement tags, and news content tags. Webpage information extraction is the need to extract effective content from webpages, such as Xinwang content tags. The function...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 