Unlock instant, AI-driven research and patent intelligence for your innovation.
A method and system for accurately extracting web page content
What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A web page content and precise extraction technology, applied in the Internet field, can solve the problems of time-consuming adjustment, difficult maintenance, and difficult system maintenance, etc., and achieve the effect of reducing maintenance costs and improving development efficiency
Active Publication Date: 2018-09-28
翁杰
View PDF5 Cites 0 Cited by
Summary
Abstract
Description
Claims
Application Information
AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology
Problems solved by technology
Using this method is relatively excellent in performance, but it is very difficult for the maintenance of the entire system
To accurately locate the data of a page, it is usually necessary to extract the content of the page in segments, so that the written regular expression code is very large, and it is difficult to maintain
If the label of the page to be collected changes slightly, the corresponding regular expression also needs to be readjusted, and the whole adjustment is very time-consuming
[0004] In the javascript script of the web page, the label selector provided by the jquery framework can filter out the labels in the page very conveniently, but it can only be used in the client script of the browser
Method used
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more
Image
Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
Click on the blue label to locate the original text in one second.
Reading with bidirectional positioning of images and text.
Smart Image
Examples
Experimental program
Comparison scheme
Effect test
Embodiment 1
[0068] When the user input command is "INPUT, SELECT, TEXTAREA, IMG", it means to find all INPUT, SELECT, TEXTAREA, IMG tags in the page.
[0069] Through lexical analysis, the system will form such image 3 The complete, unoptimized expression tree is shown.
[0070] while in syntax table The block is defined as follows:
[0071] ::= ','
[0072] |
[0073] its meaning means one or more Composed of commas "," between separated.
[0074] Before optimization, the nodes presented in the syntax tree are nested, which can be expressed as follows:
[0079] The above format is not conducive to grammatical analysis, and the nodes need to be optimized into the following optimized expression tree that is conducive to grammatical analysis:
[0080] INPUT,SE...
Embodiment 2
[0097] When the user enters the expression "DIV.a>IMG.b[alt]", the meaning is to first find the DIV tag with the style name a in the page, and then filter out the child elements with the style name b, and Contains the IMG tag with the alt attribute.
[0098] Through lexical analysis, the system will form such Figure 5 The complete, unoptimized expression tree is shown.
[0099] in the grammar table The block is defined as follows:
[0100] ::=
[0101] |
[0102] ::='>'|'+'|'~'|
[0103] its meaning means is composed of one or more It consists of three layers separated by layer separators, which are ">", "+", and "~". When parsing the jQuery syntax, another separator is one or more blank characters. In this case, it is necessary to judge whether the node contains blank characters, and keep them if they exist, otherwise they will be removed.
[0104] For the above nodes can be optimized as Figure 6 As shown in the schematic diagram, the opti...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More
PUM
Login to View More
Abstract
The embodiment of the invention discloses a method for accurately extracting webpage content. The method particularly comprises the following steps: webpage content corresponding to a URL (uniform resource locator) is obtained, and source codes of the webpage content is analyzed into a DOM (document object model) structure tree; a screening expression input by a user is read; a syntactic analyzer uploads a grammar table, and the screening expression is analyzed into an expression tree composed of multiple words; the multiple words are analyzed to obtain a screening condition set as per semantic analysis; the screening condition set is composed of multiple screening condition objects; and each screening condition object is composed of a tag extracting method and multiple tag screening methods. The embodiment of the invention further discloses a system for accurately extracting webpage content. With adoption of the method and the system, the screening expression is recombined into an optimized expression tree through grammatical analysis and forms a set of multiple screening condition objects through semantic analysis, so that acquisition positioning and quick screening of DOM document tree nodes can be realized. With adoption of the method and the system, improvement of development efficiency and reduction of maintenance cost are facilitated.
Description
technical field [0001] The present invention relates to the technical field of the Internet, in particular to a method and system for accurately extracting web page content. Background technique [0002] Web page data collection technology is a technology similar to search engineROBOT, which collects articles and data on the Internet and stores them in the database to fill the website content. Data collection technology is of great help to enrich the content of the website and improve the flow of the website. [0003] However, most of the data collection methods used in the prior art use regular expressions to locate and extract data. Using this method is relatively excellent in performance, but it is not easy for the maintenance of the entire system. To accurately locate the data of a page, it is usually necessary to extract the content of the page in segments, so the written regular expression code is very large, and it is difficult to maintain. If the label of the pag...
Claims
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More
Application Information
Patent Timeline
Application Date:The date an application was filed.
Publication Date:The date a patent or application was officially published.
First Publication Date:The earliest publication date of a patent with the same application number.
Issue Date:Publication date of the patent grant document.
PCT Entry Date:The Entry date of PCT National Phase.
Estimated Expiry Date:The statutory expiry date of a patent right according to the Patent Law, and it is the longest term of protection that the patent right can achieve without the termination of the patent right due to other reasons(Term extension factor has been taken into account ).
Invalid Date:Actual expiry date is based on effective date or publication date of legal transaction data of invalid patent.