Method and system for extracting webpage content

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A web page content and web page technology, applied in the field of web page content extraction, can solve the problems of slow extraction, single technical means, and low extraction accuracy, and achieve the effect of improving accuracy and accurate extraction results.

Inactive Publication Date: 2019-07-30

GUANGZHOU WANLONG SECURITIES CONSULTING CO LTD

View PDF3 Cites 2 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

At present, there are some methods in this field, but the technical means are relatively single, the extraction speed is relatively slow, and the extraction accuracy is relatively low, which is difficult to meet the application requirements

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0052] refer to figure 1 , the present invention provides a method for extracting webpage content, comprising the following steps:

[0053] S1. Perform content extraction processing based on regular expression matching on the webpage. When it is judged that the extraction is successful, execute step S4, otherwise, continue to execute step S2;

[0054] S2. Perform content extraction processing based on the CSS style on the webpage. When it is judged that the extraction is successful, execute step S4, otherwise, continue to execute step S3;

[0055] S3, performing content extraction processing based on XPath matching on the webpage;

[0056] S4. Outputting the extraction result.

[0057] This method firstly extracts the content of the webpage based on regular expressions. When the extraction is unsuccessful, it extracts the content of the webpage based on the CSS style, and when the extraction fails again, it extracts the content of the webpage based on XPath matching. Accord...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method and system for extracting webpage content, and the method comprises the following steps of S1, carrying out content extraction processing based on regular expression matching on a webpage, executing a step S4 when the extraction is judged to be successful, and otherwise, continuing to execute the step S2; s2, performing content extraction processing based on the CSS style on the webpage, executing the step S4 when the extraction is judged to be successful, otherwise, continuing to execute the step S3; s3, performing content extraction processing based on XPathmatching on the webpage; s4, outputting an extraction result. According to the invention, the regular expression, the CSS style and the XPath are combined in sequence to extract the webpage content, the webpage content extraction can be achieved at the fastest speed, and by combining three extraction modes, the accuracy of the extracted webpage content is greatly improved, an effective and accurate extraction result can be provided, and the webpage content extraction method and system can be widely applied to the field of webpage information processing.

Description

technical field [0001] The invention relates to the field of computer application and information extraction, in particular to a method and system for extracting web page content. Background technique [0002] Glossary: [0003] CSS Style: Cascading Style Sheets, a computer language used to represent document styles such as HTML (an application of Standard Generalized Markup Language) or XML (a subset of Standardized Generalized Markup Language); [0004] XPath: A language for finding information in XML documents, it is a language used to determine the position of a certain part in an XML document. Based on the tree structure of XML, XPath provides the ability to find nodes in the data structure tree. [0005] General text mining analysis will involve web page content extraction. The content of a web page is the basic information element in a text and the basis for a correct understanding of the text. Web page content extraction is an important basic tool in machine lear...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F16/335G06F16/9535G06F16/903

CPCG06F16/335G06F16/9535G06F16/90344

Inventor吴远辉

OwnerGUANGZHOU WANLONG SECURITIES CONSULTING CO LTD

Method and system for extracting webpage content

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology