Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and system for extracting webpage content

A web page content and web page technology, applied in the field of web page content extraction, can solve the problems of slow extraction, single technical means, and low extraction accuracy, and achieve the effect of improving accuracy and accurate extraction results.

Inactive Publication Date: 2019-07-30
GUANGZHOU WANLONG SECURITIES CONSULTING CO LTD
View PDF3 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At present, there are some methods in this field, but the technical means are relatively single, the extraction speed is relatively slow, and the extraction accuracy is relatively low, which is difficult to meet the application requirements

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for extracting webpage content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0052] refer to figure 1 , the present invention provides a method for extracting webpage content, comprising the following steps:

[0053] S1. Perform content extraction processing based on regular expression matching on the webpage. When it is judged that the extraction is successful, execute step S4, otherwise, continue to execute step S2;

[0054] S2. Perform content extraction processing based on the CSS style on the webpage. When it is judged that the extraction is successful, execute step S4, otherwise, continue to execute step S3;

[0055] S3, performing content extraction processing based on XPath matching on the webpage;

[0056] S4. Outputting the extraction result.

[0057] This method firstly extracts the content of the webpage based on regular expressions. When the extraction is unsuccessful, it extracts the content of the webpage based on the CSS style, and when the extraction fails again, it extracts the content of the webpage based on XPath matching. Accord...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and system for extracting webpage content, and the method comprises the following steps of S1, carrying out content extraction processing based on regular expression matching on a webpage, executing a step S4 when the extraction is judged to be successful, and otherwise, continuing to execute the step S2; s2, performing content extraction processing based on the CSS style on the webpage, executing the step S4 when the extraction is judged to be successful, otherwise, continuing to execute the step S3; s3, performing content extraction processing based on XPathmatching on the webpage; s4, outputting an extraction result. According to the invention, the regular expression, the CSS style and the XPath are combined in sequence to extract the webpage content, the webpage content extraction can be achieved at the fastest speed, and by combining three extraction modes, the accuracy of the extracted webpage content is greatly improved, an effective and accurate extraction result can be provided, and the webpage content extraction method and system can be widely applied to the field of webpage information processing.

Description

technical field [0001] The invention relates to the field of computer application and information extraction, in particular to a method and system for extracting web page content. Background technique [0002] Glossary: [0003] CSS Style: Cascading Style Sheets, a computer language used to represent document styles such as HTML (an application of Standard Generalized Markup Language) or XML (a subset of Standardized Generalized Markup Language); [0004] XPath: A language for finding information in XML documents, it is a language used to determine the position of a certain part in an XML document. Based on the tree structure of XML, XPath provides the ability to find nodes in the data structure tree. [0005] General text mining analysis will involve web page content extraction. The content of a web page is the basic information element in a text and the basis for a correct understanding of the text. Web page content extraction is an important basic tool in machine lear...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/335G06F16/9535G06F16/903
CPCG06F16/335G06F16/9535G06F16/90344
Inventor 吴远辉
Owner GUANGZHOU WANLONG SECURITIES CONSULTING CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products