Unlock instant, AI-driven research and patent intelligence for your innovation.

Method, system, device and storage medium for extracting the text of policy web pages

A webpage text extraction and text technology, which is applied in the direction of website content management, network data retrieval, instruments, etc., can solve the problems of poor extraction effect, different webpage content layout differences, etc.

Active Publication Date: 2021-04-20
SHANDONG EVAYINFO TECH CO LTD
View PDF14 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the policy webpage has a different webpage source code structure, and the content layout of the webpage is also quite different from that of ordinary news webpages
Therefore, the existing web page content analysis method cannot accurately locate the position of the text of the policy web page, and the extraction effect is poor

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method, system, device and storage medium for extracting the text of policy web pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0029] This embodiment provides a method for extracting policy webpage text;

[0030] Such as figure 1 As shown, the method for extracting the text of policy web pages includes:

[0031] S101: Obtain the HTML source code of the policy web page;

[0032] S102: Obtain the location of the text of the webpage according to the HTML source code of the policy webpage;

[0033] S103: Obtain the HTML source code corresponding to the text according to the location of the webpage text, and output the HTML source code corresponding to the text.

[0034] As one or more embodiments, said S101: Obtain the HTML source code of the policy web page; the HTML source code of the policy web page includes but not limited to: text, pictures, attachment download links, and the like.

[0035] Obtain the HTML source code of the web page by accessing the URL, including escaping the URL of the web page, and accessing the escaped URL.

[0036] As one or more embodiments, after the step of obtaining the...

Embodiment 2

[0066] This embodiment provides a system for extracting policy webpage text;

[0067] Policy web page text extraction system, including:

[0068] A source code acquisition module configured to: acquire the HTML source code of the policy web page;

[0069] The module for obtaining the location of the webpage text is configured to: obtain the location of the webpage text according to the HTML source code of the policy webpage;

[0070] The output module is configured to: obtain the HTML source code corresponding to the text according to the location of the web page text, and output the HTML source code corresponding to the text.

[0071] What needs to be explained here is that the above-mentioned source code acquisition module, web page text position acquisition module and output module correspond to steps S101 to S103 in the first embodiment, and the examples and application scenarios implemented by the above-mentioned modules and the corresponding steps are the sa...

Embodiment 3

[0075] This embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, the processor is connected to the memory, and the one or more computer programs are programmed Stored in the memory, when the electronic device is running, the processor executes one or more computer programs stored in the memory, so that the electronic device executes the method described in Embodiment 1 above.

[0076] It should be understood that in this embodiment, the processor can be a central processing unit CPU, and the processor can also be other general-purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic devices , discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processo...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method, a system, a device and a storage medium for extracting policy-type webpage text, including: obtaining the HTML source code of the policy webpage; obtaining the location of the webpage text according to the HTML source code of the policy webpage; obtaining the text according to the location of the webpage text The corresponding HTML source code, output the HTML source code corresponding to the text. By analyzing the page organization structure rules of the policy webpage, the relationship between the webpage organization structure and the location of the webpage text is constructed, and the content of the webpage text is obtained. In this way, the rapid and effective extraction of the policy webpage text is realized. The implementation of the present invention enables fast and efficient acquisition of policy-type webpage texts, greatly improves work efficiency, and saves labor costs of the company. At the same time, it is verified that the present invention also achieves a relatively high accuracy rate.

Description

technical field [0001] The present application relates to the technical field of webpage text extraction, in particular to a method, system, device and storage medium for policy webpage text extraction. Background technique [0002] The statements in this section merely mention the background art related to this application, and do not necessarily constitute the prior art. [0003] Nowadays, a large number of notices, announcements, and policies are announced through web pages. The existing web content analysis systems mainly focus on news and other article web pages, and most of them locate the main content through the HTML source code structure. However, the policy webpage has a different webpage source code structure, and the content layout of the webpage is also quite different from that of ordinary news webpages. Therefore, the existing web page content analysis method cannot accurately locate the position of the text of the policy web page, and the extraction effect i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/9535G06F16/955G06F16/958
CPCG06F16/9535G06F16/9566G06F16/986
Inventor 李钊卢凤陈通王瑞霜胡传会魏静
Owner SHANDONG EVAYINFO TECH CO LTD