Method, system, device and storage medium for extracting the text of policy web pages
A webpage text extraction and text technology, which is applied in the direction of website content management, network data retrieval, instruments, etc., can solve the problems of poor extraction effect, different webpage content layout differences, etc.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0029] This embodiment provides a method for extracting policy webpage text;
[0030] Such as figure 1 As shown, the method for extracting the text of policy web pages includes:
[0031] S101: Obtain the HTML source code of the policy web page;
[0032] S102: Obtain the location of the text of the webpage according to the HTML source code of the policy webpage;
[0033] S103: Obtain the HTML source code corresponding to the text according to the location of the webpage text, and output the HTML source code corresponding to the text.
[0034] As one or more embodiments, said S101: Obtain the HTML source code of the policy web page; the HTML source code of the policy web page includes but not limited to: text, pictures, attachment download links, and the like.
[0035] Obtain the HTML source code of the web page by accessing the URL, including escaping the URL of the web page, and accessing the escaped URL.
[0036] As one or more embodiments, after the step of obtaining the...
Embodiment 2
[0066] This embodiment provides a system for extracting policy webpage text;
[0067] Policy web page text extraction system, including:
[0068] A source code acquisition module configured to: acquire the HTML source code of the policy web page;
[0069] The module for obtaining the location of the webpage text is configured to: obtain the location of the webpage text according to the HTML source code of the policy webpage;
[0070] The output module is configured to: obtain the HTML source code corresponding to the text according to the location of the web page text, and output the HTML source code corresponding to the text.
[0071] What needs to be explained here is that the above-mentioned source code acquisition module, web page text position acquisition module and output module correspond to steps S101 to S103 in the first embodiment, and the examples and application scenarios implemented by the above-mentioned modules and the corresponding steps are the sa...
Embodiment 3
[0075] This embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, the processor is connected to the memory, and the one or more computer programs are programmed Stored in the memory, when the electronic device is running, the processor executes one or more computer programs stored in the memory, so that the electronic device executes the method described in Embodiment 1 above.
[0076] It should be understood that in this embodiment, the processor can be a central processing unit CPU, and the processor can also be other general-purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate array FPGA or other programmable logic devices , discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processo...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 
