Information extraction method of news webpage and terminal

An information extraction and web page technology, applied in the direction of network data retrieval, network data indexing, and other database retrieval, etc., can solve the problems affecting the efficiency and accuracy of information extraction of news web pages, and achieve the goal of improving information extraction efficiency, accuracy, and improvement. The effect of efficiency and accuracy

Pending Publication Date: 2022-04-12
XIAMEN MEIYA PICO INFORMATION
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] At present, it is usually targeted to extract and analyze the required content attributes of website elements. This method requires a lot of manpower and time, and it needs to be re-extracted and analyzed after the website is revised, which greatly affects the efficiency and accuracy of information extraction of news web pages. Rate

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Information extraction method of news webpage and terminal
  • Information extraction method of news webpage and terminal
  • Information extraction method of news webpage and terminal

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0092] Please refer to figure 1 , 3 -5, the information extraction method of a kind of news webpage of the present embodiment, comprises:

[0093] S1. Obtain the HTML source code of the news web page, and perform a preprocessing on the HTML source code to obtain a preprocessed HTML source code, including:

[0094] S11. Obtain the HTML source code of the news web page;

[0095] In another optional implementation manner, a link to a news webpage is obtained;

[0096] If the link to the news web page is obtained, the HTML source code corresponding to the link will be automatically downloaded;

[0097] S12. Obtain preset keywords, the source code of the first preset tag and the source code of the first preset sub-tag;

[0098] Wherein, the preset keywords include disclaimer and advertisement service;

[0099] S13. Perform a screening of the text-level tags or texts containing the preset keywords in the HTML source code to obtain the HTML source code after the screening;

[0...

Embodiment 2

[0150] Please refer to figure 2 , an information extraction terminal for a news webpage, comprising a memory, a processor, and a computer program stored on the memory and operable on the processor, and the news in Embodiment 1 is realized when the processor executes the computer program Each step in the information extraction method of the webpage.

[0151] In summary, the information extraction method and terminal of a news webpage provided by the present invention obtains the HTML source code of the news webpage, and performs a preprocessing on the HTML source code to obtain the preprocessed HTML source code; according to the preset The XPATH rule extracts information from the HTML source code after the first preprocessing to obtain the title of the webpage, the published title, the published author and the published time, which realizes automatic information extraction, and can quickly and accurately extract the title of the webpage in the news webpage, Publishing th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a news webpage information extraction method and a terminal, and the method comprises the steps: obtaining an HTML (Hypertext Markup Language) source code of a news webpage, carrying out the primary preprocessing of the HTML source code, and obtaining the HTML source code after the primary preprocessing; performing information extraction on the HTML source code after the primary preprocessing according to a preset XPATH rule to obtain a webpage title, a release title, a release author and release time; performing secondary preprocessing on the HTML source code after the primary preprocessing to obtain an HTML source code after the secondary preprocessing; and performing information extraction on the HTML source code subjected to secondary preprocessing by utilizing a text density formula and a symbol density formula to obtain published content, so that automatic extraction of news webpage information is realized, useless information in the HTML source code can be filtered out by performing primary preprocessing on the HTML source code, the information extraction efficiency is further improved, and the user experience is improved. The method can further improve the extraction accuracy of the published content, thereby improving the extraction efficiency and accuracy of the information in the news webpage.

Description

technical field [0001] The invention relates to the technical field of data collection, in particular to an information extraction method and a terminal of a news web page. Background technique [0002] With the rapid development of the Internet, the Internet disseminates hundreds of millions of information every day, and the Internet has also become an important way for us to understand social hotspots and current events in the world, especially news sites on the Internet; the huge amount of information in news sites allows us to It is difficult to obtain the information you care about accurately and quickly, and different news sites have different page structures and layouts, which are more likely to contain useless commercial information such as advertisements that affect our reading. Extracting the information we need from web pages has become one of the issues that the public cares about. [0003] At present, it is usually targeted to extract and analyze the required c...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951
Inventor 林彬陈强李火泉徐晓文
Owner XIAMEN MEIYA PICO INFORMATION
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products