News key information extraction method and system

A key information and extraction method technology, applied in the field of news key information extraction methods and systems, can solve problems such as methods that do not have versatility, real-time performance, no extraction requirements, and complication of simple problems, so as to achieve less resource consumption, The effect of high accuracy, strong practicability and robustness

Inactive Publication Date: 2016-10-12
CHINA INTERNET NETWORK INFORMATION CENTER
View PDF2 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0015] To sum up, the methods mentioned above are either outdated, or inefficient, or simple problems are complica

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • News key information extraction method and system
  • News key information extraction method and system
  • News key information extraction method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0063] The present invention proposes a method for extracting news key information, the name is newsExtractor; the method can include extracting four modules of title, time, source and text in the news webpage, and the overall process is as follows image 3 shown.

[0064] 1. Pretreatment

[0065] Preprocessing is mainly to remove some noise and special HTML symbol entities that are obviously not text content, simplify HTML tags, and reduce the workload of post-processing. In the preprocessing process, this article will borrow the third-party open source tool Jsoup (Jsoup[Z].http: / / jsoup.org / ) for auxiliary processing. The preprocessing process of this article includes the following aspects:

[0066] 1) Remove useless label pairs. The source code information of the web page is very mixed, including many script language tag pairs , user interaction label pairs, such as , Wait. We first remove these tag pairs that obviously do not contain body content. The ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a news key information extraction method and system. According to the method, a webpage is converted into a set of line numbers and a text through preprocessing; according to the feature that the probability of a sentence with the maximum number of words occurring in the news main body is very high, the start point and end point of the main body are searched starting from the middle part of the main body towards two ends, thereby extracting the news main body; a title is extracted according to a longest public substring algorithm; a regular expression is established; time is extracted by taking the line numbers as assisted judgment; a source is extracted by taking the line numbers as the assistance according to format features of the source; and the obtained contents are written into a local file by taking line breaks as separators in sequence according to the obtained news title, time, source and main body. The system has relatively high accuracy, is independent of a special webpage template and has relatively high practicability and robustness. Moreover, according to the method, the complexity is low; the extracted results are accurate; and the consumed resources are few.

Description

technical field [0001] The invention relates to the technical field of natural language processing, and relates to a method and system for extracting key news information. Background technique [0002] News, as a major source of information for people, has developed from the single paper media in the past to the coexistence of multimedia dominated by Internet media. Since information exchange on the Internet has the characteristics of unlimited space, fast update speed, and low information exchange cost, it has become the most powerful tool for news dissemination. [0003] However, there are no less than irrelevant advertisements or some links (collectively referred to as noise) in the current news web pages, which interfere with the user's reading and experience, such as figure 1 shown. [0004] Secondly, since several large news portals currently push news through their own apps or their PC apps, more channels for users to obtain news information are still through search...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/986
Inventor 李晓东向菁菁耿光刚
Owner CHINA INTERNET NETWORK INFORMATION CENTER
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products