Universal forum text extraction method

An extraction method and text technology, applied in the field of general forum text extraction, can solve the problems of inability to extract useful information efficiently and universally, and achieve good utilization effect

Inactive Publication Date: 2017-10-10
NORTHEASTERN UNIV
View PDF7 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The present invention provides a general forum text extraction method to solve the problem that the pr

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Universal forum text extraction method
  • Universal forum text extraction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0037] In order to make the technical scheme of the present invention clearer, below in conjunction with figure 1 The flow chart of the general forum text extraction method shown in detail describes the specific implementation of the present invention.

[0038] General forum text extraction method of the present invention comprises the steps:

[0039] a. Crawl data: Crawl all the information of the website, that is, extract the complete html code of the website, detect the encoding format of the webpage, and uniformly encode it into utf8 format for subsequent processing;

[0040] b. Clean data: Based on the data encoded in uft8 format, apply BeautifulSoup to parse the html tag type to obtain the DOM tree of the web page, such as figure 2 As shown, extract the title information and the content of the div tag containing the publication time information, filter the useless information and classify the extracted information and generate a list;

[0041] c. Format information: ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a universal forum text extraction method. The method comprises the following steps that a complete html code of a website is extracted, a webpage coded format is tested, and the webpage coded format is uniformly coded into a utf 8 format; a html label type is analyzed, a DOM tree of a webpage is obtained, title information and div label content containing publishing time information are extracted, and the extracted information is classified to generate a list after useless information is filtered; the data length of the list is calculated, and the information is classified with time as a mark and is output in a formatted mode. The extraction method is high in universality, can be applied to most forums, and can accurately extract corresponding data fields of main posts, replies, titles and posting time and output the corresponding data fields in a formatted mode, so that forum information is better utilized.

Description

technical field [0001] The invention relates to the technical field of network information processing, in particular to a general forum text extraction method. Background technique [0002] With the rapid development of the Internet, the amount of data on forum web pages has become larger and larger, gathering human knowledge and reflecting social hotspots. Effectively excavating valuable information of forum webpages can make full use of webpage information and improve the usefulness of webpage data. While forums contain a lot of valuable information, they also contain a lot of noise, and because the data structures of different forum webpages are different, it becomes more difficult to find a general method to extract useful information from webpages. [0003] If a crawling algorithm is designed for a certain type of website according to its specific tags and attributes, efficient and universal extraction cannot be satisfied. The current general news website crawling alg...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/9577
Inventor 张杰李永立管智慧赖裕妮
Owner NORTHEASTERN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products