Webpage label position-based text formatting and cleaning method

A web page labeling and text technology, applied in text database indexing, unstructured text data retrieval, special data processing applications, etc. The effect of improving accuracy

Pending Publication Date: 2020-08-28
安徽慧医信息科技有限公司
View PDF0 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Existing webpage text analysis technology needs to write additional customized cleaning rules for each webpage. When applied to webpages with a lot of content, the extraction efficiency of text content is low, and some content will be missing at the same time. Labels (especially tables) When parsing, there will be a transposition so that it is quite different from the original web page format

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage label position-based text formatting and cleaning method
  • Webpage label position-based text formatting and cleaning method
  • Webpage label position-based text formatting and cleaning method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] The preferred embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings, so that the advantages and features of the present invention can be more easily understood by those skilled in the art, so as to define the protection scope of the present invention more clearly.

[0034] see figure 1 , the embodiment of the present invention includes:

[0035] A text formatting and cleaning method based on the position of the web page label, comprising the following steps:

[0036] S1: traverse all the tags of the entire webpage, and record each tag name, tag location, text content, and document number to the original table; the specific steps include:

[0037] S101: read the network address (URL, uniform resource locator) of the original webpage, this address can be the link address of the website or the downloaded HTML file, and convert the content of the webpage into a tree-structured document object;

[0038] S102: T...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a webpage label position-based text formatting and cleaning method, which comprises the following steps of: S1, traversing all labels of a whole webpage, and recording each label name, the position of each label, text content and a document number into an original table; s2, if the webpage contains the table data, dynamically traversing the table data in the webpage according to rows and columns, and extracting key information to obtain a replacement table; s3, matching the original table with the replacement table through document numbers, deleting or updating data information, related to the replacement table, of table positions in the original table, and outputting cleaned data to a cleaning table; and S4, if no table data exists in the webpage, extracting the text corresponding to the DIV in the webpage, extracting the document number and the text content to a data box, persisting the document number and the text content to an original table, and merging andinserting the similar documents into a cleaning table through a grouping aggregation method to complete cleaning of the webpage text. According to the invention, the accuracy of webpage text analysiscan be improved.

Description

technical field [0001] The invention relates to the technical field of web page text extraction, in particular to a text formatting and cleaning method based on the position of web page tags. Background technique [0002] We all know that web pages use the HTML markup language, and the tags in it represent a certain format. For example, the table tag is a table (similar to the table in excel), tr is the row in the table, and td is the cell in the row. Web page text analysis is also called web page text content extraction, which is an extraction technology that removes the tags in the original web page while retaining the original format. The extracted and cleaned text can be used in various scenarios such as information display, data integration, data persistence, and data analysis. This technology is mainly applied to relevant text processing technologies such as crawlers, regular expressions, and rule extraction. [0003] Existing webpage text analysis technology needs ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/31G06F16/215
CPCG06F16/215G06F16/313
Inventor 沈亮曾华凌毛磊
Owner 安徽慧医信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products