Webpage content processing method and apparatus

A technology for web page content and processing methods, applied in the field of data processing, can solve problems such as low versatility, decreased data availability, unfavorable sorting and optimization, etc., to optimize processing technology, expand description information, and meet the effect of personalization

Active Publication Date: 2017-02-22
BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
View PDF3 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The main defect of the prior art is that the specification of HTML is very free, and the pages of many websites more or less contain structures that do not conform to the specification. In this case, only using the HTML structure will cause many errors, resulting in the accuracy of structured data. At the same time, the tree data structure is relatively complex in storage and use, and it is not conducive to sorting and optimization, which makes the data availability drop to another level; in addition, the existing page structure method can only deal with a part of Style web pages, the versatility is not high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage content processing method and apparatus
  • Webpage content processing method and apparatus
  • Webpage content processing method and apparatus

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0029] figure 1 It is a flow chart of a method for processing webpage content provided by Embodiment 1 of the present invention. The method of this embodiment can be executed by a device for processing webpage content. The device can be implemented by means of hardware and / or software, and can generally be integrated in the server. The method of this embodiment specifically includes:

[0030] 110. Read the text data of the HTML structure corresponding to the webpage to be processed.

[0031] In the technical solution of the embodiment of the present invention, the text content in the webpage to be processed needs to be processed to finally generate a title text pair, so the text content in the webpage to be processed needs to be read first. At the same time, since the webpage is composed of HTML-structured hypertext, in this embodiment, the text content in the webpage to be processed is defined as the HTML-structured text data.

[0032] Wherein, those skilled in the art may...

Embodiment 2

[0044] Figure 2aIt is a flow chart of a method for processing webpage content provided by Embodiment 2 of the present invention. This embodiment is optimized on the basis of the above-mentioned embodiments. In this embodiment, each paragraph in the paragraph list will be converted into a title text according to the content with title attributes in each paragraph in the paragraph list. To: extract a paragraph included in the paragraph list as a target paragraph; identify the content with title attributes included in the target paragraph as a title; use the content in the target paragraph except the title as a paragraph text; The title text pair is formed by taking the title, the paragraph text and the target paragraph as independent wholes.

[0045] Correspondingly, the method in this embodiment specifically includes:

[0046] 210. Read HTML structure text data corresponding to the webpage to be processed.

[0047] 220. Using a paragraph as a unit, perform structural divisi...

Embodiment 3

[0062] Figure 3a It is a flow chart of a method for processing webpage content provided by Embodiment 3 of the present invention. This embodiment is optimized on the basis of the above embodiments. In this embodiment, according to the content with title attributes in each paragraph in the paragraph list, each paragraph in the paragraph list is converted into a title text pair Afterwards, it is also preferred to include: if the adjacent two heading text pairs do not include the text of the paragraph, and the headings in the previous heading text pair only include numbers, then the two adjacent text pairs do not include the text of the paragraph. The title text pair is merged to generate a new title text pair;

[0063] In addition, after converting each paragraph in the paragraph list into a title text pair according to the content with title attributes in each paragraph in the paragraph list, it also preferably includes: if two adjacent title text pairs , the previous paragr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Embodiments of the present invention disclose a webpage content processing method and apparatus. The method comprises: reading HTML structure text data corresponding to a to-be-processed webpage; in a unit of a paragraph, performing structure division on the HTML structure text data, to generate a paragraph list; and according to content with a title attribute in each paragraph in the paragraph list, converting each paragral in the paragraph list into a title text pair. According to the technical scheme provided by the embodiments of the present invention, after the titles and paragraph texts included in each paragraph of the webpage text are identified and organized together to generate the title text pairs, further description of the webpage content can be implemented using the identified titles in the title text pairs, so that the technical effect of the description information of the webpage content is enriched, the webpage analysis process is greatly simplified, the existing webpage content processing technology is optimized, and the increasing demand of people for personalized and convenient webpage content processing is satisfied.

Description

technical field [0001] The embodiments of the present invention relate to data processing technologies, and in particular, to a method and device for processing web page content. Background technique [0002] With the development of network information technology, webpage information such as websites, forums, and blogs is getting larger and larger, and technologies such as search engines, content analysis, and public opinion analysis are all for analyzing and processing such information. Therefore, how to analyze and structurally process massive website pages has become an important problem that people need to solve urgently. [0003] The existing webpage structural processing method only starts from the HTML (HyperText Markup Language) structure of the webpage, arranges the text information in layers, and finally produces and stores the results in the form of a tree data structure. [0004] The main defect of the prior art is that the specification of HTML is very free, an...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/9577
Inventor 邵睿徐国强尹存祥骆彬钟辉强沈剑平
Owner BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products