Method and apparatus for page resource structuring

A page resource and structured technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of limited information and unstructured content, and achieve the effect of wide range and comprehensive content

Inactive Publication Date: 2016-10-05
上海世纪出版股份有限公司
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The present invention aims to provide a method and device for structuring page resources to solve the above-mentioned problems of limited information and unstructured content in traditional data collection

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and apparatus for page resource structuring

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] The present invention will be described in detail below with reference to the accompanying drawings and in combination with embodiments. Embodiments of the present invention are first described, see figure 1 , including the following steps:

[0031] a. Grab the content of the webpage and obtain the corresponding html file of the webpage;

[0032] b. Define the Schema file to standardize the XML result document generated after the structure;

[0033] c. Create a label mapping file, and establish a mapping with the label defined by the Schema according to the html label, text attribute, and paragraph attribute;

[0034] d. Perform content identification according to the mapping relationship and generate a corresponding structured document, and the page resource structuring program ends.

[0035] As a further optimization of the above-mentioned technical solution, the step a also includes:

[0036] a1. After grabbing the content of the webpage, define the address range...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and an apparatus for page resource structuring, wherein the method comprises steps of: creating a web page content capturing module, acquiring a html file corresponding to a web page; defining a Schema file for standardizing an XML result document generated after structuring; establishing a label mapping file, and according to a html label, a text property and a paragraph attribute, building a mapping with a label defined by the Schema; and performing content identification according to the mapping relation and generating a corresponding structured document, thereby completing structuring of page resource. The conventional web page data acquisition generally only relates to acquisition of web page metadata, and relative to the conventional processing method, the method and the apparatus provided by the invention can quickly, intelligently and accurately complete acquisition of the web page metadata and effective content, and can fragment and structure the acquired content, moreover, the content related is more comprehensive and extensive in comparison with the conventional method.

Description

technical field [0001] The present invention relates to the field of digital content processing, in particular to a method and device for structuring page resources. Background technique [0002] When publishing houses build reusable resource library data, they often face the problem of storing published or published finished content, but most of the content often does not conform to the new storage format specifications, which involves The problem that the finished product content is difficult to store. [0003] In traditional technologies, webpage data collection generally only involves the collection of webpage metadata, and when it comes to specific content, manual intervention is often required. Compared with traditional processing methods, the method and device can quickly, intelligently and accurately collect webpage metadata and effective content, and fragment and structure the collected content, and the content involved is more comprehensive than traditional method...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 施宏俊周建宝胡大卫贾立群段学俭周怡刘懿吴弃疾翁志轩何勇杨文华谢冬华朱丹瑾陈力勇易英华张少杰程艳
Owner 上海世纪出版股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products