Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and device for extracting page information

A technology of page information and positioning information, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of extracting page information and low efficiency of extracting page information, and achieve efficient and accurate extraction

Active Publication Date: 2016-05-18
ALIBABA (CHINA) CO LTD
View PDF4 Cites 4 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, a large number of new webpages are generated every day on the Internet, and it is difficult to extract page information from the newly added webpages by using the original preset rules. Therefore, the traditional method of extracting page information needs to modify the preset rules frequently, resulting in The efficiency of extracting page information is not high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting page information
  • Method and device for extracting page information
  • Method and device for extracting page information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0070] See Figure 1A The embodiment of the present invention provides a method for extracting page information.

[0071] In the embodiment of the present invention, before extracting the page information of the webpage to be processed, it is necessary to pre-set the algorithm library offline and configure the transcoding configuration information of the webpage to be processed. The transcoding configuration information of the webpage to be processed includes the positioning information and data structure type of each service block of the webpage to be processed. The preset algorithm library includes data structure types and their corresponding recognition algorithms. Such as Figure 1B As shown, the specific process of the above offline configuration operation includes:

[0072] S1: Obtain a DOM (Document Object Model, Document Object Model) tree of the webpage to be processed, and divide the DOM tree of the webpage to be processed according to the business type to obtain each bus...

Embodiment 2

[0124] See figure 2 An embodiment of the present invention provides a device for extracting page information, and the device is configured to execute the method for extracting page information provided in the first embodiment. The device includes:

[0125] The first obtaining module 201 is used to obtain the source code of the webpage to be processed and the document object model DOM tree, and obtain the transcoding configuration information of the webpage to be processed from the server. The transcoding configuration information of the webpage to be processed includes each service of the webpage to be processed Block location information and data structure type;

[0126] When a user browses a webpage through a terminal, the terminal sends a webpage acquisition request to the above-mentioned device, and the webpage acquisition request carries the webpage address of the webpage and the terminal identifier. The above-mentioned device receives the web page acquisition request sent ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a method and a device for extracting page information. The method comprises the steps as follows: a source code and a DOM tree of a to-be-treated webpage are obtained; transcoding configuration information of the to-be-treated webpage is obtained from a server; the transcoding configuration information of the to-be-treated webpage comprises positioning information and data structure type of each business block of the to-be-treated webpage; a DOM node corresponding to the business block is obtained from the DOM tree; a recognition algorithm corresponding to the business block is obtained from a preset algorithm library according to the data structure type of the business block; and the page information is extracted from the webpage source code, corresponding to the DOM node, of the business block according to the recognition algorithm corresponding to the business block. The method and the device efficiently and accurately extract the page information of the webpage according to the preset algorithm library and the transcoding configuration information of the webpage; and for a newly added webpage, the page information can also be successfully extracted from the newly added webpage.

Description

Technical field [0001] The present invention relates to the field of Internet and terminal technology, and in particular to a method and device for extracting page information. Background technique [0002] At present, most of the web pages developed by the website are only suitable for display on terminals with large screens such as personal computers. However, with the development of science and technology, terminals with screens of different sizes, such as tablet computers and smart phones, have appeared. To enable these terminals to display web pages normally, it is necessary to extract page information suitable for these terminals to display from the web pages. [0003] Currently, the traditional methods for extracting page information are based on preset rules to extract page information. For example, the preset rule may be a preset keyword. When page information is extracted from a web page, the page information in the web page is traversed according to the preset keyword, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/9577
Inventor 梁捷蔡明唐俊开
Owner ALIBABA (CHINA) CO LTD
Features
  • Generate Ideas
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More