Unlock instant, AI-driven research and patent intelligence for your innovation.

Method and apparatus for extracting information from a website

A website and data extraction technology, applied in the field of data processing, can solve the problems of increasing the difficulty of extracting information, difficult to obtain high-value information, and a lot of interference information

Active Publication Date: 2017-04-12
ADVANCED NEW TECH CO LTD
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In addition, for some reason, the providers of some sites will also add interference information (such as interference nodes, a large number of advertisements) to the web pages in the site, and the introduction of these interference information also increases the difficulty of extracting information from the website
In this environment, the data crawled directly from the site has too much interference information for actual analysis, and it is difficult to obtain effective and high-value information from it

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and apparatus for extracting information from a website
  • Method and apparatus for extracting information from a website
  • Method and apparatus for extracting information from a website

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0021] Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

[0022] Those skilled in the art know that the present application can be implemented as a system, method or computer program product. Therefore, the present disclosure can be specifically implemented in the following forms, namely: it can be complete hardware, it can also be complete software (including firmware, resident software, microcode, etc.), and it can also be a combination of hardware and software. Called a "circuit", "module" or "...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The application relates to a method and a device for extracting information from a website. Specifically, one embodiment of the application provides a method for extracting information fro the website. The method comprises the steps of: obtaining a set of URL (Uniform Resource Locator) links with same depth from the website; analyzing codes of a plurality of web pages to which the links in the set are pointed, and thus obtaining a tag tree of each web page in the web pages; overlapping the tag trees of the web pages to which the links in the set are pointed to configure a grid tree; classifying tag nodes in grid nodes in the grid tree based on classification rules, and thus extracting data from the grid tree.

Description

technical field [0001] This application relates to data processing, and in particular to a method and device for extracting information from a website. Background technique [0002] With the development of computer technology and data communication technology, the amount of data on the Internet is becoming larger and larger, and a site can include various data such as text, pictures, audio, video, and so on. Usually, the core data in a site is often surrounded by a lot of less important information (for example, advertisements, etc.). In addition, for some reason, the providers of some sites will also add interference information (such as interference nodes, a large number of advertisements) to the web pages in the site, and the introduction of these interference information also increases the difficulty of extracting information from the website . In this environment, the data crawled directly from the site has too much interference information for actual analysis, and it...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
CPCG06F16/955
Inventor 刘照星
Owner ADVANCED NEW TECH CO LTD