Data extraction method and system for mass URLs

A data extraction and mass technology, applied in the direction of network data retrieval, electronic digital data processing, other database retrieval, etc., can solve the problems of low efficiency, large resource consumption and memory consumption, to reduce occupied space, solve memory problems, solve Effects of Resource Consumption Issues

Inactive Publication Date: 2017-04-19
PHICOMM (SHANGHAI) CO LTD
View PDF5 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, with the massive increase of data, the data extraction methods in the prior art will consume a lot of resources and memory, and the efficiency is low, which cannot meet the efficient analysis of massive data.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Data extraction method and system for mass URLs
  • Data extraction method and system for mass URLs
  • Data extraction method and system for mass URLs

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the specific implementation manners of the present invention will be described below with reference to the accompanying drawings. Obviously, the accompanying drawings in the following description are only some embodiments of the present invention, and those skilled in the art can obtain other accompanying drawings based on these drawings and obtain other implementations.

[0031] In order to make the drawing concise, each drawing only schematically shows the parts related to the present invention, and they do not represent the actual structure of the product. In addition, to make the drawings concise and easy to understand, in some drawings, only one of the components having the same structure or function is schematically shown, or only one of them is marked. Herein, "a" not only means "only one", but also means "more than one".

[0032] Such as f...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a data extraction method for mass URLs. The method comprises the following steps of S10, respectively collecting each text data into a local file pool by using a distributed web sever framework; S20, uploading overall text data acquired by accumulating in the local file pool into a hadoop cloud distributed file system hdfsl; and S40, extracting keywords of the URLs in a distributed manner from the overall text data in the cloud distributed file system by using a hadoop data warehouse tool hive. According to the method provided by the invention, under a big data application scene, after each ext data is collected into the local file pool, the overall text data is uploaded into the cloud distributed file system, and the hive is used for performing distributed computation to perform distribution extraction; and the method has the advantages of high efficiency and low resource consumption.

Description

technical field [0001] The invention belongs to the technical field of data extraction, and in particular relates to a data extraction method and system for massive URLs. Background technique [0002] With the rapid development of the Internet today, after analyzing the rules and personalized habits of users when using network resources (that is, user behavior analysis), extract and understand the interests of users. On the one hand, users can be personalized and pushed to provide more active and intelligent services for website visitors. On the other hand, discovering their interests and preferences from different behaviors of users can optimize the organizational relationship between pages and improve the system architecture of the website, thereby reducing the burden on users to find information, making operations easier, and saving time and energy. [0003] At present, when analyzing user behavior, since large websites generally have a large number of online users, the ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/182G06F16/1744G06F16/215G06F16/955
Inventor 欧阳涛
Owner PHICOMM (SHANGHAI) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products