Patents
Literature
Patsnap Copilot is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Patsnap Copilot

84 results about "Web page categorization" patented technology

Web page information extraction system and method

The invention relates to a system for extracting web page information and a method thereof. The system comprises a template generation module, a web page homogenization module, an automatic tagging module, a wrapper file generation module and an on-line extraction module, wherein, the template generation module is used for selecting web pages to be automatically tagged from a web page collection, and the web pages to be automatically tagged is classified according to training web pages tagged by a user, so as to generate a classified category web page template; the web page homogenization module is used for screening out the difference between the automatic tagging web pages and the web page template belonging to the same category with automatic tagging web; the automatic tagging module is used for analyzing training web pages corresponding to the category, so as to generate a first wrapper file; automatic tagging can be performed on the automatic tagging web pages according to the fisrt wrapper file, so as to generate new training web pages; the wrapper file generation module is used for analyzing all the training web pages and generating a second wrapper file; and the on-line extraction module is applied to the second wrapper document, and is used for extracting unselected web page information in the web page collection. The invention ensures that a plurality of templates corresponding to inhomogeneous web pages can be generated, and extracting can be performed on a plurality of records in a web page and a plurality of attributes of each record.
Owner:INST OF COMPUTING TECH CHINESE ACAD OF SCI

System for automatic classification analysis for website based on website content

The invention discloses a system for automatic classification analysis for websites based on website contents. The system comprises a capture module, a website text content analysis module, a word segmentation module, a feature training extracting module and a website classification module. The feature training extracting module selects a plurality of features words with maximum weights by calculating importance degree, distinction degree and feature keyword weight of every candidate feature word and sorting the candidate feature words according to the feature keyword weights, wherein the feature keyword weights are used as weightings after the normalization of the selected feature words and a website classification vector template is created according to the given sets of the selected feature words and the feature keyword weights. The website classification module is used for generating a feature spatial vector according to the given set of the selected feature words and the weightings which are obtained by the feature training extracting module and identifying the classification of a website by calculating the similarity between the feature spatial vector and the feature spatial vector of the website. The system is capable of effectively solving the problem of network information in a mess and allowing users to searching information for positioning conveniently and accurately.
Owner:NANJING HUGEDATA NETWORK TECH

Web page text classification algorithm research based on web page link analysis and support vector machine

The invention discloses web page text classification algorithm research based on web page link analysis and a support vector machine and relates to the technical field of web page classification. The method includes the specific steps that 1, a large number of web pages are divided into a training set and a test set; 2, the web pages (including the training set and the test set) are preprocessed; 3, the word frequencies of feature words in each web page in the training set are calculated; 4, the weights of the feature words in each web page in the training set are calculated; 5, feature vectors of each class in the test set are calculated; 6, text feature vectors of each web page in the training set are calculated; 7, the minimum similarity value is determined as the threshold value; 8, the number of the feature words is reduced to the maximum degree; 9, text feature vectors of the web pages in the test set are classified; 10, the similarity between the classified web pages and the feature vectors is calculated and tested at the same time. A method in which a space vector model and the support vector machine is adopted is used, and the web page text classification algorithm research has the advantages of being short in classification time, high in recall rate, low in memory requirement and high in learning rate.
Owner:HUNAN UNIV

Method and device for web page classification

The invention discloses a method and device for web page classification. The method for the web page classification disclosed by the invention comprises the steps that a characteristic word classifier is established according to a web page sample set, wherein the web page address sample set comprises a plurality of sample web page addresses and web page types corresponding to the sample web page addresses; the web page addresses of a preset quantity are acquired, and the web page type of each web page address is determined by the characteristic word classifier; the web page addresses of which the web page types are determined are treated by redundancy elimination, and structure character strings are then obtained, wherein the structure character strings are web page address structures; the web page address structures and the corresponding web page types are stored; and the web page address of a to-be-classified page is acquired during the web page classification, the corresponding web page address structure is obtained through implementation of the redundancy elimination to the web page address, and the web page type of the to-be-classified web page is searched from the storage according to the web page address structure. According to the method disclosed by the invention, the web page classification can be implemented rapidly and efficiently.
Owner:ZTE CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products