Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

85 results about "Web page categorization" patented technology

Web page information extraction system and method

The invention relates to a system for extracting web page information and a method thereof. The system comprises a template generation module, a web page homogenization module, an automatic tagging module, a wrapper file generation module and an on-line extraction module, wherein, the template generation module is used for selecting web pages to be automatically tagged from a web page collection, and the web pages to be automatically tagged is classified according to training web pages tagged by a user, so as to generate a classified category web page template; the web page homogenization module is used for screening out the difference between the automatic tagging web pages and the web page template belonging to the same category with automatic tagging web; the automatic tagging module is used for analyzing training web pages corresponding to the category, so as to generate a first wrapper file; automatic tagging can be performed on the automatic tagging web pages according to the fisrt wrapper file, so as to generate new training web pages; the wrapper file generation module is used for analyzing all the training web pages and generating a second wrapper file; and the on-line extraction module is applied to the second wrapper document, and is used for extracting unselected web page information in the web page collection. The invention ensures that a plurality of templates corresponding to inhomogeneous web pages can be generated, and extracting can be performed on a plurality of records in a web page and a plurality of attributes of each record.
Owner:INST OF COMPUTING TECH CHINESE ACAD OF SCI

System for automatic classification analysis for website based on website content

The invention discloses a system for automatic classification analysis for websites based on website contents. The system comprises a capture module, a website text content analysis module, a word segmentation module, a feature training extracting module and a website classification module. The feature training extracting module selects a plurality of features words with maximum weights by calculating importance degree, distinction degree and feature keyword weight of every candidate feature word and sorting the candidate feature words according to the feature keyword weights, wherein the feature keyword weights are used as weightings after the normalization of the selected feature words and a website classification vector template is created according to the given sets of the selected feature words and the feature keyword weights. The website classification module is used for generating a feature spatial vector according to the given set of the selected feature words and the weightings which are obtained by the feature training extracting module and identifying the classification of a website by calculating the similarity between the feature spatial vector and the feature spatial vector of the website. The system is capable of effectively solving the problem of network information in a mess and allowing users to searching information for positioning conveniently and accurately.
Owner:NANJING HUGEDATA NETWORK TECH

Web page classification method based on training set

The invention relates to a web page automatic classification method based on a training set. A classification process is the combination of methods of characteristic selection, characteristic weight value determination, text vector comparison, and the like. The automatic classification method based on a classification system mainly classes a document to be classified into a corresponding sort according to a beforehand established sort model, namely a training set. Along with the development of the multimedia technique, the content forms of web page information are also rich and colorful, and contents not only comprise text information but also comprise much structural information and other form information, such as sound, figures, images, and the like. However, because web pages based on texts still possess larger proportions, the classification based on web page texts still takes the precedence. The method has reliable theoretical support and favorable extensibility and accuracy and is easy to be in butt joint with application interfaces correlative to an operator.
Owner:NANJING UNIV OF POSTS & TELECOMM

Webpage classification technology based on vertical search and focused crawler

The invention provides a method for identifying webpage classification based on vertical search and focused crawler. The method comprises two parts, namely a webpage source code acquisition method and a webpage content analysis method, wherein the webpage content analysis method is a key method, and comprises two main parts, namely extraction of structured information of the webpage and crawling strategy of the focused crawler. First, a URL is selected from a navigation site URL list to acquire a source file of the URL; and then, all classified URL of the navigation URL sites can be identified and acquired by the webpage content analysis method. The key method in the method is the webpage content analysis method, which is to first extract the webpage structured information, then carry out URL snatch by a directional breadth-first search strategy based on webpage content feature, and finally store the snatched URL and corresponding website classification in a list Category.
Owner:苏州锐创通信有限责任公司

Content-based web page classification method and system

The invention discloses a content-based web page classification method, which comprises the following steps of: acquiring, by user equipment, a characteristic keyword in a uniform resource locator URL of a web page to be accessed by a user, and querying a local URL characteristic library according to the characteristic keyword in the URL to acquire corresponding web page classification information; and further acquiring web page content of the web page to be accessed by the user when the corresponding web page classification information is not queried by the user equipment in the URL characteristic library and querying a local web page template library according to the web page content to acquire the corresponding web page classification information. The invention also correspondingly discloses a content-based web page classification system. According to the content-based web page classification method and the content-based web page classification system, web page granularity-based classification can be realized, the classification accuracy and the classification real-time property are improved, and labor cost is reduced.
Owner:BEIJINGNETENTSEC

Intelligent web page classifier based on user behaviors

An intelligent web page classification device based on the user's behaviors: (1) Perform background input with an initial classification sample group for training, so as to gain a clustering center of each classification in the characteristic space. (2) Receive a URL input by the user input before catching and analyzing corresponding pages on the background; then, output texts with index value in the page. Moreover, extract a characteristic set according to the user-input content and web page contents; then, perform feedback modification for the characteristic space in the initial classification sample group and adjust the characteristic weight factor in the vector space. (3) Use the user-selected classification device to perform automatic classification for the texts in the previous step of the created texts and output results. When the user executes a search, the classification device can automatically determine the classification of each result and perform gradual adjustment for the classification device; the more times the user executes the search, the more accurate the classification of the web page classification device will be, so as to help different users effectively reduce the size of the set of the search result before locating necessary information more accurately.
Owner:SHANGHAI XINSHENG ELECTRONICS TECH

Apparatus, method and computer-accessible medium for explaining classifications of documents

Classification of collections of items such as words, which are called “document classification,” and more specifically explaining a classification of a document, such as a web-page or website. This can include exemplary procedure, system and / or computer-accessible medium to find explanations, as well as a framework to assess the procedure's performance. An explanation is defined as a set of words (e.g., terms, more generally) such that removing words within this set from the document changes the predicted class from the class of interest. The exemplary procedure system and / or computer-accessible medium can include a classification of web pages as containing adult content, e.g., to allow advertising on safe web pages only. The explanations can be concise and document-specific, and provide insight into the reasons for the classification decisions, into the workings of the classification models, and into the business application itself. Other exemplary aspects describe how explaining documents' classifications can assist in improving the data quality and model performance.
Owner:NEW YORK UNIV

Webpage classification method for semi-supervised multi-view learning

The invention relates to the technical field of Internet, in particular to a webpage classification method for semi-supervised multi-view learning which comprises the following steps of: obtaining data from a webpage, and establishing a training set; training a classifier through the marked training set; encoding the marked training set and the unmarked training set through a trained classifier toobtain sample features; performing density clustering on the sample features to obtain a clustering result; and classifying the samples of the unmarked training set according to a clustering result.According to the scheme, the marked training set is used for training the classifier; orthogonal constraints and adversarial similarity constraints are added on the basis of an existing multi-view classification method, density clustering marking is carried out on all data in a training set through a trained classifier, finally, accuracy verification is carried out on the classifier, and the classification performance of the classifier can be improved through multiple iterations of the process.
Owner:GUANGDONG UNIV OF PETROCHEMICAL TECH

Method and system for searching pictures in network

The invention relates to a method for searching a picture in a network, comprising the steps of: determining a major classification of an inquired word according to a preset word class library; searching each picture relevant to the inquired word, obtaining classification weight of the each picture on the website to the major classification respectively according to a preset website class library; obtaining a description weight of the each picture on webpage to the major classification respectively according to the preset webpage class library; extracting a picture with a comprehensive relevance more than threshold according to the comprehensive relevance of each picture calculated by the classification weight and the description weight. The invention also discloses a system for searching a picture in a network. The invention solves the problem of weak relevance of searched picture to the inquired work in current technique and the problem of lower experiencing of user. The invention is capable of obtaining close relevance of searched picture to inquired work and improving experiencing of user.
Owner:SHENZHEN TENCENT COMP SYST CO LTD

Web page text classification algorithm research based on web page link analysis and support vector machine

The invention discloses web page text classification algorithm research based on web page link analysis and a support vector machine and relates to the technical field of web page classification. The method includes the specific steps that 1, a large number of web pages are divided into a training set and a test set; 2, the web pages (including the training set and the test set) are preprocessed; 3, the word frequencies of feature words in each web page in the training set are calculated; 4, the weights of the feature words in each web page in the training set are calculated; 5, feature vectors of each class in the test set are calculated; 6, text feature vectors of each web page in the training set are calculated; 7, the minimum similarity value is determined as the threshold value; 8, the number of the feature words is reduced to the maximum degree; 9, text feature vectors of the web pages in the test set are classified; 10, the similarity between the classified web pages and the feature vectors is calculated and tested at the same time. A method in which a space vector model and the support vector machine is adopted is used, and the web page text classification algorithm research has the advantages of being short in classification time, high in recall rate, low in memory requirement and high in learning rate.
Owner:HUNAN UNIV

A web page classification method based on deep learning with the fusion of text and structural features

InactiveCN108984706AClassification is comprehensive and effectiveImprove accuracySpecial data processing applicationsWeb page categorizationShort-term memory
The invention provides a web page classification method based on deep learning with fusion of text and structural features. Firstly, a HTML (HyperText Markup Language) document of the web page is obtained by a crawler, and the key text information such as title, meta, hyperlink and so on is extracted, and the text vocabulary is converted into vector (word2vec) to represent the text features. Thenthe HTML tags are traversed and transformed into vectors to represent the structural characteristics of the web page. Finally, the vector is input into the long-term and short-term memory network (LSTM), and the heterogeneous web page text features and web page structure features are fused into the training model through the neural network to classify. This method synthesizes the distinguishing features to represent the web pages more comprehensively and improves the classification accuracy.
Owner:ZHEJIANG UNIV

Student browsed webpage classification method

The invention discloses a student browsed webpage classification method based on N-Gram and a naive Bayesian classifier. The method comprises the specific implementation steps that first, URL description information is crawled from a navigation website, four classification corpora are constructed, corpus texts are expressed in the forms of uni-gram and bi-gram, TF-IDF is used as a weight of text characteristics, and a naive Bayesian classification algorithm is used to construct the classifier; and URLs in student browsed records are segmented according to set rules, URL categories are determined through matching of the classifier and a URL category base, and if the URL categories determined through the classifier conform to set confidence, the URL categories are added into the URL category base. Through the method, the URLs in the student browsed records are effectively classified, and therefore the webpage recognition rate and the classification accuracy rate are increased.
Owner:HUAIYIN INSTITUTE OF TECHNOLOGY

Method and device for web page classification

The invention discloses a method and device for web page classification. The method for the web page classification disclosed by the invention comprises the steps that a characteristic word classifier is established according to a web page sample set, wherein the web page address sample set comprises a plurality of sample web page addresses and web page types corresponding to the sample web page addresses; the web page addresses of a preset quantity are acquired, and the web page type of each web page address is determined by the characteristic word classifier; the web page addresses of which the web page types are determined are treated by redundancy elimination, and structure character strings are then obtained, wherein the structure character strings are web page address structures; the web page address structures and the corresponding web page types are stored; and the web page address of a to-be-classified page is acquired during the web page classification, the corresponding web page address structure is obtained through implementation of the redundancy elimination to the web page address, and the web page type of the to-be-classified web page is searched from the storage according to the web page address structure. According to the method disclosed by the invention, the web page classification can be implemented rapidly and efficiently.
Owner:ZTE CORP

Method for classifying search term, device and search engine system

The invention provides an inquiry word classification method that includes: acquiring user-input inquiry words; recording web pages that the user click with the inquiry word; acquiring the classification information of the web page; and determining the inquiry word classification results according to the classification parameters. The classification parameters include web page classification information. Through classification for web page classification corresponding to inquiry words, the invention can determine the classification of the inquiry words, so as to improve the efficiency of the inquiry word classification and save resources. Moreover, the invention further provides an apparatus for inquiry word classification as well as a search engine system comprises the apparatus.
Owner:BEIJING SOGOU TECHNOLOGY DEVELOPMENT CO LTD

Web page training method and device, and search intention identifying method and device

A search intention identifying method. The method includes: at a device having one or more processor and memory, obtaining a to-be-identified query character string, and obtaining a history web page set corresponding to the query character string, the history web page set comprising web pages clicked by using the query character string historically; obtaining a predetermined web page categorization model; obtaining a category of each web page in the history web page set according to the web page categorization model; collecting statistics on the number of web pages in each category in the history web page set, and performing calculation according to the number of the web pages in each category and a total number of web pages in the history web page set to obtain intention distribution of the query character string; and obtaining an intention identification result of the query character string according to the intention distribution.
Owner:TENCENT TECH (SHENZHEN) CO LTD

Web page classifying method and apparatus

The invention provides a web page classifying method and apparatus. The method comprises the steps of: analyzing a plurality of web page elements from a to-be-predicted web page; predicting candidate web page classifications to which the to-be-predicted web page belongs according to the web page elements respectively; and comparing the candidate web page classifications predicted by the web page elements respectively, and determining final web page classification of the to-be-predicted web page. According to the method, a full-automatic classifying process is realized, the manual operation is not required, the web page classifying efficiency is greatly improved, especially massive web pages of the whole network and web pages newly generated in the internet can be quickly and effectively classified, and the web page classifying timeliness is ensured.
Owner:北京鸿享技术服务有限公司

Webpage classification method and device

The invention discloses webpage classification method and device. The method includes establishing a virtual hierarchical URL (uniform resource locator) according to recording in an existing URL class library, and predicting the class of the hierarchical URL; when classification on webpages to be classified is needed, searching the URL class library according to URLs of the webpages to be classified; if matching URLs are unfound, searching the URL class library according to higher-level URLs of the URLs; and when matching URLs are found, determining the classes of the webpages to be classified according to predicted classes of the found URLs. Efficiency and success rate in webpage classification by the method and device are improved.
Owner:CHINA MOBILE SUZHOU SOFTWARE TECH CO LTD +2

Data processing method and system

The invention provides a data processing method. The method comprises the following steps that a web page is collected from a preset data source; a web page category to which the collected web page belongs is determined; the web page categorization is based on different objects described by the web page included in the preset data source; a wrapper corresponding to the web page category is adopted to extract valid information from the collected web page; the wrapper is generated according to attributes of the objects described by the web page corresponding to the web page category; the extracted valid information is converted into a preset standard format and stored. According to the data processing method, redundant network data can be effectively processed into data required by people, and the use value of the network data is improved.
Owner:GUOXIN YOUE DATA CO LTD

Webpage type recognition method and webpage type recognition device

The invention relates to a webpage type recognition method and a webpage type recognition device. The webpage type recognition method includes the steps of receiving a webpage address of a webpage to be tested, analyzing the webpage address to obtain a constituent part of the webpage address, judging whether the constituent part of the address of the webpage to be tested is matched with a webpage classification rule or not, classifying the webpage to be tested according to the webpage classification rule to obtain the webpage type of the webpage to be tested if the constituent part of the address of the webpage to be tested is matched with the webpage classification rule, and sending the webpage address of the webpage to be tested to a webpage classification device to be classified to obtain the webpage type of the webpage to be tested if the constituent part of the address of the webpage to be tested is not matched with the webpage classification rule. According to the webpage type recognition method and the webpage type recognition device, under the condition that only the webpage address is used, the webpage type can be forecast, the forecast speed is high, and the instantaneity is high.
Owner:深圳市雅阅科技有限公司

Web page classification method, web page classification device and network equipment

The invention provides a web page classification method, a web page classification device and network equipment. The method comprises the following steps of: extracting information of different classification weight levels in a source file of a web page; performing word segmentation processing on information of each classification weight level to acquire segmented words of each classification weight level; and performing classification processing on the web page by using the segmented words of each classification weight level according to a sequence of the classification weight level from high to low. According to the technical scheme provided by the invention, classification processing is performed on the web page by preferably using the information with higher classification weight level by using the characteristic that the more important information in the web page has higher influence on a web page classification result, so that the influence of invalid information on web page classification in the web page is favorably reduced, and further the accuracy of web page classification is favorably improved.
Owner:BEIJING XINWANG RUIJIE NETWORK TECH CO LTD

Webpage classification method and webpage classification system

The invention aims at providing a webpage classification method and a webpage classification system, wherein the method comprises the following steps that: a webpage-to-be-classified obtaining device receives a domain name input by a user, and obtains an URL (Uniform Resource Locator) corresponding to a webpage of breadcrumb to be crawled on the basis of the domain name; a breadcrumb crawling device crawls the breadcrumb of the webpage on the basis of the URL; and a webpage classifier classifies the webpage on the basis of the crawled breadcrumb. Compared with the prior art, the method and the system have the advantages that the breadcrumb is extracted from the webpage on the basis of the domain name; the webpage is classified; and the webpage classification accuracy is effectively improved.
Owner:BEIJING DEEPZERO TECH CO LTD

Method for auditing webpage based on cloud semantic database

The invention provides a method for auditing a webpage based on a cloud semantic database, which is mainly applied to the fields of an online information security, an online behavior management and the like. The invention uses a cloud technology and a semantic analysis technology to construct the cloud semantic database; the semantic analysis and word frequency statistics are carried out on the online webpage contents of a user captured by an online behavior management system, and online webpage contents are matched with the cloud semantic database to obtain webpage classification; and the cloud semantic database is revised in accordance with feedback audit results.
Owner:莱克斯科技(北京)有限公司

Online semantic excavation system of Chinese polysemic words and based on uniform resource locator (URL)

InactiveCN103488741AEffectively obtain online semantic classification resultsWeb data indexingSpecial data processing applicationsWeb page categorizationClassification methods
The invention discloses an online semantic excavation system of Chinese polysemic words and based on a uniform resource locator (URL). The system utilizes a webpage classification method based on the URL and can conduct semantic excavation on the Chinese polysemic words online. The process includes first constructing a URL classifier through an online URL classification catalogue; then classifying searching results (including webpage URLs and abstracts) of the polysemic words returned by a search engine by means of the URL classifier to obtain initial semantic classification results of the polysemic words; finally clustering the initial semantic classification results according to the webpage abstracts to obtain semantic excavation results of the polysemic words. The semantic excavation system has ideal accuracy and recall rate and is highly applicable to semantic excavation of network popular words.
Owner:EAST CHINA NORMAL UNIV

Website navigation path analysis

A method and a system for website navigation path analysis are disclosed. The system comprises of a processor; a non-transitory computer-readable storage medium coupled to the processor. The processor executes a plurality of the modules / subsystems stored in the storage medium such as a data categorizing module for categorizing web pages of the website into one or more groups of web pages based on the domain knowledge and functional similarities between the web pages; a score assigning module for calculating an index score, casual score, base line score and engagement score for web page elements and categorized web page pair; a statistical model to be trained with the scores and weight calculated by the score assigning module, and an analyzing module for determining which web pages and transitions correspond to engagement and decision making based on the trained statistical model.
Owner:IQUANTI INC

Interest point searching method and system for mobile internet

The invention discloses an interest point searching method and system for a mobile internet. The method comprises the following steps of: acquiring HTTP logs of users accessed to the mobile internet from a DPI system; extracting a user search URL according to a searching characteristic library; determining user search interest points by adopting a keyword classification method; if successful, ending the process; and if not, determining the user search interest points by adopting a webpage classification method. According to the interest point searching method and system for the mobile internet disclosed by the invention, starting with a basic network, two analysis methods are combined; mobile internet search interest points of user can be analyzed more accurately; the accuracy rate and the coverage rate for analysis of the mobile internet search interest points can be increased; and powerful data support is provided for product marketing and improving the user experience.
Owner:E-SURFING DIGITAL LIFE TECH CO LTD

Method to search transactional web pages

A method of performing transactional web page searches is disclosed. The method includes examining a plurality of web pages, identifying transactional features within a set of the plurality of web pages, and classifying the set of web pages as transactional. The method proceeds with annotating and indexing the transactional web pages, and, in response to a user-designated transactional query, providing only the set of web pages that have been classified as transactional. The identifying transactional features comprises checking for the existence of positive patterns and verifying the absence of negative patterns with respect to a set of contents within each of the plurality of web pages and comprises identifying transactional actions to be performed and identifying transactional objects of the transactional actions to be performed. The annotating and indexing the transactional features comprises annotating and indexing transactional actions and transactional objects.
Owner:IBM CORP

Web page classification method based on von Mises-Fisher probability model

The invention discloses a web page classification method based on a von Mises-Fisher probability model, and belongs to the technical field of the Internet and machine learning. The method comprises the following steps: at first, carrying out data preprocessing, feature extraction and feature screening on a training sample, modeling, and then, substituting a feature vector to a web page to be classified in the model to realize final classification. The web page classification method disclosed by the invention is used for carrying out two-norm normalization on the obtained feature vector to prepare for modeling the von Mises-Fisher model while eliminating the influence of a text length on the feature vector; and the von Mises-Fisher probability model is used for modeling the text feature vector, and the model is applied to the field of natural language processing for the first time.
Owner:BEIJING UNIV OF POSTS & TELECOMM

Webpage classification dictionary generation method and apparatus

Embodiments of the invention disclose a webpage classification dictionary generation method and apparatus. The method comprises the steps of determining a sample uniform resource locator (URL) corresponding to a webpage classification sample of each category according to a predetermined webpage classification standard, and obtaining sample webpage contents corresponding to each sample URL; extracting sample text information in each piece of the sample webpage contents, performing word segmentation processing on the sample text information, and obtaining a corresponding sample word from the sample text information; and screening out a reverse word frequency value corresponding to the sample word from corresponding relationships between a plurality of pre-stored learning words and the reverse word frequency value, wherein the reverse word frequency value is a value determined according to the occurrence frequency of each learning word in corresponding learning text information; and storing the sample word and a weight value determined according to the corresponding reverse word frequency value in a webpage classification dictionary. Therefore, the webpage classification dictionary with higher accuracy is generated.
Owner:NEW H3C TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products