Patents

Literature

Patsnap Eureka AI that helps you search prior art, draft patents, and assess FTO risks, powered by patent and scientific literature data.

277 results about "Tf–idf" patented technology

Filter

Efficacy Topic

Property

Owner

Technical Advancement

Application Domain

Technology Topic

Technology Field Word

Patent Country/Region

Patent Type

Patent Status

Application Year

Inventor

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.

Realization method and system for electronic medical record post-structuring and auxiliary diagnosis

InactiveCN106383853AGood effectSpecial data processing applicationsData setJaro–Winkler distance

The invention relates to a realization method and system for electronic medical record post-structuring and auxiliary diagnosis. A combination mode of multiple types of distance measurement is used: a character string editing distance refers to a minimum number of replacement, insertion and deletion operations required for converting a character into another character string; a Jaro-Winkler distance measures similarity between two character strings and is used for repeated recording detection; a geometric mean value of a Chinese character distance and a Chinese character input method is adopted as comprehensive similarity measurement for measuring similarity between characteristic texts; characteristic ranking is realized by using a TF-IDF method and is used for assessing the importance of characteristic terms relative to documents in a file set or a corpus library, and the importance of the characteristic terms is in direct proportion to an occurrence frequency in the documents and is in inverse proportion to an occurrence document in the corpus library; and files are converted to be in a file format of PU learning of a positive example data set and an unlabelled data set according to the generated characteristic terms, and through the PU learning, the system automatically recommends related diagnoses for clinical medical personnel to refer.

Realization method and system for electronic medical record post-structuring and auxiliary diagnosis

Realization method and system for electronic medical record post-structuring and auxiliary diagnosis

Realization method and system for electronic medical record post-structuring and auxiliary diagnosis

Owner:刘勇

Natural language processing-based multi-language analysis method and device

ActiveCN108197109AQuality improvementEnable multilingual analysisCharacter and pattern recognitionNatural language data processingAlgorithmModel selection

The invention discloses a natural language processing-based multi-language analysis method and device. The method comprises the following steps of: selecting to input a natural language text information language category through a language detection training model; obtaining word embedding expression information of corresponding words which can be recognized by a computer through a trained word vector model, and extracting a keyword of the obtained word embedding expression information through a TF-IDF manner; calculating an article vector and a category vector of each preset category according to the keyword and a keyword weight, and calculating a similarity between an article of natural language text information and each preset category so as to determine a text classification result ofthe natural language text information; and inputting the word embedding expression information of the natural language text information into a trained convolutional neural network and a parallel-framework text emotion analysis model of a bidirectional gate circulation unit, and obtaining a final emotion tendency value through calculation. According to the method and device, the problem that traditional multi-language analysis method needs to know domain knowledges of related linguistics and needs plenty of manpower to carry out operation is solved.

Natural language processing-based multi-language analysis method and device

Natural language processing-based multi-language analysis method and device

Natural language processing-based multi-language analysis method and device

Owner:北京百分点科技集团股份有限公司

Patent literature similarity measurement method based on ontology

InactiveCN107247780AImprove comprehensivenessIncrease depthSemantic analysisSpecial data processing applicationsInformation processingPatent classification

The invention relates to a patent literature similarity measurement method based on ontology, and relates to the technical field of natural language information processing for the ontology. The method comprises the following steps: extracting a core technical scheme according to the structural features, the position features and the keyword features of patent literatures; constructing a model for the relation between thematic terms of patent classes; constructing a field dictionary according to the model for the relation between the thematic terms of the patent classes and segmenting terms and removing stop terms for the core technical scheme; extracting keywords and weight by combining the relation between the thematic terms to TF-IDF as TextRank term initial weight; training a FastText model, and generating a term vector; and calculating an EMD distance to obtain a semantic distance according to keywords, term weight and term vector. Compared with the prior art, the patent literature similarity measurement method based on the ontology solves the problem that the similarity is low due to the fact that the structural features, the field features, the term relation features and the semantics approximate expression of the patent literature are not fully considered.

Patent literature similarity measurement method based on ontology

Patent literature similarity measurement method based on ontology

Patent literature similarity measurement method based on ontology

Owner:BEIJING INSTITUTE OF TECHNOLOGYGY

Telecommunication field package recommending method based on intelligent customer service robot interaction

InactiveCN102760128AMeet needsImprove recommendation efficiencySpecial data processing applicationsPersonalizationRobotic systems

The invention provides a telecommunication field package recommending method based on intelligent customer service robot interaction. The method comprises the following steps: a. acquiring a user interest model; and b. recommending individual demand-satisfied package service for a user by adopting a decision tree algorithm. The invention further provides an intelligent customer service robot system recommending engine device. Compared with the existing recommending method, the telecommunication field package recommending method is based on a scene interaction model in a recommending process, and can be used for carrying out similarity calculation according to similarity among the calculation labels, calculating the similarity among the labels by combining similarity between semantic similarity calculation with the traditional TF-IDF (Term Frequency-Inverse Document Frequency) in the similarity calculation, better reflecting the characteristics of users and resources by applying labels to show resource and user models, and improving the recommending quality.

Telecommunication field package recommending method based on intelligent customer service robot interaction

Telecommunication field package recommending method based on intelligent customer service robot interaction

Telecommunication field package recommending method based on intelligent customer service robot interaction

Owner:EAST CHINA NORMAL UNIV

Network hot event detection method based on text classification and clustering analysis

ActiveCN104239436AImprove efficiencyImprove accuracySpecial data processing applicationsText database clustering/classificationFeature extractionText categorization

The invention discloses a network hot event detection method based on text classification and clustering analysis. The method solves the problem that the efficiency and accuracy rate of the existing network hot event detection method based on clustering analysis need to be improved. The method comprises the steps that feature words are respectively selected for various classes of files through feature extraction and feature selection by utilizing a training corpus; each training text and test text are represented as vectors in all of the feature spaces by utilizing a vector space model method, and the weight of each dimension of the vectors is determined by utilizing a TF-IDF (term frequency-inverse document frequency) method, and then each test text is classified; the classified test texts in different classes are respectively subjected to clustering analysis, so the hot cluster of each class is obtained, the feature word representing the hot event is obtained through further analysis, and then the word property and other aspects of each feature word are analyzed; the description of each hot event is generated by utilizing relevant language knowledge and necessary linguistic organization. With the network hot event detection method based on text classification and clustering analysis, the detection efficiency and accuracy rate of hot events can be effectively improved.

Network hot event detection method based on text classification and clustering analysis

Network hot event detection method based on text classification and clustering analysis

Network hot event detection method based on text classification and clustering analysis

Owner:NANJING UNIV OF POSTS & TELECOMM

Text information similarity matching method and device, computer device and storage medium

InactiveCN108628825AHigh precisionImprove efficiencyNatural language data processingSpecial data processing applicationsCosine similarityAlgorithm

The invention provides a text information similarity matching method and device on the basis of TF-IDF. The method includes the steps of acquiring text information; performing word segmentation on thetext information to obtain word segments w1, w2, ..., wn-1 and wn; using a CBOW model for calculating word vectors V(w1), V(w2), ..., V(wn-1) and V(wn) of the word segments respectively; using the TF-IDF algorithm for calculating TF-IDF values k1, k2, ..., kn-1 and kn of the word segments respectively; according to products of the word vectors of the word segments and the corresponding TF-IDF values, obtaining a sentence vector V; calculating the cosine similarity between the sentence vector V and sentence vectors of pre-stored statements, and determining the pre-stored statement with the maximum cosine similarity. Through the process above, the pre-stored statement which is the most similar to the text information can be found, the accuracy of problem recognition can be improved in the aspects of robot dialogue, information classification and the like, and therefore the dialogue efficiency or the classification efficiency is improved. A computer device and a storage medium are also provided.

Text information similarity matching method and device, computer device and storage medium

Text information similarity matching method and device, computer device and storage medium

Text information similarity matching method and device, computer device and storage medium

Owner:PING AN TECH (SHENZHEN) CO LTD

Automatic microblog text abstracting method based on unsupervised key bigram extraction

ActiveCN104216875AQuality improvementImprove efficiencySpecial data processing applicationsMicrobloggingMutual information

The invention discloses an automatic microblog text abstracting method based on unsupervised key binary word extraction. The automatic microblog text abstracting method comprises the steps of preprocessing a microblog; standardizing a binary word; extracting a key binary word based on a mixed TF-IDF (term frequency-inverse document frequency), TexRank and an LDA (local data area); sequencing sentences based on the intersection similarity and a mutual information strategy; extracting abstract sentences based on a similarity threshold value; generating abstract by reasonably combining the abstract sentences. According to the automatic microblog text abstracting method, the binary word is used as a minimum vocabulary unit, and the binary word has richer text information than words, so that the sentences based on the key binary word is higher in noise immunity and accuracy than the sentences based on key word extraction; meanwhile, when the abstract sentences are extracted, the similarity threshold value is introduced to control redundancy, so that the abstract is higher in recall rate. The abstract generated by the method is accurate, simple and comprehensive; the efficiency and the quality that a user acquires knowledge are obviously improved, and the time of the user is greatly saved.

Automatic microblog text abstracting method based on unsupervised key bigram extraction

Automatic microblog text abstracting method based on unsupervised key bigram extraction

Automatic microblog text abstracting method based on unsupervised key bigram extraction

Owner:INST OF AUTOMATION CHINESE ACAD OF SCI

Filtering method for spam based on supporting vector machine

InactiveCN101106539ASolve the problem of unequal cost of misjudgmentIncrease the weight valueOffice automationData switching networksSupport vector machineRelevant information

The invention discloses a junk mail filtering method based on support vector machine (SVM). The steps are as following: 1) analyze the mail and extract the message relevant to title, text and character set; 2) carry out divided syncopation to the extracted text message content; 3) make statistics of word frequency in mail and utilize TF-IDF formula to map the mail text to vector; 4) utilize LibSVM to train the mail sample and obtain support vector machine model; 5) utilize support vector machine model to classify new mail and obtain the probability value of junk mails; 6) utilize threshold value adjustment to guarantee a lower level of false positive rate of normal mails to junk mails and ultimately judge whether mails are junk mails. The invention utilizes the advantage of highest single model classification accuracy of the support vector machine, improves the correctness of junk mail filtering, according to the text feature and activity feature and at the same time, also effectively solves the problem of unequal miscarriage cost in junk mail filtering.

Filtering method for spam based on supporting vector machine

Filtering method for spam based on supporting vector machine

Filtering method for spam based on supporting vector machine

Owner:ZHEJIANG UNIV

Text key word extracting method

InactiveCN101067808AHigh precisionImprove performanceSpecial data processing applicationsTf–idfExtraction methods

This invention relates to an improved TF-IDF pick-up method for text key words, which picks up key words of one text by a text frequency modification method to increase accuracy for picking up key words from a single text and picks up key words of common fields in a set of texts of a same kind by a word frequency modification method or a comparison selection method.

Text key word extracting method

Text key word extracting method

Text key word extracting method

Owner:SHANGHAI UNIV

Method for filtering Chinese junk mail based on Logistic regression

InactiveCN101227435AFew adjustment parametersImprove classification effectOffice automationData switching networksFeature vectorRelevant information

The invention discloses a filtering method of recursive Chinese junk E-mail, which is based on Logistic. The method comprises the following steps: first, analyzing E-mails, extracting E-mail titles, E-mail main bodies and accessory relative information, second, segmenting words for version information which is extracted, third, accounting word frequencies of entries in E-mails, calculating weights of words through utilizing TF-IDF pattern, presenting the E-mail to be characteristic vector which is weighted, fourth, utilizing an LIBLINEAR tool kit to exercise the sample of the E-mail to get an Logistic recursive module, fifth, utilizing the Logistic recursive module to classify for new E-mails, getting the probability value whether the E-mails which are got are junk E-mails. The utility which utilizes the Logistic recursive module has the advantages of simple module, little amount of parameter, and high classifying accuracy in a data set whose text number and characteristic number are both bigger, the accuracy and efficiency of filtering junk E-mails are improved through dimension reduction and improved characteristic value calculating method, and meanwhile, the problem of choosing module exercise parameter which is faced in filtering junk E-mails is effectively solved.

Method for filtering Chinese junk mail based on Logistic regression

Method for filtering Chinese junk mail based on Logistic regression

Method for filtering Chinese junk mail based on Logistic regression

Owner:ZHEJIANG UNIV

Topic feature text keyword extraction method

InactiveCN108763213AReduce the influence of human subjective factorsReduce workloadNatural language data processingSpecial data processing applicationsPart of speechAlgorithm

The invention discloses a topic feature text keyword extraction method. Through the method, text keyword extraction results better than those of a traditional TF-IDF method can be obtained. Accordingto the technical scheme, at a training stage, word segmentation, stop word removal, part-of-speech filtering and other preprocessing are performed on a training text, statistical analysis is performedon inverse document frequency of words, meanwhile a topic model method is utilized to learn and obtain a topic probability matrix of the words, normalization processing is performed, topic distribution entropy of the words is calculated according to the topic probability matrix of the words, global weights of the words are calculated in combination with the inverse document frequency and the topic distribution entropy, and global weight calculation results are output to a test stage; and after a test text is preprocessed, statistical analysis is performed on normalized term frequency of wordsin the test text, the normalized term frequency is combined with the global weight calculation results obtained at the training stage, comprehensive scores of the words are calculated are ordered, and a plurality of words with the highest scores in the score order are used as automatic keyword extraction results of the current test text.

Topic feature text keyword extraction method

Topic feature text keyword extraction method

Topic feature text keyword extraction method

Owner:10TH RES INST OF CETC

Method and system for recommending text files

ActiveCN103207899AHigh similarityRecommended results are accurateSpecial data processing applicationsFeature vectorText file

The invention discloses a method and a system for recommending text files. The method includes: determining a term set of a current text file and then determining a TF (term frequency) value or a TF-IDF (inverse document frequency) value of each term in the term set; determining an implied subject feature vector of the current text file; respectively computing similarity degrees among the implied subject feature vector of the current text document and implied subject feature vectors of various text files to be recommended; and selecting certain text files to be recommended and recommending the certain text files to be recommended. The similarity degrees among the implied subject feature vectors of the certain text files to be recommended and the implied subject feature vector of the current text file meet preset screening conditions. The method and the system have the advantage that the similarity degrees among the text files are computed from the implied subject feature vectors, so that the method implemented by the aid of the system for recommending the text files is more accurate.

Method and system for recommending text files

Method and system for recommending text files

Method and system for recommending text files

Owner:新浪技术(中国)有限公司

Document similarity calculating method and similar document whole-network retrieval tracking method

InactiveCN106095737AAccurate processing of similarityAccurate analysis and statisticsSemantic analysisText database indexingCosine similarityTwo-vector

The invention relates to a document similarity calculating method and a similar document whole-network retrieval tracking method. The technical scheme is characterized in that the document similarity calculating method includes: S01, performing word segmentation on an original document and a target document to acquire respective word segmentation sets; S02, performing preprocessing and feature weighting: utilizing TF-IDF technology to calculate weight of each word segmentation, extracting core key words, utilizing Word2vec to dig correlation degree among different word segmentation in the documents, and performing semantic analysis on each document; S03, adopting a vector space model and a cosine similarity algorithm: utilizing a cosine value of an included angle of two vectors in vector space to evaluate similarity of the documents, wherein the cosine value is between 0 and 1, and the greater the cosine value is, the higher the similarity of the documents is. The document similarity calculating method and the similar document whole-network retrieval tracking method are suitable for news information redistribution tracking and transmissibility statistics.

Document similarity calculating method and similar document whole-network retrieval tracking method

Document similarity calculating method and similar document whole-network retrieval tracking method

Document similarity calculating method and similar document whole-network retrieval tracking method

Owner:HANGZHOU FANEWS TECH

Text semantic analysis method

PendingCN109271626AOvercome deficienciesImprove accuracySemantic analysisSpecial data processing applicationsObject structureDocument similarity

A text semantic analysis method and system can realize semantic analysis of text data base on lexical level and sentence level. Aiming at the semantic analysis at the lexical level, the invention firstly adopts an improved word segmentation algorithm to solve the problem that English words are segmented only by spaces. Secondly, based on word segmentation, TF-IDF modeling is performed to obtain weight value; Then the text is vectorized by weighting and summing the weight value and the word vector trained by Word2Vec, and finally the document similarity is solved. At the same time, the invention considers the contribution degree of the vocabulary to the document content and the semantic status to calculate the similarity degree of the document, the result has higher accuracy, and provide agood foundation for subsequent text clustering. The present invention extracts subject-predicate object structure based on text segmentation, part-of-speech tagging, syntactic analysis and dependencyrelation for sentence level semantic analysis. The invention realizes the extraction of subject-predicate-object structures of various sentence types in all aspects, and realizes the noun expansion function, which is more consistent with the manual extraction result.

Text semantic analysis method

Text semantic analysis method

Text semantic analysis method

Owner:BEIJING UNIV OF TECH

Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources

InactiveUS20190197105A1Semantic analysisSpecial data processing applicationsDocumentationSocial web

Machine training for determining sentiments in social network communications. A text document is extracted from a web site and tokenized into tokens. The tokens are input to a word to vector conversion model to generate word vectors. A term frequency inverse document frequency (TF-IDF) algorithm converts the word vectors to sentence vectors. A randomly selected subset the sentence vectors are tagged and used to train a classifier. The classifier takes a sentence vector and predicts a sentiment associated with the sentence vector. Predicted sentiment associated with each of the sentence vectors may be combined to generate a sentiment associated with the text document.

Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources

Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources

Unsupervised neural based hybrid model for sentiment analysis of web/mobile application using public data sources

Owner:IBM CORP

Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts

ActiveCN104750844AImprove performanceOvercome the problem of large deviation in weight calculationSpecial data processing applicationsFeature vectorData set

The invention discloses a method and a device for generating text characteristic vectors based on TF-IGM, as well as a method and a device for classifying texts. The concentration ratios of characteristic words distributed in different classes of texts are calculated by establishing inverted gravitational moment (IGM) models, and the weights of the characteristic words are calculated based thereon. The weights obtained by the calculation can more realistically reflect the importance of the characteristic words in the text classes, accordingly increasing the performance of text classifiers. The device for generating the text characteristic vectors based on the TF-IGM has a plurality of options that may be optimized and regulated based on the results of the performance test of the text classes in order to be adaptive to text data sets having different characteristics. It is proved by experiments on public English corpus and Chinese corpus that the TF-IGM method is much more superior to the existing methods such as TF-IDF methods and TF-RF methods, and the TF-IGM method is particularly applicable to multi-class text classifications of more than two classes.

Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts

Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts

Method and device for generating text characteristic vectors based on TF-IGM, method and device for classifying texts

Owner:CENT SOUTH UNIV

Court similar case recommendation model based on word vectors and word frequencies

PendingCN110597949AThe similarity calculation results are goodAvoid Natural DisadvantagesText database queryingSpecial data processing applicationsRecommendation modelComputational model

The invention discloses a court similar case recommendation model based on word vectors and word frequencies, namely a TF-W2V similarity calculation model. The judgment documents are divided into fivecase types of criminal affairs, civil affairs, execution, compensation and administrative affairs, and in order to process, store and query the judgment documents, the model extracts the key information from the submitted judgment, and finds out the judgment with the highest similarity in the same type of judgment in the document data by adopting a Word2Vc + TF-IDF text similarity algorithm to give out the similarity and recommend the judgment. According to the method, based on a word frequency and word vector method, the keywords and the word meaning information of the texts are integrated,and the similarity of the two texts is accurately calculated. The method is applied to the court judgment for similarity calculation, and the experimental results prove that the method is simple to apply, has no requirement for a labeling training set, can be applied to the texts in different fields, consumes the moderate time in calculation, is more accurate in obtained result compared with a traditional method, is closer to the expert evaluation results, and can calculate the similarity of the court texts accurately and effectively.

Court similar case recommendation model based on word vectors and word frequencies

Court similar case recommendation model based on word vectors and word frequencies

Court similar case recommendation model based on word vectors and word frequencies

Owner:HUBEI UNIV OF TECH

LDA fusion model and multilayer clustering-based news topic detection method

InactiveCN107423337AGood impetusImprove clustering qualitySemantic analysisSpecial data processing applicationsData dredgingTime complexity

The invention belongs to the field of data mining, natural language processing and information retrieval, and provides a news topic detection method. For the defect of a TF-IDF-based vector space algorithm in semantics and the defects of time complexity and accuracy of textual level clustering, feature extraction, representation modeling, similarity calculation and quick and accurate text clustering methods for a large amount of news texts are improved. The LDA fusion model and multilayer clustering-based news topic detection method comprises the following steps of 1: building a similarity model by using a vector space model (VSM); 2: finally obtaining accurate parameter settings; 3: organically fusing two text models; 4: judging whether a topic is a new topic or not; 5: calculating the similarity until all documents are clustered; and 6: adding an ISP&AH clustering algorithm of AHC based on the step 5. The method is mainly applied to the design and manufacturing occasions.

LDA fusion model and multilayer clustering-based news topic detection method

LDA fusion model and multilayer clustering-based news topic detection method

Owner:TIANJIN UNIV

Feature bag image retrieval method based on Hash binary code

ActiveCN105469096AImprove scalabilityImprove use valueCharacter and pattern recognitionSpecial data processing applicationsBag of featuresImage retrieval

The invention discloses a feature bag image retrieval method based on a Hash binary code. The method comprises steps that, a vision term list is established; tf-idf (term frequency-inverse document frequency index) weight quantification of vision terms is carried out; vision term characteristic quantification of an image is carried out; an inverted index is established; a feature binary code projection direction is learned; feature binary code quantification is carried out; candidate image sets are retrieved. According to the method, the index is established for an image database, rapid image retrieval is realized, and retrieval efficiency is improved, moreover, through a binary code learning method having the similarity retention capability, the binary code is learned from spatial distance similarity and meaning distance similarity as signature, and image retrieval accuracy is improved. The feature bag image retrieval technology based on the Hash binary code has properties of high efficiency and accuracy, and relatively high use values are realized.

Feature bag image retrieval method based on Hash binary code

Feature bag image retrieval method based on Hash binary code

Feature bag image retrieval method based on Hash binary code

Owner:NANJING UNIV

Chinese short text sentiment classification method based on fields

InactiveCN105069021AImprove accuracySpecial data processing applicationsText database clustering/classificationSentence segmentationData set

The present invention discloses a Chinese short text sentiment classification method based on fields, which includes: data preprocessing of a short text including sentence segmentation, word segmentation, stop word filtration, and field division; construction of a field-oriented sentiment dictionary; extraction and matching of sentiment paths, extraction and polarity discrimination of candidates, and TF-IDF weight calculation of sentiment words by the field-oriented sentiment dictionary and using a corpus as a data set; sentimental characteristic extraction of the short text; and the corpus training or unknown sentiment types discrimination by a rand forest algorithm. Experiments show that the scheme provided by the present invention has high accuracy rate.

Chinese short text sentiment classification method based on fields

Chinese short text sentiment classification method based on fields

Chinese short text sentiment classification method based on fields

Owner:GUANGDONG UNIV OF PETROCHEMICAL TECH

TF-IDF feature extraction based short text classification method

ActiveCN106528642AEnhanced TF-IDF featuresImprove performanceSpecial data processing applicationsSemantic tool creationFeature vectorFeature extraction

The invention discloses a TF-IDF feature extraction based short text classification method. According to the method, short texts are merged into a long text so as to enhance the TF-IDF feature of the short texts; dimension reduction is performed so as to generate a feature word list and a feature word dictionary; a mechanism compensation is established for a class having a relative unobvious feature while the feature word list is established, and the text feature vector weight is enhanced; and other word banks or word vector dictionaries do not have to be constructed or trained, and then the algorithm performance can be improved on the premise of ensuring the feature expression result of the texts. The TF-IDF feature extraction based short text classification method can be widely applied to the field of data processing.

TF-IDF feature extraction based short text classification method

TF-IDF feature extraction based short text classification method

TF-IDF feature extraction based short text classification method

Owner:广东广业开元科技有限公司

Problem similarity calculation method based on a plurality of features

ActiveCN109344236AIncrease diversityImprove generalization abilityDigital data information retrievalSemantic analysisSemantic gapSemantic feature

The invention discloses a problem similarity calculation method based on a plurality of features, includes steps: For the input new question sentence, Compared with the stored historical questions andcorresponding answers, the similarities between the new questions and the historical questions are calculated based on character features, semantic features of words, semantic features of sentences,implied topic features of sentences and semantic features of answers. The final similarity is the product of the above five similarities and their corresponding weights, which are trained by linear regression method. The invention adopts a plurality of features to increase the diversity of sample attributes, and improves the generalization ability of the model. At that same time, the soft cosine distance is utilized to convert the TF-IDF is fused with editing distance, word semantics and other information, which overcomes the semantic gap between words and improves the accuracy of similarity calculation.

Problem similarity calculation method based on a plurality of features

Problem similarity calculation method based on a plurality of features

Problem similarity calculation method based on a plurality of features

Owner:JINAN UNIVERSITY

Microblog data analysis based hot news prediction method and system

ActiveCN105224608ASolve the problem of early predictionImprove practical abilityWeb data indexingSpecial data processing applicationsEarly predictionStatistical analysis

The present invention discloses a microblog data analysis based hot news prediction method and system. The method comprises: acquiring news reports from mainstream news sites and microblog user response information caused by the news reports on the microblog; carrying out word segmentation and word frequency statistics to a microblog text, calculating a TF-IDF value of a word, and converting the value into a microblog topic described by using a vector space; classifying the microblog topics, counting each quantitative index for describing the microblog topics, and calculating each hot index of news; and using a multivariate linear regression algorithm to learn sample data, establishing a hot news prediction model, and determining whether the latter news can become a hot news or not. The system comprises a data acquisition module, a text analysis processing module, a data statistical analysis module and a hot news prediction module. According to the method and system disclosed by the present invention, the trend of news reported by media in microblog topics is comprehensively analyzed to predict whether the news can become a hot news or not in public sentiments, so that the problem of early prediction of hot news can be well solved.

Microblog data analysis based hot news prediction method and system

Microblog data analysis based hot news prediction method and system

Microblog data analysis based hot news prediction method and system

Owner:SOUTH CHINA UNIV OF TECH

Content based similarity detection

ActiveUS20150154497A1Digital computer detailsMachine learningRandom projectionTf–idf

Content Based Similarity Detection. A computer implemented method includes computing a hash of each word in a collection of books to produce a numerical integer token using a reduced representation and computing an Inverse Document Frequency (IDF) vector comprising the number of books the token appears in, for every token in the collection of books. The method also includes creating a token occurrence count vector for each book in the collection and normalizing the token occurrence count vector using the IDF vector to create a Term Frequency-Inverse Document Frequency (TF-IDF) vector. Further, the method includes reducing each TF-IDF vector by using random projections to obtain a final signature representing each book in the collection, reducing each TF-IDF vector by using random projections to obtain a final signature representing each book in the collection and using a trained machine learning algorithm, determining whether each of the list of candidate books is similar to the target book.

Content based similarity detection

Content based similarity detection

Content based similarity detection

Owner:RAKUTEN KOBO

Public opinion hot word finding method based on keyword weighting algorithm

InactiveCN107153658AImprove accuracyGuaranteed real-timeWeb data indexingNatural language data processingPattern recognitionAlgorithm

The invention discloses a hot word finding method, and particularly relates to a public opinion hot word finding method based on a keyword weighting algorithm. According to the public opinion hot word finding method based on the keyword weighting algorithm, a Chinese word segmentation tool is utilized to conduct preliminary word segmentation on massive public opinion information, part-of-speech tagging is provided, an IDF table, a filter word table and a part-of-speech weighting value table are combined at the same time, according to a weighting type TF-IDF algorithm, a candidate word popularity value is calculated, the calculation is not only relied on word frequency, instead, effective information contained in part-of-speech, position and the like of a word is taken into full account, and reliability basis is provided for hot word recognition. In addition, in the public opinion hot word finding method based on the keyword weighting algorithm, the characteristic that the public opinion has a distinct topic and theme under a we media time is taken into full account, corpus processing is mainly conducted on the public opinion topic, and the problem of the efficiency of the hot word recognition under massive public opinion information is solved. Finally, dynamic incremental type updating is achieved for the IDF table, the real-time performance of the word inverse document frequency is guaranteed, and the accuracy of the hot word recognition is improved.

Public opinion hot word finding method based on keyword weighting algorithm

Public opinion hot word finding method based on keyword weighting algorithm

Public opinion hot word finding method based on keyword weighting algorithm

Owner:CHANGZHOU PUSHI INFORMATION TECH +1

Method and device for spam filtering based on short text

ActiveCN103441924AEasy to compareReduce the possibilityOffice automationData switching networksHash functionTf–idf

The invention discloses a method for spam filtering based on a short text. The method for spam filtering based on the short text comprises the following steps that word segmentation is conducted on the text of each email and word segmentation results are obtained; sequencing is conducted on the word segmentation results through the TF-IDF technology, so that a word segmentation list is obtained; an email fingerprint of each email is calculated according to the word segmentation results; clustering processing is conducted on the emails according to the email fingerprints, and a clustering result is obtained; spam filtering is conducted according to the clustering result. The invention further discloses a device for spam filtering based on the short text. By the adoption of the method and device for spam filtering based on the short text, word segmentation and TF-IDF technology sequencing can be conducted on the texts of the emails, and noise filtering is achieved; according to the length of the text of each email, the email fingerprint of each email is calculated through one or more BKDR hash functions, and the function of the word segmentation result can be effectively enhanced; clustering processing can be conducted on the emails through similarity comparison of the fingerprints by means of normalization processing, and therefore spam filtering is achieved.

Method and device for spam filtering based on short text

Method and device for spam filtering based on short text

Method and device for spam filtering based on short text

Owner:LUNKR TECH GUANGZHOU CO LTD

Text clustering multi-document automatic abstracting method and system for improving word vector model

PendingCN110413986AGet efficientlyExpress semantic coherenceCharacter and pattern recognitionNeural architecturesScale modelProblem of time

The invention discloses a text clustering multi-document automatic abstracting method and a system for improving a word vector model. The CBOW of the Hierachic Softmax belongs to the field of large-scale model training, and the CBOW of the Hierachic Softmax belongs to the field of large-scale model training. Based on the method, a TesorFlow deep learning framework is introduced into word vector model training; the problem of time efficiency of a large-scale training set is solved through streaming processing calculation, TF-IDF is introduced firstly during sentence vector representation, thenthe semantic similarity of a semantic unit to be extracted is calculated, weighting parameters are set for comprehensive consideration, and a semantic weighted sentence vector is generated; beneficialeffects are as follows. The advantages and disadvantages of semantics, deep learning and machine learning are comprehensively considered; density clustering and convolutional neural network algorithms are applied. Intelligent degree is high, according to the method, the statement with high relevancy with the central content can be quickly extracted to serve as the abstract of the text, various machine learning algorithms are applied to the automatic text abstract to achieve a better abstract effect, the method is possibly the main research direction in future in the field, and in addition, the system according to the invention supplies a tool for automatic extraction of a document abstract based on the method.

Text clustering multi-document automatic abstracting method and system for improving word vector model

Text clustering multi-document automatic abstracting method and system for improving word vector model

Text clustering multi-document automatic abstracting method and system for improving word vector model

Owner:上海晏鼠计算机技术股份有限公司

Method of classifying web text information sentiments

InactiveCN106202372AAccurate discoverySpecial data processing applicationsText database clustering/classificationClassification ruleDocument preparation

The invention discloses a method of classifying web text information sentiments, comprising the following steps of 1, judging whether a document is news, if yes, just extracting a title for sentiment classification, and if not, carrying out sentiment classifying on the whole document; 2, preprocessing the document to be classified; 3, classifying the document according to text length: calculating a feature weight for a document longer than 140 characters by using TF-IDF (term frequency-inverse document frequency), and carrying out classification by using a trained logistic regression classifier; and carrying classification on a document longer than 140 characters by using manual sentiment classification rules. Compared with the prior art, the method has the advantages that a technical route combining a classifier and field expert formulated classification features is constructed by using machine learning algorithm according to different features of long and short texts, and it is possible to timely find related reactionary information, sensitive information and negative information in online public opinions.

Method of classifying web text information sentiments

Method of classifying web text information sentiments

Method of classifying web text information sentiments

Owner:CHINA ELECTRONICS TECH CYBER SECURITY CO LTD

Personalized search method for Web service recommendation

ActiveCN102819575AImprove accuracyImprove relevanceSpecial data processing applicationsPersonalized searchWeb service

The invention discloses a personalized search method for Web service recommendation. The personalized search method comprises the following steps of: 1, preprocessing a WSDL (Web Services Description Language) file, i.e., forming a bag of words through two preprocessing steps of removing stop words and extracting stems; 2, extracting user interest, i.e., calculating weight of each word in the bag of words by using an improved TF-IDF (Term Frequency-Inverse Document Frequency) formula, and multiplying by a time decay factor of the word to obtain a new weight; selecting previous k words according to the weight from large to small as interest words of a user and corresponding weight of each word to form a k-dimension user interest vector; 3, calculating interest similarity, i.e., setting a similarity threshold and selecting the users with interest similarity exceeding the threshold as neighbor users of a target user; and 4, ordering service search results, calculating a recommended predicted value of the service according to similarity of neighbor users and the frequency of selecting service of the users, and arranging the searched results in a descending order according to the recommended predicted value, thereby obtaining the personalized search result.

Personalized search method for Web service recommendation

Personalized search method for Web service recommendation

Personalized search method for Web service recommendation

Owner:十方健康管理(江苏)有限公司 +1

Method and system for advertisement recommendation based microblog

ActiveCN103617230AReal-timeGood effectMarketingSpecial data processing applicationsFeature vectorData dredging

The invention belongs to the field of data mining and provides a method and system for advertisement recommendation based a microblog. The method comprises the steps that microblog data are read; the microblog data are initialized and a microblog text lexical item set is obtained; stop words of the microblog text lexical item set are deleted and a microblog text original feature lexical item set is obtained; mapping is conducted on the microblog text original feature lexical item set and a feature lexical item dictionary, whether lexical items in the microblog text original feature lexical item set exist in the feature lexical item dictionary or not is judged, and the tf-idf values of the appearing lexical items are calculated and serve as the feature values of the lexical items; whether the lexical items of the feature lexical item dictionary exist in the microblog text original feature lexical item set or not is judged and the feature values of the lexical items which do not appear are marked to be zero; feature vectors of the feature values obtained through calculation are automatically classified to classifications divided in advance; according to an automatic classification result, advertisements are recommended to a user. The advertisements recommended by the method and system are accurate and the effect is good.

Method and system for advertisement recommendation based microblog

Method and system for advertisement recommendation based microblog

Method and system for advertisement recommendation based microblog

Owner:SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI

Popular searches

Electronic medical record Aided diagnosis Chinese characters Ranking Data mining Unlabelled data Data science Distance measurement Input method File format

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

© 2025 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com