Patents
Literature
Patsnap Copilot is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Patsnap Copilot

277 results about "Tf–idf" patented technology

In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general. tf–idf is one of the most popular term-weighting schemes today; 83% of text-based recommender systems in digital libraries use tf–idf.

Realization method and system for electronic medical record post-structuring and auxiliary diagnosis

InactiveCN106383853AGood effectSpecial data processing applicationsData setJaro–Winkler distance
The invention relates to a realization method and system for electronic medical record post-structuring and auxiliary diagnosis. A combination mode of multiple types of distance measurement is used: a character string editing distance refers to a minimum number of replacement, insertion and deletion operations required for converting a character into another character string; a Jaro-Winkler distance measures similarity between two character strings and is used for repeated recording detection; a geometric mean value of a Chinese character distance and a Chinese character input method is adopted as comprehensive similarity measurement for measuring similarity between characteristic texts; characteristic ranking is realized by using a TF-IDF method and is used for assessing the importance of characteristic terms relative to documents in a file set or a corpus library, and the importance of the characteristic terms is in direct proportion to an occurrence frequency in the documents and is in inverse proportion to an occurrence document in the corpus library; and files are converted to be in a file format of PU learning of a positive example data set and an unlabelled data set according to the generated characteristic terms, and through the PU learning, the system automatically recommends related diagnoses for clinical medical personnel to refer.
Owner:刘勇

Natural language processing-based multi-language analysis method and device

The invention discloses a natural language processing-based multi-language analysis method and device. The method comprises the following steps of: selecting to input a natural language text information language category through a language detection training model; obtaining word embedding expression information of corresponding words which can be recognized by a computer through a trained word vector model, and extracting a keyword of the obtained word embedding expression information through a TF-IDF manner; calculating an article vector and a category vector of each preset category according to the keyword and a keyword weight, and calculating a similarity between an article of natural language text information and each preset category so as to determine a text classification result ofthe natural language text information; and inputting the word embedding expression information of the natural language text information into a trained convolutional neural network and a parallel-framework text emotion analysis model of a bidirectional gate circulation unit, and obtaining a final emotion tendency value through calculation. According to the method and device, the problem that traditional multi-language analysis method needs to know domain knowledges of related linguistics and needs plenty of manpower to carry out operation is solved.
Owner:北京百分点科技集团股份有限公司

Patent literature similarity measurement method based on ontology

The invention relates to a patent literature similarity measurement method based on ontology, and relates to the technical field of natural language information processing for the ontology. The method comprises the following steps: extracting a core technical scheme according to the structural features, the position features and the keyword features of patent literatures; constructing a model for the relation between thematic terms of patent classes; constructing a field dictionary according to the model for the relation between the thematic terms of the patent classes and segmenting terms and removing stop terms for the core technical scheme; extracting keywords and weight by combining the relation between the thematic terms to TF-IDF as TextRank term initial weight; training a FastText model, and generating a term vector; and calculating an EMD distance to obtain a semantic distance according to keywords, term weight and term vector. Compared with the prior art, the patent literature similarity measurement method based on the ontology solves the problem that the similarity is low due to the fact that the structural features, the field features, the term relation features and the semantics approximate expression of the patent literature are not fully considered.
Owner:BEIJING INSTITUTE OF TECHNOLOGYGY

Network hot event detection method based on text classification and clustering analysis

The invention discloses a network hot event detection method based on text classification and clustering analysis. The method solves the problem that the efficiency and accuracy rate of the existing network hot event detection method based on clustering analysis need to be improved. The method comprises the steps that feature words are respectively selected for various classes of files through feature extraction and feature selection by utilizing a training corpus; each training text and test text are represented as vectors in all of the feature spaces by utilizing a vector space model method, and the weight of each dimension of the vectors is determined by utilizing a TF-IDF (term frequency-inverse document frequency) method, and then each test text is classified; the classified test texts in different classes are respectively subjected to clustering analysis, so the hot cluster of each class is obtained, the feature word representing the hot event is obtained through further analysis, and then the word property and other aspects of each feature word are analyzed; the description of each hot event is generated by utilizing relevant language knowledge and necessary linguistic organization. With the network hot event detection method based on text classification and clustering analysis, the detection efficiency and accuracy rate of hot events can be effectively improved.
Owner:NANJING UNIV OF POSTS & TELECOMM

Automatic microblog text abstracting method based on unsupervised key bigram extraction

The invention discloses an automatic microblog text abstracting method based on unsupervised key binary word extraction. The automatic microblog text abstracting method comprises the steps of preprocessing a microblog; standardizing a binary word; extracting a key binary word based on a mixed TF-IDF (term frequency-inverse document frequency), TexRank and an LDA (local data area); sequencing sentences based on the intersection similarity and a mutual information strategy; extracting abstract sentences based on a similarity threshold value; generating abstract by reasonably combining the abstract sentences. According to the automatic microblog text abstracting method, the binary word is used as a minimum vocabulary unit, and the binary word has richer text information than words, so that the sentences based on the key binary word is higher in noise immunity and accuracy than the sentences based on key word extraction; meanwhile, when the abstract sentences are extracted, the similarity threshold value is introduced to control redundancy, so that the abstract is higher in recall rate. The abstract generated by the method is accurate, simple and comprehensive; the efficiency and the quality that a user acquires knowledge are obviously improved, and the time of the user is greatly saved.
Owner:INST OF AUTOMATION CHINESE ACAD OF SCI

Method for filtering Chinese junk mail based on Logistic regression

The invention discloses a filtering method of recursive Chinese junk E-mail, which is based on Logistic. The method comprises the following steps: first, analyzing E-mails, extracting E-mail titles, E-mail main bodies and accessory relative information, second, segmenting words for version information which is extracted, third, accounting word frequencies of entries in E-mails, calculating weights of words through utilizing TF-IDF pattern, presenting the E-mail to be characteristic vector which is weighted, fourth, utilizing an LIBLINEAR tool kit to exercise the sample of the E-mail to get an Logistic recursive module, fifth, utilizing the Logistic recursive module to classify for new E-mails, getting the probability value whether the E-mails which are got are junk E-mails. The utility which utilizes the Logistic recursive module has the advantages of simple module, little amount of parameter, and high classifying accuracy in a data set whose text number and characteristic number are both bigger, the accuracy and efficiency of filtering junk E-mails are improved through dimension reduction and improved characteristic value calculating method, and meanwhile, the problem of choosing module exercise parameter which is faced in filtering junk E-mails is effectively solved.
Owner:ZHEJIANG UNIV

Topic feature text keyword extraction method

The invention discloses a topic feature text keyword extraction method. Through the method, text keyword extraction results better than those of a traditional TF-IDF method can be obtained. Accordingto the technical scheme, at a training stage, word segmentation, stop word removal, part-of-speech filtering and other preprocessing are performed on a training text, statistical analysis is performedon inverse document frequency of words, meanwhile a topic model method is utilized to learn and obtain a topic probability matrix of the words, normalization processing is performed, topic distribution entropy of the words is calculated according to the topic probability matrix of the words, global weights of the words are calculated in combination with the inverse document frequency and the topic distribution entropy, and global weight calculation results are output to a test stage; and after a test text is preprocessed, statistical analysis is performed on normalized term frequency of wordsin the test text, the normalized term frequency is combined with the global weight calculation results obtained at the training stage, comprehensive scores of the words are calculated are ordered, and a plurality of words with the highest scores in the score order are used as automatic keyword extraction results of the current test text.
Owner:10TH RES INST OF CETC

Court similar case recommendation model based on word vectors and word frequencies

PendingCN110597949AThe similarity calculation results are goodAvoid Natural DisadvantagesText database queryingSpecial data processing applicationsRecommendation modelComputational model
The invention discloses a court similar case recommendation model based on word vectors and word frequencies, namely a TF-W2V similarity calculation model. The judgment documents are divided into fivecase types of criminal affairs, civil affairs, execution, compensation and administrative affairs, and in order to process, store and query the judgment documents, the model extracts the key information from the submitted judgment, and finds out the judgment with the highest similarity in the same type of judgment in the document data by adopting a Word2Vc + TF-IDF text similarity algorithm to give out the similarity and recommend the judgment. According to the method, based on a word frequency and word vector method, the keywords and the word meaning information of the texts are integrated,and the similarity of the two texts is accurately calculated. The method is applied to the court judgment for similarity calculation, and the experimental results prove that the method is simple to apply, has no requirement for a labeling training set, can be applied to the texts in different fields, consumes the moderate time in calculation, is more accurate in obtained result compared with a traditional method, is closer to the expert evaluation results, and can calculate the similarity of the court texts accurately and effectively.
Owner:HUBEI UNIV OF TECH

Microblog data analysis based hot news prediction method and system

The present invention discloses a microblog data analysis based hot news prediction method and system. The method comprises: acquiring news reports from mainstream news sites and microblog user response information caused by the news reports on the microblog; carrying out word segmentation and word frequency statistics to a microblog text, calculating a TF-IDF value of a word, and converting the value into a microblog topic described by using a vector space; classifying the microblog topics, counting each quantitative index for describing the microblog topics, and calculating each hot index of news; and using a multivariate linear regression algorithm to learn sample data, establishing a hot news prediction model, and determining whether the latter news can become a hot news or not. The system comprises a data acquisition module, a text analysis processing module, a data statistical analysis module and a hot news prediction module. According to the method and system disclosed by the present invention, the trend of news reported by media in microblog topics is comprehensively analyzed to predict whether the news can become a hot news or not in public sentiments, so that the problem of early prediction of hot news can be well solved.
Owner:SOUTH CHINA UNIV OF TECH

Public opinion hot word finding method based on keyword weighting algorithm

The invention discloses a hot word finding method, and particularly relates to a public opinion hot word finding method based on a keyword weighting algorithm. According to the public opinion hot word finding method based on the keyword weighting algorithm, a Chinese word segmentation tool is utilized to conduct preliminary word segmentation on massive public opinion information, part-of-speech tagging is provided, an IDF table, a filter word table and a part-of-speech weighting value table are combined at the same time, according to a weighting type TF-IDF algorithm, a candidate word popularity value is calculated, the calculation is not only relied on word frequency, instead, effective information contained in part-of-speech, position and the like of a word is taken into full account, and reliability basis is provided for hot word recognition. In addition, in the public opinion hot word finding method based on the keyword weighting algorithm, the characteristic that the public opinion has a distinct topic and theme under a we media time is taken into full account, corpus processing is mainly conducted on the public opinion topic, and the problem of the efficiency of the hot word recognition under massive public opinion information is solved. Finally, dynamic incremental type updating is achieved for the IDF table, the real-time performance of the word inverse document frequency is guaranteed, and the accuracy of the hot word recognition is improved.
Owner:CHANGZHOU PUSHI INFORMATION TECH +1

Method and device for spam filtering based on short text

The invention discloses a method for spam filtering based on a short text. The method for spam filtering based on the short text comprises the following steps that word segmentation is conducted on the text of each email and word segmentation results are obtained; sequencing is conducted on the word segmentation results through the TF-IDF technology, so that a word segmentation list is obtained; an email fingerprint of each email is calculated according to the word segmentation results; clustering processing is conducted on the emails according to the email fingerprints, and a clustering result is obtained; spam filtering is conducted according to the clustering result. The invention further discloses a device for spam filtering based on the short text. By the adoption of the method and device for spam filtering based on the short text, word segmentation and TF-IDF technology sequencing can be conducted on the texts of the emails, and noise filtering is achieved; according to the length of the text of each email, the email fingerprint of each email is calculated through one or more BKDR hash functions, and the function of the word segmentation result can be effectively enhanced; clustering processing can be conducted on the emails through similarity comparison of the fingerprints by means of normalization processing, and therefore spam filtering is achieved.
Owner:LUNKR TECH GUANGZHOU CO LTD

Text clustering multi-document automatic abstracting method and system for improving word vector model

The invention discloses a text clustering multi-document automatic abstracting method and a system for improving a word vector model. The CBOW of the Hierachic Softmax belongs to the field of large-scale model training, and the CBOW of the Hierachic Softmax belongs to the field of large-scale model training. Based on the method, a TesorFlow deep learning framework is introduced into word vector model training; the problem of time efficiency of a large-scale training set is solved through streaming processing calculation, TF-IDF is introduced firstly during sentence vector representation, thenthe semantic similarity of a semantic unit to be extracted is calculated, weighting parameters are set for comprehensive consideration, and a semantic weighted sentence vector is generated; beneficialeffects are as follows. The advantages and disadvantages of semantics, deep learning and machine learning are comprehensively considered; density clustering and convolutional neural network algorithms are applied. Intelligent degree is high, according to the method, the statement with high relevancy with the central content can be quickly extracted to serve as the abstract of the text, various machine learning algorithms are applied to the automatic text abstract to achieve a better abstract effect, the method is possibly the main research direction in future in the field, and in addition, the system according to the invention supplies a tool for automatic extraction of a document abstract based on the method.
Owner:上海晏鼠计算机技术股份有限公司

Personalized search method for Web service recommendation

The invention discloses a personalized search method for Web service recommendation. The personalized search method comprises the following steps of: 1, preprocessing a WSDL (Web Services Description Language) file, i.e., forming a bag of words through two preprocessing steps of removing stop words and extracting stems; 2, extracting user interest, i.e., calculating weight of each word in the bag of words by using an improved TF-IDF (Term Frequency-Inverse Document Frequency) formula, and multiplying by a time decay factor of the word to obtain a new weight; selecting previous k words according to the weight from large to small as interest words of a user and corresponding weight of each word to form a k-dimension user interest vector; 3, calculating interest similarity, i.e., setting a similarity threshold and selecting the users with interest similarity exceeding the threshold as neighbor users of a target user; and 4, ordering service search results, calculating a recommended predicted value of the service according to similarity of neighbor users and the frequency of selecting service of the users, and arranging the searched results in a descending order according to the recommended predicted value, thereby obtaining the personalized search result.
Owner:十方健康管理(江苏)有限公司 +1

Method and system for advertisement recommendation based microblog

The invention belongs to the field of data mining and provides a method and system for advertisement recommendation based a microblog. The method comprises the steps that microblog data are read; the microblog data are initialized and a microblog text lexical item set is obtained; stop words of the microblog text lexical item set are deleted and a microblog text original feature lexical item set is obtained; mapping is conducted on the microblog text original feature lexical item set and a feature lexical item dictionary, whether lexical items in the microblog text original feature lexical item set exist in the feature lexical item dictionary or not is judged, and the tf-idf values of the appearing lexical items are calculated and serve as the feature values of the lexical items; whether the lexical items of the feature lexical item dictionary exist in the microblog text original feature lexical item set or not is judged and the feature values of the lexical items which do not appear are marked to be zero; feature vectors of the feature values obtained through calculation are automatically classified to classifications divided in advance; according to an automatic classification result, advertisements are recommended to a user. The advertisements recommended by the method and system are accurate and the effect is good.
Owner:SHENZHEN INST OF ADVANCED TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products