The method of assessment of semantic similarity of documents, which is based on the use of the latent and semantic analysis, dynamics of change of singular values of a term-document matrix and automatic determination of a range of rank values, is offered. Assessment of semantic similarity of documents is considered in relation to the solution of problems of identification of duplication and contradictions in databases and storages of data.
A short review of the approaches used at assessment of semantic similarity of documents, identification of duplication and contradictions in databases is provided. Results of numerical examples of assessment of semantic dependences between terms of documents for the benefit of identification of duplication and contradictions in databases and storages of data are given. In this case, the degree of correspondence between the compared documents as the resultant characteristic is calculated.
Comparative estimates of the accuracy of the calculation of the degree of correspondence of λ documents with the help of the main methods (cosine proximity measure, vector model, Spearman rank correlation coefficient, static measure tf-idf — frequency of the term — reverse document frequency) are given.
It is shown that application of the offered method of the latent and semantic analysis with automatic detection of a range of rank values allows eliminating dependence of results of application of a method of the latent semantic analysis on the chosen rank.
We propose an approach to the automatic categorization of text documents based on the joint application of the method of latent semantic analysis (LSA) and fuzzy inference Mamdani algorithm. Method LSA is used for the semantic analysis of information in electronic document management systems by identifying semantic relationships between terms of documents and receipt of the compliance rate of the compared vectors. The rule base is proposed for fuzzy inference algorithm of Mamdani implementing the automatic rubrication of documents for a variety of given topics enabling automated monitoring of the distribution of documents not relevant to the specified topics, or having similarities in several thematic categories on the basis of the results of latent semantic analysis.
1 - 2 of 2 items