Ranking text search results according to purity of documents calculated by clustering of document set.
Ranking algorism based on purity of documents
Ranking text search results according to purity of documents calculated by clustering of document set.
This article describes a technique of ranking text search results according to purity of documents calculated by clustering of document set.
1. Cluster text document set into subsets by using clustering algorism such as Latent Semantic Analysis/Indexing (LSA/LSI) or Latent Dirichlet allocation (LDA).
2. Represent every document by a vector in a vector space. In this vector space, each cluster is regarded as a base of space. Projection of a representing vector on a base is score of a document in a cluster.
3. Vectors of documents are stored in search index as metadata of document.
4. User inputs search keywords. This search keywords are regarded as a document and evaluate it representing vector.
5. Calculate an inner product of a vector of search keywords and a vactor of every document in index.. Present search results in the order of this inner product value. Documents with large inner product values are listed on top of search results
1