Page 01 of 3
A M E T H O D F O R A L P H A B E T I C O R D E R I N G O F P R E F I X - B A S E D
S E A R C H R E S U L T S
Author: Sergei Suslov
A prefix-based search engine (SE) must support queries that match only the beginning of a word (e.g., prefix) and not necessarily the entire word. For example, for the movie "Catching Out," the search engine could match any of the following potential user submitted queries:
a) "c"
b) "ca"
c) "cat"
d) "catc"
e) "catch"
f) "catchi"
g) "catchin"
h) "catching"
As a result of this requirement, the SE must devise the most optimal way for:
1. Usage of space
2. Usage of time when a query is submitted. In other words, return the response to the user in the fastest way possible.
It is challenging to optimize the space and timing usage when the SE has to manage:
1. A very large document set (search index)
2. A very large results set for a vague query such as "ca" which will return many (1,000 or more) results.
The basic problem is that gathering all of the data at once for calculating the ordering of results at query time for thousands of results is very inefficient and time consuming. Also, storing all of the data needed for calculating relevance rank in the same place as where the index is stored is also very inefficient because the space used up by the index will grow to be very large and index traversal at query time will be very slow.
Typically, search engines will store all the data required for calculating alphabetic ordering in one place in the search index. Thus, at query time, the search engine will retrieve all the data points for a document at once and calculate the ordering at one time. P a g e | 1 1701 John F. Kennedy Blvd., Philadelphia, PA 19103
Page 02 of 3
The two principal drawbacks of this conventional solution are:
1. Time inefficiency: attaching all of the data points to a potential result at one point is less efficient than dividing the alphabetic ranking process into multiple parts where only the data required for each part is attached to the potential result (document). This is especially the case for very large results set spanning thousands of documents.
2. Space inefficiency: storing all of the data required for alphabetic ranking in the same part of the search index as what is used for traversing the words in the index looking for matches with the query is very inefficient. This part of the search index will balloon in size, and the matching of the query to the words in the documents will take much longer.
The proposed method addresses both the time inefficiency and the space inefficiency of conventional solutions by:
1. Dividing the storage required data for alphabetic rank into different sections of the search index
2. At query time, calculating the alphabetic rank in multiple phases, thereby returning the results to the user as fast as possible.
The search index may be divided into 2 parts:
1. Prefix Index: The part which has a listing of words and word prefixes from all the docu...