Text Search Query Syntax Details |
< Previous | Next > |
The information on this page assumes the reader is familiar with the use of a boolean and matching query syntax. Such syntaxes are fairly common within specialized full-text search engines. This summary generally describes the operators, indicators, and modifiers which are available when forming queries. In addition to this overview, some examples may offer additional insights.
| Syntax | Explanation |
|---|---|
| * |
Asterisk; the multiple character wildcard indicator. When used within or at the end of a word, the asterisk indicates zero or more of any character may be matched. A leading (or lone) asterisk is invalid. See additional information in "Regarding Lexicographic Analysis and Wildcards" below. |
| ? |
Question mark; the single character wildcard indicator. When used within or at the end of a word, the question mark indicates any single character may be matched. A leading (or lone) question mark is invalid. See additional information in "Regarding Lexicographic Analysis and Wildcards" below. |
| AND |
The AND logical operator. Words or phrases may be joined together with the operator AND, instructing the engine that the documents on either side of the AND should be merged together into a result containing only documents common to each side. AND is the default operator; in other words, a space between two words outside of a phrase is interpreted to mean AND. Queries should not begin or end with an AND. The AND operator is case-insensitive; to search for the literal word and outside of a phrase, surround it in double quotes. |
| OR |
The OR logical operator. Words or phrases may be joined together with the operator OR, instructing the engine that the documents on either side of the OR should be merged together into a result containing the documents found on both sides. Queries should not begin or end with an OR. The OR operator is case-insensitive; to search for the literal word or outside of a phrase, surround it in double quotes. |
| NOT |
The NOT logical operator. Words or phrases may be joined together with the operator NOT, instructing the engine that the documents on left side of the NOT should be compared with the documents on the right side and merged into a result containing only documents unique to the left side. Queries should not begin or end with a NOT. The NOT operator is case-insensitive; to search for the literal word not outside of a phrase, surround it in double quotes. |
| : |
Colon; the field restriction indicator. The word before a colon is intepreted as the syntax code for a field; see the list of available fields for details. The word, phrase, or expression following the colon is restricted to being searched in the field named before the colon. For example, the query TI:"computer network" restricts searching for the phrase computer network to the Title field. Note that when no field restriction is applied, the ALL field (which includes all searchable information) is used. Note that field restrictions are only valid when using expert search (e.g., on the Expert tab of the Search form); this syntax is invalid when the entry methodology enables field restrictions based on the text box being used. |
| ^ |
Caret; the relative weight modifier. A word may have a caret followed by a whole number appended to it (i.e., word^5) to indicate that the word should be "weighed" more heavily when results are scored for sorting. The higher the value, the more the word is considered "more important" than other words in the query. |
| " " |
Double quotes; the phrase indicator. Instead of a single word, a phrase may be included in a query by surrounding two or more words with double quotes (i.e., "intellectual property"). The words inside the quotes should not contain any other operators, indicators, or modifiers; no modifiers or indicators are recognized inside a phrase. (While it is not recommend, if you must embed a double quote inside a phrase, use \".) When locating each word in a phrase, stemming rules are applied if appropriate. The desired proximity of the words in the phrase to one another may be controlled using the phrase proximity modifier (~). The desired relative weight of the phrase may be established by appending its modifier as well (^). |
| ~ |
Tilde; the fuzzy search or phrase proximity modifier. A word may have a tilde optionally followed by numeric value between 0 and 1 appended to it (i.e., word~ or word~0.8) to indicate a "fuzzy" search. Such a search implements a form of the Levenshtein distance algorithm to determine variations that are allowed when resolving the word lookup. Values closer to 1 indicate a desire for a shorter allowed distance. The default value is 0.5. Additionally, a phrase may have a tilde followed by a whole number appended to it (i.e., "intellectual property"~5). In this case, the value indicates the number of words allowed between the terms in the phrase. |
| ( ) |
Parenthesis; the order of operations indicator. Words or phrases that are coupled together using AND, OR, or NOT may be further grouped with the use of parenthesis to force a particular order of evaluation. When in doubt, use of parenthesis clarifies the intent of the logic. |
| [ TO ] |
Brackets; the inclusive range indicator. It is possible to query for a range of words using this special operator. For example, the query [Aida TO Carmen] selects all documents containing words between and including "Aida" through "Carmen." The "TO" separator is case insensitive. This indicator is particularly useful when restricting a result to a date range; see regarding searching date fields. |
| { TO } |
Braces; the exclusive range indicator. It is possible to query for a range of words using this special operator. For example, the query [Aida TO Carmen] selects all documents containing words between but not including "Aida" through but not including "Carmen." The "TO" separator is case insensitive. |
| \ |
Backslash; the literal modifier. It is sometimes desirable to include characters which are used as part of the query syntax into words themselves. This is accomplished by preceeding the special indicator or operator character with a backslash. The special characters which requiring escaping are: * ? : ^ " ~ ( ) [ ] { } \ |
After a query is analyzed and determined valid, the system then orders the way it will be executed based on precedence rules. Somewhat simplified, the precedence rules are:
Use of parenthesis always override any default precedence rules. Note, however, that precedence rules are applied inside a parenthetical portion of the query if needed.
For example, the query:
network AND computer OR TI:neural NOT machine
is interpreted in the following manner:
((network AND computer) OR ((TI:neural) NOT machine))
This is in comparison to systems which use a right-to-left rule.
In such cases, the query would be
(network AND (computer OR (TI:(neural NOT machine))))
which has quite a different meaning and is not the way this system works.
When a word is resolved into a list of documents during the search process, this is accomplished by trying to find the word and its common morphological variants within the search indexes. This process is commonly referred to as "stemming;" generally, the Library's search engine implements this function automatically.
Under certain circumstances, automatic stemming is disabled. This is true when search fields that contain names (such as the "People" fields in the patent collections as well as the "Author" field in the NPL collections). In addition, stemming is disabled when a search word contain wildcards.
The Library's engine uses an inflectional stemmer based on the algorithm pioneered by Krovetz at UMASS in 1993 (see this reference for some detail).
When a word is entered as part of a query, it is first lexicographically analyzed. This analysis may cause the word to be interpreted as multiple words. For example, the word e-mail is interpreted as the result of a merge of the words "e," "mail," "email," and "e-mail." (This merge may also includes automatic stemming results ... such as "mailed" or "mailing" ... as well as appropriate stopword suppression.) Such behavior is very useful in retrieving relevent results for most words.
When a word contains a "wildcard" (that is, an asterisk and/or a question mark), in addition to automatic stemming being disabled for the word, the word is interpreted literally. This means there is no or limited lexicographical analysis. For example, the word e-mail* will search only for words which begin with the literal e-mail (including the hyphen). Such a word lookup is unlikely to return results in most circumstances.
Remember, word stemming and stopwords are suppressed in fields where entire literal words may carry important meaning. Examples of where all words are important are fields containing people or company names. The lexicographic analysis for these fields may be adjusted as well to accommodate their more precise nature.