Distributed Digital Library of
Chant Manuscript Images
A list of online images and metadata edited by the NEUMES Project.
 
QUERY  HELP  help

§ Overview
§ The Result Set
§ Filters
§ About Lucene
§ Synonyms
§ Stop Words
§ Accent Marks
§ Uppercase or Lowercase
§ Excluded Data
§ Boolean operators
   • OR
   • AND
   • NOT
   • plus sign ('+')
   • minus sign ('-')
§ Use of Parentheses
§ Term Modifiers
   • Wildcard Searches
   • 'Fuzzy' Searches
   • Proximity Searches
   • Boosting Term-Relevance
§ Escaping Special Characters
Credits:
Authors: Louis W. G. Barton and Debra Lacoste.
This page includes materials adapted by from the Lucene documentation on 'Query Syntax'.

Overview

The Distributed Digital Library of Chant Manuscript Images is an annotated index of images on the Web, maintained by the NEUMES Project [see, NEUMES website]. This index contains metadata about the manuscripts shown in the images, including: type of neume notation; date and place of origin; basic information about the chants; and so on. The images themselves are located on many servers across the Web. The index includes hyperlinks to the images at the locations where they are stored.

The user-interface of this index has two main parts: (1) the top part has controls for the Lucene full-text search engine; (2) the bottom part has filters that generate an SQL query against the relational database containing the metadata. Remark: database filters and a Lucene query may be combined in one search, or they may be used separately. When the 'Query' box is blank, then no Lucene query is included.

Metadata search.
Figure 1. Top part of the 'Search' page.

The following two sections discuss Result Sets and the filters; the remainder of this Help page is about how to use the Lucene search engine. For a quick start, you can read just the sections on Boolean operators and Parentheses.


The Result Set

The results of a search are called a "result set". It is a list of online chant-manuscript images, whose metadata in this index match your search criteria. Each "record" in a result set includes a short summary of the index's metadata about this image.

The main purpose of a result set is to provide hyperlinks to source images. Click on an image hyperlink to open the image in a new window. [Remark: you might need to resize the new window in order to view the full image.] Manuscript images in this "distributed digital library" are stored on many different servers across the Web, but some of the images are located on the Scribe server of the NEUMES Project.

If the website hosting the image has a metadata page about the image, then a hyperlink to this is included in the result-set record. If one or more NEUMES/NeumesXML transcriptions exist for the chants shown in the image, then each transcription is listed in a separate block below the result-set record; click a transcription hyperlink to see its "diplomatic facsimile" visualization in a new window. Each result-set record also has a "Detail record" hyperlink: click on it to view this index's metadata about the image, which includes a small excerpt from the source image.


Filters

The bottom part of the Search screen has several control-options that may be used separately or together with a Lucene query. These options are largely self-explanatory, but a few remarks may be helpful.

  • Discipline and Notation selectors: as the user-interface currently operates, if you select a Notation, then you must also select the corresponding Discipline (otherwise the Notation selection doesn't work).
  • Date: enter year numbers only, and the first number must be less than or equal to the second number.
  • Collation: if you sort the results by Relevance, then please see the additional remarks under Boosting Term-Relevance, below.

The rest of this Help page is just about how to formulate Lucene queries.


About Lucene

Lucene is a highly customizable, open-source, Java-based, full-text indexing and search engine. Several of the query features discussed here are customizations of Lucene made by the NEUMES Project. The queries you write can be simple or quite complex, depending on your needs and your level of skill.


Synonyms

There is a check-box to the right of the 'Query' box where you can select or de-select 'Use synonyms'. The index maintains a list of synonym words, for example: "circa"; "ca"; "approximately"; "around"; etc. When you select 'Use synonyms', then the search engine will match any synonym word from your query with any synonym word in the metadata.

When 'Use synonyms' is selected, you can defeat synonym matching for particular words in your query by putting the word(s) inside quotation marks: "...".

To view the current list of synonym words for this index, click here.


Stop Words

Commonly-occurring words are automatically excluded from Lucene queries (normally called "stop words"). These are words that generally are not significant, or probably don't differentiate one set of metadata from another. Examples of stop words are "the" and "is".

You can override the default behavior by placing the word(s) inside quotation marks: "...".

To view the current list of stop words for this index, click here.


Accent Marks

Accented characters (i.e., letters of the Latin alphabet with diacritical marks) are treated as equivalent to un-accented characters in Lucene queries. This is mainly so that searching is easier especially for English-speaking users, who usually do not have accented characters on their computer keyboards. Examples of accented characters are à and ü.

You can override this default behavior by placing the word(s) inside quotation marks: "...".

To view the current list of accented-character replacements for this index, click here.


Uppercase or Lowercase

In order to simplify searches, and for consistent results, queries and index metadata are automatically converted to lowercase letters during word-matching operations.

Remark: This rule does not apply to Boolean operators, which must be typed in uppercase letters only.


Excluded Data

Some data of this index are excluded from searching by Lucene query. That is to say, you will not find results by searching on these data. Excluded are the record numbers which the index uses internally, the URL (or, Web address) of source images, data fields that have "Yes/No" values, and so forth.


Boolean operators

Boolean operators allow search terms to be combined to make complex queries. The Boolean operators permitted are:

  • OR    (a [space] character acts the same as 'OR')
  • AND
  • NOT
  • +    (the plus-sign)
  • -    (the minus-sign)

You must type the Boolean operators in ALL CAPITAL letters.


OR

The OR operator is the default "conjunction" operator. This means that if there is no Boolean operator between two terms, the OR operator is used. The OR operator finds records where either or both terms exist in the metadata. Two vertical bar characters ("||") or a [space] can be used instead of the word "OR".

To search for image metadata that contain "Haec dies" or "Easter", use the query:

"Haec dies" Easter

or,

"Haec dies" OR Easter

Remark: Enclose multiple words in quotation marks (e.g.: "Haec dies") if they must appear together in the metadata (viz., to prevent matching on either word individually).

Remark: The "OR" operator must be type in ALL CAPITAL letters.


AND

The AND operator matches records where both terms exist anywhere in the metadata. Two ampersand characters ("&&") may be used instead of the word "AND".

To search for records that contain "Advent" and "first Sunday", use the query:

Advent AND "first Sunday"

Remark: The "AND" operator must be type in ALL CAPITAL letters.


NOT

The NOT operator excludes results that contain the term after NOT. An exclamation mark ('!') may be used in place of the word "NOT".

To search for records that contain "Haec dies" but not "France" use the query:

"Haec dies" NOT France

Remark: The NOT operator cannot be used with just one term. For example, the following search will return no results:

NOT "Haec dies"

Remark: The "NOT" operator must be type in ALL CAPITAL letters.


+

The plus sign ('+') or "required" operator requires that the term after the '+' character must exist somewhere in the metadata of result.

To search for image metadata that must contain "gradual" and may contain "Haec dies", use the query:

+gradual "Haec dies"

-

The minus sign ('-') or "prohibit" operator excludes results that contain the term after the '-' character.

To search for records that contain "Haec dies" but not "Dextera domini" use the query:

"Haec dies" -"Dextera domini"

Use of Parentheses

Use parentheses, "(...)", to group clauses in the same way these are used in mathematical expressions. Parentheses override the normal order in which terms are evaluated. This can be quite useful in controlling or limiting the result set.

Example: to search for "St Gall" and either "Haec dies" or "Easter", you can type:

("Haec dies" OR Easter) AND "St Gall"

The parentheses ensure that "St Gall" must exist in the metadata, plus either "Haec dies" or "Easter".

Remark: There is no priority given to any of the operators (AND, OR, NOT); the search engine treats each in succession. Therefore, very different results are shown by the queries "Easter OR (France AND Germany)" and "Easter OR France AND Germany", for example.


Term Modifiers

For more complex queries, term modifiers provide additional flexibility and power.


Wildcard Searches

You can perform both single and multiple character wildcard searches.

For a single character wildcard search, use a question mark ('?') for any one character, and the search engine will return records where there is a match with all characters other than the one indicated.

For example, to search for "antiphoner" (English) or "antiphonar" (German) you can use the query:

antiphon?r

Multiple character wildcard searches look for 0 or more characters. Use an asterisk ('*') at the position where more or different characters may occur. For example, to search for gradual, graduals, graduale, gradualia, and so on, you can use the query:

gradual*

You can also use multiple wildcard searches in the middle of a term. For example,

h*c

will return records with both "Haec" and "Hec".

Remark: An asterisk ('*') or question mark ('?') is not permitted as the first character of a query.

Remark: Unlike most other types of Lucene queries, wildcard queries are case-sensitive.


'Fuzzy' Searches

The Lucene engine also supports so-called "fuzzy" searches based, on the Levenshtein Distance or Edit Distance algorithm. To do a fuzzy search, use a tilde character ('~') at the end of a term. For example, to search for a term similar in spelling to "roam" use the fuzzy search:

roam~

This search will find terms like "foam" and "roams."

Remark: Unlike most other types of Lucene queries, "fuzzy" queries are case-sensitive.


Proximity Searches

Lucene supports finding words which are within a specific distance away from each other. To do a proximity search use the tilde '~' symbol and a number (the distance) at the end of a phrase enclosed in quotation marks. For example, to search for "Aquitanian" and "one" (meaning, for one drypoint line) within ten words of each other in a single record use the query:

"Aquitanian one"~10

Boosting Term-Relevance

Remark: 31 December 2006, this feature currently is not working correctly -- under development.

In the bottom part of the Search screen, you can choose "Relevance" as the Collation selection (viz., "Sort results by ..."). This will display the result set sorted by relevance score. You can control the relevance-scoring to some degree by "boosting" particular term(s) in your query.

You can specify the level of importance given to particular terms of your query by "term boosting". To boost the relevance of a term, use the caret character ('^') with an optional boost-factor (which must be an integer number) at the end of the term. The higher the boost-factor, the more relevant the term will be.

For example, if you are searching for

gradual France

and you want the term "gradual" to be more relevant, you can boost it using the '^' character along with the boost-factor right after the term. Thus, you would type:

gradual^4 France

This will make results with the term "gradual" be ranked higher in their relevance scores; in other words, they will appear closer to the top of the result set.

You can also boost whole phrases, for example:

"Haec dies"^4 France

The default boost-factor is 1. Although the boost-factor must be positive, it can be less than 1 (e.g., 0.2)


Escaping Special Characters

In order to include characters in your queries which the search programme utilizes in the query syntax, use "\" before the character. The current list of special characters includes:

+ - && || ! ( ) { } [ ] ^ " ~ * ? : \

For example, to search for "(olim40)" use the query:

\(olim40\)



goto: The NEUMES Project homepage Neumed and Ekphonetic Universal Manuscript Encoding Standard  

Copyright © 2006-2007, The University of Oxford. Contains software or other intellectual property copyright © 2003-2005, Louis W. G. Barton; copyright © 2002-2003, The President and Fellows of Harvard College; and/or copyright © 1995-2001, Louis W. G. Barton.