The Neume Notation Project

Prof. Louis W. G. Barton

St. Anne's College
University of Oxford



µ Context-Free Grammar for the Data Representation

§ Overview
§ Compatibility with SGML and XML
§ Assertion of semantic equivalence across notational families
§ Motivation for a Context-Free Grammar
§ The Proposed CFG

Back to homepage

§ Overview

To a computer scientist, a representation has two principal aspects: the semantics (or, meaning) encoded by the representation; and the possible inferences (or, logic) that these semantics entail. This essay focuses on the semantics of a specific data representation, but, in passing, some mention is made of 'hooks' in the data representation that will bear upon inferences. (I have given substantial thought to the inference problem for this representation and hope that a subsequent essay can address that problem directly.)

The particular branch of semantics we are concerned with is usually called denotational semantics. One symbol is assigned in the representation to each distinct semantic notion in the domain of interest. (Obviously, designing a semantic data representation requires classifying the semantic notions in the domain.) During processing of a data stream in the representation, valuation functions interpret the semantics; from the perspective of a processing program, the semantics of the data are equivalent to the valuation functions.

The semantics of a data representation deal both with the meanings of individual elements of the representation, and the meanings of interrelationships between elements in a group. Much of my discussion relies on the reader's understanding of how semantic data representations are normally handled in software engineering, and so I shall devote some space to clarifying the relevant concepts.

A well-devised and standardized data representation for neume notation would likely be of great benefit also in reducing the problem of data incompatibilities between various second-order data representations. The task of encoding neumed documents is huge and, under current technology, quite labor-intensive. It would be very desirable if the encoded documents of one research group were reusable by other groups. Realistically, it is only through such interoperability of data that important questions involving systematic comparison of manuscripts can be done.

Designing a data representation for permanence and interoperability is especially difficult in the case of neumed documents. Some of the problems include the extent of variants in notational styles, uncertainties about identification of neume forms in certain source documents, vagueness about the height of neumes in early documents, and conflicting use requirements.

In computer science there is a classic trade-off between optimizing for space and optimizing for speed. Ironically, my proposed data representation optimizes both. More significantly, it also tends to reduce the complexity of programs that use the data.

Principle #3:
Eliminate context-dependency as much as practically feasible.

For comparative analysis of documents (and perhaps other uses) one needs to extract snippets of data from encoded documents. It would be computationally most efficient to be able to do this without reference to the context of the data, except for the 'header' information at the beginning of the file. An important design principle is to minimize the extent to which correct interpretation relies on interpreting the data leading up to that point and maintaining state information about context.

This strategy differs dramatically from the mode-of-operation of programs that displays HTML documents. They interpret documents sequentially from the beginning and are not expected to correctly interpret snippets. State variables, such as font size and font color, remain in effect until they are turned off; it is not possible to correctly render a section of the document without reference to state variables that might have been set hundreds of characters earlier in the data stream. When interpreting a section of a neumed data stream that might be thousands of characters long, it would be undesirable to have to 'rewind' to the beginning of the data and interpret everything up to the section that is of interest.

By avoiding the use of 'global' state variables in the data representation, I have kept the context dependencies localized to a few characters in the immediate neighborhood of the affected character. Complexity if the nemesis of computer programs; reducing context dependency in the data representation results in significantly less complexity in programs that use the data representation.
Top of document

§ Compatibility with SGML and XML

One of the most useful abstractions to come out of the Xerox PARC (Palo Alto Research Center) in the 1970s is Model-View-Controller (MVC) framework. The MVC framework evolved from early work at PARC on the Smalltalk programming language, and has been summarized clearly by Burbeck [1], Reenskaug [2], and Gamma et al. [3]. The basic idea of MVC is to introduce strict separation between the following, three areas of software:
  1. the data representation (the model), which contains the 'raw data' that will be interpreted or manipulated by programs;
  2. the presentation layer (the view), which consists of the programs that interpret the raw data for display, analysis, etc.–note that, many views are possible from a single model; and
  3. the controller, which consists of the programs to control the flow of execution, dispatching of procedures, and interaction with the user.
The salient benefits of MVC include the following:
  • a significant reduction in the complexity of computer software designs;
  • flexibility and extensibility in the modes of presentation of the data; and
  • isolation of the data from changes in the program logic or in the mode of presentation.
What does MVC have to do with a data representation for medieval neumes? Basically, I want to make three points in this regard:
  1. the design of the data representation (the model) should be independent of any particular use of those data, whether current or contemplated for the future;
  2. if the view of a particular use-scenario does not require all the detail that is in the model, then that view should simply filter out the unneeded detail when reading the model; and
  3. the interest that some rightly have in embedding neume data in SGML (Standard General Markup Language [ISO 8879]), or in its XML (Extensible Markup Language) subset (see, does not conflict with my proposed Unicode data representation.

I take note of a comment in the latter report that, "Many text-processing applications store the text and the associated markup (or in some cases styling information) of a document in separate structures" [ibid., §3.1]. This distinction between text and markup is largely consistent with the Model-View-Controller framework, in that the "text" (which in our case includes neume characters) constitutes the model and is conceptually separable from markup information (which belongs to the view).

For example, a data stream might read as follows:

<?xml encoding='UTF-16/Private/Neume/1.0'>
<MS Source="Oxford::Bodleian::Seldon Supra 27">

. . . followed by a neumed data stream . . .


"definitely is" CF = 1.0
"definitely is not" CF = -1.0
"is" CF > 0.2
"is not" CF < -0.2
"could be" CF >= -0.2
"might not be" CF <= 0.2
"is known" CF > 0.2 or CF < -0.2
"is not known" CF <= 0.2 and CF >= -0.2
"is completely uncertain" CF = 0.0

Table 1. The semantics of Certainty Factors.

If an inference program produces numerical CFs of finer granularity than I have provided for, then the final values must be rounded off before being written. My experience with CF-enabled reasoning programs is that some post-computational rounding of CFs normally does little damage to the meaning of the results. Also, if a particular inference program uses a different scale of values for its CFs (such as 0 to 1,000), the CFs can be re-scaled as needed.

One can think about CFs in terms of mnemonics, such as 'LU' for "a little uncertain," 'SU' for "somewhat uncertain," etc. The details of a system of mnemonics have yet to be worked out, but that is irrelevant to this discussion.

The need to reason under uncertainty is the motivation for my including in the data representation a battery of three-valued inference operators (a few of which are original to me). There is a tacit, fourth truth value in this data representation, namely that no CF was specified. Perhaps elsewhere I can treat the entailments of uncertain data for automated reasoning.
Top of document

§ Assertion of semantic equivalence across notational families.

Projection of notational families on semantic code space

Figure 1. Projection of notational families on semantic code space.

A projection is simply a mapping from the neume forms in a particular notational family to particular Unicode characters in the code space. The notion of semantic code space simply means that the mapping is done to tokens that stand for the semantic value of a neumes; for example, a virga in every notational family maps to the same semantic token in the code space. Since one notational family can evidence a particular neume form that is not evidenced in another family, the mapping is not what computer scientists call an onto projection (i.e., not every member in the range of projection has a corresponding member in the domain of the projection). The semantic code space is taken to be the union of all particular projections. This can be described more formally as follows:

Definition of graphical neumes

Definition of a graphical neume A graphical symbol of neume notation.
Definition of a chant melody An ordered set of neume symbols that constitutes an instance of a chant melody.
Definition of a manuscript A manuscript, consisting of an ordered set of chant instances.

Definition of a notational family

Definition of the neume set in a manuscript A taxonomically-ordered, minimal set of neume symbols in manuscript Ms, where 'Triple equals sign' is an equivalence relation defined in a 'reasonable' sense, and the enumeration can contain empty elements Epsilon.

Definition of the neume set in notational family The union of neume symbol sets across manuscripts of the same notational 'family', such that, for any two manuscripts in one family, the taxonomic enumeration of neume symbols is identical, except for empty elements.

Definition of the semantic equivalence relation

Definition of equivalence relation on neume symbols
An equivalence relation equating neume symbols across notational families, except for empty elements; based on 'reasonable' semantic synonymy.

Definition of semantic neume forms

Definition of a neume form A Unicode character in the Private Use Area, representing an abstract neume-form type.

Definition of mapping function A function that does semantic mapping from a neume symbol to a particular Unicode character representation.

Definition of semantic projection A semantic projection of a notational family on the Unicode space, guaranteeing complete mapping.

Definition of equivalence projection Composite projection function, with constraint on the semantic equivalence of notational families.

Definition of code space for semantic neumes The Unicode space for semantic neumes.

Definition of the language of abstract neume forms{denumerable 'sentences', i.e., strings of Unicode characters in the notational grammar} The language of abstract neume forms.

Top of document

§ Motivation for a Context Free Grammar

A context-free grammar (CFG) is a set of rules that describes in a clear and deterministic way the set of all possible 'sentences' in a language, such that any valid sentence has a single interpretation. In this case, we are concerned with the language of abstract neume forms. A 'sentence' in this language is a series of Latin syllables and their associated neumes that, together, make up a chant instance. A formal grammar is needed to guide the formulation of programs that create chant encodings, and of programs that interpret those encodings. In particular, the CFG is needed for use by a program that verifies that a given chant document conforms to the rules for a well-formed sentence in the language. At this level, information about the visual layout of a chant is not included in the grammar, as layout information is part of the presentation layer.

Note: In the ensuing discussions I sometimes, for convenience, refer to a particular Unicode character by a mnemonic. For example, I may write 'CF' (the Certainty Factor character) instead of its Unicode value, which might be U+EE00 in hexadecimal [which equates to the binary pattern 1110 1110 0000 0000]. Or, I might write 'NULL' where I mean the Unicode character whose value is U+0000. I use this mnemonic notation is only for convenience of presentation. Thus, 'CF' should not be confused with the two-character series "CF", and 'NULL' not be confused with the four-character series "NULL". Significantly, alternative mnemonics could be devised as aides-mémoire for languages other than English; such substitution would in no way impinge on the semantics of the Unicode characters themselves.
Top of document

§ The Proposed CFG

Below is given the proposed context-free grammar for the language of abstract neume forms.
  • Terms in the CFG are denoted in angle brackets, "< >". The term on the left-hand side is the one being defined. To the right of the double-colon, "::", is the definition. Two or more terms written in series means that the terms must appear in the order specified (ordered AND). Where two or more definitions are possible, they are listed on consecutive lines; only one of the definitions can be taken at a time (exclusive OR).
  • Annotations are given in subscripts. The subscript "opt" means that the term is optional in the definition. The subscript "one or more" means that Other subscripts are explanations for mnemonics; for example, "BOTbaseline of text" means that the mnemonic "BOT" stands for "baseline of text".
  • The star operator '*' means that there can be one or more of these terms in series. For example, <syllable> :: <non-vowel*> means that a syllable can be one or more non-vowels.
  • Primitives of the language (or, terminals) are shown as sets within braces, "{ }"; they are shown either as their Unicode values (such as U+E001) or by a mnemonic (which stands for a Unicode character, but is easier to read). A set is understood to be a selection list, where a single member of the set is valid in this position. A set can have a single member.
  • The second definition of the term <neume string> is an example of a recursive definition. This is different from the iterative definition of <chant>; substrings of neumes are considered to also be strings of neumes, but a sub-part of a chant is not considered to be a chant. The degenerate case of a chant is a single vowel.
  • Equivalence functions are denoted with parentheses, "( )". For example, "upper-case(<vowel>)" means the mapping of a <vowel> to its upper-case equivalent.
<chant> :: <tokenone or more>
<token> :: <syllable> <neume stringopt>
<syllable> :: <non-vowel*opt> <vowel> <non-vowelopt; one or more>
<non-vowel> :: {}
<vowel> :: {'a', e', 'i', 'o', 'u', 'y', 'æ', 'œ'}
<neume string> :: <neume specifier>
                  <neume specifier> <neume string>
<neume specifier> :: <neume form> <certainty factoropt> <coloropt> <height specifieropt>
<neume form> :: {U+E000, ..., U+F8FF}
<certainty factor> :: {CF} <numeric>
<numeric> :: {NUM} <number> {NULL} {NULL}
<height specifier> :: <relative heightopt> <absolute heightopt>
<relative height> :: <baseline> <delta> <certainty factoropt> 
                     <baseline> <staff line> <certainty factoropt> 
<baseline> :: {BOTbaseline of text, LSLlowest staff line, PREprevious neume, OUTinserted out-of-line, UNKunknown}
<delta> :: {DYPup a lot, DYUup a little, DYZno change, DYDdown a little, DYNdown a lot}
<staff line> :: {1st line, 1st space, 2nd line, 2nd space, ... 6th line, below staff, above staff}
<absolute height> :: <baseline> <top of neume> <bottom of neumeopt>
<top of neume> :: {TONtop of neume} <numeric> <certainty factor>
<bottom of neume> :: {BONbottom of neume} <numeric> <certainty factor>
Top of document

End Notes:
[1] Steve Burbeck, "Applications Programming in Smalltalk-80(™): How to use Model-View-Controller (MVC)," at, (1992).
[2] Trygve Reenskaug, Working with Objects, (Greenwich, CT: Manning, 1996).
[3] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides, Design Patterns; Elements of Reusable Object-Oriented Software, (Reading, MA: Addison-Wesley, 1995).

Top of document

Back to homepage

Revision: 4 June 2000.
Copyright (c) 2000, Louis W. G. Barton