Codex Suprasliensis


Maintained by: David J. Birnbaum (djbpitt@gmail.com) [Creative Commons BY-NC-SA 3.0 Unported License] Last modified: 2012-06-09T19:43:32+0000


Editorial principles

Input files

The ASCII CCMH files are used as input, and they are converted programmatically to Unicode and delivered in Word format. The only XML-like markup introduced at this stage is that superscript letters are tagged as <sup> (but see below concerning an error in the end tag). A sample looks like:

1 008v 05 шедъ о̑устрои͑ сꙙ съ ѫ͑жиками свои͑ми.
1 008v 06 и͑ пришъдъ мѫчениѥ приїмеши въ к<sup>о<\sup>-
1 008v 07 манѣхъ. нъ не бои͑ сꙙ о͑тъ мѫкъ а͑зъ б<sup>о<\sup>
1 008v 08 ѥ͑смъ съ тобоѭ̑. и͑ не и͑матъ тебе врѣді-

XML up-conversion through plain-text search-and-replace

The files are XMLified with the following plain-text search-and-replace operations:

  1. The OCR output encodes superscript letters with the <sup> tag, but erroneously writes the end tag with a backslash instead of a forward slash (<\sup> instead of </sup>). We correct this with a global search-and-replace.
  2. We programmatically replace all leading white-space characters by matching ^\s+ and replacing it with an empty string. This also strips out blank lines.
  3. We programmatically match all remaining manuscript lines (^([123]) (\d+)(.) (\d{2}) (.*)$) and add markup by replacing them with <line text="\1" folio="\2" side="\3" line="\4">\5</line>. Input like
    1 008r 18 Мѣсꙙца марта въ е͆ день͗. мѫчениѥ ст͆ааго василиска⁛
    is thus converted to:
    <line text="1" folio="008" side="r" line="18">Мѣсꙙца марта въ е͆ день͗. мѫчениѥ ст͆ааго василиска⁛</line>
  4. We manually tag the first line as <title> and wrap the entire file in a <text> root element.

This yields a well-formed XML file that conforms to the following Relax NG schema:

start =
    element text {
        element title { text },
        element line {
            attribute text { xsd:int },
            attribute folio { xsd:int },
            attribute side { "r" | "v" },
            attribute line { xsd:int },
            mixed {
                element sup { text }*
            }
        }+
    }

Character inventory

Diacritic analysis

Smooth breathing, apostrophe, and paerok are not clearly distinguished in the Word files. They are, however, in complementary distribution (except for occasional errors in the CCMH source), and we distinguish them as follows. The examples are all taken from line-final position in folio 9r. See the table below for mappings and additional notes.

Smooth breathing: ll. 4, 20, 21

All encoded as U+0357 COMBINING RIGHT HALF RING ABOVE. We replace those globally with U+0486 COMBINING CYRILLIC PSILI PNEUMATA.

Rough breathing: ll. 6, 9, 11

All encoded as U+0351 COMBINING LEFT HALF RING ABOVE. We replace those globally with U+0485 COMBINING CYRILLIC DASIA PNEUMATA.

Paerok: ll. 7, 29

The instance on l. 7 is encoded as U+02BC MODIFIER LETTER APOSTROPHE and the one on l. 29 as a U+035B COMBINING ZIGZAG ABOVE. Both follow a consonant letter.

Apostrophe: ll. 13, 23

Also encoded as U+02BC MODIFIER LETTER APOSTROPHE, but after a vowel. We replace U+02BC MODIFIER LETTER APOSTROPHE when it follows a vowel letter with U+0313 COMBINING COMMA ABOVE (a non-spacing character).

The preceding situations are distinct, either because the character in the Word final is generally distinct or because it is distinct in position (after consonant or after vowel). This permits us to automate the replacements. Although the physical position of paerok varies in the manuscript (whether between two characters or above the first of the pair), we standardize in all cases on U+2E2F VERTICAL TILDE (a spacing character). This decision is governed by our desire to privilege the semantics of the text over the appearance, and from that perspective the paerok is conceptually between two characters even when, from a graphic perspective, it might be rendered slightly over the first of them.

Table of errors and corrections

Error in the transcription Replaced by Note
Value Raw Name Value Raw Name
Alphabetic
U+0437; з CYRILLIC SMALL LETTER ZE U+A641 CYRILLIC SMALL LETTER ZEMLYA  
U+0479 ѹ CYRILLIC SMALL LETTER UK U+A64B; CYRILLIC SMALL LETTER MONOGRAPH UK U+0479 is deprecated because of ambiguity. Digraphic uk should be represented as two characters in sequence; monographic uk should be represented as U+A64B.
U+A647 CYRILLIC SMALL LETTER IOTA U+0456; і CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I U+A647 is used only for transliteration from Glagolitic to Cyrillic.
Punctuation
U+002E; . FULL STOP U+00B7 · MIDDLE DOT  
U+2022; BULLET U+00B7 · MIDDLE DOT  
U+205B FOUR DOT MARK U+2058 FOUR DOT PUNCTUATION  
Diacritic
U+02BC иʼ MODIFIER LETTER APOSTROPHE U+0313 и̓ COMBINING COMMA ABOVE Replaced by U+0313 COMBINING COMMA ABOVE (a non-spacing character) when it follows a vowel and by U+2E2F VERTICAL TILDE (a spacing character) when it follows a consonant.
U+2E2F и VERTICAL TILDE
U+0311 и̑ COMBINING INVERTED BREVE U+0484 и҄ COMBINING CYRILLIC PALATALIZATION We replace U+0311 (COMBINING INVERTED BREVE) automatically by U+0484 (COMBINING CYRILLIC PALATALIZATION) when it follows a consonant and we leave it alone, as a representation of Cyrillic kamora, when it follows a vowel. It does not occur in other positions. There are a few genuine ambiguities that will need to be fixed manually, e.g., благꙑнѫ̑ (11v1), where it should be a palatalization hook on the н. This reflects a conceptual error in CCMH, which uses a caret (^) for both palatalization of a preceding consonant and kamora over a following vowel, thus creating an ambiguity about whether a particular instance of caret associates to the left or to the right.
Retained without change
U+0346 и͆ COMBINING BRIDGE ABOVE U+A66F и COMBINING CYRILLIC VZMET U+0346 is documented in Unicode as an addition for IPA.
U+0351 и͑ COMBINING LEFT HALF RING ABOVE U+0485 и҅ COMBINING CYRILLIC DASIA PNEUMATA U+0351 and U+0357 are documented in Unicode as additions for the Uralic phonetic alphabet.
U+0357 и͗ COMBINING RIGHT HALF RING ABOVE U+0486 и҆ COMBINING CYRILLIC PSILI PNEUMATA
U+035B и͛ COMBINING ZIGZAG ABOVE U+2E2F и VERTICAL TILDE  

To add

U+0343 COMBINING GREEK KORONIS, with a preceding space, is used for smooth breathing before upper-case letter. There is no corresponding Cyrillic glyph and the Greek one isn't yet in our fonts. Temporarily replaced by U+02BC MODIFIER LETTER APOSTROPHE, keeping the preceding space.

PROIEL

For PROIEL tagging we strip all diacritics except paerok and the palatalization hook.