Maintained by: David J. Birnbaum (djbpitt@gmail.com) Last modified: 2012-06-09T19:43:32+0000
The ASCII CCMH
files are used as input, and they are converted programmatically to Unicode and
delivered in Word format. The only XML-like markup introduced at this stage is that
superscript letters are tagged as <sup>
(but see below concerning an
error in the end tag). A sample looks like:
1 008v 05 шедъ о̑устрои͑ сꙙ съ ѫ͑жиками свои͑ми. 1 008v 06 и͑ пришъдъ мѫчениѥ приїмеши въ к<sup>о<\sup>- 1 008v 07 манѣхъ. нъ не бои͑ сꙙ о͑тъ мѫкъ а͑зъ б<sup>о<\sup> 1 008v 08 ѥ͑смъ съ тобоѭ̑. и͑ не и͑матъ тебе врѣді-
The files are XMLified with the following plain-text search-and-replace operations:
<sup>
tag,
but erroneously writes the end tag with a backslash instead of a forward slash
(<\sup>
instead of </sup>
). We correct
this with a global search-and-replace.^\s+
and replacing it with an empty string. This also strips out
blank lines.^([123]) (\d+)(.)
(\d{2}) (.*)$
) and add markup by replacing them with <line
text="\1" folio="\2" side="\3" line="\4">\5</line>
. Input like
1 008r 18 Мѣсꙙца марта въ е͆ день͗. мѫчениѥ ст͆ааго василиска⁛is thus converted to:
<line text="1" folio="008" side="r" line="18">Мѣсꙙца марта въ е͆ день͗. мѫчениѥ ст͆ааго василиска⁛</line>
<title>
and wrap the entire
file in a <text>
root element.This yields a well-formed XML file that conforms to the following Relax NG schema:
start = element text { element title { text }, element line { attribute text { xsd:int }, attribute folio { xsd:int }, attribute side { "r" | "v" }, attribute line { xsd:int }, mixed { element sup { text }* } }+ }
Smooth breathing, apostrophe, and paerok are not clearly distinguished in the Word files. They are, however, in complementary distribution (except for occasional errors in the CCMH source), and we distinguish them as follows. The examples are all taken from line-final position in folio 9r. See the table below for mappings and additional notes.
All encoded as U+0357 COMBINING RIGHT HALF RING ABOVE. We replace those globally with U+0486 COMBINING CYRILLIC PSILI PNEUMATA.
All encoded as U+0351 COMBINING LEFT HALF RING ABOVE. We replace those globally with U+0485 COMBINING CYRILLIC DASIA PNEUMATA.
The instance on l. 7 is encoded as U+02BC MODIFIER LETTER APOSTROPHE and the one on l. 29 as a U+035B COMBINING ZIGZAG ABOVE. Both follow a consonant letter.
Also encoded as U+02BC MODIFIER LETTER APOSTROPHE, but after a vowel. We replace U+02BC MODIFIER LETTER APOSTROPHE when it follows a vowel letter with U+0313 COMBINING COMMA ABOVE (a non-spacing character).
The preceding situations are distinct, either because the character in the Word final is generally distinct or because it is distinct in position (after consonant or after vowel). This permits us to automate the replacements. Although the physical position of paerok varies in the manuscript (whether between two characters or above the first of the pair), we standardize in all cases on U+2E2F VERTICAL TILDE (a spacing character). This decision is governed by our desire to privilege the semantics of the text over the appearance, and from that perspective the paerok is conceptually between two characters even when, from a graphic perspective, it might be rendered slightly over the first of them.
Error in the transcription | Replaced by | Note | ||||
---|---|---|---|---|---|---|
Value | Raw | Name | Value | Raw | Name | |
Alphabetic | ||||||
U+0437; | з | CYRILLIC SMALL LETTER ZE | U+A641 | ꙁ | CYRILLIC SMALL LETTER ZEMLYA | |
U+0479 | ѹ | CYRILLIC SMALL LETTER UK | U+A64B; | ꙋ | CYRILLIC SMALL LETTER MONOGRAPH UK | U+0479 is deprecated because of ambiguity. Digraphic uk should be represented as two characters in sequence; monographic uk should be represented as U+A64B. |
U+A647 | ꙇ | CYRILLIC SMALL LETTER IOTA | U+0456; | і | CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I | U+A647 is used only for transliteration from Glagolitic to Cyrillic. |
Punctuation | ||||||
U+002E; | . | FULL STOP | U+00B7 | · | MIDDLE DOT | |
U+2022; | • | BULLET | U+00B7 | · | MIDDLE DOT | |
U+205B | ⁛ | FOUR DOT MARK | U+2058 | ⁘ | FOUR DOT PUNCTUATION | |
Diacritic | ||||||
U+02BC | иʼ | MODIFIER LETTER APOSTROPHE | U+0313 | и̓ | COMBINING COMMA ABOVE | Replaced by U+0313 COMBINING COMMA ABOVE (a non-spacing character) when it follows a vowel and by U+2E2F VERTICAL TILDE (a spacing character) when it follows a consonant. |
U+2E2F | иⸯ | VERTICAL TILDE | ||||
U+0311 | и̑ | COMBINING INVERTED BREVE | U+0484 | и҄ | COMBINING CYRILLIC PALATALIZATION | We replace U+0311 (COMBINING INVERTED BREVE) automatically by U+0484 (COMBINING CYRILLIC PALATALIZATION) when it follows a consonant and we leave it alone, as a representation of Cyrillic kamora, when it follows a vowel. It does not occur in other positions. There are a few genuine ambiguities that will need to be fixed manually, e.g., благꙑнѫ̑ (11v1), where it should be a palatalization hook on the н. This reflects a conceptual error in CCMH, which uses a caret (^) for both palatalization of a preceding consonant and kamora over a following vowel, thus creating an ambiguity about whether a particular instance of caret associates to the left or to the right. |
Retained without change | ||||||
U+0346 | и͆ | COMBINING BRIDGE ABOVE | U+A66F | и꙯ | COMBINING CYRILLIC VZMET | U+0346 is documented in Unicode as an addition for IPA. |
U+0351 | и͑ | COMBINING LEFT HALF RING ABOVE | U+0485 | и҅ | COMBINING CYRILLIC DASIA PNEUMATA | U+0351 and U+0357 are documented in Unicode as additions for the Uralic phonetic alphabet. |
U+0357 | и͗ | COMBINING RIGHT HALF RING ABOVE | U+0486 | и҆ | COMBINING CYRILLIC PSILI PNEUMATA | |
U+035B | и͛ | COMBINING ZIGZAG ABOVE | U+2E2F | иⸯ | VERTICAL TILDE |
U+0343 COMBINING GREEK KORONIS, with a preceding space, is used for smooth breathing before upper-case letter. There is no corresponding Cyrillic glyph and the Greek one isn't yet in our fonts. Temporarily replaced by U+02BC MODIFIER LETTER APOSTROPHE, keeping the preceding space.
For PROIEL tagging we strip all diacritics except paerok and the palatalization hook.