-
Notifications
You must be signed in to change notification settings - Fork 9
Description
Adding <s> and <w> to the current ELTeC schema in such a way as to make import of linguistic analyses easier has some implications for current content models. Here's my summary of the relevant issues to consider.
<w> elements can only appear within an <s>. At level 2, the elements p, head, and l
should therefore change to permit as content a sequence of <s> elements, intertwingled with empty elements (gap, milestone, pb, ref)
This leaves unclear what to do with the other sub-paragraph elements (bibl, corr, date, emph, foreign,hi, label, measurem name, note, term, title).
Some of these (bibl, date, measure, term) are really only used or needed in the header. The schema should add this constraint.
That leaves corr, emph, foreign, hi, label, name, titel. The most natural (TEI-like) thing to do would be to change their content models to permit w elements. This would mean that <w> elements can now be found at two levels in the hierarchy which may upset some software. It also implies that <w> elements must be properly contained within one of these elements; this should not be an issue except possibly for <corr>. An alternative might be to use a trojan horse style notation, but that risks making downstream processing considerably more complicated.
(I note in passing that <name> might well be used to mark the result of named entity recognition).
Currently <quote> is allowed all over the place and may contain just words, unwrapped in <p> or <l>. I think that should probably be disallowed.