Specifying regions of text
From WhyNotWiki
Contents |
[edit] Requirements specification
Maybe it's not a "problem" so much as a requirement...
I need to be able to refer to regions in a text document and these regions may overlap.
XML forbids overlapping tags such as <a>1<b>2</a>3</b> ... But I wouldn't necessarily be using XML. And even if I were, I could work around that limitation if I had to -- for example, by "concatenating" non-overlapping regions.
[edit] How specific?
- At the very least, I should be able to specify regions at the sentence-level.
- It'd probably be nice to be able to refer to regions at the word level -- regions within sentences or spanning sentences but not ending on a sentence boundary...
- I probably would very rarely use it, but while I'm at it, it seems like maybe I should allow one to specify things down to the individual character level (?)...
[edit] Embedded or overlaid?
Perhaps both of these should be possible...
"Embedded" refers to markup indicating regions actually being included in the source document itself.
- This would probably be done by having some "start" and "stop" indicator for every region.
- This would only be possible if you had access to the source document -- for example, if you were the author.
"Overlaid" refers to the regions specifications being in a separate file but applying to some source document -- that is, a regions specification would be "overlaid" on top of the source document.
- Start and end points would probably be indicated by specifying a certain number of words or characters after the start of the document, or by supplying a context regular expression.
The main problem with "overlaid" specifications is that if the source document changes, it may break the regions specification.
[edit] Existing technologies?
For instance, would XPath allow what I'm trying to do?
