Specifying regions of text

From WhyNotWiki

Jump to: navigation, search

Contents

[edit] Requirements specification

Maybe it's not a "problem" so much as a requirement...

I need to be able to refer to regions in a text document and these regions may overlap.

XML forbids overlapping tags such as <a>1<b>2</a>3</b> ... But I wouldn't necessarily be using XML. And even if I were, I could work around that limitation if I had to -- for example, by "concatenating" non-overlapping regions.

[edit] How specific?

  • At the very least, I should be able to specify regions at the sentence-level.
  • It'd probably be nice to be able to refer to regions at the word level -- regions within sentences or spanning sentences but not ending on a sentence boundary...
  • I probably would very rarely use it, but while I'm at it, it seems like maybe I should allow one to specify things down to the individual character level (?)...

[edit] Embedded or overlaid?

Perhaps both of these should be possible...

"Embedded" refers to markup indicating regions actually being included in the source document itself.

  • This would probably be done by having some "start" and "stop" indicator for every region.
  • This would only be possible if you had access to the source document -- for example, if you were the author.

"Overlaid" refers to the regions specifications being in a separate file but applying to some source document -- that is, a regions specification would be "overlaid" on top of the source document.

  • Start and end points would probably be indicated by specifying a certain number of words or characters after the start of the document, or by supplying a context regular expression.

The main problem with "overlaid" specifications is that if the source document changes, it may break the regions specification.



[edit] Existing technologies?

For instance, would XPath allow what I'm trying to do?

Personal tools