[ic] How would you search, store, and display documents

Peter peter at pajamian.dhs.org
Sat Aug 13 02:15:24 UTC 2011


> In my research of lucene I ran across this post on someone contemplating
> exactly my issue
> 
> http://robrohan.com/2007/02/09/do-you-save-html-in-your-relational-database/
> 
> In there he proposes a pretty nifty idea - Here it is in summary:
> 
> ====================================================
> One solution I am kicking around is trying to write / find some sort of
> text style markup language that is stored separate from the text data
> (This has to exist somewhere, probably an old school Unix format, 
> but I am not even sure where to start looking). I am thinking it could 
> work something like this:
> 
> The stylesheet, in its most basic form, would be a type and 
> position-length pair. So for the text:
> 
> This <b>is</b> <i>example</i> text, <b>man</b>.
> 
> A parser would sniff out the tags, and make a stylesheet that could look
> like:
> 
> (sheet (bold (5,2), (22,3)), (italic (8,7)) )
> ====================================================
> 
> I read through the comments and the only valid issue someone had about it
> was regarding editing and resyncing the logistics. However, my simple
> solution to that is to delete and resubmit all  of this "positional
> logistics" each time, thereby no needing to "adjust positions".
> 
> Not that I can build this kind of thing myself, but I think it would not be
> that complicated. In fact, instead of supporting code standards, why not
> just store the tag verbatim, so in this persons example it would be more
> like:
> 
> 5|<b>|,7|</b>|,8|<i>|,15|</i>|
> 
> Could this not be stored in a single field, then applied via a regex on
> output?
> 
> My target dataset would be something like the body of a blog post. Anything
> interactive would be built by IC on the page itself as the environment.

This just seems like way too much work and re-inventing the wheel to be
worthwhile.  If you decide to just store in HTML then it is trivially
easy to get a plain-text document from it by simply parsing out the tags
(there are certainly modules that do this for you) and much preferable,
imo than trying to maintain some position data in a separate document
where you have to hope you can keep it in sync with the text and then
cobble it all back together again in order to display the formatted version.

As for searchability, there are search engines (such as swish-e) that
can parse and index the files right from the HTML version.  You just set
up an indexing run from a cron job and run the search through the engine.


Peter



More information about the interchange-users mailing list