Separating markup from text

Tim Johnson · Post by **Tim Johnson** » Fri Nov 20, 2009 3:51 pm

Rebol has a function/refinement called load/markup that parses a string, url or file
into alternating text and tags. I would like to be able to do that in newlisp.
I've tried xml-parse with no luck, although xml-parse does a wonderful job of parsing
individual markup tags into a data structure.

Did I miss something as to xml-parse or is there another way?
thanks
tim

cormullion · Post by **cormullion** » Fri Nov 20, 2009 6:42 pm

I have this in one of my files somewhere - it might be markdown.lsp ...

Code: Select all

(define (tokenize-html xhtml)
; return list of tag/text portions of xhtml text
  (letn (
       (tag-match [text]((?s:<!(-- .*? -- \s*)+>)|
(?s:<\?.*?\?>)|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>])*>))*>))*>))*>))*>))*>))[/text]) ; yeah, well...
      (str xhtml)
      (len (length str))
      (pos 0)
      (tokens '())
      )
 (while (set 'tag-start (find tag-match str 8))
    (if (< pos tag-start)
       (push (list 'text (slice str pos (- tag-start pos))) tokens -1))
    (push (list 'tag $0) tokens -1)
    (set 'str (slice str (+ tag-start (length $0))))
    (set 'pos 0))
 ; leftovers
  (if (< pos len)
     (push (list 'text (slice str pos (- len pos))) tokens -1))
  tokens)
)

(set 'tokens (tokenize-html (get-url {http://newlispfanclub.alh.net/forum/viewtopic.php?f=16&t=3386})))

I have no idea whether it works or not, but I know it struggles with some stuff such as Javascript embedded in Script elements.

Tim Johnson · Post by **Tim Johnson** » Fri Nov 20, 2009 7:26 pm

I just did a test and it seems to work fine. I included a simple javascript function between a<script></script> tag.
This is great. I recommend this as a native. Between that and xml-parse there would be a powerful tool.

BTW: This is the beginning of a project for me: And that is to decompose html text into a data structure that allows
modification in a pseudo-dom fashion, like loading records into forms, setting form actions etc. I've such
functionality with rebol and python and need the same for newlisp.
Thanks very much cormullion, you've saved me a bunch of time.

cheers
tim (a boob when it comes to regex)

cormullion · Post by **cormullion** » Fri Nov 20, 2009 7:49 pm

Cool - it's a start!

I think it fails on this page because there's a greater than sign in the Javascript code and it starts with lessthan-bang-bracket-CDATA. It's possible that the regexes could be tweaked but I wonder whether that would be the start of a never-ending job. Your big problem might be not with this kind of valid HTML but with invalid HTML...

- I hate regexes more than you! :)

Tim Johnson · Post by **Tim Johnson** » Fri Nov 20, 2009 9:14 pm

One could use a brute-force, iterate-on-every-character approach that would overcome this problem by consolidating
a tag, identifying its type and ignore certain '>s' and '<s'. It would be a performance hit, but the data structure could
be stored by 'save and only rebuilt on an mtime check when the source document was changed. Because I do work for
those who like to push the limits and play with ideas, no doubt I'm going to end up using the "brute force" method, but
you've given me a starting point.
thanks again
tim

newlispfanclub.alh.net

Separating markup from text

Separating markup from text

Re: Separating markup from text

Re: Separating markup from text

Re: Separating markup from text

Re: Separating markup from text