Rebol has a function/refinement called load/markup that parses a string, url or file
into alternating text and tags. I would like to be able to do that in newlisp.
I've tried xml-parse with no luck, although xml-parse does a wonderful job of parsing
individual markup tags into a data structure.
Did I miss something as to xml-parse or is there another way?
thanks
tim
Separating markup from text
-
- Posts: 253
- Joined: Thu Oct 07, 2004 7:21 pm
- Location: Palmer Alaska USA
Separating markup from text
Programmer since 1987. Unix environment.
-
- Posts: 2038
- Joined: Tue Nov 29, 2005 8:28 pm
- Location: latiitude 50N longitude 3W
- Contact:
Re: Separating markup from text
I have this in one of my files somewhere - it might be markdown.lsp ...
I have no idea whether it works or not, but I know it struggles with some stuff such as Javascript embedded in Script elements.
Code: Select all
(define (tokenize-html xhtml)
; return list of tag/text portions of xhtml text
(letn (
(tag-match [text]((?s:<!(-- .*? -- \s*)+>)|
(?s:<\?.*?\?>)|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>]|
(?:<[a-z/!$](?:[^<>])*>))*>))*>))*>))*>))*>))[/text]) ; yeah, well...
(str xhtml)
(len (length str))
(pos 0)
(tokens '())
)
(while (set 'tag-start (find tag-match str 8))
(if (< pos tag-start)
(push (list 'text (slice str pos (- tag-start pos))) tokens -1))
(push (list 'tag $0) tokens -1)
(set 'str (slice str (+ tag-start (length $0))))
(set 'pos 0))
; leftovers
(if (< pos len)
(push (list 'text (slice str pos (- len pos))) tokens -1))
tokens)
)
(set 'tokens (tokenize-html (get-url {http://newlispfanclub.alh.net/forum/viewtopic.php?f=16&t=3386})))
-
- Posts: 253
- Joined: Thu Oct 07, 2004 7:21 pm
- Location: Palmer Alaska USA
Re: Separating markup from text
I just did a test and it seems to work fine. I included a simple javascript function between a<script></script> tag.
This is great. I recommend this as a native. Between that and xml-parse there would be a powerful tool.
BTW: This is the beginning of a project for me: And that is to decompose html text into a data structure that allows
modification in a pseudo-dom fashion, like loading records into forms, setting form actions etc. I've such
functionality with rebol and python and need the same for newlisp.
Thanks very much cormullion, you've saved me a bunch of time.
cheers
tim (a boob when it comes to regex)
This is great. I recommend this as a native. Between that and xml-parse there would be a powerful tool.
BTW: This is the beginning of a project for me: And that is to decompose html text into a data structure that allows
modification in a pseudo-dom fashion, like loading records into forms, setting form actions etc. I've such
functionality with rebol and python and need the same for newlisp.
Thanks very much cormullion, you've saved me a bunch of time.
cheers
tim (a boob when it comes to regex)
Programmer since 1987. Unix environment.
-
- Posts: 2038
- Joined: Tue Nov 29, 2005 8:28 pm
- Location: latiitude 50N longitude 3W
- Contact:
Re: Separating markup from text
Cool - it's a start!
I think it fails on this page because there's a greater than sign in the Javascript code and it starts with lessthan-bang-bracket-CDATA. It's possible that the regexes could be tweaked but I wonder whether that would be the start of a never-ending job. Your big problem might be not with this kind of valid HTML but with invalid HTML...
- I hate regexes more than you! :)
I think it fails on this page because there's a greater than sign in the Javascript code and it starts with lessthan-bang-bracket-CDATA. It's possible that the regexes could be tweaked but I wonder whether that would be the start of a never-ending job. Your big problem might be not with this kind of valid HTML but with invalid HTML...
- I hate regexes more than you! :)
-
- Posts: 253
- Joined: Thu Oct 07, 2004 7:21 pm
- Location: Palmer Alaska USA
Re: Separating markup from text
One could use a brute-force, iterate-on-every-character approach that would overcome this problem by consolidating
a tag, identifying its type and ignore certain '>s' and '<s'. It would be a performance hit, but the data structure could
be stored by 'save and only rebuilt on an mtime check when the source document was changed. Because I do work for
those who like to push the limits and play with ideas, no doubt I'm going to end up using the "brute force" method, but
you've given me a starting point.
thanks again
tim
a tag, identifying its type and ignore certain '>s' and '<s'. It would be a performance hit, but the data structure could
be stored by 'save and only rebuilt on an mtime check when the source document was changed. Because I do work for
those who like to push the limits and play with ideas, no doubt I'm going to end up using the "brute force" method, but
you've given me a starting point.
thanks again
tim
Programmer since 1987. Unix environment.