>Has anyone developed code to clean up what passes for HTML from a Word document? I have a few apps with users pasting documents from Word - either directly or indirectly - and it makes a mess and sometimes breaks my web application pages.
If I remember correctly, it was either Ted Roche or Rich Schummer at one of the conferences who demonstrated such a tool, but mostly in passing - the emphasis was on what was done with a cleaned-up document afterwards. I think it was on the last Whilfest in 2003. I've browsed through my conference downloads, but couldn't find it - so I'm not sure my memory is quite OK :).
Generally, you could parse the text for tags... and then untag the thing.
Here's my untag function:
Procedure untag(c)
Local lcTag
lcTag=Strextract(c, "<", ">",1,4)
lcCloseTag="</"+Getwordnum(lcTag, 1, "< ")+">"
Return Strextract(c, lcTag, lcCloseTag,1,1+2)
You can just chop the text into paragraphs (looking for pairs of matching P or H tags and anything between them), then work your way inside each paragraph, stripping some tags as you go.