TqrWebParser-D1 by Andrew Pam. ===2005.04.08 1. Decode all character encodings into Unicode UTF-8. 2. Extract document title, if any. 3. If the source document is MIME type text/html, text/sgml or text/xml: 3a. Remove all comment strings between (including the tags). 3b. Remove all text between and , and tags, including the tags themselves. 3c. Remove all remaining tags (everything between "<" and ">" characters). 3d. Decode all HTML/XML entities (e.g. "&" or "—") into Unicode. 4. Replace all runs of successive whitespace characters (including line breaks) with a single space. 5. Strip off any leading and trailing whitespace. Now we count characters in the resulting transformed document. Hope that helps, Andrew -- mailto:xanni@xanadu.net Andrew Pam http://www.xanadu.com.au/ Chief Scientist, Xanadu http://www.glasswings.com.au/ Partner, Glass Wings http://www.sericyb.com.au/ Manager, Serious Cybernetics