I got to answer my own question. Pretty simple after all, I found all the characters in the field data that mapped to 0x80, or decimal 128 and up. Then converted them to HTML special character codes by prefixing with '&#' and suffixing ';'. So
146 -> ’
Seems to have worked well enough for my purpose although I don't know for sure that this accounts for all possibilities.
+++++++++++++++++++++++++++++++++++++++++++++
I'm porting CCK content from a redesigned D4 site to D6 by extracting the data in a PHP script, generating CSV files then using node_import to create the new nodes. It's working out pretty well so far but I hit the following snag on one of the content types. There are hex encoded characters in the D4 content, somewhere in the importation process the text gets truncated at the first appearance of one of these eg. 0x99, 0x93, 0x94.
Am thinking the easiest way to handle it would be to preprocess the data in my PHP script, converting to what it really should be anyway ie. HTML special characters such as the following:
0x99 -> ™ 0x93 -> “ 0x94 -> ”
Anyone know of an existing utility to do this for me? Otherwise I can code up a conversion method to accomplish it.
Marty