[drupal-devel] [bug] replace binary strings in node.module with hex encodings

Fri Jun 3 12:03:02 UTC 2005

Issue status update for http://drupal.org/node/24025

 Project:      Drupal
 Version:      4.6.0
 Component:    base system
 Category:     bug reports
 Priority:     normal
 Assigned to:  Anonymous
 Reported by:  danielc
 Updated by:   Shot
 Status:       patch

It looks like drupal-devel is not gated both ways, so I’ll add my 2¢
here as well:

"
Any modern OS should support fallback fonts for missing characters.
It's not my problem if you decide to torture yourself with vi and
friends.

"
Just to clarify – vim works perfectly in an UTF-8 environment. If one
sets up a proper locale (pl_PL.UTF-8 in my case, en_US.UTF-8 in the
original poster's, perhaps) and properly configures his terminal,
everything simply works and the characters show up without a problem.

If an outsider vote counts, -1 on converting the strings to hex values.

Cheers,
-- Shot (Piotr Szotkowski)

Shot

Previous comments:
------------------------------------------------------------------------

May 31, 2005 - 22:53 : danielc

Attachment: http://drupal.org/files/issues/node.module.nobinary.diff (864 bytes)

drupal/modules/node.module contains three binary strings.  The use of
binary strings causes problems in text editors incapable of handling
such strings.

For example, I have modified this file for internal use, but when I
save the file, the binary strings get converted to question marks. 
While someone could argue that I should get a better text editor, this
issue exists for many users, not just me.  Better yet, the solution is
simple...

PHP allows representing binary characters via hex encoding.  That's
what this patch does.

Thanks.

------------------------------------------------------------------------

June 1, 2005 - 00:23 : kbahey

+1 for this.

They never show up correctly for me (Linux. vim).

------------------------------------------------------------------------

June 1, 2005 - 01:19 : Steven

All Drupal code is UTF-8 encoded. Locale.inc contains many more such
strings and there's some in search.module too. Keeping them as
plain-text is vital to keeping the code readable, using hex escapes
reduces editability.

-1 on this.

------------------------------------------------------------------------

June 1, 2005 - 01:25 : Steven

PS: If your editor converts them to question marks, it means it doesn't
support UTF-8 properly and only handles your local ANSI codepage. You
will run into many more issues. I use Notepad2 on Windows and it works
like a charm.

------------------------------------------------------------------------

June 1, 2005 - 02:49 : danielc

Steven wrote:
> Keeping them as plain-text is vital to keeping the code readable,
> using hex escapes reduces editability.

Right now, when viewing these characters using readers don't deal with
these characters well,
like Mozilla looking at the web interface of the CVS repository
(go to
http://cvs.drupal.org/viewcvs/drupal/drupal/modules/node.module?annotate=1.493
then look at line 217), I see an a tilde, a Euro symbol, and a comma.  I
hardly consider that "readable."

When viewing the file in vi, the binary data already shows up as their
hex representatives (for example line 217 has "\xe3\x80\x82".  So,
changing them to an escaped/encoded string makes the string look
exactly the same.  The only difference is that my patch represents each
character using four ASCII characters.  Now everyone can easily read and
edit the values.

------------------------------------------------------------------------

June 1, 2005 - 03:53 : Steven

This is the fault of the viewcvs code which assumes ISO-8859-1 encoding,
it has nothing to do with using UTF-8. Those mails look fine in my email
client (Thunderbird).

"The only difference is that my patch represents each character using
four ASCII characters. Now everyone can easily read and edit the
values.

"
I disagree. If I want to edit the characters now, I simply type them in
using whatever input method is appropriate. The literal bytes don't mean
anything.

If I want to hexencode a piece of UTF-8 text, I have to view the text I
typed as literal bytes somehow (so I need a hex editor?) and enter the
values in the code. If I later want to figure out what that text really
says, I have to paste the hex values again in the hex editor, save to a
text file and open it as UTF-8. This is a waste of time.

This like saying all code should be hardwrapped at 80 characters,
because well, that's what ancient terminals use. Sorry, I don't buy it.
My computer has no problems using and displaying Unicode. There are tons
of freeware Unicode fonts around and as far as I know, most Unix tools
should handle it fine. As far as non-displayable characters goes, there
is an excellent fallback font which represents them with a small box
with the Unicode codepoint in them. This is much more useful than the
literal bytes as it actually means something.

Any modern OS should support fallback fonts for missing characters.
It's not my problem if you decide to torture yourself with vi and
friends.