(!) "Bush hid the facts"

Mr. Analog · December 15, 2006, 05:39:13 AM

Quote from: Shayne on December 14, 2006, 11:17:29 PM
As I read through this i wonder to myself. how does this really affect me? Not so sure it does. Viva la notepad!

LOL, my thoughts exactly.

Darren Dirt · December 15, 2006, 10:42:04 AM

Thanks to a couple of uber-geeks, I now know a whole bunch more than I ever realized there was to know about UTF/Unicode... And within a week I will forget it all as it really don't amount to a hillabeans in my daily life

Wonder if this "debate" would have gotten all fierd up in the first place if the thread title wasn't so darned controversial?

Thorin · December 16, 2006, 10:41:53 PM

Another example is given here: http://weblogs.asp.net/cumpsd/archive/2004/02/27/81098.aspx

Quote
Someone showed me a weird text file today. It was a bat file with 'copy MeYou.bak MeYou.txt'. When you would ran it, it would work. But when you opened it in Notepad, there was nothing.

So we decided to look a bit into this and here is something we came up with to 'create' invisible text:

Open notepad and enter:
' abc.bak abc.txt'

(That is: space abc dot bak space abc dot txt, no line break, without the quotes)

It doesn't work with every string, just follow us on this example and use that one.

Save your file. Notepad picks default ANSI as encoding.

Open your file, Notepad seems to open by default in Unicode encoding.

Your text is now invisible.

This is discussed and dissected quite handily by Raymond Chen, who explains how the word "Hello" is represented by different numbers of bytes in the different encodings. He also explains what causes the problem described at the beginning of this thread:

Quote
The encodings that do not have special prefixes and which are still supported by Notepad are the traditional ANSI encoding (i.e., "plain ASCII") and the Unicode (little-endian) encoding with no BOM. When faced with a file that lacks a special prefix, Notepad is forced to guess which of those two encodings the file actually uses. The function that does this work is IsTextUnicode, which studies a chunk of bytes and does some statistical analysis to come up with a guess.

IsTextUnicode is a Windows API that has been available since NT 3.5.

As for special characters marking a text file as having a particular encoding (using something called a BOM): turns out this a standard technique that is supported not just by Notepad but also by emacs (and probably Vi, although I was unable to find any sources stating so).

Ultimately I agree that trying to determine the encoding of a text file with no special control type indicating the encoding and thereby getting a false-positive for little-endian Unicode is a bad design decision. However, I'm sure that when you rip apart IsTextUnicode, you'll find it was probably written quite well, with quite a bit of thought given as to how to determine if text is Unicode or not; I doubt the actual code was written by a code monkey as you define it. After all, this is a core function included in Windows NT 3.5 and beyond, and there aren't too many OS programmers posting in forums demanding solutions with examples immediately.

I'll say this again: Including code to determine if a text file with no encoding indicator is encoded as ASCII or little-endian Unicode is a bad design decision; instead, a text file with no encoding indicator should simply always be treated as encoded as ASCII.

Quote from: Shayne on December 14, 2006, 11:17:29 PM
As I read through this, i wonder to myself, "How does this really affect me?" I am not so sure it does. Viva la notepad!

It probably has no direct bearing on your day-to-day work. Still, it's nice to understand how text files are marked to indicate they're in an encoding other than ASCII, don't you think? Especially helpful when you need to start working with text files in software used in parts of the world that don't use English.

Tom · December 16, 2006, 10:56:55 PM

IMO, there is no GOOD way of auto-detecting a text file's encoding. To do so is just silly. If the system default isn't the right one, let the user change it from some menu.

Found a usefull tidbit on google:define:

QuoteThe Unicode character U+FEFF, or its non-character mirror-image, U+FFFE, used to indicate the byte order, or its non-character mirror image, of a text stream. The presence of a BOM can be a strong clue that a file is encoded in Unicode.

Thats not a _bad_ idea, 0xFE and 0xFF are not printable ASCII afaik and should make a good prefix. Though I still don't know why an editor will see a file that isn't prefixed by such a character as unicode. Using just _any_ unicode char (as that NT function seems to do) is a bad idea, any a number of unicode chars can look identical to a string of valid ASCII/ABDIC/sjis/whatever.

QuoteI doubt the actual code was written by a code monkey as you define it

You'd be surprised at who get hired by some companies. Then again, maybe you wouldn't, considering who ran that one place you guys always talk about. Also, a good programmer can be forced into monkey-mode depending on circumstances. You just do what you're told, and just want to get it done asap.

QuoteI'll say this again: Including code to determine if a text file with no encoding indicator is encoded as ASCII or little-endian Unicode is a bad design decision; instead, a text file with no encoding indicator should simply always be treated as encoded as ASCII.

I totally agree.

Though allowing both little and big endian forms of the same text file seems excessive. Force it to one, and you never have to worry about the endian. I think sometimes those big standard organizations over do/think things.

Thorin · December 16, 2006, 11:34:30 PM

Further reading indicated that IsTextUnicode looks at the entire text file to determine whether it has a high percentage of bytes that taken together could only be Unicode characters. Basically, it's doing a statistical analysis. Using this to automatically pick a specific encoding is a bad idea, but what if you pop up a dialog box asking for a choice and you use IsTextUnicode to suggest to the user what the most likely encoding is?

And no, Darren, after first reading the title I wasn't going to look at this thread at all. It wasn't until Lazy posted the quote from hoaxslayer that I started wondering what caused this particular quirk in Notepad, and it was the posting of my initial findings that led to this big discussion about encoding and how it's recognized.

(!) "Bush hid the facts"

Mr. Analog

Darren Dirt

Thorin

Tom

Thorin