(!) "Bush hid the facts"

Started by Darren Dirt, December 13, 2006, 11:15:36 PM

Previous topic - Next topic

Darren Dirt

1. With Windows XP (only, apparently) open up Notepad.

2. Type the following, and save this brand new file: (filename doesn't matter)
Bush hid the facts

3. Close Notepad. Open that file you just saved.

4. Say out loud: WTF!?
_____________________

Strive for progress. Not perfection.
_____________________

Lazybones

http://www.hoax-slayer.com/bush-hid-the-facts-notepad.html

QuoteIn fact, even a line of text such as "hhhh hhh hhh hhhhh" will elicit the same results.

Since I first published this article, a few readers have pointed out that some character strings that fit the "4,3,3,5" pattern do not generate the error. For example, the phrase "Bush hid the truth" is displayed normally. However, conspiracy theorists should not take this as aiding their argument. "Fred led the brats", "brad ate the trees" and other strings also escape the error.

Thus, any hint of political conspiracy fades into oblivion and is replaced by a rather mundane programming bug. It seems probable that a certain combination and/or frequency of letters in the character string cause Notepad to misinterpret the encoding of the file when it is re-opened. If the file is originally saved as "Unicode" rather than "ANSI" the text displays correctly. Older versions of Notepad such as those that came with Windows 95, 98 or ME do not include Unicode support so the error does not occur.

So, nothing weird here at all...except perhaps for the fact that someone, somewhere had nothing better to do than turn a simple software glitch into another lame conspiracy theory. Smile

Darren Dirt

#2
"Older versions of Notepad such as those that came with Windows 95, 98 or ME do not include Unicode support so the error does not occur."

Ah, so it was a "regressive bug" as they say -- adding feature X makes perfectly-working feature Y falldowngoboom. ::)


I knew it was some kinda stupid M$ bug, since taking an existing .TXT file and putting those characters (and presumably, any of that special pattern) in results in a normal perfectly-readable file (and doing the exact same "test" but including a "." at the end of the sentence resulted in a normal file as well).

Just thought it was funny. :)

_____________________

Strive for progress. Not perfection.
_____________________

Lazybones

It would have been more funny if it wasn't wrapped in a conspiracy theory / political context, like more than half your posts are.

You do submit some interesting things, but I am really, getting tired of the tinfoil hat stuff.

Thorin

That made me wonder what exactly causes Notepad to think it's Unicode, though.  The minimum I could boil it down to was "hhhh hhh h" (and " hhh hhh h", "hh h hhh h", "aaaa aaa a", " aaa aaa a", "hh h hhh h").  Each two characters represents a single Unicode character, right?  So "hh" "hh" " h" "hh" " h".

Adding a single character (like a period) would make it not fit the valid Unicode count, but adding two characters (like two periods) does and so Notepad still thinks it's Unicode text.

If you save the file from Notepad as Unicode and look at the extra characters it puts in (need a hex viewer for this), You'll see that two characters get added to the beginning (together they form one valid Unicode character that's not a valid ANSI character).  Looks like the programmer decided to take the first four characters and check to see if at least one of them would form valid Unicode but invalid ANSI codes ("aa" and "aa" would be two valid Unicode characters, but not valid ANSI codes; " a" and "aa" has one invalid ANSI code; "aa" and " a" has one valid ANSI code; " a" and " a" has two valid ANSI codes, so Notepad treats it as ANSI).

So to sum up, looks like the programmer hacked Unicode capability into the text file Notepad creates by simply adding two characters to the beginning that are valid Unicode but not valid ANSI code, then when opening a file taking the first four characters of the text to see if they construct at least one invalid ANSI code.  There's some kind of a minimum length and required space algorithm, too...  Looks to me like they took a simple idea (put a Unicode character at the beginning as a hack to mark the file a Unicode file) and botched it into a difficult (yet ultimately confusable) algorithm.
Prayin' for a 20!

gcc thorin.c -pedantic -o Thorin
compile successful

Tom

#5
Quotebut not valid ANSI codes
Thats just stupid. ANSI/ASCII, aka the standard 1 byte, normally used in simple editors " " is a valid ANSI/ASCII character, "a" is a valid ANSI/ASCII character. I think its MS's fault for going with UCS16/UTF16. Everyone else uses UTF8 for the simple fact that its 100% backwards compatible with ASCII. Sure some characters end up getting encoded as up to 4 or 6 bytes, but who here writes in watoose? Or extended Higrana/kanji? (Sure there are 10s of thousands of characters, but no one in their right mind uses all of them). Standard english, and other languages like it will not be much larger in UTF8 than in ASCII. Where as, in UCS16, all files are twice the size \o/

Quotethen when opening a file taking the first four characters of the text to see if they construct at least one invalid ANSI code.
Again, ANSI/ASCII codes are 8bits/1byte. Or are you mixing ANSI and Unicode up as the same thing?
<Zapata Prime> I smell Stanley... And he smells good!!!

Ustauk

Might I suggest the inclusion of a conspiracy subforum under the General forum, with access available to only those who want it?  This would clear the threads from the general forum, and keep them from the prying eyes of the government :)

Mr. Analog

Quote from: Ustauk on December 14, 2006, 10:43:15 AM
Might I suggest the inclusion of a conspiracy subforum under the General forum, with access available to only those who want it?  This would clear the threads from the general forum, and keep them from the prying eyes of the government :)

I rather enjoy the challenge of critical thinking when these posts are made, no matter how much I disagree with them. Burying them in a subforum I think would be irrelevant.
By Grabthar's Hammer

Shayne

Not to mention I was able to end one topic by posting a picture :)  So the weapon is available to all those who need it.

Thorin

Quote from: Tom on December 14, 2006, 10:32:26 AM
Quotebut not valid ANSI codes
Thats just stupid. ANSI/ASCII, aka the standard 1 byte, normally used in simple editors " " is a valid ANSI/ASCII character, "a" is a valid ANSI/ASCII character.

At no point did I say it was a smart solution.

Quote from: Tom on December 14, 2006, 10:32:26 AM
Quote
then when opening a file taking the first four characters of the text to see if they construct at least one invalid ANSI code.
ANSI/ASCII codes are 8bits/1byte. Or are you mixing ANSI and Unicode up as the same thing?

I'm not mixing ANSI and Unicode up.  If " a" is read as Unicode, it can be converted to the valid ANSI code "a".  If "aa" is read as Unicode, it cannot be converted to a valid ANSI code (what do you do with the extra A?).  From what I could determine in the short time I spent looking at it, this appears to be the mainstay of the algorithm used in Notepad to determine if a .txt file is in Unicode or ANSI (the two most common formats Notepad can save .txt files in).

Quote from: Tom on December 14, 2006, 10:32:26 AM
I think its MS's fault for going with UCS16/UTF16. Everyone else uses UTF8 for the simple fact that its 100% backwards compatible with ASCII.

Everyone else?  Care to state who that entails, what your boundary for the grouping is, and how big the MS grouping versus the "everyone else" grouping is?

Quote from: Tom on December 14, 2006, 10:32:26 AM
I think its MS's fault for going with UCS16/UTF16. Everyone else uses UTF8 for the simple fact that its 100% backwards compatible with ASCII. Sure some characters end up getting encoded as up to 4 or 6 bytes, but who here writes in watoose? Or extended Higrana/kanji? (Sure there are 10s of thousands of characters, but no one in their right mind uses all of them). Standard english, and other languages like it will not be much larger in UTF8 than in ASCII. Where as, in UCS16, all files are twice the size \o/

So because you write in standard English, we should forget about all the other languages that all the other people in the world speak?  Just how many bytes does it take to store extended kanji or traditional Chinese characters in UTF8?  Keep in mind that 20% of the Earth's population lives in one country that does not use English nor even a similar language as its main form of communication.

A much better argument would have been that file size doesn't matter anymore in text files, so using UTF8 wouldn't be a problem these days even for those who use languages that take a lot of space to store electronically.  Of course, the counter-argument to that is how long ago was Unicode (UTF16) adopted by MS?
Prayin' for a 20!

gcc thorin.c -pedantic -o Thorin
compile successful

Tom

QuoteCare to state who that entails
Anyone but MS? Being backwards compatible with ASCII, UTF8 is the smarter route before Unicode fully takes off.

QuoteJust how many bytes does it take to store extended kanji or traditional Chinese characters in UTF8?
anywhere from 1 to 4 I believe.

I have UTF8 set as my default charset on my entire system, and everything works, old ASCII stuff, and websites that display unicode (since they use UTF8, or possibly, the browser is able to convert from UCS* to UTF8).


QuoteIf " a" is read as Unicode, it can be converted to the valid ANSI code "a".
Only if you're a moron. a space is a space.

QuoteIf "aa" is read as Unicode, it cannot be converted to a valid ANSI code (what do you do with the extra A?)
Its either a valid UCS/UTF code, or it isn't. if "aa" happens to be an invalid encoding for _anything_, its obviously not unicode.

Using UTF8 by default could have solved this before it became an "issue". Totally ASCII compatible. And since Most OSs have support for multiple charsets, theres no problem with supporting kjs, UCS16, UTF8, or plain ASCII.
<Zapata Prime> I smell Stanley... And he smells good!!!

Lazybones

Also Windows didn't become the dominant OS by supporting only one language. In fact almost every major version of windows added some form of language support improvements. VISTA's language support is right down to the core, apparently there is no such thing as a language specific version, you can now load the language you want as a module regardless of the country you purchased VISTA in.

Thorin

#12
Quote from: Tom on December 14, 2006, 03:49:15 PM
Quote
Care to state who that entails
Anyone but MS? Being backwards compatible with ASCII, UTF8 is the smarter route before Unicode fully takes off.

See, you don't really know who else might use anything but UTF8 so you make a sweeping, grandiose statement.  Oh, but wait,

Quote from: Tom on December 14, 2006, 03:49:15 PM
I have UTF8 set as my default charset on my entire system

Ah ha!  You're using it, so everyone else should, too!

Quote from: Tom on December 14, 2006, 03:49:15 PM
Quote
If " a" is read as Unicode, it can be converted to the valid ANSI code "a".
Only if you're a moron. a space is a space.

Yeah, that's it, only a moron would try and invent a system to figure out the charset of something when there is absolutely no indication of what the charset should be!  Step back for a minute now (I know my comments thus far have been rather inflammatory but you started with name-calling), and consider what you would do with a file that contained 10 bytes and that has no indicator of what charset to use.  Is it ASCII?  Is it ANSI?  Is it UTF8?  Is it Unicode?

Now, the way the programmer who added Unicode to Notepad seems to have figured it out is that he'll add a Unicode character to the beginning that is then ignored - this means he needs to read the first two bytes to see if they form a valid Unicode character that cannot be represented as an ANSI character.  In a Unicode character, if the first 8 bits are set to zero, the second 8 bits represent a letter that happens to have the same value in the ANSI character set.  If the first 8 bits are not all set to zero, the first 8 bits and second 8 bits of the two-byte Unicode character represent something that is outside the ANSI character set and therefore cannot be converted to ANSI directly.

Probably would've been a better idea to add some kind of charset indicator to the file, but then it wouldn't be a straight text file anymore.  And Notepad was originally intended to make straight text files using the ANSI character set.

Quote from: Tom on December 14, 2006, 03:49:15 PM
Quote
If "aa" is read as Unicode, it cannot be converted to a valid ANSI code (what do you do with the extra A?)
Its either a valid UCS/UTF code, or it isn't. if "aa" happens to be an invalid encoding for _anything_, its obviously not unicode.

As discussed above, if you don't know the encoding of a set of bytes you have no idea whether the two bytes that represent "aa" are supposed to be read as Unicode or not and therefore have no idea whether it's a valid single Unicode character or was supposed to be read as two separate ANSI characters.

Quote from: Tom on December 14, 2006, 03:49:15 PM
Using UTF8 by default could have solved this before it became an "issue". Totally ASCII compatible. And since Most OSs have support for multiple charsets, theres no problem with supporting kjs, UCS16, UTF8, or plain ASCII.

For the record, it's the whole holier-than-thou attitude in your previous post that made me reply, and I see by this last sentence that it's still here.  It wasn't Microsoft who created or even first picked Unicode; Unicode came out of work from Xerox and Apple, and was first adopted as a standard by Apple as part of their TrueType.

One thing I will agree with, though, if the computer industry as a whole (including Apple, Microsoft, Sun, Oracle, HP, IBM, Sybase, Berkeley University, and others) had seen before 1988 that in 1992 something better than Unicode would be invented on a placemat at a diner, they would not have picked Unicode.  The problem, of course, was that UTF8 simply was not available when all the big players were looking for an internationalization standard they could use; Unicode was, and had been in development all through the early 80s.  Most of the big players even had full-time employees spending their time on nothing but Unicode meetings and work.

*edit: Changed "UTF16" to "Unicode" in my post - they're not the same.
Prayin' for a 20!

gcc thorin.c -pedantic -o Thorin
compile successful

Tom

#13
QuoteNow, the way the programmer who added Unicode to Notepad seems to have figured it out is that he'll add a Unicode character to the beginning that is then ignored
Wow, Thats rather stupid. You wouldn't add just any Unicode character if you were to do that, you'd add a specific control char of some kind. Not something that could ever be mistaken for valid ASCII.

QuoteSee, you don't really know who else might use anything but UTF8 so you make a sweeping, grandiose statement.
Unix. Thats pretty much everyone else, besides the embedded OSs that have all but evaporated lately.

QuoteAh ha!  You're using it, so everyone else should, too!
Woo. Of course I'd assume that ::) You missed the point. I was saying, it works for me(tm).

QuoteI know my comments thus far have been rather inflammatory but you started with name-calling)
I didn't call anyone a moron. I insinuated that one _might_ be a moron if they thought " a" could be converted to "a". " a" (space a) in Unicode happens to be char 8289 in dec, and 2061 in hex, that specific char is a special control char, so very possibly it could be used as an identifier. BUT "Bu" is not, it happens to be defined as "Unicode Han Character" (kanji I assume). Using "any" valid unicode char as a control char in the manner suggested is rather dumb. Unicode is made from 2 consecutive bytes, totally unrelated to any ASCII or other code point. Just about any two byte combination is a valid Unicode char.

Quotethey would not have picked Unicode.
I'm sorry, Notepad has had unicode support since 1988, or 1992? Besides, UTF8 is a "unicode" variant. Fully spec'd and supported by the Unicode Consortium, otherwise it could not use the UTF designation.

QuoteProbably would've been a better idea to add some kind of charset indicator to the file, but then it wouldn't be a straight text file anymore.
Using that detection method, its already not a text file, unless it used some valid combination of "printable" ASCII characters, which makes it pretty pointless to use, as it'd cause all sorts of false positives.

QuoteAs discussed above, if you don't know the encoding of a set of bytes you have no idea whether the two bytes that represent "aa" are supposed to be read as Unicode or not and therefore have no idea whether it's a valid single Unicode character or was supposed to be read as two separate ANSI characters.
Its a text file. Its very likely going to be the same encoding as the OS is using by default. If it isn't the user is going to probably know that, if not, its of no use to them anyway, as if it was displayed properly they couldn't read it.

QuoteFor the record, it's the whole holier-than-thou attitude in your previous post that made me reply, and I see by this last sentence that it's still here.
Maybe its the stupid detection method it uses. I rather dislike stupid code. And the monkeys that write it. (defn: code monkey: one that does not think) No offense to any of the programmers here, I am quite certain none of you are code monkeys.

edit: re: code monkeys.
I see way too many of them on mailing lists working for companies only asking for answers NOW!!!11111, examples and someone to do the work for them.
<Zapata Prime> I smell Stanley... And he smells good!!!

Shayne

#14
As I read through this, i wonder to myself, "How does this really affect me?"  I am not so sure it does.  Viva la notepad!