1. With Windows XP (only, apparently) open up Notepad.
2. Type the following, and save this brand new file: (filename doesn't matter)
Bush hid the facts
3. Close Notepad. Open that file you just saved.
4. Say out loud: WTF!?
http://www.hoax-slayer.com/bush-hid-the-facts-notepad.html
QuoteIn fact, even a line of text such as "hhhh hhh hhh hhhhh" will elicit the same results.
Since I first published this article, a few readers have pointed out that some character strings that fit the "4,3,3,5" pattern do not generate the error. For example, the phrase "Bush hid the truth" is displayed normally. However, conspiracy theorists should not take this as aiding their argument. "Fred led the brats", "brad ate the trees" and other strings also escape the error.
Thus, any hint of political conspiracy fades into oblivion and is replaced by a rather mundane programming bug. It seems probable that a certain combination and/or frequency of letters in the character string cause Notepad to misinterpret the encoding of the file when it is re-opened. If the file is originally saved as "Unicode" rather than "ANSI" the text displays correctly. Older versions of Notepad such as those that came with Windows 95, 98 or ME do not include Unicode support so the error does not occur.
So, nothing weird here at all...except perhaps for the fact that someone, somewhere had nothing better to do than turn a simple software glitch into another lame conspiracy theory. Smile
"Older versions of Notepad such as those that came with Windows 95, 98 or ME do not include Unicode support so the error does not occur."
Ah, so it was a "regressive bug" as they say -- adding feature X makes perfectly-working feature Y falldowngoboom. ::)
I knew it was some kinda stupid M$ bug, since taking an existing .TXT file and putting those characters (and presumably, any of that special pattern) in results in a normal perfectly-readable file (and doing the exact same "test" but including a "." at the end of the sentence resulted in a normal file as well).
Just thought it was funny. :)
It would have been more funny if it wasn't wrapped in a conspiracy theory / political context, like more than half your posts are.
You do submit some interesting things, but I am really, getting tired of the tinfoil hat stuff.
That made me wonder what exactly causes Notepad to think it's Unicode, though. The minimum I could boil it down to was "hhhh hhh h" (and " hhh hhh h", "hh h hhh h", "aaaa aaa a", " aaa aaa a", "hh h hhh h"). Each two characters represents a single Unicode character, right? So "hh" "hh" " h" "hh" " h".
Adding a single character (like a period) would make it not fit the valid Unicode count, but adding two characters (like two periods) does and so Notepad still thinks it's Unicode text.
If you save the file from Notepad as Unicode and look at the extra characters it puts in (need a hex viewer for this), You'll see that two characters get added to the beginning (together they form one valid Unicode character that's not a valid ANSI character). Looks like the programmer decided to take the first four characters and check to see if at least one of them would form valid Unicode but invalid ANSI codes ("aa" and "aa" would be two valid Unicode characters, but not valid ANSI codes; " a" and "aa" has one invalid ANSI code; "aa" and " a" has one valid ANSI code; " a" and " a" has two valid ANSI codes, so Notepad treats it as ANSI).
So to sum up, looks like the programmer hacked Unicode capability into the text file Notepad creates by simply adding two characters to the beginning that are valid Unicode but not valid ANSI code, then when opening a file taking the first four characters of the text to see if they construct at least one invalid ANSI code. There's some kind of a minimum length and required space algorithm, too... Looks to me like they took a simple idea (put a Unicode character at the beginning as a hack to mark the file a Unicode file) and botched it into a difficult (yet ultimately confusable) algorithm.
Quotebut not valid ANSI codes
Thats just stupid. ANSI/ASCII, aka the standard 1 byte, normally used in simple editors " " is a valid ANSI/ASCII character, "a" is a valid ANSI/ASCII character. I think its MS's fault for going with UCS16/UTF16. Everyone else uses UTF8 for the simple fact that its 100% backwards compatible with ASCII. Sure some characters end up getting encoded as up to 4 or 6 bytes, but who here writes in watoose? Or extended Higrana/kanji? (Sure there are 10s of thousands of characters, but no one in their right mind uses all of them). Standard english, and other languages like it will not be much larger in UTF8 than in ASCII. Where as, in UCS16, all files are twice the size \o/
Quotethen when opening a file taking the first four characters of the text to see if they construct at least one invalid ANSI code.
Again, ANSI/ASCII codes are 8bits/1byte. Or are you mixing ANSI and Unicode up as the same thing?
Might I suggest the inclusion of a conspiracy subforum under the General forum, with access available to only those who want it? This would clear the threads from the general forum, and keep them from the prying eyes of the government :)
Quote from: Ustauk on December 14, 2006, 10:43:15 AM
Might I suggest the inclusion of a conspiracy subforum under the General forum, with access available to only those who want it? This would clear the threads from the general forum, and keep them from the prying eyes of the government :)
I rather enjoy the challenge of critical thinking when these posts are made, no matter how much I disagree with them. Burying them in a subforum I think would be irrelevant.
Not to mention I was able to end one topic by posting a picture :) So the weapon is available to all those who need it.
Quote from: Tom on December 14, 2006, 10:32:26 AM
Quotebut not valid ANSI codes
Thats just stupid. ANSI/ASCII, aka the standard 1 byte, normally used in simple editors " " is a valid ANSI/ASCII character, "a" is a valid ANSI/ASCII character.
At no point did I say it was a smart solution.
Quote from: Tom on December 14, 2006, 10:32:26 AM
Quote
then when opening a file taking the first four characters of the text to see if they construct at least one invalid ANSI code.
ANSI/ASCII codes are 8bits/1byte. Or are you mixing ANSI and Unicode up as the same thing?
I'm not mixing ANSI and Unicode up. If " a" is read as Unicode, it can be converted to the valid ANSI code "a". If "aa" is read as Unicode, it cannot be converted to a valid ANSI code (what do you do with the extra A?). From what I could determine in the short time I spent looking at it, this appears to be the mainstay of the algorithm used in Notepad to determine if a .txt file is in Unicode or ANSI (the two most common formats Notepad can save .txt files in).
Quote from: Tom on December 14, 2006, 10:32:26 AM
I think its MS's fault for going with UCS16/UTF16. Everyone else uses UTF8 for the simple fact that its 100% backwards compatible with ASCII.
Everyone else? Care to state who that entails, what your boundary for the grouping is, and how big the MS grouping versus the "everyone else" grouping is?
Quote from: Tom on December 14, 2006, 10:32:26 AM
I think its MS's fault for going with UCS16/UTF16. Everyone else uses UTF8 for the simple fact that its 100% backwards compatible with ASCII. Sure some characters end up getting encoded as up to 4 or 6 bytes, but who here writes in watoose? Or extended Higrana/kanji? (Sure there are 10s of thousands of characters, but no one in their right mind uses all of them). Standard english, and other languages like it will not be much larger in UTF8 than in ASCII. Where as, in UCS16, all files are twice the size \o/
So because you write in standard English, we should forget about all the other languages that all the other people in the world speak? Just how many bytes does it take to store extended kanji or traditional Chinese characters in UTF8? Keep in mind that 20% of the Earth's population lives in one country that does not use English nor even a similar language as its main form of communication.
A much better argument would have been that file size doesn't matter anymore in text files, so using UTF8 wouldn't be a problem these days even for those who use languages that take a lot of space to store electronically. Of course, the counter-argument to that is how long ago was Unicode (UTF16) adopted by MS?
QuoteCare to state who that entails
Anyone but MS? Being backwards compatible with ASCII, UTF8 is the smarter route before Unicode fully takes off.
QuoteJust how many bytes does it take to store extended kanji or traditional Chinese characters in UTF8?
anywhere from 1 to 4 I believe.
I have UTF8 set as my default charset on my entire system, and everything works, old ASCII stuff, and websites that display unicode (since they use UTF8, or possibly, the browser is able to convert from UCS* to UTF8).
QuoteIf " a" is read as Unicode, it can be converted to the valid ANSI code "a".
Only if you're a moron. a space is a space.
QuoteIf "aa" is read as Unicode, it cannot be converted to a valid ANSI code (what do you do with the extra A?)
Its either a valid UCS/UTF code, or it isn't. if "aa" happens to be an invalid encoding for _anything_, its obviously not unicode.
Using UTF8 by default could have solved this before it became an "issue". Totally ASCII compatible. And since Most OSs have support for multiple charsets, theres no problem with supporting kjs, UCS16, UTF8, or plain ASCII.
Also Windows didn't become the dominant OS by supporting only one language. In fact almost every major version of windows added some form of language support improvements. VISTA's language support is right down to the core, apparently there is no such thing as a language specific version, you can now load the language you want as a module regardless of the country you purchased VISTA in.
Quote from: Tom on December 14, 2006, 03:49:15 PM
Quote
Care to state who that entails
Anyone but MS? Being backwards compatible with ASCII, UTF8 is the smarter route before Unicode fully takes off.
See, you don't really know who else might use anything but UTF8 so you make a sweeping, grandiose statement. Oh, but wait,
Quote from: Tom on December 14, 2006, 03:49:15 PM
I have UTF8 set as my default charset on my entire system
Ah ha! You're using it, so everyone else should, too!
Quote from: Tom on December 14, 2006, 03:49:15 PM
Quote
If " a" is read as Unicode, it can be converted to the valid ANSI code "a".
Only if you're a moron. a space is a space.
Yeah, that's it, only a moron would try and invent a system to figure out the charset of something when there is absolutely no indication of what the charset should be! Step back for a minute now (I know my comments thus far have been rather inflammatory but you started with name-calling), and consider what you would do with a file that contained 10 bytes and that has no indicator of what charset to use. Is it ASCII? Is it ANSI? Is it UTF8? Is it Unicode?
Now, the way the programmer who added Unicode to Notepad seems to have figured it out is that he'll add a Unicode character to the beginning that is then ignored - this means he needs to read the first two bytes to see if they form a valid Unicode character that cannot be represented as an ANSI character. In a Unicode character, if the first 8 bits are set to zero, the second 8 bits represent a letter that happens to have the same value in the ANSI character set. If the first 8 bits are not all set to zero, the first 8 bits and second 8 bits of the two-byte Unicode character represent something that is outside the ANSI character set and therefore cannot be converted to ANSI directly.
Probably would've been a better idea to add some kind of charset indicator to the file, but then it wouldn't be a straight text file anymore. And Notepad was originally intended to make straight text files using the ANSI character set.
Quote from: Tom on December 14, 2006, 03:49:15 PM
Quote
If "aa" is read as Unicode, it cannot be converted to a valid ANSI code (what do you do with the extra A?)
Its either a valid UCS/UTF code, or it isn't. if "aa" happens to be an invalid encoding for _anything_, its obviously not unicode.
As discussed above, if you don't know the encoding of a set of bytes you have
no idea whether the two bytes that represent "aa" are supposed to be read as Unicode or not and therefore have
no idea whether it's a valid single Unicode character or was supposed to be read as two separate ANSI characters.
Quote from: Tom on December 14, 2006, 03:49:15 PM
Using UTF8 by default could have solved this before it became an "issue". Totally ASCII compatible. And since Most OSs have support for multiple charsets, theres no problem with supporting kjs, UCS16, UTF8, or plain ASCII.
For the record, it's the whole holier-than-thou attitude in your previous post that made me reply, and I see by this last sentence that it's still here. It wasn't Microsoft who created or even first picked Unicode; Unicode came out of work from Xerox and Apple, and was first adopted as a standard by Apple as part of their TrueType (http://unicode.org/history/tenyears.html).
One thing I will agree with, though, if the computer industry as a whole (including Apple, Microsoft, Sun, Oracle, HP, IBM, Sybase, Berkeley University, and others (http://www.unicode.org/consortium/memblogo.html#inst)) had seen before 1988 that in 1992 (http://en.wikipedia.org/wiki/UTF-8) something better than Unicode would be invented on a placemat at a diner, they would not have picked Unicode. The problem, of course, was that UTF8 simply was not available when all the big players were looking for an internationalization standard they could use; Unicode was, and had been in development all through the early 80s. Most of the big players even had full-time employees spending their time on nothing but Unicode meetings and work.
*edit: Changed "UTF16" to "Unicode" in my post - they're not the same.
QuoteNow, the way the programmer who added Unicode to Notepad seems to have figured it out is that he'll add a Unicode character to the beginning that is then ignored
Wow, Thats rather stupid. You wouldn't add just any Unicode character if you were to do that, you'd add a specific control char of some kind. Not something that could ever be mistaken for valid ASCII.
QuoteSee, you don't really know who else might use anything but UTF8 so you make a sweeping, grandiose statement.
Unix. Thats pretty much everyone else, besides the embedded OSs that have all but evaporated lately.
QuoteAh ha! You're using it, so everyone else should, too!
Woo. Of course I'd assume that ::) You missed the point. I was saying, it works for me(tm).
QuoteI know my comments thus far have been rather inflammatory but you started with name-calling)
I didn't call anyone a moron. I insinuated that one _might_ be a moron if they thought " a" could be converted to "a". " a" (space a) in Unicode happens to be char 8289 in dec, and 2061 in hex, that specific char is a special control char, so very possibly it could be used as an identifier. BUT "Bu" is not, it happens to be defined as "Unicode Han Character" (kanji I assume). Using "any" valid unicode char as a control char in the manner suggested is rather dumb. Unicode is made from 2 consecutive bytes, totally unrelated to any ASCII or other code point. Just about any two byte combination is a valid Unicode char.
Quotethey would not have picked Unicode.
I'm sorry, Notepad has had unicode support since 1988, or 1992? Besides, UTF8 is a "unicode" variant. Fully spec'd and supported by the Unicode Consortium, otherwise it could not use the UTF designation.
QuoteProbably would've been a better idea to add some kind of charset indicator to the file, but then it wouldn't be a straight text file anymore.
Using that detection method, its already not a text file, unless it used some valid combination of "printable" ASCII characters, which makes it pretty pointless to use, as it'd cause all sorts of false positives.
QuoteAs discussed above, if you don't know the encoding of a set of bytes you have no idea whether the two bytes that represent "aa" are supposed to be read as Unicode or not and therefore have no idea whether it's a valid single Unicode character or was supposed to be read as two separate ANSI characters.
Its a text file. Its very likely going to be the same encoding as the OS is using by default. If it isn't the user is going to probably know that, if not, its of no use to them anyway, as if it was displayed properly they couldn't read it.
QuoteFor the record, it's the whole holier-than-thou attitude in your previous post that made me reply, and I see by this last sentence that it's still here.
Maybe its the stupid detection method it uses. I rather dislike stupid code. And the monkeys that write it. (defn: code monkey: one that does not think) No offense to any of the programmers here, I am quite certain none of you are code monkeys.
edit: re: code monkeys.
I see way too many of them on mailing lists working for companies only asking for answers NOW!!!11111, examples and someone to do the work for them.
As I read through this, i wonder to myself, "How does this really affect me?" I am not so sure it does. Viva la notepad!
Quote from: Shayne on December 14, 2006, 11:17:29 PM
As I read through this i wonder to myself. how does this really affect me? Not so sure it does. Viva la notepad!
LOL, my thoughts exactly.
Thanks to a couple of uber-geeks, I now know a whole bunch more than I ever realized there was to know about UTF/Unicode... And within a week I will forget it all as it really don't amount to a hillabeans in my daily life ;D
Wonder if this "debate" would have gotten all fierd up in the first place if the thread title wasn't so darned controversial? ;)
Another example is given here: http://weblogs.asp.net/cumpsd/archive/2004/02/27/81098.aspx
Quote
Someone showed me a weird text file today. It was a bat file with 'copy MeYou.bak MeYou.txt'. When you would ran it, it would work. But when you opened it in Notepad, there was nothing.
So we decided to look a bit into this and here is something we came up with to 'create' invisible text:
Open notepad and enter:
' abc.bak abc.txt'
(That is: space abc dot bak space abc dot txt, no line break, without the quotes)
It doesn't work with every string, just follow us on this example and use that one.
Save your file. Notepad picks default ANSI as encoding.
Open your file, Notepad seems to open by default in Unicode encoding.
Your text is now invisible.
This is discussed and dissected quite handily by Raymond Chen, who explains how the word "Hello" is represented by different numbers of bytes in the different encodings. He also explains what causes the problem described at the beginning of this thread:
Quote
The encodings that do not have special prefixes and which are still supported by Notepad are the traditional ANSI encoding (i.e., "plain ASCII") and the Unicode (little-endian) encoding with no BOM. When faced with a file that lacks a special prefix, Notepad is forced to guess which of those two encodings the file actually uses. The function that does this work is IsTextUnicode, which studies a chunk of bytes and does some statistical analysis to come up with a guess.
IsTextUnicode (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_81np.asp) is a Windows API that has been available since NT 3.5.
As for special characters marking a text file as having a particular encoding (using something called a
BOM): turns out this a standard technique that is supported not just by Notepad but also by emacs (and probably Vi, although I was unable to find any sources stating so).
Ultimately I agree that trying to determine the encoding of a text file with no special control type indicating the encoding and thereby getting a false-positive for little-endian Unicode is a bad
design decision. However, I'm sure that when you rip apart IsTextUnicode, you'll find it was probably written quite well, with quite a bit of thought given as to how to determine if text is Unicode or not; I doubt the actual code was written by a code monkey as you define it. After all, this
is a core function included in Windows NT 3.5 and beyond, and there aren't too many OS programmers posting in forums demanding solutions with examples immediately.
I'll say this again: Including code to determine if a text file with no encoding indicator is encoded as ASCII or little-endian Unicode is a bad
design decision; instead, a text file with no encoding indicator should simply always be treated as encoded as ASCII.
Quote from: Shayne on December 14, 2006, 11:17:29 PM
As I read through this, i wonder to myself, "How does this really affect me?" I am not so sure it does. Viva la notepad!
It probably has no direct bearing on your day-to-day work. Still, it's nice to understand how text files are marked to indicate they're in an encoding other than ASCII, don't you think? Especially helpful when you need to start working with text files in software used in parts of the world that don't use English.
IMO, there is no GOOD way of auto-detecting a text file's encoding. To do so is just silly. If the system default isn't the right one, let the user change it from some menu.
Found a usefull tidbit on google:define:
QuoteThe Unicode character U+FEFF, or its non-character mirror-image, U+FFFE, used to indicate the byte order, or its non-character mirror image, of a text stream. The presence of a BOM can be a strong clue that a file is encoded in Unicode.
Thats not a _bad_ idea, 0xFE and 0xFF are not printable ASCII afaik and should make a good prefix. Though I still don't know why an editor will see a file that isn't prefixed by such a character as unicode. Using just _any_ unicode char (as that NT function seems to do) is a bad idea, any a number of unicode chars can look identical to a string of valid ASCII/ABDIC/sjis/whatever.
QuoteI doubt the actual code was written by a code monkey as you define it
You'd be surprised at who get hired by some companies. Then again, maybe you wouldn't, considering who ran that one place you guys always talk about. Also, a good programmer can be forced into monkey-mode depending on circumstances. You just do what you're told, and just want to get it done asap.
QuoteI'll say this again: Including code to determine if a text file with no encoding indicator is encoded as ASCII or little-endian Unicode is a bad design decision; instead, a text file with no encoding indicator should simply always be treated as encoded as ASCII.
I totally agree.
Though allowing both little and big endian forms of the same text file seems excessive. Force it to one, and you never have to worry about the endian. I think sometimes those big standard organizations over do/think things.
Further reading indicated that IsTextUnicode looks at the entire text file to determine whether it has a high percentage of bytes that taken together could only be Unicode characters. Basically, it's doing a statistical analysis. Using this to automatically pick a specific encoding is a bad idea, but what if you pop up a dialog box asking for a choice and you use IsTextUnicode to suggest to the user what the most likely encoding is?
And no, Darren, after first reading the title I wasn't going to look at this thread at all. It wasn't until Lazy posted the quote from hoaxslayer that I started wondering what caused this particular quirk in Notepad, and it was the posting of my initial findings that led to this big discussion about encoding and how it's recognized.