With most attack vectors, like buffer overflows, csrf, SQL injection, I (and many developers that aren't security domain experts) am acutely aware of best practices and mitigations. I know exactly where they can appear and how to prevent them. (Well not literally but it's a simple web search away).
With unicode, I really don't know what to do. What strings need to be sanitized (or validated)? File names? Urls? How do I sanitize them without causing agony for most of the world? Are there other unicode attack vectors?
For starters, there's the Unicode Technical Report (UTR) #36 [0], "Unicode Security Considerations". It covers this particular issue in section 2.5 [1].
The general answer is to not allow (e.g., ignore it, remove it, but do not render it) in any identifiers, filenames, domainname labels, usernames, and other security-sensitive strings, U+202E between characters of scripts that are not bi-directional and which flow in the same direction. More generally, do not allow mixing of scripts in certain contexts -- for example, do not allow mixed scripts in a single domainname label. Some of these rules need to be implemented by, e.g., DNS registries, and may need to be tailored to their specific needs (e.g., you might find that .kr has to allow mixing of ASCII letters and Hangul characters because it's common in South Korea to add "ing" to names to make brands out of them).
It should be noted that the problems of U+202E and such are not inherent to all Bidi_Control characters: Right-to-Left Marker, Left-to-Right Marker and Arabic Language Marker are all safe (and useful)—though UTR #36 suggests going far enough (e.g. “use only LTR or RTL, and if RTL, first and last characters must be strong”) that they wouldn’t be useful. It’s specific to the characters that override the default stuff, which means Embedding, Isolate and Override characters.
The Right-To-Left Override (U+202E) does not obscure nor overwrite existing characters. It is an instruction to switch the text direction to right-to-left like for Arabic and Hebrew.
In the article it is shown how this can be misused to show executable files as a different file type:
file[U+202e]fdp.exe shown as fileexe.pdf
~~~~~~~~^^^^^^^
| |--- read this right-to-left, i.e. exe.pdf
|----------- this is the Unicode character Right-To-Left Override
It certainly obscures and overwrites. When you replace "fdp.exe" by "exe.pdf", you're obscuring and overwriting the leading "f" by "e", then the "d" by "x" and so on.
It doesn't overwrite or obscure, it switches rendering direction. E.g. \u202E\u41\u42\u43 starts outputting from the right: \u41 from the right, then 42, then 43. That's CBA, altough in normal English you'd render that as ABC.
The functions/codepoints are there for reasons. If you want to research the reasons and decide that text processing doens't need them, that's one thing. But to simply dismiss it off-hand with the argument "if you have to remove it for security it shouldn't exist in the first place" is just ridiculous. It's like saying "if you have to properly escape parameters in SQL to avoid injection attacks, SQL shouldn't exit in the first place!"
I am continually baffled about why the HN comment sections are such a hotbed of anti-unicode sentiment. Perhaps most of y'all don't have to write software that deals with non-English text in any significant way.
If you just didn't have buffers you wouldn't have to avoid buffer overflow attacks! Come on.
It feels like the whole “everything is Unicode now” movement was misguided to me.
Some things are meant to be simple, and Unicode is massively complex. True, these are rare-ish corner cases, but every programmer knows it’s only ever a matter of time until you hit any possible corner case.
I’m not saying, screw people whose languages don’t fit into ASCII (I’m one of them btw), but I really don’t want to know the vagaries of Unicode representation to inspect a file name.
Maybe Unicode needs a “safe mode” where only outright characters are allowed? Or maybe the unavoidable complexity is a design flaw, for some uses anyway.
I feel like we’re close to someone discovering Unicode is Turing-complete...
The problem in this case was that we depend on convention (filenames and extensions) for security.
Operating system should already know if a file is executable and desktop should be using this information when presenting files instead of conventional names.
Yes it would mean that when you download a file you have to set the executable attribute before you can run it. It's already almost exactly like that on Linux. That's a good thing.
> Operating system should already know if a file is executable and desktop should be using this information when presenting files instead of conventional names.
That already happens. If you set the "Details" view in Windows, it will display one file per row with various columns showing file metadata, including the file type. So for the example in the post, you'd see "[Word icon] annexe.doc - [the date] - Application" as opposed to "[Word icon] annexe.doc - [the date] - Microsoft Word 97 - 2003 Document".
The main advantage is that the UI for opening documents (vim file) is different from the UI for running executables (./file) which is different from the UI for running executables as administrator (sudo ./file).
This makes it impossible to accidentally execute a file that looks like a document.
As a desktop user, it makes me physically ill everytime I have to do that. I'd honestly pay for someone to make auto-executing the default on most common Linux distros. Like, in 20+ years of using computers, I don't remember one time where I downloaded something and did not want to run it afterwards.
Fortunately for the rest of us, common Linux distro maintainers have a better understanding of security than you appear to.
If I download a pdf, I don't want it to run. I probably want to open it, but if I've gone through the trouble of downloading it onto the filesystem as opposed to just opening it in the web browser, there may be other things I want to do with it. Maybe I want to print it, or maybe I want to break it up into component parts. Or maybe I want to transfer it to a sandboxed environment before I open it because I don't trust it.
If you don't want to take an active role in maintaining security on your computer, then by all means use a commercial operating system. Pay a license and let Microsoft and Apple make those decisions for you. They have competent people who can program reasonable default settings for the average user. If you're running Linux or an open source operating system, you've already decided to take these responsibilities into your own hands.
> If you don't want to take an active role in maintaining security on your computer, then by all means use a commercial operating system. Pay a license and let Microsoft and Apple make those decisions for you. They have competent people who can program reasonable default settings for the average user. If you're running Linux or an open source operating system, you've already decided to take these responsibilities into your own hands.
That is a beyond ridiculous take. The last Windows OS that did not make me mad was Win98, and every time I have to use Mac machines I want to gouge my eyeballs out. I want an OS that lets me shoot myself in the foot as much as I want.
> If you're running Linux or an open source operating system, you've already decided to take these responsibilities into your own hands.
no, I just want to use the fastest possible OS for my use case on my hardware, which is Linux.
> Though I don't really recommend doing this since you should not be executing downloaded binaries often enough for this to be worth doing.
I literally do this multiple times per day. What in hell do you folk do with your computers?
Thanks for the script ! There's a few tens of thousands of files in my downloads folder, I hope this isn't going to pollute the inotify fds too much...
I usually install programs through the distro package manager. Out of curiosity, what are you doing that requires downloading so many executables from the web?
>requires downloading so many executables from the web?
Hopefully not leaking company information to clever attackers.
I work with a number of clients in high security environments where you have to get permissions to perform operations like execution (or even setting chmod x on a file). It really does limit people running random things and causing destruction.
When you seriously download and run multiple executables a day I wouldn't call that a common use-case. At least 99.9% of my downloaded files are not executables. Those are usually pulled via the package manager or git, so basically I only download an AppImage if both the repos and flatpak can't install it.
I have no idea in what situation I'd need to load so many executables via the browser.
when I do my OS has no trouble opening them in my image or text viewer. The issue is when I download something that should be executable, and I have to paintaskingly open a terminal, and
cd ~/Téléchargements; chmod +x <try to remember the first letter><hit tab 15 times><miss it><go check inside the firefox download manager><restart><end up typing the whole name>
This seems like more of a UI issue than anything. Could be fixed by simply presenting the user with a prompt asking whether they want to allow this file to execute as a program if it's missing the permissions.
Consider writing a bash script that makes a subdirectory with current date if it doesn't exist, downloads url from the command line argument into that directory, changes the permissions and runs it.
Then your workflow could be:
- keep terminal open
- Alt+tab to firefox, browse, copy link to executable file
- Alt+tab to terminal, ./handle.sh <paste the link> ENTER
- Alt+tab to firefox again
- rinse and repeat
There's many benefits - you don't get thousands of files in one directory (which can lead to performance degradation and even random errors), there's little risk that you run the wrong file, you don't waste as much time.
the idea of having to go in those damned menus one more time just to launch an app sends shivers down my spine. I already barely tolerate windows and macOS's prompts when downloading executables.
> Maybe Unicode needs a “safe mode” where only outright characters are allowed?
What is safe for me may not be safe for you. I need RLM characters in my terminal emulator [1], you probably don't. And who knows, maybe RLM characters could prove to be an attack vector, by e.g. displaying filenames or other text backwards.
Every Unicode character and code point exists because somebody, somewhere, needs it.
> Every Unicode character and code point exists because somebody, somewhere, needs it.
Not quite. That and because some twits are willing to cave in and add it, because adding crap to Unicode gives them a sense of purpose which makes them blind to the harm they are causing.
You can definitely argue about whether Unicode should support long-dead or undeciphered languages, but those aren't the characters that cause problems in practice, or at least not more than any random emoji would and those have been embraced with fervor.
Egyptologists would disagree re your first there: it is D53 in the Gardiner sign list, used to spell words including 'urinate', 'poison', 'husband' and 'in the presence of' !
That isn't true either; many Unicode characters exist solely because they were part of some other encoding standard that Unicode preserved in its entirety. For example, all of ASCII is preserved in that way. That's also why 囍, a character which does not have any linguistic use, is in there - it was included in an earlier Chinese encoding standard. Nobody thinks it was a good idea to separate U+2126 OHM SIGN from U+03A9 GREEK CAPITAL LETTER OMEGA either.
Is that true? It seems like screen readers would benefit from knowing the difference between Ohms and Omega, particularly if the surrounding text is in Greek.
> It seems like screen readers would benefit from knowing the difference between Ohms and Omega, particularly if the surrounding text is in Greek.
Sure, just like they benefit from knowing the difference between 'V' and volts; 'J' and Joules; 'K' and Kelvins; 'A' and Amperes; π the letter, π the circle constant, and π the exotic particle; 'm', meters, and mass; 's', seconds, and position; 'g', grams, and gravity...
囍 is a really bad example of Unicode bloat since it's actually used very frequently.
The real issue are the thousands of characters that appeared incidentally in some ancient text, either as typo or as a weird interpretation of a common character, which then ended up in the Kangxi dictionary, and then subsequently imported en-masse into Unicode.
Example, 𠒇 - the only known use (in non-ancient times) of this character was being the official name of 雷莊𠒇, winner of the 2017 Miss Hong Kong Pageant. According to her, she intended to write 雷莊兒 when she applied for her official documents, but somehow the officials interpreted it as 𠒇, which is really an archaic form of 兒 (at best). Reportedly she's changing her name back to 雷莊兒
Thousands of such characters exist, if you look at the page where 𠒇 is supposed to originate, more than half of this is obsolete -- https://www.kangxizidian.com/v1/?page=125#gv
(So yeah, you're correct in essence but picking on 囍 as an example probably doesn't really get your point across...)
You've completely missed my point. Unicode is only intended to represent writing systems. It doesn't matter whether they're ancient, modern, common, or rare, but they're supposed to be writing systems.
囍 is used frequently, but it is not part of any writing system and does not convey any linguistic message. The opposite is true for 𠒇 - it is not used frequently, but it is part of a writing system and is used to convey linguistic messages.
The reason the point about 𠒇 holds is that it really isn't a character. I didn't claim it was "rare", I said it was a typo or a misinterpretation of an actual character (for the case of 𠒇 it is a misrepresented form of 兒.)
To presume a character is "real" merely because it exists in the Kangxi dictionary is as valid reasoning as presuming a character is "real" because it exists in "some other encoding standard" that you've been dismissive about. It's just that Kangxi is the de-facto Han character encoding scheme before computer encodings were invented. (Unihan even contains all details about the radical, stroke and even page number where it was sourced from)
A lot of those characters appeared once in some ancient text, and took the Kangxi form due to transcriptions from scribes across the centuries, but we actually have no evidence that they are "real" (at any point in time). Some of these characters are known alternative forms of common characters, or are only known to appear in some ancient text before Han characters were standardized. Some are plain typographical errors. It's like a 3 year old child learning to write "ABC", which looks a bit weird, and then the unicode committee assigned 3 code points to them.
> Some of these characters are known alternative forms of common characters, or are only known to appear in some ancient text before Han characters were standardized.
Those are entirely valid for Unicode. Han unification in Unicode is already considered a mistake. That's why "unified" code points now also have explicit, higher-numbered 'equivalent' code points that unambiguously refer to a particular graphical form. The graphical form is the whole point of Unicode.
The character 囍 means "double happiness" literally, and is often used in wedding ceremonies among other places here in China. It is also included in the Xinhua dictionary, the semi-official dictionary in mainland China.
> The character 囍 means "double happiness" literally, and is often used in wedding ceremonies among other places here in China.
It's used by being hung on the wall, like a painting[1]. Like I said, it has no linguistic use, and thus it is not part of a writing system, which puts it outside the stated scope of Unicode. It is the exact equivalent, for weddings, of the upside-down 福 character that is hung for New Year's, or the wreath that Americans hang for Christmas.
But it does not correspond to anything in any language; there is no Chinese sentence whose spelling would include 囍. Note that the 新华 dictionary entry says "Double 喜. Generally used at happy occasions such as weddings.", and there are zero examples of the character being used. ( https://zidian.aies.cn/NDQ4MA==.htm )
The wording of the 新华 entry is almost identical with the beginning of the 汉语大词典 entry, and it's instructive to quote the rest of the entry:
> Character used at happy occasions. Commonly called "double 喜"[2]. Generally used at occasions such as weddings. Often cut from red paper (or gold leaf), or written on red paper, [then] pasted onto a door, window, or wall, in order to indicate a happy occasion.
[1] Actually, a much closer comparison would be the traditional magical talismans that use elements from the writing system in a freeform way to express various desired goals. https://en.wikipedia.org/wiki/Fulu
[2] This dictionary is nice enough to make the fact completely explicit that 双喜 is the name of the character rather than a definition. 汉语大词典 doesn't even attempt to provide a definition for this character.
> it is not part of a writing system, which puts it outside the stated scope of Unicode.
Literally all emojis are not part of a writing system nor have proper linguistic use. Who on the earth write emoji by hand or even spoke it? Unicode already broaden the scope for so long, whether it is good or bad, nobody cares now. And beside that, a lot of symbols are also culture dependent. If you are not used to live at there, you just have no idea what it actually means.
> Literally all emojis are not part of a writing system nor have proper linguistic use.
Yes, that's true, and they've been a constant source of problems for Unicode ever since the decision was made to let them in. They stand in violation of Unicode's declared principles and purpose.
Note that 囍 is not considered an emoji by the Unicode standards, though that's what it is in fact.
> Who on the earth write emoji by hand or even spoke it?
But there is an example of this - the name of https://en.wikipedia.org/wiki/I_Heart_Huckabees [wikipedia title uses the word "heart", but the actual title uses the symbol] was frequently spoken aloud.
I think the problem of emoji are more about they are in extended plane and they are colorful. The 'utf8' in sql assume we will never use extended plain and we actually used, which is the cause of former problem. Some of the chinese characters are also suffer from the form problem. And a lot of old system just can't draw the symbol in other color because it is designed prior to the exist of emojies.
I think I agree. Trying to encode all of human written language in one standard is just too huge a scope, and then you shove all that complexity into every layer of computing except just being in the kind of documents that need that level of complexity.
It's definitely an extreme view[0], and I'm open to being argued out of it, but I really think we should use a different and much simpler encoding for the vast majority of things even if it doesn't perfectly reproduce the written text of the language. Writing systems encoded the language in a way fit for the medium of paper, and maybe we need to consider computers a different medium and adapt our language to them differently too.
I don't think you do anything tbh. It's not really like those other situations where there's a straightforward, static answer.
The problem is honestly just that unicode is used to convey things like "this is a PDF" or "this is going to execute code" and it's attacker controlled. It's a terrible UX that's been abused for as long as paths have existed.
So I guess hope that operating systems will find a better UX for "this executes" and that they'll stop "a process executed" from being a game over situation for users.
I think you're right about paths but aside from that we have website urls, usernames or similar identifiers.. People look at those and assume that they can compare them. If I see `google.com` I assume it consists of certain characters that represent a specific institution. Are we doomed to live with ascii for the forseeable future or is there some form of reduced set of unicode that we can use without these types of impersonation attacks?
> and more recently display that instead of the raw url
Technically, punicode is the raw URL and old enough browsers will display all non-ASCII domains as punycode. It was only for a short while that browsers naively decoded punicode without restrictions.
There are certainly things that can be done. UTR #36 covers some of the things that can be done.
Besides refusing to render U+202E when it is between characters of scripts that aren't bi-directional and flow in the same direction, the UI could display the string in ways that make it clear (e.g., different color, maybe add a warning tooltip, maybe add a dialog around an operation that could be dangerous, or maybe refuse to perform dangerous operations).
Another option for filenames would be to recognize that the basename and extensions are distinct parts and the extension should always be displayed after the basename no matter what is in the basename. Of course, this adds lots of complexity so ¯\_(ツ)_/¯
There are two broad types of attack vectors, those that target a program, and those that target a person.
Most Unicode attack vectors seem to target people, they are ways to obscure from a person the true intention of what something is. As a developer it’s our job to try and mitigate this through how we display the data, warn about risks and block particular attacks. So it’s about working out how to prevent a user from making a mistake.
It seems to me in this case obviously Unicode became “a thing” after the development of a file extension. Clearly if it had been the other way around the file extension would have no meaning, maybe that’s what we need to move towards, store that metadata elsewhere on the file system.
Fundamentally Unicode from and “untrusted” origin shouldn’t be trusted in an executable context.
We don't need to store the metadata differently. The on-disk format is still unambiguous: the part after the last dot is the extension.
It's only the graphical display that is ambiguous, because the last dot is not necessarily the rightmost dot.
This is easily solved with a UI change, e.g. move the file extension from the "file name" column to a separately rendered "file extension" column.
One could even translate the file extension into something more user-friendly (e.g. .exe -> Application), making it a "file type" column. Which is of course what Windows Explorer has been doing for decades already. Of course, that results in everyone turning the classic file extension back on...
I think one of the troubles here is the difference between a GUI and a command line type app. Very true on a GUI you can hide the extension and display file type in a column, which is what Windows and MacOS both increasingly do. However on the command line its still fundamentally part of the filename, to fix that is a larger job and my suggestion is that it should not be part of he filename, but metadata on the file/filesystem.
The other problem with hiding the extension is that you then have "file name" collisions, two files will appear to have the same name but only differentiate by the "type". I think that's wrong.
Apply the Unicode security guidelines for identifiers (TR39) to names, simple as that.
Identifiers such as variable names, filenames, Login names, domain names, emails should be identifiable.
> With unicode, I really don't know what to do. What strings need to be sanitized (or validated)?
Just yesterday I filed a bug on the KDE terminal emulator [1] for stripping away too many characters. I'm sure that the fine devs had no idea which characters are safe to leave in and which are not. I certainly have no idea myself.
Please correct me if I’m overreacting but I feel like Unicode in most cases is just a giant security flaw and the only reason we tolerate it is because it does tremendous good making things more equitable for other languages and cultures.
Unicode in URLs and usernames and filenames is just so easy to trick people with.
> making things more equitable for other languages and cultures
Well, yes. People often hate how compilcated Unicode is, but they also tend to forget that even ASCII was not sufficient for writing English in a proper way. I am not by any means old, but even I still remember the period on the web when non-Unicode encodings were relatively common and how problematic it was.
And, with assumption that by “other cultures and languages” you mean non-English speaking regions: Tolerate is a wrong word. Most people are not English native speakers. If we would go the route of deciding who is tolerating who, nations using Chinese Hanzi and its derivatives would be probably the first to claim they tolerate all the others by the speakers’ headcount alone.
Moreover, deciding what is/can be executed is not so obvious in my opinion. PDFs can have JavaScript embedded, SVG too. Flash games? Python scripts on computers with python installed? It is not just the names, but the very knowledge of what can be considered “executable.”
I bought emoji domains for some of my projects. They work perfectly everywhere I've tested them in the domain side of the equation. In the mailbox name I've had less success.
ASCII is enough to write English, or, at least, it used to be back when ASCII was a multi-byte encoding for Latin! I know, it sounds crazy, but it used to be a thing to use over-strike using back-space. E.g., ë was e BS ", ø was o BS /, etc. It was a thing back in the days of typewriters. I believe this is the reason that upper-case letters in Spanish are allowed to be written without accents even when they are needed otherwise, that accented upper-case letters were hard to produce with typewriters and ASCII fonts.
> ASCII is enough to write English, or, at least, it used to be back when ASCII was a multi-byte encoding for Latin!
I know of both the over-strike method for traditional typewriters and extended ASCII. But I have no knowlede of a ASCII (extended or not) supporting every sign commonly used in English. Granted, it is not as significant as other languages missing whole letters, but signs like ”,“,—,–,’, which are often replaced with simpler ASCII alternative (",') or their combination (---,--).
Over-strike, from what I remember wasn’t implemented in any well-spread way with standalone ASCII, although I could be simply not aware of it. It gives us then a few more characters rarely used in English language in loanwords (née, naïve, façade).
Skipping diacritics with uppercase letters on displays with too little space happens with Polish language too, so I am aware of the practice.
In any case, yes, overstrike is just an approximation, but believe it or not, ASCII was designed to make it possible. And yes, it kinda sucks, and yes, it doesn't really make ASCII a multi-byte encoding, not exactly. But it's a funny thought that ASCII kinda almost was a multi-byte encoding. One can imagine ASCII evolving to treat BS<mark> as not unlike Unicode combining codepoints to make it possible to get proper diacritics even on capital letters, but ASCII would still have been a dead end, naturally.
The comment you're replying to just said ASCII without qualification. By definition this is not multibyte, or even whole byte (assuming by byte you mean octet). ASCII is a 7 bit standard, anything else is some other encoding.
Slow down and enjoy some humor once in a while my friend.
I'm quite certain that ASCII was designed to make overstrike feasible. Overstrike just wasn't explicitly part of the standard. The quip about it being a "multi-byte encoding for Latin" was a joke made funnier (to me, and perhaps only to me) by having a kernel of truth in it.
Writing in a language the reader doesn't understand is ... not so fine.
Writing in a way to give the appearance of one message but the machine-recognised existence of another is ... wrong, malicious, and harmful.
At root the issue is that encodings and graphical presentations aren't the same thing. 7-bit ASCII is limited and constrained, but as a universally understood encoding those specific characteristics are useful benefits. Yes, it means that representations are limited. But that's the essential trade-off for a lack of ambiguity.
And even within ASCII, there are homoglyphs or near-homoglyphs: {0O,1lI, 5S} being the most frequently encountered. Kerning and ligatures may present others, as with {m, rn}. In historical documents, distinguishing {ſ, f}.
And, yes, of course, email addresses in local languages, why the heck not? People in the world want to use their language! And yeah, even Hebrew, Tibetan, or Mongolian (in vertical script).
That comment of yours reads to me like an lazy post of a unilingual or uniscriptal cultural imperialist who really does not care about other languages.
Your attitude makes me rather angry, because Unicode is such an amazing achievement for the world, and your comment just comes off totally ignorant of that. It should be clear that combining the world's languages into a common standard is really, really difficult and will inevitably create something that is much more complex than your beloved ASCII. And it should also be clear that in such a complex standard, you cannot (a) solve all problems the first time you try, (b) you cannot try a second time, because no-one will adopt yet another such standard.
So just read those Unicode documents in order to understand. Those people really try and there are security consideration documents, and they are extended all the time. And take those new security warnings for what they are: they are problems with a complex system. It's expected. So when they get known, react calmly and figure out whether you need to fix anything.
There doesn’t seem to be any way to fix this issue though? Unless you mean breaking the standard such as forcing special behaviour for this particular character such that it simply cannot be used in a filename. But that would defeat the purpose of Unicode, it would turn, at the very least, into Duocode.
Hum... Most standards allow that kind of special cases. For filenames, POSIX doesn't, but your viewer can render it on any way (and it already renders many dangerous characters - ASCII had them too - in non-obvious ways).
This wouldn’t be a standard allowing a special case, it would be forcing all current and future machines to implement a special case. What if one vendor decides not to care? Then there would be a de facto split, it’s likely why the unicode committee doesn’t directly fix the issue.
>Please correct me if I’m overreacting but I feel like Unicode in most cases is just a giant security flaw
well ok, it seems to me that it is more likely that in most cases Unicode probably has no effect on security one way or another, but there are a few edge cases where it does. Here is one case where using a particular character in a filename led to problems 8 years ago in an operating system historically known for not having the greatest security.
We can see here also a clear example of the edge case theory, almost every character in every language supported by Windows would not have caused any sort of problem when used in a filename. But there are a few characters where they would, this
is probably an example of things programmers think they know about unicode or language or whatever, because we go around thinking that each unicode code point just represents a character in a language but some encodings have ways to represent little weird behaviors of particular languages, for example interlinear annotation characters https://www.unicode.org/charts/nameslist/n_FFF0.html and it is generally weird behaviors that end up being security hazards (not sure if there is any security hazard in interlinear annotation characters but wouldn't be surprised)
Unicode certainly has been a source of security vulnerabilities.
Are you overreacting? Unless you were proposing we rip it out, no, you're not.
Unicode isn't going anywhere. That's because the scripts that human languages use aren't going anywhere. And people do mix scripts, too.
There are answers though. First, Unicode has been a learning experience, and we're still all learning. Second, one of the outputs of all that learning that has happened so far is UTR #36 (https://www.unicode.org/reports/tr36/), which does cover a lot of these things.
ASCII is part of Unicode, so for Unicode it's also «a limited set. Yes, errors are made and occur. They're reasonably easy to code defensively against».
> Unicode ... vastly expands the attack interface.
Are you aware of the maximum number of Unicode codepoints which can exist, offhand?
And how that compares to the number of 7-bit ASCII characters?
And how many special cases would have to be considered?
Scale matters.
The risks:reward ratio from 7-bit ASCII is low and manageable. The expressive capability is high. No, it's not a perfect representation for all languages. It is, however, a sufficient one, where common understanding is necessary.
The low number of ASCII symbols forces developers to reuse them for all purposes. The number of security holes with ASCII characters is order(s?) of magnitude larger than with Unicode. The problem of apple vs appIe (all Latin) vs аррІе(all Cyrillic) vs elppa (all Latin in reverse) is well known.
> Please correct me if I’m overreacting but I feel like Unicode in most cases is just a giant security flaw and the only reason we tolerate it is because it does tremendous good making things more equitable for other languages and cultures.
Well, that's a lot like saying "software, in general, is just a giant security flaw we tolerate because it does useful things". It's kinda the point aint it?
In 21st century, even Americans aren't able to write their own names and place names with pure ASCII anymore. So Unicode is there from necessity, not as an addon.
All technologies have intended / desireable, and unintended / undesireable effects.
The more powerful and flexible a technology, the more likely there are unintended / undesireable effects.
Many of those unintended / undesireable effects are themselves not obviously apparent, not immediately manifest, or both. All of which makes risk assessment all the more difficult.
Software, in that sense, is inherently a security flaw, as it virtually always manifests unintended and/or undesireable effects.
This isn't necessarily an argument to ban all software, though the Butlerian Jihad are taking notes. It is an argument, however, for acknowledging risks, raising awareness of them, and taking reasonable steps to guard against them.
I agree with you. TR36 is outright terrifying. I might feel more comfortable if the recommendations were already baked into foundational libraries like libicu, but they aren't. Its as if I'm wading into a world with as many footguns and silent killers as C's undefined behavior rules. In C we've spent decades developing tools to help detect and mitigate the problems and they are still regularly exploited. Unicode only has widely-available mitigation tools for a handful of the easiest problems to solve.
C's flaws lead to the development of several different competing programming languages. Various flaws in cryptography lead to the development of algorithms that achieve the same result, but have far fewer implementation gotchas. IMO, the time has come for the development of a truly strict variant of Unicode that still supports the primary objective, but learns from Unicode's mistakes.
That's why unicode published the security guidelines and mechanisms to avoid such attacks. In 2004 already.
The problem is that nobody cared. Browsers invented punycode instead of following tr39, email ditto. But ok, at least something. Java did it, cperl did, rust did it.
The only reason we "tolerate" unicode in, say, usernames, is because it does things like let people with non-ascii names have usernames with their names, right... but this is kind of a big reason!
Given the ways that English works, one of which is the removal of all sorts of deliberate and tedious indications as to what one is talking about usually found in the Germanic languages, when one writes in English and uses the phrase 'other languages' but there is no specific language under discussion it should be understood that other languages refers to languages other than English.
The parent is making the point that the language the grandparent is referring to is irrelevant to the language of an arbitrary individual included in the word "we".
It was just rhetorical. As the sibling comment alluded to, it can come off as quite anglocentric. There is some historical relevance to this point - much of the technology we use today was created with such biases baked in. I apologize if this came off as snarky.
If I’m writing industrial automation, server-side applications, video rendering codecs, why do I need to make it much more complex with Unicode? I just need and want pure 1 byte ASCII.
The grandparent's assertion was in the general case, not specific cases. However, even for your specific cases, it is possible - even likely - to at some point handle user-facing text (i.e. text that is reasonable to localize), at the very least in the form of error messages. Also, the complexity of Unicode is typically already well implemented by default in most environments, and the real issue only comes at the intersection of user-facing text and functional strings (e.g. URLs and filenames), which is a problem that Unicode is another tool to exploit, but not a problem that Unicode invented. We have been trying to detangle the dual nature of URLs/filenames for a long time.
However, tangential to the grandparent's topic, I do imagine there are very specialized cases where Unicode is pretty objectively unnecessary.
You can have security issues with non escaped ASCII too, so you probably want to fix the tools and educate people about other languages and cultures (or even if you just educate them to be capitalists they would understand that selling their iDevice or CoolApp to make cash would require it to work with non english , so unicode means more money not we tolerate your language because we are too nice)
There are certainly contexts in which Unicode unambiguously and demonstrably leads to security weaknesses and issues. See generally homoglyph attacks.
At the heart of the lie and damage is the existence of a message which appears to say one thing but in fact says something different. It's the very limited nature of 7-bit ASCII, 128 characters in total, which provide its utility here. Yes, this means that texts in other languages must be represented by transliterations and approximations. That's ... simply a necessary trade-off.
We see this in other domains, in which for the purposes of reducing ambiguity and emphasizing clarity standardisation is adopted.
Internationally, air traffic control communications occur in English, and aircraft navigation uses feet (altitude) and nautical miles (dstance) units.
Through the early 20th century, the language of diplomacy was French. The language of much scientific discourse, particularly in physics, was German. And for the Catholic Church, Latin was abandoned for mass only in the 1960s.
Trading and maritime cultures tend to creat pidgin languages --- common amongst participants, but foreign to all, as distinguished from a creole, an amalgam language with native speakers.
A key problem with computers is that the encodings used to create visual glyphs and the glyphs themselves are two distinct entities, and there can be a tremendous amount of ambiguity and confusion over similarly-appearing characters. Or, in many cases, glyphs cannot be represented at all.
Where the full expressive value of language is required --- within texts, in descriptive fields, and in local or native contexts, I'm ... mostly ... open to Unicode (though it can still present problems).
Where what is foremost in functionality is broad and universal understanding, selectinga small standardised and widely-recognised characterset has tremendous value, and no amount of emotive shaming changes that fact.
As an example, OpenStreetMap generally represents local place names in local language and charactersets. This may preserve respect or integrity to the local culture. As a user of the map, however, not knowing that language or charcterset, it is utterly useless to me. Or, quite frankly, anyone not specifically literate in that language and writing system.
It's worth considering that the characterset and language in question are themselves, adoptions and impositions: English was brought into Britain by invaders, the alphabet used itself is Roman, based on Greek and originally Phoenecian glyphs. English has adopted or incorporated terms from a huge set of other languages (rendering its own internal consistency ... low ... and making it confusing to learn).
International communications and signage, at airports, on roadways, in public buildings, on electronic devices, aims at small message sets and consistent, widely-recognised symbols, shapes, fonts, and colours. That is a context in which the freedoms of unfettered Unicode adoption are in fact hazardous.
(Yes, many of those symbols now have Unicode code points. It is the symbol set and glyph set which is constrained in public usage.)
And the simple fact is that a widely recognised encoding system will most often reflect on some power structure or hierarchy, as that's how these encodings become known --- English, Roman Alphabet, French, German, Latin, etc. Small minor powers tend not to find their writing systems widely adopted (yes, there are exceptions: use of Greek within the Roman empire, Hindu numbering systems). Again, exceptions.
This is why file names in applications shouldn't be considered "just a sequence of bytes forming any name". It should be a tiny subset of valid strings for some definition of "valid". Even if file systems or shells allow anything in a file name, you don't need applications to allow everything. A system SaveAs dialog, or a graphical file explorer should just refuse to open or rename a file to contain U+202E or even something remotely nonobvious. Power users always have a backdoor (open a shell and rename the file) while "most users" should be presented with simple/sane/safe behavior.
Realistically you don't even want to allow good chunk of ASCII in filenames. Actually the subset that you would like to allow is small, namely lower and uppercase, numbers, ., - and _. That is what 65 out of 127... Maybe a few more at stretch, but even some of those are questionable with some history like ~.
How would you implement caching, or looking up for the decryption key for some directory in a lookup table, if `Documents` and `documents` must both resolve to the same entry? Some kind of normalization would be needed, right? Then you need to introduce encoding, and [Unicode?] normalization. Shivers
> How would you implement caching, or looking up for the decryption key for some directory in a lookup table, if `Documents` and `documents` must both resolve to the same entry
Calculating a hash or equality for a string always uses some kind of comparer logic. Being able to use a "raw" comparer would be one special case of that. In C#/Windows, you'd use
var cache = new Dictionary<string, Cached>(StringComparison.InvariantCultureIgnoreCase);
, or similar. These correctly calculate that "documents".GetHashCode() == "Documents".GetHashCode(), and that "documents".Equals("Documents"). You might think that this is more complex because of the case insensitivity, but it's only slightly so.
E.g. if you instead assumed case sensititivity and naively use a default here:
var cache = new Dictionary<string, Cached>();
, then you'd actually be in MORE trouble because now the default comparison using locale-specific collation comes into play. So e.g. in germany files weiß,txt and weiss.txt would compare equal and thus also compare to the same hash (despite being two different files). A working linux lookup table with case sensitivity would look pretty similar to the first one
var cache = new Dictionary<string, Cached>(StringComparison.InvariantCulture);
What if we only allowed the following in filenames: [a-z0-9.-_]?
Now there will be no 'Documents', only 'documents'.
The comment I replied to already restricted us to ASCII letters, numbers, _, ., and -. I questioned why upper and lower case numbers should be allowed.
I remember back in the day before websites started focusing on input sanitization you could get away with really messing up website designs with that Unicode character. Long time ago I made a group on Steam with just the Unicode RTL character, years before they started banning the character. Then a decade later, some Russians found it and started using the group as their clan tag to make their character info appear backwards in Counter Strike GO - https://www.youtube.com/watch?v=TZCw_cYIY_g
This is almost moot because Windows hides file extensions by default and has done so since like Windows 95. So the user has to enable showing file extensions for known file types.
You might as well name the file document.doc.exe
If you are pro user, you will notice the extra extension that shouldnt be there,
if you are basic user, you dont care about extensions and try to run everything anyway.
I've written an API Fuzzer that (among other things) tests specifically for things like this: https://github.com/Endava/cats (with a lot more other unicode special chars). People many times forget how big the Unicode standard is and that a lot of Control Characters are used by different processors for specific logic not-very-obvious logic.
I'm curious the best way to protect end users or yourself from this, if it still exists. I wonder how trivial it would be for either antivirus software or a user written program to check the file for this?
Back in the web chat days, this is how we used to clone other people's nickname, haha. If I'm not mistaken you'd press alt + 0160 on the numpad to insert a invisible char.
With unicode, I really don't know what to do. What strings need to be sanitized (or validated)? File names? Urls? How do I sanitize them without causing agony for most of the world? Are there other unicode attack vectors?