I don't understand; I don't feel like character combination using the zero width...

gopalv · on March 17, 2015

That's effectively the only way to write several indian language with unicode - ZWJ + NZWJ and a decent font which supports all the ligatures.

The recent Bullshit Sans font is a clear example to describe how ligatures works. And for Malayalam there are consonant patterns which do not match itself to the Sanskrit model which makes it rather odd to write half-consonants which are full syllables (വ്യഞ്ജനം +് zwj).

And my name is written with one of those (ഗോപാൽ) & I can't say I'm mad about it because I found Unicode to be elegant in another way.

Somewhere in the early 2000s, I was amazed to find out that the Unicode layouts for Malayalam as bytes in UTF-8 were sortable as-is.

As a programmer, I found that detail of encoding to sort order to be very fascinating as it meant that I had to do nothing to handle Malayalam text in my programs - the collation order is implicit, provided everyone reads the ZWJ and NZWJ in their sorting orders.

cosarara97 · on March 17, 2015

It's like typing ` + o to get "ò", isn't it? You can argue that ò is actually an o with that tilde, while that character is not ত + ্ + an invisible joining character, but that's an input method thing, and there is a ৎ character after all.

jamie_ca · on March 17, 2015

They're on the same key on my keyboard, but ` is a grave, ~ is a tilde.

zhemao · on March 17, 2015

But ò has it's own codepoint, whereas the ending character in the Aditya's name does not.

Manishearth · on March 18, 2015

I agree.

Most devanagari glyphs don't have their own codepoint. Marathi/Hindi/Sanskrit (which use devanagari) have a bunch of basic consonants and some vowels (which can appear independently or as modifiers). All the glyphs are logically formed by mixing the two, so the glyph for "foo" would be the consonant for "f"[1] plus the vowel modifier for "oo". When typing this in Unicode, you would do the same, type फ then ू, creating फू.

It gets interesting when we get to consonant clusters. As mentioned in [1], the consonants have a schwa by default, so the consonant for s followed by the consonant for k with a vowel modifier for the "y" sound would not be "sky", but instead something like "səky" (suh-ky).

So how do we write these? We can do this in two ways. One way is to use the no-vowel modifier, which looks like a straight tail[3] (on स, the consonant for "s", or "sə", the tail looks like स्), and follow that by the other consonant. So "sky" would be स् कै [2]. This method is rarely used, and the straight-tail is only used when you want to end a word with a consonant[4].

The more common way of doing multiple consonants are by using glyphs known as consonant clusters or conjuncts[5]. For "sky", we get स्कै, which is a partial glyph for स stuck to the glyph for कै. For most clusters you can take the known "partial" form of the first glyph and stick it to the full form of the second glyph, but there are tons of exceptions, eg द+द=द्ध, ह+म=ह्म (the second character was broken), and whatnot. See http://en.wikipedia.org/wiki/Devanagari#Biconsonantal_conjun... if you want a full table.

There aren't individual Unicode codepoints for this, not even codepoints for the straight-tail form of the consonants. I typed स्कै as स्+कै which was itself typed as स + ् + क + ै. This isn't an irregular occurrence either, consonant clusters (with a vowel modifier!) are pretty common in these languages[6].

I personally don't see anything wrong with having to use combining characters to get a single glyph. If it's possible and logical for it to be broken up that way, it's fine. With this trick, it's possible to represent Devanagari as a 128-codepoint block (http://en.wikipedia.org/wiki/Devanagari_%28Unicode_block%29), including a lot of the archaic stuff. It's either that, or you make a characters for every combined glyph there is, which is a lot[7]. One could argue that things like o-umlaut get their own codepoint, but स्क doesn't, but o-umlaut is one of maybe 10 such characters for a given European language, whereas स्क is one of around 700 (and that number is being conservative with the "usefulness" of the glyph, see [7]).

The article is never quite clear about which glyph Aditya finds lacking for his name (sadly, I don't know Bengali so I can't figure it out), but from the comments it seems like it it's something which can be inputted in Unicode, just not as a single character. That's okay, I guess. And if it's not showing up properly, that's a fault of the font. (And if it's hard to input, a fault of the keyboard).

It becomes a Unicode problem when:

- There is no way to input the glyph as a set of unicode code points, or - The input method for the glyph as a set of unicode code points can also mean and look like something else given a context (fonts can only implement one, so it's not fair to blame them)

[1]: well, fə, since the consonants are schwa'd by default. Pronounced "fuh" (ish)

[2]: the space is intentional here so that I can type this without it becoming something else, but in written form you wouldn't have the space. Also it's not exactly "sky", but close enough.

[3]: called paimodi ("broken foot") in Marathi

[4]: which is pretty rare in these languages. In some cases however, words that end with consonant-vowel combinations do get pronounced as if they end with a consonant (http://en.wikipedia.org/wiki/Schwa_deletion_in_Indo-Aryan_la...), but they're still written as if they ended with a vowel (this is true for my own name too, the schwa at the end is dropped). Generally words ending with consonants are only seen in onomatopoeia and whatnot.

[5]: called jodakshar ("joined word") in Marathi

[6]: Almost as common as having two side by side consonants in English. We like vowels, so it's a bit less common, but still common.

[7]: technically infinite, though that table of 700-odd biconsonantal conjuncts would contain all the common ones (assuming we still have the vowel modifying diacritics as separate codepoints), and adding a few triconsonantal conjuncts would represent almost all of what you need to write Marathi/Hindi/Sanskrit words. It wouldn't let you represent all sounds though, unless you still have the ् modifier, in which case why not just use that in the first place?