Bengali is the seventh most widely spoken language in the world, with more nativ...

zhemao · on March 17, 2015

It's also the national language of Bangladesh. Once a language has become the official language of a recognized country it can hardly be considered a "niche" language.

exelius · on March 17, 2015

Right, but some languages are insanely complex to implement. It might be a better idea to teach English to people around the globe rather than cater to every individual need (which will still leave people unable to communicate across languages).

I'm not saying other languages should go away -- but the world would also benefit from having a "universal" language, which is more or less English at this point (Mandarin is spoken by more people but is rarely spoken outside of Asia). If we want to maximize intelligence and will to power, it would be best if like-minded individuals could communicate regardless of where they're from or what their native language is.

beat · on March 17, 2015

It's one thing to say English will be the international language of trade and commerce. It's another thing altogether to say we won't bother making character sets that represent other languages, because it's too hard.

Since when has "too hard" ever stopped an engineer? That should be catnip for us!

ticking · on March 18, 2015

Unicode is notoriously tricky to implement, there have been bugs where users could take over others accounts by choosing a different unicode representation of the same string.

I'd much rather have something that works and is simple. Than something that makes it easy for everybody and is broken.

mbillie1 · on March 17, 2015

> some languages are insanely complex to implement

I don't understand this, is there more to implementing a language than creating glyphs for its character set? I wouldn't think the linguistic complexity would matter at all, only the number of glyphs in the 'alphabet' or similar?

dalke · on March 17, 2015

Languages aren't all just simple alphabets like English. Some languages use ligatures to combine characters. In English, things like 'fi' and 'ffl' can be done almost automatically, and is optional, but other languages have stronger and more important rules.

As a simple example, in German the ligature ß is not a simple ligature for 'ss' but a combination of two previous ligatures; long s with round s ("ſs") and long s with (round) z "ſʒ"). Various spelling reforms have simplified the orthography, but "Maßen" and "Massen" are still different words.

Quoting from a Wikipedia page, "Urdu (one of the main languages of South Asia), which uses a calligraphic version of the Arabic-based Nasta`liq script, requires a great number of ligatures in digital typography. InPage, a widely used desktop publishing tool for Urdu, uses Nasta`liq fonts with over 20,000 ligatures"

Then there are rules for presentation. "Complex text layout ... refers to the typesetting of writing systems in which the shape or positioning of a grapheme depends on its relation to other graphemes." - http://en.wikipedia.org/wiki/Complex_text_layout . Cursive English is closest we have to complex text layout; while there are "cursive" fonts where each of the characters is in cursive the letters don't merge. Now imagine a language where smooth connections and fancy curlicues in the "right" places were essential for being seen as erudite, and where "right" depended on 5 years of learning.

peterfirefly · on March 17, 2015

Yes, if the way the characters look depends on the other characters in the word including some that are nowhere near being neighbours. Especially if there were weird and complex rules about how this changed depending on the kind of word being written.

There are a number of writing systems that are evil like that, including several Indic ones.

The code that handles this complicated process is often called a shaper. Choosing and combining the correct glyphs involves a complicated dance between that and the font(s), possibly including large tables (and code!) in the font itself on top of what the shaper does.

exelius · on March 17, 2015

It really depends on the language, and it's not totally about the glyphs. Text entry is a huge challenge for languages like Mandarin where glyphs can have multiple pronunciations and meanings depending on context. Consider that Mandarin (which shares many, but not all glyphs with Japanese kanji) has upwards of 20,000 different glyphs, and that other languages have a similar level of complexity, and it becomes hard to find an encoding standard capable of handling all of that complexity and variance.

What constitutes a "glyph" isn't even consistent - in some languages a glyph is a syllable, in some (like English) it's less than a syllable, and in yet others a single glyph can be an entire word.

In a language like Japanese, multiple glyphs are often combined to create new composite glyphs with different meanings. For example, the word for "forest" is a glyph comprised of 3 "tree" glyphs, but has an unrelated pronunciation.

How do you handle text entry between these differences? It may seem like a pedantic question, but it makes sense to define the characters in the way they will be written, or else the text entry scheme will be so complex you'll need an interpreter to convert from some entry scheme into the Unicode format. I think this is the problem the Unicode Consortium is grappling with - and it's not an easy problem. I don't claim to have the answers here; but I do recognize the complexity.

beat · on March 17, 2015

User interface isn't the problem, though - bitwise representation is the problem. How do we represent all the valid characters in Unicode? Data entry is an entirely separate issue (as is display).

Crito · on March 17, 2015

http://en.wikipedia.org/wiki/Complex_text_layout

Hypothetically you could construct a language where the glyphs are easy to generate procedurally on the fly by people who are fluent in that language, but who's full space of possible glyphs is staggeringly massive.

Suppose a language with tens or hundreds of thousands of "base" glyphs, but with a unique variant on each glyph depending on what is to the left and right of it. With that alone, for N base glyphs, you could have N^2 variants of each glyph.

I don't know if that sort of language exists. I don't see any reason why it couldn't though.