The boom in smartphone use has fuelled a long-simmering controversy in the IT community over adopting Unicode as the standard for character encoding in Myanmar, instead of Zawgyi.
By GRIFFIN HOTCHKISS | FRONTIER
A passionate Michael Suantak was speaking in the Phandeeyar tech hub’s downtown Yangon office.
“They should not cheat the people; they should not cheat the future,” Mr Suantak said. “At the moment they will be very popular…but when the people realise [they have been cheated], after five or ten years, they will be very angry. So we cannot compromise,” he said.
You might be mistaken for thinking that he is discussing the national ceasefire, constitutional reform, or any other of the myriad challenges that await Myanmar’s new government when it takes power on April 1. But Mr Suantak is talking about computer fonts. Well, not fonts, exactly – about character encoding in Myanmar languages, and the grand battle for the minds of the country’s new smartphone-only generation of internet users.
Mr Suantak is the author of BIT font, one of the early solutions for Myanmar text encoding in modern operating systems. To understand the gravity of his comments, we need to understand character encoding and how it applies to languages such as Burmese.
What follows is a quick primer for the non-Burmese speaker on Unicode and Zawgyi, a history of the dispute over Myanmar fonts, and how the resolution (or non-resolution) of these issues will affect the future of Myanmar’s languages in silico.
Burmese character encoding 101
These words were written on my keyboard and saved to a small file on my computer. The file, like everything else on a computer, is nothing but a series of numbers. In a text file such as this article, each letter, punctuation mark, space, and paragraph break gets its own unique number, called a code point, so that when another computer reads the file, it has enough information to reproduce the sequence of letters exactly as I typed them. The conversion from written letters to a long string of numbers is called encoding.
Going the opposite way, from a long string of numbers to images of letters for printing, is called decoding. The process of encoding and decoding only works if both computers agree on the same letters corresponding to the same numbers. That is to say, there needs to be an encoding scheme so that all text is handled in the same way by all computers. In other words, character encoding must be standardised. For most of the world’s writing systems, the Unicode standard has been adopted so that alphabetisation, sorting, and encoding remain consistent across all operating systems and applications.
For English and any language that uses the familiar Latin alphabet, following the Unicode standard is a relatively simple task of assigning each letter in the alphabet to a unique code point. But for many languages with more complicated writing systems, choosing the best coding scheme has been difficult. Burmese, in particular, involves many modifications to a single written character.
Usually these modifications appear above, below, or to the left of the base consonant character. The challenge for encoding, then, is finding a way to assign a unique code point to each of the component parts so that when combined together, a computer can render the desired character.
For example, the word “myo” is a single complex character made up of five simple character elements. ‘Ma’ is the base consonant, “ya-yit” adds a “ya” sound to become “mya”, “loungji-tin” and “tachaun-ngin” combined modify the vowel sound to become a tight “myoh”, and the final “auka-myit” signifies a creaky tone at the end of word. For encoding, each simple element must be assigned a code point.
Additionally, each element in a character might change how another element should be rendered. In this example, the “ya-yit” must be cut off slightly so as not to cover up the “loungji-tin”, and the “auka-myit” must be placed further to the right to make space for the “tachaun-ngin”. Unicode handles all this by using an intelligent rendering engine – each element has one and only one code point, but the character will modify the shape and width of the element automatically depending on which other elements are present.
The birth of Burmese fonts
In the early days of Burmese fonts, getting a computer to display all the possible shapings for a character was difficult, because Windows did not support intelligent rendering of fonts. Ko Ngwe Tun, the author of one of the first Burmese fonts, Myazedi, devised a workaround still used by Zawgyi to this day: He mapped each individual variation of a character element to its own unique code-point. To write our example word, myo, users of Myazedi had to find the correct “ya-yit” manually from eight possible variations, and the correct “auka-myit” from three possible variations.
Myazedi, BIT, and later Zawgyi, circumscribed the rendering problem by adding extra code points that were reserved for Myanmar’s ethnic languages. Not only does the re-mapping prevent future ethnic language support, it also results in a typing system that can be confusing and inefficient, even for experienced users.
In Zawgyi, there are six different ways to write the word “myo” that render a superficially “correct” character, and many more if you allow for “incorrect” variations that would look strange but still intelligible to a reader. A computer, however, sees these variations as completely different words. Modern Unicode, by contrast, has only one code point per element, and will only render if the characters are encoded in the correct sequence, meaning that for each word there is one and only one encoding.
“We knew that it was going to be just a temporary solution, because eventually Microsoft and others would support a standard,” said Ko Ngwe Tun. “Once the standard was developed, we informed our customers that we could no longer support [them].”
Non-standardisation was not the only problem for Myazedi, though. It was also very expensive. A user licence was US$100, and a developer licence – needed for content producers such as online media – was $1,000. In 2002, this price was well beyond the means of most companies. Just like any other piece of expensive software, there was an incentive for piracy.
The rise of Zawgyi
The Zawgyi-One font was released in Mandalay as freeware in 2006, and it bore a striking resemblance to Myazedi – the first version even contained some of Ko Ngwe Tun’s copyright messages intact and unnoticed before release. That did little to dissuade people from downloading it. Soon many of the largest software companies in Myanmar were using or were planning to use Zawgyi instead of Myazedi.
Ko Ngwe Tun’s company, Solveware Solution, published a legal notice threatening to sue any company using a pirated Myazedi font, as well as the developers of Zawgyi. This angered many in the software community (especially the implicated companies), and in response Zawgyi was modified – the change brought Zawgyi even further from the Unicode standard – to make it harder to prove intellectual property theft.
Meanwhile, internationalisation efforts continued for Unicode. With the release of Windows XP service pack 2, complex scripts were supported, which made it possible for Windows to render a Unicode-compliant Burmese font such as Myanmar1 (released in 2005).
Getting a Unicode font to work with Windows, however, still required a bit of technical knowledge and some configuration. Ravi Chhabra, a Unicode researcher and the author of the first Zawgyi/Unicode detection engine, says this was the first advantage Zawgyi had over Unicode.
The second, he says, was the adoption of the internet as a source of information. Starting with the monk-led protests in 2007 called the Saffron Revolution and continuing with Cyclone Nargis in 2008, an increasing number of people in Myanmar began to realise the power of the internet as a medium, and the demand for Burmese content rose dramatically.
“The reason for Zawgyi’s huge success is planet.com.mm,” said Ko Ravi. “It was the web portal for news, and they used Zawgyi. When people went [to the site], there was a link to download the font, because font embedding didn’t work back then,” he said.
“The third thing is blogging. Nyi Linn Sat, one of the proto-bloggers, used Zawgyi and wrote instructions on how to use Zawgyi to set up blogs. People loved it, and they started blogging with Zawgyi.”
Ko Ngwe Tun eventually backed down from his legal threats, but the damage was done. Galvanized by the dispute with Solveware Solution, the developers of Zawgyi continued promoting their own product online as the best font for Burmese.
Subsequent Zawgyi releases made the font easier to install for an average user, and much harder to migrate away from. In some cases it forced the user’s whole system to default to Zawgyi, either by injecting it into the default Microsoft Arial font, or by installing as an Internet Explorer plugin with no option for uninstall.
“If you ask me,” said Ko Ravi, “these things did not happen in good faith. If they did, we wouldn’t be where we are today.”
Where are we?
Unicode for Myanmar languages has been refined and updated continuously since those early days. Complex characters and intelligent rendering are built-in to Unicode, and the standard has been endorsed by Google, Apple, and Facebook as the future of Myanmar language support. Many of Myanmar’s ethnic languages such as Shan, Mon, Kayah, and Karen, are also supported within the Unicode Myanmar codespace.
If you access Facebook to find a post written in Burmese, however, it will almost certainly be written in Zawgyi. Some news media websites offer Unicode as an option, but most do not. Huawei and Samsung, the two most popular smartphone brands in Myanmar, are motivated only by capturing the largest market share, which means they support Zawgyi out of the box.
More Myanmar people are going online for the first time, and when they open their new smartphone, they are unknowingly being inducted into Zawgyi’s massive userbase. Zawgyi is the font of the layman and that’s the crux of the problem.
The network effect is the only thing keeping Zawgyi alive. Switching to Unicode is a risk. Content producers such as media websites risk losing readers, phone makers risk losing customers, and ordinary folks risk simply alienating themselves from their friends. Zawgyi will have to die eventually – Unicode is the de facto standard worldwide, and for good reason.
Until that happens, however, all of the digital content produced by Myanmar’s rapidly growing population of internet users will be flawed. Searching, ordering, and manipulating content written in Zawgyi is a nightmare for developers, who must account for all of the redundancies of Zawgyi’s inefficient coding scheme, and who cannot make use of any software built for Unicode-compliant content.
“In the future, there will be machine translation, there will be optical character recognition, there will be text-to-voice, and all [the Zawgyi content] will not be usable.”
“They also had this [problem] in Cambodia, and the government stood up and declared: ‘If you want to sell or distribute in this country, you must use Unicode.’ And the issue was solved.”
Mr Suantak believes this will result in a lot of orphaned information, but he remains optimistic for those who choose to migrate their content. “BBC uses a fully Unicode-compliant font, and people still follow the BBC. If the information is good enough, the people will change to get the information.”
The great migration to Unicode might seem like an arduous undertaking, but it’s less challenging than some of the other obstacles Myanmar will face in its transformation. Ko Thura Hlaing, an avid supporter of Unicode and author of a Zawgyi/Unicode conversion script, believes that a clear directive from the government is all that is needed to push people to change.
“They also had this [problem] in Cambodia, and the government stood up and declared ‘If you want to sell or distribute in this country, you must use Unicode’. And the issue was solved,” Ko Thura Hlaing said.
Hopefully, it’ll be just as easy in Myanmar.