Unicode Strings in Ada 2012

kqr

, published 2017-03-08

Tags:

Ada has a fun history of character support. They were reasonably quick to jump on the Unicode train, but despite that, there is almost no material on the web on how to deal with anything other than Latin-1 (iso-8859-1) characters in Ada. So here’s what I know:

General Ground Rules

These rules apply no matter which language you are using. In other words, they are not specific to Ada, and you should already know them, but they bear repeating.

Internally in your application, you should not have to care about encodings. At all.

In your application, you should have The String Type that represents text of any flavour, be it English, Navaho or Chinese. It’s really simple. What encoding does The String Type use? Who cares. Well, I care, of course, but it doesn’t matter for my code.

Ideally, The String Type should be an abstraction that supports various operations you might expect to perform on written text, such as

concatenation (smushing two strings together)
search (and replace),
converting it to uppercase (trickier than you think!),
truncating it after x characters, and
splitting it up into lines.

I say “ideally” because they’re not strictly required. Only if you actually want to use them do you need The String Type to support them.

A word of caution: in many languages, The String Type is not actually the type called String. Examples where this confusion occurs include Python 2, Haskell and Ada. In these languages, the type called String is not The String Type.

So when do encodings matter? When you want text to exit your application. Maybe you write it to a file, or you send it over the internet, or you print it to the user. This is when encodings matter, because the receiver will expect a certain pattern of bits, so you need to put out the right pattern of bits.

To put text data out of your application, you take a value of The String Type, you specify an encoding (which is essentially something that tells you how to convert a text value to bits) and you write out the result of running the text value through the encoding.

To get text data inside your application, you do the opposite: take some data source, take an encoding, and decode the data into a value of The String Type.

This sound familiar? This is essentially how you deal with every data type ever. A Python dict is never “encoded” as long as it stays inside the Python application. It’s just a dict. Only when you want to put it out on the internet do you encode it to e.g. JSON, which is a bit pattern representing a dict, but it is not a dict itself.

Same thing with text: utf-8 data is a bit pattern representing text, but it is not text itself.

Ada Specifics

Update 2018-12-26: The Library Path

I have been notified by a reader that if one is not constrained to using only the standard libraries, there is a very feature-rich implementation of The String Type for Ada in the Leage library, which is part of the Matreshka framework. Here’s an excerpt of their email:

Matreshka provides a tagged type Universal_String which is an abstraction of Unicode string. The type implements many handy string operations like Index, Slice, Split to vector of strings, case conversions, etc. The library also provides Text_Codecs to encode/decode string using multiple encodings.

Recently we have released new version of the library - 18.1. It’s open source (BSD license).

Reader email.

The type they refer to is described in their wiki page about Localisation, Internationalisation and Globalisation. They go on to explain in their email that

Having custom strings isn’t enough to be useful in application developing, so Matreshka provides advanced extension that use strings: RegExp, xml, json, sql access, calendar. Then we went further and added web development framework, some soap stuff and Ada modeling framework. Most of these are implemented as separate libraries.

Reader email.

I haven’t had a chance to use this – or audit the code – but from what I can tell, it looks a lot like things you’d expect from a modern standard library. I will try to use it when I feel like I can take the time to!

In Ada 2012 and Ada 2005

As I said, the history of character support in Ada has been a funny one. But! If we’re working in Ada 2005 or newer, we don’t need to worry about the history of it. To achieve modern character support in your Ada programs, you should know that

Ada source files support full Unicode in both identifiers (variable names) and string literals. However, you may need to tell your compiler which encoding the source code files are in. To tell gnat your files are encoded using utf-8, you pass the -gnatW8 flag.
You want to store characters in variables of type Wide_Wide_Character. This is part of the language so there is nothing special to import. Wide_Wide_Character has full Unicode support and as such can store any character you might want.

This comes with a caveat that applies to all the following types, though. Technically, it doesn’t actually store a character. It stores a Unicode code point, which may or may not be a character. This is to be expected, though, because it’s pretty much the only reasonable thing to do.
If you have an array of characters, the type for that is Wide_Wide_String. It is also part of the language, so no imports required. However, note that this is still a low-level fixed-size array, which means it cannot reliably support operations such as “convert to upper case”, which may change the length of the string. (It does support such operations, but their results may not necessarily be what you expect for some languages.)

It also carries over the caveat from Wide_Wide_Character: a single index in this string may not actually be a character, it can be a combining mark that is meant to be used with the character that comes before or after.

String literals in Ada are also automatically converted to this type, so if you can write Hello : Wide_Wide_String := "你好，世界";.
If you want a dynamic string, you’ll have to import Ada.Strings.Wide_Wide_Unbounded, which has a type Unbounded_Wide_Wide_String which is the closest you’ll get to The String Type in standard Ada 2005 and Ada 2012.

The Ada.Strings.Wide_Wide_Unbounded library is pretty much a copy of the Ada.Strings.Unbounded library, except it deals with Wide_Wide_Characters instead.

While Unbounded_Wide_Wide_String> will store any Unicode character you throw at it, and it does support some basic string operations, it does not support all operations you may want it to. For example, converting a string to uppercase is done on a “codepoint-by-codepoint” basis, which is even more wrong than if it was done “character by character”. However, I can’t fault Ada for this because almost every language gets this wrong anyway. It is a hard problem.
For input/output, the Ada.Wide_Wide_Text_IO looks pretty much like Ada.Text_IO except it reads and writes to and from values of type Wide_Wide_String. There is also Ada.Wide_Wide_Text_IO.Wide_Wide_Unbounded_IO which does input/output directly with unbounded strings.

However, if you want other people to read or write the text data you’re outputting, you may want to specify an encoding to be used outside your application. Since input/output is sort of platform-dependent, how to do this is not strictly mandated by the Ada standard. The Open procedure call is required to take a Form string which specifies platform-specific instructions. What the string looks like depends on platform, however,

For writing utf-8 using gnat, you specify the string "ENCODING=UTF8,WCEM=8".
If you’re used to convert values to strings with the Image attribute, you might want to know that there is an Wide_Wide_Image attribute that does the same thing except it can handle unicode values.

So that’s the deal in standard Ada 2005 and standard Ada 2012. It’s not as sunny in earlier versions, but I’ll quickly go through the important details anyway.

In Ada 95

For a long time, it was believed that Unicode could get by with 16 bits to represent the characters for all languages of the world. Originally, “Unicode” was defined as “16 bit characters”. History showed this was a bad idea, but it was believed to be true for long enough that many systems are stuck with 16 bit characters; both Java and Windows, for example, deal in 16 bit characters.

Aaand so does Ada 95. The types are Wide_Character, Wide_String and Unbounded_Wide_String. Read the advice for Ada 2012, and replace every occurrence of “Wide_Wide” with simply “Wide” and you’ll be good. Except, of course, that you’re limited to 16 bit code points, surrogate pairs and all that comes with it.

Another difference you’ll find is that while Ada 95 allows 16 bit codepoints in strings, it does not in identifiers. So variable and function names and such are still limited to 8 bits of iso-8859-1 (“latin-1”).

…In Ada 83

Originally, Ada 83 only supported ascii, i.e. 7 bit codepoints. This is what you should expect if you’re using Ada 83. It’s also worth knowing that Ada 83 does not have the raii-style controlled types that were introduced in Ada 95, so you cannot have “unbounded” strings in Ada 83. Only fixed-size strings are available.

Oh, and many compilers sneakily switched over to an 8-bit encoding during the lifetime of Ada 83, so if you desperately need it, check if this is the case with yours. If so, you’re probably dealing with iso-8859-1, also known as “latin-1”.