A.4.11 String Encoding
1/3
{
AI05-0137-2}
Facilities for encoding, decoding, and converting strings in various
character encoding schemes are provided by packages Strings.UTF_Encoding,
Strings.UTF_Encoding.Conversions, Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings,
and Strings.UTF_Encoding.Wide_Wide_Strings.
Static Semantics
2/3
{
AI05-0137-2}
The encoding library packages have the following declarations:
3/3
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding
is
pragma Pure (UTF_Encoding);
4/3
--
Declarations common to the string encoding packages
type Encoding_Scheme
is (UTF_8, UTF_16BE, UTF_16LE);
5/3
subtype UTF_String
is String;
6/3
subtype UTF_8_String
is String;
7/3
subtype UTF_16_Wide_String
is Wide_String;
8/3
Encoding_Error :
exception;
9/3
BOM_8 :
constant UTF_8_String :=
Character'Val(16#EF#) &
Character'Val(16#BB#) &
Character'Val(16#BF#);
10/3
BOM_16BE :
constant UTF_String :=
Character'Val(16#FE#) &
Character'Val(16#FF#);
11/3
BOM_16LE :
constant UTF_String :=
Character'Val(16#FF#) &
Character'Val(16#FE#);
12/3
BOM_16 :
constant UTF_16_Wide_String :=
(1 => Wide_Character'Val(16#FEFF#));
13/3
function Encoding (Item : UTF_String;
Default : Encoding_Scheme := UTF_8)
return Encoding_Scheme;
14/3
end Ada.Strings.UTF_Encoding;
15/3
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Conversions
is
pragma Pure (Conversions);
16/3
--
Conversions between various encoding schemes
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
17/3
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
18/3
function Convert (Item : UTF_8_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
19/3
function Convert (Item : UTF_16_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
20/3
function Convert (Item : UTF_16_Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
21/3
end Ada.Strings.UTF_Encoding.Conversions;
22/3
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Strings
is
pragma Pure (Strings);
23/3
--
Encoding / decoding between String and various encoding schemes
function Encode (Item : String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
24/3
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_8_String;
25/3
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
26/3
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return String;
27/3
function Decode (Item : UTF_8_String)
return String;
28/3
function Decode (Item : UTF_16_Wide_String)
return String;
29/3
end Ada.Strings.UTF_Encoding.Strings;
30/3
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Wide_Strings
is
pragma Pure (Wide_Strings);
31/3
--
Encoding / decoding between Wide_String and various encoding schemes
function Encode (Item : Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
32/3
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
33/3
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
34/3
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return Wide_String;
35/3
function Decode (Item : UTF_8_String)
return Wide_String;
36/3
function Decode (Item : UTF_16_Wide_String)
return Wide_String;
37/3
end Ada.Strings.UTF_Encoding.Wide_Strings;
38/3
{
AI05-0137-2}
package Ada.Strings.UTF_Encoding.Wide_Wide_Strings
is
pragma Pure (Wide_Wide_Strings);
39/3
--
Encoding / decoding between Wide_Wide_String and various encoding schemes
function Encode (Item : Wide_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
40/3
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
41/3
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
42/3
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return Wide_Wide_String;
43/3
function Decode (Item : UTF_8_String)
return Wide_Wide_String;
44/3
function Decode (Item : UTF_16_Wide_String)
return Wide_Wide_String;
45/3
end Ada.Strings.UTF_Encoding.Wide_Wide_Strings;
46/3
{
AI05-0137-2}
{
AI05-0262-1}
The type Encoding_Scheme defines encoding schemes. UTF_8 corresponds
to the UTF-8 encoding scheme defined by Annex D of ISO/IEC 10646. UTF_16BE
corresponds to the UTF-16 encoding scheme defined by Annex C of ISO/IEC
10646 in 8 bit, big-endian order; and UTF_16LE corresponds to the UTF-16
encoding scheme in 8 bit, little-endian order.
47/3
{
AI05-0137-2}
The subtype UTF_String is used to represent a String of 8-bit values
containing a sequence of values encoded in one of three ways (UTF-8,
UTF-16BE, or UTF-16LE). The subtype UTF_8_String is used to represent
a String of 8-bit values containing a sequence of values encoded in UTF-8.
The subtype UTF_16_Wide_String is used to represent a Wide_String of
16-bit values containing a sequence of values encoded in UTF-16.
48/3
{
AI05-0137-2}
{
AI05-0262-1}
The BOM_8, BOM_16BE, BOM_16LE, and BOM_16 constants correspond to values
used at the start of a string to indicate the encoding.
49/3
{
AI05-0262-1}
{
AI05-0269-1}
Each of the Encode functions takes a String, Wide_String, or Wide_Wide_String
Item parameter that is assumed to be an array of unencoded characters.
Each of the Convert functions takes a UTF_String, UTF_8_String, or UTF_16_String
Item parameter that is assumed to contain characters whose position values
correspond to a valid encoding sequence according to the encoding scheme
required by the function or specified by its Input_Scheme parameter.
50/3
{
AI05-0137-2}
{
AI05-0262-1}
{
AI05-0269-1}
Each of the Convert and Encode functions returns a UTF_String, UTF_8_String,
or UTF_16_String value whose characters have position values that correspond
to the encoding of the Item parameter according to the encoding scheme
required by the function or specified by its Output_Scheme parameter.
For UTF_8, no overlong encoding is returned. A BOM is included at the
start of the returned string if the Output_BOM parameter is set to True.
The lower bound of the returned string is 1.
51/3
{
AI05-0137-2}
{
AI05-0262-1}
Each of the Decode functions takes a UTF_String, UTF_8_String, or UTF_16_String
Item parameter which is assumed to contain characters whose position
values correspond to a valid encoding sequence according to the encoding
scheme required by the function or specified by its Input_Scheme parameter,
and returns the corresponding String, Wide_String, or Wide_Wide_String
value. The lower bound of the returned string is 1.
52/3
{
AI05-0137-2}
{
AI05-0262-1}
For each of the Convert and Decode functions, an initial BOM in the input
that matches the expected encoding scheme is ignored, and a different
initial BOM causes Encoding_Error to be propagated.
53/3
{
AI05-0137-2}
The exception Encoding_Error is also propagated in the following situations:
54/3
By a Decode function when a UTF encoded string
contains an invalid encoding sequence.
55/3
By a Decode function when the expected encoding
is UTF-16BE or UTF-16LE and the input string has an odd length.
56/3
{
AI05-0262-1}
By a Decode function yielding a String when the decoding of a sequence
results in a code point whose value exceeds 16#FF#.
57/3
By a Decode function yielding a Wide_String when
the decoding of a sequence results in a code point whose value exceeds
16#FFFF#.
58/3
{
AI05-0262-1}
By an Encode function taking a Wide_String as input when an invalid character
appears in the input. In particular, the characters whose position is
in the range 16#D800# .. 16#DFFF# are invalid because they conflict with
UTF-16 surrogate encodings, and the characters whose position is 16#FFFE#
or 16#FFFF# are also invalid because they conflict with BOM codes.
59/3
{
AI05-0137-2}
function Encoding (Item : UTF_String;
Default : Encoding_Scheme := UTF_8)
return Encoding_Scheme;
60/3
{
AI05-0137-2}
{
AI05-0269-1}
Inspects a UTF_String value to determine whether it starts with a BOM
for UTF-8, UTF-16BE, or UTF_16LE. If so, returns the scheme corresponding
to the BOM; otherwise, returns the value of Default.
61/3
{
AI05-0137-2}
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
62/3
Returns the value
of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE as specified
by Input_Scheme) encoded in one of these three schemes as specified by
Output_Scheme.
63/3
{
AI05-0137-2}
function Convert (Item : UTF_String;
Input_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
64/3
Returns the value
of Item (originally encoded in UTF-8, UTF-16LE, or UTF-16BE as specified
by Input_Scheme) encoded in UTF-16.
65/3
{
AI05-0137-2}
function Convert (Item : UTF_8_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
66/3
Returns the value
of Item (originally encoded in UTF-8) encoded in UTF-16.
67/3
{
AI05-0137-2}
function Convert (Item : UTF_16_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
68/3
Returns the value
of Item (originally encoded in UTF-16) encoded in UTF-8, UTF-16LE, or
UTF-16BE as specified by Output_Scheme.
69/3
{
AI05-0137-2}
function Convert (Item : UTF_16_Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
70/3
Returns the value
of Item (originally encoded in UTF-16) encoded in UTF-8.
71/3
{
AI05-0137-2}
function Encode (Item : String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
72/3
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Output_Scheme.
73/3
{
AI05-0137-2}
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_8_String;
74/3
Returns the value
of Item encoded in UTF-8.
75/3
{
AI05-0137-2}
function Encode (Item : String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
76/3
Returns the value
of Item encoded in UTF_16.
77/3
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return String;
78/3
Returns the result
of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Input_Scheme.
79/3
{
AI05-0137-2}
function Decode (Item : UTF_8_String)
return String;
80/3
Returns the result
of decoding Item, which is encoded in UTF-8.
81/3
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String)
return String;
82/3
Returns the result
of decoding Item, which is encoded in UTF-16.
83/3
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
84/3
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Output_Scheme.
85/3
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
86/3
Returns the value
of Item encoded in UTF-8.
87/3
{
AI05-0137-2}
function Encode (Item : Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
88/3
Returns the value
of Item encoded in UTF_16.
89/3
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return Wide_String;
90/3
Returns the result
of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Input_Scheme.
91/3
{
AI05-0137-2}
function Decode (Item : UTF_8_String)
return Wide_String;
92/3
Returns the result
of decoding Item, which is encoded in UTF-8.
93/3
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String)
return Wide_String;
94/3
Returns the result
of decoding Item, which is encoded in UTF-16.
95/3
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_Scheme : Encoding_Scheme;
Output_BOM : Boolean := False)
return UTF_String;
96/3
{
AI05-0262-1}
Returns the value of Item encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Output_Scheme.
97/3
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_8_String;
98/3
Returns the value
of Item encoded in UTF-8.
99/3
{
AI05-0137-2}
function Encode (Item : Wide_Wide_String;
Output_BOM : Boolean := False)
return UTF_16_Wide_String;
100/3
Returns the value
of Item encoded in UTF_16.
101/3
{
AI05-0137-2}
function Decode (Item : UTF_String;
Input_Scheme : Encoding_Scheme)
return Wide_Wide_String;
102/3
Returns the result
of decoding Item, which is encoded in UTF-8, UTF-16LE, or UTF-16BE as
specified by Input_Scheme.
103/3
{
AI05-0137-2}
function Decode (Item : UTF_8_String)
return Wide_Wide_String;
104/3
Returns the result
of decoding Item, which is encoded in UTF-8.
105/3
{
AI05-0137-2}
function Decode (Item : UTF_16_Wide_String)
return Wide_Wide_String;
106/3
Returns the result
of decoding Item, which is encoded in UTF-16.
Implementation Advice
107/3
{
AI05-0137-2}
If an implementation supports other encoding schemes, another similar
child of Ada.Strings should be defined.
107.a.1/3
Implementation Advice: If an implementation
supports other string encoding schemes, a child of Ada.Strings similar
to UTF_Encoding should be defined.
108/3
18 {
AI05-0137-2}
A BOM (Byte-Order Mark, code position 16#FEFF#) can be included in a
file or other entity to indicate the encoding; it is skipped when decoding.
Typically, only the first line of a file or other entity contains a BOM.
When decoding, the Encoding function can be called on the first line
to determine the encoding; this encoding will then be used in subsequent
calls to Decode to convert all of the lines to an internal format.
Extensions to Ada 2005
108.a/3
{
AI05-0137-2}
The packages Strings.UTF_Encoding, Strings.UTF_Encoding.Conversions,
Strings.UTF_Encoding.Strings, Strings.UTF_Encoding.Wide_Strings, and
Strings.UTF_Encoding.Wide_Wide_Strings are new.
Ada 2005 and 2012 Editions sponsored in part by Ada-Europe