C++ does not guarantee ASCII encoding of ASCII characters

Many programmers assume the C++ standard guarantees ASCII characters to be ASCII-encoded. After all, the programs they’ve written where this assumption is relevant have behaved as they’ve expected.

Such programmers may have implemented a function to classify whether a character is a decimal digit like so:

c++

bool is_digit(char c) {
    return '0' <= c && c <= '9';
}

Fortunately, the standard guarantees that the decimal digit sequence (0, 1, …, 9) satisfies the following property: 1

“Each character in the sequence, except for the first character, has a value greater than that of the previous character.”

As a result, the function is_digit is correct across all systems with a standards-compliant C++ compiler. In other words, is_digit is portably correct.

If is_digit is portably correct, then it would be unsurprising if the functions is_lowercase_letter and is_uppercase_letter to classify whether a character is a lower case letter and an uppercase letter, respectively, were portably correct as well:

c++

bool is_lowercase_letter(char c) {
    return 'a' <= c && c <= 'z';
}

bool is_uppercase_letter(char c) {
    return 'A' <= c && c <= 'Z';
}

Unfortunately, is_lowercase_letter and is_uppercase_letter are not portably correct. The standard does not guarantee the property holds for the sequences of lower case letters (a, b, …, z) and upper case letters (A, B, …, Z). The standard guarantees the property holds only for following sequence: 1

0123456789_{}[]#()<>%:;.?*+-/^&|~!=,\"

Note the decimal digit sequence is a subsequence of that sequence.

However, most systems use character encoding schemes where the property holds true for is_lowercase_letter and is_uppercase_letter. Such encoding schemes include those compatible with ISO/IEC 646 and Unicode. On such systems, the functions above will behave how most programmers intended for them to behave despite the C++ standard not guaranteeing such behavior.

While you will probably never have to program a machine where the property doesn’t hold for lowercase_letters and uppercase_letters, I still believe you should care until it becomes impractical to do so. For character classification and conversion, I suggest using the standard localization library defined in header <locale>. For converting between variables of type std::string and int, I suggest using standard string library’s numeric conversion functions such as std::stoi and std::to_string.

Footnotes


  1. For more details about character sets and encodings in C++, visit this reference page↩︎ ↩︎