Viewing internal representation of strings in Perl

This article explains how to debug encoding problems arising from a wrong interpretation of the byte sequence stored as the internal representation of a string variable in Perl 5.8.x. Such problems can manifest themselves as invalid characters being printed or stored to a database under a variety of circumstances. The official Perl utf8 documentation explains how Perl handles string values behind the scenes, and experts advise to disregard the internal representation and just trust Perl to do the right thing (see also these recommendations on dealing with Unicode in Perl). While founded on good premises of early prevention, the ideal approach falls short when you have to diagnose real-life problems.

Here is what you need to know:

  • The value of every string variable is internally stored by Perl in a data structure with two fields: a sequence of bytes (called the "internal representation"), and a flag (called the "utf8 flag") indicating how the internal representation should be interpreted by functions that deal with characters (such as print, substr, and length).
  • If the flag is set, the internal representation is interpreted as a UTF-8 encoded string. If the flag is not set, the same byte sequence is interpreted as an ISO-8859-1 encoded (Latin1) string.
  • Problems arise when the flag for some reason does not match the actual encoding used in the internal representation.

The official documentation explains how to determine the flag's current value: utf8::is_utf8($string) or, if you want to remain compatible with Perl 5.8.0, Encode::is_utf8($string). But it does not reveal how to view the internal representation. There are several ways that do not work, such as Data::Dumper and a straightforward unpack('C*', $string). The solution is a combination of use bytes and unpack, as illustrated by the following sample code:

use Encode;

my $hash = { key1 => 'ä', key2 => 'ä' };

# Convert the internal representation from latin1 to utf8.
# This also turns on the utf8 flag on the value:
utf8::upgrade($hash->{key2});

print "str1: " . hexdump($hash->{key1}) . "\n";
print "str2: " . hexdump($hash->{key2}) . "\n";

if ($str1 eq $str2)
{
    print "str1 eq str2\n";
}

# Note that the output of Data::Dumper is not helpful:
use Data::Dumper;
print Dumper($hash);

# For a given string parameter, returns a string which shows
# whether the utf8 flag is enabled and a byte-by-byte view
# of the internal representation.
#
sub hexdump
{
    my $str = shift;
    my $flag = Encode::is_utf8($str) ? 1 : 0;
    use bytes; # this tells unpack to deal with raw bytes
    my @internal_rep_bytes = unpack('C*', $str);
    return
        $flag
        . '('
        . join(' ', map { sprintf("%02x", $_) } @internal_rep_bytes)
        . ')';
}

The correct encoding for the Unicode character No. 0x00E4 "latin small letter a with diaeresis", also known as "a umlaut" (ä) happens to be 0xE4 in ISO-8859-1 and 0xC3 0xA4 in UTF-8. Note that Data::Dumper displays the Unicode character number "\x{e4}" (which coincides with ISO-8859-1) rather than the internal byte representation. The program's output is shown below:

str1: 0(e4)
str2: 1(c3 a4)
str1 eq str2
$VAR1 = {
          'key1' => 'ä'
          'key2' => "\x{e4}",
        };

Once you determine that the utf8 flag is set incorrectly (mismatching the internal representation) your next questions are surely going to be:

  1. Why is the flag set the wrong way?
  2. How do I fix the flag - without also modifying the bytes of the internal representation?

You should first attempt to answer the first question, by examining your code (and very likely also the code of used modules). The second question essentially asks for "quick and dirty" workarounds, given that the various functions provided by the utf8 and Encode modules do alter the internal representation as their intended effect. Nothing is impossible in Perl, however:

Unset the utf8 flag
{ use bytes; $str = pack('C*', unpack('C*', $str); }
Set the utf8 flag
$str = pack("U0C*", unpack("C*", $str));

Checking that the internal representation and the value of the utf8 flag are synchronized is the basic first step in troubleshooting encoding problems in Perl. It ensures the consistency of data manipulated in memory. New pitfalls lie where the data has to leave the process, being subject to character encoding conversions in input/output operations (e.g. communicating with a database). However, that's another can of worms...

No comments:

Post a Comment