In Unicode§

See primary documentation in context for UTF8-C8.

UTF-8 Clean-8 is an encoder/decoder that primarily works as the UTF-8 one. However, upon encountering a byte sequence that will either not decode as valid UTF-8, or that would not round-trip due to normalization, it will use NFG synthetics to keep track of the original bytes involved. This means that encoding back to UTF-8 Clean-8 will be able to recreate the bytes as they originally existed. The synthetics contain four codepoints:

  • The codepoint 0x10FFFD (which is a private use codepoint)

  • The codepoint 'x'

  • The upper 4 bits of the non-decodable byte as a hex char (0..9A..F)

  • The lower 4 bits as the non-decodable byte as a hex char (0..9A..F)

Under normal UTF-8 encoding, this means the unrepresentable characters will come out as something like ?xFF.

UTF-8 Clean-8 is used in places where MoarVM receives strings from the environment, command line arguments, and filesystem queries; for instance when decoding buffers:

say Buf.new(ord('A'), 0xFE, ord('Z')).decode('utf8-c8');
#  OUTPUT: «A􏿽xFEZ␤»

You can see how the two initial codepoints used by UTF8-C8 show up below right before the 'FE'. You can use this type of encoding to read files with unknown encoding:

my $test-file = "/tmp/test";
given open($test-file, :w, :bin) {
  .write: Buf.new(ord('A'), 0xFA, ord('B'), 0xFB, 0xFC, ord('C'), 0xFD);
  .close;
}

say slurp($test-file, enc => 'utf8-c8');
# OUTPUT: «(65 250 66 251 252 67 253)␤»

Reading with this type of encoding and encoding them back to UTF8-C8 will give you back the original bytes; this would not have been possible with the default UTF-8 encoding.

Please note that this encoding so far is not supported in the JVM implementation of Rakudo.