In Unicode§
See primary documentation in context for UTF8-C8.
UTF-8 Clean-8 is an encoder/decoder that primarily works as the UTF-8 one. However, upon encountering a byte sequence that will either not decode as valid UTF-8, or that would not round-trip due to normalization, it will use NFG synthetics to keep track of the original bytes involved. This means that encoding back to UTF-8 Clean-8 will be able to recreate the bytes as they originally existed. The synthetics contain four codepoints:
The codepoint 0x10FFFD (which is a private use codepoint)
The codepoint 'x'
The upper 4 bits of the non-decodable byte as a hex char (0..9A..F)
The lower 4 bits as the non-decodable byte as a hex char (0..9A..F)
Under normal UTF-8 encoding, this means the unrepresentable characters will come out as something like ?xFF
.
UTF-8 Clean-8 is used in places where MoarVM receives strings from the environment, command line arguments, and filesystem queries; for instance when decoding buffers:
say Buf.new(ord('A'), 0xFE, ord('Z')).decode('utf8-c8'); # OUTPUT: «AxFEZ»
You can see how the two initial codepoints used by UTF8-C8 show up below right before the 'FE'. You can use this type of encoding to read files with unknown encoding:
my $test-file = "/tmp/test"; given open($test-file, :w, :bin) { .write: Buf.new(ord('A'), 0xFA, ord('B'), 0xFB, 0xFC, ord('C'), 0xFD); .close; } say slurp($test-file, enc => 'utf8-c8'); # OUTPUT: «(65 250 66 251 252 67 253)»
Reading with this type of encoding and encoding them back to UTF8-C8 will give you back the original bytes; this would not have been possible with the default UTF-8 encoding.
Please note that this encoding so far is not supported in the JVM implementation of Rakudo.