g | x | w | all

Bytes	Lang	Time	Link
129	Bash	250420T214212Z	Explorer
083	Charcoal	250424T091905Z	Neil
5854	x8664 machine code	250422T140443Z	m90
139	Python 3.8	250422T155852Z	CrSb0001
123	C gcc	250422T010335Z	l4m2
nan	05AB1E	250422T095113Z	Kevin Cr
143	C gcc	250421T075343Z	ceilingc

Bash, 131 129 bytes

u()(printf %X${1:9} $((v=0x${1:3:3}+128,(v&7228)*3+v+985096)))
c()(printf ED%XEDB${1:5} $((v=0x${1:1:4},(v&771|v/4&7228)+40832)))

Try it online!

Zsh (no `C_PRECEDENCES`), 122 bytes

u()(printf %X${1:9} $[v=0x${1:3:3}+128,v&7228*3+v+985096])
c()(printf ED%XEDB${1:5} $[v=0x${1:1:4},v&771|v>>2&7228+40832])

Try it online!

This entry was submitted by the author of this challenge and serves as an example.

Numbers are processed in big endian. (It doesn't make sense to use little endian for this language.)

u converts a CESU-8 sequence to UTF-8. The input, passed as argument $1, is formatted in 12 hex digits without the 0x prefix (like EDAFABEDB387).

c converts a UTF-8 sequence to CESU-8. The input, passed as argument $1, is formatted in 8 hex digits without the 0x prefix (like F48AB387).

The Zsh version assumes the C_PRECEDENCES option is off (unsetopt C_PRECEDENCES) before calling the functions.

A slightly deobfuscated version:

cesu8_to_utf8 () (
    printf "%X${1:9}" \
        $((v = (0x${1:3:3}) + 0x80, \
            (v & 0x1C3C) * 3 + v + 0xF0808))
)

utf8_to_cesu8 () (
    printf "ED%XEDB${1:5}" \
        $((v = (0x${1:1:4}), \
            ((v & 0x303) | ((v / 4) & 0x1C3C)) + 0x9F80))
)

Charcoal, 107 103 83 bytes

≔⍘⁺⍘↧θ¹⁶⎇÷Ｌθχ×³³Ｘ⁴¦¹⁶±×⁹Ｘ⁴χ⁴η↥⍘⍘⭆⎇÷Ｌθχ”C↘⁼Ｂ⁷HÀ-↙y&↥””E⌈∨→ln‴=R⸿⊕À⊖G)<℅”∨Σι§η⌕βι⁴¦¹⁶

Try it online! Link is to verbose version of code. I/O is a hexadecimal string (big-endian). Auto-detects the conversion based on the length of the string. Explanation:

≔⍘⁺⍘↧θ¹⁶⎇÷Ｌθχ×³³Ｘ⁴¦¹⁶±×⁹Ｘ⁴χ⁴η

Convert the input from hexadecimal, bias it by either adding 2³²+2³⁷ or subtracting 2²⁰+2²³, then convert it to base 4.

↥⍘⍘⭆⎇÷Ｌθχ”C↘⁼Ｂ⁷HÀ-↙y&↥””E⌈∨→ln‴=R⸿⊕À⊖G)<℅”∨Σι§η⌕βι⁴¦¹⁶

Extract the appropriate base 4 digits into one of two compressed template strings, then convert the result from base 4 into base 16.

Previous 103-byte answer:

≔⪪⎇‹θ240”←″LＫ�⁺TRＩ⁴⁵””#⌈“«h⁸⎇¦↘4∕Ｕ▶³∨Ｇº▶ＤＢ>” η≔⁰ζＦＥ⊟ηＸ²Ｉι≔⁺×ζι﹪Ｎιζ≧⁺Ｉ⊟ηζＦＥ⊟ηＸ²Ｉι«⊞υ⁺⁺¹²⁸Ｉ⊟η﹪ζι≧÷ιζ»Ｉ⮌υＳ

Try it online! Link is to verbose version of code. I/O is a list of decimal bytes. Auto-detects the conversion based on the first byte. Explanation: Similar to my previous 107-byte answer below but doesn't process the last byte since it's always the same.

≔⪪⎇‹θ240”←″EＴ²G³|⌈⁼₂a≡””#⌈∨⊘↘yＬ×↔ＦＹ▶.πY⦃✂F¦‹$？ZI” η≔⁰ζＦＥ⊟ηＸ²Ｉι≔⁺×ζι﹪Ｎιζ≧⁺Ｉ⊟ηζＦＥ⊟ηＸ²Ｉι«⊞υ⁺⁺¹²⁸Ｉ⊟η﹪ζι≧÷ιζ»Ｉ⮌υ

Try it online! Link is to verbose version of code. I/O is a list of decimal bytes. Auto-detects the conversion based on the first byte. Explanation:

≔⪪⎇‹θ240”←″EＴ²G³|⌈⁼₂a≡””#⌈∨⊘↘yＬ×↔ＦＹ▶.πY⦃✂F¦‹$？ZI” η

Select the operations to perform from two compressed strings based on the first byte. The uncompressed string data contains a) a list of 128-offset output byte offsets b) a string of output byte bit counts c) the offset between the two encodings d) a string of input byte bit counts.

≔⁰ζ

Start with a zero code point.

ＦＥ⊟ηＸ²Ｉι≔⁺×ζι﹪Ｎιζ

For each of the input bytes, extract the appropriate number of bits and update the running total. (For UTF, the final running total is the code point.)

≧⁺Ｉ⊟ηζ

Offset the value to map between the encodings. (For CESU to UTF, this means that the value now has the right code point.)

ＦＥ⊟ηＸ²Ｉι«⊞υ⁺⁺¹²⁸Ｉ⊟η﹪ζι≧÷ιζ»

For each output byte, extract the desired number of bits, and add on the desired offset.

Ｉ⮌υ

The above loop extracts the bits in reverse order, so correct this for the final output.

I haven't looked at any of the other answers to see what I/O formats they use, so this may not be the best format.

x86-64 machine code, 58 54 bytes

A8 F9 B8 3F 3F 3F 07 48 B9 3F 0F 00 3F BF ED 00 00 BE 80 B0 ED 80 BA 00 00 DF 1F 72 09 48 91 BE 80 80 80 F0 F7 DA C4 E2 C2 F5 F8 01 D7 C4 E2 C3 F5 C1 48 09 F0 C3

Try it online!

This uses big-endian integers.

Following the standard calling convention for Unix-like systems (from the System V AMD64 ABI), this takes a 64-bit integer in RDI and returns a 64-bit integer in RAX.

The CESU-8 to UTF-8 function starts at the beginning, and the UTF-8 to CESU-8 function starts after one byte.

In assembly:

cesu8_to_utf8:
    .byte 0xA8  # This combines with the following instruction to form
                # 'test al, 0xF9', which sets CF to 0 among other things.
utf8_to_cesu8:
    stc         # Set CF to 1.
    mov eax, 0x073F3F3F     # Mask of variable bits in UTF-8.
    mov rcx, 0xEDBF3F000F3F # Mask of variable bits in CESU-8, and also
                            #  the first eight 1 bits and one 0 bit.
    mov esi, 0x80EDB080     # Fixed 1 bits in CESU-8 except the first eight.
                            # (With those removed, this fits in 32 bits.)
    mov edx, 0x1FDF0000     # Adjustment value to add to the codepoint
                            #  to add eight 1 bits and subtract 2¹⁶.
    jc s                    # Jump if CF=1 to skip some code.
    xchg rax, rcx       # Swap the masks.
    mov esi, 0xF0808080 # Fixed 1 bits in UTF-8.
    neg edx             # Negate the adjustment value in EDX.
s:  pext rdi, rdi, rax      # Extract the variable bits (maybe extra) from RDI.
    add edi, edx            # Add the adjustment value to the variable bits.
    pdep rax, rdi, rcx      # Distribute the bits using the other mask (into RAX).
    or rax, rsi             # Combine the result with the fixed 1 bits.
    ret                     # Return.

Python 3.8, 155 139 bytes

^{-16 bytes thanks to @ceilingcat}

Big endian input. Used with hexadecimal input/output.

The code for both big endian and little endian conversion are both golfed versions of the OP's code

c=lambda s:hex((t:=(s>>12)-0xED9F80EDB)+(t&29605888)*3|s&3903|0xF0808080)
u=lambda s:hex(0xE9DD7EEDB080+(s&3903)+(s+(s&3158016)*3>>12<<22))

Attempt This Online!

Python 3.8, 168 146 bytes (little endian input)

^{-22 bytes thanks to @ceilingcat}

a,b=0x80B0ED809FED,16143<<16
c=lambda s:hex(((s:=s-a)<<4|s>>10&3161863)|s>>16&b|0x808080F0)
u=lambda s:hex(((s&b)<<16|(s&3847)<<10|s>>4&197376)+a)

Attempt This Online!

C (gcc), 123 bytes

long f(long n){long i=1<<20,j=0,p,q;for(;i|='x```'*2u,p=i++^'@@@'^n,j|='@UP@'*3u,q=j++^0xeda040124040,p&&q-n;);return p^q;}

Try it online!

-5 bytes ceilingcat

-1 byte Explorer09

big endian input. A brute force example

x86asm, 55 bytes

00000000: b900 0010 0031 f681 c9c0 c0c0 f08d 91c0  .....1..........
00000010: bfbf ffff c148 b8c0 f0ff c0a0 ed00 0048  .....H.........H
00000020: 09c6 488d 86c0 bfed bf48 ffc6 39fa 7406  ..H......H..9.t.
00000030: 4839 f875 d292 c3                        H9.u...

Try it online!

    use64
f:  
    mov ecx, 0x100000
    xor esi, esi
.a: or  ecx, 0xF0C0C0C0
    lea edx, [rcx-0x404040]
    inc ecx
    mov rax, 0xEDA0C0FFF0C0
    or  rsi, rax
    lea rax, [rsi-0x40124040]
    inc rsi
    cmp edx, edi
    jz .b
    cmp rax, rdi
    jnz .a
    xchg eax, edx
.b: ret

Same logic. Didn't expect asm still shorter when pext and pdep exist

05AB1E, 74 (39+35) bytes

UTF-8 to CESU-8 (39 bytes):

bć¦¦¦šε¦¦}JŽMaS£ćC<bš2ô€0˜•î¡€î±€•₅вb+C

I/O as a list of integers. (Could be -2 bytes if I/O as a list of binary-strings is allowed.)

Try it online (with header/footer to convert from/to hexadecimal).

CESU-8 to UTF-8 (35 bytes):

bεŽ3
Nè.$}JŽIuS£ćC>bšJŽE"S£Tìć7bìšC

I/O as a list of integers. (Could be -2 bytes if I/O as a list of binary-strings is allowed.)

Try it online (with header/footer to convert from/to hexadecimal).

Explanation:

b              # Convert the (implicit) list of integers to a list of binary-strings
 ć             # Extract the head
  ¦¦¦          # Remove the first three 1s of this binary-string
     š         # Prepend it back to the head
 ε             # Map over each binary-string:
  ¦¦           #  Remove the first "10" of the binary-string
 }J            # After the map: join everything together
   ŽMa         # Push compressed integer 5646
      S        # Convert it to a list of digits: [5,6,4,6]
       £       # Split the string into parts of those sizes
        ć      # Extract the head
         C     # Convert from a binary-string to an integer
          <    # Decrease it by 1
           b   # Convert it from an integer back to a binary-string
            š  # Prepend it back to the list
2ô             # Split the list of four into two pairs
  €0           # Place a 0 before each pair in the list:
               #  [0,[wwww,uuuuzz],0,[yyyy,xxxxxx]]
    ˜          # Flatten it
     •î¡€î±€•  # Push compressed integer 256212984493808
      ₅в       # Convert it to base-255 as list: [237,160,128,237,176,128]
        b      # Convert each to binary:
               #  [11101101,10100000,10000000,11101101,10110000,10000000]
         +     # Add the values in the two lists together
C              # Convert all values from binary-strings to integers
               # (after which the result is output implicitly)

b              # Convert the (implicit) list of integers to a list of binary-strings
 ε             # Map over each binary-string:
  Ž3\n         #  Push compressed integer 842
      Nè       #  Get the digit at the modular map-index
        .$     #  Remove that many leading bits from the binary-string
 }J            # After the map: join everything together
   ŽIu         # Push compressed integer 4646
      S        # Convert it to a list of digits: [4,6,4,6]
       £       # Split the string into parts of those sizes
        ćC>bš  # Similar as before, but increase by 1 instead
JŽE"S£        "# Join and split it into parts of sizes [3,6,6,6]
      Tì       # Prepend "10" before each
        ć      # Extract head
         7bì   # Prepend "111" (binary of 7)
            š  # Prepend it back to the list
C              # Convert all values from binary-strings to integers
               # (after which the result is output implicitly)

See this 05AB1E tip of mine (sections How to compress large integers? and How to compress integer lists?) to understand why ŽMa is 5646; •î¡€î±€• is 256212984493808; •î¡€î±€•₅в is [237,160,128,237,176,128]; Ž3\n is 842; ŽIu is 4646; and ŽE" is 3666.

C (gcc), big endian input, 143 bytes

#define x 4096-0xED9F80EDB
#define c(s)(((s+(s&'00\0')*3>>12)<<22)+0xE9DD7EEDB080+(s&3903))
#define u(s)(s/x+(s/x&29605888)*3|s&3903|'x@@@'*2L)

Try it online!

C (gcc), little endian input, 148 147 bytes

long x=0x80B0ED809FED;
#define u(s)((s-x<<4|s-x>>10)&'0?\7'|s-x>>16&16143<<16|'@@@x'*2)
#define c(s)((s<<6&16143L<<22|s&3847)*1024+(s>>4&197376)+x)

Try it online!

Both are golfed versions of the reference implementation.

Bash, 131 129 bytes

Zsh (no C_PRECEDENCES), 122 bytes

Charcoal, 107 103 83 bytes

x86-64 machine code, 58 54 bytes

Python 3.8, 155 139 bytes

Python 3.8, 168 146 bytes (little endian input)

C (gcc), 123 bytes

x86asm, 55 bytes

05AB1E, 74 (39+35) bytes

C (gcc), big endian input, 143 bytes

C (gcc), little endian input, 148 147 bytes

Zsh (no `C_PRECEDENCES`), 122 bytes