| Bytes | Lang | Time | Link |
|---|---|---|---|
| 129 | Bash | 250420T214212Z | Explorer |
| 083 | Charcoal | 250424T091905Z | Neil |
| 5854 | x8664 machine code | 250422T140443Z | m90 |
| 139 | Python 3.8 | 250422T155852Z | CrSb0001 |
| 123 | C gcc | 250422T010335Z | l4m2 |
| nan | 05AB1E | 250422T095113Z | Kevin Cr |
| 143 | C gcc | 250421T075343Z | ceilingc |
Bash, 131 129 bytes
u()(printf %X${1:9} $((v=0x${1:3:3}+128,(v&7228)*3+v+985096)))
c()(printf ED%XEDB${1:5} $((v=0x${1:1:4},(v&771|v/4&7228)+40832)))
Zsh (no C_PRECEDENCES), 122 bytes
u()(printf %X${1:9} $[v=0x${1:3:3}+128,v&7228*3+v+985096])
c()(printf ED%XEDB${1:5} $[v=0x${1:1:4},v&771|v>>2&7228+40832])
This entry was submitted by the author of this challenge and serves as an example.
Numbers are processed in big endian. (It doesn't make sense to use little endian for this language.)
u converts a CESU-8 sequence to UTF-8. The input, passed as argument $1, is formatted in 12 hex digits without the 0x prefix (like EDAFABEDB387).
c converts a UTF-8 sequence to CESU-8. The input, passed as argument $1, is formatted in 8 hex digits without the 0x prefix (like F48AB387).
The Zsh version assumes the C_PRECEDENCES option is off (unsetopt C_PRECEDENCES) before calling the functions.
A slightly deobfuscated version:
cesu8_to_utf8 () (
printf "%X${1:9}" \
$((v = (0x${1:3:3}) + 0x80, \
(v & 0x1C3C) * 3 + v + 0xF0808))
)
utf8_to_cesu8 () (
printf "ED%XEDB${1:5}" \
$((v = (0x${1:1:4}), \
((v & 0x303) | ((v / 4) & 0x1C3C)) + 0x9F80))
)
Charcoal, 107 103 83 bytes
≔⍘⁺⍘↧θ¹⁶⎇÷Lθχ׳³X⁴¦¹⁶±×⁹X⁴χ⁴η↥⍘⍘⭆⎇÷Lθχ”C↘⁼B⁷HÀ-↙y&↥””E⌈∨→ln‴=R⸿⊕À⊖G)<℅”∨Σι§η⌕βι⁴¦¹⁶
Try it online! Link is to verbose version of code. I/O is a hexadecimal string (big-endian). Auto-detects the conversion based on the length of the string. Explanation:
≔⍘⁺⍘↧θ¹⁶⎇÷Lθχ׳³X⁴¦¹⁶±×⁹X⁴χ⁴η
Convert the input from hexadecimal, bias it by either adding 2³²+2³⁷ or subtracting 2²⁰+2²³, then convert it to base 4.
↥⍘⍘⭆⎇÷Lθχ”C↘⁼B⁷HÀ-↙y&↥””E⌈∨→ln‴=R⸿⊕À⊖G)<℅”∨Σι§η⌕βι⁴¦¹⁶
Extract the appropriate base 4 digits into one of two compressed template strings, then convert the result from base 4 into base 16.
Previous 103-byte answer:
≔⪪⎇‹θ240”←″LK�⁺TRI⁴⁵””#⌈“«h⁸⎇¦↘4∕U▶³∨Gº▶DB>” η≔⁰ζFE⊟ηX²Iι≔⁺×ζι﹪Nιζ≧⁺I⊟ηζFE⊟ηX²Iι«⊞υ⁺⁺¹²⁸I⊟η﹪ζι≧÷ιζ»I⮌υS
Try it online! Link is to verbose version of code. I/O is a list of decimal bytes. Auto-detects the conversion based on the first byte. Explanation: Similar to my previous 107-byte answer below but doesn't process the last byte since it's always the same.
≔⪪⎇‹θ240”←″ET²G³|⌈⁼₂a≡””#⌈∨⊘↘yL×↔FY▶.πY⦃✂F¦‹$?ZI” η≔⁰ζFE⊟ηX²Iι≔⁺×ζι﹪Nιζ≧⁺I⊟ηζFE⊟ηX²Iι«⊞υ⁺⁺¹²⁸I⊟η﹪ζι≧÷ιζ»I⮌υ
Try it online! Link is to verbose version of code. I/O is a list of decimal bytes. Auto-detects the conversion based on the first byte. Explanation:
≔⪪⎇‹θ240”←″ET²G³|⌈⁼₂a≡””#⌈∨⊘↘yL×↔FY▶.πY⦃✂F¦‹$?ZI” η
Select the operations to perform from two compressed strings based on the first byte. The uncompressed string data contains a) a list of 128-offset output byte offsets b) a string of output byte bit counts c) the offset between the two encodings d) a string of input byte bit counts.
≔⁰ζ
Start with a zero code point.
FE⊟ηX²Iι≔⁺×ζι﹪Nιζ
For each of the input bytes, extract the appropriate number of bits and update the running total. (For UTF, the final running total is the code point.)
≧⁺I⊟ηζ
Offset the value to map between the encodings. (For CESU to UTF, this means that the value now has the right code point.)
FE⊟ηX²Iι«⊞υ⁺⁺¹²⁸I⊟η﹪ζι≧÷ιζ»
For each output byte, extract the desired number of bits, and add on the desired offset.
I⮌υ
The above loop extracts the bits in reverse order, so correct this for the final output.
I haven't looked at any of the other answers to see what I/O formats they use, so this may not be the best format.
x86-64 machine code, 58 54 bytes
A8 F9 B8 3F 3F 3F 07 48 B9 3F 0F 00 3F BF ED 00 00 BE 80 B0 ED 80 BA 00 00 DF 1F 72 09 48 91 BE 80 80 80 F0 F7 DA C4 E2 C2 F5 F8 01 D7 C4 E2 C3 F5 C1 48 09 F0 C3
This uses big-endian integers.
Following the standard calling convention for Unix-like systems (from the System V AMD64 ABI), this takes a 64-bit integer in RDI and returns a 64-bit integer in RAX.
The CESU-8 to UTF-8 function starts at the beginning, and the UTF-8 to CESU-8 function starts after one byte.
In assembly:
cesu8_to_utf8:
.byte 0xA8 # This combines with the following instruction to form
# 'test al, 0xF9', which sets CF to 0 among other things.
utf8_to_cesu8:
stc # Set CF to 1.
mov eax, 0x073F3F3F # Mask of variable bits in UTF-8.
mov rcx, 0xEDBF3F000F3F # Mask of variable bits in CESU-8, and also
# the first eight 1 bits and one 0 bit.
mov esi, 0x80EDB080 # Fixed 1 bits in CESU-8 except the first eight.
# (With those removed, this fits in 32 bits.)
mov edx, 0x1FDF0000 # Adjustment value to add to the codepoint
# to add eight 1 bits and subtract 2¹⁶.
jc s # Jump if CF=1 to skip some code.
xchg rax, rcx # Swap the masks.
mov esi, 0xF0808080 # Fixed 1 bits in UTF-8.
neg edx # Negate the adjustment value in EDX.
s: pext rdi, rdi, rax # Extract the variable bits (maybe extra) from RDI.
add edi, edx # Add the adjustment value to the variable bits.
pdep rax, rdi, rcx # Distribute the bits using the other mask (into RAX).
or rax, rsi # Combine the result with the fixed 1 bits.
ret # Return.
Python 3.8, 155 139 bytes
-16 bytes thanks to @ceilingcat
Big endian input. Used with hexadecimal input/output.
The code for both big endian and little endian conversion are both golfed versions of the OP's code
c=lambda s:hex((t:=(s>>12)-0xED9F80EDB)+(t&29605888)*3|s&3903|0xF0808080)
u=lambda s:hex(0xE9DD7EEDB080+(s&3903)+(s+(s&3158016)*3>>12<<22))
Python 3.8, 168 146 bytes (little endian input)
-22 bytes thanks to @ceilingcat
a,b=0x80B0ED809FED,16143<<16
c=lambda s:hex(((s:=s-a)<<4|s>>10&3161863)|s>>16&b|0x808080F0)
u=lambda s:hex(((s&b)<<16|(s&3847)<<10|s>>4&197376)+a)
C (gcc), 123 bytes
long f(long n){long i=1<<20,j=0,p,q;for(;i|='x```'*2u,p=i++^'@@@'^n,j|='@UP@'*3u,q=j++^0xeda040124040,p&&q-n;);return p^q;}
-5 bytes ceilingcat
-1 byte Explorer09
big endian input. A brute force example
x86asm, 55 bytes
00000000: b900 0010 0031 f681 c9c0 c0c0 f08d 91c0 .....1..........
00000010: bfbf ffff c148 b8c0 f0ff c0a0 ed00 0048 .....H.........H
00000020: 09c6 488d 86c0 bfed bf48 ffc6 39fa 7406 ..H......H..9.t.
00000030: 4839 f875 d292 c3 H9.u...
use64
f:
mov ecx, 0x100000
xor esi, esi
.a: or ecx, 0xF0C0C0C0
lea edx, [rcx-0x404040]
inc ecx
mov rax, 0xEDA0C0FFF0C0
or rsi, rax
lea rax, [rsi-0x40124040]
inc rsi
cmp edx, edi
jz .b
cmp rax, rdi
jnz .a
xchg eax, edx
.b: ret
Same logic. Didn't expect asm still shorter when pext and pdep exist
05AB1E, 74 (39+35) bytes
UTF-8 to CESU-8 (39 bytes):
b榦¦šε¦¦}JŽMaS£ćC<bš2ô€0˜••₅вb+C
I/O as a list of integers. (Could be -2 bytes if I/O as a list of binary-strings is allowed.)
Try it online (with header/footer to convert from/to hexadecimal).
CESU-8 to UTF-8 (35 bytes):
bεŽ3
Nè.$}JŽIuS£ćC>bšJŽE"S£Tìć7bìšC
I/O as a list of integers. (Could be -2 bytes if I/O as a list of binary-strings is allowed.)
Try it online (with header/footer to convert from/to hexadecimal).
Explanation:
b # Convert the (implicit) list of integers to a list of binary-strings
ć # Extract the head
¦¦¦ # Remove the first three 1s of this binary-string
š # Prepend it back to the head
ε # Map over each binary-string:
¦¦ # Remove the first "10" of the binary-string
}J # After the map: join everything together
ŽMa # Push compressed integer 5646
S # Convert it to a list of digits: [5,6,4,6]
£ # Split the string into parts of those sizes
ć # Extract the head
C # Convert from a binary-string to an integer
< # Decrease it by 1
b # Convert it from an integer back to a binary-string
š # Prepend it back to the list
2ô # Split the list of four into two pairs
€0 # Place a 0 before each pair in the list:
# [0,[wwww,uuuuzz],0,[yyyy,xxxxxx]]
˜ # Flatten it
•• # Push compressed integer 256212984493808
₅в # Convert it to base-255 as list: [237,160,128,237,176,128]
b # Convert each to binary:
# [11101101,10100000,10000000,11101101,10110000,10000000]
+ # Add the values in the two lists together
C # Convert all values from binary-strings to integers
# (after which the result is output implicitly)
b # Convert the (implicit) list of integers to a list of binary-strings
ε # Map over each binary-string:
Ž3\n # Push compressed integer 842
Nè # Get the digit at the modular map-index
.$ # Remove that many leading bits from the binary-string
}J # After the map: join everything together
ŽIu # Push compressed integer 4646
S # Convert it to a list of digits: [4,6,4,6]
£ # Split the string into parts of those sizes
ćC>bš # Similar as before, but increase by 1 instead
JŽE"S£ "# Join and split it into parts of sizes [3,6,6,6]
Tì # Prepend "10" before each
ć # Extract head
7bì # Prepend "111" (binary of 7)
š # Prepend it back to the list
C # Convert all values from binary-strings to integers
# (after which the result is output implicitly)
See this 05AB1E tip of mine (sections How to compress large integers? and How to compress integer lists?) to understand why ŽMa is 5646; •• is 256212984493808; ••₅в is [237,160,128,237,176,128]; Ž3\n is 842; ŽIu is 4646; and ŽE" is 3666.
C (gcc), big endian input, 143 bytes
#define x 4096-0xED9F80EDB
#define c(s)(((s+(s&'00\0')*3>>12)<<22)+0xE9DD7EEDB080+(s&3903))
#define u(s)(s/x+(s/x&29605888)*3|s&3903|'x@@@'*2L)
C (gcc), little endian input, 148 147 bytes
long x=0x80B0ED809FED;
#define u(s)((s-x<<4|s-x>>10)&'0?\7'|s-x>>16&16143<<16|'@@@x'*2)
#define c(s)((s<<6&16143L<<22|s&3847)*1024+(s>>4&197376)+x)
Both are golfed versions of the reference implementation.