| Bytes | Lang | Time | Link |
|---|---|---|---|
| 449 | GNU sed E | 240831T145818Z | 鳴神裁四点一号 |
| 200 | Charcoal | 240831T210441Z | Neil |
| 911 | Python3 | 240830T173518Z | Ajax1234 |
GNU sed -E, 461 457 449 bytes
s/^/;0,q/
:q
s/q/qq/g
t3
:3
s/(,\w+)(,\w+,\w+)(.*)/\1,q\3;q\2/
tq
s/(.*)(;(q+)z)/\3\1/
:4
s/^(q+)\>(.*);((\w+)(,\1(,\w+)|(,\w+),\1))\>/\1\2%\3;\4\6\7/
t4
s/^(q+)\>(.*);((\w+),\1)\>/\1\2%\3;\4z/
t4
s/^q+//
y/%/;/
t3
s/;(q+,q+,\w+)/<\1/g
s/(.*);(q+,q+)/\2\1/
:g
s/^((q+),(q+))\>(.*);(\3(,\w+(,\w+)?))/\1;\2\6\4<\5/
tg
s/^q+,q+//
y/</;/
t3
s/,(a+,)/%\1/g
s/(,q+),a/\1%a/g
s/(.*)(%a+)/q\2&/
:s
s/^q(%a+)\>(.*)\1\>/q\1\2,q/
ts
s/^q%(a+)/;q,\1/
s/%/,/g
tq
Usage
One-line input. Unary state and unary alphabet.
Input := Rule*
Rule := ";" State RewriteList
State := "q"+
RewriteList := ( "," ( State | Alphabet) )+ | Empty
Alphabet:= "a"+
Empty := "z"
One letter state "q" considered to be initial state of input; the output additionally uses "0" to represent initial.
For example if you have the language
S -> AbB | C
C -> b | c
B -> AA | AC
A -> a | ε
you need to translate into this:
;q,qq,aa,qqq;q,qqqq;qqqq,aa;qqqq,aaa;qqq,qq,qq;qqq,qq,qqqq;qq,a;qqz
States S, A, B, C are translated into q, qq, qqq, qqqq respectively; similar for alphabets a, b, c.
Ungolfed 461 bytes answer
I tried with in order of step 1, 3, 4, 5, 2.
Reading hint: I named alphabet labels taken from the qwerty keyboard;
# step 1
s/^/;0,q/
# rename every state; new state q can be introduced
:q
# s/\<q/qq/g # readable alternative
s/q/qq/g
t3
# step 3
:3
s/(,\w+)(,\w+,\w+)(.*)/\1,q\3;q\2/
tq
:r
# step 4
# choose a nullable state
s/(.*)(;(q+)z)/\3\1/
Tw
# add nullable state rules
:4
s/^(q+)\>(.*);((\w+)(,\1(,\w+)|(,\w+),\1))\>/\1\2%\3;\4\6\7/
t4
s/^(q+)\>(.*);((\w+),\1)\>/\1\2%\3;\4z/
t4
# step 4 done for that state
s/q+//
y/%/;/
tr
# step 5
# protect every A -> BC
:w
s/;(q+,q+,\w+)/<\1/g
# choose one A -> B
s/(.*);(q+,q+)/\2\1/
# no need to unprotect A -> BC
# if A -> B and B -> ... add A -> ...
:g
s/^((q+),(q+))\>(.*);(\3(,\w+(,\w+)?))/\1;\2\6\4<\5/
tg
# remove A -> B
s/^q+,q+//
# end of step 5 or just removed A -> B
y/</;/
tw
# step 2
# mark every alphabet a in A -> Ba and A -> aB
s/,(a+,)/%\1/g
s/(,q+),(a+)/\1%\2/g
# choose one alphabet
s/(.*)(%a+)/q\2&/
# now rewrite rules
:s
s/^q(%a+)\>(.*)\1\>/q\1\2,q/
ts
s/^q%(a+)/;q,\1/
# still having any?
s/%/,/g
tq # to be able to get more states
# COMPLETED!
# some rules are duplicated but whatever
Charcoal, 200 bytes
UMθEιEλ∨νLθ⊞θ§θ⁰§≔θ⁰⟦⟦⊖Lθ⟧⟧W⌊EΦEθΦκ∧›Lμ¹‹⌊μ⁰κ⌊Eκ⌊μ«UMθEκEμ⎇⁼ξιLθξ⊞θ⟦⟦ι⟧⟧»W⌊ΦEθ⌊Φκ›Lμ²κ«⊞θ⟦⮌E²⊟ι⟧⊞ι⊖Lθ»WΦ⁻Eθλυ⊙§θκ¬⁻μυFι⊞υκUMθΦιλF№υ⁰⊞§θ⁰⟦⟧FθFΦι⁼Lκ²FΦ²№υ§κλ⊞ιΦκ⁻νλW⌊ΣEθΦκ∧⁼Lμ¹›⌊μ⁰UMθ⁺⁻κ⟦ι⟧⎇№κι§θ⌊ι⟦⟧⭆¹θ
Attempt This Online! Link is to verbose version of code. Takes input as an array of arrays of rules where index 0 represents the start symbol and rules contain positive integers for nonterminals and negative integers for terminals. Explanation:
UMθEιEλ∨νLθ⊞θ§θ⁰§≔θ⁰⟦⟦⊖Lθ⟧⟧
Renumber the start symbol and add a production from 0 to the new symbol.
W⌊EΦEθΦκ∧›Lμ¹‹⌊μ⁰κ⌊Eκ⌊μ
While rules exist that contain terminals but aren't of normal form, ...
«UMθEκEμ⎇⁼ξιLθξ⊞θ⟦⟦ι⟧⟧»
... add a new nonterminal with a production to the terminal and rename all uses of that terminal to the new nonterminal.
W⌊ΦEθ⌊Φκ›Lμ²κ
While a rule exists with more than two nonterminals...
«⊞θ⟦⮌E²⊟ι⟧⊞ι⊖Lθ»
... add a new nonterminal for the last two nonterminals in that rule and update the rule to use that nonterminal instead.
WΦ⁻Eθλυ⊙§θκ¬⁻μυFι⊞υκ
Determine which nonterminals are nullable.
UMθΦιλ
Remove all empty productions.
F№υ⁰⊞§θ⁰⟦⟧
If the start symbol is nullable then add in an empty production.
FθFΦι⁼Lκ²FΦ²№υ§κλ⊞ιΦκ⁻νλ
For each rule that contains a nullable nonterminal, add a rule with that nonterminal removed.
W⌊ΣEθΦκ∧⁼Lμ¹›⌊μ⁰
While a unit rule exists...
UMθ⁺⁻κ⟦ι⟧⎇№κι§θ⌊ι⟦⟧
... update all of the productions that include it to include the rules for its nonterminal instead.
⭆¹θ
Output the final list of productions.
The given example
S -> AbB | C
A -> a | ε
B -> AA | AC
C -> b | c
is represented by the following input array:
[
[[1, -2, 2], [3]],
[[-1], []],
[[1, 1], [1, 3]],
[[-2], [-3]]]
]
The resulting normal form is
[
[[1, 6], [-3], [-2], [5, 2], [-2]],
[[-1]],
[[1, 1], [1, 3], [-1], [-3], [-2]],
[[-3], [-2]],
[[1, 6], [-3], [-2], [5, 2], [-2]],
[[-2]],
[[5, 2], [-2]]
]
which can be interpreted as:
S -> AF | c | b | EB | b
A -> a
B -> AA | AC | a | c | b
C -> c | b
D -> AF | c | b | EB | b
E -> b
F -> EB | b
Python3, 911 bytes
import string as S
from itertools import*
L=str.islower
H=lambda x:x.replace('S','T')
V=product
def P(T,G):
q,s=[b for a,b in G if a==T],[]
for i in q:
if''==i:return 1
for j in map(''.join,V(*[[t]if L(t)else[b for a,b in G if a==t]for t in i])):
if j==''or not any({*k}=={*j}for k in s):q+=[j];s+=[j]
return 0
def f(g):
g=[(H(a),H(b))for a,b in g]
s,d=[*{*S.ascii_uppercase}-{a for a,_ in g}-{'S'}],{}
G,N=[],[]
for a,b in g:
if all(L(i)for i in b):G+=[(a,b)]
else:
n=''
for i in b:
if L(i):G+=[(t:=(d.get(i)or s.pop()),i)];d[i]=t;i=t
n+=i
N+=[(a,n)]
for a,b in N:
if len(b)<3:G+=[(a,b)];continue
G+=[(a,b[0]+(i:=s.pop()))];N+=[(i,b[1:])]
K=[]
G={*G}
for a,b in G:
if''!=b:K+=[(a,''.join(j))for j in V(*[[t]+['']*P(t,G)for t in b])]
F,T=[],[*{*G,*K}]
for a,b in T:
if len(b)==1 and not L(b):T+=[(a,B)for A,B in T if A==b]
else:F+=[(a,b)]
return [('S','T')]+F