g | x | w | all

Bytes	Lang	Time	Link
449	GNU sed E	240831T145818Z	鳴神裁四点一号
200	Charcoal	240831T210441Z	Neil
911	Python3	240830T173518Z	Ajax1234

GNU sed `-E`, 461 457 449 bytes

s/^/;0,q/
:q
s/q/qq/g
t3
:3
s/(,\w+)(,\w+,\w+)(.*)/\1,q\3;q\2/
tq
s/(.*)(;(q+)z)/\3\1/
:4
s/^(q+)\>(.*);((\w+)(,\1(,\w+)|(,\w+),\1))\>/\1\2%\3;\4\6\7/
t4
s/^(q+)\>(.*);((\w+),\1)\>/\1\2%\3;\4z/
t4
s/^q+//
y/%/;/
t3
s/;(q+,q+,\w+)/<\1/g
s/(.*);(q+,q+)/\2\1/
:g
s/^((q+),(q+))\>(.*);(\3(,\w+(,\w+)?))/\1;\2\6\4<\5/
tg
s/^q+,q+//
y/</;/
t3
s/,(a+,)/%\1/g
s/(,q+),a/\1%a/g
s/(.*)(%a+)/q\2&/
:s
s/^q(%a+)\>(.*)\1\>/q\1\2,q/
ts
s/^q%(a+)/;q,\1/
s/%/,/g
tq

Try it online!

Usage

One-line input. Unary state and unary alphabet.

Input := Rule*
Rule := ";" State RewriteList
State := "q"+
RewriteList := ( "," ( State | Alphabet) )+ | Empty
Alphabet:= "a"+
Empty := "z"

One letter state "q" considered to be initial state of input; the output additionally uses "0" to represent initial.

For example if you have the language

S -> AbB | C
C -> b | c
B -> AA | AC
A -> a | ε

you need to translate into this:

;q,qq,aa,qqq;q,qqqq;qqqq,aa;qqqq,aaa;qqq,qq,qq;qqq,qq,qqqq;qq,a;qqz

States S, A, B, C are translated into q, qq, qqq, qqqq respectively; similar for alphabets a, b, c.

Ungolfed 461 bytes answer

I tried with in order of step 1, 3, 4, 5, 2.

Reading hint: I named alphabet labels taken from the qwerty keyboard;

# step 1
s/^/;0,q/

# rename every state; new state q can be introduced
:q
# s/\<q/qq/g # readable alternative
s/q/qq/g
t3

# step 3
:3
s/(,\w+)(,\w+,\w+)(.*)/\1,q\3;q\2/
tq

:r
# step 4
# choose a nullable state
s/(.*)(;(q+)z)/\3\1/
Tw

# add nullable state rules
:4
s/^(q+)\>(.*);((\w+)(,\1(,\w+)|(,\w+),\1))\>/\1\2%\3;\4\6\7/
t4
s/^(q+)\>(.*);((\w+),\1)\>/\1\2%\3;\4z/
t4

# step 4 done for that state
s/q+//
y/%/;/
tr

# step 5
# protect every A -> BC
:w
s/;(q+,q+,\w+)/<\1/g

# choose one A -> B
s/(.*);(q+,q+)/\2\1/

# no need to unprotect A -> BC
# if A -> B and B -> ... add A -> ...
:g
s/^((q+),(q+))\>(.*);(\3(,\w+(,\w+)?))/\1;\2\6\4<\5/
tg

# remove A -> B
s/^q+,q+//

# end of step 5 or just removed A -> B
y/</;/
tw

# step 2
# mark every alphabet a in A -> Ba and A -> aB
s/,(a+,)/%\1/g
s/(,q+),(a+)/\1%\2/g

# choose one alphabet
s/(.*)(%a+)/q\2&/

# now rewrite rules
:s
s/^q(%a+)\>(.*)\1\>/q\1\2,q/
ts
s/^q%(a+)/;q,\1/

# still having any?
s/%/,/g
tq # to be able to get more states

# COMPLETED!
# some rules are duplicated but whatever

Try it online!

Charcoal, 200 bytes

ＵＭθＥιＥλ∨νＬθ⊞θ§θ⁰§≔θ⁰⟦⟦⊖Ｌθ⟧⟧Ｗ⌊ＥΦＥθΦκ∧›Ｌμ¹‹⌊μ⁰κ⌊Ｅκ⌊μ«ＵＭθＥκＥμ⎇⁼ξιＬθξ⊞θ⟦⟦ι⟧⟧»Ｗ⌊ΦＥθ⌊Φκ›Ｌμ²κ«⊞θ⟦⮌Ｅ²⊟ι⟧⊞ι⊖Ｌθ»ＷΦ⁻Ｅθλυ⊙§θκ¬⁻μυＦι⊞υκＵＭθΦιλＦ№υ⁰⊞§θ⁰⟦⟧ＦθＦΦι⁼Ｌκ²ＦΦ²№υ§κλ⊞ιΦκ⁻νλＷ⌊ΣＥθΦκ∧⁼Ｌμ¹›⌊μ⁰ＵＭθ⁺⁻κ⟦ι⟧⎇№κι§θ⌊ι⟦⟧⭆¹θ

Attempt This Online! Link is to verbose version of code. Takes input as an array of arrays of rules where index 0 represents the start symbol and rules contain positive integers for nonterminals and negative integers for terminals. Explanation:

ＵＭθＥιＥλ∨νＬθ⊞θ§θ⁰§≔θ⁰⟦⟦⊖Ｌθ⟧⟧

Renumber the start symbol and add a production from 0 to the new symbol.

Ｗ⌊ＥΦＥθΦκ∧›Ｌμ¹‹⌊μ⁰κ⌊Ｅκ⌊μ

While rules exist that contain terminals but aren't of normal form, ...

«ＵＭθＥκＥμ⎇⁼ξιＬθξ⊞θ⟦⟦ι⟧⟧»

... add a new nonterminal with a production to the terminal and rename all uses of that terminal to the new nonterminal.

Ｗ⌊ΦＥθ⌊Φκ›Ｌμ²κ

While a rule exists with more than two nonterminals...

«⊞θ⟦⮌Ｅ²⊟ι⟧⊞ι⊖Ｌθ»

... add a new nonterminal for the last two nonterminals in that rule and update the rule to use that nonterminal instead.

ＷΦ⁻Ｅθλυ⊙§θκ¬⁻μυＦι⊞υκ

Determine which nonterminals are nullable.

ＵＭθΦιλ

Remove all empty productions.

Ｆ№υ⁰⊞§θ⁰⟦⟧

If the start symbol is nullable then add in an empty production.

ＦθＦΦι⁼Ｌκ²ＦΦ²№υ§κλ⊞ιΦκ⁻νλ

For each rule that contains a nullable nonterminal, add a rule with that nonterminal removed.

Ｗ⌊ΣＥθΦκ∧⁼Ｌμ¹›⌊μ⁰

While a unit rule exists...

ＵＭθ⁺⁻κ⟦ι⟧⎇№κι§θ⌊ι⟦⟧

... update all of the productions that include it to include the rules for its nonterminal instead.

⭆¹θ

Output the final list of productions.

The given example

S -> AbB | C
A -> a | ε
B -> AA | AC
C -> b | c

is represented by the following input array:

[
 [[1, -2, 2], [3]],
 [[-1], []],
 [[1, 1], [1, 3]],
 [[-2], [-3]]]
]

The resulting normal form is

[
 [[1, 6], [-3], [-2], [5, 2], [-2]],
 [[-1]],
 [[1, 1], [1, 3], [-1], [-3], [-2]],
 [[-3], [-2]],
 [[1, 6], [-3], [-2], [5, 2], [-2]],
 [[-2]],
 [[5, 2], [-2]]
]

which can be interpreted as:

S -> AF | c | b | EB | b
A -> a
B -> AA | AC | a | c | b
C -> c | b
D -> AF | c | b | EB | b
E -> b
F -> EB | b

Python3, 911 bytes

import string as S
from itertools import*
L=str.islower
H=lambda x:x.replace('S','T')
V=product
def P(T,G):
 q,s=[b for a,b in G if a==T],[]
 for i in q:
  if''==i:return 1
  for j in map(''.join,V(*[[t]if L(t)else[b for a,b in G if a==t]for t in i])):
   if j==''or not any({*k}=={*j}for k in s):q+=[j];s+=[j]
 return 0
def f(g):
 g=[(H(a),H(b))for a,b in g]
 s,d=[*{*S.ascii_uppercase}-{a for a,_ in g}-{'S'}],{}
 G,N=[],[]
 for a,b in g:
  if all(L(i)for i in b):G+=[(a,b)]
  else:
   n=''
   for i in b:
    if L(i):G+=[(t:=(d.get(i)or s.pop()),i)];d[i]=t;i=t
    n+=i
   N+=[(a,n)]
 for a,b in N:
  if len(b)<3:G+=[(a,b)];continue
  G+=[(a,b[0]+(i:=s.pop()))];N+=[(i,b[1:])]
 K=[]
 G={*G}
 for a,b in G:
  if''!=b:K+=[(a,''.join(j))for j in V(*[[t]+['']*P(t,G)for t in b])]
 F,T=[],[*{*G,*K}]
 for a,b in T:
  if len(b)==1 and not L(b):T+=[(a,B)for A,B in T if A==b]
  else:F+=[(a,b)]
 return [('S','T')]+F

Try it online!

GNU sed -E, 461 457 449 bytes

Usage

Ungolfed 461 bytes answer

Charcoal, 200 bytes

Python3, 911 bytes

GNU sed `-E`, 461 457 449 bytes