| Bytes | Lang | Time | Link |
|---|---|---|---|
| 163 | JavaScript V8 | 251011T110513Z | Arnauld |
| 102 | Charcoal | 251013T213633Z | Neil |
| 084 | 05AB1E | 251013T093958Z | Kevin Cr |
| 148 | Retina | 251011T125836Z | Neil |
JavaScript (V8), 163 bytes
Expects a string and prints each match on a single space-separated line, using the field order described in the challenge.
s=>[1,k=-1].map(r=>[...S=s].map((c,i)=>S.slice(i).replace(/^ATG(...)*?T(AA|AG|GA)/,s=>print(k,k+~-s.length*r,r*i%3+r,s),k+=r,s=q[q.search(c)^1]+s),s=q="CGAT",k++))
Commented
s => // s = input string
[1, k = -1] // k = ORF starting position, initialized to -1
.map(r => // for r = 1 and r = -1:
[...S = s] // S = copy of s
.map((c, i) => // for each character c at index i in S:
S.slice(i) // take the substring of S starting at i
.replace( // search for:
/^ATG(...)*?T(AA|AG|GA)/, // a valid ORF at the start of this substring
s => // if found, store it in s and
print( // print:
k, // starting position k
k + ~-s.length * r, // ending position: k + (s.length - 1) * r
r * i % 3 + r, // frame: (r * i) % 3 + r
s // matched sequence s
), // end of print()
k += r, // update the starting position
s = // build the reverse sequence in s
q[ // by prepending the complement base
q.search(c) ^ 1 // using c and the lookup string q
] + s //
), // end of replace()
s = q = "CGAT", // set the lookup string q and reset s for the
// reverse sequence (*)
k++ // increment k
) // end of inner map()
) // end of outer map()
(*) This is safe because C is put in front and cannot accidentally complete a 'stop' codon.
Charcoal, 102 bytes
F²«≔⭆⮌θ§ACGT⌕TGCκθF⌕AθATG«≔⪪✂θκLθ¹¦³η≔⌊ΦE⁺T⪪ײAAG²⊕⌕ηλλζ¿ζ⟦⪫⟦⎇ι⊕κ⁻Lθκ⎇ι⁺κ׳ζ⊕⁻Lθ⁺κ׳ζ⁺§-+ι⊕﹪κ³⪫…ηζω⟧;
Try it online! Link is to verbose version of code. Explanation:
F²«
Loop twice, once for the reverse complement.
≔⭆⮌θ§ACGT⌕TGCκθ
Take the reverse complement of the string. (So on the second pass, we recover the original input.)
F⌕AθATG«
Loop over all occurrences of ATG in the string.
≔⪪✂θκLθ¹¦³η
Split the suffix of the string into groups of three.
≔⌊ΦE⁺T⪪ײAAG²⊕⌕ηλλζ
Search the groups of three for each of TAA, TGA and TAG, and take the lowest codon count, if any.
¿ζ⟦⪫⟦⎇ι⊕κ⁻Lθκ⎇ι⁺κ׳ζ⊕⁻Lθ⁺κ׳ζ⁺§-+ι⊕﹪κ³⪫…ηζω⟧;
If a match was found then output the details of the match.
05AB1E, 84 bytes
Â.•∍–•u‡‚εDŒʒg3Öy…ATGÅ?y.•5ʒŒΩœ •u3ôÅ¿àP}©kDU>X®€g+‚NiIgα>}`X3%Ni(<ë>}®)ø.γн}€н}í€`
Try it online or verify all test cases.
If we're allowed to output a pair of quartets where the regular and reversed DNA-ORFs are separated, the trailing 4 bytes can be removed:
Try it online or verify all test cases.
Explanation:
 # Bifurcate the (implicit) input; aka, Duplicate & Reverse copy
.•∍–• # Push compressed string "acgt"
u # Uppercase it
 # Bifurcate it as well
‡ # Transliterate ("A" to "T"; "C" to "G"; "G" to "C"; "T" to "A")
‚ # Pair the two strings together
ε # Map over this pair:
D # Duplicate the current string
Œ # Pop and push its substrings
ʒ # Filter this list by:
g # Check that its length
3Ö # is divisible by 3
y Å? # Check that it starts with
…ATG # string "ATG"
y Å¿à # Check that it ends with any of these three:
.•5ʒŒΩœ • # Push compressed string "taatagtga"
u # Uppercase it
3ô # Split into triplets: ["TAA","TAG","TGA"]
P # Take the product of the stack to verify all are truthy
}© # After the filter(s): store it in variable `®` (without popping)
k # Get the 0-based index of each in the full string
DU # Store a copy of these indices in variable `X`
> # +1 to make it a 1-based index
X # Push 0-based indices `X` again
® # Push the list of substrings `®` again
€g # Get the length of each substring
+ # Add it to the 0-based indices
‚ # Pair these two lists together
Ni # If the 0-based map-index is 1 (aka, it's the reversed string):
Ig # Push the input-length
α # Take its absolute difference with each value
> # Increase each value by 1
}` # After the if-statement: push both lists to the stack again
X # Push 0-based indices `X` again
3% # Modulo-3 each
Ni # If the map-index is 1 (aka, it's the reversed string)
( # Negate each
< # Then decrease each by 1
ë # Else (it's the regular string):
> # Increase each by 1 instead
} # Close the if-else statement
® # Push the list of substrings `®` again
) # Wrap all four lists on the stack into a list
ø # Zip/transpose; swapping rows/columns
.γ # Group each quartet by:
н # Its first value, the 1-based index
}€н # After the group-by: keep the first quartet of each group
} # After the map:
€` # Flatten the pair of lists one level down
í # (without changing its order)
# (after which the result is output implicitly)
See this 05AB1E tip of mine (section How to compress strings not part of the dictionary?) to understand why .•∍–• is "acgt" and .•5ʒŒΩœ • is "taatagtga".
Retina, 148 bytes
$
¶$`
O$^`\G.
T`ACGT`Ro`^.+
Lv$`(?<=(¶(.*)))?(?<=(.*)(...)*)ATG(...)*?T(AA|AG|GA)(?=((.*)¶))?
$.($1$8$#8*$&); $.($7$2$#2*$&); $#2*+$#8*-$.($3_); $&
Try it online! Link includes test cases. Explanation:
$
¶$`
Duplicate the input line.
O$^`\G.
T`ACGT`Ro`^.+
Reverse the first line and switch A with T and C with G.
Lv$`(?<=(¶(.*)))?(?<=(.*)(...)*)ATG(...)*?T(AA|AG|GA)(?=((.*)¶))?
Match all overlapping sequences, plus also capture the suffix or prefix respectively depending on whether this is the first or second line. $.1 = length of prefix + 1; $.2 = length of prefix; $#2 = 1 on the second line, 0 on the first; $.3 = frame - 1; $.7 = length of suffix + 1; $.8 = length of suffix; $#8 = 1 on the first line, 0 on the second.
$.($1$8$#8*$&); $.($7$2$#2*$&); $#2*+$#8*-$.($3_); $&
Output the positions of the first and last character, the frame and the sequence. If the match is on the first line, the start position is the length of the suffix plus the length of the string, otherwise it is one more than the length of the prefix. Similarly if the match is on the second line then the end position is the length of the prefix plus the length of the string, otherwise it is one more than the length of the suffix.