Matching C Block Comments With Regular Expressions
2024-01-19
An interesting problem was presented during my Compilers Class yesterday: write a regular expression that matches all valid C-style block comments but no other strings. To avoid extra escaping, we'll consider an alternative comment syntax with /
replaced with l
and *
replaced with x
: (lx this is a comment xl
matches but lx this isn't xlxl
doesn't).
My first attempt's logic was to match a comment beginning (lx
) followed by anything except xl
any number of times and finally xl
. Inverting xl
gives the regex [^x]|x[^l]
, for the complete regex lx([^x]|x[^l])*xl
. Unfortunately, this fails to match lxxxl
(or /***/
in C syntax).
The most obvious fix for this is lx([^x]|x[^l])*x+l
, which (I believe) does correctly match all valid C comments. However, it also matches the string lxxxlxl
(C syntax /***/*/
), which is not a C comment on its own. This is ultimately due to the structure of the regex, with the (x[^l])
term consuming the character after the x
.
Doing this correctly is non-obvious, and is better thought about as a state machine / nondeterministic finite automata. Thinking about it like this, we first need to match lx
, and from there can match any character but x
while remaining in the same state. On an x
, we transition to a new state, from which x
remains in the same state, l
finishes the comment (and any following characters should be cause the string to be rejected), and anything else moves back to the previous state.
To turn this into a regex, we can start by removing the second state. The transition from first to third is now lx[^x]*x
(just the transition in, followed by zero or more transitions back to itself, and the transition out) and from third to third is x|[^xl][^x]*x
(because either the new or old can be taken). Now, this can be repeated any number of times between the two forward transitions to obtain the regex lx[^x]*x(x|[^xl][^x]*x)*l
, which correctly matches comments.
UPDATE: Reducing about the third state instead of the second gives lx([^x]|x+[^xl])*x+l
after a bit of simplification, which is far more intuitive to me, more similar to my initial guess.