Here's an interesting one for people, I hope.
I'm using a regexp to match (and substitute out) quote-text-quote from the start of a string (in Perl), that in trivial use is:
$text=~s/^"(.*?)"//
(I actually use s||| instead of s///, here, for reasons related to nearby bits of code, but it's the same.)
i.e from the start of string, match a quote, a group of as few as possible characters, then another quote. Replace with nothing. The group I found, between the quotes, is used elsewhere after the (valid) search and replace.
It needs to be an ungreedy *? because there's (usually) later ones to find and standard * would gobble far too much up. Could be further matched/replaced to stitch back in to the original, but that'd be wasteful.
Some if the quoted values are something like:
"Quoted text with \"escaped quotes\" within"
Right now, I have no use for the escaped quotes, so I just pretreat $text to s/\\"//g, i.e. nuke them all before I start, but maybe I want them in the future. Or whoever else might pick up my data in the future.
As I just realised, by inserting [^\\] after the (
grouping) and before the proper-close-quote I perpetually strip the last character off, and it cannot find "".
If I put it at the end of the group (or give it a (
grouping2) to reconcatenate it back onto the end of where (
grouping) is stored, it would save the character but still would not validly find "" whenever that pops up.
So, maybe a negative look-behind, then. (?<!\\) inserted before the true quote matcher. Except now I'm worrying how to deal with the (hypothetical) situation of an escaped-\ as the true last character in the true quote. Or indeed any number of escaped-\s (an even number of raw \s), which would mean that only an odd number of \s before a quote should disqualify it from ending the pattern, and {m,n} isn't that versatile. It looks like I'd need to zero-width-pre-not-match (\\\\)*\\ or similar, i.e. \\ times 0..n plus another \. Which (if it works as I think it should, haven't tested yet) would be slightly unweildy to put into the (?<!...)" bit and hardly good for future readability.
Anyone have any better ideas?
I'm almost tempted to slice out each character in turn (or, slightly quicker, in ungreedy blocks ending positively with [\\"] and flipping an when
that m/\\$/ I then slice out the next character on a free-pass (whether \, " or whatever, doesn't matter), only if it's m/"$/ do I consider it finished (or unmatchable, which might mean restoring all the sliced-off partial-matchings just to show what didn't match the original intention).
There's also switching to greedy-grouping and coming back in from the /end$/ so long as that matches in the exact right way, but that sounds even more awkward (and prone to enbuggability).
But I'm sure there's a regexp (or, ideally, perlre) guru out there who has a perfecly wonderful alternative in mind.
(Until then, I'm prematching for \
escapes of all kinds I expect to encounter, including \u
unicode, to nuke or replace with best appropriate de-escaped and/or raw replacement version according to a separate lookup table, until it doesn't m/\\/ at all,
then using the original matcher in perfect safety!)