I quoted the section from the Python module here. [1] If you do not specify mult...

burntsushi · on March 20, 2024

The docs do not say what you're saying. Your phrasing is completely different, and the part where "if ^/$ are in the pattern then the haystack is treated as a single line" is completely made up. As far as I can tell, that's your rationalization for how to make sense of this behavior. But it is not a story supported by the actual regex engine docs. The actual docs say, "^ matches only at the beginning of the string, and $ matches only at the end of the string and immediately before the newline (if any) at the end of the string." The docs do not say, "the string is treated as a single line when ^/$ are used in the pattern." That's your phrasing, not anyone else's. That's your story, not theirs.

I still have not seen anything from you that makes sense of the behavior that `cat$` does not match `cat\n\n`. Like, I realize you've tried to explain it. But your explanation does not make sense. That's because the behavior is strange.

The only actual way to explain the behavior of $ is what the `re` docs say: it either matches at the end of the string or just before a `\n` that appears at the end of the string. That's it.

danbruc · on March 20, 2024

You are right, it is my wording, I replaced end of string or before newline as the last character with end of line because that is what this means. You could also write that into the documentation but then you would have to also explain what end of line means. And I will grant you that I might be wrong, that the behavior is only accidentally identical to matching the end of a line but that the true reason for it is different.

cat$, the $ matches the end of the line, the second \n, cat is not directly before that. I guess you want the regex engine to first treat the input as a multi-line input, extract cat\n as the first line, and then have cat$ match successfully in that single line? What about cat$ and dog$ and cat\ndog\n.

burntsushi · on March 21, 2024

> I guess you want the regex engine

Ignoring compatibility concerns, I would want the regex engine to behave the same way RE2, Go's regexp package and Rust's regex engine behave. I remember specifically considering Cox's decision ~10 years ago when writing the initial implementation of the regex crate. I thought Perl's (and Python's) behavior on this point was whacky then and I still think it's whacky now. So I followed RE2's semantics.

The OP is right to be surprised by this. And folks will continue to be surprised by it for eternity because it's an extremely subtle corner case that doesn't have a consistent story explaining its behavior. (I know you have proffered one, but I don't find it consistent in the context of a general purpose regex engine that searches arbitrary strings and not just lines.)

Of course, compatibility is a trump card here. I've acknowledged that. Changing this behavior now would be too hard. The best you could probably do is some kind of migration, where you provide the more "sensible" behavior behind an opt-in flag. And then maybe Python 4 enables it by default. But it's a lot of churn, and while people will continue to be confounded by this so long as the behavior exists, it probably isn't a Huge & Common Deal In Practice. So it may not be worth fixing. But if you're starting from scratch? Yes, please don't implement $ this way. It should match the end of the string when 'm' is disabled and the end of any line (including end of string and possibly being Unicode aware, depending on how much you care about that) when 'm' is enabled.

IshKebab · on March 21, 2024

Dunno if you noticed who you are debating with here... :-D