The Pains of Path Parsing

gopalv · on April 30, 2021

About 17 years ago, I had to solve the same problem, but since I used regexes, I had two problems at the end of it - memory usage and performance.

   /* RFC2396 : Appendix B
                As described in Section 4.3, the generic URI syntax is not sufficient
                to disambiguate the components of some forms of URI.  Since the
                "greedy algorithm" described in that section is identical to the
                disambiguation method used by POSIX regular expressions, it is
                natural and commonplace to use a regular expression for parsing
                ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
                12            3  4          5       6  7        8 9
                
                (Modified to support mailto: syntax as well)
        */

But at least it wasn't the problem I started with.

the-dude · on April 30, 2021

Regular Expressions: Now You Have Two Problems

https://blog.codinghorror.com/regular-expressions-now-you-ha...

SavantIdiot · on April 30, 2021

That is a nightmare of a regex, IETF be damned. Splitting regexess up into smaller chunks helps readability, support, memory usage -and- performance. I suspect that example was provided as a definition and not actually intended to be implemented. Although there are better forms for representing construction, like BNF.

nerdponx · on May 1, 2021

How can you split this up into chunks? I would love "composable regex" but I'm not aware of such a thing in the wild.

SavantIdiot · on May 1, 2021

TL;DR: Naive string splitting on reserved/fragment characters followed by expected string comparisons in a search tree for known tokens (e.g., http, https, ftp, rtsp, mail).

It is a tradeoff between what you are asking the regex engine to do and what you can code in a small filter. A regex engine is a general-purpose tokenizer, which while having been optimized over the multiple decades since they were proposed will never be as fast as a parser that uses knowledge of grammar. You can tokenize and parse simultaneous which IMHO is a more future-thinking strategy from a support standpoint.

I suspect the person who rote Appendix B was favoring compactness: https://tools.ietf.org/html/rfc2396#appendix-B

I've encountered regex's this nasty in the wild, and while they may be compact, try going back and debugging this in 6 months, a year, two years, etc. Even harder if you didn't write it. There could be a page of comments explaining it (like the entire RFC). OR, the programmer could have been a bit more "naive", which in many cases pays off because it forces more verbosity which leaves clues if the comments suck.

Putting aside the mental gymnastics needed to identify if these sub expressions collide, in terms of performance, you can immediately see that almost the entire regex is optional: almost every level-1 subexpression is optional (except /([^?#])/), including the final match all .!. Regex's aren't even, but huge ones like this with multiple rewinds, compound subexpressions, and optional subexpressions can severely degrade performance.

Yes, I'm ranting, because I've had to fix the consequences of opaque code like this in the wild. :)

nerdponx · on May 2, 2021

Ah, the answer is to just use less regex :)

I do sometimes build regexes by concatenating strings, or using the "verbose" mode in Python which lets you add free whitespace and comments.

Maybe my mind has just been poisoned by too much regex, but I didn't think the example was that bad...

michael1999 · on April 30, 2021

Once you accept that urls don't even specify the character encoding, you realize it's impossible in general.

Doxin · on May 3, 2021

Aren't urls quite specifically an ASCII subset? As soon as you stray outside that range any sane program is going to url-encode/idna-encode depending on where in the url the character is.

michael1999 · on May 3, 2021

They are _encoded_ as ASCII. But they very explicitly support arbitrary octets. The interpretation of those octets, and the timing of decode is where the fun starts.

Quick quiz: How many hierarchy elements are present? http://example.com/one%2Ftwo/three

If your Apache config has a proxy for `/one`, does that url match?

What is the value of the utf8 param? http://example.com/frosty?utf8=%26%C7

application/x-www-form-urlencoded historically didn't specify encoding either. So POST doesn't save you.

It is bad, and just gets worse. HTML/HTTP are just a bit too old to benefit from the genius of Ken Thompson. They started absolutely ignorant of character encoding. Berners-Lee was a physicist who just wanted to share some cat memes.

https://tools.ietf.org/html/rfc1738#section-2.2 https://blogs.warwick.ac.uk/kieranshaw/entry/utf-8_internati...

jgalt212 · on April 30, 2021

non-ASCII URLs mainly serve to make it easier for scammers to steal from senior citizens.

michael1999 · on May 1, 2021

Lol, no. Germany is a real place. So is China. Bad luck for us that urls are just a bit to old to be assumed utf8.

rambojazz · on May 1, 2021

Plaintext files do not either.

michael1999 · on May 1, 2021

And handling plaintext files on the internet reliably is similarly broken.

mlex · on April 30, 2021

Great article; in particular I hadn't thought about empty path components too closely and how websites usually just omit them when you try to go to e.g. https://github.com/nodejs//node

saurik · on April 30, 2021

> Whether to include trailing slashes in URLs has been an old argument on the internet. Personally, because I consider the parsing-into-segments concept to be central to path parsing, I prefer excluding the trailing slash. And in fact, Yesod's default (and, at least for now, routetype-rs's default) is to treat such a URL as non-canonical and redirect away from it. I felt even more strongly about that when I realized lots of frameworks have special handling for "final segments with filename extensions." For example, /blog/bananas/ is good with a trailing slash, but /images/bananas.png should not have a trailing slash.

So, this is not an argument in which people can really have an opinion: the URLs have fundamentally different semantics and behaviors with respect to relative paths. If you are talking about "the subresource banana under the resource blog" then you must use /blog/bananas, and the trailing slash is incorrect; in such a case, if you were to have a relative link to "apples" it would bring you to /blog/apples. In contrast, if you are wanting some kind of "default" resource--say, the moral equivalent of an "index.html" as is implemented in many web servers (but has nothing to do with the actual information model of the web)--as a subresource of the subresource bananas, then you must use /blog/bananas/; in such a case, if you were to have relative link to apples, it would bring you to /blog/bananas/apples and to link to /blog/apples you'd have to use ../apples.

FWIW, I absolutely agree that "for the narrow question of blog posts, is a blog post a file or a folder?" to be an interesting argument for which one might have a different opinion than someone else for a reasonable reason, but generalizing it to the concept of URL routing itself is wrong: if you believe a blog post is semantically a folder--which is very reasonable, as a blog post might "contain" a number of media attachments"--and the post itself is part of that folder it would simply be wrong to elide the trailing slash, and web frameworks or content management systems that return the representation of a folder from "inside" that folder without a trailing slash deserve a special circle of www hell :/. My hope is that this author is actually just expressing an opinion on the semantics of a blog post, not some general notion about URLs, but it is certainly written as the latter and it seems like the software they work on is general purpose.

As an example, I personally find the usage on GitHub to not just be "incorrect" but "flagrantly ridiculous": it has decided to make no opinion of whether a trailing slash has any semantic meaning or not, and so relative paths essentially make no sense in the context of their website. Is the landing page of my repository "inside" the folder of my repository, or is my repository itself a resource of sorts that happens to also contains subresources? In the former case, the landing page of other repositories in my organization are siblings of my repository's landing page, and the other information about my repository is a subresource of said landing page; while, in the latter case, the landing page of other repositories in my organization is the aunt/uncle of my repository's landing page, and other information about my repository is a sibling of my repository's landing page. Only one of these is supposed to be true!

kevin_thibedeau · on April 30, 2021

Trailing slash behavior depends on the web server and how it is configured. There is no one way to interpret them.

saurik · on May 2, 2021

While correct, that is both begging the question and ignoring the evidence: the behavior of relative URLs is both standardized and controlled by the client; HTTP servers have no concept of such, and the behavior is not even HTTP-specific.

slver · on April 30, 2021

Regarding blog post, resources and trailing slashes: we have base tags.