The first should be allowed, but I can see why the latter are excluded. If the purpose of this validator is to run against user-submitted URLs, then you wouldn't want to allow users to specify ones to which they don't have authority. Imagine that the program takes the user's input, fetches the item, and the stores it somewhere accessible to them. If the program allowed URLs for internal subnets, then a malicious user could attempt to acquire internal assets!
I'd argue that it should fail, for any hostname/address validation other than an academic exercise.
It's valid, but the only people who ever use dword IP addresses are pentesters and their less scrupulous analogues. I have no issue disapppointing them.
There are valid dotless (and dotted) octal and hex representations too. They should also fail, I think.
It is in octal actually, and it isn't overly portable.
It should also start with a 0 to be "correct". Which in this case "correct" means the libc inet_aton() function will accept it and convert it to an ip address. The octal portion is nonstandard technically for uri's and is more an artifact of the libc the software you're using is sitting on.
A huge pet peeve of mine are sites that reject the new TLDs in emails or URLs. I use my hotfresh.pizza email address pretty often and you'd be surprised how many sites reject that with "Please enter a valid email address." Infuriating.
Same issue with our hi.fi. Facebook's JavaScript SDK wouldn't write cookies to it, instead rejecting it as too short and invalid. Had to open a ticket and the oversight was corrected, with lots of embarrassment and apologizing.
To be fair, an old RFC referenced by an RFC that is referenced by the current thing specifies that a segment of a domain name can't start with a number. Obviously, that isn't true in the real world.
So to conflict with Intranet sites? Bad idea in my opinion. I already use lots of one word Intranet sites at work. If http://pizza were public you'd have to have some way to distinguish between the local site and the external. That also seems like something you could then spoof.
Nice to have this comparison in one place, but I think also serves as a good illustration of the limits of regex usefulness. It feels like this would just be better off implemented in the language of your choice instead of an impenetrable string of characters.
That can get ugly too. I have a function that takes in a (possibly malformed) URL and returns a key-value array of its parts ... seems straightforward, but it's about 300 lines, and this site includes some tests that I haven't bothered to concoct yet, so mine probably isn't even complete.
URIs are one of those things that make you go, "Oh! That should be easy!", and then a week later you're walking around looking for puppies to kick.
This is partially because every single regex in the example is terribly written. Regexes _can_ look good and understandable with the x modifier and lots of spaces and comments.
You wouldn't write a normal function in one line with no comments, why do it with regex?
In all seriousness, this is why I don't like the IETF's documents. They write in a verbose way and then don't even provide a reference implementation, in this case, a reference regex that would have done away with lots of dispute and ambiguity. It is my opinion that, in practice, your specification is inherently broken if you can't provide a reference implementation.
(No, BNF doesn't count as reference "implementation". Who uses BNF in their programs to validate strings anyway?)
So a URL because of politics is a context full grammar.
And regexp cannot parse anything else than context free stuff.
To be honest I doubt anything useful is context free.
I doubt, therefore, that anything useful can be parsed with regexp... Except float, integer, and other basic types that are useful to build a context full grammar...
But for this RFC should separate standards in context free stuffs (rules/context free stuff) and config file for the political/commercial stuffs (context that changes meanings of atoms being parsed for illogical reasons).
The problem is politics is fucking hard to normalize, we have no BSML yet.
A minor clarification for those who might be learning from HN: regular expressions aren't equivalent to context-free grammars, they are even more limited than context-free grammars. Things which involve arbitrary-depth nesting or recursion, like HTML, can't be parsed by regular expressions, but that is exactly what context-free grammars are for. Where regular expressions correspond to finite state automata (essentially directed graphs where nodes are states and edges are input-driven transitions between states), context-free grammars correspond to pushdown automata (with a stack, this is the recursion secret sauce). Lots of interesting things are context-free. For example, any good programming language grammar should be context-free, to allow sane and efficient parsing. I welcome further corrections.
In other words regular expressions are equivalent to regular grammars [1]. (Except "modern regular expressions" support some construct that make them match some non-regular grammars. [2])
Please. Please Please Please. With Sugar on Top. Don't do this with a regex.
I've written a lot of web facing software that accepts URLs from the untrusted masses and ultimately makes requests to them if they are "valid." The lesson I've learned is simple. Regex's are terrible for this task because there are a ton of things you check and lots of normalization you need to do. Instead, do this as a function
I've evolved mine over the years, and my use case is semi-specific: Given a string, validate it as a fully qualified HTTP/HTTPS URL that doesn't have credentials and isn't trying to point my software toward the internal network or localhost. It looks like this:
- Use system/framework library to create Uri object from source string. All your checks will be consulting this object's properties, not looking at the source string
- Is scheme HTTP/HTTPS? If no, stop
- Did they supply user:pass@ in URL? If so, stop, and yell at them for putting usernames and passwords into a random site on the Internet.
- If hostname is an IP address, normalize it to dotted decimal quad IPv4 or IPv6 (no octal obfuscation for you!), and test against private IP space ranges or loopback. If private or loopback, stop
- If hostname is an actual hostname, normalize it + de-puny code it, and check for localhost aliases. If local, stop (you can also do a DNS lookup and make sure you someone isn't trying to return private/local IPs to bypass your checks)
At this point, you have a syntactically valid, fully qualified URL pointing to a public facing web property accessed via HTTP or HTTPS.
You don't have to worry about TLDS or the like. At this point, you can do additional DNS checks, check the domain against lists of bad actors, whatever else you want to do. You can try and be smart and do things like, "if supplied URL wasn't fully qualified, prepend http:// and try validation again" to avoid user error. Pretty flexible.
This is more rigorous than a simple regex and way way way easier for another developer to read and understand what is going on.
> - Did they supply user:pass@ in URL? If so, stop, and yell at them for putting usernames and passwords into a random site on the Internet.
FWIW, browsers don't send the user:pass in the URL - they automatically marshal it into the Authorization: Basic header. Obviously sending these over HTTP is still dumb, but they're not (typically) logged in plain-text in browser history/server request logs.
Ahh. Very true. Let me be a little more clear. I work on public web apps where there is literally an HTML text input for someone to submit a URL that we should audit. I'm not talking about people typing URLs directly into a browser.
At least once a month I get someone giving me a URL with embedded credentials to a dev or staging environment for an Alexa top 50K site. Things like https://qa:tester@dev.major-ecomm-site.com/blah. It's pretty terrible, but makes for an interesting sales call :-)
Chrome/Webkit browsers will go to `bar.com` but Firefox will end up at `example.com` for the same URL! There's a fantastic book called "The Tangled Web" that has lots of examples of these pitfalls.
Agreed. It's "funny" how common 1000+ character uncommented oneliner regular expressions are, while in any other programming language even things like `if (flag) return;` would get rejected for being dangerous, unreadable, uncommented, brace style violations and possibly even for being too long if indentation and length of the flag name make this exceed just 80 characters.
Is this because people think of a regex as an atomic blackbox instead of a program or function that can be read and modified? For example when a regex is incorrect, they don't say "This regex is incorrect. How can it be corrected?" like they would for a program or function, but "This is the incorrect regex. Which one is the correct regex."
Free-spacing mode (where you can add whitespace, newlines and comments to a regex) would at least help a little. (Although even then I think a function with early returns is often more appropriate. Possibly using very small regular expressions for some of the individual checks.)
But I rarely even see free-spacing modes mentioned. Maybe because most programmers using regex actually prefer an atomic blackbox?
Most programmers I've met are not in fact regex "literate". They may be able to write one, but understanding what one does, or even modifying it? Well... So yeah, black box.
I'm extremely impressed you thought of checking for private IP addresses - everyone seems to forget that. But as you've described it, there's a time-of-check/time-of-use vulnerability: an attacker could set a really low TTL on the A record and swap it out with a private address between your check and the time you actually hit the URL. You really have to hook into the HTTP client for that check (in Perl, I recommend LWPx::ParanoidAgent; if using libcurl, you can use a CURLOPT_OPENSOCKETFUNCTION callback).
> This requirement is a willful violation of RFC 5322, which defines a syntax for e-mail addresses that is simultaneously too strict (before the "@" character), too vague (after the "@" character), and too lax (allowing comments, whitespace characters, and quoted strings in manners unfamiliar to most users) to be of practical use here.
A lot of these regexes miss the whole point of validation. The key problem that needs to be solved here is to find if a URL is syntactically correct. A lot of them focus on the semantics as well. Thus hardcoding the TLDs is not needed at all, so http://microsoft.com and http://microsoft.foobar should both be valid (I could have the latter in my /etc/hosts). Also, that's what the DNS is used for.
Exactly that is what I was thinking when I checked for the tests and I haven't seen anything in regards to that. If they really want to get the real deal, they would have to support that too.
However as others already pointed out, it's better not to use a regex for it, but a proper library for your language which would bail out as soon as it hits something invalid. With the hope that the library for the language does support that already.
oops. I disagree that it should be rejecting rfc1918 addresses anyway, because this makes it less useful in an intranet context, where you want those to work.
There's also an apples-to-oranges comparison going on here. The Gruber pattern is not for validation, but for detecting url-like-things in text, which is why it excludes a whole bunch of punctuation chars from appearing at the end - when I say 'google.com.' in text, I mean 'google.com'.
Edited to add: I misremembered suggesting trailing punctuation exclusion to Gruber, what we discussed was xxx.xxx/xxx as an alternate pattern, catching protocol-less shortened urls in tweets.
In many cases you don't want "http://foo.bar?q=Spaces should be encoded" to pass. If for example you want to turn URLs in comments to links, then space should just end the URL right there. Otherwise you end up making whole paragraphs as links.
While I admire the effort that people put into this (and the similar effort of email address validation), what I really want to see is a comparison based on "good enough" validation -vs- performance. I think a reasonably low rate false-positives is a reasonable trade-off for fast validation.
I just don't understand why so many people like to reinvent the wheel all the time. In PHP, there's filter_var() family of functions with plenty of filter types, like FILTER_VALIDATE_URL, or you could try to use parse_url() if you need to add further constrains to validated URLs - like forbidding localhost, etc. IMHO complex regular expressions should be avoided as they make debugging a PITA and are usually a performance bottleneck, too.
Terrific. Sr. Perini's regex has served me well. That being said, if you solve a problem with a list of thirteen regexs, you then have fourteen problems.
http://./ is a perfectly valid and very short URL, even though it is highly unlikely anyone (InterNIC? ICANN? Verisign?) would issue an address record for @. On a related note, has anyone seen ccTLDs do something like this (e.g., http://io./, https://co.uk./foo)?