Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
In search of the perfect URL validation regex (mathiasbynens.be)
97 points by edward on Aug 6, 2015 | hide | past | favorite | 77 comments


Why are these in the `should fail' section?

    http://www.foo.bar./
    http://10.1.1.1
    http://10.1.1.254
    http://10.1.1.0
    http://10.1.1.255
The top one is just a fully qualified domain name, and the rest could be valid host addresses depending on your subnet size.


The first should be allowed, but I can see why the latter are excluded. If the purpose of this validator is to run against user-submitted URLs, then you wouldn't want to allow users to specify ones to which they don't have authority. Imagine that the program takes the user's input, fetches the item, and the stores it somewhere accessible to them. If the program allowed URLs for internal subnets, then a malicious user could attempt to acquire internal assets!


you can point dns at internal hosts too!


Good point, obviously the better way to mitigate is via proper firewalls so that the application servers don't access arbitrary internal hosts.


> http://3628126748

This shouldn't fail. It resolves to an IP owned by The Coca Cola Corp (not an internal IP)


I'd argue that it should fail, for any hostname/address validation other than an academic exercise.

It's valid, but the only people who ever use dword IP addresses are pentesters and their less scrupulous analogues. I have no issue disapppointing them.

There are valid dotless (and dotted) octal and hex representations too. They should also fail, I think.


It's valid, therefore it should be marked as such by the regex.


Not so. From TFA:

    I also don’t want to allow every possible technically
    valid URL — quite the opposite.
And dword and octal IP URLs are definitely in the realm of technically valid but practically not.


I'm confused by how that works. I've never seen IP addresses expressed in decimal form before (which is what I'm assuming that is).


An IP address is just a number. Here's a great post on it: http://superuser.com/a/486936


It is in octal actually, and it isn't overly portable.

It should also start with a 0 to be "correct". Which in this case "correct" means the libc inet_aton() function will accept it and convert it to an ip address. The octal portion is nonstandard technically for uri's and is more an artifact of the libc the software you're using is sitting on.


That looks like decimal to me. I didn't convert it to test, but I see a couple 8s. :)

Regardless, there are valid dotless decimal, octal, and hex representations for every IP address. Library support is variable.


Regarding the IPs, the article's use-case is listed as a 'public URL shortener', and 10.0.0.0/8 is not on the public address range by definition.


Sure, but it seems like you should in those cases just parse the URL and then reject black-listed host names.


A huge pet peeve of mine are sites that reject the new TLDs in emails or URLs. I use my hotfresh.pizza email address pretty often and you'd be surprised how many sites reject that with "Please enter a valid email address." Infuriating.


You'd be surprised how many sites reject my domain, 3e.org. Sometimes it's because it begins with a number, sometimes it's because it's too short.


I've had Gogo's inflight wifi not let me sign up because my email account at gmail started with a number.


Same issue with our hi.fi. Facebook's JavaScript SDK wouldn't write cookies to it, instead rejecting it as too short and invalid. Had to open a ticket and the oversight was corrected, with lots of embarrassment and apologizing.


To be fair, an old RFC referenced by an RFC that is referenced by the current thing specifies that a segment of a domain name can't start with a number. Obviously, that isn't true in the real world.


So, TLDs are selling (technically) invalid domains?


And every major implementation supports them.

Well, I trudged through the RFCs in ~2010, so it's possible that it's been updated since then.


There's also a proposal in front of ICANN to allow dotless domains, eg http://pizza


So to conflict with Intranet sites? Bad idea in my opinion. I already use lots of one word Intranet sites at work. If http://pizza were public you'd have to have some way to distinguish between the local site and the external. That also seems like something you could then spoof.


There's no money in intranet addresses.


Ha, and HN is on board, apparently!


This is why I haven't been able to update my Yahoo account in several years. I keep sending them bugs about it, but no one has fixed it yet.


Sometimes this isn't the website, it's actually the third-party API their using for users/signups/mailings/whatever :)


Little off topic but what inspired that domain?


Simply my love for pizza, especially the hot and fresh variety. When I discovered the .pizza TLD, my eyes lit up.


Nice to have this comparison in one place, but I think also serves as a good illustration of the limits of regex usefulness. It feels like this would just be better off implemented in the language of your choice instead of an impenetrable string of characters.


That can get ugly too. I have a function that takes in a (possibly malformed) URL and returns a key-value array of its parts ... seems straightforward, but it's about 300 lines, and this site includes some tests that I haven't bothered to concoct yet, so mine probably isn't even complete.

URIs are one of those things that make you go, "Oh! That should be easy!", and then a week later you're walking around looking for puppies to kick.


This is partially because every single regex in the example is terribly written. Regexes _can_ look good and understandable with the x modifier and lots of spaces and comments.

You wouldn't write a normal function in one line with no comments, why do it with regex?


Obligatory "Conway's GoL in one line of APL" http://catpad.net/michael/apl/


There're some other options for regex syntax:

https://github.com/VerbalExpressions/JSVerbalExpressions


Why should URL's like http://3628126748 fail?

For example, http://1249711460 resolves just fine in Chrome.


Also appears to resolve in curl.


What is this witchcraft?


It's a 32-bit IP address in decimal (sorry, base ten) form, IIRC.


Figured it out also as base 256 ipv4 converted to a number.

74.125.21.100 => 100x1 + 21x256 + 125x256^2 + 74x256^3 => 1249711460


Yep, I 1st learn this trick in my teens when a dude was using 3564020356 as an address.

Certainly a witch nest. :]


Unsigned integer form! /ducks


And now you have 404 problems.

In all seriousness, this is why I don't like the IETF's documents. They write in a verbose way and then don't even provide a reference implementation, in this case, a reference regex that would have done away with lots of dispute and ambiguity. It is my opinion that, in practice, your specification is inherently broken if you can't provide a reference implementation.

(No, BNF doesn't count as reference "implementation". Who uses BNF in their programs to validate strings anyway?)


There is more than grammar in URL: there is politics like gov.uk is a TLD.

And that is why IETF RFC are so verbose. Because there is politics in them. That is why you can't provide regexps.

url because of politics are not context free. https://www.cs.rochester.edu/~nelson/courses/csc_173/grammar...

So a URL because of politics is a context full grammar.

And regexp cannot parse anything else than context free stuff.

To be honest I doubt anything useful is context free.

I doubt, therefore, that anything useful can be parsed with regexp... Except float, integer, and other basic types that are useful to build a context full grammar... But for this RFC should separate standards in context free stuffs (rules/context free stuff) and config file for the political/commercial stuffs (context that changes meanings of atoms being parsed for illogical reasons).

The problem is politics is fucking hard to normalize, we have no BSML yet.


A minor clarification for those who might be learning from HN: regular expressions aren't equivalent to context-free grammars, they are even more limited than context-free grammars. Things which involve arbitrary-depth nesting or recursion, like HTML, can't be parsed by regular expressions, but that is exactly what context-free grammars are for. Where regular expressions correspond to finite state automata (essentially directed graphs where nodes are states and edges are input-driven transitions between states), context-free grammars correspond to pushdown automata (with a stack, this is the recursion secret sauce). Lots of interesting things are context-free. For example, any good programming language grammar should be context-free, to allow sane and efficient parsing. I welcome further corrections.


In other words regular expressions are equivalent to regular grammars [1]. (Except "modern regular expressions" support some construct that make them match some non-regular grammars. [2])

[1] https://en.wikipedia.org/wiki/Regular_grammar

[2] https://en.wikipedia.org/wiki/Regular_expression#Patterns_fo...


Could you write a blog post on this? I want to learn more!


Please. Please Please Please. With Sugar on Top. Don't do this with a regex.

I've written a lot of web facing software that accepts URLs from the untrusted masses and ultimately makes requests to them if they are "valid." The lesson I've learned is simple. Regex's are terrible for this task because there are a ton of things you check and lots of normalization you need to do. Instead, do this as a function

I've evolved mine over the years, and my use case is semi-specific: Given a string, validate it as a fully qualified HTTP/HTTPS URL that doesn't have credentials and isn't trying to point my software toward the internal network or localhost. It looks like this:

- Use system/framework library to create Uri object from source string. All your checks will be consulting this object's properties, not looking at the source string

- Is scheme HTTP/HTTPS? If no, stop

- Did they supply user:pass@ in URL? If so, stop, and yell at them for putting usernames and passwords into a random site on the Internet.

- If hostname is an IP address, normalize it to dotted decimal quad IPv4 or IPv6 (no octal obfuscation for you!), and test against private IP space ranges or loopback. If private or loopback, stop

- If hostname is an actual hostname, normalize it + de-puny code it, and check for localhost aliases. If local, stop (you can also do a DNS lookup and make sure you someone isn't trying to return private/local IPs to bypass your checks)

At this point, you have a syntactically valid, fully qualified URL pointing to a public facing web property accessed via HTTP or HTTPS.

You don't have to worry about TLDS or the like. At this point, you can do additional DNS checks, check the domain against lists of bad actors, whatever else you want to do. You can try and be smart and do things like, "if supplied URL wasn't fully qualified, prepend http:// and try validation again" to avoid user error. Pretty flexible.

This is more rigorous than a simple regex and way way way easier for another developer to read and understand what is going on.


> - Did they supply user:pass@ in URL? If so, stop, and yell at them for putting usernames and passwords into a random site on the Internet.

FWIW, browsers don't send the user:pass in the URL - they automatically marshal it into the Authorization: Basic header. Obviously sending these over HTTP is still dumb, but they're not (typically) logged in plain-text in browser history/server request logs.


Ahh. Very true. Let me be a little more clear. I work on public web apps where there is literally an HTML text input for someone to submit a URL that we should audit. I'm not talking about people typing URLs directly into a browser.

At least once a month I get someone giving me a URL with embedded credentials to a dev or staging environment for an Alexa top 50K site. Things like https://qa:tester@dev.major-ecomm-site.com/blah. It's pretty terrible, but makes for an interesting sales call :-)


It can affect what domain you end up at though, for example:

http://foo:bar.com\@example.com

Chrome/Webkit browsers will go to `bar.com` but Firefox will end up at `example.com` for the same URL! There's a fantastic book called "The Tangled Web" that has lots of examples of these pitfalls.


Agreed. It's "funny" how common 1000+ character uncommented oneliner regular expressions are, while in any other programming language even things like `if (flag) return;` would get rejected for being dangerous, unreadable, uncommented, brace style violations and possibly even for being too long if indentation and length of the flag name make this exceed just 80 characters.

Is this because people think of a regex as an atomic blackbox instead of a program or function that can be read and modified? For example when a regex is incorrect, they don't say "This regex is incorrect. How can it be corrected?" like they would for a program or function, but "This is the incorrect regex. Which one is the correct regex."

Free-spacing mode (where you can add whitespace, newlines and comments to a regex) would at least help a little. (Although even then I think a function with early returns is often more appropriate. Possibly using very small regular expressions for some of the individual checks.) But I rarely even see free-spacing modes mentioned. Maybe because most programmers using regex actually prefer an atomic blackbox?


Most programmers I've met are not in fact regex "literate". They may be able to write one, but understanding what one does, or even modifying it? Well... So yeah, black box.


I'm extremely impressed you thought of checking for private IP addresses - everyone seems to forget that. But as you've described it, there's a time-of-check/time-of-use vulnerability: an attacker could set a really low TTL on the A record and swap it out with a private address between your check and the time you actually hit the URL. You really have to hook into the HTTP client for that check (in Perl, I recommend LWPx::ParanoidAgent; if using libcurl, you can use a CURLOPT_OPENSOCKETFUNCTION callback).


Using a regex to validate a URL is only marginally less stupid than using a regex to validate an e-mail address.

Validating URLs is also something you shouldn't have to implement yourself.


Validating email addresses with a regex is fine if you follow the HTML5 spec for email addresses: http://www.w3.org/TR/html5/forms.html#valid-e-mail-address

> This requirement is a willful violation of RFC 5322, which defines a syntax for e-mail addresses that is simultaneously too strict (before the "@" character), too vague (after the "@" character), and too lax (allowing comments, whitespace characters, and quoted strings in manners unfamiliar to most users) to be of practical use here.


A lot of these regexes miss the whole point of validation. The key problem that needs to be solved here is to find if a URL is syntactically correct. A lot of them focus on the semantics as well. Thus hardcoding the TLDs is not needed at all, so http://microsoft.com and http://microsoft.foobar should both be valid (I could have the latter in my /etc/hosts). Also, that's what the DNS is used for.



Exactly that is what I was thinking when I checked for the tests and I haven't seen anything in regards to that. If they really want to get the real deal, they would have to support that too.

However as others already pointed out, it's better not to use a regex for it, but a proper library for your language which would bail out as soon as it hits something invalid. With the hope that the library for the language does support that already.


The one marked as winning, by @diegoperini, has problems.

The way it defines passwords is wrong, so you can trick it into accepting almost anything by putting the domain somewhere else:

    re_weburl.test("http://127.0.0.1/")
    => false
    re_weburl.test("http://127.0.0.1/@example.com")
    => true
    re_weburl.test("http://999.999.999.999.999/@example.com")
    => true
oops. I disagree that it should be rejecting rfc1918 addresses anyway, because this makes it less useful in an intranet context, where you want those to work.

There's also an apples-to-oranges comparison going on here. The Gruber pattern is not for validation, but for detecting url-like-things in text, which is why it excludes a whole bunch of punctuation chars from appearing at the end - when I say 'google.com.' in text, I mean 'google.com'.

Edited to add: I misremembered suggesting trailing punctuation exclusion to Gruber, what we discussed was xxx.xxx/xxx as an alternate pattern, catching protocol-less shortened urls in tweets.


What about new TLDs? URLs with them should be also checked. I can see that some regexes contain list of TLDs, which is already disqualifying.


Why would you ever use a regex to do this, apart from experimenting?

This would be much easier to do with a context free grammar or any decent parsing library.


In many cases you don't want "http://foo.bar?q=Spaces should be encoded" to pass. If for example you want to turn URLs in comments to links, then space should just end the URL right there. Otherwise you end up making whole paragraphs as links.


While I admire the effort that people put into this (and the similar effort of email address validation), what I really want to see is a comparison based on "good enough" validation -vs- performance. I think a reasonably low rate false-positives is a reasonable trade-off for fast validation.


I'm using a modified version of this for IRI/URI validation:

https://github.com/nisavid/spruce-iri/blob/master/spruce/iri...


I just don't understand why so many people like to reinvent the wheel all the time. In PHP, there's filter_var() family of functions with plenty of filter types, like FILTER_VALIDATE_URL, or you could try to use parse_url() if you need to add further constrains to validated URLs - like forbidding localhost, etc. IMHO complex regular expressions should be avoided as they make debugging a PITA and are usually a performance bottleneck, too.


Terrific. Sr. Perini's regex has served me well. That being said, if you solve a problem with a list of thirteen regexs, you then have fourteen problems.


Here's my easy (but not totally passing) version using the DOM: https://gist.github.com/javan/6aaebfeb5fe415498028


Maybe gwern can build a neural net for this too. :)


Great matrix


Eyeballing those charts it looks like @scottgonzales scores the best, followed by @cowboy.


Looks like @diegoperini has a perfect score. (Site is wide so if your screen is narrower, you won't see his)


You are correct, I missed that one.


@diegoperini seems to be the clear winner with green in each cell.


http://./ is a perfectly valid and very short URL, even though it is highly unlikely anyone (InterNIC? ICANN? Verisign?) would issue an address record for @. On a related note, has anyone seen ccTLDs do something like this (e.g., http://io./, https://co.uk./foo)?


The Vatican advertises MX records on va.:

    $ dig va MX
    ...
    ;; ANSWER SECTION:
    va.			3599	IN	MX	10 mx12.vatican.va.
    va.			3599	IN	MX	10 mx11.vatican.va.
    va.			3599	IN	MX	100 raphaelmx3.posta.va.


Cool! Anyone want to try emailing pope@va and seeing what comes of it?


> On a related note, has anyone seen ccTLDs do something like this (e.g., http://io./, https://co.uk./foo)?

Yes, I wondered about this myself, http://dk. redirects to www.dk-hostmaster.dk.

It would be interesting to receive mail from root@com or whatever :)




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: