[development] URL encoding

Scott Reynen scott at makedatamakesense.com
Fri Mar 12 01:59:57 UTC 2010


On Mar 11, 2010, at 11:10 AM, nitin gupta wrote:

> I am using the following to solve the problem, any ideas to improve  
> it in terms of efficiency or otherwise are welcome:
>
> function encodeurl($url) {
>   $reserved = array(
>     ":" => '!%3A!ui',
>     "/" => '!%2F!ui',
>     "?" => '!%3F!ui',
>     "#" => '!%23!ui',
>     "[" => '!%5B!ui',
>     "]" => '!%5D!ui',
>     "@" => '!%40!ui',
>     "!" => '!%21!ui',
>     "$" => '!%24!ui',
>     "&" => '!%26!ui',
>     "'" => '!%27!ui',
>     "(" => '!%28!ui',
>     ")" => '!%29!ui',
>     "*" => '!%2A!ui',
>     "+" => '!%2B!ui',
>     "," => '!%2C!ui',
>     ";" => '!%3B!ui',
>     "=" => '!%3D!ui',
>   );
>
>   $url = rawurlencode(rawurldecode($url));
>   $url = preg_replace(array_values($reserved),  
> array_keys($reserved), $url);
>   return $url;
> }

There's an old quote [1] that seems somewhat apt here:

> Some people, when confronted with a problem, think "“I know, I'll  
> use regular expressions."”   Now they have two problems.

That's not entirely apt, as your regular expression might as well be  
done with str_replace(), but you are adding problems rather than  
removing them. You should really scrap this whole thing and take a few  
steps back rather than adding more to it; this will break URLs due to  
flaws in the fundamental approach.

rawurlencode and rawurldecode are meant to be used on fragments of  
URLs, not whole URLs.  It's impossible to properly encode an entire  
URL without first breaking it up into component parts, because the  
different parts require different encoding.  For example, "/" should  
be encoded in a query string, but not in a path.  Treating it the same  
everywhere is why you're having the problem with delimiters being  
encoded.  The preg_replace() only hides this problem, while  
introducing new problems (not encoding things that should be encoded);  
it's not a solution.

To illustrate the problem, consider this URL:

http://www.google.com/search?q=%22a%26b%22

That's a Google search for the phrase "a&b".  Your function turns that  
into this:

http://www.google.com/search?q=%22a&b%22

That's a Google search for "a, which returns completely different  
results.

Backing up, you apparently have input that looks like this:

http://example.com/path with spaces/

That's not a valid URL, so it needs to be fixed somewhere.  Ideally it  
would be fixed at the source, but if that's not an option, you can fix  
this specific problem simply with str_replace(' ', '%20', $url);  That  
won't break anything else because spaces aren't URL delimiters.  I'm  
guessing your input has more complex problems with invalid URLs as  
your attempted solution is more broad in scope.  It's hard to say what  
you should do without knowing more about the input.  What does the raw  
XML look like?

[1] http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html

--
Scott Reynen
MakeDataMakeSense.com




More information about the development mailing list