[development] URL encoding

nitin gupta nitingupta.iitg at gmail.com
Fri Mar 12 03:57:50 UTC 2010


Now just '%25' is lost from the original url, though still not acceptable.

On 3/12/10, nitin gupta <nitingupta.iitg at gmail.com> wrote:
> Scott,
>
> I was just going to post this, but you caught me.. yeah there is a problem
> with the above proposed solution and you caught it right. It will not
> encode
> what should be encoded.
>
> You are right that such URLs must be corrected at source, but that doesn't
> count as an excuse if my module fails to work properly. (
> http://drupal.org/node/731798). As mentioned in first post, I know
> rawurlencode are for components of the URLs but I have not choice as I can
> not assume which literals will be there in the URL and which will be not.
>
> Accidentally, while working on my other module (Facebook-style
> Links<http://drupal.org/project/facebook_link>)
> I found the same bug in Facebook. Try sharing the
> http://www.google.com/search?q=%22a%26b%22 on facebook, you will see that
> they are doing the same. ;) I haven't reported this yet though.
>
> I changed the function to this, let me know your views:
>
> function encode_url($url) {
>   $reserved = array(
>     ":" => '!%3A!ui',
>     "/" => '!%2F!ui',
>     "?" => '!%3F!ui',
>     "#" => '!%23!ui',
>     "[" => '!%5B!ui',
>     "]" => '!%5D!ui',
>     "@" => '!%40!ui',
>     "!" => '!%21!ui',
>     "$" => '!%24!ui',
>     "&" => '!%26!ui',
>     "'" => '!%27!ui',
>     "(" => '!%28!ui',
>     ")" => '!%29!ui',
>     "*" => '!%2A!ui',
>     "+" => '!%2B!ui',
>     "," => '!%2C!ui',
>     ";" => '!%3B!ui',
>     "=" => '!%3D!ui',
>   );
>
>   $url = rawurlencode($url);
>   $url = preg_replace(array_values($reserved), array_keys($reserved),
> $url);
>   $url = preg_replace('!%25!ui', '%', $url);
>   return ($url);
> }
>
>
> I am still testing, so let me know if some case fails for above function.
>
> --
> Regards,
> Nitin Kumar Gupta
> http://publicmind.in/blog/
>
>
> On Fri, Mar 12, 2010 at 7:29 AM, Scott Reynen
> <scott at makedatamakesense.com>wrote:
>
>> On Mar 11, 2010, at 11:10 AM, nitin gupta wrote:
>>
>>  I am using the following to solve the problem, any ideas to improve it
>> in
>>> terms of efficiency or otherwise are welcome:
>>>
>>> function encodeurl($url) {
>>>  $reserved = array(
>>>    ":" => '!%3A!ui',
>>>    "/" => '!%2F!ui',
>>>    "?" => '!%3F!ui',
>>>    "#" => '!%23!ui',
>>>    "[" => '!%5B!ui',
>>>    "]" => '!%5D!ui',
>>>    "@" => '!%40!ui',
>>>    "!" => '!%21!ui',
>>>    "$" => '!%24!ui',
>>>    "&" => '!%26!ui',
>>>    "'" => '!%27!ui',
>>>    "(" => '!%28!ui',
>>>    ")" => '!%29!ui',
>>>    "*" => '!%2A!ui',
>>>    "+" => '!%2B!ui',
>>>    "," => '!%2C!ui',
>>>    ";" => '!%3B!ui',
>>>    "=" => '!%3D!ui',
>>>  );
>>>
>>>  $url = rawurlencode(rawurldecode($url));
>>>  $url = preg_replace(array_values($reserved), array_keys($reserved),
>>> $url);
>>>  return $url;
>>> }
>>>
>>
>> There's an old quote [1] that seems somewhat apt here:
>>
>>  Some people, when confronted with a problem, think "“I know, I'll use
>>> regular expressions."”   Now they have two problems.
>>>
>>
>> That's not entirely apt, as your regular expression might as well be done
>> with str_replace(), but you are adding problems rather than removing
>> them.
>> You should really scrap this whole thing and take a few steps back rather
>> than adding more to it; this will break URLs due to flaws in the
>> fundamental
>> approach.
>>
>> rawurlencode and rawurldecode are meant to be used on fragments of URLs,
>> not whole URLs.  It's impossible to properly encode an entire URL without
>> first breaking it up into component parts, because the different parts
>> require different encoding.  For example, "/" should be encoded in a
>> query
>> string, but not in a path.  Treating it the same everywhere is why you're
>> having the problem with delimiters being encoded.  The preg_replace()
>> only
>> hides this problem, while introducing new problems (not encoding things
>> that
>> should be encoded); it's not a solution.
>>
>> To illustrate the problem, consider this URL:
>>
>> http://www.google.com/search?q=%22a%26b%22
>>
>> That's a Google search for the phrase "a&b".  Your function turns that
>> into
>> this:
>>
>> http://www.google.com/search?q=%22a&b%22
>>
>> That's a Google search for "a, which returns completely different
>> results.
>>
>> Backing up, you apparently have input that looks like this:
>>
>>
>> http://example.com/path with spaces/
>>
>> That's not a valid URL, so it needs to be fixed somewhere.  Ideally it
>> would be fixed at the source, but if that's not an option, you can fix
>> this
>> specific problem simply with str_replace(' ', '%20', $url);  That won't
>> break anything else because spaces aren't URL delimiters.  I'm guessing
>> your
>> input has more complex problems with invalid URLs as your attempted
>> solution
>> is more broad in scope.  It's hard to say what you should do without
>> knowing
>> more about the input.  What does the raw XML look like?
>>
>> [1]
>> http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html
>>
>> --
>> Scott Reynen
>> MakeDataMakeSense.com
>>
>>
>>
>

-- 
Sent from my mobile device

--
Regards,
Nitin Kumar Gupta
http://publicmind.in/blog/


More information about the development mailing list