[development] URL encoding
nitin gupta
nitingupta.iitg at gmail.com
Fri Mar 12 03:57:50 UTC 2010
Now just '%25' is lost from the original url, though still not acceptable.
On 3/12/10, nitin gupta <nitingupta.iitg at gmail.com> wrote:
> Scott,
>
> I was just going to post this, but you caught me.. yeah there is a problem
> with the above proposed solution and you caught it right. It will not
> encode
> what should be encoded.
>
> You are right that such URLs must be corrected at source, but that doesn't
> count as an excuse if my module fails to work properly. (
> http://drupal.org/node/731798). As mentioned in first post, I know
> rawurlencode are for components of the URLs but I have not choice as I can
> not assume which literals will be there in the URL and which will be not.
>
> Accidentally, while working on my other module (Facebook-style
> Links<http://drupal.org/project/facebook_link>)
> I found the same bug in Facebook. Try sharing the
> http://www.google.com/search?q=%22a%26b%22 on facebook, you will see that
> they are doing the same. ;) I haven't reported this yet though.
>
> I changed the function to this, let me know your views:
>
> function encode_url($url) {
> $reserved = array(
> ":" => '!%3A!ui',
> "/" => '!%2F!ui',
> "?" => '!%3F!ui',
> "#" => '!%23!ui',
> "[" => '!%5B!ui',
> "]" => '!%5D!ui',
> "@" => '!%40!ui',
> "!" => '!%21!ui',
> "$" => '!%24!ui',
> "&" => '!%26!ui',
> "'" => '!%27!ui',
> "(" => '!%28!ui',
> ")" => '!%29!ui',
> "*" => '!%2A!ui',
> "+" => '!%2B!ui',
> "," => '!%2C!ui',
> ";" => '!%3B!ui',
> "=" => '!%3D!ui',
> );
>
> $url = rawurlencode($url);
> $url = preg_replace(array_values($reserved), array_keys($reserved),
> $url);
> $url = preg_replace('!%25!ui', '%', $url);
> return ($url);
> }
>
>
> I am still testing, so let me know if some case fails for above function.
>
> --
> Regards,
> Nitin Kumar Gupta
> http://publicmind.in/blog/
>
>
> On Fri, Mar 12, 2010 at 7:29 AM, Scott Reynen
> <scott at makedatamakesense.com>wrote:
>
>> On Mar 11, 2010, at 11:10 AM, nitin gupta wrote:
>>
>> I am using the following to solve the problem, any ideas to improve it
>> in
>>> terms of efficiency or otherwise are welcome:
>>>
>>> function encodeurl($url) {
>>> $reserved = array(
>>> ":" => '!%3A!ui',
>>> "/" => '!%2F!ui',
>>> "?" => '!%3F!ui',
>>> "#" => '!%23!ui',
>>> "[" => '!%5B!ui',
>>> "]" => '!%5D!ui',
>>> "@" => '!%40!ui',
>>> "!" => '!%21!ui',
>>> "$" => '!%24!ui',
>>> "&" => '!%26!ui',
>>> "'" => '!%27!ui',
>>> "(" => '!%28!ui',
>>> ")" => '!%29!ui',
>>> "*" => '!%2A!ui',
>>> "+" => '!%2B!ui',
>>> "," => '!%2C!ui',
>>> ";" => '!%3B!ui',
>>> "=" => '!%3D!ui',
>>> );
>>>
>>> $url = rawurlencode(rawurldecode($url));
>>> $url = preg_replace(array_values($reserved), array_keys($reserved),
>>> $url);
>>> return $url;
>>> }
>>>
>>
>> There's an old quote [1] that seems somewhat apt here:
>>
>> Some people, when confronted with a problem, think "“I know, I'll use
>>> regular expressions."” Now they have two problems.
>>>
>>
>> That's not entirely apt, as your regular expression might as well be done
>> with str_replace(), but you are adding problems rather than removing
>> them.
>> You should really scrap this whole thing and take a few steps back rather
>> than adding more to it; this will break URLs due to flaws in the
>> fundamental
>> approach.
>>
>> rawurlencode and rawurldecode are meant to be used on fragments of URLs,
>> not whole URLs. It's impossible to properly encode an entire URL without
>> first breaking it up into component parts, because the different parts
>> require different encoding. For example, "/" should be encoded in a
>> query
>> string, but not in a path. Treating it the same everywhere is why you're
>> having the problem with delimiters being encoded. The preg_replace()
>> only
>> hides this problem, while introducing new problems (not encoding things
>> that
>> should be encoded); it's not a solution.
>>
>> To illustrate the problem, consider this URL:
>>
>> http://www.google.com/search?q=%22a%26b%22
>>
>> That's a Google search for the phrase "a&b". Your function turns that
>> into
>> this:
>>
>> http://www.google.com/search?q=%22a&b%22
>>
>> That's a Google search for "a, which returns completely different
>> results.
>>
>> Backing up, you apparently have input that looks like this:
>>
>>
>> http://example.com/path with spaces/
>>
>> That's not a valid URL, so it needs to be fixed somewhere. Ideally it
>> would be fixed at the source, but if that's not an option, you can fix
>> this
>> specific problem simply with str_replace(' ', '%20', $url); That won't
>> break anything else because spaces aren't URL delimiters. I'm guessing
>> your
>> input has more complex problems with invalid URLs as your attempted
>> solution
>> is more broad in scope. It's hard to say what you should do without
>> knowing
>> more about the input. What does the raw XML look like?
>>
>> [1]
>> http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-two-problems.html
>>
>> --
>> Scott Reynen
>> MakeDataMakeSense.com
>>
>>
>>
>
--
Sent from my mobile device
--
Regards,
Nitin Kumar Gupta
http://publicmind.in/blog/
More information about the development
mailing list