Hi Scott, I don't think we are in contradiction here, but in point of view. I am saying that we should not encode what is an allowed character. If the URL is already present somewhere like (http://example.com/hj/hj) there is not need to encode and if it is present like (http://example.com%2f/test) there is no need to decode. And what you should do if you get such a URL, just do not touch it, because it contains no invalid character. @URL dictionary: Are you kidding?? I was obviously referring to the same RFC. I will like you to think for a moment and tell me what will you gain by breaking the URL into components and then encoding it and then joining it again. Consider this problem statement: You are given a URL, which is extracted from a source HTML of a webpage, and you need to access it using drupal_http_request(). I am, of course, interesting in improving what I currently have in hand. "Fire me all you can, but cast me into a solid and beautiful pot" -- Regards, Nitin Kumar Gupta http://publicmind.in/blog/ On Sun, Mar 14, 2010 at 10:50 AM, Scott Reynen <scott@makedatamakesense.com>wrote:
On Mar 13, 2010, at 8:20 PM, nitin gupta wrote:
I completely agree to what you and Scott are trying to say. But, I am not
looking to create an URL, just to sanitize it to remove disallowed character, i.e. what a browser would do while accessing a URL when a user inputs an URL. Consider, I parse the following URL from XML:
Do you think I should encode the '/' in the query part i.e. [test/com]??
Technically, yes, but that's beside the point. Regardless of how strictly you choose to apply URL encoding, you should be applying it to specific URL parts, not full URLs.
I don't think we need to. (Nor will Firefox, if you enter this URL in the
address bar).
You're right that encoding the slash character isn't particularly important in the query. In a path segment, however, the difference between encoded and unencoded slashes is very significant; http://example.com/a/b/c is different than http://example.com/a%2fb/c. And a slash definitely shouldn't be encoded where it's used as a delimiter between URL components. This is actually a good example of why encoding must be applied to individual URL components, not the full URL.
If a URL contains characters which are allowed in the URL dictionary, will
we ever need to encode those characters? No.
What is the URL dictionary? Here's one of the relevant RFC on URLs:
http://www.ietf.org/rfc/rfc3986.txt
Selected quotes:
"A percent-encoding mechanism is used to represent a data octet _in_a_component_" "the conflicting data must be percent-encoded _before_the_URI_is_formed_"
Emphasis added to, well, emphasize that encoding applies to component parts.
-- Scott Reynen MakeDataMakeSense.com