Hi, I am not creating the URL, but I parse the URL from XML content and I want to urlencode it. Also, I am looking to preserve all the reserved characters in the URL and not just '?' or '&'. -- Regards, Nitin Kumar Gupta http://publicmind.in/blog/ On Thu, Mar 11, 2010 at 6:59 PM, Don <donald@fane.com> wrote:
If you use l() or url() the query section is passed separately, so the '?' and '&' are preserved. The query attributes are passed in an array as name/value pairs. That's the real drupal way to make a link also.
-Don-
On 3/11/2010 8:23 AM, nitin gupta wrote:
Hi,
I am trying to convert a URL to its encoded equivalent. i.e. http://example.com/path with spaces/ to http://example.com/path%20with%20spaces/.
The problem I am facing is that rawurlencode or urlencode, encodes all the characters and does not show mercy upon the reserved characters such as '?', '#', '=' which makes the URL invalid after encoding. I understand that this behavior of these functions is expected considering the fact that they are to be used to encode data (such as query) that can be appended in an URL, but it isn't helping the current situation.
I thought of the following procedure to achieve the above:
1. run rawurldecode on the input url, so that we do not accidentally encode twice if whole or part of URL is already encoded. 2. then run rawurlencode, and then preg replace these reserved characters.
Though it works, I am not sure if it is the correct way to handle this. Any help through ideas or code will be highly appreciated as this will be useful in Feeds Image Grabber <http://drupal.org/project/feeds_imagegrabber> and Facebook-style Links <http:///drupal.org/project/facebook_link> module.
Also, off the subject: shouldn't drupal_http_request() should take care of such encoding?
-- Regards, Nitin Kumar Gupta http://publicmind.in/blog/
nitin gupta wrote:
Hi,
I am not creating the URL, but I parse the URL from XML content and I want to urlencode it. Also, I am looking to preserve all the reserved characters in the URL and not just '?' or '&'.
http://www.google.com/search?q=url+encoding+site%3Aapi.drupal.org leads you to two api functions url [1] and drupal_urlencode [2]. Are these what you need? [1] http://api.drupal.org/api/function/url [2] http://api.drupal.org/api/function/drupal_urlencode -- Earnie -- http://www.give-me-an-offer.com
Thanks Earnie, but that is not what I am looking for. It is essentially similar to urlencode or rawurlencode. I am looking for a function that does the work like the one present at the bottom of the page here: http://www.blooberry.com/indexdot/html/topics/urlencoding.htm For eg: http://example.com/q=12 &12grt key=value;<>+,.()!* is converted to http://example.com/q=12%20&12grt%20key=value;%3C%3E+,.()!* -- Regards, Nitin Kumar Gupta http://publicmind.in/blog/ On Thu, Mar 11, 2010 at 7:32 PM, Earnie Boyd <earnie@users.sourceforge.net>wrote:
nitin gupta wrote:
Hi,
I am not creating the URL, but I parse the URL from XML content and I want to urlencode it. Also, I am looking to preserve all the reserved characters in the URL and not just '?' or '&'.
http://www.google.com/search?q=url+encoding+site%3Aapi.drupal.org leads you to two api functions url [1] and drupal_urlencode [2]. Are these what you need?
[1] http://api.drupal.org/api/function/url [2] http://api.drupal.org/api/function/drupal_urlencode
-- Earnie -- http://www.give-me-an-offer.com
HI!! I´ll look that and give you some feedback about this function! 2010/3/11 nitin gupta <nitingupta.iitg@gmail.com>
Thanks Earnie, but that is not what I am looking for. It is essentially similar to urlencode or rawurlencode. I am looking for a function that does the work like the one present at the bottom of the page here:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
For eg:
http://example.com/q=12 &12grt key=value;<>+,.()!*
is converted to
http://example.com/q=12%20&12grt%20key=value;%3C%3E+,.()!*
-- Regards, Nitin Kumar Gupta http://publicmind.in/blog/
On Thu, Mar 11, 2010 at 7:32 PM, Earnie Boyd <earnie@users.sourceforge.net
wrote:
nitin gupta wrote:
Hi,
I am not creating the URL, but I parse the URL from XML content and I want to urlencode it. Also, I am looking to preserve all the reserved characters in the URL and not just '?' or '&'.
http://www.google.com/search?q=url+encoding+site%3Aapi.drupal.org leads you to two api functions url [1] and drupal_urlencode [2]. Are these what you need?
[1] http://api.drupal.org/api/function/url [2] http://api.drupal.org/api/function/drupal_urlencode
-- Earnie -- http://www.give-me-an-offer.com
-- Att, João Gustavo Taveira Analista de Sistemas II Peopleware Tecnologia “Tecnologia começa com gente.“ Qualidade | Criatividade | Responsabilidade | Ética | Comprometimento Fone:(21) 8208-4688 / (21) 8160-1302
So you want to replace the spaces, but not any of the non-alpha characters? Sounds more like a preg_replace() action. That way you could just create an array of patterns you want to replace, rather than doing a real urlencode. -Don Pickerel- On 3/11/2010 9:17 AM, nitin gupta wrote:
Thanks Earnie, but that is not what I am looking for. It is essentially similar to urlencode or rawurlencode. I am looking for a function that does the work like the one present at the bottom of the page here:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
For eg:
http://example.com/q=12 &12grt key=value;<>+,.()!*
is converted to
http://example.com/q=12%20&12grt%20key=value;%3C%3E+,.()!* <http://example.com/q=12%20&12grt%20key=value;%3C%3E+,.%28%29%21*>
-- Regards, Nitin Kumar Gupta http://publicmind.in/blog/
On Thu, Mar 11, 2010 at 7:32 PM, Earnie Boyd <earnie@users.sourceforge.net <mailto:earnie@users.sourceforge.net>> wrote:
nitin gupta wrote:
Hi,
I am not creating the URL, but I parse the URL from XML content and I want to urlencode it. Also, I am looking to preserve all the reserved characters in the URL and not just '?' or '&'.
http://www.google.com/search?q=url+encoding+site%3Aapi.drupal.org leads you to two api functions url [1] and drupal_urlencode [2]. Are these what you need?
[1] http://api.drupal.org/api/function/url [2] http://api.drupal.org/api/function/drupal_urlencode
-- Earnie -- http://www.give-me-an-offer.com
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 11/03/10 14:17, nitin gupta wrote:
Thanks Earnie, but that is not what I am looking for. It is essentially similar to urlencode or rawurlencode. I am looking for a function that does the work like the one present at the bottom of the page here:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
If I understand correctly you have full uris that you want to check (encode) before using them out there. Why not use parse_url http://php.net/manual/en/function.parse-url.php to get the component parts and then pass them into url() http://api.drupal.org/api/function/url/6 (or l()) to get the sanitised correct version? ekes -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/ iEYEARECAAYFAkuZAf4ACgkQR9het8OQC6WpvQCgrfMvP2kQaNJb5YYhl94G0GAF s18AniVEyL+qXSkm/+eWAwN6cvFF3HYz =05vz -----END PGP SIGNATURE-----
This doesn't seem to work: url('example.com/with spaces', array('absolute' => TRUE, 'external' => TRUE)); -- Regards, Nitin Kumar Gupta http://publicmind.in/blog/ On Thu, Mar 11, 2010 at 8:15 PM, ekes <ekes@aktivix.org> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 11/03/10 14:17, nitin gupta wrote:
Thanks Earnie, but that is not what I am looking for. It is essentially similar to urlencode or rawurlencode. I am looking for a function that does the work like the one present at the bottom of the page here:
http://www.blooberry.com/indexdot/html/topics/urlencoding.htm
If I understand correctly you have full uris that you want to check (encode) before using them out there. Why not use parse_url http://php.net/manual/en/function.parse-url.php to get the component parts and then pass them into url() http://api.drupal.org/api/function/url/6 (or l()) to get the sanitised correct version?
ekes -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org/
iEYEARECAAYFAkuZAf4ACgkQR9het8OQC6WpvQCgrfMvP2kQaNJb5YYhl94G0GAF s18AniVEyL+qXSkm/+eWAwN6cvFF3HYz =05vz -----END PGP SIGNATURE-----
For one url('example.com/with spaces', array('absolute' => TRUE, 'external' => TRUE)); isn't an absolute url. You should pass it http:/// also. The absolute TRUE will only make internal urls output with the full http:// ----- Adam A. Gregory Drupal Developer & Consultant Web: AdamAGregory.com Twitter: twitter.com/adamgregory Phone: 910.808.1717 Cell: 919.306.6138 On Thu, Mar 11, 2010 at 9:59 AM, nitin gupta <nitingupta.iitg@gmail.com>wrote:
url('example.com/with spaces', array('absolute' => TRUE, 'external' => TRUE));
I am sorry, my mistake. But this also doesn't seem to work: url('http://example.com/with spaces', array( 'external' => TRUE)); -- Regards, Nitin Kumar Gupta http://publicmind.in/blog/ On Thu, Mar 11, 2010 at 8:33 PM, Adam Gregory <arcaneadam@gmail.com> wrote:
For one url('example.com/with spaces', array('absolute' => TRUE, 'external' => TRUE)); isn't an absolute url. You should pass it http:/// also. The absolute TRUE will only make internal urls output with the full http:// ----- Adam A. Gregory Drupal Developer & Consultant Web: AdamAGregory.com Twitter: twitter.com/adamgregory Phone: 910.808.1717 Cell: 919.306.6138
On Thu, Mar 11, 2010 at 9:59 AM, nitin gupta <nitingupta.iitg@gmail.com>wrote:
url('example.com/with spaces', array('absolute' => TRUE, 'external' => TRUE));
I just tested the following and it spit it out correctly. url("http://test.com/test dfsdaf dsfsadf",array('external'=>TRUE,'absolute=>TRUE)); What are the actual characters that are being used because in your initial example you had some random character like ;<>+,.()!* which may be the reason it's not working ----- ***NEW CELL PHONE # Please Update(See Below)*** Adam A. Gregory Drupal Developer & Consultant Web: AdamAGregory.com Twitter: twitter.com/adamgregory Phone: 910.808.1717 Cell: 919.306.6138 On Thu, Mar 11, 2010 at 10:08 AM, nitin gupta <nitingupta.iitg@gmail.com>wrote:
I am sorry, my mistake. But this also doesn't seem to work:
url('http://example.com/with spaces', array( 'external' => TRUE));
-- Regards, Nitin Kumar Gupta http://publicmind.in/blog/
On Thu, Mar 11, 2010 at 8:33 PM, Adam Gregory <arcaneadam@gmail.com>wrote:
For one url('example.com/with spaces', array('absolute' => TRUE, 'external' => TRUE)); isn't an absolute url. You should pass it http:/// also. The absolute TRUE will only make internal urls output with the full http:// ----- Adam A. Gregory Drupal Developer & Consultant Web: AdamAGregory.com Twitter: twitter.com/adamgregory Phone: 910.808.1717 Cell: 919.306.6138
On Thu, Mar 11, 2010 at 9:59 AM, nitin gupta <nitingupta.iitg@gmail.com>wrote:
url('example.com/with spaces', array('absolute' => TRUE, 'external' => TRUE));
I am currently just testing with spaces. Its strange that the above wasn't converted to: http://test.com/test%20dfsdaf%20dsfsadf on my machine. I am using D6.15 with PHP 5.3. -- Regards, Nitin Kumar Gupta http://publicmind.in/blog/ On Thu, Mar 11, 2010 at 8:50 PM, Adam Gregory <arcaneadam@gmail.com> wrote:
I just tested the following and it spit it out correctly.
url("http://test.com/test dfsdaf dsfsadf",array('external'=>TRUE,'absolute=>TRUE));
What are the actual characters that are being used because in your initial example you had some random character like ;<>+,.()!* which may be the reason it's not working ----- ***NEW CELL PHONE # Please Update(See Below)***
Adam A. Gregory Drupal Developer & Consultant Web: AdamAGregory.com Twitter: twitter.com/adamgregory Phone: 910.808.1717 Cell: 919.306.6138
On Thu, Mar 11, 2010 at 10:08 AM, nitin gupta <nitingupta.iitg@gmail.com>wrote:
I am sorry, my mistake. But this also doesn't seem to work:
url('http://example.com/with spaces', array( 'external' => TRUE));
-- Regards, Nitin Kumar Gupta http://publicmind.in/blog/
On Thu, Mar 11, 2010 at 8:33 PM, Adam Gregory <arcaneadam@gmail.com>wrote:
For one url('example.com/with spaces', array('absolute' => TRUE, 'external' => TRUE)); isn't an absolute url. You should pass it http:/// also. The absolute TRUE will only make internal urls output with the full http:// ----- Adam A. Gregory Drupal Developer & Consultant Web: AdamAGregory.com Twitter: twitter.com/adamgregory Phone: 910.808.1717 Cell: 919.306.6138
On Thu, Mar 11, 2010 at 9:59 AM, nitin gupta <nitingupta.iitg@gmail.com>wrote:
url('example.com/with spaces', array('absolute' => TRUE, 'external' => TRUE));
nitin gupta wrote:
I am currently just testing with spaces. Its strange that the above wasn't converted to:
http://test.com/test%20dfsdaf%20dsfsadf
on my machine. I am using D6.15 with PHP 5.3.
Are you aware of http://drupal.org/node/360605 ? -- Earnie -- http://www.give-me-an-offer.com
@Earnie: I am not sure if this is affecting, but I will check it out. I am using the following to solve the problem, any ideas to improve it in terms of efficiency or otherwise are welcome: function encodeurl($url) { $reserved = array( ":" => '!%3A!ui', "/" => '!%2F!ui', "?" => '!%3F!ui', "#" => '!%23!ui', "[" => '!%5B!ui', "]" => '!%5D!ui', "@" => '!%40!ui', "!" => '!%21!ui', "$" => '!%24!ui', "&" => '!%26!ui', "'" => '!%27!ui', "(" => '!%28!ui', ")" => '!%29!ui', "*" => '!%2A!ui', "+" => '!%2B!ui', "," => '!%2C!ui', ";" => '!%3B!ui', "=" => '!%3D!ui', ); $url = rawurlencode(rawurldecode($url)); $url = preg_replace(array_values($reserved), array_keys($reserved), $url); return $url; } -- Regards, Nitin Kumar Gupta http://publicmind.in/blog/ On Thu, Mar 11, 2010 at 10:01 PM, Earnie Boyd <earnie@users.sourceforge.net>wrote:
nitin gupta wrote:
I am currently just testing with spaces. Its strange that the above wasn't converted to:
http://test.com/test%20dfsdaf%20dsfsadf
on my machine. I am using D6.15 with PHP 5.3.
Are you aware of http://drupal.org/node/360605 ?
-- Earnie -- http://www.give-me-an-offer.com
On Mar 11, 2010, at 11:10 AM, nitin gupta wrote:
I am using the following to solve the problem, any ideas to improve it in terms of efficiency or otherwise are welcome:
function encodeurl($url) { $reserved = array( ":" => '!%3A!ui', "/" => '!%2F!ui', "?" => '!%3F!ui', "#" => '!%23!ui', "[" => '!%5B!ui', "]" => '!%5D!ui', "@" => '!%40!ui', "!" => '!%21!ui', "$" => '!%24!ui', "&" => '!%26!ui', "'" => '!%27!ui', "(" => '!%28!ui', ")" => '!%29!ui', "*" => '!%2A!ui', "+" => '!%2B!ui', "," => '!%2C!ui', ";" => '!%3B!ui', "=" => '!%3D!ui', );
$url = rawurlencode(rawurldecode($url)); $url = preg_replace(array_values($reserved), array_keys($reserved), $url); return $url; }
There's an old quote [1] that seems somewhat apt here:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
That's not entirely apt, as your regular expression might as well be done with str_replace(), but you are adding problems rather than removing them. You should really scrap this whole thing and take a few steps back rather than adding more to it; this will break URLs due to flaws in the fundamental approach. rawurlencode and rawurldecode are meant to be used on fragments of URLs, not whole URLs. It's impossible to properly encode an entire URL without first breaking it up into component parts, because the different parts require different encoding. For example, "/" should be encoded in a query string, but not in a path. Treating it the same everywhere is why you're having the problem with delimiters being encoded. The preg_replace() only hides this problem, while introducing new problems (not encoding things that should be encoded); it's not a solution. To illustrate the problem, consider this URL: http://www.google.com/search?q=%22a%26b%22 That's a Google search for the phrase "a&b". Your function turns that into this: http://www.google.com/search?q=%22a&b%22 That's a Google search for "a, which returns completely different results. Backing up, you apparently have input that looks like this: http://example.com/path with spaces/ That's not a valid URL, so it needs to be fixed somewhere. Ideally it would be fixed at the source, but if that's not an option, you can fix this specific problem simply with str_replace(' ', '%20', $url); That won't break anything else because spaces aren't URL delimiters. I'm guessing your input has more complex problems with invalid URLs as your attempted solution is more broad in scope. It's hard to say what you should do without knowing more about the input. What does the raw XML look like? [1] http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-tw... -- Scott Reynen MakeDataMakeSense.com
Scott, I was just going to post this, but you caught me.. yeah there is a problem with the above proposed solution and you caught it right. It will not encode what should be encoded. You are right that such URLs must be corrected at source, but that doesn't count as an excuse if my module fails to work properly. ( http://drupal.org/node/731798). As mentioned in first post, I know rawurlencode are for components of the URLs but I have not choice as I can not assume which literals will be there in the URL and which will be not. Accidentally, while working on my other module (Facebook-style Links<http://drupal.org/project/facebook_link>) I found the same bug in Facebook. Try sharing the http://www.google.com/search?q=%22a%26b%22 on facebook, you will see that they are doing the same. ;) I haven't reported this yet though. I changed the function to this, let me know your views: function encode_url($url) { $reserved = array( ":" => '!%3A!ui', "/" => '!%2F!ui', "?" => '!%3F!ui', "#" => '!%23!ui', "[" => '!%5B!ui', "]" => '!%5D!ui', "@" => '!%40!ui', "!" => '!%21!ui', "$" => '!%24!ui', "&" => '!%26!ui', "'" => '!%27!ui', "(" => '!%28!ui', ")" => '!%29!ui', "*" => '!%2A!ui', "+" => '!%2B!ui', "," => '!%2C!ui', ";" => '!%3B!ui', "=" => '!%3D!ui', ); $url = rawurlencode($url); $url = preg_replace(array_values($reserved), array_keys($reserved), $url); $url = preg_replace('!%25!ui', '%', $url); return ($url); } I am still testing, so let me know if some case fails for above function. -- Regards, Nitin Kumar Gupta http://publicmind.in/blog/ On Fri, Mar 12, 2010 at 7:29 AM, Scott Reynen <scott@makedatamakesense.com>wrote:
On Mar 11, 2010, at 11:10 AM, nitin gupta wrote:
I am using the following to solve the problem, any ideas to improve it in
terms of efficiency or otherwise are welcome:
function encodeurl($url) { $reserved = array( ":" => '!%3A!ui', "/" => '!%2F!ui', "?" => '!%3F!ui', "#" => '!%23!ui', "[" => '!%5B!ui', "]" => '!%5D!ui', "@" => '!%40!ui', "!" => '!%21!ui', "$" => '!%24!ui', "&" => '!%26!ui', "'" => '!%27!ui', "(" => '!%28!ui', ")" => '!%29!ui', "*" => '!%2A!ui', "+" => '!%2B!ui', "," => '!%2C!ui', ";" => '!%3B!ui', "=" => '!%3D!ui', );
$url = rawurlencode(rawurldecode($url)); $url = preg_replace(array_values($reserved), array_keys($reserved), $url); return $url; }
There's an old quote [1] that seems somewhat apt here:
Some people, when confronted with a problem, think "“I know, I'll use
regular expressions."” Now they have two problems.
That's not entirely apt, as your regular expression might as well be done with str_replace(), but you are adding problems rather than removing them. You should really scrap this whole thing and take a few steps back rather than adding more to it; this will break URLs due to flaws in the fundamental approach.
rawurlencode and rawurldecode are meant to be used on fragments of URLs, not whole URLs. It's impossible to properly encode an entire URL without first breaking it up into component parts, because the different parts require different encoding. For example, "/" should be encoded in a query string, but not in a path. Treating it the same everywhere is why you're having the problem with delimiters being encoded. The preg_replace() only hides this problem, while introducing new problems (not encoding things that should be encoded); it's not a solution.
To illustrate the problem, consider this URL:
http://www.google.com/search?q=%22a%26b%22
That's a Google search for the phrase "a&b". Your function turns that into this:
http://www.google.com/search?q=%22a&b%22
That's a Google search for "a, which returns completely different results.
Backing up, you apparently have input that looks like this:
http://example.com/path with spaces/
That's not a valid URL, so it needs to be fixed somewhere. Ideally it would be fixed at the source, but if that's not an option, you can fix this specific problem simply with str_replace(' ', '%20', $url); That won't break anything else because spaces aren't URL delimiters. I'm guessing your input has more complex problems with invalid URLs as your attempted solution is more broad in scope. It's hard to say what you should do without knowing more about the input. What does the raw XML look like?
[1] http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-tw...
-- Scott Reynen MakeDataMakeSense.com
Now just '%25' is lost from the original url, though still not acceptable. On 3/12/10, nitin gupta <nitingupta.iitg@gmail.com> wrote:
Scott,
I was just going to post this, but you caught me.. yeah there is a problem with the above proposed solution and you caught it right. It will not encode what should be encoded.
You are right that such URLs must be corrected at source, but that doesn't count as an excuse if my module fails to work properly. ( http://drupal.org/node/731798). As mentioned in first post, I know rawurlencode are for components of the URLs but I have not choice as I can not assume which literals will be there in the URL and which will be not.
Accidentally, while working on my other module (Facebook-style Links<http://drupal.org/project/facebook_link>) I found the same bug in Facebook. Try sharing the http://www.google.com/search?q=%22a%26b%22 on facebook, you will see that they are doing the same. ;) I haven't reported this yet though.
I changed the function to this, let me know your views:
function encode_url($url) { $reserved = array( ":" => '!%3A!ui', "/" => '!%2F!ui', "?" => '!%3F!ui', "#" => '!%23!ui', "[" => '!%5B!ui', "]" => '!%5D!ui', "@" => '!%40!ui', "!" => '!%21!ui', "$" => '!%24!ui', "&" => '!%26!ui', "'" => '!%27!ui', "(" => '!%28!ui', ")" => '!%29!ui', "*" => '!%2A!ui', "+" => '!%2B!ui', "," => '!%2C!ui', ";" => '!%3B!ui', "=" => '!%3D!ui', );
$url = rawurlencode($url); $url = preg_replace(array_values($reserved), array_keys($reserved), $url); $url = preg_replace('!%25!ui', '%', $url); return ($url); }
I am still testing, so let me know if some case fails for above function.
-- Regards, Nitin Kumar Gupta http://publicmind.in/blog/
On Fri, Mar 12, 2010 at 7:29 AM, Scott Reynen <scott@makedatamakesense.com>wrote:
On Mar 11, 2010, at 11:10 AM, nitin gupta wrote:
I am using the following to solve the problem, any ideas to improve it in
terms of efficiency or otherwise are welcome:
function encodeurl($url) { $reserved = array( ":" => '!%3A!ui', "/" => '!%2F!ui', "?" => '!%3F!ui', "#" => '!%23!ui', "[" => '!%5B!ui', "]" => '!%5D!ui', "@" => '!%40!ui', "!" => '!%21!ui', "$" => '!%24!ui', "&" => '!%26!ui', "'" => '!%27!ui', "(" => '!%28!ui', ")" => '!%29!ui', "*" => '!%2A!ui', "+" => '!%2B!ui', "," => '!%2C!ui', ";" => '!%3B!ui', "=" => '!%3D!ui', );
$url = rawurlencode(rawurldecode($url)); $url = preg_replace(array_values($reserved), array_keys($reserved), $url); return $url; }
There's an old quote [1] that seems somewhat apt here:
Some people, when confronted with a problem, think "“I know, I'll use
regular expressions."” Now they have two problems.
That's not entirely apt, as your regular expression might as well be done with str_replace(), but you are adding problems rather than removing them. You should really scrap this whole thing and take a few steps back rather than adding more to it; this will break URLs due to flaws in the fundamental approach.
rawurlencode and rawurldecode are meant to be used on fragments of URLs, not whole URLs. It's impossible to properly encode an entire URL without first breaking it up into component parts, because the different parts require different encoding. For example, "/" should be encoded in a query string, but not in a path. Treating it the same everywhere is why you're having the problem with delimiters being encoded. The preg_replace() only hides this problem, while introducing new problems (not encoding things that should be encoded); it's not a solution.
To illustrate the problem, consider this URL:
http://www.google.com/search?q=%22a%26b%22
That's a Google search for the phrase "a&b". Your function turns that into this:
http://www.google.com/search?q=%22a&b%22
That's a Google search for "a, which returns completely different results.
Backing up, you apparently have input that looks like this:
http://example.com/path with spaces/
That's not a valid URL, so it needs to be fixed somewhere. Ideally it would be fixed at the source, but if that's not an option, you can fix this specific problem simply with str_replace(' ', '%20', $url); That won't break anything else because spaces aren't URL delimiters. I'm guessing your input has more complex problems with invalid URLs as your attempted solution is more broad in scope. It's hard to say what you should do without knowing more about the input. What does the raw XML look like?
[1] http://www.codinghorror.com/blog/2008/06/regular-expressions-now-you-have-tw...
-- Scott Reynen MakeDataMakeSense.com
-- Sent from my mobile device -- Regards, Nitin Kumar Gupta http://publicmind.in/blog/
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
There's an old quote [1] that seems somewhat apt here:
Some people, when confronted with a problem, think "“I know, I'll use regular expressions."” Now they have two problems.
+1 Scott is completely correct. If you want this to be sane to any degree, you'll need to parse the url into its components before trying to escape anything. Once you know which parts map to the components found in parse_url(), you can apply the appropriate escape rules (such as url_encode()) to the them individual. From *there* you should build up the final, escaped url in the form $scheme . '://' . $host . '/' . $path . '?' . $query CM Lubinski http://cmlubinski.info -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFLm9yofzi1OiZiJLARAmiuAJ9YpTTIJmXI+eQFm7GraWBRmjEuvgCcDtkw wykkezVvS9PUsbebUT8n2v0= =KBjn -----END PGP SIGNATURE-----
Hi, I completely agree to what you and Scott are trying to say. But, I am not looking to create an URL, just to sanitize it to remove disallowed character, i.e. what a browser would do while accessing a URL when a user inputs an URL. Consider, I parse the following URL from XML: http://example.com?test/com Do you think I should encode the '/' in the query part i.e. [test/com]?? I don't think we need to. (Nor will Firefox, if you enter this URL in the address bar). If a URL contains characters which are allowed in the URL dictionary, will we ever need to encode those characters? No. My point being, the only characters we need to encode are those which are disallowed. Of course, if encounter something like this: http://example.com?test%3Fcom we must not decode it either. i.e. we must maintain the integrity of the URL while checking its validity. I am thinking this way, please let me know some case which proves otherwise. -- Regards, Nitin Kumar Gupta http://publicmind.in/blog/ On Sun, Mar 14, 2010 at 12:12 AM, CM Lubinski <cmc333333@gmail.com> wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
There's an old quote [1] that seems somewhat apt here:
Some people, when confronted with a problem, think "“I know, I'll use regular expressions."” Now they have two problems.
+1
Scott is completely correct. If you want this to be sane to any degree, you'll need to parse the url into its components before trying to escape anything. Once you know which parts map to the components found in parse_url(), you can apply the appropriate escape rules (such as url_encode()) to the them individual. From *there* you should build up the final, escaped url in the form $scheme . '://' . $host . '/' . $path . '?' . $query
CM Lubinski http://cmlubinski.info -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux)
iD8DBQFLm9yofzi1OiZiJLARAmiuAJ9YpTTIJmXI+eQFm7GraWBRmjEuvgCcDtkw wykkezVvS9PUsbebUT8n2v0= =KBjn -----END PGP SIGNATURE-----
On Mar 13, 2010, at 8:20 PM, nitin gupta wrote:
I completely agree to what you and Scott are trying to say. But, I am not looking to create an URL, just to sanitize it to remove disallowed character, i.e. what a browser would do while accessing a URL when a user inputs an URL. Consider, I parse the following URL from XML:
Do you think I should encode the '/' in the query part i.e. [test/ com]??
Technically, yes, but that's beside the point. Regardless of how strictly you choose to apply URL encoding, you should be applying it to specific URL parts, not full URLs.
I don't think we need to. (Nor will Firefox, if you enter this URL in the address bar).
You're right that encoding the slash character isn't particularly important in the query. In a path segment, however, the difference between encoded and unencoded slashes is very significant; http://example.com/a/b/c is different than http://example.com/a%2fb/c. And a slash definitely shouldn't be encoded where it's used as a delimiter between URL components. This is actually a good example of why encoding must be applied to individual URL components, not the full URL.
If a URL contains characters which are allowed in the URL dictionary, will we ever need to encode those characters? No.
What is the URL dictionary? Here's one of the relevant RFC on URLs: http://www.ietf.org/rfc/rfc3986.txt Selected quotes: "A percent-encoding mechanism is used to represent a data octet _in_a_component_" "the conflicting data must be percent-encoded _before_the_URI_is_formed_" Emphasis added to, well, emphasize that encoding applies to component parts. -- Scott Reynen MakeDataMakeSense.com
Hi Scott, I don't think we are in contradiction here, but in point of view. I am saying that we should not encode what is an allowed character. If the URL is already present somewhere like (http://example.com/hj/hj) there is not need to encode and if it is present like (http://example.com%2f/test) there is no need to decode. And what you should do if you get such a URL, just do not touch it, because it contains no invalid character. @URL dictionary: Are you kidding?? I was obviously referring to the same RFC. I will like you to think for a moment and tell me what will you gain by breaking the URL into components and then encoding it and then joining it again. Consider this problem statement: You are given a URL, which is extracted from a source HTML of a webpage, and you need to access it using drupal_http_request(). I am, of course, interesting in improving what I currently have in hand. "Fire me all you can, but cast me into a solid and beautiful pot" -- Regards, Nitin Kumar Gupta http://publicmind.in/blog/ On Sun, Mar 14, 2010 at 10:50 AM, Scott Reynen <scott@makedatamakesense.com>wrote:
On Mar 13, 2010, at 8:20 PM, nitin gupta wrote:
I completely agree to what you and Scott are trying to say. But, I am not
looking to create an URL, just to sanitize it to remove disallowed character, i.e. what a browser would do while accessing a URL when a user inputs an URL. Consider, I parse the following URL from XML:
Do you think I should encode the '/' in the query part i.e. [test/com]??
Technically, yes, but that's beside the point. Regardless of how strictly you choose to apply URL encoding, you should be applying it to specific URL parts, not full URLs.
I don't think we need to. (Nor will Firefox, if you enter this URL in the
address bar).
You're right that encoding the slash character isn't particularly important in the query. In a path segment, however, the difference between encoded and unencoded slashes is very significant; http://example.com/a/b/c is different than http://example.com/a%2fb/c. And a slash definitely shouldn't be encoded where it's used as a delimiter between URL components. This is actually a good example of why encoding must be applied to individual URL components, not the full URL.
If a URL contains characters which are allowed in the URL dictionary, will
we ever need to encode those characters? No.
What is the URL dictionary? Here's one of the relevant RFC on URLs:
http://www.ietf.org/rfc/rfc3986.txt
Selected quotes:
"A percent-encoding mechanism is used to represent a data octet _in_a_component_" "the conflicting data must be percent-encoded _before_the_URI_is_formed_"
Emphasis added to, well, emphasize that encoding applies to component parts.
-- Scott Reynen MakeDataMakeSense.com
participants (8)
-
Adam Gregory -
CM Lubinski -
Don -
Earnie Boyd -
ekes -
João Gustavo Taveira -
nitin gupta -
Scott Reynen