[drupal-devel] [bug] Problems with using relative path names

chx drupal-devel at drupal.org
Mon Aug 22 14:30:25 UTC 2005


Issue status update for 
http://drupal.org/node/13148
Post a follow up: 
http://drupal.org/project/comments/add/13148

 Project:      Drupal
-Version:      4.6.0
+Version:      cvs
 Component:    base system
 Category:     bug reports
 Priority:     normal
 Assigned to:  Anonymous
 Reported by:  kbahey
 Updated by:   chx
-Status:       patch (code needs work)
+Status:       patch (code needs review)
 Attachment:   http://drupal.org/files/issues/base_kill.patch (8.25 KB)

Reworked.




chx



Previous comments:
------------------------------------------------------------------------

Fri, 19 Nov 2004 03:29:34 +0000 : kbahey


Looking at my site's logs, there seem to be several problems that are
caused by Drupal's use of relative path names.



If Drupal causes all the site's urls to be absolute, then none of this
would be an issue.


A. Search Engine Crawlers
Getting lots of 404s on things like: linux/index.html/robots.txt


Where 'linux' is an alias to a taxonomy, and 'index.html' is an alias
to a node within that taxonomy.


Another example, is recursing unnecessarily. I see 404s on things like:
/linux/index.html/linux/index.html


Where 'linux' is a path alias for a taxonomy term, and 'index.html' is
an alias to the main node within it.


This does not seem to happen when Google crawls my sites, but Yahoo's
Slurp suffers from this problem, and keeps recursing. MSNBot also
suffers from this.


Another crawler/harvester called Blinkx/DFS-Fetch keeps adding the .css
file to the relative path, getting a 404 on things like:
/linux/themes/xtemplate/pushbutton/logo.gif


And Fast Search Engine also attempts to access:
/linux/contact/tracker/tracker/user/password


The same goes for grub.org, another crawler.


B. Google Cache / Archive Way Back Machine
Pages in Google cache and archive.org Way Back Machine suffer form a
similar problem: the .css files cannot be found, and hence rendering of
the pages is not correct.


Examples:


Compare this: http://www.drupal.org/node/4647
To this:
http://www.google.ca/search?q=cache:www.drupal.org/node/view/4647


Notice the following:



* How there is no formatting at all, because of the lack of a .css file
* The httpd log on Drupal will show errors for:
linux/themes/pushbutton/style.css and linux/misc/drupal.css

Also see:
http://web.archive.org/web/20031016184902/http://www.drupal.org/


C. Proxy Caches:
When someone is browsing my site from behind a proxy cache, the web
site is hit with a rapid succession of requests, and many of it is just
for bogus pages.


Examples:



2004/11/17 - 17:47 404 error: linux/user/1 not found.
2004/11/17 - 17:47 404 error: linux/feedback not found.
2004/11/17 - 17:47 404 error: linux/tracker not found.
2004/11/17 - 17:47 404 error: linux/sitemap not found.
2004/11/17 - 17:47 404 error: linux/search not found.
2004/11/17 - 17:47 404 error: linux/misc not found.
2004/11/17 - 17:47 404 error: linux/programming not found.
2004/11/17 - 17:47 404 error: linux/programming not found.
2004/11/17 - 17:47 404 error: linux/linux not found.
2004/11/17 - 17:47 404 error: linux/technology not found.
2004/11/17 - 17:47 404 error: linux/writings not found.
2004/11/17 - 17:47 404 error: linux/family not found.
And also:



2004/11/17 - 07:23 404 error: history/user/1 not found.
2004/11/17 - 07:23 404 error: history/tracker not found.
2004/11/17 - 07:23 404 error: history/feedback not found.
2004/11/17 - 07:23 404 error: history/sitemap not found.
2004/11/17 - 07:23 404 error: history/search not found.
2004/11/17 - 07:23 404 error: history/misc not found.
2004/11/17 - 07:23 404 error: history/technology not found.
2004/11/17 - 07:23 404 error: history/science not found.
2004/11/17 - 07:22 404 error: history/history not found.
2004/11/17 - 07:22 404 error: history/writings not found.
2004/11/17 - 07:22 404 error: history/family not found.
As you can tell, history and linux are aliases to taxonomy terms, and
so is misc, technology, writings, family, ...etc. The user agent is
appending the taxonomy term alias to the url and forming a new URL.



D. Regular Browsing:
There is even at least one extreme case where the following URL was
accessed (the result was 404 of course)


/book/view/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/logo.gif
It seems it was a normal user, because the user agent is: "Mozilla/4.0
(compatible; MSIE 5.5; Windows 98; Win 9x 4.90)"


Proposed Solution:
As a proposed solution, all URLs in Drupal can be made into absolute
path names. This can be done by the following:



* The variable $base_url in the conf.php file is broken down into two
components:

* $base_host (the 'http://whatever-host.example.com' part WITHOUT the
trailing slash)
* $base_path (the '/path-to-drupal' part, WITH the leading slash. If
this is the DocumentRoot, then it is just a '/' character)

* $base_url is now $base_host concatenated with $base_path
* A simple filter can be written to preceed every href="path" with the
$base_path variable, so it becomes "/path"
* This option can be turned on and off for a site. The default is to
have it off so current behavior is maintained.
* A similar scheme applies for style sheets as well.

So, did I miss something obvious? Am I seriously off the mark?


Your thoughts!




------------------------------------------------------------------------

Sat, 20 Nov 2004 04:10:17 +0000 : chrisada

I am getting similar 404 errors, mainly from rss feed link that looks
like /blog/blog/feed and many manual links that are relative to drupal
root.


It was not a problem before Drupal 4.5, so I think there might not be a
need to change all URIs to absolute. I can't see where the problem is
coming from though.




------------------------------------------------------------------------

Sat, 20 Nov 2004 04:35:06 +0000 : kbahey

I am pretty sure that these problems were happening for at least the
past 10 months (ever since I moved to Drupal in January 2004).


The main issue here is that crawlers and other user agents get confused
by the relative path names. 


Using absolute paths will definitely solve this.  However, is this the
only solution? 


I am looking for a discussion of this.




------------------------------------------------------------------------

Sat, 20 Nov 2004 09:55:21 +0000 : Goba

No absolute paths please. Having the path start with '/' solves all the
mentioned problems, and is not absolute, it is relative to the domain.
Sadly some crawlers and even the Google Cache does not obey to the base
href. I have reported this cache problem in April to Google, and they
promised they will keep it in mind... Hehe...


What we need is to have the printed relative path values relative to
the domain name, and not relative to the Drupal installation path.


Note that this issue will appear on the drupal devel mailing list if
someone finally provides a patch we can talk about :)




------------------------------------------------------------------------

Sat, 20 Nov 2004 10:19:30 +0000 : Dries

Goba is right.  We need paths relative to the domain name to fix this
'problem'.




------------------------------------------------------------------------

Sat, 20 Nov 2004 20:19:32 +0000 : kbahey

Sorry for not making my self clear.


When I said absolute, I meant that they start with just a /. I did NOT
mean that they start with http://host.example.com. That would be a very
bad idea.


In any case, what do people think about the proposed solution (breaking
down $base_url into two parts?)


Also, does this address the style sheets as well, or more is needed?




------------------------------------------------------------------------

Sun, 21 Nov 2004 16:24:27 +0000 : chx

Attachment: http://drupal.org/files/issues/common_inc_patch.txt (471 bytes)

I have implemented what Goba suggested.




------------------------------------------------------------------------

Sun, 21 Nov 2004 16:43:56 +0000 : chx

Attachment: http://drupal.org/files/issues/common_inc_patch_0.txt (825 bytes)

Maybe this one is faster?




------------------------------------------------------------------------

Sun, 21 Nov 2004 17:34:37 +0000 : kbahey

Man! You are fast!


I tried the second version. It works fine for things that are not
inside the node body, I mean they have  a / in front of them, as we
want it to be.


Two comments/issues:


- If there is a URL that is already "/" representing the home page, it
gets set to "//".  Perhaps it should check for that case?


- URLs in nodes that do not start with / do not get changed to have a /
prepended to them. Do we need a filter for this?


- Do we need to do something for the style sheets in the page header? I
mean the "misc/drupal.css" and "themes/themename/style.css"?


Thanks




------------------------------------------------------------------------

Mon, 22 Nov 2004 00:28:01 +0000 : kbahey

Hi chx


Here is a fix for the case where you have a url that is just "/".


In your patch, instead of:



<?php
  $base = $parts['path'] . '/' ; 
?>




Replace that by:



<?php
  $base = ( $path == '/' ? $base : $parts['path'] . '/' ); 
?>






------------------------------------------------------------------------

Sun, 28 Nov 2004 04:08:56 +0000 : kbahey

Did this patch make it into CVS yet?


If there are any objections to it, can someone please explain what they
are?


Thanks




------------------------------------------------------------------------

Sun, 28 Nov 2004 09:33:01 +0000 : Dries

Shouldn't your changes be included in the patch?


Also, it's better to cache $base rather than $parts.


Lastly, it this patch makes it to HEAD, we should probably remove some
'base url' cruft from the themes.




------------------------------------------------------------------------

Sun, 28 Nov 2004 18:54:02 +0000 : kbahey

Attachment: http://drupal.org/files/issues/x.diff (1 KB)

Here is the patch including my fix.


I am asking chx to comment on caching $base instead of $parts.


Will this make it faster?




------------------------------------------------------------------------

Sun, 28 Nov 2004 19:26:59 +0000 : chx

Hm. $base = ( $path == '/' ? $base : $parts['path'] . '/' ); this
depends on path which is a parameter. Thus I fail to see how could we
cache $base. I'd correct this code however $base = ( $path == '/' ? ''
: $parts['path'] . '/' ); 'cos I think $base is not defined before, but
this is not a problem, PHP will be happy to replace NULL with NULL...


Maybe instead of all parts, only $parts['path'] is enough to be cached,
yes, but the performance and memory usage difference -- I guess -- would
not be noticable...




------------------------------------------------------------------------

Sun, 28 Nov 2004 23:06:39 +0000 : kbahey

Attachment: http://drupal.org/files/issues/common-inc-patch.txt (1 KB)

OK.


I put in chx suggested change.


This patch can go in CVS then, to rid us of the problems with paths not
beginning with slash.


This is not an ultimate solution still. We need to address the problem
with .css files. Although the header contains a:


<base href="http://example.com" />


it does not seem that major search engines and archiving sites obey it
anyway.




------------------------------------------------------------------------

Thu, 02 Dec 2004 20:31:27 +0000 : Dries

Your coding style needs work.  Also, I'm not going to commit this unless
the themes get fixed up: we'd end up with invalid URLs all over the
place.  Lastly, I wonder how portable the themes will be when Drupal is
run from within a subdirectory.




------------------------------------------------------------------------

Thu, 02 Dec 2004 21:15:19 +0000 : chx

Attachment: http://drupal.org/files/issues/common_inc_patch_1.txt (849 bytes)

Well, my patch worked from a subdirectory very well, as fact, I have not
tested it from the root dir. And I think that it adheres to coding
standards. So I resubmit it with the root path fix. However, my Drupal
work is focused on i18n these days, and I was never into themeing so it
won't be me who fixes those.




------------------------------------------------------------------------

Thu, 02 Dec 2004 22:07:08 +0000 : kbahey

I have tested the previous patch (including my fix) with drupal
installed in the DocumentRoot of the server.


So, in effect, it is tested with both Drupal in / and Drupal in a
subdirectory.


This change fixes the problem for the crawlers and other browsers from
getting confused.


While it is true that there is no fix for the .css files in the HTML
head section yet, this fix deals with a major part of the problem, and
rids us of a major pain. Check your web server's logs some time to see
what I mean.


Someone who is familiar with the themes can contribute a patch later. 


This patch and the future fix for themes are not mutually exclusive, so
let it go in CVS.




------------------------------------------------------------------------

Thu, 09 Dec 2004 15:31:57 +0000 : Goba

Please commit this into Drupal core, this fix is badly needed.




------------------------------------------------------------------------

Mon, 17 Jan 2005 13:00:12 +0000 : chx

Attachment: http://drupal.org/files/issues/base_url_kill.patch (4.34 KB)

Well as noone have stepped in to fix this problem, I have tried to fix
the themes also. themes.inc , xtemplate.engine and the bluemarine
template is patched besides common.inc.


Of course, more templates could follow, but first I'd like to see your
opinions.




------------------------------------------------------------------------

Mon, 17 Jan 2005 13:09:10 +0000 : Goba

I don't think that removing <base> from the themes is a good idea, using
$parts['path'] should be encouraged though before the files, which would
fix the google cache problem, and would still keep the HTML size low. It
would also help those, who save the file to find the originating site
easier, since clicking on a non-pagelocal link would lead to the online
version.




------------------------------------------------------------------------

Mon, 17 Jan 2005 16:03:02 +0000 : Steven

Definitely -1 on removing the <base&gt tag or using absolute or
root-relative URLs. This tag has been around for ages, and it is the
only way to make clean URLs work without bloating in the code. FYI,
"base" is (first?) mentioned in Berners-Lee's HTML 1.0 draft [1].
That's June 1993.


As the amount of clean URL-using sites grows, the crawlers will have to
be updated. Perhaps we could prevent crawlers from going too insane by
404ing for URLs with more than say 10 components? That would prevent
the really crappy ones from hammering your site.


I'm all for making the <base> tag easier to handle for the user (say,
by including a filter to allow simple anchor links to work as most
people expect them to), but we should keep Drupal-generated URLs clean
and completely relative.
[1] http://www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt




------------------------------------------------------------------------

Mon, 17 Jan 2005 16:57:09 +0000 : kbahey

The problem with css is this: The @import argument does not start with a
/.


This is simple to fix.


We keep the "base" as it is today, but add the new variable: $base
before it.


So for a site where Drupal is installed in the DocumentRoot, all that
will change is that /misc/drupal.css and /themes/themename/style.css
will be preceded by a slash. For sites that use another path, that path
will be prepended to the css file name.


How about that?




------------------------------------------------------------------------

Mon, 17 Jan 2005 17:38:10 +0000 : Steven

What exactly is the problem with the @import? As far as I know:


- url() in stylesheets is interpreted relative to the base of the
stylesheet, not the source document.
- However, if the styles are inside an HTML document, through a style
tag or style attribute, then the stylesheet's location is the same as
the HTML document.
- Thus, the stylesheet's base is the same as the base of the HTML
document (which can be altered through the <base> tag).


I just don't see why it is necessary. As far as I know, the only
browser that has had problems resolving CSS urls properly was Netscape
4, which does not support @import at all, and which Drupal does not
support either, because of its CSS usage.




------------------------------------------------------------------------

Mon, 17 Jan 2005 18:35:07 +0000 : kbahey

The problem for stylesheets is as follows. I think it mainly affect
crawlers and Google's cache.


Say you have an installtion of Drupal in DocumentRoot. You then use url
aliases, and put slashes in them.


For example, you use news/general/2004-12-15.html for a node.


That node still has misc/drupal.css and themes/pushbutton/style.css in
the head section if the document. Crawlers get fooled by that and try
to look for /news/general/misc/drupal.css and
/news/general/themes/pushbutton/style.css, which don't exist.


So, just prepending the new $base variable (in chx's patch) before the
stylesheet @import argument would fix this issue. Assuming you are in
DocumentRoot, then /misc and /themes would be used instead of just misc
and themes.


It would still be compliant with standards, be relative to the web
site, and no ambiguous to anyone, be they crawler or browser.


I hope it is clearer now. 


I think chx can change the patch to use the $base instead of $base_url
everywhere, so as to avoid the host/domain name in the urls.




------------------------------------------------------------------------

Mon, 17 Jan 2005 21:36:32 +0000 : Steven

But typical crawlers don't even pay attention to stylesheets, hence it
wouldn't have much use for them. I just don't see why we should adjust
to rare cases of buggy software. Reading out a base URL from an HTML
document is dead easy, and on top of that it doesn't add more
complexity as without the base tag, the document's URL is already an
implicit base which has to be parsed anyway.


I did not like it when we altered the <link&gt tag to accomodate buggy
RSS readers and I certainly don't like it now, as this is even rarer.
In both cases, it is not Drupal which is at fault.




------------------------------------------------------------------------

Mon, 17 Jan 2005 21:48:45 +0000 : kbahey

Steven


While I agree with most of what you said, the 404s show up in the logs
enough to be a bother.


Perhaps the original design of Drupal did not forsee that people will
use url aliases to mimic directory/file hierarchies. Whether this was
intended or not, it is the way many use Drupal today.


It does not matter where the bug is (Drupal or the external world), as
long as we can stop it ourselves, by adjusting our end of it.


The fix is simple enough and does not break standards (if implemented
as described with a leading / before the css).




------------------------------------------------------------------------

Mon, 17 Jan 2005 21:54:30 +0000 : Steven

It does not break standards, but it does bloat the code in an ugly way.
Why not send an e-mail to the owners of the crawlers and tell them to
implement a standard that is nearly 10 years old [2] (RFC 1808)?


Note that Google Cache now seems to correctly interpret base URLs [3]
and even adds a <base> tag of its own.


By the way, this problem has nothing to do with people using URL
aliases or not, as for a browser the regular nested paths that Drupal
uses (e.g. "node/1" is no different from aliases mimicking files
"foo/bar.html").


[2] http://www.faqs.org/rfcs/rfc1808.html
[3]
http://www.google.be/search?q=cache%3Awww.drupal.org&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8&client=firefox-a&rls=org.mozilla:en-US:official




------------------------------------------------------------------------

Mon, 17 Jan 2005 22:34:25 +0000 : Goba

Steven, part of the problem is that Google cache does add a base href
even if there is a base href in the document. Eg adds a <BASE
HREF="http://drupal.org/node/13733"> on the plone comparision page
cached. Now that since HTML does not allow more than one base tag [4]
to be present, it is up to the browsers, to use the first or the last
base value, or any of the base values on the page for that matter as
the used base. So even pages displayed from the google cache will be
buggy if a full relative path to the domain root is not specified, due
to this problem.
[4]
http://www.w3.org/TR/1999/REC-html401-19991224/sgml/dtd.html#head.content




------------------------------------------------------------------------

Tue, 18 Jan 2005 07:14:21 +0000 : chx

Attachment: http://drupal.org/files/issues/base_url_kill_0.patch (4.75 KB)

This one does not use the whole base_url only the path part of it. HTML
bloat is kept at minimal.




------------------------------------------------------------------------

Tue, 01 Feb 2005 23:33:58 +0000 : clairem

Please please can this be done?


It's a good idea in itself, but if using fully-qualified paths means we
can get rid of the BASE HREF, then page anchors will work without having
the overhead of a filter. That's be a huge bonus for those creating
larger nodes, or who just want to be able to put a "skip navigation"
link in their theme without having to abandon Xtemplate or PHPtemplate




------------------------------------------------------------------------

Tue, 01 Feb 2005 23:37:23 +0000 : Goba

Well, speaking of skip navigation links, phptemplate and xtemplate
should expose the REQUEST_URI to the templates, so when a link to an
anchor on the same page is needed, the link can be formatted with the
complete request URI in mind.




------------------------------------------------------------------------

Wed, 02 Feb 2005 23:40:36 +0000 : clairem

"hptemplate and xtemplate should expose the REQUEST_URI to the templates

"
Should, but don't :(


If BASE HREF isn't removed, surely it wouldn't be a big job to
implement this tweak?




------------------------------------------------------------------------

Thu, 17 Feb 2005 14:50:29 +0000 : kbahey

This patch is badly needed. The lack of a leading / in many paths is
causing lots of problems.




------------------------------------------------------------------------

Sat, 12 Mar 2005 20:32:26 +0000 : kbahey

Can this patch be applied for 4.6? it is really badly needed.




------------------------------------------------------------------------

Tue, 22 Mar 2005 20:50:55 +0000 : Dries

I don't see why this is badly needed.  We generate perfectly valid URLs
which are supposed to be short and crispy.  This patch has some
advantages though, yet it is unclear which patch to go with.




------------------------------------------------------------------------

Tue, 22 Mar 2005 21:00:45 +0000 : chx

The second patch is better.




------------------------------------------------------------------------

Tue, 22 Mar 2005 21:46:27 +0000 : grohk

Forgive me for saying so,  but since the way Drupal is generating
hyperlinks is completely valid, why are you suggesting Drupal should
move away from an accepted standard when the problem lies with the
search engines?


At the very least, this needs to be optional -- which it appears to be
-- I hate the 404s too, but I hate to hear that a change in Drupal is
needed to fix a Google problem.




------------------------------------------------------------------------

Tue, 22 Mar 2005 22:03:51 +0000 : jhriggs

I have to agree with the last comment from grohk.




------------------------------------------------------------------------

Tue, 22 Mar 2005 22:39:33 +0000 : mathias

I also agree with the two previous Drupaleers, but I wouldn't mind
enabling a 'quirks mode' via my conf file to stop the flood of 404
messages.




------------------------------------------------------------------------

Wed, 23 Mar 2005 00:37:52 +0000 : kbahey

I really can't fathom why some of us cannot deal with with the realities
out there in the world.


These problems are not because Drupal is broken. It is because crawlers
are. We cannot just bury our collective heads in the sand and say that
we are standards compliant and forget about what is out there. 


As an analogy, people who design themes or write CSS have to deal with
the ugliness of Microsoft Internet Explorer and its intentional going
against standards. You cannot tell a client or your boss that you are
not modifying a theme that works perfectly on Konqueror and Firefox
because MS IE is broken.


Similarly, we cannot ignore that crawlers from major search engine
companies are broken or confused, and keep recursing through site using
Drupal causing countless errors in the logs. We cannot tell our users to
ask Google and Yahoo et al to fix their software.


Remember that we are not breaking any standards by implementing this
patch. All we are doing is putting the entire path out (from the first
/ down) and thus eliminating ambiguity for everyone.


Sorry if I am a bit blunt in this post, but I am tired of what may be
seen as isolationist thinking.


I do not mind if this is implemented in an advanced mode or via a
settings.php thing. All I care about is getting it fixed somehow.




------------------------------------------------------------------------

Fri, 29 Apr 2005 23:29:41 +0000 : kbahey

Circular log errors reported here too http://drupal.org/node/9499




------------------------------------------------------------------------

Sat, 30 Apr 2005 19:35:50 +0000 : shane

I agree with KBAHEY.  Burrying our head in the sand and saying "it ain't
our problem" is not going to fix the issue.  I despise companies that
break standards - and I applaud Drupal for working hard to keep within
those confines.  But the reality is money grubbing, lazy ass
programmers exist the world over, and the consequence is things like
MSIE breaking everything wantonly and intentionally, Google, Yahoo, et
al implementing poor bot code, etc...  


I believe this desperately needs to get fixed.  Ever since I started
hand writing HTML code in 1992 I have always insured that my URL paths
are absolute to the base html document root (eg, preceeded with "/" and
the full path).  It avoids confusion, problems, or issues.  It seems odd
that the debate over this would rage as it has in this thread.  


...and I don't get the "bloat" discussion.  How is this bloating
things?  Are we talking a few dozen extra characters?  I hope I'm
missing something more obvious and insidious than that!?


I've been a loyal Drupaler for ages now, and I love it.  But this new
problem is causing me a lot of grief, I see frequent munging of the
URLs, and it worries me; particularly when I see that there are end
users getting 404s.  They don't give a rats ba-tu-tey that Drupal is
"standards compliant" ... they just know they got an error when they
supposedely did exactly what they should have, click on a URL.  That
reflects poorly on the site owner and ultimately on the software
itself.


Please reconsider this issue, and let a patch go into core to fix it. 
It doesn't make sense to let it rage on as an issue that is causing
lots of people obvious grief.  I'm betting it's a bigger issue than
most admins think - most don't spend the anal-retentitive time that I
and others do grubbing through our logs, trying to insure a "perfect"
surfing experience for our end-users...




------------------------------------------------------------------------

Sun, 01 May 2005 22:09:58 +0000 : clydefrog

This is truly not a problem with Drupal, but it may be reasonable to
change Drupal's behavior anyway. 


The base href tag has been in the W3C standards since /1997/ [5].
Failing to observe this tag isn't about being slow on the uptake (as
with MSIE and CSS2). It's about deliberately breaking existing
compatibility.


Has anyone contacted Yahoo, MSN, etc. and told them of this problem? If
and when they fix their crawlers, we need to be able to turn off this
kludge to discourage other more marginal crawlers to observe the
standards.
[5] http://www.w3.org/TR/REC-html32#base




------------------------------------------------------------------------

Sun, 01 May 2005 22:11:04 +0000 : clydefrog

That should be "encourage other more marginal crawlers to observe the
standards."




------------------------------------------------------------------------

Fri, 06 May 2005 19:00:40 +0000 : kbahey

Here are examples from drupal.org itself:


As you can see, if the paths started with a slash, none of this would
have happened.


warning	page not found	06/05/2005 -
10:36	drupal-sites/themes/bluebeach/style.css not found.
warning	page not found	06/05/2005 -
10:36	drupal-sites/themes/bluebeach/print.css not found.
warning	page not found	06/05/2005 - 10:36	drupal-sites/misc/drupal.css
not found.
warning page not found	06/05/2005 - 10:33	about/tracker not found.
warning	page not found	06/05/2005 - 10:33	about/support not found.
warning	page not found	06/05/2005 - 10:33	about/project not found.
warning	page not found	06/05/2005 - 10:33	about/services not found.
warning	page not found	06/05/2005 - 10:33	about/handbook not found.
warning	page not found	06/05/2005 - 10:33	about/features not found.
warning	page not found	06/05/2005 - 10:33	about/forum not found.
warning	page not found	06/05/2005 - 10:33	about/drupal-sites not found.
warning	page not found	06/05/2005 - 10:33	about/druplicon not found.
warning	page not found	06/05/2005 - 10:33	about/download not found.
warning	page not found	06/05/2005 - 10:33	about/donate not found.
warning	page not found	06/05/2005 -
10:33	about/documentation-writers-guide not found.
warning	page not found	06/05/2005 - 10:33	about/contributors-guide not
found.
warning	page not found	06/05/2005 - 10:33	about/cvs not found.
warning	page not found	06/05/2005 - 10:33	about/contact not found.
warning	page not found	06/05/2005 - 10:33	about/contribute not found.
warning	page not found	06/05/2005 - 10:33	about/aggregator not found.
warning	page not found	06/05/2005 - 10:33	about/cases not found.
warning	page not found	06/05/2005 - 10:33	about/about not found.




------------------------------------------------------------------------

Fri, 06 May 2005 19:14:23 +0000 : grohk

Just because some of us disagree with this solution to the perceived
problem does not mean we are not fond of reality, it just mean we have
a different way of seeing this issue.  If we cannot use accepted
standards in Drupal, then what are good is it to adhere to them in the
first place?


Does this patch really affect the experience of end users of Drupal? 
Unless I am missing something, normal people never see these errors. 
Google is a search engine, it is not a user.  In my experience, users
remain unaware of this "problem".  But changing Drupal to adhere to
preferences of a broken crawler is not going to encourage anyone to fix
their poorly implemented software either.


As someone who appreciates the elegance of Drupal and uses it just as
much as anyone, all this is fine with me -- as long as it is optional. 
But I don't think appeasement is the answer to "fixing" this problem,
because with this option enabled there is no impetus for the crawler
programmers to fix anything.


And for the record, Google has been caching my pages correctly with CSS
for months now and it has not entered into a loop either.




------------------------------------------------------------------------

Fri, 06 May 2005 19:37:37 +0000 : kbahey

"If we cannot use accepted standards in Drupal, then what are good is it
to adhere to them in the first place?

"
By using paths beginning with slashes, we are not breaking any
standards that I know of.


"Does this patch really affect the experience of end users of Drupal?
Unless I am missing something, normal people never see these errors.

"
The clutter in the logs is very annoying, and makes makes it harder for
the site admin to find the info he needs. It also consumes bandwidth.


"Google is a search engine, it is not a user.

"
The user here is the site admin, not the end user.


"But changing Drupal to adhere to preferences of a broken crawler is
not going to encourage anyone to fix their poorly implemented software
either ... But I don't think appeasement is the answer to "fixing" this
problem, because with this option enabled there is no impetus for the
crawler programmers to fix anything.

"
By the same token, we can ignore MS IE's broken CSS handling and a
bunch of other things, and claim that they should fix themselves.
Meanwhile 80% of users are facing these issues. 


That is not the way to look at things. If we can implement something
that does not break standards but avoid many of us the grief that this
causes, then why not?


A solution that allows this to be turned on or off, via an option or a
settings.php flag would make everyone happy.




------------------------------------------------------------------------

Sun, 08 May 2005 14:52:40 +0000 : killes at www.drop.org

The patch apparently hasn't found much favour, setting to "active".


I suggest to get hold of the IP ranges the broken crawlers use and
block them in the .htaccess file we ship with Drupal.


Long live open web standards!




------------------------------------------------------------------------

Sun, 08 May 2005 18:08:12 +0000 : kbahey

It is really sad that most of us do not see a problem here, or brush it
off as someone else's problem.


Standards are only valid if everyone follows them. The reality is that
some do not, and depending on the market presence and strengths of
those in violation of the standard, they are insignificant to something
that is to be dealt with.


If web designers ignored Microsoft Internet Expolrer, with its blatant
aloofness to standard, unintentional or otherwise, they would be out of
business. 80% of people are still using MS IE. This is exactly the same
issue.
To see the magnitude of the problem, run the following SQL against your
site:


> select hostname from watchdog where type = 'httpd' and message like
'%.css%';


> select hostname, count(*) cnt from watchdog where type = 'httpd' and
message like '%.css%' group by hostname order by cnt asc;


The first shows 5383 rows, the latter shows 1886 rows.


As I said we will not be breaking any standards by qualifying our URLs
and making them unambiguous to everyone, starting with the /.




------------------------------------------------------------------------

Sun, 08 May 2005 23:24:50 +0000 : clydefrog

kbahey, have you contacted any of these crawlers to tell them their
software is broken?




------------------------------------------------------------------------

Mon, 09 May 2005 01:30:18 +0000 : kbahey

No. I haven't.


There are 1866 unique IPs over the a period of 4 months. Even if we
assume that these are in subnets, and say 10 per subnet, this means I
have to contact 186 separate organizations/individuals, which is such a
great effort. Even if I assume that there is skew, and that there are 20
organizations/individuals, it is still a great effort, and how many of
those will respond, let alone fix their crawlers.


The question is: what is within our control and influence and what is
not. This is like writing CSS for MS IE and for other standard
conforming browsers. Or like defensive driving in an area where there
are many rogue drivers. You cannot say that you are conforming to css
standards and hell with the rest of the world, and you will not deal
with them at all. Nor can you say that you are within your lane, at the
set speed and keeping your distance, and will cross a green light while
a drunk person is crossing the intersection.


We had to deal with comment spam (Google's nofollow, various modules to
deal with it, or turning off anonymous comments, and moderating them),
and referer spam (hide the statistics pages from view, or disabling
statistics altogether). Didn't we? It is a rough world out there, and
if the others are unethical or criminals or just don't play by the
rules, we still have to deal with them. How is this any different?


Seeing this as a purely external issue and not dealing with it in the
software we control is unrealistic. Remember that the fix does not
break any standards. We will still be standard compliant with it, so
the standards slant of it is not convincing.


Sorry, I just see it this way, and none of the counter arguments so far
is convincing to me so far.




------------------------------------------------------------------------

Mon, 09 May 2005 02:04:28 +0000 : clydefrog

This is not at all like MSIE. IE is a majority browser, so designing
sites that don't work with it loses users. This, on the other hand, is
a small minority of crawlers making a nuisance of themselves.


The drunk driver analogy is a little bit closer, but this isn't a life
or death situation. You've convinced me that you want this feature, but
you haven't convinced me that I want this feature. If this is committed,
*please* make it optional!




------------------------------------------------------------------------

Wed, 18 May 2005 00:14:56 +0000 : jayCampbell

This issue interferes with BlogLines' ability to autodetect feeds.
Several desktop clients have the same problem. While I agree that
ideally this should be an optional change, losing 10% of your RSS
readership is serious stuff. Thank you for the interim hide-saving,
chx.




------------------------------------------------------------------------

Wed, 18 May 2005 03:18:56 +0000 : kbahey

chx


Was the patch updated for 4.6? I have applied only the common.inc.


But since I am now using phptemplate based themes (pushbutton, and soon
a custom one), can you please give some instructions on what to change
(e.g. putting path_to_theme() in some places?)


Thanks in advance.




------------------------------------------------------------------------

Wed, 01 Jun 2005 21:47:16 +0000 : njvack

I've tested this with 4.6 -- one of the hunks didn't want to apply to
common.inc; I think it was a line count off-by-one change or something,
though -- I made the change by hand and it's working great.


This patch hasn't found widespread acceptance... is it breaking things
for some people?


This feels like a better method of handling links than using BASE HREF,
as far as I can feel. The way I see it is: Google is pretty good at
indexing web pages. Not perfect, but I'm willing to believe they
understand HTML better than I do. BASE HREF isn't a new part of the
standard. At all. If Google doesn't support the way Drupal is trying to
use BASE HREF, my money is that Google is right, and Drupal is using the
tag in an unintended way.


But anyhow, it's happy under 4.6 for me, so far.




------------------------------------------------------------------------

Sun, 19 Jun 2005 18:56:20 +0000 : mjr

I just read through this thread because i have been beaten by the
problem: I  am in need for URLs relative to the server document root
for different reasons.


First, my server lives in a LAN behind a firewall. Since it runs
several web apps, drupal in installed in a subdirectory of my document
root. It is accessible from the outside solely via https and from
inside the LAN under a different name, via https as well as using
normal http (the latter is necessary to get cron.php working correctly,
isn't it?)


Secondly, run a clone of this site as a testing platform on my laptop,
which obviously has its own URL and to be able to sync it with as
little hassle as possible. IMHO this is very important - especially if
You set up sites using many modules You will need to test Your setup
somewhere else before going onto the production system.


Thirdly, it prevents me from using http(s)://localhost/drupal46 or IP
addresses to access the site.


Fourthly i or some remote user will get hassle to mirror the site or
parts of it into static pages (that's again the crawler problem).


Such a situation has been discussed elsewhere in this forum multiple
times, and the commonly accepted proposal was to use a relative path as
a base URL, in my setting '/drupal46'  (if i remember correctly, it is
even in drupal's manual, isn't it?) --- although this will obviosly
break the HTML-standard. Now comes the real problem: relative paths in 
are interpreted as intended by this hack by most relevant browsers:
Mozilla et al., Opera, MSIE 5.x, Safari, even by exots like lynx and
links, but there are a few ones that adhere strictly to standards in
this respect, beside a few other exots like w3m and amaya MSIE 6.x also
belongs to this group. Which means that my site is not accessible by 2/3
of the world. Really ugly.


I am aware of using a multisite setup as a workaround, but seriously,
why use a workaround which costs me administrative effort if a clean
and standard conforming solution is at sight, namely avoiding the use
of the  tag in favour of using paths relative to the document root of
my server or virtual host.


So i would strongly vote for modifying drupal to drop using the  tag.
Although being part of HTML even before HTML 1.0 has been defined
(always as an optional tag, BTW)  it brings in more trouble than it
helps.


best regards


Michael




------------------------------------------------------------------------

Mon, 22 Aug 2005 13:36:39 +0000 : Uwe Hermann

What's the status of this issue?




------------------------------------------------------------------------

Mon, 22 Aug 2005 13:53:53 +0000 : Goba

Now that we have /patch (code needs work)/, this is a perfect issue to
put into that status. I would vote for the change provided by the
latest patch in this issue, although I have not tested the patch
myself. At least at weblabor.hu and drupal.hu, we run with a custom
url() function which just does what this patch is about to do (but
since these Drupal setups are in the root folder, we just prepend a
slash to all path values).


The patch [6] needs to be updated to latest CVS, and as Dries said [7]
he is willing to fix this problem, it should be committed after a
review.
[6] http://drupal.org/node/13148#comment-19433
[7] http://drupal.org/node/13148#comment-16088







More information about the drupal-devel mailing list