[drupal-devel] [bug] Problems with using relative path names

kbahey drupal-devel at drupal.org
Wed Mar 23 00:37:59 UTC 2005

Issue status update for http://drupal.org/node/13148

 Project:      Drupal
 Version:      cvs
 Component:    base system
 Category:     bug reports
 Priority:     normal
 Assigned to:  Anonymous
 Reported by:  kbahey
 Updated by:   kbahey
 Status:       patch

I really can't fathom why some of us cannot deal with with the realities
out there in the world.
These problems are not because Drupal is broken. It is because crawlers
are. We cannot just bury our collective heads in the sand and say that
we are standards compliant and forget about what is out there. 
As an analogy, people who design themes or write CSS have to deal with
the ugliness of Microsoft Internet Explorer and its intentional going
against standards. You cannot tell a client or your boss that you are
not modifying a theme that works perfectly on Konqueror and Firefox
because MS IE is broken.
Similarly, we cannot ignore that crawlers from major search engine
companies are broken or confused, and keep recursing through site using
Drupal causing countless errors in the logs. We cannot tell our users to
ask Google and Yahoo et al to fix their software.
Remember that we are not breaking any standards by implementing this
patch. All we are doing is putting the entire path out (from the first
/ down) and thus eliminating ambiguity for everyone.
Sorry if I am a bit blunt in this post, but I am tired of what may be
seen as isolationist thinking.
I do not mind if this is implemented in an advanced mode or via a
settings.php thing. All I care about is getting it fixed somehow.


Previous comments:

November 18, 2004 - 22:29 : kbahey

Looking at my site's logs, there seem to be several problems that are
caused by Drupal's use of relative path names.

If Drupal causes all the site's urls to be absolute, then none of this
would be an issue.
A. Search Engine Crawlers
Getting lots of 404s on things like: linux/index.html/robots.txt
Where 'linux' is an alias to a taxonomy, and 'index.html' is an alias
to a node within that taxonomy.
Another example, is recursing unnecessarily. I see 404s on things like:
Where 'linux' is a path alias for a taxonomy term, and 'index.html' is
an alias to the main node within it.
This does not seem to happen when Google crawls my sites, but Yahoo's
Slurp suffers from this problem, and keeps recursing. MSNBot also
suffers from this.
Another crawler/harvester called Blinkx/DFS-Fetch keeps adding the .css
file to the relative path, getting a 404 on things like:
And Fast Search Engine also attempts to access:
The same goes for grub.org, another crawler.
B. Google Cache / Archive Way Back Machine
Pages in Google cache and archive.org Way Back Machine suffer form a
similar problem: the .css files cannot be found, and hence rendering of
the pages is not correct.
Compare this: http://www.drupal.org/node/4647
To this:
Notice the following:

How there is no formatting at all, because of the lack of a .css file
The httpd log on Drupal will show errors for:
linux/themes/pushbutton/style.css and linux/misc/drupal.css

Also see:
C. Proxy Caches:
When someone is browsing my site from behind a proxy cache, the web
site is hit with a rapid succession of requests, and many of it is just
for bogus pages.

2004/11/17 - 17:47 404 error: linux/user/1 not found.
2004/11/17 - 17:47 404 error: linux/feedback not found.
2004/11/17 - 17:47 404 error: linux/tracker not found.
2004/11/17 - 17:47 404 error: linux/sitemap not found.
2004/11/17 - 17:47 404 error: linux/search not found.
2004/11/17 - 17:47 404 error: linux/misc not found.
2004/11/17 - 17:47 404 error: linux/programming not found.
2004/11/17 - 17:47 404 error: linux/programming not found.
2004/11/17 - 17:47 404 error: linux/linux not found.
2004/11/17 - 17:47 404 error: linux/technology not found.
2004/11/17 - 17:47 404 error: linux/writings not found.
2004/11/17 - 17:47 404 error: linux/family not found.

And also:

2004/11/17 - 07:23 404 error: history/user/1 not found.
2004/11/17 - 07:23 404 error: history/tracker not found.
2004/11/17 - 07:23 404 error: history/feedback not found.
2004/11/17 - 07:23 404 error: history/sitemap not found.
2004/11/17 - 07:23 404 error: history/search not found.
2004/11/17 - 07:23 404 error: history/misc not found.
2004/11/17 - 07:23 404 error: history/technology not found.
2004/11/17 - 07:23 404 error: history/science not found.
2004/11/17 - 07:22 404 error: history/history not found.
2004/11/17 - 07:22 404 error: history/writings not found.
2004/11/17 - 07:22 404 error: history/family not found.

As you can tell, history and linux are aliases to taxonomy terms, and
so is misc, technology, writings, family, ...etc. The user agent is
appending the taxonomy term alias to the url and forming a new URL.

D. Regular Browsing:
There is even at least one extreme case where the following URL was
accessed (the result was 404 of course)

It seems it was a normal user, because the user agent is: "Mozilla/4.0
(compatible; MSIE 5.5; Windows 98; Win 9x 4.90)"
Proposed Solution:
As a proposed solution, all URLs in Drupal can be made into absolute
path names. This can be done by the following:

The variable $base_url in the conf.php file is broken down into two

$base_host (the 'http://whatever-host.example.com' part WITHOUT the
trailing slash)
$base_path (the '/path-to-drupal' part, WITH the leading slash. If this
is the DocumentRoot, then it is just a '/' character)

$base_url is now $base_host concatenated with $base_path
A simple filter can be written to preceed every href="path" with the
$base_path variable, so it becomes "/path"
This option can be turned on and off for a site. The default is to have
it off so current behavior is maintained.
A similar scheme applies for style sheets as well.

So, did I miss something obvious? Am I seriously off the mark?
Your thoughts!


November 19, 2004 - 23:10 : chrisada

I am getting similar 404 errors, mainly from rss feed link that looks
like /blog/blog/feed and many manual links that are relative to drupal
It was not a problem before Drupal 4.5, so I think there might not be a
need to change all URIs to absolute. I can't see where the problem is
coming from though.


November 19, 2004 - 23:35 : kbahey

I am pretty sure that these problems were happening for at least the
past 10 months (ever since I moved to Drupal in January 2004).
The main issue here is that crawlers and other user agents get confused
by the relative path names. 
Using absolute paths will definitely solve this.  However, is this the
only solution? 
I am looking for a discussion of this.


November 20, 2004 - 04:55 : Goba

No absolute paths please. Having the path start with '/' solves all the
mentioned problems, and is not absolute, it is relative to the domain.
Sadly some crawlers and even the Google Cache does not obey to the base
href. I have reported this cache problem in April to Google, and they
promised they will keep it in mind... Hehe...
What we need is to have the printed relative path values relative to
the domain name, and not relative to the Drupal installation path.
Note that this issue will appear on the drupal devel mailing list if
someone finally provides a patch we can talk about :)


November 20, 2004 - 05:19 : Dries

Goba is right.  We need paths relative to the domain name to fix this


November 20, 2004 - 15:19 : kbahey

Sorry for not making my self clear.
When I said absolute, I meant that they start with just a /. I did NOT
mean that they start with http://host.example.com. That would be a very
bad idea.
In any case, what do people think about the proposed solution (breaking
down $base_url into two parts?)
Also, does this address the style sheets as well, or more is needed?


November 21, 2004 - 11:24 : chx

Attachment: http://drupal.org/files/issues/common_inc_patch.txt (471 bytes)

I have implemented what Goba suggested.


November 21, 2004 - 11:43 : chx

Attachment: http://drupal.org/files/issues/common_inc_patch_0.txt (825 bytes)

Maybe this one is faster?


November 21, 2004 - 12:34 : kbahey

Man! You are fast!
I tried the second version. It works fine for things that are not
inside the node body, I mean they have  a / in front of them, as we
want it to be.
Two comments/issues:
- If there is a URL that is already "/" representing the home page, it
gets set to "//".  Perhaps it should check for that case?
- URLs in nodes that do not start with / do not get changed to have a /
prepended to them. Do we need a filter for this?
- Do we need to do something for the style sheets in the page header? I
mean the "misc/drupal.css" and "themes/themename/style.css"?


November 21, 2004 - 19:28 : kbahey

Hi chx
Here is a fix for the case where you have a url that is just "/".
In your patch, instead of:

$base = $parts['path'] . '/' ;

Replace that by:

$base = ( $path == '/' ? $base : $parts['path'] . '/' );


November 27, 2004 - 23:08 : kbahey

Did this patch make it into CVS yet?
If there are any objections to it, can someone please explain what they


November 28, 2004 - 04:33 : Dries

Shouldn't your changes be included in the patch?
Also, it's better to cache $base rather than $parts.
Lastly, it this patch makes it to HEAD, we should probably remove some
'base url' cruft from the themes.


November 28, 2004 - 13:54 : kbahey

Attachment: http://drupal.org/files/issues/x.diff (1 KB)

Here is the patch including my fix.
I am asking chx to comment on caching $base instead of $parts.
Will this make it faster?


November 28, 2004 - 14:26 : chx

Hm. $base = ( $path == '/' ? $base : $parts['path'] . '/' ); this
depends on path which is a parameter. Thus I fail to see how could we
cache $base. I'd correct this code however $base = ( $path == '/' ? ''
: $parts['path'] . '/' ); 'cos I think $base is not defined before, but
this is not a problem, PHP will be happy to replace NULL with NULL...
Maybe instead of all parts, only $parts['path'] is enough to be cached,
yes, but the performance and memory usage difference -- I guess -- would
not be noticable...


November 28, 2004 - 18:06 : kbahey

Attachment: http://drupal.org/files/issues/common-inc-patch.txt (1 KB)

I put in chx suggested change.
This patch can go in CVS then, to rid us of the problems with paths not
beginning with slash.
This is not an ultimate solution still. We need to address the problem
with .css files. Although the header contains a:
<base href="http://example.com" />
it does not seem that major search engines and archiving sites obey it


December 2, 2004 - 15:31 : Dries

Your coding style needs work.  Also, I'm not going to commit this unless
the themes get fixed up: we'd end up with invalid URLs all over the
place.  Lastly, I wonder how portable the themes will be when Drupal is
run from within a subdirectory.


December 2, 2004 - 16:15 : chx

Attachment: http://drupal.org/files/issues/common_inc_patch_1.txt (849 bytes)

Well, my patch worked from a subdirectory very well, as fact, I have not
tested it from the root dir. And I think that it adheres to coding
standards. So I resubmit it with the root path fix. However, my Drupal
work is focused on i18n these days, and I was never into themeing so it
won't be me who fixes those.


December 2, 2004 - 17:07 : kbahey

I have tested the previous patch (including my fix) with drupal
installed in the DocumentRoot of the server.
So, in effect, it is tested with both Drupal in / and Drupal in a
This change fixes the problem for the crawlers and other browsers from
getting confused.
While it is true that there is no fix for the .css files in the HTML
head section yet, this fix deals with a major part of the problem, and
rids us of a major pain. Check your web server's logs some time to see
what I mean.
Someone who is familiar with the themes can contribute a patch later. 
This patch and the future fix for themes are not mutually exclusive, so
let it go in CVS.


December 9, 2004 - 10:31 : Goba

Please commit this into Drupal core, this fix is badly needed.


January 17, 2005 - 08:00 : chx

Attachment: http://drupal.org/files/issues/base_url_kill.patch (4.34 KB)

Well as noone have stepped in to fix this problem, I have tried to fix
the themes also. themes.inc , xtemplate.engine and the bluemarine
template is patched besides common.inc.
Of course, more templates could follow, but first I'd like to see your


January 17, 2005 - 08:09 : Goba

I don't think that removing <base> from the themes is a good idea, using
$parts['path'] should be encouraged though before the files, which would
fix the google cache problem, and would still keep the HTML size low. It
would also help those, who save the file to find the originating site
easier, since clicking on a non-pagelocal link would lead to the online


January 17, 2005 - 11:03 : Steven

Definitely -1 on removing the <base&gt tag or using absolute or
root-relative URLs. This tag has been around for ages, and it is the
only way to make clean URLs work without bloating in the code. FYI,
"base" is (first?) mentioned in Berners-Lee's HTML 1.0 draft [1].
That's June 1993.
As the amount of clean URL-using sites grows, the crawlers will have to
be updated. Perhaps we could prevent crawlers from going too insane by
404ing for URLs with more than say 10 components? That would prevent
the really crappy ones from hammering your site.
I'm all for making the <base> tag easier to handle for the user (say,
by including a filter to allow simple anchor links to work as most
people expect them to), but we should keep Drupal-generated URLs clean
and completely relative.
[1] http://www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt


January 17, 2005 - 11:57 : kbahey

The problem with css is this: The @import argument does not start with a
This is simple to fix.
We keep the "base" as it is today, but add the new variable: $base
before it.
So for a site where Drupal is installed in the DocumentRoot, all that
will change is that /misc/drupal.css and /themes/themename/style.css
will be preceded by a slash. For sites that use another path, that path
will be prepended to the css file name.
How about that?


January 17, 2005 - 12:38 : Steven

What exactly is the problem with the @import? As far as I know:
- url() in stylesheets is interpreted relative to the base of the
stylesheet, not the source document.
- However, if the styles are inside an HTML document, through a style
tag or style attribute, then the stylesheet's location is the same as
the HTML document.
- Thus, the stylesheet's base is the same as the base of the HTML
document (which can be altered through the <base> tag).
I just don't see why it is necessary. As far as I know, the only
browser that has had problems resolving CSS urls properly was Netscape
4, which does not support @import at all, and which Drupal does not
support either, because of its CSS usage.


January 17, 2005 - 13:35 : kbahey

The problem for stylesheets is as follows. I think it mainly affect
crawlers and Google's cache.
Say you have an installtion of Drupal in DocumentRoot. You then use url
aliases, and put slashes in them.
For example, you use news/general/2004-12-15.html for a node.
That node still has misc/drupal.css and themes/pushbutton/style.css in
the head section if the document. Crawlers get fooled by that and try
to look for /news/general/misc/drupal.css and
/news/general/themes/pushbutton/style.css, which don't exist.
So, just prepending the new $base variable (in chx's patch) before the
stylesheet @import argument would fix this issue. Assuming you are in
DocumentRoot, then /misc and /themes would be used instead of just misc
and themes.
It would still be compliant with standards, be relative to the web
site, and no ambiguous to anyone, be they crawler or browser.
I hope it is clearer now. 
I think chx can change the patch to use the $base instead of $base_url
everywhere, so as to avoid the host/domain name in the urls.


January 17, 2005 - 16:36 : Steven

But typical crawlers don't even pay attention to stylesheets, hence it
wouldn't have much use for them. I just don't see why we should adjust
to rare cases of buggy software. Reading out a base URL from an HTML
document is dead easy, and on top of that it doesn't add more
complexity as without the base tag, the document's URL is already an
implicit base which has to be parsed anyway.
I did not like it when we altered the <link&gt tag to accomodate buggy
RSS readers and I certainly don't like it now, as this is even rarer.
In both cases, it is not Drupal which is at fault.


January 17, 2005 - 16:48 : kbahey

While I agree with most of what you said, the 404s show up in the logs
enough to be a bother.
Perhaps the original design of Drupal did not forsee that people will
use url aliases to mimic directory/file hierarchies. Whether this was
intended or not, it is the way many use Drupal today.
It does not matter where the bug is (Drupal or the external world), as
long as we can stop it ourselves, by adjusting our end of it.
The fix is simple enough and does not break standards (if implemented
as described with a leading / before the css).


January 17, 2005 - 16:54 : Steven

It does not break standards, but it does bloat the code in an ugly way.
Why not send an e-mail to the owners of the crawlers and tell them to
implement a standard that is nearly 10 years old [2] (RFC 1808)?
Note that Google Cache now seems to correctly interpret base URLs [3]
and even adds a <base> tag of its own.
By the way, this problem has nothing to do with people using URL
aliases or not, as for a browser the regular nested paths that Drupal
uses (e.g. "node/1" is no different from aliases mimicking files
[2] http://www.faqs.org/rfcs/rfc1808.html


January 17, 2005 - 17:34 : Goba

Steven, part of the problem is that Google cache does add a base href
even if there is a base href in the document. Eg adds a <BASE
HREF="http://drupal.org/node/13733"> on the plone comparision page
cached. Now that since HTML does not allow more than one base tag [4]
to be present, it is up to the browsers, to use the first or the last
base value, or any of the base values on the page for that matter as
the used base. So even pages displayed from the google cache will be
buggy if a full relative path to the domain root is not specified, due
to this problem.


January 18, 2005 - 02:14 : chx

Attachment: http://drupal.org/files/issues/base_url_kill_0.patch (4.75 KB)

This one does not use the whole base_url only the path part of it. HTML
bloat is kept at minimal.


February 1, 2005 - 18:33 : clairem

Please please can this be done?
It's a good idea in itself, but if using fully-qualified paths means we
can get rid of the BASE HREF, then page anchors will work without having
the overhead of a filter. That's be a huge bonus for those creating
larger nodes, or who just want to be able to put a "skip navigation"
link in their theme without having to abandon Xtemplate or PHPtemplate


February 1, 2005 - 18:37 : Goba

Well, speaking of skip navigation links, phptemplate and xtemplate
should expose the REQUEST_URI to the templates, so when a link to an
anchor on the same page is needed, the link can be formatted with the
complete request URI in mind.


February 2, 2005 - 18:40 : clairem

"hptemplate and xtemplate should expose the REQUEST_URI to the templates
Should, but don't :(
If BASE HREF isn't removed, surely it wouldn't be a big job to
implement this tweak?


February 17, 2005 - 09:50 : kbahey

This patch is badly needed. The lack of a leading / in many paths is
causing lots of problems.


March 12, 2005 - 15:32 : kbahey

Can this patch be applied for 4.6? it is really badly needed.


March 22, 2005 - 15:50 : Dries

I don't see why this is badly needed.  We generate perfectly valid URLs
which are supposed to be short and crispy.  This patch has some
advantages though, yet it is unclear which patch to go with.


March 22, 2005 - 16:00 : chx

The second patch is better.


March 22, 2005 - 16:46 : grohk

Forgive me for saying so,  but since the way Drupal is generating
hyperlinks is completely valid, why are you suggesting Drupal should
move away from an accepted standard when the problem lies with the
search engines?
At the very least, this needs to be optional -- which it appears to be
-- I hate the 404s too, but I hate to hear that a change in Drupal is
needed to fix a Google problem.


March 22, 2005 - 17:03 : jhriggs

I have to agree with the last comment from grohk.


March 22, 2005 - 17:39 : mathias

I also agree with the two previous Drupaleers, but I wouldn't mind
enabling a 'quirks mode' via my conf file to stop the flood of 404

More information about the drupal-devel mailing list