[drupal-devel] [bug] Problems with using relative path names

kbahey drupal-devel at drupal.org
Sat Mar 12 20:32:35 UTC 2005


Issue status update for http://drupal.org/node/13148

 Project:      Drupal
 Version:      cvs
 Component:    base system
 Category:     bug reports
 Priority:     normal
 Assigned to:  Anonymous
 Reported by:  kbahey
 Updated by:   kbahey
 Status:       patch

Can this patch be applied for 4.6? it is really badly needed.


kbahey



Previous comments:
------------------------------------------------------------------------

November 18, 2004 - 22:29 : kbahey


Looking at my site's logs, there seem to be several problems that are
caused by Drupal's use of relative path names.

If Drupal causes all the site's urls to be absolute, then none of this
would be an issue.
A. Search Engine Crawlers
Getting lots of 404s on things like: linux/index.html/robots.txt
Where 'linux' is an alias to a taxonomy, and 'index.html' is an alias
to a node within that taxonomy.
Another example, is recursing unnecessarily. I see 404s on things like:
/linux/index.html/linux/index.html
Where 'linux' is a path alias for a taxonomy term, and 'index.html' is
an alias to the main node within it.
This does not seem to happen when Google crawls my sites, but Yahoo's
Slurp suffers from this problem, and keeps recursing. MSNBot also
suffers from this.
Another crawler/harvester called Blinkx/DFS-Fetch keeps adding the .css
file to the relative path, getting a 404 on things like:
/linux/themes/xtemplate/pushbutton/logo.gif
And Fast Search Engine also attempts to access:
/linux/contact/tracker/tracker/user/password
The same goes for grub.org, another crawler.
B. Google Cache / Archive Way Back Machine
Pages in Google cache and archive.org Way Back Machine suffer form a
similar problem: the .css files cannot be found, and hence rendering of
the pages is not correct.
Examples:
Compare this: http://www.drupal.org/node/4647
To this:
http://www.google.ca/search?q=cache:www.drupal.org/node/view/4647
Notice the following:

How there is no formatting at all, because of the lack of a .css file
The httpd log on Drupal will show errors for:
linux/themes/pushbutton/style.css and linux/misc/drupal.css

Also see:
http://web.archive.org/web/20031016184902/http://www.drupal.org/
C. Proxy Caches:
When someone is browsing my site from behind a proxy cache, the web
site is hit with a rapid succession of requests, and many of it is just
for bogus pages.
Examples:

2004/11/17 - 17:47 404 error: linux/user/1 not found.
2004/11/17 - 17:47 404 error: linux/feedback not found.
2004/11/17 - 17:47 404 error: linux/tracker not found.
2004/11/17 - 17:47 404 error: linux/sitemap not found.
2004/11/17 - 17:47 404 error: linux/search not found.
2004/11/17 - 17:47 404 error: linux/misc not found.
2004/11/17 - 17:47 404 error: linux/programming not found.
2004/11/17 - 17:47 404 error: linux/programming not found.
2004/11/17 - 17:47 404 error: linux/linux not found.
2004/11/17 - 17:47 404 error: linux/technology not found.
2004/11/17 - 17:47 404 error: linux/writings not found.
2004/11/17 - 17:47 404 error: linux/family not found.

And also:

2004/11/17 - 07:23 404 error: history/user/1 not found.
2004/11/17 - 07:23 404 error: history/tracker not found.
2004/11/17 - 07:23 404 error: history/feedback not found.
2004/11/17 - 07:23 404 error: history/sitemap not found.
2004/11/17 - 07:23 404 error: history/search not found.
2004/11/17 - 07:23 404 error: history/misc not found.
2004/11/17 - 07:23 404 error: history/technology not found.
2004/11/17 - 07:23 404 error: history/science not found.
2004/11/17 - 07:22 404 error: history/history not found.
2004/11/17 - 07:22 404 error: history/writings not found.
2004/11/17 - 07:22 404 error: history/family not found.

As you can tell, history and linux are aliases to taxonomy terms, and
so is misc, technology, writings, family, ...etc. The user agent is
appending the taxonomy term alias to the url and forming a new URL.

D. Regular Browsing:
There is even at least one extreme case where the following URL was
accessed (the result was 404 of course)
/book/view/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/logo.gif

It seems it was a normal user, because the user agent is: "Mozilla/4.0
(compatible; MSIE 5.5; Windows 98; Win 9x 4.90)"
Proposed Solution:
As a proposed solution, all URLs in Drupal can be made into absolute
path names. This can be done by the following:

The variable $base_url in the conf.php file is broken down into two
components:

$base_host (the 'http://whatever-host.example.com' part WITHOUT the
trailing slash)
$base_path (the '/path-to-drupal' part, WITH the leading slash. If this
is the DocumentRoot, then it is just a '/' character)

$base_url is now $base_host concatenated with $base_path
A simple filter can be written to preceed every href="path" with the
$base_path variable, so it becomes "/path"
This option can be turned on and off for a site. The default is to have
it off so current behavior is maintained.
A similar scheme applies for style sheets as well.

So, did I miss something obvious? Am I seriously off the mark?
Your thoughts!


------------------------------------------------------------------------

November 19, 2004 - 23:10 : chrisada

I am getting similar 404 errors, mainly from rss feed link that looks
like /blog/blog/feed and many manual links that are relative to drupal
root.
It was not a problem before Drupal 4.5, so I think there might not be a
need to change all URIs to absolute. I can't see where the problem is
coming from though.


------------------------------------------------------------------------

November 19, 2004 - 23:35 : kbahey

I am pretty sure that these problems were happening for at least the
past 10 months (ever since I moved to Drupal in January 2004).
The main issue here is that crawlers and other user agents get confused
by the relative path names. 
Using absolute paths will definitely solve this.  However, is this the
only solution? 
I am looking for a discussion of this.


------------------------------------------------------------------------

November 20, 2004 - 04:55 : Goba

No absolute paths please. Having the path start with '/' solves all the
mentioned problems, and is not absolute, it is relative to the domain.
Sadly some crawlers and even the Google Cache does not obey to the base
href. I have reported this cache problem in April to Google, and they
promised they will keep it in mind... Hehe...
What we need is to have the printed relative path values relative to
the domain name, and not relative to the Drupal installation path.
Note that this issue will appear on the drupal devel mailing list if
someone finally provides a patch we can talk about :)


------------------------------------------------------------------------

November 20, 2004 - 05:19 : Dries

Goba is right.  We need paths relative to the domain name to fix this
'problem'.


------------------------------------------------------------------------

November 20, 2004 - 15:19 : kbahey

Sorry for not making my self clear.
When I said absolute, I meant that they start with just a /. I did NOT
mean that they start with http://host.example.com. That would be a very
bad idea.
In any case, what do people think about the proposed solution (breaking
down $base_url into two parts?)
Also, does this address the style sheets as well, or more is needed?


------------------------------------------------------------------------

November 21, 2004 - 11:24 : chx

Attachment: http://drupal.org/files/issues/common_inc_patch.txt (471 bytes)

I have implemented what Goba suggested.


------------------------------------------------------------------------

November 21, 2004 - 11:43 : chx

Attachment: http://drupal.org/files/issues/common_inc_patch_0.txt (825 bytes)

Maybe this one is faster?


------------------------------------------------------------------------

November 21, 2004 - 12:34 : kbahey

Man! You are fast!
I tried the second version. It works fine for things that are not
inside the node body, I mean they have  a / in front of them, as we
want it to be.
Two comments/issues:
- If there is a URL that is already "/" representing the home page, it
gets set to "//".  Perhaps it should check for that case?
- URLs in nodes that do not start with / do not get changed to have a /
prepended to them. Do we need a filter for this?
- Do we need to do something for the style sheets in the page header? I
mean the "misc/drupal.css" and "themes/themename/style.css"?
Thanks


------------------------------------------------------------------------

November 21, 2004 - 19:28 : kbahey

Hi chx
Here is a fix for the case where you have a url that is just "/".
In your patch, instead of:

<?php
$base = $parts['path'] . '/' ;
?>


Replace that by:

<?php
$base = ( $path == '/' ? $base : $parts['path'] . '/' );
?>




------------------------------------------------------------------------

November 27, 2004 - 23:08 : kbahey

Did this patch make it into CVS yet?
If there are any objections to it, can someone please explain what they
are?
Thanks


------------------------------------------------------------------------

November 28, 2004 - 04:33 : Dries

Shouldn't your changes be included in the patch?
Also, it's better to cache $base rather than $parts.
Lastly, it this patch makes it to HEAD, we should probably remove some
'base url' cruft from the themes.


------------------------------------------------------------------------

November 28, 2004 - 13:54 : kbahey

Attachment: http://drupal.org/files/issues/x.diff (1 KB)

Here is the patch including my fix.
I am asking chx to comment on caching $base instead of $parts.
Will this make it faster?


------------------------------------------------------------------------

November 28, 2004 - 14:26 : chx

Hm. $base = ( $path == '/' ? $base : $parts['path'] . '/' ); this
depends on path which is a parameter. Thus I fail to see how could we
cache $base. I'd correct this code however $base = ( $path == '/' ? ''
: $parts['path'] . '/' ); 'cos I think $base is not defined before, but
this is not a problem, PHP will be happy to replace NULL with NULL...
Maybe instead of all parts, only $parts['path'] is enough to be cached,
yes, but the performance and memory usage difference -- I guess -- would
not be noticable...


------------------------------------------------------------------------

November 28, 2004 - 18:06 : kbahey

Attachment: http://drupal.org/files/issues/common-inc-patch.txt (1 KB)

OK.
I put in chx suggested change.
This patch can go in CVS then, to rid us of the problems with paths not
beginning with slash.
This is not an ultimate solution still. We need to address the problem
with .css files. Although the header contains a:
<base href="http://example.com" />
it does not seem that major search engines and archiving sites obey it
anyway.


------------------------------------------------------------------------

December 2, 2004 - 15:31 : Dries

Your coding style needs work.  Also, I'm not going to commit this unless
the themes get fixed up: we'd end up with invalid URLs all over the
place.  Lastly, I wonder how portable the themes will be when Drupal is
run from within a subdirectory.


------------------------------------------------------------------------

December 2, 2004 - 16:15 : chx

Attachment: http://drupal.org/files/issues/common_inc_patch_1.txt (849 bytes)

Well, my patch worked from a subdirectory very well, as fact, I have not
tested it from the root dir. And I think that it adheres to coding
standards. So I resubmit it with the root path fix. However, my Drupal
work is focused on i18n these days, and I was never into themeing so it
won't be me who fixes those.


------------------------------------------------------------------------

December 2, 2004 - 17:07 : kbahey

I have tested the previous patch (including my fix) with drupal
installed in the DocumentRoot of the server.
So, in effect, it is tested with both Drupal in / and Drupal in a
subdirectory.
This change fixes the problem for the crawlers and other browsers from
getting confused.
While it is true that there is no fix for the .css files in the HTML
head section yet, this fix deals with a major part of the problem, and
rids us of a major pain. Check your web server's logs some time to see
what I mean.
Someone who is familiar with the themes can contribute a patch later. 
This patch and the future fix for themes are not mutually exclusive, so
let it go in CVS.


------------------------------------------------------------------------

December 9, 2004 - 10:31 : Goba

Please commit this into Drupal core, this fix is badly needed.


------------------------------------------------------------------------

January 17, 2005 - 08:00 : chx

Attachment: http://drupal.org/files/issues/base_url_kill.patch (4.34 KB)

Well as noone have stepped in to fix this problem, I have tried to fix
the themes also. themes.inc , xtemplate.engine and the bluemarine
template is patched besides common.inc.
Of course, more templates could follow, but first I'd like to see your
opinions.


------------------------------------------------------------------------

January 17, 2005 - 08:09 : Goba

I don't think that removing <base> from the themes is a good idea, using
$parts['path'] should be encouraged though before the files, which would
fix the google cache problem, and would still keep the HTML size low. It
would also help those, who save the file to find the originating site
easier, since clicking on a non-pagelocal link would lead to the online
version.


------------------------------------------------------------------------

January 17, 2005 - 11:03 : Steven

Definitely -1 on removing the <base&gt tag or using absolute or
root-relative URLs. This tag has been around for ages, and it is the
only way to make clean URLs work without bloating in the code. FYI,
"base" is (first?) mentioned in Berners-Lee's HTML 1.0 draft [1].
That's June 1993.
As the amount of clean URL-using sites grows, the crawlers will have to
be updated. Perhaps we could prevent crawlers from going too insane by
404ing for URLs with more than say 10 components? That would prevent
the really crappy ones from hammering your site.
I'm all for making the <base> tag easier to handle for the user (say,
by including a filter to allow simple anchor links to work as most
people expect them to), but we should keep Drupal-generated URLs clean
and completely relative.
[1] http://www.w3.org/MarkUp/draft-ietf-iiir-html-01.txt


------------------------------------------------------------------------

January 17, 2005 - 11:57 : kbahey

The problem with css is this: The @import argument does not start with a
/.
This is simple to fix.
We keep the "base" as it is today, but add the new variable: $base
before it.
So for a site where Drupal is installed in the DocumentRoot, all that
will change is that /misc/drupal.css and /themes/themename/style.css
will be preceded by a slash. For sites that use another path, that path
will be prepended to the css file name.
How about that?


------------------------------------------------------------------------

January 17, 2005 - 12:38 : Steven

What exactly is the problem with the @import? As far as I know:
- url() in stylesheets is interpreted relative to the base of the
stylesheet, not the source document.
- However, if the styles are inside an HTML document, through a style
tag or style attribute, then the stylesheet's location is the same as
the HTML document.
- Thus, the stylesheet's base is the same as the base of the HTML
document (which can be altered through the <base> tag).
I just don't see why it is necessary. As far as I know, the only
browser that has had problems resolving CSS urls properly was Netscape
4, which does not support @import at all, and which Drupal does not
support either, because of its CSS usage.


------------------------------------------------------------------------

January 17, 2005 - 13:35 : kbahey

The problem for stylesheets is as follows. I think it mainly affect
crawlers and Google's cache.
Say you have an installtion of Drupal in DocumentRoot. You then use url
aliases, and put slashes in them.
For example, you use news/general/2004-12-15.html for a node.
That node still has misc/drupal.css and themes/pushbutton/style.css in
the head section if the document. Crawlers get fooled by that and try
to look for /news/general/misc/drupal.css and
/news/general/themes/pushbutton/style.css, which don't exist.
So, just prepending the new $base variable (in chx's patch) before the
stylesheet @import argument would fix this issue. Assuming you are in
DocumentRoot, then /misc and /themes would be used instead of just misc
and themes.
It would still be compliant with standards, be relative to the web
site, and no ambiguous to anyone, be they crawler or browser.
I hope it is clearer now. 
I think chx can change the patch to use the $base instead of $base_url
everywhere, so as to avoid the host/domain name in the urls.


------------------------------------------------------------------------

January 17, 2005 - 16:36 : Steven

But typical crawlers don't even pay attention to stylesheets, hence it
wouldn't have much use for them. I just don't see why we should adjust
to rare cases of buggy software. Reading out a base URL from an HTML
document is dead easy, and on top of that it doesn't add more
complexity as without the base tag, the document's URL is already an
implicit base which has to be parsed anyway.
I did not like it when we altered the <link&gt tag to accomodate buggy
RSS readers and I certainly don't like it now, as this is even rarer.
In both cases, it is not Drupal which is at fault.


------------------------------------------------------------------------

January 17, 2005 - 16:48 : kbahey

Steven
While I agree with most of what you said, the 404s show up in the logs
enough to be a bother.
Perhaps the original design of Drupal did not forsee that people will
use url aliases to mimic directory/file hierarchies. Whether this was
intended or not, it is the way many use Drupal today.
It does not matter where the bug is (Drupal or the external world), as
long as we can stop it ourselves, by adjusting our end of it.
The fix is simple enough and does not break standards (if implemented
as described with a leading / before the css).


------------------------------------------------------------------------

January 17, 2005 - 16:54 : Steven

It does not break standards, but it does bloat the code in an ugly way.
Why not send an e-mail to the owners of the crawlers and tell them to
implement a standard that is nearly 10 years old [2] (RFC 1808)?
Note that Google Cache now seems to correctly interpret base URLs [3]
and even adds a <base> tag of its own.
By the way, this problem has nothing to do with people using URL
aliases or not, as for a browser the regular nested paths that Drupal
uses (e.g. "node/1" is no different from aliases mimicking files
"foo/bar.html").
[2] http://www.faqs.org/rfcs/rfc1808.html
[3]
http://www.google.be/search?q=cache%3Awww.drupal.org&sourceid=mozilla-search&start=0&start=0&ie=utf-8&oe=utf-8&client=firefox-a&rls=org.mozilla:en-US:official


------------------------------------------------------------------------

January 17, 2005 - 17:34 : Goba

Steven, part of the problem is that Google cache does add a base href
even if there is a base href in the document. Eg adds a <BASE
HREF="http://drupal.org/node/13733"> on the plone comparision page
cached. Now that since HTML does not allow more than one base tag [4]
to be present, it is up to the browsers, to use the first or the last
base value, or any of the base values on the page for that matter as
the used base. So even pages displayed from the google cache will be
buggy if a full relative path to the domain root is not specified, due
to this problem.
[4]
http://www.w3.org/TR/1999/REC-html401-19991224/sgml/dtd.html#head.content


------------------------------------------------------------------------

January 18, 2005 - 02:14 : chx

Attachment: http://drupal.org/files/issues/base_url_kill_0.patch (4.75 KB)

This one does not use the whole base_url only the path part of it. HTML
bloat is kept at minimal.


------------------------------------------------------------------------

February 1, 2005 - 18:33 : clairem

Please please can this be done?
It's a good idea in itself, but if using fully-qualified paths means we
can get rid of the BASE HREF, then page anchors will work without having
the overhead of a filter. That's be a huge bonus for those creating
larger nodes, or who just want to be able to put a "skip navigation"
link in their theme without having to abandon Xtemplate or PHPtemplate


------------------------------------------------------------------------

February 1, 2005 - 18:37 : Goba

Well, speaking of skip navigation links, phptemplate and xtemplate
should expose the REQUEST_URI to the templates, so when a link to an
anchor on the same page is needed, the link can be formatted with the
complete request URI in mind.


------------------------------------------------------------------------

February 2, 2005 - 18:40 : clairem

"hptemplate and xtemplate should expose the REQUEST_URI to the templates
"
Should, but don't :(
If BASE HREF isn't removed, surely it wouldn't be a big job to
implement this tweak?


------------------------------------------------------------------------

February 17, 2005 - 09:50 : kbahey

This patch is badly needed. The lack of a leading / in many paths is
causing lots of problems.





More information about the drupal-devel mailing list