[drupal-devel] [bug] Problems with using relative path names
Goba
drupal-devel at drupal.org
Mon Jan 17 13:09:12 UTC 2005
Project: Drupal
Version: cvs
Component: base system
Category: bug reports
Priority: normal
Assigned to: Anonymous
Reported by: kbahey
Updated by: Goba
Status: patch
I don't think that removing from the themes is a good idea, using
$parts['path'] should be encouraged though before the files, which
would fix the google cache problem, and would still keep the HTML size
low. It would also help those, who save the file to find the
originating site easier, since clicking on a non-pagelocal link would
lead to the online version.
Goba
Previous comments:
------------------------------------------------------------------------
November 19, 2004 - 04:29 : kbahey
Looking at my site's logs, there seem to be several problems that are
caused by Drupal's use of relative path names.
If Drupal causes all the site's urls to be absolute, then none of this
would be an issue.
A. Search Engine Crawlers
Getting lots of 404s on things like: linux/index.html/robots.txt
Where 'linux' is an alias to a taxonomy, and 'index.html' is an alias
to a node within that taxonomy.
Another example, is recursing unnecessarily. I see 404s on things like:
/linux/index.html/linux/index.html
Where 'linux' is a path alias for a taxonomy term, and 'index.html' is
an alias to the main node within it.
This does not seem to happen when Google crawls my sites, but Yahoo's
Slurp suffers from this problem, and keeps recursing. MSNBot also
suffers from this.
Another crawler/harvester called Blinkx/DFS-Fetch keeps adding the .css
file to the relative path, getting a 404 on things like:
/linux/themes/xtemplate/pushbutton/logo.gif
And Fast Search Engine also attempts to access:
/linux/contact/tracker/tracker/user/password
The same goes for grub.org, another crawler.
B. Google Cache / Archive Way Back Machine
Pages in Google cache and archive.org Way Back Machine suffer form a
similar problem: the .css files cannot be found, and hence rendering of
the pages is not correct.
Examples:
Compare this: http://www.drupal.org/node/4647
To this:
http://www.google.ca/search?q=cache:www.drupal.org/node/view/4647
Notice the following:
How there is no formatting at all, because of the lack of a .css file
The httpd log on Drupal will show errors for:
linux/themes/pushbutton/style.css and linux/misc/drupal.css
Also see:
http://web.archive.org/web/20031016184902/http://www.drupal.org/
C. Proxy Caches:
When someone is browsing my site from behind a proxy cache, the web
site is hit with a rapid succession of requests, and many of it is just
for bogus pages.
Examples:
2004/11/17 - 17:47 404 error: linux/user/1 not found.
2004/11/17 - 17:47 404 error: linux/feedback not found.
2004/11/17 - 17:47 404 error: linux/tracker not found.
2004/11/17 - 17:47 404 error: linux/sitemap not found.
2004/11/17 - 17:47 404 error: linux/search not found.
2004/11/17 - 17:47 404 error: linux/misc not found.
2004/11/17 - 17:47 404 error: linux/programming not found.
2004/11/17 - 17:47 404 error: linux/programming not found.
2004/11/17 - 17:47 404 error: linux/linux not found.
2004/11/17 - 17:47 404 error: linux/technology not found.
2004/11/17 - 17:47 404 error: linux/writings not found.
2004/11/17 - 17:47 404 error: linux/family not found.
And also:
2004/11/17 - 07:23 404 error: history/user/1 not found.
2004/11/17 - 07:23 404 error: history/tracker not found.
2004/11/17 - 07:23 404 error: history/feedback not found.
2004/11/17 - 07:23 404 error: history/sitemap not found.
2004/11/17 - 07:23 404 error: history/search not found.
2004/11/17 - 07:23 404 error: history/misc not found.
2004/11/17 - 07:23 404 error: history/technology not found.
2004/11/17 - 07:23 404 error: history/science not found.
2004/11/17 - 07:22 404 error: history/history not found.
2004/11/17 - 07:22 404 error: history/writings not found.
2004/11/17 - 07:22 404 error: history/family not found.
As you can tell, history and linux are aliases to taxonomy terms, and
so is misc, technology, writings, family, ...etc. The user agent is
appending the taxonomy term alias to the url and forming a new URL.
D. Regular Browsing:
There is even at least one extreme case where the following URL was
accessed (the result was 404 of course)
/book/view/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/logo.gif
It seems it was a normal user, because the user agent is: "Mozilla/4.0
(compatible; MSIE 5.5; Windows 98; Win 9x 4.90)"
Proposed Solution:
As a proposed solution, all URLs in Drupal can be made into absolute
path names. This can be done by the following:
The variable $base_url in the conf.php file is broken down into two
components:
$base_host (the 'http://whatever-host.example.com' part WITHOUT the
trailing slash)
$base_path (the '/path-to-drupal' part, WITH the leading slash. If this
is the DocumentRoot, then it is just a '/' character)
$base_url is now $base_host concatenated with $base_path
A simple filter can be written to preceed every href="path" with the
$base_path variable, so it becomes "/path"
This option can be turned on and off for a site. The default is to have
it off so current behavior is maintained.
A similar scheme applies for style sheets as well.
So, did I miss something obvious? Am I seriously off the mark?
Your thoughts!
------------------------------------------------------------------------
November 20, 2004 - 05:10 : chrisada
I am getting similar 404 errors, mainly from rss feed link that looks
like /blog/blog/feed and many manual links that are relative to drupal
root.
It was not a problem before Drupal 4.5, so I think there might not be a
need to change all URIs to absolute. I can't see where the problem is
coming from though.
------------------------------------------------------------------------
November 20, 2004 - 05:35 : kbahey
I am pretty sure that these problems were happening for at least the
past 10 months (ever since I moved to Drupal in January 2004).
The main issue here is that crawlers and other user agents get confused
by the relative path names.
Using absolute paths will definitely solve this. However, is this the
only solution?
I am looking for a discussion of this.
------------------------------------------------------------------------
November 20, 2004 - 10:55 : Goba
No absolute paths please. Having the path start with '/' solves all the
mentioned problems, and is not absolute, it is relative to the domain.
Sadly some crawlers and even the Google Cache does not obey to the base
href. I have reported this cache problem in April to Google, and they
promised they will keep it in mind... Hehe...
What we need is to have the printed relative path values relative to
the domain name, and not relative to the Drupal installation path.
Note that this issue will appear on the drupal devel mailing list if
someone finally provides a patch we can talk about :)
------------------------------------------------------------------------
November 20, 2004 - 11:19 : Dries
Goba is right. We need paths relative to the domain name to fix this
'problem'.
------------------------------------------------------------------------
November 20, 2004 - 21:19 : kbahey
Sorry for not making my self clear.
When I said absolute, I meant that they start with just a /. I did NOT
mean that they start with http://host.example.com. That would be a very
bad idea.
In any case, what do people think about the proposed solution (breaking
down $base_url into two parts?)
Also, does this address the style sheets as well, or more is needed?
------------------------------------------------------------------------
November 21, 2004 - 17:24 : chx
Attachment: http://drupal.org/files/issues/common_inc_patch.txt (471 bytes)
I have implemented what Goba suggested.
------------------------------------------------------------------------
November 21, 2004 - 17:43 : chx
Attachment: http://drupal.org/files/issues/common_inc_patch_0.txt (825 bytes)
Maybe this one is faster?
------------------------------------------------------------------------
November 21, 2004 - 18:34 : kbahey
Man! You are fast!
I tried the second version. It works fine for things that are not
inside the node body, I mean they have a / in front of them, as we
want it to be.
Two comments/issues:
- If there is a URL that is already "/" representing the home page, it
gets set to "//". Perhaps it should check for that case?
- URLs in nodes that do not start with / do not get changed to have a /
prepended to them. Do we need a filter for this?
- Do we need to do something for the style sheets in the page header? I
mean the "misc/drupal.css" and "themes/themename/style.css"?
Thanks
------------------------------------------------------------------------
November 22, 2004 - 01:28 : kbahey
Hi chx
Here is a fix for the case where you have a url that is just "/".
In your patch, instead of:
[?php $base = $parts['path'] . '/' ; ?]
Replace that by:
[?php $base = ( $path == '/' ? $base : $parts['path'] . '/' ); ?]
------------------------------------------------------------------------
November 28, 2004 - 05:08 : kbahey
Did this patch make it into CVS yet?
If there are any objections to it, can someone please explain what they
are?
Thanks
------------------------------------------------------------------------
November 28, 2004 - 10:33 : Dries
Shouldn't your changes be included in the patch?
Also, it's better to cache $base rather than $parts.
Lastly, it this patch makes it to HEAD, we should probably remove some
'base url' cruft from the themes.
------------------------------------------------------------------------
November 28, 2004 - 19:54 : kbahey
Attachment: http://drupal.org/files/issues/x.diff (1 KB)
Here is the patch including my fix.
I am asking chx to comment on caching $base instead of $parts.
Will this make it faster?
------------------------------------------------------------------------
November 28, 2004 - 20:26 : chx
Hm. $base = ( $path == '/' ? $base : $parts['path'] . '/' ); this
depends on path which is a parameter. Thus I fail to see how could we
cache $base. I'd correct this code however $base = ( $path == '/' ? ''
: $parts['path'] . '/' ); 'cos I think $base is not defined before, but
this is not a problem, PHP will be happy to replace NULL with NULL...
Maybe instead of all parts, only $parts['path'] is enough to be cached,
yes, but the performance and memory usage difference -- I guess -- would
not be noticable...
------------------------------------------------------------------------
November 29, 2004 - 00:06 : kbahey
Attachment: http://drupal.org/files/issues/common-inc-patch.txt (1 KB)
OK.
I put in chx suggested change.
This patch can go in CVS then, to rid us of the problems with paths not
beginning with slash.
This is not an ultimate solution still. We need to address the problem
with .css files. Although the header contains a:
it does not seem that major search engines and archiving sites obey it
anyway.
------------------------------------------------------------------------
December 2, 2004 - 21:31 : Dries
Your coding style needs work. Also, I'm not going to commit this unless
the themes get fixed up: we'd end up with invalid URLs all over the
place. Lastly, I wonder how portable the themes will be when Drupal is
run from within a subdirectory.
------------------------------------------------------------------------
December 2, 2004 - 22:15 : chx
Attachment: http://drupal.org/files/issues/common_inc_patch_1.txt (849 bytes)
Well, my patch worked from a subdirectory very well, as fact, I have not
tested it from the root dir. And I think that it adheres to coding
standards. So I resubmit it with the root path fix. However, my Drupal
work is focused on i18n these days, and I was never into themeing so it
won't be me who fixes those.
------------------------------------------------------------------------
December 2, 2004 - 23:07 : kbahey
I have tested the previous patch (including my fix) with drupal
installed in the DocumentRoot of the server.
So, in effect, it is tested with both Drupal in / and Drupal in a
subdirectory.
This change fixes the problem for the crawlers and other browsers from
getting confused.
While it is true that there is no fix for the .css files in the HTML
head section yet, this fix deals with a major part of the problem, and
rids us of a major pain. Check your web server's logs some time to see
what I mean.
Someone who is familiar with the themes can contribute a patch later.
This patch and the future fix for themes are not mutually exclusive, so
let it go in CVS.
------------------------------------------------------------------------
December 9, 2004 - 16:31 : Goba
Please commit this into Drupal core, this fix is badly needed.
------------------------------------------------------------------------
January 17, 2005 - 14:00 : chx
Attachment: http://drupal.org/files/issues/base_url_kill.patch (4.34 KB)
Well as noone have stepped in to fix this problem, I have tried to fix
the themes also. themes.inc , xtemplate.engine and the bluemarine
template is patched besides common.inc.
Of course, more templates could follow, but first I'd like to see your
opinions.
--
View: http://drupal.org/node/13148
Edit: http://drupal.org/project/comments/add/13148
More information about the drupal-devel
mailing list