[drupal-devel] [bug] Problems with using relative path names

Goba drupal-devel at drupal.org
Mon Jan 17 13:09:12 UTC 2005


 Project:      Drupal
 Version:      cvs
 Component:    base system
 Category:     bug reports
 Priority:     normal
 Assigned to:  Anonymous
 Reported by:  kbahey
 Updated by:   Goba
 Status:       patch

I don't think that removing  from the themes is a good idea, using
$parts['path'] should be encouraged though before the files, which
would fix the google cache problem, and would still keep the HTML size
low. It would also help those, who save the file to find the
originating site easier, since clicking on a non-pagelocal link would
lead to the online version.

Goba



Previous comments:
------------------------------------------------------------------------

November 19, 2004 - 04:29 : kbahey


Looking at my site's logs, there seem to be several problems that are
caused by Drupal's use of relative path names.


If Drupal causes all the site's urls to be absolute, then none of this
would be an issue.

A. Search Engine Crawlers

Getting lots of 404s on things like: linux/index.html/robots.txt

Where 'linux' is an alias to a taxonomy, and 'index.html' is an alias
to a node within that taxonomy.

Another example, is recursing unnecessarily. I see 404s on things like:
/linux/index.html/linux/index.html

Where 'linux' is a path alias for a taxonomy term, and 'index.html' is
an alias to the main node within it.

This does not seem to happen when Google crawls my sites, but Yahoo's
Slurp suffers from this problem, and keeps recursing. MSNBot also
suffers from this.

Another crawler/harvester called Blinkx/DFS-Fetch keeps adding the .css
file to the relative path, getting a 404 on things like:
/linux/themes/xtemplate/pushbutton/logo.gif

And Fast Search Engine also attempts to access:
/linux/contact/tracker/tracker/user/password

The same goes for grub.org, another crawler.

B. Google Cache / Archive Way Back Machine

Pages in Google cache and archive.org Way Back Machine suffer form a
similar problem: the .css files cannot be found, and hence rendering of
the pages is not correct.

Examples:

Compare this: http://www.drupal.org/node/4647
To this:
http://www.google.ca/search?q=cache:www.drupal.org/node/view/4647

Notice the following:


How there is no formatting at all, because of the lack of a .css file

The httpd log on Drupal will show errors for:
linux/themes/pushbutton/style.css and linux/misc/drupal.css


Also see:
http://web.archive.org/web/20031016184902/http://www.drupal.org/

C. Proxy Caches:

When someone is browsing my site from behind a proxy cache, the web
site is hit with a rapid succession of requests, and many of it is just
for bogus pages.

Examples:

2004/11/17 - 17:47 404 error: linux/user/1 not found.
2004/11/17 - 17:47 404 error: linux/feedback not found.
2004/11/17 - 17:47 404 error: linux/tracker not found.
2004/11/17 - 17:47 404 error: linux/sitemap not found.
2004/11/17 - 17:47 404 error: linux/search not found.
2004/11/17 - 17:47 404 error: linux/misc not found.
2004/11/17 - 17:47 404 error: linux/programming not found.
2004/11/17 - 17:47 404 error: linux/programming not found.
2004/11/17 - 17:47 404 error: linux/linux not found.
2004/11/17 - 17:47 404 error: linux/technology not found.
2004/11/17 - 17:47 404 error: linux/writings not found.
2004/11/17 - 17:47 404 error: linux/family not found.

And also:

2004/11/17 - 07:23 404 error: history/user/1 not found.
2004/11/17 - 07:23 404 error: history/tracker not found.
2004/11/17 - 07:23 404 error: history/feedback not found.
2004/11/17 - 07:23 404 error: history/sitemap not found.
2004/11/17 - 07:23 404 error: history/search not found.
2004/11/17 - 07:23 404 error: history/misc not found.
2004/11/17 - 07:23 404 error: history/technology not found.
2004/11/17 - 07:23 404 error: history/science not found.
2004/11/17 - 07:22 404 error: history/history not found.
2004/11/17 - 07:22 404 error: history/writings not found.
2004/11/17 - 07:22 404 error: history/family not found.


As you can tell, history and linux are aliases to taxonomy terms, and
so is misc, technology, writings, family, ...etc. The user agent is
appending the taxonomy term alias to the url and forming a new URL.

D. Regular Browsing:

There is even at least one extreme case where the following URL was
accessed (the result was 404 of course)

/book/view/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/themes/xtemplate/pushbutton/logo.gif


It seems it was a normal user, because the user agent is: "Mozilla/4.0
(compatible; MSIE 5.5; Windows 98; Win 9x 4.90)"

Proposed Solution:

As a proposed solution, all URLs in Drupal can be made into absolute
path names. This can be done by the following:



The variable $base_url in the conf.php file is broken down into two
components:


$base_host (the 'http://whatever-host.example.com' part WITHOUT the
trailing slash)
$base_path (the '/path-to-drupal' part, WITH the leading slash. If this
is the DocumentRoot, then it is just a '/' character)


$base_url is now $base_host concatenated with $base_path

A simple filter can be written to preceed every href="path" with the
$base_path variable, so it becomes "/path"

This option can be turned on and off for a site. The default is to have
it off so current behavior is maintained.

A similar scheme applies for style sheets as well.


So, did I miss something obvious? Am I seriously off the mark?

Your thoughts!


------------------------------------------------------------------------

November 20, 2004 - 05:10 : chrisada

I am getting similar 404 errors, mainly from rss feed link that looks
like /blog/blog/feed and many manual links that are relative to drupal
root.

It was not a problem before Drupal 4.5, so I think there might not be a
need to change all URIs to absolute. I can't see where the problem is
coming from though.

------------------------------------------------------------------------

November 20, 2004 - 05:35 : kbahey

I am pretty sure that these problems were happening for at least the
past 10 months (ever since I moved to Drupal in January 2004).

The main issue here is that crawlers and other user agents get confused
by the relative path names. 

Using absolute paths will definitely solve this.  However, is this the
only solution? 

I am looking for a discussion of this.

------------------------------------------------------------------------

November 20, 2004 - 10:55 : Goba

No absolute paths please. Having the path start with '/' solves all the
mentioned problems, and is not absolute, it is relative to the domain.
Sadly some crawlers and even the Google Cache does not obey to the base
href. I have reported this cache problem in April to Google, and they
promised they will keep it in mind... Hehe...

What we need is to have the printed relative path values relative to
the domain name, and not relative to the Drupal installation path.

Note that this issue will appear on the drupal devel mailing list if
someone finally provides a patch we can talk about :)

------------------------------------------------------------------------

November 20, 2004 - 11:19 : Dries

Goba is right.  We need paths relative to the domain name to fix this
'problem'.

------------------------------------------------------------------------

November 20, 2004 - 21:19 : kbahey

Sorry for not making my self clear.

When I said absolute, I meant that they start with just a /. I did NOT
mean that they start with http://host.example.com. That would be a very
bad idea.

In any case, what do people think about the proposed solution (breaking
down $base_url into two parts?)

Also, does this address the style sheets as well, or more is needed?

------------------------------------------------------------------------

November 21, 2004 - 17:24 : chx

Attachment: http://drupal.org/files/issues/common_inc_patch.txt (471 bytes)

I have implemented what Goba suggested.

------------------------------------------------------------------------

November 21, 2004 - 17:43 : chx

Attachment: http://drupal.org/files/issues/common_inc_patch_0.txt (825 bytes)

Maybe this one is faster?

------------------------------------------------------------------------

November 21, 2004 - 18:34 : kbahey

Man! You are fast!

I tried the second version. It works fine for things that are not
inside the node body, I mean they have  a / in front of them, as we
want it to be.

Two comments/issues:

- If there is a URL that is already "/" representing the home page, it
gets set to "//".  Perhaps it should check for that case?

- URLs in nodes that do not start with / do not get changed to have a /
prepended to them. Do we need a filter for this?

- Do we need to do something for the style sheets in the page header? I
mean the "misc/drupal.css" and "themes/themename/style.css"?

Thanks 


------------------------------------------------------------------------

November 22, 2004 - 01:28 : kbahey

Hi chx

Here is a fix for the case where you have a url that is just "/".

In your patch, instead of:

[?php  $base = $parts['path'] . '/' ; ?]

Replace that by:

[?php  $base = ( $path == '/' ? $base : $parts['path'] . '/' ); ?]



------------------------------------------------------------------------

November 28, 2004 - 05:08 : kbahey

Did this patch make it into CVS yet?

If there are any objections to it, can someone please explain what they
are?

Thanks

------------------------------------------------------------------------

November 28, 2004 - 10:33 : Dries

Shouldn't your changes be included in the patch?

Also, it's better to cache $base rather than $parts.

Lastly, it this patch makes it to HEAD, we should probably remove some
'base url' cruft from the themes.

------------------------------------------------------------------------

November 28, 2004 - 19:54 : kbahey

Attachment: http://drupal.org/files/issues/x.diff (1 KB)

Here is the patch including my fix.

I am asking chx to comment on caching $base instead of $parts.

Will this make it faster?

------------------------------------------------------------------------

November 28, 2004 - 20:26 : chx

Hm. $base = ( $path == '/' ? $base : $parts['path'] . '/' ); this
depends on path which is a parameter. Thus I fail to see how could we
cache $base. I'd correct this code however $base = ( $path == '/' ? ''
: $parts['path'] . '/' ); 'cos I think $base is not defined before, but
this is not a problem, PHP will be happy to replace NULL with NULL...

Maybe instead of all parts, only $parts['path'] is enough to be cached,
yes, but the performance and memory usage difference -- I guess -- would
not be noticable...

------------------------------------------------------------------------

November 29, 2004 - 00:06 : kbahey

Attachment: http://drupal.org/files/issues/common-inc-patch.txt (1 KB)

OK.

I put in chx suggested change.

This patch can go in CVS then, to rid us of the problems with paths not
beginning with slash.

This is not an ultimate solution still. We need to address the problem
with .css files. Although the header contains a:



it does not seem that major search engines and archiving sites obey it
anyway.

------------------------------------------------------------------------

December 2, 2004 - 21:31 : Dries

Your coding style needs work.  Also, I'm not going to commit this unless
the themes get fixed up: we'd end up with invalid URLs all over the
place.  Lastly, I wonder how portable the themes will be when Drupal is
run from within a subdirectory.

------------------------------------------------------------------------

December 2, 2004 - 22:15 : chx

Attachment: http://drupal.org/files/issues/common_inc_patch_1.txt (849 bytes)

Well, my patch worked from a subdirectory very well, as fact, I have not
tested it from the root dir. And I think that it adheres to coding
standards. So I resubmit it with the root path fix. However, my Drupal
work is focused on i18n these days, and I was never into themeing so it
won't be me who fixes those.

------------------------------------------------------------------------

December 2, 2004 - 23:07 : kbahey

I have tested the previous patch (including my fix) with drupal
installed in the DocumentRoot of the server.

So, in effect, it is tested with both Drupal in / and Drupal in a
subdirectory.

This change fixes the problem for the crawlers and other browsers from
getting confused.

While it is true that there is no fix for the .css files in the HTML
head section yet, this fix deals with a major part of the problem, and
rids us of a major pain. Check your web server's logs some time to see
what I mean.

Someone who is familiar with the themes can contribute a patch later. 

This patch and the future fix for themes are not mutually exclusive, so
let it go in CVS.

------------------------------------------------------------------------

December 9, 2004 - 16:31 : Goba

Please commit this into Drupal core, this fix is badly needed.

------------------------------------------------------------------------

January 17, 2005 - 14:00 : chx

Attachment: http://drupal.org/files/issues/base_url_kill.patch (4.34 KB)

Well as noone have stepped in to fix this problem, I have tried to fix
the themes also. themes.inc , xtemplate.engine and the bluemarine
template is patched besides common.inc.

Of course, more templates could follow, but first I'd like to see your
opinions.

-- 
View: http://drupal.org/node/13148
Edit: http://drupal.org/project/comments/add/13148





More information about the drupal-devel mailing list