mysterious search issue
I think i've just figured out a problem with a site I'm working on and wanted some wisdom from ths list. The site is http://community.telecentre.org/ (not that it matters). The problem: The problem was that the search module's database stopped getting updated (i.e., new material wasn't showing up in searches). I looked at the search_dataset table and discovered that the biggest nid (i.e., sid) was from a node that was published about 9 months ago (hmm, seems like most folks don't believe a site search anyway?). The diagnosis: So, I ran some debugging and discovered that the sql in node_update_index (the one that tells search whether there are any new nodes to spider) was returning no rows all the time, even though there was lots of new content. After struggling with the logic in the SQL, I think I figured out that the problem was a single node which had gotten a date of May 2007 in the created field. I don't think that's normally a problem, but the node_update_shutdown function (which is invoked in case search gets aborted because it runs out of time) saves the system variables node_cron_last and node_cron_last_nid as the current node's created and nid values. Conclusion: I think what happened was that the search indexer got aborted while processing a node with a future date. That inserted a future value into node_cron_last, which means that nodes don't get spidered again until that date. Question: (multiple choice to make it easy ...) 1. is this a problem with the node_update_shutdown logic (or the point in node_update_index when the last_change global gets set for it)? 2. Or is it a bug in the aggregator2 module that creates nodes with 'created' set in the future? 3. Or have i misdiagnosed the problem? 4. All of the above ... Comments: I've heard of other mysterious search indexing failures like this. It took me quite a while to figure out what was going on - the logic in what nodes get spidered is pretty complex. Does anyone have any handy tools for such search problem diagnosis? Sounds like a useful addition to the devel module or as a separate one. Something that can explain how many and which nodes will be spidered by the next cron perhaps ... -- Alan Dixon, Web Developer http://alan.g.dixon.googlepages.com/
Alan, nice report. Is there an issue in the issue queue for this? It seems a simple matter to add a clause to the query to ignore future nodes, and this isn't the first time I've heard of the problem. I'd be happy to work on this if nobody already is. Point me to an existing issue or one that you create. Thanks. Robert Douglass Alan Dixon wrote:
I think i've just figured out a problem with a site I'm working on and wanted some wisdom from ths list. The site is http://community.telecentre.org/ (not that it matters).
The problem:
The problem was that the search module's database stopped getting updated (i.e., new material wasn't showing up in searches). I looked at the search_dataset table and discovered that the biggest nid (i.e., sid) was from a node that was published about 9 months ago (hmm, seems like most folks don't believe a site search anyway?).
The diagnosis:
So, I ran some debugging and discovered that the sql in node_update_index (the one that tells search whether there are any new nodes to spider) was returning no rows all the time, even though there was lots of new content. After struggling with the logic in the SQL, I think I figured out that the problem was a single node which had gotten a date of May 2007 in the created field. I don't think that's normally a problem, but the node_update_shutdown function (which is invoked in case search gets aborted because it runs out of time) saves the system variables node_cron_last and node_cron_last_nid as the current node's created and nid values.
Conclusion: I think what happened was that the search indexer got aborted while processing a node with a future date. That inserted a future value into node_cron_last, which means that nodes don't get spidered again until that date.
Question: (multiple choice to make it easy ...)
1. is this a problem with the node_update_shutdown logic (or the point in node_update_index when the last_change global gets set for it)?
2. Or is it a bug in the aggregator2 module that creates nodes with 'created' set in the future?
3. Or have i misdiagnosed the problem?
4. All of the above ...
Comments:
I've heard of other mysterious search indexing failures like this. It took me quite a while to figure out what was going on - the logic in what nodes get spidered is pretty complex. Does anyone have any handy tools for such search problem diagnosis? Sounds like a useful addition to the devel module or as a separate one. Something that can explain how many and which nodes will be spidered by the next cron perhaps ...
Robert Douglass writes:
Alan,
nice report. Is there an issue in the issue queue for this? [snip] Point me to
an existing issue or one
that you create.
Issue here: http://drupal.org/node/57106 fwiw I've not yet found a node in the future on my site that would be culpable, although I'll try to check that properly (rather than checking node_cron_last_nid and visiting the node) to rule it out. Last it got to was me contacting AJK to provide a db dump, happy to do that or try stuff my end instead if guidance is forthcoming. Nat
Ah, thanks, that's it. Moving this discussion there. - Alan On 4/3/07, Nathaniel Catchpole <ncatchpole@brookehouse.ac.uk> wrote:
Robert Douglass writes:
Alan,
nice report. Is there an issue in the issue queue for this? [snip] Point me to
an existing issue or one
that you create.
Issue here: http://drupal.org/node/57106
fwiw I've not yet found a node in the future on my site that would be culpable, although I'll try to check that properly (rather than checking node_cron_last_nid and visiting the node) to rule it out. Last it got to was me contacting AJK to provide a db dump, happy to do that or try stuff my end instead if guidance is forthcoming.
Nat
-- Alan Dixon, Web Developer http://alan.g.dixon.googlepages.com/
Robert: Yes, I'll be happy to make an issue - I was hesitating because I wasn't sure where the issue was (search? node? aggregator2?). I've just gone and looked at it again and I realise it's slightly more complicated. The sql I referred to in node_update_index is this: $result = db_query_range('SELECT GREATEST(IF(c.last_comment_timestamp IS NULL, 0, c.last_comment_timestamp), n.changed) as last_change, n.nid FROM {node} n LEFT JOIN {node_comment_statistics} c ON n.nid = c.nid WHERE n.status = 1 AND ((GREATEST(n.changed, c.last_comment_timestamp) = %d AND n.nid > %d) OR (n.changed > %d OR c.last_comment_timestamp > %d)) ORDER BY GREATEST(n.changed, c.last_comment_timestamp) ASC, n.nid ASC', $last, $last_nid, $last, $last, $last, 0, $limit); so I was wrong in my reference to the node's created value, since it's the changed value that matters. And indeed the changed value was okay - it was the last_comment_timestamp which was set to next May. And that was because the comments module initializes the last_comment_timestamp in the node_comment_statistics as the node's created value (hehe, so i was sort of right ..). In other words, another option is: 5. A bug in the comments module on line 290 - when a new node is created, the last_comment_timestamp of the node_comment_statistics should be initialized to the current date, or the lesser of the current date and the node's created date. On 4/3/07, Robert Douglass <rob@robshouse.net> wrote:
Alan,
nice report. Is there an issue in the issue queue for this? It seems a simple matter to add a clause to the query to ignore future nodes, and this isn't the first time I've heard of the problem. I'd be happy to work on this if nobody already is. Point me to an existing issue or one that you create. Thanks.
Robert Douglass
Alan Dixon wrote:
I think i've just figured out a problem with a site I'm working on and wanted some wisdom from ths list. The site is http://community.telecentre.org/ (not that it matters).
The problem:
The problem was that the search module's database stopped getting updated (i.e., new material wasn't showing up in searches). I looked at the search_dataset table and discovered that the biggest nid (i.e., sid) was from a node that was published about 9 months ago (hmm, seems like most folks don't believe a site search anyway?).
The diagnosis:
So, I ran some debugging and discovered that the sql in node_update_index (the one that tells search whether there are any new nodes to spider) was returning no rows all the time, even though there was lots of new content. After struggling with the logic in the SQL, I think I figured out that the problem was a single node which had gotten a date of May 2007 in the created field. I don't think that's normally a problem, but the node_update_shutdown function (which is invoked in case search gets aborted because it runs out of time) saves the system variables node_cron_last and node_cron_last_nid as the current node's created and nid values.
Conclusion: I think what happened was that the search indexer got aborted while processing a node with a future date. That inserted a future value into node_cron_last, which means that nodes don't get spidered again until that date.
Question: (multiple choice to make it easy ...)
1. is this a problem with the node_update_shutdown logic (or the point in node_update_index when the last_change global gets set for it)?
2. Or is it a bug in the aggregator2 module that creates nodes with 'created' set in the future?
3. Or have i misdiagnosed the problem?
4. All of the above ...
Comments:
I've heard of other mysterious search indexing failures like this. It took me quite a while to figure out what was going on - the logic in what nodes get spidered is pretty complex. Does anyone have any handy tools for such search problem diagnosis? Sounds like a useful addition to the devel module or as a separate one. Something that can explain how many and which nodes will be spidered by the next cron perhaps ...
-- Alan Dixon, Web Developer http://alan.g.dixon.googlepages.com/
Hi Alan, Is there any chance it's to do with this: http://drupal.org/node/90128 ? Hope this helps! Mark On 3 Apr 2007, at 15:58, Alan Dixon wrote:
I think i've just figured out a problem with a site I'm working on and wanted some wisdom from ths list. The site is http://community.telecentre.org/ (not that it matters).
The problem:
The problem was that the search module's database stopped getting updated (i.e., new material wasn't showing up in searches). I looked at the search_dataset table and discovered that the biggest nid (i.e., sid) was from a node that was published about 9 months ago (hmm, seems like most folks don't believe a site search anyway?).
The diagnosis:
So, I ran some debugging and discovered that the sql in node_update_index (the one that tells search whether there are any new nodes to spider) was returning no rows all the time, even though there was lots of new content. After struggling with the logic in the SQL, I think I figured out that the problem was a single node which had gotten a date of May 2007 in the created field. I don't think that's normally a problem, but the node_update_shutdown function (which is invoked in case search gets aborted because it runs out of time) saves the system variables node_cron_last and node_cron_last_nid as the current node's created and nid values.
Conclusion: I think what happened was that the search indexer got aborted while processing a node with a future date. That inserted a future value into node_cron_last, which means that nodes don't get spidered again until that date.
Question: (multiple choice to make it easy ...)
1. is this a problem with the node_update_shutdown logic (or the point in node_update_index when the last_change global gets set for it)?
2. Or is it a bug in the aggregator2 module that creates nodes with 'created' set in the future?
3. Or have i misdiagnosed the problem?
4. All of the above ...
Comments:
I've heard of other mysterious search indexing failures like this. It took me quite a while to figure out what was going on - the logic in what nodes get spidered is pretty complex. Does anyone have any handy tools for such search problem diagnosis? Sounds like a useful addition to the devel module or as a separate one. Something that can explain how many and which nodes will be spidered by the next cron perhaps ...
-- Alan Dixon, Web Developer http://alan.g.dixon.googlepages.com/
I've run into a similar problem -- not sure if it is the same problem -- I just never filed an issue. I have a problem with comments getting dates in the future and the logic for node_cron_last and node_cron_last_nid breaks when this happens. So I wrote the following custom cron hook to fix this: function mojo_cron() { $max_sid = db_result(db_query('SELECT MAX(sid) FROM {search_index}')); $max_nid = db_result(db_query('SELECT MAX(nid) FROM {node}')); if ($max_sid + 1000 < $max_nid) { watchdog('mojo', t('search indexing is behind (%sid of %nid)', array('%sid' => $max_sid, '%nid' => $max_nid)), WATCHDOG_NOTICE); } $last = variable_get('node_cron_last', 0); if (!$last || $last > time()) { $max_created = db_result(db_query('SELECT MAX(created) FROM {node} n INNER JOIN {search_index} s ON s.sid=n.nid')); variable_set('node_cron_last', $max_created); watchdog('mojo', t('fixed cron last time'), WATCHDOG_NOTICE); } } Doug Green 904-583-3342 www.douggreenconsulting.com Bringing Ideas to Life with Software Artistry and Invention... Providing open source software political solutions -----Original Message----- From: development-bounces@drupal.org [mailto:development-bounces@drupal.org] On Behalf Of Mark Leicester Sent: Tuesday, April 03, 2007 11:12 AM To: development@drupal.org Subject: Re: [development] mysterious search issue Hi Alan, Is there any chance it's to do with this: http://drupal.org/node/90128 ? Hope this helps! Mark On 3 Apr 2007, at 15:58, Alan Dixon wrote:
I think i've just figured out a problem with a site I'm working on and wanted some wisdom from ths list. The site is http://community.telecentre.org/ (not that it matters).
The problem:
The problem was that the search module's database stopped getting updated (i.e., new material wasn't showing up in searches). I looked at the search_dataset table and discovered that the biggest nid (i.e., sid) was from a node that was published about 9 months ago (hmm, seems like most folks don't believe a site search anyway?).
The diagnosis:
So, I ran some debugging and discovered that the sql in node_update_index (the one that tells search whether there are any new nodes to spider) was returning no rows all the time, even though there was lots of new content. After struggling with the logic in the SQL, I think I figured out that the problem was a single node which had gotten a date of May 2007 in the created field. I don't think that's normally a problem, but the node_update_shutdown function (which is invoked in case search gets aborted because it runs out of time) saves the system variables node_cron_last and node_cron_last_nid as the current node's created and nid values.
Conclusion: I think what happened was that the search indexer got aborted while processing a node with a future date. That inserted a future value into node_cron_last, which means that nodes don't get spidered again until that date.
Question: (multiple choice to make it easy ...)
1. is this a problem with the node_update_shutdown logic (or the point in node_update_index when the last_change global gets set for it)?
2. Or is it a bug in the aggregator2 module that creates nodes with 'created' set in the future?
3. Or have i misdiagnosed the problem?
4. All of the above ...
Comments:
I've heard of other mysterious search indexing failures like this. It took me quite a while to figure out what was going on - the logic in what nodes get spidered is pretty complex. Does anyone have any handy tools for such search problem diagnosis? Sounds like a useful addition to the devel module or as a separate one. Something that can explain how many and which nodes will be spidered by the next cron perhaps ...
-- Alan Dixon, Web Developer http://alan.g.dixon.googlepages.com/
participants (5)
-
Alan Dixon -
Doug Green -
Mark Leicester -
Nathaniel Catchpole -
Robert Douglass