[development] mysterious search issue

Alan Dixon alan.g.dixon at gmail.com
Tue Apr 3 14:58:43 UTC 2007


I think i've just figured out a problem with a site I'm working on and
wanted some wisdom from ths list. The site is
http://community.telecentre.org/ (not that it matters).

The problem:

The problem was that the search module's database stopped getting
updated (i.e., new material wasn't showing up in searches). I looked
at the search_dataset table and discovered that the biggest nid (i.e.,
sid) was from a node that was published about 9 months ago (hmm, seems
like most folks don't believe a site search anyway?).

The diagnosis:

So, I ran some debugging and discovered that the sql in
node_update_index (the one that tells search whether there are any new
nodes to spider) was returning no rows all the time, even though there
was lots of new content. After struggling with the logic in the SQL, I
think I figured out that the problem was a single node which had
gotten a date of May 2007 in the created field. I don't think that's
normally a problem, but the node_update_shutdown function (which is
invoked in case search gets aborted because it runs out of time) saves
the system variables node_cron_last and node_cron_last_nid as the
current node's created and nid values.

Conclusion: I think what happened was that the search indexer got
aborted while processing a node with a future date. That inserted a
future value into node_cron_last, which means that nodes don't get
spidered again until that date.

Question: (multiple choice to make it easy ...)

1. is this a problem with the node_update_shutdown logic (or the point
in node_update_index when the last_change global gets set for it)?

2. Or is it a bug in the aggregator2 module that creates nodes with
'created' set in the future?

3. Or have i misdiagnosed the problem?

4. All of the above ...

Comments:

I've heard of other mysterious search indexing failures like this. It
took me quite a while to figure out what was going on - the logic in
what nodes get spidered is pretty complex. Does anyone have any handy
tools for such search problem diagnosis? Sounds like a useful addition
to the devel module or as a separate one. Something that can explain
how many and which nodes will be spidered by the next cron perhaps ...



-- 
Alan Dixon, Web Developer
http://alan.g.dixon.googlepages.com/


More information about the development mailing list