[drupal-devel] [feature] Improve search link ranking

Steven drupal-devel at drupal.org
Sat Apr 23 09:09:23 UTC 2005


Issue status update for http://drupal.org/node/21152

 Project:      Drupal
 Version:      cvs
 Component:    search.module
 Category:     feature requests
 Priority:     normal
 Assigned to:  Steven
 Reported by:  Steven
 Updated by:   Steven
 Status:       patch
 Attachment:   http://drupal.org/files/issues/search.link.ranking.patch (2.87 KB)

Existing system
Search.module currently tracks links to nodes, and indexes the words in
the link's text as part of the target node. Thus, if someone links to
the magic quotes FAQ with <a>more info about magic quotes</a> then the
target node will get a higher ranking for a query containing any of the
words "more info about magic quotes". However, there are two problems
with this:


1) Most people don't provide a good caption for their link ("click
here" or even just the URL). 


2) Sometimes there are a bunch of names for a particular subject and
the user used a less common name (e.g. teaser, excerpt, summary).


This means that the link hint of "this page is important" often does
not come into play simply because the search query doesn't match any
linked keywords.


The search query goes like this:
SELECT i.sid, i.type, SUM(i.score / t.total) AS score FROM search_index
i INNER JOIN search_total t ON i.word = t.word WHERE (...keyword
matching...) GROUP BY i.sid, i.type ORDER BY score DESC


Thus it selects all matching keywords for the query, groups them by
source (a sid/type pair, e.g. "node" + nid) and sums the (relative)
scores per source.


For example when searching for "Drupal 4.6.0" one might get:


#wordsidfromsidtypescoretotalsum (per sid/type)#1drupal1NULLnode6106/10
+ 2/5 + 1/10 =
1.1#24601NULLnode25#3drupal13node110#4drupal2NULLnode3102/10 = 0.2
#3 is a link from node 3 to node 1 with the word "drupal". The other
rows represent words that appeared literally in the result node.


Note that these are the rows before grouping and summation. MySQL will
only return two rows, where the first row combines #1,#2,#3 (node 1
with score 1.1) and the second row is only for #4 (node 2 with score
0.2).


Changes
This patch improves the results by assigning a de-facto score boost to
nodes which are linked to. Only the node itself still has to match one
of the given keywords, not the link texts. The node receives a boost
proportional to how many times it has been linked to.


It fits well within the existing search.module idea of trying to ensure
the top results are relevant rather than focusing on making all results
relevant. It does not obsolete the old mechanism, but complements it.


Implementation-wise, this is a bit tricky. 


Issues
The new, generic link boost should however only be applied once per
item, not once per keyword. And if an item does not match any keywords,
we don't want it to show up in the results (even if it has links
pointing to it).


There are two approaches to this:


- Store the link boost among the regular words with an empty word as a
special marker. When selecting the results, we assume the empty word is
implicitly part of the search query, but we need to only keep those
sid/type pairs that have at least one non-empty word in them. The only
way I can think of is to require that each indexed item has a matching
row (with a 0 score if there are no links). Then we can do a COUNT(*)
per item and require that each sid/type pair has at least two matching
rows (one link boost + one real word). This would of course be quite
wasteful as there are nowhere near as much links as there are nodes.


SELECT i.sid, i.type, COUNT(*) AS count, SUM(i.score / t.total) AS
score FROM search_index i INNER JOIN search_total t ON i.word = t.word
WHERE (i.word = '') OR (...keyword matching...) GROUP BY i.sid, i.type
ORDER BY score DESC HAVING count > 1


- Store the link boost in a separate table to join on. As far as I know
it's not possible to do one join per GROUP BY item, so I'd have to use a
MAX() statement to reduce the link boost which is fetched per keyword to
one score per sid/type pair. If we don't want an entry in search_links
per item, then we have to use an if() statement to see if the join
succeeded (0.02 is the score boost per link):


SELECT i.type, i.sid, (SUM(i.score/t.count) + MAX(if(j.sid IS NULL, 0,
j.score))*0.02) AS score FROM search_index i INNER JOIN search_total t
ON i.word = t.word LEFT JOIN search_links j ON i.sid = j.sid AND i.type
= j.type WHERE (...keyword matching..) GROUP BY i.type, i.sid ORDER BY
score DESC


In this patch I implemented a hybrid option: the link boosts are stored
in the search_index table with an empty word, like in option 1. But like
option 2, we use a join (on search_index itself) to fetch the boost and
we don't store link counts for items without links.


SELECT i.type, i.sid, (SUM(i.score/t.count) + MAX(if(j.word IS NULL, 0,
j.score))*0.02) AS score FROM search_index i INNER JOIN search_total t
ON i.word = t.word LEFT JOIN search_index j ON i.sid = j.sid AND j.word
= '' WHERE (...keyword matching...) GROUP BY i.type, i.sid ORDER BY
score DESC


The reason I used the same table is because the links table would
otherwise simply have the same DB structure as the search_index table.
The word column is keyed, so restricting to j.word = "" should be fast.


While this patch works, I'm not too sure about the implementation. I'm
not an SQL guru, so feedback is certainly needed on all aspects.


Results
I can't really try it out on Drupal.org because it requires
re-indexing, but I do have some clues that it would provide an
improvement: of all the linked keywords in the search index now, 1977
are for the word "http", which almost certainly means a literal URL.
Thus these records are almost never used, unless someone actually
searched for "http", even though they tell us something useful (the
linked nodes were interesting).




Steven




More information about the drupal-devel mailing list