[drupal-devel] [feature] Advanced search features

Steven drupal-devel at drupal.org
Thu Aug 4 02:34:43 UTC 2005


Issue status update for 
http://drupal.org/node/28159
Post a follow up: 
http://drupal.org/project/comments/add/28159

 Project:      Drupal
 Version:      cvs
 Component:    search.module
 Category:     feature requests
 Priority:     normal
 Assigned to:  Steven
 Reported by:  Steven
 Updated by:   Steven
 Status:       patch (code needs review)
 Attachment:   http://drupal.org/files/issues/search_3.patch (37.05 KB)

Sorry, the patch was malformed because wincvs wrapped those really long
preg classes :P. Fixed patch attached.




Steven



Previous comments:
------------------------------------------------------------------------

Thu, 04 Aug 2005 00:46:49 +0000 : Steven

Attachment: http://drupal.org/files/issues/search_2.patch (37.05 KB)

Here's my promised search patch. It's not 100% commit ready yet, but
it's time to sollicit some feedback and get this tested ;). Note that
this patch requires a db update, which will wipe the search index. You
will then need to call cron.php enough times for the site to be indexed
completely again. This could take a while for large databases, but you
can control the throttle and see the progress at admin/settings/search.


*Features*



* AND keyword matching by default ('all of the words'), instead of OR
('any of the words').
* OR support through keyword1 OR keyword2 OR ...
* Phrase searching through "quoted strings".
* Negative matching through -"minus prefix" -word.
* Restrict search by taxonomy or node type(s) using taxonomy:1,2 and
type:blog,page.

The options are built-into the keyword string through a google like
syntax, but there is an expandable "advanced settings" form below the
search box which acts as a 'query builder':

This example will result in the following search string (of course not
a practical example):
test type:forum,story category:1 "tinky winky" OR "dipsy" -"uh oh"
"teletubby bye bye"


On a different note, I removed the wildcard matching. An important
reason is that there were significant performance problems with leading
wildcards. Such queries were not be able to use any indices, and the
resulting full-table scan took a long time. Even Google does not have
intra-word wildcards, theirs can only be used as placeholders for
entire words in phrases.


Trailing wildcards on the other hand are usually used to accomodate
grammatical variations on a word. But, wildcards are not really the
best tool for this as this puts a burden on the user. If you need this
feature, you should instead tie in an algorithm like the Porter Stemmer
through the search_preprocess hook.
That way you can reduce related words to a single common root (e.g.
"walker" "walking" "walked" to "walk"). The search system will then
index and search on the reduced words. You will even benefit from a
reduced database size because there are less unique words.


Because such algorithms are very language specific, I didn't build in
any. But it should be trivial to make a Porter Stemmer module for
Drupal search, which can be used on english sites.


*Database*
To implement the above searches, I added a 'search_dataset' table that
is independent of the keyword index. Each dataset row contains the
entire contents of the indexed item, but filtered, cleaned up and
reduced to space-sparated tokens (words, numbers, dates, ...). This
table is used to resolve the exact conditions, which means the keyword
index is not as essential anymore. Because searches are AND by default,
the OR method of search_index acts as an initial filter to eliminate the
majority of items immediately. That subset is then further reduced
through the search_dataset table. All of this means that the
search_index table can now be indexed at a much higher minimum word
lenght (e.g. 5), which means a reduced database size. Even with the new
dataset table, the net database size shrinks slightly.


I also implemented the searching as two selects into temporary tables.
This allows me to avoid doing a costly counting query for the pager and
a range-limited query for the actual results. I added support for
temporary tables to database.(my|pg)sql. The db api itself takes a
normal SELECT and a table name, and turns it into an appropriate
platform specific temporary table query (CREATE TABLE ... AS, CREATE
TABLE ... SELECT).


I still need to do detailed benchmarking, but at least for the same
queries as before, this patch should be faster. Of course, pre-patch,
all searches were OR, not AND, so a direct comparison needs to take
this into account (the pre-patch query "drupal theme development" is
now "drupal OR theme OR development").


One feature request that I did not do is date based searching (before
X/X/X, after X/X/X), mostly because we don't have a good date widget
yet. I've been toying with making a simple in-page JS data picker, but
it's not done yet and I think the patch is good enough already. Date
restrictions can be added on later without any problems.




------------------------------------------------------------------------

Thu, 04 Aug 2005 01:27:29 +0000 : Steven

Oh and in case this wasn't clear, the syntax of putting extra conditions
into the search keywords ("type:blog") means that each search result
page can be linked to directly. They all have clean URLs:
search/node/type:blog+keyword for example.







More information about the drupal-devel mailing list