[drupal-devel] [feature] Advanced search features

killes drupal-devel at drupal.org
Thu Aug 4 14:29:33 UTC 2005


Issue status update for 
http://drupal.org/node/28159
Post a follow up: 
http://drupal.org/project/comments/add/28159

 Project:      Drupal
 Version:      cvs
 Component:    search.module
 Category:     feature requests
 Priority:     normal
 Assigned to:  Steven
 Reported by:  Steven
 Updated by:   killes at www.drop.org
-Status:       patch (ready to be committed)
+Status:       patch (code needs review)

Oops, that comment should have been for another issue.




killes at www.drop.org



Previous comments:
------------------------------------------------------------------------

Thu, 04 Aug 2005 00:46:49 +0000 : Steven

Attachment: http://drupal.org/files/issues/search_2.patch (37.05 KB)

Here's my promised search patch. It's not 100% commit ready yet, but
it's time to sollicit some feedback and get this tested ;). Note that
this patch requires a db update, which will wipe the search index. You
will then need to call cron.php enough times for the site to be indexed
completely again. This could take a while for large databases, but you
can control the throttle and see the progress at admin/settings/search.


*Features*



* AND keyword matching by default ('all of the words'), instead of OR
('any of the words').
* OR support through keyword1 OR keyword2 OR ...
* Phrase searching through "quoted strings".
* Negative matching through -"minus prefix" -word.
* Restrict search by taxonomy or node type(s) using taxonomy:1,2 and
type:blog,page.

The options are built-into the keyword string through a google like
syntax, but there is an expandable "advanced settings" form below the
search box which acts as a 'query builder':

This example will result in the following search string (of course not
a practical example):
test type:forum,story category:1 "tinky winky" OR "dipsy" -"uh oh"
"teletubby bye bye"


On a different note, I removed the wildcard matching. An important
reason is that there were significant performance problems with leading
wildcards. Such queries were not be able to use any indices, and the
resulting full-table scan took a long time. Even Google does not have
intra-word wildcards, theirs can only be used as placeholders for
entire words in phrases.


Trailing wildcards on the other hand are usually used to accomodate
grammatical variations on a word. But, wildcards are not really the
best tool for this as this puts a burden on the user. If you need this
feature, you should instead tie in an algorithm like the Porter Stemmer
through the search_preprocess hook.
That way you can reduce related words to a single common root (e.g.
"walker" "walking" "walked" to "walk"). The search system will then
index and search on the reduced words. You will even benefit from a
reduced database size because there are less unique words.


Because such algorithms are very language specific, I didn't build in
any. But it should be trivial to make a Porter Stemmer module for
Drupal search, which can be used on english sites.


*Database*
To implement the above searches, I added a 'search_dataset' table that
is independent of the keyword index. Each dataset row contains the
entire contents of the indexed item, but filtered, cleaned up and
reduced to space-sparated tokens (words, numbers, dates, ...). This
table is used to resolve the exact conditions, which means the keyword
index is not as essential anymore. Because searches are AND by default,
the OR method of search_index acts as an initial filter to eliminate the
majority of items immediately. That subset is then further reduced
through the search_dataset table. All of this means that the
search_index table can now be indexed at a much higher minimum word
lenght (e.g. 5), which means a reduced database size. Even with the new
dataset table, the net database size shrinks slightly.


I also implemented the searching as two selects into temporary tables.
This allows me to avoid doing a costly counting query for the pager and
a range-limited query for the actual results. I added support for
temporary tables to database.(my|pg)sql. The db api itself takes a
normal SELECT and a table name, and turns it into an appropriate
platform specific temporary table query (CREATE TABLE ... AS, CREATE
TABLE ... SELECT).


I still need to do detailed benchmarking, but at least for the same
queries as before, this patch should be faster. Of course, pre-patch,
all searches were OR, not AND, so a direct comparison needs to take
this into account (the pre-patch query "drupal theme development" is
now "drupal OR theme OR development").


One feature request that I did not do is date based searching (before
X/X/X, after X/X/X), mostly because we don't have a good date widget
yet. I've been toying with making a simple in-page JS data picker, but
it's not done yet and I think the patch is good enough already. Date
restrictions can be added on later without any problems.




------------------------------------------------------------------------

Thu, 04 Aug 2005 01:27:29 +0000 : Steven

Oh and in case this wasn't clear, the syntax of putting extra conditions
into the search keywords ("type:blog") means that each search result
page can be linked to directly. They all have clean URLs:
search/node/type:blog+keyword for example.




------------------------------------------------------------------------

Thu, 04 Aug 2005 02:34:38 +0000 : Steven

Attachment: http://drupal.org/files/issues/search_3.patch (37.05 KB)

Sorry, the patch was malformed because wincvs wrapped those really long
preg classes :P. Fixed patch attached.




------------------------------------------------------------------------

Thu, 04 Aug 2005 13:28:14 +0000 : stevryn

This looks great, cant wait till its fully ready. I tried trip_search,
but couldnt get it to work, and the regular search definately needed
some better features! I would like to test it, but I have no idea how
to apply a patch. Can you give me simple, for a Unix dummy,
instructions on how to go about it?


Tx
T




------------------------------------------------------------------------

Thu, 04 Aug 2005 14:23:08 +0000 : webchick

> Can you give me simple, for a Unix dummy, instructions on how to go
about it?


I can help you there, I think. Follow step 2 here if you don't already
have a CVS version of Drupal up and running (you can't use this patch
against 4.6.2, for example): http://www.planetsoc.com/node/164


Then, switch to your Drupal CVS root directory, for example:


cd ~/drupal-cvs


Use wget to retrieve a copy of the most recent patch (in this case,
search_3.patch):


wget http://drupal.org/files/issues/search_3.patch


Execute the following command to apply the patch to your Drupal
installation:


patch -p0 -u < search_3.patch


This will patch all the files with the updated search.


Then go through the normal steps you would go through to get a new
Drupal system up and running. Step 3 of the aforementioned link has
some info on how to get a table prefix going if you want to keep this
test version separate from your "normal" Drupal installation.


My problem is I've done all of this, but am still getting strange
errors (even on a "normal" unpatched version of the search), so I need
to figure out if I have a problem on my end or what's going on.




------------------------------------------------------------------------

Thu, 04 Aug 2005 14:27:52 +0000 : killes at www.drop.org

@Jeremy: Sorting by two fields does not seem to work.


@Moshe: This code does not rely on the fact that wid is an
auto_increment field in any way. Just some concerns did.







More information about the drupal-devel mailing list