[drupal-devel] [feature] Advanced search features

matt_paz drupal-devel at drupal.org
Fri Aug 5 21:57:48 UTC 2005


Issue status update for 
http://drupal.org/node/28159
Post a follow up: 
http://drupal.org/project/comments/add/28159

 Project:      Drupal
 Version:      cvs
 Component:    search.module
 Category:     feature requests
 Priority:     normal
 Assigned to:  Steven
 Reported by:  Steven
 Updated by:   matt_paz
 Status:       patch (code needs review)

I have the code running at http://connect.educause.edu


"100% of the site has been indexed."
Minimum word length to index: 4


Nonetheless, I'm getting mixed results.


If I search for 'Hawkins' I get no results ... even though I know (I
think) it should be indexing the word hawkins from the body of this
content ...
http://connect.educause.edu/blog/mpasiewicz/317


Any ideas?




matt_paz



Previous comments:
------------------------------------------------------------------------

Thu, 04 Aug 2005 00:46:49 +0000 : Steven

Attachment: http://drupal.org/files/issues/search_2.patch (37.05 KB)

Here's my promised search patch. It's not 100% commit ready yet, but
it's time to sollicit some feedback and get this tested ;). Note that
this patch requires a db update, which will wipe the search index. You
will then need to call cron.php enough times for the site to be indexed
completely again. This could take a while for large databases, but you
can control the throttle and see the progress at admin/settings/search.


*Features*



* AND keyword matching by default ('all of the words'), instead of OR
('any of the words').
* OR support through keyword1 OR keyword2 OR ...
* Phrase searching through "quoted strings".
* Negative matching through -"minus prefix" -word.
* Restrict search by taxonomy or node type(s) using taxonomy:1,2 and
type:blog,page.

The options are built-into the keyword string through a google like
syntax, but there is an expandable "advanced settings" form below the
search box which acts as a 'query builder':

This example will result in the following search string (of course not
a practical example):
test type:forum,story category:1 "tinky winky" OR "dipsy" -"uh oh"
"teletubby bye bye"


On a different note, I removed the wildcard matching. An important
reason is that there were significant performance problems with leading
wildcards. Such queries were not be able to use any indices, and the
resulting full-table scan took a long time. Even Google does not have
intra-word wildcards, theirs can only be used as placeholders for
entire words in phrases.


Trailing wildcards on the other hand are usually used to accomodate
grammatical variations on a word. But, wildcards are not really the
best tool for this as this puts a burden on the user. If you need this
feature, you should instead tie in an algorithm like the Porter Stemmer
through the search_preprocess hook.
That way you can reduce related words to a single common root (e.g.
"walker" "walking" "walked" to "walk"). The search system will then
index and search on the reduced words. You will even benefit from a
reduced database size because there are less unique words.


Because such algorithms are very language specific, I didn't build in
any. But it should be trivial to make a Porter Stemmer module for
Drupal search, which can be used on english sites.


*Database*
To implement the above searches, I added a 'search_dataset' table that
is independent of the keyword index. Each dataset row contains the
entire contents of the indexed item, but filtered, cleaned up and
reduced to space-sparated tokens (words, numbers, dates, ...). This
table is used to resolve the exact conditions, which means the keyword
index is not as essential anymore. Because searches are AND by default,
the OR method of search_index acts as an initial filter to eliminate the
majority of items immediately. That subset is then further reduced
through the search_dataset table. All of this means that the
search_index table can now be indexed at a much higher minimum word
lenght (e.g. 5), which means a reduced database size. Even with the new
dataset table, the net database size shrinks slightly.


I also implemented the searching as two selects into temporary tables.
This allows me to avoid doing a costly counting query for the pager and
a range-limited query for the actual results. I added support for
temporary tables to database.(my|pg)sql. The db api itself takes a
normal SELECT and a table name, and turns it into an appropriate
platform specific temporary table query (CREATE TABLE ... AS, CREATE
TABLE ... SELECT).


I still need to do detailed benchmarking, but at least for the same
queries as before, this patch should be faster. Of course, pre-patch,
all searches were OR, not AND, so a direct comparison needs to take
this into account (the pre-patch query "drupal theme development" is
now "drupal OR theme OR development").


One feature request that I did not do is date based searching (before
X/X/X, after X/X/X), mostly because we don't have a good date widget
yet. I've been toying with making a simple in-page JS data picker, but
it's not done yet and I think the patch is good enough already. Date
restrictions can be added on later without any problems.




------------------------------------------------------------------------

Thu, 04 Aug 2005 01:27:29 +0000 : Steven

Oh and in case this wasn't clear, the syntax of putting extra conditions
into the search keywords ("type:blog") means that each search result
page can be linked to directly. They all have clean URLs:
search/node/type:blog+keyword for example.




------------------------------------------------------------------------

Thu, 04 Aug 2005 02:34:38 +0000 : Steven

Attachment: http://drupal.org/files/issues/search_3.patch (37.05 KB)

Sorry, the patch was malformed because wincvs wrapped those really long
preg classes :P. Fixed patch attached.




------------------------------------------------------------------------

Thu, 04 Aug 2005 13:28:14 +0000 : stevryn

This looks great, cant wait till its fully ready. I tried trip_search,
but couldnt get it to work, and the regular search definately needed
some better features! I would like to test it, but I have no idea how
to apply a patch. Can you give me simple, for a Unix dummy,
instructions on how to go about it?


Tx
T




------------------------------------------------------------------------

Thu, 04 Aug 2005 14:23:08 +0000 : webchick

> Can you give me simple, for a Unix dummy, instructions on how to go
about it?


I can help you there, I think. Follow step 2 here if you don't already
have a CVS version of Drupal up and running (you can't use this patch
against 4.6.2, for example): http://www.planetsoc.com/node/164


Then, switch to your Drupal CVS root directory, for example:


cd ~/drupal-cvs


Use wget to retrieve a copy of the most recent patch (in this case,
search_3.patch):


wget http://drupal.org/files/issues/search_3.patch


Execute the following command to apply the patch to your Drupal
installation:


patch -p0 -u < search_3.patch


This will patch all the files with the updated search.


Then go through the normal steps you would go through to get a new
Drupal system up and running. Step 3 of the aforementioned link has
some info on how to get a table prefix going if you want to keep this
test version separate from your "normal" Drupal installation.


My problem is I've done all of this, but am still getting strange
errors (even on a "normal" unpatched version of the search), so I need
to figure out if I have a problem on my end or what's going on.




------------------------------------------------------------------------

Thu, 04 Aug 2005 14:27:52 +0000 : killes at www.drop.org

@Jeremy: Sorting by two fields does not seem to work.


@Moshe: This code does not rely on the fact that wid is an
auto_increment field in any way. Just some concerns did.




------------------------------------------------------------------------

Thu, 04 Aug 2005 14:29:28 +0000 : killes at www.drop.org

Oops, that comment should have been for another issue.




------------------------------------------------------------------------

Thu, 04 Aug 2005 15:49:36 +0000 : matt_paz

It would be nice to allow the ability to select which vocabularies and
node types are (or aren't) available in the advanced search.  Or to be
able to turn them off altogether.  It would also be nice to be able to
display the totla node count for each type/category in parens.




------------------------------------------------------------------------

Thu, 04 Aug 2005 15:50:11 +0000 : matt_paz

Nice addition!  It seems to be working great.  It would be nice to allow
the ability to select which vocabularies and node types are (or aren't)
available in the advanced search.  Or to be able to turn them off
altogether.  It would also be nice to be able to display the totla node
count for each type/category in parens.




------------------------------------------------------------------------

Thu, 04 Aug 2005 17:20:48 +0000 : stevryn

Thanks webchick! I have it working now. Great work Steven, my live site
is 4.6.1, I havent wanted to take the great leap and update, last time
I did it was *not* pretty. I assume this will work with that version
once its completed and submitted? Seriously the search functionality
needs this sort of advanced features!! Tx for all your work!




------------------------------------------------------------------------

Fri, 05 Aug 2005 06:19:51 +0000 : Kobus

Hi!


Like the previous replier to this post, I can't apply patches myself,
but simply because I don't do it frequently enough, and forgot how to
do it, but I will catch up with this and test if I get a chance. In
principle this is a great patch! Definate +1.


I have another feature (not sure it belongs to this thread, so if not,
I apologize) that I would love to see in the search module. That is to
provide hooks so that you can create a customized search form, for
example, I need the following three field sets in a search form for a
property website:



* Price range "Start price" -> "End price", using some MIN() and MAX()
functions if the user selects the wrong way around. (dropdowns with
certain ranges).
* Area where the property is supposed to be located in (taxonomy
category).
* Features that the property MUST have, for example, your requirement
would be "TWO BATHROOMS" or "FOUR BEDROOMS". (Search through
checkboxes, radios and text fields that was defined in the
"property.module" file and database structure.

(This module is written by an amateur (me), that's why not contributed,
but, should there be interest in it, I will contribute it.)


These extra search forms would be way different for each different
application, so I don't expect the search module to be able to actually
do this, but at least to provide functionality that a coder can write
such an extension in his module, in other words, the queries that will
define the search, and the form that the user will see, should be
definable in the module, and called instead of the default search form
if required.


Will this be possible at all?


Regards,


Kobus




------------------------------------------------------------------------

Fri, 05 Aug 2005 06:42:59 +0000 : lgarfiel

@Kobus: Actually you can do that now.  See the Location module for an
example of a fully custom search function.




------------------------------------------------------------------------

Fri, 05 Aug 2005 06:59:25 +0000 : Kobus

Hi!


Thanks lgarfiel. I have downloaded the module and will play around with
it over the weekend.


Regards,


Kobus




------------------------------------------------------------------------

Fri, 05 Aug 2005 10:11:35 +0000 : Bèr Kessels

* my update failed. You had/have a problem in update.inc in the
updates/callback array. I had-modified the database and it works fine
now.
* I am not happy with the 5 character default limit. It took me quite
some time to find out why my words were not indexed. But that should
not hold back this patch. More somthing to have acloser look for in teh
future. I think of intelligent auto-blackslisting or so.
* When I do an advanced search, teh resutlts are primted in the box:
very good!. But my advaced form is emty. That is no good udability,
imo. That form should represent what I was searching for.
* I still do not like he way results are returned by default: "weblink
- Bèr Kessels - 05/08/2005 - 14:20 - 0 comments" is far too much data.
And i even have "do not show data for foo", set in the theme settings!
can we please re-thing the *default* styles for search results? Without
all the CSS bloat, and without all the details? Why not go for a default
teaser vew?
$node->title = $title //the name, subject or title of teh element
$node->teaser = $content // teh nice highlighted data of search result
body
That way it wil be consisten tih teh rest of the site, save a *lot* of
code and be better theamble too.


* "your serach yielded no results" should be a set_message. Yes, we
discussed that it should be in the place where ppl look for the
results. but I tried it, and it is just as visible above the search as
below! really. (see screenie)


Overall, i think introducing a complex search is very nice. And i think
this is a great step forward.
But I much rather see, this being implemented in a MUCH easier API as
well as MUCH easier hooks. preprocessing as a single hook. What /is/
preprocessing? form, index, etc in one hook? why put preprocess ii a
single hook, but the others on one huge nodapi alike construction? What
do they do, these nodeapi things? Why do I need them and when do I need
them?
I say this, because I spent hours with and hours trial-and-error
methods to get some form of advanced flexinode search going. The
current system is just to hard to grok for an average developer like
me.


I know these things are easy to say, but very hard to implement. But as
long as we cannot allow searching for the obvious data people see (I am
sure I read that that search guru has winamp as favorite, why can't i
find his username when i search for winamp) I think we should be
carefull with extending the search into advanced search. 


Can we not first think of a general solution, one that will fix ALL
drupals search problems and then dive into advanced searching, that
will make it even more complex?


Steven: I still think you did a marvelous job!




------------------------------------------------------------------------

Fri, 05 Aug 2005 13:25:45 +0000 : Steven

"* When I do an advanced search, teh resutlts are printed in the box:
very good!. But my advaced form is emty. That is no good udability,
imo. That form should represent what I was searching for.

"
I debated over how to implement this, and I went for the current method
of not showing anything in the advanced form.


The first reason is that parsing a query back into the advanced options
is not generally possible. For example, you could have more than one
phrase in your query, yet the query builder only accomodates one for
simplicity. It is a tool, not a complete equivalent to the keyword
syntax.


Secondly, there is a conflict between what is in the keywords box and
what is in the advanced form. For example, I might build an advanced
query, and then start modifying the 'baked in' keyword version. At that
point we need to decide which takes precedence, and arguments can be
found for both sides. At first I tried to reconcile this by only
respecting the advanced controls if you pressed "advanced search", but
then there is ambiguity of which button to default to if the user
submits the form by pressing enter.


I also thought of making it so that the query builder always adds to
the keywords (ignoring duplicates), but that means you can't loosen a
query except by manually removing parts of the keywords box /and/
manually unchecking the options in the advanced form.


After trying out the various options, I found that the usability of the
current method is the clearest and results in the least amount of
inconsistencies perceived as 'buggy behaviour'.


"What /is/ preprocessing? why put preprocess ii a single hook, but the
others on one huge nodapi alike construction?

"
Preprocessing is what it says: a transformation applied to text before
it is inserted in and matched against the search index. You can use it
to add in stemming algorithms, soundex as well as word-splitting for
non-spaced languages like Chinese and Japanese.


As for why to put it in a separate hook: for performance reasons.


"I am not happy with the 5 character default limit. It took me quite
some time to find out why my words were not indexed. But that should
not hold back this patch. More somthing to have acloser look for in teh
future. I think of intelligent auto-blackslisting or so.

"
About the limit: it only has any effect if none of the words in your
search query is 5 letters or longer. As soon as one word in the query
is 5 letters, all results can be found. Judging from drupal.org
searches, this was a correct trade-off to make. If you introduce a
stemming algorithm through the preprocessor, you can probably afford to
set the size limit smaller. But until then, imo 5 is a good default.


About intelligent blacklisting: this is not really possible. The goal
of blacklisting is to keep the index size down. But to find out which
words to blacklist, you need to include them in some sort of index.
Chicken and egg.


"Can we not first think of a general solution, one that will fix ALL
drupals search problems and then dive into advanced searching, that
will make it even more complex?

"
I await your proposals. But until people start getting involved in the
architecture rather than just bombarding me with feature-requests and
saying that 'search sucks', I can only do things my way.


In fact, the biggest complexity hurdle right now is simply that search
is optional. This means that all search operations should go through
search.module. Modules with 100% custom searches on the other hand can
simply put a tab on "search/module" and do everything their way. But
that means that that tab will be there even if search.module is
disabled and that these modules cannot take advantage of the various
utility functions provided by search.module.


For example, searching profile fields for user search is perfectly
possible even without this patch, though it would require that the user
search be moved into profile.module. A nodeapi-like system where
profile.module can add to user.module's search would only result in
huge joins and queries being added to other queries en-masse. The code
would be even more complex than it is now.


As far as the hook_search ops, I'm sorry but I think they are quite
clear. I specifically added comments to node_search to explain each op
when this would not be done normally (e.g. everyone is assumed to know
that nodeapi('validate') is used to validate extra node fields):


'name': return the name of this search
'search': perform a search
(new) 'form': add form items to the standard 'enter your keywords'
search
(new) 'post': process a search form and inject the parameters into the
keyword string (needed because of the clean URLs)


And only for index-using modules:
'reset': reset the indexing progress
'status': return status of the indexing progress (% completed)




------------------------------------------------------------------------

Fri, 05 Aug 2005 15:19:57 +0000 : Bèr Kessels

Thanks. So lets get teh focus back on the patch then. And move my
proposal for a new search architecture to  new thread;
http://drupal.org/node/28275


I agree about the complexity of re-filling the search box. But still,
it somehow gfeels odd that its not fileld anymore. Its a usability
no-no to do that, though your rationale for not filling it makes more
sense. Maybe we just need to rething the interface in general then?
Does anyone s"ee any other options?
Put adv.search in a tab advanced could be an option: but that might be
a no-no because we hide it away then.







More information about the drupal-devel mailing list