[development] Duplicated modules

Ivan Sergio Borgonovo mail at webthatworks.it
Thu Mar 12 21:07:59 UTC 2009


On Thu, 12 Mar 2009 20:00:50 +0100 (CET)
"Karoly Negyesi" <karoly at negyesi.net> wrote:

> > Maybe we could let people (other than the maintainers) to
> > "categorise" the project.

> Mail, mails, mailing, notify, notifies, notification, subscribe,
> subscriptions, subscribes and I am sure we will have a few more
> variations... 

test_drupal=# select * from to_tsvector('pg_catalog.english', 'Mail,
mails, mailing, notify, notifies, notification, subscribe,
subscriptions, subscribes'); to_tsvector
------------------------------------------------------------------
'mail':1,2,3 'notif':6 'notifi':4,5 'subscrib':7,9 'subscript':8

This could be one solution. This dictionary is not configured to use
synonyms, but it's just a matter of configuration for PostgreSQL.
There should be precooked solutions for MySQL as well.
After all if you're making taxonomies searchable

term AND term
(term OR term) and (term OR term)

you're going to solve this problem anyway.

Another solution could be to restrict terms to a selection.
Collect terms from the community for let's say one week, filter
them, post-edit the result and use it.

Auto-completion may help avoiding duplicates.

Just most "voted" terms may appear.

If you expect high load... the project is a success, still the load
of collecting terms should be negligible compared to all the other
things required to drupal.org.
If you don't expect high load... there is no need to worry about
performance of a dictionary.

To add tags people will have to login. Load to filter synonyms
should be tolerable.

If spammers use similar techniques they should have a good ROI.

Now we have ~40 categories. Reach 400 and the categorisation system
will work much better. BIC [1] should have ~6000 leaves.

I could do some research about tools available for mysql.
drupal.org recently moved to solr, solr has support for synonyms.

[1] http://www.bic.org.uk/

-- 
Ivan Sergio Borgonovo
http://www.webthatworks.it



More information about the development mailing list