First, your create text for index: extract text from files using any mechanism(use a unique words filter, and small word 3 char, will reduce the text content): tika, unix shell command, etc second action is to attach the files text to node, to be indexed by the drupal search index (cck or directly in body). After that you can user Solr or sphinx or other extra index storage, and I think a custom solution is betten than a lot of installed modules.
Liviu.
On Fri, Dec 9, 2011 at 10:28 AM, Florian Auer lists@floeschie.org wrote:
Hi Liviu!
Am Donnerstag, 8. Dezember 2011, 20:06:06 schrieb Liviu Nicolicioiu:
What do you think if you use pdf2txt unix and put in a cck field?
I need to parse Microsoft Office and other file formats, too. Furthermore, we want to use a Solr search engine for perfomance reasons. The Search API integrates both Tika and Solr very well (if you know how to do it ;]).
-- Cheers!
Florian
[ Drupal support list | http://lists.drupal.org/ ]