Hey guys,
I'm trying to make documents searchable using Search API [1], Search API Attachments [2], Search API DB [3] and a local Tika [4] installation. I want to save the index data in the database for now, there is a Solr server ready for later integration.
I downloaded the Tika 1.0 runnable JAR file and successfully parsed a PDF and a DOC file from the command line using "java -jar /path/to/tika-1.0.jar --text" as user www-data.
I can index regular nodes and File nodes, but it doesn't parse the file's contents when I execute cron or trigger indexing manually. To me it seems as if Tika is never executed at all...
The author of the Search API Attachments module sais the module is "based on Apache Solr attachments", but it's not marked as required in the info file. So I'm assuming by "based on" he means "I borrowed some code"...
Is there someone who successfully got Drupal 7, Search API and Tika working together?
Any hints appreciated!
Am Mittwoch, 7. Dezember 2011, 20:19:58 schrieb Florian Auer:
I can index regular nodes and File nodes, but it doesn't parse the file's contents when I execute cron or trigger indexing manually. To me it seems as if Tika is never executed at all...
Forgot to mention that my settings for the Search API attachments modules reflect Tika's installation paths:
Tika directory path: /opt/tika/ Tika jar file: tika-app-1.0.jar
www-data@localhost$ pwd /opt/tika www-data@localhost$ ls -l total 23524 -rw-r--r-- 1 root root 24056779 Nov 8 07:39 tika-app-1.0.jar
I got this working by following the README, I believe. Do you have apache-tika-0.9-src.zip and tika-app-0.9.jar in sites/all/libraries/tika? If you have full control over your (assumed Linux) server, you can always run an strace on apache and grep for tika. That and xdebug with breakpoints in the appropriate lines in search_api_attachments/includes/callback_attachments_settings.inc (i.e. the shell_exec invocation) should get you there.
Ted
On 12/7/2011 2:19 PM, Florian Auer wrote:
Hey guys,
I'm trying to make documents searchable using Search API [1], Search API Attachments [2], Search API DB [3] and a local Tika [4] installation. I want to save the index data in the database for now, there is a Solr server ready for later integration.
I downloaded the Tika 1.0 runnable JAR file and successfully parsed a PDF and a DOC file from the command line using "java -jar /path/to/tika-1.0.jar --text" as user www-data.
I can index regular nodes and File nodes, but it doesn't parse the file's contents when I execute cron or trigger indexing manually. To me it seems as if Tika is never executed at all...
The author of the Search API Attachments module sais the module is "based on Apache Solr attachments", but it's not marked as required in the info file. So I'm assuming by "based on" he means "I borrowed some code"...
Is there someone who successfully got Drupal 7, Search API and Tika working together?
Any hints appreciated!
Am Mittwoch, 7. Dezember 2011, 20:19:58 schrieb Florian Auer:
Is there someone who successfully got Drupal 7, Search API and Tika working together?
Finally I got it working. The documentation is somewhat incomplete, so here's what i did to get Drupal 7, Search API and Tika running on Debian Squeeze:
== 1. Download Tika source archive == Go to [1] and copy the link URL to the archive on your favourite mirror and download it using wget:
$ wget [URL]
== 2. Extract Tika source archive to /opt == $ cd /opt # unzip /path/to/apache-tika-X.Y-src.zip
== 3. Install maven2 package == # apt-get install maven2
== 4. Compile Tika using Maven == $ cd tika-X.Y # MAVEN_OPTS=-Xmx256m mvn clean install
(This might take a while…)
== 5. Download and enable the required Drupal modules == drush dl search_api search_api_attachments search_api_db drush en search_api search_api_attachments search_api_db
== 6. Configure Drupal to use Tika == - Login to Drupal admin backend - Open Search API settings - Create a new server (Database) - Create a new index or use existing one - In your index settings, switch to "Workflow" tab - In "Data alterations" area enable "File attachments" - Got to "Fields" tab - Enable "File content" field for indexing
== 7. Edit Search API attachment module == Note: This is only needed if you use version 7.x-1.0, should be already fixed in newer versions (see patch 3048482a89a1a587feab78f2d5ea92c4b5642898 on [2])
- Go to the module's directory (if you used drush, this should be DRUPAL_HOME/sites/all/modules/search_api_attachments) - Open file include/callback_attachments_settings.inc in your favourite editor - Replace any occurences of "entity_type" by "item_type" (see issue on [3])
== 8. Verify Tika is working and called by Drupal == - Open file include/callback_attachments_settings.inc again - Add the following PHP code at the end of the file, right before the last return command (line 141-ish)
syslog(LOG_INFO, 'Calling Tika: ' . $cmd);
- Save and close the file - Tail your syslog (# tail -f /var/log/syslog) - Got to Search API settings in Drupal backend - Re-index your site - You should see some messages telling you the Tika command and the file which is indexed
This is a rather quick'n'dirty documentation, but I don't have time for more and the git repo for Search AP attachments isn't working properly, so I cannot create patches right now. If you have any questions, let me know!
What do you think if you use pdf2txt unix and put in a cck field?
On Thu, Dec 8, 2011 at 6:15 PM, Florian Auer lists@floeschie.org wrote:
Am Mittwoch, 7. Dezember 2011, 20:19:58 schrieb Florian Auer:
Is there someone who successfully got Drupal 7, Search API and Tika working together?
Finally I got it working. The documentation is somewhat incomplete, so here's what i did to get Drupal 7, Search API and Tika running on Debian Squeeze:
== 1. Download Tika source archive == Go to [1] and copy the link URL to the archive on your favourite mirror and download it using wget:
$ wget [URL]
== 2. Extract Tika source archive to /opt == $ cd /opt # unzip /path/to/apache-tika-X.Y-src.zip
== 3. Install maven2 package == # apt-get install maven2
== 4. Compile Tika using Maven == $ cd tika-X.Y # MAVEN_OPTS=-Xmx256m mvn clean install
(This might take a while…)
== 5. Download and enable the required Drupal modules == drush dl search_api search_api_attachments search_api_db drush en search_api search_api_attachments search_api_db
== 6. Configure Drupal to use Tika ==
- Login to Drupal admin backend
- Open Search API settings
- Create a new server (Database)
- Create a new index or use existing one
- In your index settings, switch to "Workflow" tab
- In "Data alterations" area enable "File attachments"
- Got to "Fields" tab
- Enable "File content" field for indexing
== 7. Edit Search API attachment module == Note: This is only needed if you use version 7.x-1.0, should be already fixed in newer versions (see patch 3048482a89a1a587feab78f2d5ea92c4b5642898 on [2])
- Go to the module's directory (if you used drush, this should be DRUPAL_HOME/sites/all/modules/search_api_attachments)
- Open file include/callback_attachments_settings.inc in your favourite editor
- Replace any occurences of "entity_type" by "item_type" (see issue on [3])
== 8. Verify Tika is working and called by Drupal ==
- Open file include/callback_attachments_settings.inc again
- Add the following PHP code at the end of the file, right before the last return command (line 141-ish)
syslog(LOG_INFO, 'Calling Tika: ' . $cmd);
- Save and close the file
- Tail your syslog (# tail -f /var/log/syslog)
- Got to Search API settings in Drupal backend
- Re-index your site
- You should see some messages telling you the Tika command and the file which is indexed
This is a rather quick'n'dirty documentation, but I don't have time for more and the git repo for Search AP attachments isn't working properly, so I cannot create patches right now. If you have any questions, let me know!
-- Regards,
Florian
[1] http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip [2] http://drupalcode.org/project/search_api_attachments.git/patch/3048482a89a1a... [3] http://drupal.org/node/1253824 -- [ Drupal support list | http://lists.drupal.org/ ]
Hi Liviu!
Am Donnerstag, 8. Dezember 2011, 20:06:06 schrieb Liviu Nicolicioiu:
What do you think if you use pdf2txt unix and put in a cck field?
I need to parse Microsoft Office and other file formats, too. Furthermore, we want to use a Solr search engine for perfomance reasons. The Search API integrates both Tika and Solr very well (if you know how to do it ;]).
First, your create text for index: extract text from files using any mechanism(use a unique words filter, and small word 3 char, will reduce the text content): tika, unix shell command, etc second action is to attach the files text to node, to be indexed by the drupal search index (cck or directly in body). After that you can user Solr or sphinx or other extra index storage, and I think a custom solution is betten than a lot of installed modules.
Liviu.
On Fri, Dec 9, 2011 at 10:28 AM, Florian Auer lists@floeschie.org wrote:
Hi Liviu!
Am Donnerstag, 8. Dezember 2011, 20:06:06 schrieb Liviu Nicolicioiu:
What do you think if you use pdf2txt unix and put in a cck field?
I need to parse Microsoft Office and other file formats, too. Furthermore, we want to use a Solr search engine for perfomance reasons. The Search API integrates both Tika and Solr very well (if you know how to do it ;]).
-- Cheers!
Florian
[ Drupal support list | http://lists.drupal.org/ ]