How to parse files using Drupal 7, Search API and Tika?

List overview All Threads
Download

newer

older

passwords shown in Drupal 7 login

views error

Florian Auer

7 Dec 2011 7 Dec '11

7:19 p.m.

Hey guys,

I'm trying to make documents searchable using Search API [1], Search API Attachments [2], Search API DB [3] and a local Tika [4] installation. I want to save the index data in the database for now, there is a Solr server ready for later integration.

I downloaded the Tika 1.0 runnable JAR file and successfully parsed a PDF and a DOC file from the command line using "java -jar /path/to/tika-1.0.jar --text" as user www-data.

I can index regular nodes and File nodes, but it doesn't parse the file's contents when I execute cron or trigger indexing manually. To me it seems as if Tika is never executed at all...

The author of the Search API Attachments module sais the module is "based on Apache Solr attachments", but it's not marked as required in the info file. So I'm assuming by "based on" he means "I borrowed some code"...

Is there someone who successfully got Drupal 7, Search API and Tika working together?

Any hints appreciated!

-- Cheers, Florian [1] http://drupal.org/project/search_api [2] http://drupal.org/project/search_api_attachments [3] http://drupal.org/project/search_api_db [4] http://tika.apache.org/download.html System: Debian Squeeze 64 bit

Show replies by date

Florian Auer

7 Dec 7 Dec

7:25 p.m.

New subject: How to parse files using Drupal 7, Search API and Tika?

Am Mittwoch, 7. Dezember 2011, 20:19:58 schrieb Florian Auer:

...

I can index regular nodes and File nodes, but it doesn't parse the file's contents when I execute cron or trigger indexing manually. To me it seems as if Tika is never executed at all...

Forgot to mention that my settings for the Search API attachments modules reflect Tika's installation paths:

Tika directory path: /opt/tika/ Tika jar file: tika-app-1.0.jar

www-data@localhost$ pwd /opt/tika www-data@localhost$ ls -l total 23524 -rw-r--r-- 1 root root 24056779 Nov 8 07:39 tika-app-1.0.jar

-- Regards, Florian

Ted

7:31 p.m.

New subject: How to parse files using Drupal 7, Search API and Tika?

I got this working by following the README, I believe. Do you have apache-tika-0.9-src.zip and tika-app-0.9.jar in sites/all/libraries/tika? If you have full control over your (assumed Linux) server, you can always run an strace on apache and grep for tika. That and xdebug with breakpoints in the appropriate lines in search_api_attachments/includes/callback_attachments_settings.inc (i.e. the shell_exec invocation) should get you there.

Ted

On 12/7/2011 2:19 PM, Florian Auer wrote:

...

Hey guys,

I'm trying to make documents searchable using Search API [1], Search API Attachments [2], Search API DB [3] and a local Tika [4] installation. I want to save the index data in the database for now, there is a Solr server ready for later integration.

I downloaded the Tika 1.0 runnable JAR file and successfully parsed a PDF and a DOC file from the command line using "java -jar /path/to/tika-1.0.jar --text" as user www-data.

I can index regular nodes and File nodes, but it doesn't parse the file's contents when I execute cron or trigger indexing manually. To me it seems as if Tika is never executed at all...

The author of the Search API Attachments module sais the module is "based on Apache Solr attachments", but it's not marked as required in the info file. So I'm assuming by "based on" he means "I borrowed some code"...

Is there someone who successfully got Drupal 7, Search API and Tika working together?

Any hints appreciated!

Florian Auer

8 Dec 8 Dec

4:15 p.m.

New subject: How to parse files using Drupal 7, Search API and Tika?

Am Mittwoch, 7. Dezember 2011, 20:19:58 schrieb Florian Auer:

...

Is there someone who successfully got Drupal 7, Search API and Tika working together?

Finally I got it working. The documentation is somewhat incomplete, so here's what i did to get Drupal 7, Search API and Tika running on Debian Squeeze:

== 1. Download Tika source archive == Go to [1] and copy the link URL to the archive on your favourite mirror and download it using wget:

$ wget [URL]

== 2. Extract Tika source archive to /opt == $ cd /opt # unzip /path/to/apache-tika-X.Y-src.zip

== 3. Install maven2 package == # apt-get install maven2

== 4. Compile Tika using Maven == $ cd tika-X.Y # MAVEN_OPTS=-Xmx256m mvn clean install

(This might take a while…)

== 5. Download and enable the required Drupal modules == drush dl search_api search_api_attachments search_api_db drush en search_api search_api_attachments search_api_db

== 6. Configure Drupal to use Tika == - Login to Drupal admin backend - Open Search API settings - Create a new server (Database) - Create a new index or use existing one - In your index settings, switch to "Workflow" tab - In "Data alterations" area enable "File attachments" - Got to "Fields" tab - Enable "File content" field for indexing

== 7. Edit Search API attachment module == Note: This is only needed if you use version 7.x-1.0, should be already fixed in newer versions (see patch 3048482a89a1a587feab78f2d5ea92c4b5642898 on [2])

- Go to the module's directory (if you used drush, this should be DRUPAL_HOME/sites/all/modules/search_api_attachments) - Open file include/callback_attachments_settings.inc in your favourite editor - Replace any occurences of "entity_type" by "item_type" (see issue on [3])

== 8. Verify Tika is working and called by Drupal == - Open file include/callback_attachments_settings.inc again - Add the following PHP code at the end of the file, right before the last return command (line 141-ish)

syslog(LOG_INFO, 'Calling Tika: ' . $cmd);

- Save and close the file - Tail your syslog (# tail -f /var/log/syslog) - Got to Search API settings in Drupal backend - Re-index your site - You should see some messages telling you the Tika command and the file which is indexed

This is a rather quick'n'dirty documentation, but I don't have time for more and the git repo for Search AP attachments isn't working properly, so I cannot create patches right now. If you have any questions, let me know!

-- Regards, Florian [1] http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip [2] http://drupalcode.org/project/search_api_attachments.git/patch/3048482a89a1a... [3] http://drupal.org/node/1253824

Florian Auer

4:17 p.m.

New subject: How to parse files using Drupal 7, Search API and Tika? [solved]

Am Donnerstag, 8. Dezember 2011, 17:15:16 schrieb Florian Auer:

...

Finally I got it working.

Sorry, forgot to mark this thread as "solved".

-- MfG/Regards, Florian

Liviu Nicolicioiu

7:06 p.m.

New subject: How to parse files using Drupal 7, Search API and Tika?

What do you think if you use pdf2txt unix and put in a cck field?

On Thu, Dec 8, 2011 at 6:15 PM, Florian Auer lists@floeschie.org wrote:

...

Am Mittwoch, 7. Dezember 2011, 20:19:58 schrieb Florian Auer:

...
Is there someone who successfully got Drupal 7, Search API and Tika working together?

Finally I got it working. The documentation is somewhat incomplete, so here's what i did to get Drupal 7, Search API and Tika running on Debian Squeeze:

== 1. Download Tika source archive == Go to [1] and copy the link URL to the archive on your favourite mirror and download it using wget:

$ wget [URL]

== 2. Extract Tika source archive to /opt == $ cd /opt # unzip /path/to/apache-tika-X.Y-src.zip

== 3. Install maven2 package == # apt-get install maven2

== 4. Compile Tika using Maven == $ cd tika-X.Y # MAVEN_OPTS=-Xmx256m mvn clean install

(This might take a while…)

== 5. Download and enable the required Drupal modules == drush dl search_api search_api_attachments search_api_db drush en search_api search_api_attachments search_api_db

== 6. Configure Drupal to use Tika ==

Login to Drupal admin backend

Open Search API settings

Create a new server (Database)

Create a new index or use existing one

In your index settings, switch to "Workflow" tab

In "Data alterations" area enable "File attachments"

Got to "Fields" tab

Enable "File content" field for indexing

== 7. Edit Search API attachment module == Note: This is only needed if you use version 7.x-1.0, should be already fixed in newer versions (see patch 3048482a89a1a587feab78f2d5ea92c4b5642898 on [2])

Go to the module's directory (if you used drush, this should be DRUPAL_HOME/sites/all/modules/search_api_attachments)

Open file include/callback_attachments_settings.inc in your favourite editor

Replace any occurences of "entity_type" by "item_type" (see issue on [3])

== 8. Verify Tika is working and called by Drupal ==

Open file include/callback_attachments_settings.inc again

Add the following PHP code at the end of the file, right before the last return command (line 141-ish)

syslog(LOG_INFO, 'Calling Tika: ' . $cmd);

Save and close the file

Tail your syslog (# tail -f /var/log/syslog)

Got to Search API settings in Drupal backend

Re-index your site

You should see some messages telling you the Tika command and the file which is indexed

This is a rather quick'n'dirty documentation, but I don't have time for more and the git repo for Search AP attachments isn't working properly, so I cannot create patches right now. If you have any questions, let me know!

-- Regards,

Florian

[1] http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip [2] http://drupalcode.org/project/search_api_attachments.git/patch/3048482a89a1a... [3] http://drupal.org/node/1253824 -- [ Drupal support list | http://lists.drupal.org/ ]

-- regards, mit freundlichen Grüßen, cu stima, Liviu Nicolicioiu ______________________________ epoint - consulting + development Vacarescu 7 300182 Timisoara Romania email: liviu.nicolicioiu@epoint.ro skype: nicolicioiu.liviu mobile: +40 / 729/ 063 679 fax: +40 / 256 / 407 147 www.epoint.ro "reliable solutions. delivered." ______________________________ This message and any attached files are confidential and intended solely for the addressee(s). Any publication, transmission or other use of the information by a person or entity other than the intended addressee is prohibited. If you receive this in error please contact the sender and delete the material. The sender does not accept liability for any errors or omissions as a result of the transmission.

Florian Auer

9 Dec 9 Dec

8:28 a.m.

New subject: How to parse files using Drupal 7, Search API and Tika?

Hi Liviu!

Am Donnerstag, 8. Dezember 2011, 20:06:06 schrieb Liviu Nicolicioiu:

...

What do you think if you use pdf2txt unix and put in a cck field?

I need to parse Microsoft Office and other file formats, too. Furthermore, we want to use a Solr search engine for perfomance reasons. The Search API integrates both Tika and Solr very well (if you know how to do it ;]).

-- Cheers! Florian

Liviu Nicolicioiu

9:46 a.m.

New subject: How to parse files using Drupal 7, Search API and Tika?

First, your create text for index: extract text from files using any mechanism(use a unique words filter, and small word 3 char, will reduce the text content): tika, unix shell command, etc second action is to attach the files text to node, to be indexed by the drupal search index (cck or directly in body). After that you can user Solr or sphinx or other extra index storage, and I think a custom solution is betten than a lot of installed modules.

Liviu.

On Fri, Dec 9, 2011 at 10:28 AM, Florian Auer lists@floeschie.org wrote:

...

Hi Liviu!

Am Donnerstag, 8. Dezember 2011, 20:06:06 schrieb Liviu Nicolicioiu:

...
What do you think if you use pdf2txt unix and put in a cck field?

I need to parse Microsoft Office and other file formats, too. Furthermore, we want to use a Solr search engine for perfomance reasons. The Search API integrates both Tika and Solr very well (if you know how to do it ;]).

-- Cheers!

Florian

[ Drupal support list | http://lists.drupal.org/ ]

5235

Age (days ago)

5237

Last active (days ago)

support@drupal.org

7 comments

3 participants

tags (0)

participants (3)

Florian Auer
Liviu Nicolicioiu
Ted