[support] How to parse files using Drupal 7, Search API and Tika?

Florian Auer lists at floeschie.org
Thu Dec 8 16:15:16 UTC 2011


Am Mittwoch, 7. Dezember 2011, 20:19:58 schrieb Florian Auer:
> Is there someone who successfully got Drupal 7, Search API and Tika working together?

Finally I got it working. The documentation is somewhat incomplete, so here's what i did to get Drupal 7, Search API and Tika running on Debian Squeeze:


== 1. Download Tika source archive ==
Go to [1] and copy the link URL to the archive on your favourite mirror and download it using wget:

$ wget [URL]


== 2. Extract Tika source archive to /opt ==
$ cd /opt
# unzip /path/to/apache-tika-X.Y-src.zip


== 3. Install maven2 package ==
# apt-get install maven2


== 4. Compile Tika using Maven ==
$ cd tika-X.Y
# MAVEN_OPTS=-Xmx256m mvn clean install

(This might take a while…)


== 5. Download and enable the required Drupal modules ==
drush dl search_api search_api_attachments search_api_db
drush en search_api search_api_attachments search_api_db


== 6. Configure Drupal to use Tika ==
- Login to Drupal admin backend
- Open Search API settings
- Create a new server (Database)
- Create a new index or use existing one
- In your index settings, switch to "Workflow" tab
- In "Data alterations" area enable "File attachments"
- Got to "Fields" tab
- Enable "File content" field for indexing


== 7. Edit Search API attachment module ==
Note: This is only needed if you use version 7.x-1.0, should be already fixed in newer versions (see patch 3048482a89a1a587feab78f2d5ea92c4b5642898 on [2])

- Go to the module's directory (if you used drush, this should be DRUPAL_HOME/sites/all/modules/search_api_attachments)
- Open file include/callback_attachments_settings.inc in your favourite editor
- Replace any occurences of "entity_type" by "item_type" (see issue on [3])


== 8. Verify Tika is working and called by Drupal ==
- Open file include/callback_attachments_settings.inc again
- Add the following PHP code at the end of the file, right before the last return command (line 141-ish)

  syslog(LOG_INFO, 'Calling Tika: ' . $cmd);

- Save and close the file
- Tail your syslog (# tail -f /var/log/syslog)
- Got to Search API settings in Drupal backend
- Re-index your site
- You should see some messages telling you the Tika command and the file which is indexed


This is a rather quick'n'dirty documentation, but I don't have time for more and the git repo for Search AP attachments isn't working properly, so I cannot create patches right now. If you have any questions, let me know!

-- 
Regards,

Florian

[1] http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
[2] http://drupalcode.org/project/search_api_attachments.git/patch/3048482a89a1a587feab78f2d5ea92c4b5642898
[3] http://drupal.org/node/1253824


More information about the support mailing list