[support] How to parse files using Drupal 7, Search API and Tika?

Liviu Nicolicioiu liviu.nicolicioiu at epoint.ro
Thu Dec 8 19:06:06 UTC 2011


What do you think if you use pdf2txt unix and put in a cck field?

On Thu, Dec 8, 2011 at 6:15 PM, Florian Auer <lists at floeschie.org> wrote:
> Am Mittwoch, 7. Dezember 2011, 20:19:58 schrieb Florian Auer:
>> Is there someone who successfully got Drupal 7, Search API and Tika working together?
>
> Finally I got it working. The documentation is somewhat incomplete, so here's what i did to get Drupal 7, Search API and Tika running on Debian Squeeze:
>
>
> == 1. Download Tika source archive ==
> Go to [1] and copy the link URL to the archive on your favourite mirror and download it using wget:
>
> $ wget [URL]
>
>
> == 2. Extract Tika source archive to /opt ==
> $ cd /opt
> # unzip /path/to/apache-tika-X.Y-src.zip
>
>
> == 3. Install maven2 package ==
> # apt-get install maven2
>
>
> == 4. Compile Tika using Maven ==
> $ cd tika-X.Y
> # MAVEN_OPTS=-Xmx256m mvn clean install
>
> (This might take a while…)
>
>
> == 5. Download and enable the required Drupal modules ==
> drush dl search_api search_api_attachments search_api_db
> drush en search_api search_api_attachments search_api_db
>
>
> == 6. Configure Drupal to use Tika ==
> - Login to Drupal admin backend
> - Open Search API settings
> - Create a new server (Database)
> - Create a new index or use existing one
> - In your index settings, switch to "Workflow" tab
> - In "Data alterations" area enable "File attachments"
> - Got to "Fields" tab
> - Enable "File content" field for indexing
>
>
> == 7. Edit Search API attachment module ==
> Note: This is only needed if you use version 7.x-1.0, should be already fixed in newer versions (see patch 3048482a89a1a587feab78f2d5ea92c4b5642898 on [2])
>
> - Go to the module's directory (if you used drush, this should be DRUPAL_HOME/sites/all/modules/search_api_attachments)
> - Open file include/callback_attachments_settings.inc in your favourite editor
> - Replace any occurences of "entity_type" by "item_type" (see issue on [3])
>
>
> == 8. Verify Tika is working and called by Drupal ==
> - Open file include/callback_attachments_settings.inc again
> - Add the following PHP code at the end of the file, right before the last return command (line 141-ish)
>
>  syslog(LOG_INFO, 'Calling Tika: ' . $cmd);
>
> - Save and close the file
> - Tail your syslog (# tail -f /var/log/syslog)
> - Got to Search API settings in Drupal backend
> - Re-index your site
> - You should see some messages telling you the Tika command and the file which is indexed
>
>
> This is a rather quick'n'dirty documentation, but I don't have time for more and the git repo for Search AP attachments isn't working properly, so I cannot create patches right now. If you have any questions, let me know!
>
> --
> Regards,
>
> Florian
>
> [1] http://www.apache.org/dyn/closer.cgi/tika/apache-tika-1.0-src.zip
> [2] http://drupalcode.org/project/search_api_attachments.git/patch/3048482a89a1a587feab78f2d5ea92c4b5642898
> [3] http://drupal.org/node/1253824
> --
> [ Drupal support list | http://lists.drupal.org/ ]



-- 
regards, mit freundlichen Grüßen, cu stima,
Liviu Nicolicioiu

______________________________

epoint - consulting + development

Vacarescu 7 300182 Timisoara Romania

email: liviu.nicolicioiu at epoint.ro
skype: nicolicioiu.liviu
mobile: +40 / 729/ 063 679
fax: +40 / 256 / 407 147
www.epoint.ro

"reliable solutions. delivered."
______________________________
This message and any attached files are confidential and intended
solely for the addressee(s). Any publication, transmission or other
use of the information by a person or entity other than the intended
addressee is prohibited. If you receive this in error please contact
the sender and delete the material. The sender does not accept
liability for any errors or omissions as a result of the transmission.


More information about the support mailing list