[development] Mass import of HTML?

Sat Jan 21 03:35:27 UTC 2006

Worth asking here, as I've found no good solutions yet, and I'm sure
others have faced similar issues and solved it in various ways.

Despite CS shipping with node_import, and a few people posting on the
Drupal forums about various times they've coded up different scripts
to import old html sites, it doesn't look like there is a good and
mature way to import an existing HTML based site, except for the old
cut and paste approach.

Ideally, give either a set of html files or an set of URLs, a script
should be able to import each pages html, and make a node.  Stripping
out things like headers and footers would be good.

Problems:

node_import: requires CSV compilation of pages.  How do you convert
existing pages into a CSV?   Script?

import_export (despite 4.5 label on Drupal, was made 4.6 friendly):
same problem... also does XML...

import_html: http://coders.co.nz/drupal_development/?q=taxonomy/term/4
Specific to the job, but Dan hasn't released the code fully yet... in
email, he mentioned it was missing bits

wgHTML: an ugly hack that wraps html Drupalishly, but not a real import

Other scripts: haven't found a good example yet.  Got one?

Thanks for any pointers... and I'll report back and post a nice howto
for use in the handbook.

obDevel: I just noticed a bug I filed a year ago finally closed... br
tags are no longer standard on labels.  http://drupal.org/node/15609  
Yeah!  Only took a whole year to get closed.