[drupal-devel] [feature] export book as xml for formatting

Sun Jun 5 11:00:30 UTC 2005

Issue status update for http://drupal.org/node/1482

 Project:      Drupal
 Version:      cvs
 Component:    book.module
 Category:     feature requests
 Priority:     normal
 Assigned to:  puregin
 Reported by:  BenEng
 Updated by:   Dries
 Status:       patch

I made some changes (URL scheme) and committed this patch to HEAD. 
Please update your tree before making more changes.

- Can you update the issues affected by this commit?

- I wonder why the release info (md5-sum) isn't stored at XML
attributes so it can be parsed/extracted easily.  Probably
DocBook-specific.

- Firefox did not recognize the XML document as being XML.  I think we
might have to send the proper headers:
 drupal_set_header('Content-Type: text/xml; charset=utf-8');.  Haven't
checked yet.

Great work Djun.

Dries

Previous comments:
------------------------------------------------------------------------

April 5, 2003 - 23:38 : BenEng

Export an entire collaborative book to xml (docbook) so that it can be
formatted using xslt (e.g., to rtf or pdf) and printed.

It would also be nice to be able to perform the reverse. That is to
import a collaborative book from an xml (docbook) file that is
uploaded.

------------------------------------------------------------------------

June 3, 2004 - 02:10 : moshe weitzman

can someone suggest an xml schema for this? i think we need a general
xml schema for nodes. after that, this becomes a simple matter of
nesting node elements (I think)

------------------------------------------------------------------------

February 1, 2005 - 11:42 : Teto

Hi,

Is there any news about a such feature ?
All i've found about a docbook schema is here :
http://docbook.sourceforge.net/projects/schema/
It seems there isn't much in the docbook cvs about that.

Teto.

------------------------------------------------------------------------

May 14, 2005 - 11:11 : puregin

Here's a list of the Book publishing DTDs I know about:

NameNotesRef

ISO 12083:1998//DTD Book//EN - this includes ISO 12093:1993//DTD
Mathematics//EN
Committee standard - very general.  Used e.g. by University of
California Press.
www.xmlxperts.com/bookdtd.htm [1]

DocBook
Applications - widely used by Computer book publishers, e.g. O'Reilly. 
Good support.
docbook.org [2]

TEI/TEI-Lite
Applications - scholarly/historical/literary documents
www.tei-c.org [3]

MIL-STD-38784 (CALS)
Applications - Military/Govt/Enterprise publishing
http://xml.coverpages.org/mil-std-38784-a1-dtd.txt [4]

I'd highly recommend DocBook as a useable, technically focused, XML DTD
with strong toolset support.

[1] http://www.xmlxperts.com/bookdtd.htm
[2] http://docbook.org
[3] http://www.tei-c.org/
[4] http://xml.coverpages.org/mil-std-38784-a1-dtd.txt

------------------------------------------------------------------------

May 18, 2005 - 10:37 : puregin

I'd suggest we start with something very simple.  

The patch which I submitted for http://drupal.org/node/1898 wraps each
node in <div> tags, with a level, and a node id attribute, for printer
friendly output.

We can't rely in general on the contents of a node being XHTML, even if
we force output through an XHTML validator such as tidy.  So our best
bet is to encode the entire contents of a node as CDATA.  This gives us
hierarchy, and encapsulated contents (of any kind - later this could
also be other kinds of data or markup)

This output will be valid XML, with a pretty simple DTD.  It is easy to
take such a file and write simple XSLT based scripts on the the client
side to explode this file into a directory tree of HTML, a single HTML
file, or many other formats.  

Importing is trickier.  It's relatively easy to import an exported
file, and update the nodes of the book according to the hierarchy
defined by the sectional <div> elements.   Importing needs to take care
of structure which has changed - child nodes added, deleted, or moved.  

It would also be nice to have some client-side scripts to import other
formats into this nested sectional <div> based format - for example, to
take a directory tree of HTML fragments, and make this into an
importable file.

------------------------------------------------------------------------

June 1, 2005 - 12:33 : puregin

Attachment: http://drupal.org/files/issues/xml-export.patch (14.06 KB)

This patch enables export of books as XML documents.

The XML is DocBook "at the level of structure", but
node contents are wrapped as CDATA, since we
can't be sure that the contents are valid XML.

Several other bugs/feature requests are also
addressed with this patch:

 - Fixes bugs

http://drupal.org/node/1898
http://drupal.org/node/1482
http://drupal.org/node/8049
http://drupal.org/node/1899

Should go a long way towards implementing feature request
http://drupal.org/node/2062

It should also be easy to extend this to produce OPML,
for example.

 - Adds about 170 lines, of which more than 100 are comments
 - Added doxygen comments
 - Made doxygen comment format consistent; fixed minor grammatical
slips
 - A proper Doctype and more informative HTML element is generated
   for printer-friendly HTML output.
 - Refactored book_print() to use book_recurse().
 - Refactored book_recurse().  Applies 'visitor' callback functions to
nodes
   during weight/title order tree-traversal.  The parameterized
   visitor callbacks can be used to generate different kinds of output.
   There are many other kinds of operations on books which can be
implemented
   by writing a pre-node/post-node pair of callback functions:
word-count/
   statistics gathering, comparison, copying, search and replace...
 - Introduced book_export() which uses book_recurse() to generate
   DocBook-like XML to export book contents in a structured form.
   An md5 hash is computed for each node to help import code to
   decide if a node needs to be updated or not.

------------------------------------------------------------------------

June 3, 2005 - 10:29 : puregin

Attachment: http://drupal.org/files/issues/xml-export-01.patch (14.19 KB)

This updated patch adds "weight" metadata, which I forgot to capture in
the previous patch. I'm not sure how much other  metadata I should
include.

------------------------------------------------------------------------

June 3, 2005 - 11:06 : puregin

Attachment: http://drupal.org/files/issues/explode2dir.php (6.64 KB)

The attached command-line PHP script may be useful in testing the XML
export patch supplied.

Assuming your local version of PHP is built with CLI support and XML
parser support, you should be able to run the script against an XML
export file generated by the book module with my patch.  After you have
installed the patch, you can select a book page, click on the 'export
XML' link, and save the result as a file, say 'test.xml'.   Then run
the script.  On my system this looks like this:

% ./export2dir.php test.xml

This will produce output that looks something like this:

./explode2dir.php test.xml
md5: 9e8ca98c6a8be35c21f31f7937608acc
weight: 1
md5: 11a1956a1592feac37abee6b469e62c8
weight: 0
md5: ed4c91279d3bed28b56899b75ccaa9aa
weight: 0

It will generate a directory hierarchy, with one directory per book
node. Each directory contains a file containing the node contents and a
file 'nid' containing the metadata.  You can check, for example, that
the md5 signature of the contents match the md5 signature recorded with
the metadata.

Djun

------------------------------------------------------------------------

June 3, 2005 - 20:46 : Dries

I like the approach taken in this patch!  Let's tidy up the menu
structure.  I suggest changing 

book/export   (docbook)
book/print    (plain-text) to 

export/docbook
export/text  If we add OPML-support, it would then become:

export/opmlSimilarly, I suggest renaming 'export XML' to 'export
DocBook XML' (or something).

------------------------------------------------------------------------

June 3, 2005 - 20:53 : Dries

Haven't tried it but how does DocBook handle CDATA?  Does it come out
OK?  Read: does it make sense to do it this way?

------------------------------------------------------------------------

June 4, 2005 - 01:04 : puregin

Dries,

    The output XML which I've implemented is only 'DocBook-like', not
true DocBook.  I've been dealing with structure at the top level (book,
chapter, section).  I've hidden away the actual content inside a CDATA
section.  The XML is really intended to provide a container for export.

   At this point I'm trying to focus on a relatively simple way to do
an export/import round trip of the current content format (text/'loose'
HTML).  Most people would probably not edit this using an XML editor.  

   I think I can generate a tar/gzip archive of the directory structure
I described (output of explode2dir.php) directly on the server, either
by calling an external pipeline, or by using the tar/gzip PEAR
extension.  So the XML format I described would be useful primarily as
means to do the /import/, unless we can think of a better way to import
such a directory structure (perhaps via the node_import module?)

    How the CDATA section is handled depends on the application.  Most
XML editors will display this as CDATA, allow the user to edit as
CDATA, and to perform edit operations such as cut/paste to convert the
CDATA to other XML elements.   DocBook formatting applications could do
various things - ignore the CDATA; format as 'preformatted', e.g.,
source listing; or try to do something clever, like attempting to parse
and convert the CDATA into real DocBook before proceeding.

    To generate true DocBook, we would have to: 

* emit a document type declaration
* decide if we want to export complete documents (i.e., top level
elements such as books, articles, set) or document fragments.

The real difficulty is dealing with the /content/: to convert this into
DocBook,  we'd have to attempt to map (possibly not well formed) text
and/or HTML into well-formed DocBook XML.  This would require guessing
in many cases, since text/HTML doesn't directly encode the author's
intent.   A problem with this would be that the content might not be
returned exactly as exported in an (export/import) 'round trip' .

If we want to support true DocBook, it would probably be better to do
this via an input filter, similar to the PHP input filter - definitely
worth pursuing, but perhaps a separate issue?

I will rewrite the patch to make sure that the exported XML validates
as a DocBook fragment, and punt off to people who actually want to deal
with real DocBook the problem of embedding and converting content. These
folks would probably not so interested in round-trip import/export,
until we have native DocBook nodes, at which point many of these issues
vanish (I hope :)

Regards, Djun

------------------------------------------------------------------------

June 4, 2005 - 02:01 : Amazon

As an active member of the documentation team I would greatly appreciate
any ability to export and import content to and from the documentation
handbooks.  Drupal is slow for editing, and has some usability issues
that I will be following up on.

Please consider accepting this as a incremental step to assisting the
documentation team and other editors.

Kieran

------------------------------------------------------------------------

June 4, 2005 - 09:28 : Dries

If the goal is to generate books, DocBook-export is key.  However, if
the output is only DocBook-like, it is only going to be used by a
handful of people.  I think the code comments should mention that it is
only DocBook-like.

If the goal is to import/export books, OPML might be the better choice.
 I think book syndication (publish/subscribe) is going to be the more
popular.

Either way, let's clean up the URL scheme and extend the book_help()
function a bit (if not already).