[drupal-devel] [bug] Strange markup in book render functions

Tue Jun 7 00:17:12 UTC 2005

Issue status update for http://drupal.org/node/1898

 Project:      Drupal
 Version:      cvs
 Component:    book.module
 Category:     bug reports
 Priority:     critical
 Assigned to:  puregin
 Reported by:  kika
 Updated by:   puregin
 Status:       patch
 Attachment:   http://drupal.org/files/issues/xml-export-05.patch (23.89 KB)

Sorry, ignore the last patch - I goofed with escaping a string.  The new
patchfile is attached -- Djun

puregin

Previous comments:
------------------------------------------------------------------------

June 9, 2003 - 15:06 : kika

Snippet from book_render():

$output .= "<dt>". l($node->title, "node/view/$node->nid")
."</dt><dd>". book_body($node) ."<br /><br /></dd>";

Is using definition list here is really correct? This should be
replaced with appropiate div's or even better, funnel through theme()

Same goes for the function book_print():

 $output .= "$node->title";

  if ($node->body) {
    $output .= "<ul>". book_body($node) ."</ul>";
  }

Why wrap body inside <ul>?

------------------------------------------------------------------------

June 10, 2003 - 17:31 : al

Yeah, it's all nasty. I patched this ages ago, but no one seemed
interested. The patch now won't apply due to all the changes since
then. 

I'm on this - expect a new patch to hit my sandbox shortly.

------------------------------------------------------------------------

June 10, 2003 - 23:03 : ax

please remove the superfluous <ul>s and <li> [1], too.

to solve the problem of invalid nested <hx> tags, maybe we should
change &lt;h$depth&gt; to &lt;h1 class="book-h$depth"&gt;? this way, we
could use <h2> - <h6> in book pages. alternatively, we should only use
<h5> and <h6> in book pages - we shouldn't have nestings deeper than 4
in printed books, then.

additionally, i think it would be a good idea to add id's to all book
headings so they can be jumped to via fragment identifiers
(url#fragment-identifier). i often wish this would be possible when
referencing the last paragraph of a long page of documentation ...
these id's should be of the form <hx id="foo" name="foo"> [2], with foo
probably being the node-id of the single book page.

adding id's to headings would also be a good idea for headings /
sections in single book nodes. for example, it would be nice if i could
link to list point 10 in the drupal installation instructions [3] via
something like url#10. these id's had to be added manually. the
question is: how can these id's be named to be unique within a
/printed/ book? probably <node_id>_<section_id> ...

[1] [ [drupal-devel] Drupal handbook formatting and heading tags |
http://lists.drupal.org/pipermail/drupal-devel/2003-June/025860.html ]
[2] [ XHTML Spec, HTML Compatibility Guidelines, Fragment Identifiers |
http://www.w3.org/TR/xhtml1/#C_8 ]
[3] http://drupal.org/node/view/260

------------------------------------------------------------------------

June 15, 2003 - 07:23 : al

Fixed in latest CVS.
Uses all of Ax's suggestions:
 - <h1 class="depthX">
 - Fixed to use CSS rather than tables.
 - Changes link at top to standard "breadcrumb" style, rather than
hierarchy in tree view.
 - Removes odd and superfluous <ul> tags and the like.
 - Makes printed docs have proper <html> and <body> wrapping, etc.

------------------------------------------------------------------------

May 5, 2005 - 00:20 : puregin

I'm re-opening this, because the problem persists in CVS.

1) The current CVS version (1.288.2.4) of  book module still wraps
$node->body in an <ul> element in book_print_recurse().  This produces
non-conformant HTML.

I'm not sure why this was done in the first place - I'm guessing
someone wanted to insulate from having bad markup in a book page throw
off the enclosing page.

Getting rid of the offending UL tags is a one-line solution, but what
is the right thing to do here?

I'd like to see something very structural - wrapping the /entire/
output (including the generated sectional header) of the
book_print_recurse() wrapped in a <div> element.  This will properly
nest true structural sections, making the structure accessible to
applications via DOM.  It will also make it easier to use XSLT, for
example,  to transform into various XML based  export formats, for
example.  Any reason not to do so?

Also...  when we get the 'printer-friendly' version of a page, the
subtree rooted at that page is returned, and the root page is marked up
with a class="book-h1".   Is there any reason why we don't compute the
depth of the root page, and use that to generate the "in-place" header?
  In other words, if a section is "book-h3" relative to the top-level
book, shouldn't it be "book-h3" whenever it is viewed?

In this case, we could apply the same principal to generating the <div>
tags around each section.

------------------------------------------------------------------------

May 5, 2005 - 15:58 : puregin

Attachment: http://drupal.org/files/issues/book_printer_friendly_div.patch (1.57 KB)

I am attaching a patch which fixes improper printer-friendly output, as
discussed.

The patch fixes book_print and book_print_recurse() to insert a <div>
start tag before emitting the header, and to close this  after
subsections have been recursively generated.

The <div> tag has an id attribute of the form id="sect-123", where the
node's nid is 123, and a class attribute of form class="sect/n/" where
the depth of the section is /n/.

I'm marking this issue as critical, since the present HTML output plays
havoc with any attempts at sane PDF generation.

I will address the issue of where printer-friendly subsections should
be 'rooted' in a follow-up.

Djun

------------------------------------------------------------------------

May 6, 2005 - 01:53 : Dries

Let's not abbreviate 'section' to 'sect'.  Also, we use '-' to separate
words in CSS names.  

- If the section ID uses the node ID, maybe we should use 'node-'
instead of 'sect-'?

- If the class uses depth, maybe we should use 'depth-' instead of
'sect'?

The same is true for your other patch where you suggest to use 'n'. 
Better to make that 'node-'.  (Feel free to merge both patches.)

Looks like the book.module generates a _lot_ of CSS IDs/classes. If the
introduction of these new IDs/classes allows us to remove some of the
existing IDs/classes, please do.

------------------------------------------------------------------------

May 6, 2005 - 21:26 : Steven

Here's my idea.. why not use regular h tags for the book structure, and
renumber the header tags in the content, starting one level after the
current book depth?

Here's a function which does this:

<?php
function fix_up_headers($text, $level = 1) {
  // Find all header tags that are used (if any)
  if (preg_match_all('/<h[0-9]+/i', $text, $tags) == 0) {
    return $text;
  }
  // Discard duplicates and sort them by number
  $tags = array_unique(array_map('strtolower', $tags[0]));
  natsort($tags);
  // Renumber them and replace them
  $i = $level;
  foreach ($tags as $tag) {
    $from[] = '@'. preg_quote($tag) .'(?![0-9])@i';
    $from[] = '@'. preg_quote(str_replace('<', '</', $tag))
.'(?![0-9])@i';
    $to[] = '<h'. $i;
    $to[] = '</h'. $i;
    $i++;
  }
  $text = preg_replace($from, $to, $text);
  // Change level 7 headers and higher to divs.
  return preg_replace(array('!<h([7-9]|[1-9][0-9]+)!i',
'!</h([7-9]|[1-9][0-9]+)!i'), array('<div class="header-\1"', '</div'),
$text);
}
?>

That way you can use header tags safely in the content (which is useful
in the context of a single page) without messing up the book output.

------------------------------------------------------------------------

May 7, 2005 - 00:50 : puregin

Attachment: http://drupal.org/files/issues/02_book_printer_friendly_div.patch (1.78 KB)

The attached patch changes the div id to the form 'node-123'.  I was
concerned that a document fragment might be created with both a book
section and a non-book section from the same node, leading to
non-unique id's, but at this point it seems like an unlikely scenario.

I've changed 'class=sect1' to 'class=section-1'.  I was trying to be
lazy and use the DocBook section name convention to make XSLT
conversion a little easier, but it's really not saving any effort, in
the end, and this is clearer.

I think 'section' rather than 'depth' is appropriate - it describes
what the role of the div element is.

I've closed the other issue regarding id's in header elements, because
I've replaced the whole 'class="book-hx"' business in headers with
'class="book-header"'.  Since the level can now be selected by context
with respect to the enclosing div, separate classes for headers are not
required.  That is, we can style with CSS like the following:

.section-1 h1.book-header {
    font-weight: bold;
    font-size: 2.2em;
}
.section-2 h1.book-header {
    font-size: 2em;
}
.section-1 p  {
    text-align: justified;
}
.section-3 p  {
    margin-left: 3em;
    margin-right: 3em;
}

Note that the section-2 h1 will be bold, since the section-2 div is
nested within the section-1 div.  Also, all paragraphs regardless of
section will be justified; and section-3 paragraphs will be indented
left and right.

Because we can style /everything/ by section, not just headers, and
because the styling specifications respect the sectional hierarchy,
it's possible to create very elegant and sophisticated book pages with
very little CSS.  Even I can do it! :)  I'm definitely *not* a
designer, but I've played around a bit with this - have a look at
http://www.puregin.org/book/print/129 for an example.

I will attach patches for misc/print.css and misc/drupal.css
separately.  I don't understand why, but the present version (cvs) of
these files has all of the h1.book-hx selectors in drupal.css - the
printer-friendly page actually looks at print.css.  Am I missing
something?

For the record, this approach seems to have been suggested by clairem
in http://drupal.org/node/8049 where I have a little rant about
headings in book pages.  Twice :)

Djun

------------------------------------------------------------------------

May 7, 2005 - 01:10 : puregin

Steven - that's a neat idea, and very clever, clean code!  But it still
doesn't catch problems such as, for example,  an h5 inside of an h1
section with no intervening h2..h4 headers, or?

I'm inclined to be a bit suspicious of the approach, though - I rather
think it's 'doing magic behind the user's back'.   If people are going
to insist on putting headings inside of pages, it should be possible. 
And it would be odd for users to discover that what they typed isn't
what they get.

------------------------------------------------------------------------

May 7, 2005 - 01:32 : puregin

Attachment: http://drupal.org/files/issues/print_css_div.patch (1.36 KB)

Here's the patch for print.css.  

There's a couple of commented-out lines - these show how to make
section headings 'run in' as part of the following paragraph (or other
inline element).  I am assuming that print.css is supposed to be
something of an example that people would use to experiment with? 
Please feel free to improve this in any way - as I said, I'm far from
being a graphic designer, and I'm also not really sure what the
intention of this file is.

------------------------------------------------------------------------

May 7, 2005 - 11:40 : Steven

Puregin: nope, the code correctly handles missing h numbers. If the book
page only uses h2, then that will be the first new level. If it uses h2
and h4, and it is inside a 3rd level book page for example (whose title
would be h3), they they get renumbered to respectively h4 and h5 (no
missing number).

>From a semantic point of view it is crazy not to use h# for book
titles.

------------------------------------------------------------------------

May 7, 2005 - 11:44 : Steven

A bit more verbose: I think the overwhelming majority of users browse a
book on the site itself. Within a single page, there is a h1 tag for
the page title. So it makes sense for a writer to use h2, h3, etc
inside a single page. This is semantically valid HTML.

On the other hand, the print output places the content in a whole new
context. I would expect the book module to take whatever steps
necessary 'behind the user's back' to ensure it displays correctly.

------------------------------------------------------------------------

May 7, 2005 - 13:43 : puregin

Sorry, I replied to the last post on the list - I'll include the post
here for the record, and expand a little.

Steven, I think that the issue you're addressing is actually a
different one than I am trying to fix in my patch, which addresses how
sections are wrapped. 

My patch would still allow users to write headers in their book pages -
it would then be up to post-processing  - or something like your
approach - to deal with how embedded headers are dealt with.

Perhaps we should take up your suggestion in another of the book.module
issues dealing with the sequences of headers?  How about
http://drupal.org/node/8049,  or http://drupal.org/node/4118?

With respect to your suggestion - I understand now that you're
proposing to apply the re-writing only to the printer-friendly output. 
 I'm conflicted about this idea - on the one hand, the output which it
produces is clearly an improvement over the present output.  On the
other hand, it changes what the user types to make it 'right' - I guess
this is in the long tradition of HTML being interpreted very forgivingly
by web-browsers and other applications.  On the balance, I support your
idea - I guess it's in the spirit of 'make things easier for the user'.

With respect to the actual code - I don't think it will handle the case
where h2, h3, and  h4 are all used in the text, but the sequence h2, h4
occurs.  To put it another way, the problem of checking whether headers
are in a legal sequence  is context sensitive, while a
regular-expression search and replace (even if the replacement sequence
remaps the targets) is context-free.  When I say legal sequence, I mean
legal in terms of book hierarchy, not in terms of the XHTML definition,
which doesn't address hierarchy, because headers are titles, not
sectional elements.  SGML/XML publishing DTDs define such things via
inclusion/exclusion on true sectional elements)

The case I described is, admittedly, a relatively infrequent
occurrence, and the general solution would be way to difficult to deal
with - it would involve constructing and manipulating a DOM, or
something along those lines. Ugh!

Djun

------------------------------------------------------------------------

May 18, 2005 - 10:24 : cel4145

The Drupal doc team is very interested in being able to import/export
book content. I'm sure that others would be interested in it, too. Djun
has indicated that this patch is a good step toward having that
functionality

So +1 if the developers can help him to get this patch so that it can
be implemented.

------------------------------------------------------------------------

June 1, 2005 - 03:37 : puregin

This patch is subsumed in my patch for issue
http://drupal.org/project/comments/add/1482, which provides for XML
export of books.

Note that the patch for print.css in this issue still applies with the
new patch.

------------------------------------------------------------------------

June 6, 2005 - 02:30 : puregin

This issue has been fixed with the application of patch from
http://drupal.org/node/1482.

However, I'm not sure if the print.css file patch has been commited.  

Dries?

Also, as I mentioned, someone with real page/graphic designer's
expertise should probably look over this and tweak/replace with better
layout.

------------------------------------------------------------------------

June 6, 2005 - 13:30 : Dries

Attachment: http://drupal.org/files/issues/opml.png (39.94 KB)

I think this version generates invalid OPML and invalid Docbook code. 
The Docbook does not seems to validate as XML because sometimes a
book-tag is closed with a section-tag.  In OPML's case, the
section-classes look weird (see screenshot).  I'm guessing something is
wrong with the computation of the depth.

Are the class-sections part of the OPML standard?  I'm having a hard
time guessing what they are about or why one wants to use these.

A couple of code comments:

1. Don't use tabs.  Even in absence of tabs, the spacing isn't always
100% consistent -- sometimes two spaces are used.

2. The use of " (double quotes) and ' (single quotes) is sometimes
confusing.  Only use single quotes when it shortens the code.  Often
your code can be shortened by using double quotes (and escaping double
quotes in the embedded string).  Example:  $output .= '<div
id="node-'.$node->nid. '" class="section-'.$depth.'">'."\n";

3. Write $depth == 2, not 2 == $depth.

------------------------------------------------------------------------

June 6, 2005 - 17:07 : puregin

Attachment: http://drupal.org/files/issues/xml-export-04.patch (23.89 KB)

I am now supplying the correct number of parameters to the '_post_node'
visitor definitions, so the depth is now passed correctly.  This fixes
the problems with mismatched tags and strange section depths.  

The DocBook output for top-level nodes validates against Docbook XML
V4.1.2 (using onsgmls).  Level 2 and deeper nodes generate Chapters and
sections as document fragments and so won't validate per-se.  I've
checked that they are well-formed, and that they validate on a manual
embedding.

OPML now validates against the 1.0 DTD (though there seem to be some
problems with the DTD itself)

I now encode the "id" attribute as string in type attribute of outline
for OPML, in order to store this metadata in an existing attribute. The
use case I have in mind is someone exporting the book structure
(hierarchy) via OPML, editing it and re-importing to modify the
structure of the book.  I have removed the "class" metadata.

Checked for tabs/whitespace - I hope this is now OK.

Now using ($depth == 2) rather than (2 == $depth).

I've replace single quotes with double where code is simplified by
this.