[drupal-devel] [task] Put revisions in their own table
killes
drupal-devel at drupal.org
Wed Jan 19 23:33:34 UTC 2005
Project: Drupal
Version: cvs
Component: node system
Category: tasks
Priority: normal
Assigned to: killes at www.drop.org
Reported by: killes at www.drop.org
Updated by: killes at www.drop.org
Status: patch
Attachment: http://drupal.org/files/issues/revisions_3.patch (29.93 KB)
I've implemented the aforementioned solution. This makes the patch much
smaller. The patch now also removes taxonomy_node_has_term() which
wasn't used anywhere. I'd really apprciate if some people could test
drive the patch. It will be another huge improvement for 4.6.
killes at www.drop.org
Previous comments:
------------------------------------------------------------------------
May 5, 2004 - 17:25 : killes at www.drop.org
Currently all node revisions are stored in a serialized field in
node.table and retrieved for _each_ page view although they are rarely
needed. However, we have agreed that serializing data is bad and that
we should try to keep the memory foot print pf Drupal small.
Therefore I propose to create a separate revisions table which would be
in principle identical to the node table, only that it could have
several old copies of the same node. Extra data added by other modules
could be added in a serialized field unless we find a better solution.
------------------------------------------------------------------------
May 5, 2004 - 18:06 : jhriggs
I too think the serialized approach is less than desirable, but here's
an alternative. This would likely take some considerable rework in
core and contrib, but the following is how we handle similar types of
situations in our databases at work. It is more elegant that a
separate table, and avoids the (almost exact) duplication of a table.
Instead of separate tables, keep all revisions of nodes in the node
table as follows:
* add field: active (0/1 or Y/N)
* add field: revision
* every revision of a node is stored in the node table; however, only
one revision can be active at any given time
* nid can no longer be unique -- primary/unique key becomes (nid,
active)
* any time a node is loaded, updated (without revision), etc., the
active version is used.
Thoughts?
------------------------------------------------------------------------
May 5, 2004 - 18:57 : killes at www.drop.org
I am not opposed to your scheme, but I want to stress the following:
* Duplicating a table's structure is not bad (IMHO) as long as the
content is different.
* having two tables will allow us to have a rather small node table.
This is (maybe) a performance gain.
------------------------------------------------------------------------
May 5, 2004 - 19:37 : jhriggs
I don't necessarily think that duplicating a table's structure is _bad_.
It just seems to be wasteful and a pain to maintain. (Every change to
the node table is made twice...easy to do, but also easy to miss
perhaps.)
As for performance, as long as nid and the active indicator are
indexed, there shouldn't be any performance loss. Also, archiving an
old version when making a new revision will be much simpler: just
change the active indicator rather than copying an entire node to
another table (and ensuring everything gets copied...again a potential
maintainance issue).
To be honest, I would just like to see the serialized data go away,
regarless of what approach is taken.
------------------------------------------------------------------------
July 30, 2004 - 20:49 : Nick Nassar
Attachment: http://drupal.org/files/issues/Drupal-Improved_Revision_Schema_07-30-2004.patch.gz (10.47 KB)
I'm interested in using Drupal for a large scale wiki-type project. In
order to do this, I need revisions to be in their own table.
Attached is a patch to do just that. Most of the changes are pretty
self explanatory. Spreading out node data across two tables meant that
I had to add database functions to do locking/transactions. Without
this, race conditions in which the database becomes corrupted are
possible.
------------------------------------------------------------------------
July 30, 2004 - 20:54 : Nick Nassar
Oh yeah... The patch is a diff against Drupal CVS
------------------------------------------------------------------------
July 31, 2004 - 01:00 : Anonymous
Gerhard speaking.
Nick, thanks a lot for your nice patch! It saves me a great deal of
labour. I looked through it and immediately liked it. You not only put
the old revisions into a new table but also the current one. Do you
have an estimate how much more expensive the additional join is?
Besides a few minor coding style issues I found a major one: Just a few
hours before you uploaded your patch JonBob's node access patch hit
core. That means your patch won't apply anymore as all the queries you
change have been changed. Can I bug you to update your patch?
------------------------------------------------------------------------
July 31, 2004 - 02:11 : Anonymous
Also I think that your upgrade path loses existing revisions.
------------------------------------------------------------------------
July 31, 2004 - 03:39 : drumm
I think this is the proper way to do things. No columns are duplicated,
there is no serialized data, and only the fields that are logically
revised are stored. Nothing jumped out at me as a way to have my node
modules be able to keep a table of revisions of additional fields. I'm
guessing this could be done within the confines of _insert and
_update.
Assuming the upgrade path works and modules can extend it I give it a
+1.
------------------------------------------------------------------------
July 31, 2004 - 15:40 : Nick Nassar
It figures that just as I finish a big patch, another patch comes along
and breaks it. Oh well, it should be a pretty easy to fix. I'll work on
it.
Fixing the upgrade path to keep revisions should be fairly painless.
I found another issue that needs to be fixed before this patch gets
merged. There format of a node needs to be stored for each revision.
Otherwise, for modules that store a format for the nodes, such as page
and book, if you write one revision in PHP and the next in HTML, the
PHP revision will be displayed as HTML. This is part of a larger issue
of how node modules should store revisions of additional fields. I
think each module that wants to do this should create another table
with (nid, revid) as the primary key. Just as when they want to add
fields to a node they create another table with nid as the primary
key.
As far as performance goes, for sites that make heavy use of revisions,
an extra join on primary keys is going to be a lot faster than grabbing
all of the revisions from that database everytime. We would need to run
benchmarks to determine is the overall difference in speed is for an
average site is a gain or a loss. I'm guessing it's very minor either
way.
------------------------------------------------------------------------
August 23, 2004 - 14:55 : Nick Nassar
Attachment: http://drupal.org/files/issues/Drupal-Improved_Revision_Schema_08-23-2004.patch.gz (10.92 KB)
Here's an updated patch against CVS that puts revisions in their own
table, provides an upgrade path, and fixes the format related bugs in
the last patch.
Hopefully, this can make it into CVS as soon as the freeze is over.
------------------------------------------------------------------------
August 23, 2004 - 15:10 : moshe weitzman
Interesting patch ... drumm's question is still outstanding. how do
modules store revisions of their fields? Are they expected to manage
this on their own? Thats not how it works today.
As an aside, i am seeing profile_ fields in my node.revisions column.
One could argue that those need not be saved. They pertain to the node
author, not to the node itself.
------------------------------------------------------------------------
August 23, 2004 - 17:14 : Nick Nassar
Having modules be responsible for storing revisions of their own fields
is a side-effect of storing revision data in tables. There's really no
way around it. However, revisions generally don't make sense for node
types that don't have PHP/HTML content, such as polls. I think it's
going to be a pretty rare scenario for a new node type to want another
field to change per-revision, so it's a pretty good trade-off.
Storing fields that shouldn't be part of revisions, such as the
profile_ fields, is a side-effect of storing revisions as serialized
objects. Applying this patch will free up that wasted space. :)
------------------------------------------------------------------------
August 23, 2004 - 18:20 : Anonymous
There should be a hook that let's the module choose whether it supports
history. This way a module author can prevent the user from doing
something that may break his module or just cause undefined behavior.
If the module doesn't support history then don't let the user/admin
choose to add history to nodes of that type.
Craig
------------------------------------------------------------------------
August 23, 2004 - 20:23 : Nick Nassar
I agree, there should be an API change to make specifying support for
revisions easier. In the interests of keeping patches small and keeping
to one change per patch, I think the API change should be a separate
issue.
A sort of ad-hoc API to decide whether or not a module supports
revisions by default already exists. Instead of having a hook, modules
set the default value of the "Create new revision" field in the edit
form. The admin can change this option in
admin/node/configure/defaults. This patch doesn't change that.
Revisions are broken for node types that have their own database
structure, like polls, even when storing them as serialized objects.
This patch doesn't change that, either.
------------------------------------------------------------------------
October 26, 2004 - 03:35 : moshe weitzman
I'm guessing that someone is going to have to demonstrate that this
patch performs as well as current drupal before it gets comitted. i
think this patch is a few benchmarks from being comitted.
------------------------------------------------------------------------
October 27, 2004 - 02:04 : Nick Nassar
Attachment: http://drupal.org/files/issues/Drupal-Improved_Revision_Schema_10-26-2004.patch.gz (11 KB)
I ran some really unscientific benchmarks, and it looks like this patch
has a negligible affect on performance.
I used apache bench and the database from theregular.org, which doesn't
contain any revisions (worst case scenario for this patch) and contains
several hundred nodes. Both the patched and unpatched versions hovered
between 2.36 and 2.38 requests per second.
The command I used was:
ab -n50 -C 'PHPSESSID=b01a9f92880ef215b0ed6f1314a5eba2'
http://192.168.0.100/
An updated patch that should apply to CVS is attached.
------------------------------------------------------------------------
October 27, 2004 - 02:05 : Nick Nassar
I ran some really unscientific benchmarks, and it looks like this patch
has a negligible affect on performance.
I used apache bench and the database from theregular.org, which doesn't
contain any revisions (worst case scenario for this patch) and contains
several hundred nodes. Both the patched and unpatched versions hovered
between 2.36 and 2.38 requests per second.
The command I used was:
ab -n50 -C 'PHPSESSID=b01a9f92880ef215b0ed6f1314a5eba2'
http://192.168.0.100/
An updated patch that should apply to CVS is attached.
------------------------------------------------------------------------
October 27, 2004 - 02:05 : Nick Nassar
Attachment: http://drupal.org/files/issues/Drupal-Improved_Revision_Schema_10-26-2004.patch_0.gz (11 KB)
I ran some really unscientific benchmarks, and it looks like this patch
has a negligible affect on performance.
I used apache bench and the database from theregular.org, which doesn't
contain any revisions (worst case scenario for this patch) and contains
several hundred nodes. Both the patched and unpatched versions hovered
between 2.36 and 2.38 requests per second.
The command I used was:
ab -n50 -C 'PHPSESSID=b01a9f92880ef215b0ed6f1314a5eba2'
http://192.168.0.100/
An updated patch that should apply to CVS is attached.
------------------------------------------------------------------------
November 15, 2004 - 06:05 : elias1884
please overthink the revision system default workflow as well. don't
look at the revision system as an isolated system but as a part of the
whole workflow system!
if you combine revisions with the moderation queue the most logic
default workflow would be like that:
auth user creates node (revision #0)
admin approves the node (status = 1, moderation = 0)
=> node publicly available
auth user finds typo and changes node (revision #1, status = 0,
moderation = 1)
-------
what happens at that point at the moment is, that the node is not
accessible anymore at all until the new revision is approved by admin.
of course the new revision should not go online until reviewed and
approved, this is absolutely correct, but there is no reason to not
take the old revision offline, since it was already approved and should
therefore be online until the new revision is approved. it is not
practical if a node disappears only because the author corrected a
typo.
-------
admin approves the node (status = 1, moderation = 0)
eventhough I first thought a plain boolean active field would not be
capable of providing that functionality if finally came to the
conclusion, that it can. The only thing to do is to not set that bit,
when a new revision is created, but when it is approved (in case
moderation is activated under default workflow). Every revision should
have its own moderation, status and active field and on approval they
are set like this (status=1, moderation=0, active=Y).
When you wanna rollback to an old revision, you can chose between all
revisions that already have the moderation bit set back to 0 again and
the published to 1. There should be an extra permission for rollback!
another concern that I have about the default workflow is, that users
can't see the content, they have just created, when moderation is
enabled. Eventhough, there is a big fat "submission accepted" presented
after submissions, unexperienced users tend to question the information
those stupid tincans give them, if they can't find their content
afterwards. Many users are really lazy bastards and they don't even
read the status messages. The best feedback about whether his story was
submitted successfully or not of course is, if he can find the story
somewhere on the site, maybe with a status message on top of it,
mentioning, that the content is currently not publicly available since
it has not been approved yet. there should be a my content section
under my account, like somebody is trying to do with the workspace
module I guess.
so my suggestion is to make (status=0, moderation=1) still available
for the creator under a my content section somewhere!
------------------------------------------------------------------------
November 24, 2004 - 05:21 : Nick Nassar
I agree. The current workflow for moderation queues and revisions needs
to change, but this patch isn't the place for it. The patch is already
too big, and it only does the backend stuff.
Instead of adding more to this patch and making it take even longer to
get into core, would you mind creating a new issue for your UI
suggestions, so the those changes can be added as a separate patch?
Thanks,
Nick
------------------------------------------------------------------------
December 11, 2004 - 13:26 : Dries
This patch is _much_ needed so I'd love to see someone revive it. In
order for this patch to be accepted, the following needs to be done:
Update this patch to CVS HEAD.
Rename revid to vid.
Rename node_rev to node_revisions.
Rename node_rev.changed to node_revisions.timestamp.
Rename $rnode to $revision.
Fix the coding style to match Drupal's: proper spacing, single quotes
where possible, proper variable names.
Benchmark this patch with a large database with enough revisions. I'd
be happy to benchmark this on my local copy of the drupal.org database.
The book.log field should probably move to the node_revisions table.
This can be done in a separate patch.
Investigate whether transactions are well-supported.
------------------------------------------------------------------------
December 13, 2004 - 01:25 : killes at www.drop.org
Attachment: http://drupal.org/files/issues/Drupal-Improved_Revision_Schema_10-26-2004-revisited.patch.gz (11.02 KB)
I've worked a bit on the patch (coding style issues as mentioned by
Dries). One thing I noticed is that the patch uses REPLACE. IIRC this
needs to be chagned to "UPDATE, if fail INSERT" for pgsql
compatibility.
Nick, are you still interested in working on that patch? I'd like to
know how it works on your site and work on getting it into core.
------------------------------------------------------------------------
December 13, 2004 - 13:33 : Dries
Gerhard: your patch does not apply.
------------------------------------------------------------------------
December 13, 2004 - 18:10 : killes at www.drop.org
Yes, I know, that was the same version as I mailed to you earlier.
------------------------------------------------------------------------
December 13, 2004 - 22:02 : killes at www.drop.org
Attachment: http://drupal.org/files/issues/revisions.patch (52.96 KB)
Ok, upüdated the patch to cvs.
------------------------------------------------------------------------
December 14, 2004 - 09:58 : Dries
Some more comments:
db_begin_transaction() and db_end_transaction() do not belong in
database.inc, but in database.mysql.inc and database.pgsql.inc
respectively.
The node module calls node_revisionsision_list() which is not defined.
(Fxed that on my local copy.)
Do db_begin_transaction() and db_end_transaction() deprecate Jeremy's
table locking patch?
The upgrade path assigns the wrong user ID to each revision.
The upgrade path assigns the wrong date to each revision (that or a
node's revision page shows the wrong usernames/dates).
The coding style needs a bit of work, but we can worry about that
later.
------------------------------------------------------------------------
December 14, 2004 - 18:34 : Nick Nassar
If you need any help getting those things fixed, just let me know.
------------------------------------------------------------------------
December 14, 2004 - 18:50 : Nick Nassar
How this relates to Jeremy's node locking patch:
There was lots of discussion, and node locking was decided against
because from an end user point of view you never want a node to be
locked. He's now advocating for a much simpler patch that warns users
if their changes will overwrite someone elses. That patch still has a
race condition, which might be fixed using db_begin_transaction().
http://drupal.org/node/6025
------------------------------------------------------------------------
December 14, 2004 - 23:26 : killes at www.drop.org
Attachment: http://drupal.org/files/issues/revisions_0.patch (55.96 KB)
Here is an updated patch that tries to address Dries concerns.
------------------------------------------------------------------------
December 15, 2004 - 09:32 : Dries
Attachment: http://drupal.org/files/issues/revisions-bug.png (76.06 KB)
It didn't fix the aforementioned bugs. See attached screenshot.
------------------------------------------------------------------------
January 6, 2005 - 21:15 : killes at www.drop.org
Attachment: http://drupal.org/files/issues/revisions_1.patch (51.77 KB)
Ok, here is a new version. Dries and myself worked hart at it, so please
have a look.
what is still missing
- database upgrades for the core modules with an own table
- contrib modules need an upgrade too.
- do we need nid and vid in both the node and the node_revisions
table?
- the amount of sql queries means a good stress testing for large
databases.
------------------------------------------------------------------------
January 19, 2005 - 22:43 : killes at www.drop.org
Attachment: http://drupal.org/files/issues/revisions_2.patch (49.49 KB)
Here is an updated patch. We discussed to keep the current title in node
module and also in the revisiosn table. This is content duplication but
will save many joins as many queries only need the title of a node.
Discussion is welcome.
--
View: http://drupal.org/node/7582
Edit: http://drupal.org/project/comments/add/7582
More information about the drupal-devel
mailing list