[development] Solving the dev->staging->live problem

Mon Aug 11 21:09:44 UTC 2008

On Monday 11 August 2008 14:38:27 Victor Kane wrote:
> Your points are interestng, but I think there may be a lot to what
> Greg Dunlap is recommending in terms of thinking in Drupal logic terms
> and not database terms.
>
> The image I have in my mind is that the database is kind of a
> two-dimensional projection of three dimensions; that is, there may be
> many hidden relationships in the process side of things that have to
> be taken into account for deployment, especially since the database is
> usually considered in purely MySql terms, that is, with fully
> transparent relationships between tables.
>
> But given concrete client driven circumstances, of course, in a given
> instance with a given set of priorities, I can easily see how what you
> are saying could make sense.
>
> Victor

At this point then, I suspect we're talking past each other. First, are you 
referring only to the direct-database sync point that I made, or is that also 
in reference to the drupal-object exporting, git-managed system? If it's the 
latter as well as the former, then we've _really_ diverged.

Just because I'm suggesting that there are direct-database syncing solutions 
possible does NOT mean that I'm not thinking in Drupal logic terms. Let me try 
a different, potentially clearer way:

In your approach, it seems to me that the goal is to use existing drupal API 
functions - let's use nodes, so node_load() - to fully build a node object. 
That object is then exported to code, at which point we can do things with 
vcs, etc. During a deployment, we parse in that code and utilize the API 
counterpart to our data-getter function, so node_save(), to push the 
information into the database. Before node_save() is actually run, however,  
we can use something like a GUID system to change primary keys as necessary so 
that we've got the node from server A going into the right associated node on 
server B.

As I said, I can see arguments for that, particularly if it utilizes a git 
backend. And only differs from what I'm proposing in a few ways.

What I'm suggesting stems from Greg's original point about a GUID for every 
'thing'. 'Things' being, to use your words:

On Monday 11 August 2008 05:33:49 Victor Kane wrote:
> ...
> all entities, configurations, including exported views, content types,
> panels, hopefully menues and module configurations and
> exported variables
> ...

Because I agree completely that the _only_ way to arrive at a solution that 
works for drupal is to think in terms of the 'things' (I'm going to say 
'items' from here on out) it makes - not to try to just grab bits of data from 
here and there and hope it all lines up on the other end. I've always thought 
that, and am pretty sure I always will.

My proposal is for a deploy API that would let modules define what these items 
are, and then define a set of behaviors for managing the deployment of those 
items across a variety of different circumstances. For our concrete example, I 
suspect the _first_ thing I'd do in the synchro handler for nodes is to call 
node_load(), then follow it up with some internal logic that...I dunno, there 
are a lot of ways we could go from there. It could dynamically construct a 
delta from the last sync time with the remote server; it could just fire over 
the whole node object. If extension modules need to do something that the node 
module's deploy handler couldn't work out, no problem - those modules just 
need to register their interest in deploy transactions related to the GUID for 
that particular item. On the receiving end, the node module's deploy API 
implementation knows what to expect coming through the pipe and handles it 
accordingly - maybe through node_save(), maybe not.

The advantage here, as I see it, is the potential to drill down _very_ quickly 
on exactly what should be checked during a given changeset. Very quickly as 
in, potentially, a single query. I can't picture the schema for the deploy 
items table, so it may take more, but it could be as simple as a single SELECT 
query that grabs all the items which have been changed/created since the last 
deploy txn, and that that particular deploy txn is interested in (again, a dev 
<=> qa txn != staging <=> live txn), and then it's a simple question of 
iterating through modules that have something to say about how each of the 
items is deployed.

As I said, I can see ways that a git-driven system can probably provide 
similar speed when it comes to drilling down to what items need to be 
considered in a given txn; also, a git-driven system has the added benefit of 
being able to, even when your local system offline, still provide the entire 
version history on demand for _each server_ you've ever connected with on that 
project. Well, assuming your remote git branches are up to date.

As far as I can tell, this is really the kind of thing you're talking about 
when you say:
> ...the database is kind of a two-dimensional projection of three dimensions
> that is, there may be many hidden relationships in the process side of
> things that have to be taken into account for deployment...

I can think of two very different ways of interpreting that metaphor, both of 
which are applicable to the topic at hand. I'm hoping, though, that this 
explanation finally does make clear that I'm _not_ thinking along the lines of 
'how do we make an sqldump better?', but instead about methods for making 
deployment a process that's as smart about drupal data as drupal itself is.

s

>
> On Mon, Aug 11, 2008 at 3:51 PM, Sam Boyer <drupal at samboyer.org> wrote:
> > On Monday 11 August 2008 05:33:49 Victor Kane wrote:
> >> The serialization and unserialization of data is included in my
> >> approach to the problem for the purposes of the independent
> >> transmission of nodes from one system to another, as in the case of
> >> one Drupal site availing itself of a node.save service on another
> >> Drupal site.
> >>
> >> It also has the purpose of guaranteeing insofar as is possible a text
> >> version of all entities, configurations, including exported views,
> >> content types, panels, hopefully menues and module configurations and
> >> exported variables, for the purposes of continued version control and
> >> hence deployment also (serialization to text, unserialization to
> >> deployment objective).
> >>
> >> Here of course, serialization and unserialization is not meant in the
> >> php function sense, and could include marshaling and unmarshaling to
> >> and from XML, and is a cross-language concept.
> >>
> >> Victor Kane
> >> http://awebfactory.com.ar
> >
> > So my initial reaction was that this was actually a disadvantage - it
> > seemed to introduce an extra layer of unnecessary complexity, as it
> > requires pulling the data out of the db, coming up with a new storage
> > format, then transferring that format and reintegrating it into another
> > db. The backup and project-level revisioning control implications are
> > interesting - but that's a wholly different axis from the crux of the
> > deployment paradigm, where there's _one_ version.
> >
> > However, on further reflection, I can see there being some strong
> > arguments in either direction. Your approach, Victor, makes me drift back
> > to the recent vcs thread on this list, as I can't imagine such a system
> > being feasible and really scalable without the use  of something like
> > (gasp!) git. Two basic reasons for that:
> >
> > 1. Data integrity assurance: there's nothing like a SHA1 hash to ensure
> > that nothing gets corrupted in all the combining/recombining of data
> > through and around various servers. And then there's the whole
> > content-addressable filesystem bit - I'd conjecture that it would make
> > git exceptionally proficient at handling either database dumps or tons of
> > data structured from 'export' functionality, whichever the case may be. I
> > imagine Kathleen might be able to speak more to that.
> >
> > 2. Security and logic (and speed): if run on a git backend, I'd speculate
> > that we could use project-specific ssh keys to encrypt all the
> > synchronizations (although that obviously brings up a host of other
> > potential
> > requirements/issues). On the logic end, we could build on top of git's
> > system for organizing sets of commits to allow for different 'types' of
> > syncs (i.e., you're working with different datasets when doing dev <=> qa
> > vs. when you're working with live <=> staging). As for speed...well, I'll
> > just call that a hunch =)
> >
> > This approach would require a _lot_ of coding, though. The more
> > immediate-term solution that still makes the most sense to me is one
> > where we let modules work straight from the data as it exists in the db,
> > and define some logic for handling synchronizations that the deploy API
> > can manage. But if all the systems are to be implemented...well, then it
> > probably means a pluggable Deploy API that allows for different
> > subsystems to handle different segments of the overall process.
> >
> > Sam