Database / SQL future thoughts
Folks who care about the direction and performance of the Drupal database (and developers just generally interested in database backends for applications) should read this blog posting by Bob Cringley -- and more importantly, the comments written by some folks who know way more about databases than Mr. Cringely: http://www.cringely.com/2009/05/the-sequel-dilemma/ David Strauss and Larry Garfield, as our two current DB experts (IMHO, of course) -- I'd especially like you to digest the more pertinent comments and think about how we can simplify our database access, and hence optimize our performance. I'm about to go off an research Hadoop a bit, but my gut feeling is the comments which say SSD and proper, simple use of SQL and good data paths will be the right answer for the next decade or so are probably on the right mark. ..chris
One comment here.. the field API storage is pluggable.
Karoly Negyesi a ecrit le 04/05/2009 21:22:
One comment here.. the field API storage is pluggable.
... but some work is still needed for this feature to be really usable : http://drupal.org/node/443422 We need people to jump in, here :-) Yves
On Monday 04 May 2009 3:22:51 pm Yves Chedemois wrote:
Karoly Negyesi a ecrit le 04/05/2009 21:22:
One comment here.. the field API storage is pluggable.
... but some work is still needed for this feature to be really usable : http://drupal.org/node/443422
We need people to jump in, here :-)
Yves
I'd go a step farther and say that the field system backend is not pluggable in any useful sense until/unless it becomes at *least* per-object-type. Per-field is the ideal, but object type is the absolute minimum to call it "pluggable" with a straight face. -- Larry Garfield larry@garfieldtech.com
On Tue, May 5, 2009 at 3:11 AM, Larry Garfield <larry@garfieldtech.com> wrote:
On Monday 04 May 2009 3:22:51 pm Yves Chedemois wrote:
Karoly Negyesi a ecrit le 04/05/2009 21:22:
One comment here.. the field API storage is pluggable.
... but some work is still needed for this feature to be really usable : http://drupal.org/node/443422
We need people to jump in, here :-)
Yves
I'd go a step farther and say that the field system backend is not pluggable in any useful sense until/unless it becomes at *least* per-object-type. Per-field is the ideal, but object type is the absolute minimum to call it "pluggable" with a straight face.
-- Larry Garfield larry@garfieldtech.com
I think we're comparing pears to apples here. This is not a question of SQL vs. not-SQL, but more a question of database architecture – relational vs. document oriented. I think that relational databases will still have their uses with data that are inherently relational and carefully structured. That is not the way the winds are currently blowing with CCK and fields in core where the database schema is mutable and constantly evolving, so here the document databases could of help, but so could materialized views (not related to Views the module) or sharding. It all depends on usage and the amount of time you're willing to spend optimising and caching. I don't think that swapping the persistence layer out is going to make anything inherently more scalable, and it is not the first thing I'd do. Look at Facebook – to my knowledge, they're still using MySQL combined with memcache and oodles of RAM. Being able to scale to their size and no further would be good enough for my usage ;) -- Regards, Mikkel Høgh
On Tue, May 5, 2009 at 11:10 AM, Mikkel Høgh <m@ooh.dk> wrote:
On Tue, May 5, 2009 at 3:11 AM, Larry Garfield <larry@garfieldtech.com> wrote:
On Monday 04 May 2009 3:22:51 pm Yves Chedemois wrote:
Karoly Negyesi a ecrit le 04/05/2009 21:22:
One comment here.. the field API storage is pluggable.
... but some work is still needed for this feature to be really usable : http://drupal.org/node/443422
We need people to jump in, here :-)
Yves
I'd go a step farther and say that the field system backend is not pluggable in any useful sense until/unless it becomes at *least* per-object-type. Per-field is the ideal, but object type is the absolute minimum to call it "pluggable" with a straight face.
-- Larry Garfield larry@garfieldtech.com
I think we're comparing pears to apples here. This is not a question of SQL vs. not-SQL, but more a question of database architecture – relational vs. document oriented.
I think that relational databases will still have their uses with data that are inherently relational and carefully structured.
That is not the way the winds are currently blowing with CCK and fields in core where the database schema is mutable and constantly evolving, so here the document databases could of help, but so could materialized views (not related to Views the module) or sharding.
It all depends on usage and the amount of time you're willing to spend optimising and caching. I don't think that swapping the persistence layer out is going to make anything inherently more scalable, and it is not the first thing I'd do.
Look at Facebook – to my knowledge, they're still using MySQL combined with memcache and oodles of RAM. Being able to scale to their size and no further would be good enough for my usage ;)
Facebook has access to more hardware than the average Drupal developer and is also using Cassandra http://incubator.apache.org/cassandra/ Drupal doesn't really need a relational DB and actually doesn't use the relational features properly (for example, the way tags are stored is not efficient). That's one of the reasons why it is slow and doesn't scale very well. But going for another storage system would be better if implemented as a fork, IMO. -- Bertrand Mansion Mamasam
This is some interesting feedback. I'd love to hear more of your thoughts so I can understand it better. On Tue, May 5, 2009 at 7:51 AM, Bertrand Mansion <drupal@mamasam.net> wrote:
Drupal doesn't really need a relational DB and actually doesn't use the relational features properly (for example, the way tags are stored is not efficient). That's one of the reasons why it is slow and doesn't scale very well.
These are some pretty sweeping claims. Can you expand on them? 1. Tags are stored inefficiently (I can't think of a way to store them that is better for every use) 2. Drupal doesn't scale very well (I've seen enough claims otherwise - is there a particular problem you can point to?)
But going for another storage system would be better if implemented as a fork, IMO.
Various people just rewrote the entire DB API, so it is possible to make massive API changes within a release cycle. Why do you feel a fork is necessary? Thanks, Greg -- Greg Knaddison http://knaddison.com | 303-800-5623 | http://growingventuresolutions.com
Hi Greg, On Tue, May 5, 2009 at 4:18 PM, Greg Knaddison <Greg@growingventuresolutions.com> wrote:
This is some interesting feedback. I'd love to hear more of your thoughts so I can understand it better.
On Tue, May 5, 2009 at 7:51 AM, Bertrand Mansion <drupal@mamasam.net> wrote:
Drupal doesn't really need a relational DB and actually doesn't use the relational features properly (for example, the way tags are stored is not efficient). That's one of the reasons why it is slow and doesn't scale very well.
These are some pretty sweeping claims. Can you expand on them?
These are not claims, but only my opinion, based on my experiences with Drupal.
1. Tags are stored inefficiently (I can't think of a way to store them that is better for every use)
What do you mean by "every use" ? If you are interested in reading about tags and SQL, here are some pointers: http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html http://laughingmeme.org/2005/04/07/in-lieu-of-the-promised-article-on-tags-a... The way Drupal stores tags hierarchy is also inefficient.
2. Drupal doesn't scale very well (I've seen enough claims otherwise - is there a particular problem you can point to?)
Lots of problems for me, but that's off topic. Sessions, cache, sql queries, table structures, file storage, architecture and callback hell, etc. It's currently the best CMS for PHP I know, but still, it is sluggish (yes, this is subjective).
But going for another storage system would be better if implemented as a fork, IMO.
Various people just rewrote the entire DB API, so it is possible to make massive API changes within a release cycle. Why do you feel a fork is necessary?
Moving to something like CouchDB as was suggested in the thread or some other datastores (someone mentioned Hadoop but I think he meant BigTable, Dynamo or Tokyo Cabinet...) would need more than just rewriting the DB API in my opinion. That's why I can only imagine this as a fork. That would be my reply to the first post. -- Bertrand Mansion Mamasam
On May 5, 2009, at 10:07 AM, Bertrand Mansion wrote:
1. Tags are stored inefficiently (I can't think of a way to store them that is better for every use)
What do you mean by "every use" ?
If you are interested in reading about tags and SQL, here are some pointers: http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html http://laughingmeme.org/2005/04/07/in-lieu-of-the-promised-article-on-tags-a...
When Greg says, "Every use" he means that Drupal allows tags and taxonomy terms in general to be used for a lot of metadata related purposes. It's possible to optimize the storage and retrieval mechanism for specific use cases (like user-specific flickr-style tags, or hierarchical categorization), but the optimizations for those use cases are at odds with each other. What works for one will make the others punishingly inefficient. Thus, Drupal currently uses a 'best-compromise' schema that allows it to capture as much information as possible (hierarchy, weight, association, etc.) and relies on caching at a later point to smooth out hot spots. There may well be further improvements that can be eked out, and there may be opportunities for optimization that have been missed -- and there may even be a case to be made for splitting taxonomy into real "tags" and "hierarchical category" so that the system can be better optimized. But I'm not sure that you're really clear on how Drupal actually works under the hood; the article you pointed to explicitly described Drupal's tag storage schema and outlined its advantages. If you go back and read the article, it's the "Toxi Solution." --Jeff Eaton
On Tue, May 5, 2009 at 5:40 PM, Jeff Eaton <jeff@viapositiva.net> wrote:
On May 5, 2009, at 10:07 AM, Bertrand Mansion wrote:
1. Tags are stored inefficiently (I can't think of a way to store them
that is better for every use)
What do you mean by "every use" ?
If you are interested in reading about tags and SQL, here are some pointers: http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html http://laughingmeme.org/2005/04/07/in-lieu-of-the-promised-article-on-tags-a...
When Greg says, "Every use" he means that Drupal allows tags and taxonomy terms in general to be used for a lot of metadata related purposes. It's possible to optimize the storage and retrieval mechanism for specific use cases (like user-specific flickr-style tags, or hierarchical categorization), but the optimizations for those use cases are at odds with each other. What works for one will make the others punishingly inefficient. Thus, Drupal currently uses a 'best-compromise' schema that allows it to capture as much information as possible (hierarchy, weight, association, etc.) and relies on caching at a later point to smooth out hot spots. There may well be further improvements that can be eked out, and there may be opportunities for optimization that have been missed -- and there may even be a case to be made for splitting taxonomy into real "tags" and "hierarchical category" so that the system can be better optimized. But I'm not sure that you're really clear on how Drupal actually works under the hood; the article you pointed to explicitly described Drupal's tag storage schema and outlined its advantages. If you go back and read the article, it's the "Toxi Solution."
Well, I think I know everything there is to know about Drupal. I have been developing modules for it for 4 years now and deployed a dozen of websites for customers, some quite large... I think it is a good opportunity now, with the emergence of these new databases, to think about what we have been doing for years, and how. Instead of being arrogant and underestimating others, you should start by asking yourself if there really isn't any other way to better manage tags (and cache, and sessions, and hierarchies, and callbacks, and file storage, etc). -- Bertrand Mansion Mamasam
Instead of being arrogant and underestimating others, you should start by asking yourself if there really isn't any other way to better manage tags (and cache, and sessions, and hierarchies, and callbacks, and file storage, etc).
cache and sessions are in memcached for any big site. The module is well maintained there is ongoing work about hierarchies as catch told you. dunno whats your problem with callbacks. file storage, there is ongoing work to support stream wrappers and imo then you can write your own. Where do you want to help today?
Bertrand Mansion wrote:
Instead of being arrogant and underestimating others, you should start by asking yourself if there really isn't any other way to better manage tags
People are already asking these questions - I linked to the patches a couple of hours ago. Similarly people are working on the other things you mentioned - but you haven't referenced any of that work in your e-mails to this list, so it's impossible to tell if you're familiar with it or not:
cache, and sessions http://drupal.org/project/memcache (cache and sessions) http://drupal.org/project/cacherouter http://groups.drupal.org/handlers
, and hierarchies menu system rewrite for Drupal 6 and the tree parsing patch for D7. Also http://drupal.org/project/leftandright and http://drupal.org/project/lineage in contrib
and file storage, etc).
http://drupal.org/project/cdn So if you know everything there is to know about all of these efforts (and others which I didn't mention) - presumably you could explain how you evaluated and rejected them already? Nat
Bertrand Mansion wrote:
I think I know everything there is to know about Drupal. Wow, in another two years I can say this? I don't think so. I'd be willing to bet there is at least one thing even Dries doesn't know about Drupal.
Instead of being arrogant... Hmm...
Nancy E. Wichmann, PMP Injustice anywhere is a threat to justice everywhere. - Martin L. King, Jr. No virus found in this outgoing message. Checked by AVG - http://www.avg.com Version: 8.0.176 / Virus Database: 270.12.18/2096 - Release Date: 5/4/2009 5:51 PM
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Bertrand Mansion schrieb:
Well, I think I know everything there is to know about Drupal. I have been developing modules for it for 4 years now and deployed a
Even after using Drupal for twice that time I'd never make such a claim.
dozen of websites for customers, some quite large... I think it is a good opportunity now, with the emergence of these new databases, to think about what we have been doing for years, and how.
Instead of being arrogant and underestimating others,
Kettle, black?
you should start by asking yourself if there really isn't any other way to better manage tags (and cache, and sessions, and hierarchies, and callbacks, and file storage, etc).
I think you should crawl back under your rock. Cheers, Gerhard -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkoAkxcACgkQfg6TFvELooT3bACdEdgTlpGsptgxlj03KMWXIyUX ba0Anjn+mquHkxTXnZI+P6/w2EMsuUK2 =Ukcq -----END PGP SIGNATURE-----
On May 5, 2009, at 1:54 PM, Bertrand Mansion wrote:
Well, I think I know everything there is to know about Drupal. I have been developing modules for it for 4 years now and deployed a dozen of websites for customers, some quite large... I think it is a good opportunity now, with the emergence of these new databases, to think about what we have been doing for years, and how.
Instead of being arrogant and underestimating others, you should start by asking yourself if there really isn't any other way to better manage tags (and cache, and sessions, and hierarchies, and callbacks, and file storage, etc).
You're quite right, Bertrand, and I apologize for the snarkiness of my comment. Since that article you pointed to described Drupal's tagging system *without* suggesting it was a fundamentally flawed approach, I assumed you were not familiar with the Taxonomy system's internals. This is not a matter of arrogance but of misinterpreting your statement. As I'm sure you know from being on the devel list, there is an unending stream of "Drupal Should Do X Like Y, And Here's A Blog Post To Prove It" comments that are not necessarily rooted in familiarity with the way the system already works. Greg's statement, though, stands: Taxonomy as it presently stands is a generalized metadata system, and the optimizations discussed in the first two parts of the article you linked to are not possible without building an entirely different set of specialized systems. The third model, explained in the article that you linked to, is what Drupal uses currently. A number of other developers have suggested that other approaches might be good -- rather than tying ourselves to a relational model, we should consider treating nodes as cached objects, for example. Doing so would probably yield some great improvements for the specific use cases we optimize the storage mechanism for. I could be wrong, but at present the use of a traditional SQL backend is still our best bet for a generalized system that allows users to design their schemas and their views in an ad-hoc fashion without writing code. Is there any way that something other than SQL could leverage multiple loosely connected systems like CCK, Taxonomy, and Views without crippling performance in other areas? That's not a rhetorical question; I'm curious and would like to know if I'm overlooking some fundamental issues. --Jeff Eaton
On Tue, May 5, 2009 at 9:30 PM, Jeff Eaton <jeff@viapositiva.net> wrote:
On May 5, 2009, at 1:54 PM, Bertrand Mansion wrote:
Well, I think I know everything there is to know about Drupal. I have been developing modules for it for 4 years now and deployed a dozen of websites for customers, some quite large... I think it is a good opportunity now, with the emergence of these new databases, to think about what we have been doing for years, and how.
Instead of being arrogant and underestimating others, you should start by asking yourself if there really isn't any other way to better manage tags (and cache, and sessions, and hierarchies, and callbacks, and file storage, etc).
You're quite right, Bertrand, and I apologize for the snarkiness of my comment. Since that article you pointed to described Drupal's tagging system *without* suggesting it was a fundamentally flawed approach, I assumed you were not familiar with the Taxonomy system's internals. This is not a matter of arrogance but of misinterpreting your statement. As I'm sure you know from being on the devel list, there is an unending stream of "Drupal Should Do X Like Y, And Here's A Blog Post To Prove It" comments that are not necessarily rooted in familiarity with the way the system already works.
Greg's statement, though, stands: Taxonomy as it presently stands is a generalized metadata system, and the optimizations discussed in the first two parts of the article you linked to are not possible without building an entirely different set of specialized systems. The third model, explained in the article that you linked to, is what Drupal uses currently.
A number of other developers have suggested that other approaches might be good -- rather than tying ourselves to a relational model, we should consider treating nodes as cached objects, for example. Doing so would probably yield some great improvements for the specific use cases we optimize the storage mechanism for. I could be wrong, but at present the use of a traditional SQL backend is still our best bet for a generalized system that allows users to design their schemas and their views in an ad-hoc fashion without writing code.
Is there any way that something other than SQL could leverage multiple loosely connected systems like CCK, Taxonomy, and Views without crippling performance in other areas? That's not a rhetorical question; I'm curious and would like to know if I'm overlooking some fundamental issues.
This is exactly what I am investigating for another project where I use Tokyo Tyrant with PHP. I don't have figures yet nor concrete solutions, but I find it very interesting and challenging to try to think differently. In an application I wrote, I chose to manage my tags differently (only one table with lots of redundancy but fast), and it worked well. In another application, I also tried another way to deal with child/parent relations (not the celko's way but using LIKE, depths and paths) and it also worked well, was easier to manage and faster. With these new databases, at first, I found it very difficult to forget everything about relational DBs and ORM like solutions where a table looks like an object. I am almost sure that Drupal, like any other CMS, could take advantage from systems like CouchDB or Tokyo Cabinet (or others). Take for example the 'node' and 'node_revision' tables, in such DBs they wouldn't need to be separate entities. CouchDB has versioning. Tokyo Cabinet can compress your data on the fly so you can store many versions of your node without having to worry about relations. For CCK, it wouldn't even be needed because Tokyo Cabinet tables, like CouchDB's, can have arbitrary number of fields. It is your application which decides which fields are required, not your database. I find it quite exciting, you should see for yourself. PS: I'll release a PHP class that talks with Tokyo Tyrant soon, probably in PEAR. -- Bertrand Mansion Mamasam
On May 5, 2009, at 3:12 PM, Bertrand Mansion wrote:
For CCK, it wouldn't even be needed because Tokyo Cabinet tables, like CouchDB's, can have arbitrary number of fields. It is your application which decides which fields are required, not your database.
Right. That's the easy part, though -- Drupal is a generalized web CMS and the problem is not "storing buckets of data" but "storing them in a way that is flexible enough for ad-hoc queries to be done efficiently using tools like Views." I think it's fair to say that CCK is useful, but without Views and its level of "smart" flexibility, Drupal would not be the success that it is today. Am I mistaken in thinking that the loss of efficient ad-hoc queries is one of the tradeoffs inherent in moving to the 'bucket of data' model? --Jeff
On Tue, May 5, 2009 at 10:18 PM, Jeff Eaton <jeff@viapositiva.net> wrote:
On May 5, 2009, at 3:12 PM, Bertrand Mansion wrote:
For CCK, it wouldn't even be needed because Tokyo Cabinet tables, like CouchDB's, can have arbitrary number of fields. It is your application which decides which fields are required, not your database.
Right. That's the easy part, though -- Drupal is a generalized web CMS and the problem is not "storing buckets of data" but "storing them in a way that is flexible enough for ad-hoc queries to be done efficiently using tools like Views." I think it's fair to say that CCK is useful, but without Views and its level of "smart" flexibility, Drupal would not be the success that it is today.
Am I mistaken in thinking that the loss of efficient ad-hoc queries is one of the tradeoffs inherent in moving to the 'bucket of data' model?
Actually, "views" are inside the database: http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views For example, the Views module would create views in the database directly, instead of storing SQL queries in a table and executing them later. That's more logical. -- Bertrand Mansion Mamasam
Bertrand Mansion wrote:
Actually, "views" are inside the database: http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views For example, the Views module would create views in the database directly, instead of storing SQL queries in a table and executing them later. That's more logical.
That's only more logical if you don't actually know what Views does. Views SQL is not generated ahead of time and then just run on demand, it is created as needed based upon the data available. Quite an amount of that data can change at runtime.
On Tue, May 5, 2009 at 10:55 PM, Earl Miles <merlin@logrus.com> wrote:
Bertrand Mansion wrote:
Actually, "views" are inside the database: http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views For example, the Views module would create views in the database directly, instead of storing SQL queries in a table and executing them later. That's more logical.
That's only more logical if you don't actually know what Views does. Views SQL is not generated ahead of time and then just run on demand, it is created as needed based upon the data available. Quite an amount of that data can change at runtime.
Yes, it is more logical to have views inside the database, because it is a database feature, than to store chunks of SQL and serialized PHP in a database table... I guess you haven't thought about what you posted before you posted it. Obviously, it is an habit for people to be insulting here. -- Bertrand Mansion Mamasam
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Bertrand Mansion schrieb:
On Tue, May 5, 2009 at 10:55 PM, Earl Miles <merlin@logrus.com> wrote:
Bertrand Mansion wrote:
Actually, "views" are inside the database: http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views For example, the Views module would create views in the database directly, instead of storing SQL queries in a table and executing them later. That's more logical. That's only more logical if you don't actually know what Views does. Views SQL is not generated ahead of time and then just run on demand, it is created as needed based upon the data available. Quite an amount of that data can change at runtime.
Yes, it is more logical to have views inside the database, because it is a database feature, than to store chunks of SQL and serialized PHP in a database table... I guess you haven't thought about what you posted before you posted it.
Obviously, it is an habit for people to be insulting here.
I so wish I was the list admin, I'd kick you. Since I am not, I'll just filter you. Good riddance, Gerhard -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkoAtT4ACgkQfg6TFvELooT0sgCcC0cLjRAJ1cFv8vLnRmz+7xFC 9EkAn1hz0wHOJWX4zp5ftq47rZ0a85hi =b29V -----END PGP SIGNATURE-----
Bertrand Mansion wrote:
On Tue, May 5, 2009 at 10:55 PM, Earl Miles <merlin@logrus.com> wrote:
Bertrand Mansion wrote:
Actually, "views" are inside the database: http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views For example, the Views module would create views in the database directly, instead of storing SQL queries in a table and executing them later. That's more logical. That's only more logical if you don't actually know what Views does. Views SQL is not generated ahead of time and then just run on demand, it is created as needed based upon the data available. Quite an amount of that data can change at runtime.
Yes, it is more logical to have views inside the database, because it is a database feature, than to store chunks of SQL and serialized PHP in a database table... I guess you haven't thought about what you posted before you posted it.
Let's see. As the author of Views, I would say I am the supreme authority on what it does. Your response indicates you believe I am wrong, and that by stating you are NOT the supreme authority on what the Views module does, and in fact appear to now understand what Views really does, I am being insulting. Please let me quote: You:
...instead of storing SQL queries...
Me:
...Views SQL is not generated ahead of time and then just run on demand, it is created as needed...
You:
...than to store chunks of SQL...in a database table... ...I guess you haven't thought about what you posted before you posted it...
Perhaps you misunderstand what 'not generated ahead of time' means, or 'created as needed'. It certainly does NOT mean 'stores chunks of SQL' in the database. Please tell me how I can construe this as ANYTHING other than "You don't actually know what Views does." If I am being insulting with this, it is only because you are presuming. When your presumptions are corrected, you get offended. Please, 1) stop presuming, 2) stop being offended. And above all, this did not belong in my personal mailbox. For that, I do feel free to be offensive: If you feel I have offended you on a public list, suck it.
On May 5, 2009, at 4:49 PM, Bertrand Mansion wrote:
On Tue, May 5, 2009 at 10:55 PM, Earl Miles <merlin@logrus.com> wrote:
Bertrand Mansion wrote:
Actually, "views" are inside the database: http://wiki.apache.org/couchdb/Introduction_to_CouchDB_views For example, the Views module would create views in the database directly, instead of storing SQL queries in a table and executing them later. That's more logical.
That's only more logical if you don't actually know what Views does. Views SQL is not generated ahead of time and then just run on demand, it is created as needed based upon the data available. Quite an amount of that data can change at runtime.
Yes, it is more logical to have views inside the database, because it is a database feature, than to store chunks of SQL and serialized PHP in a database table... I guess you haven't thought about what you posted before you posted it.
I'm not sure you understood what Earl was saying. Views does not store serialized PHP, nor does it store serialized SQL. Views stores descriptive metadata based on high-level Drupal entities (nodes, comments, users, etc), and at runtime builds an appropriate query. Oftentimes they are constructed based on contextual information available only at runtime (the current date, the ID of the currently logged in user, the page that the user is currently on). A stored view in a database is useful, but not at all the same thing. Some plugins for Views take advantage of the ability to modify the query at runtime based on those conditions, so even stored procedures would be an uncomfortable fit. Perhaps Views could *build* stored procs and save them to the database? This is what I meant about understanding how Drupal -- and some of its most-used subsystems -- work. I mean no offense, and it's not about dismissing your suggestions about more efficient storage mechanisms. But details matter in these areas. Understanding the current schemas, and understanding how they are used by existing tools like Views, matters. Let's have these discussions! But also let's make sure that we know what we're talking about, and we're accurate. --Jeff
Jeff Eaton wrote:
I'm not sure you understood what Earl was saying. Views does not store serialized PHP, nor does it store serialized SQL. What exactly is serialized PHP? Does he mean serialized objects/variables because views DOES store those (in views_object_cache) but I suppose it is irrelevant anyway. The real meat of the topic you mention below.
Views stores descriptive metadata based on high-level Drupal entities (nodes, comments, users, etc), and at runtime builds an appropriate query. Oftentimes they are constructed based on contextual information available only at runtime (the current date, the ID of the currently logged in user, the page that the user is currently on). A stored view in a database is useful, but not at all the same thing. Some plugins for Views take advantage of the ability to modify the query at runtime based on those conditions, so even stored procedures would be an uncomfortable fit. Perhaps Views could *build* stored procs and save them to the database? There would have to be some significant performance improvements to justify pushing these views into the database in this developers opinion. To do so effectively views would have to actually monitor database schema changes (either through a DBTNG callback or another observable method) and change the stored database VIEWS so that they remain valid queries. This is opposed to the way it conveniently and intelligently "ignores" non conforming views object data presently. It would also prohibit the dynamic adjustment of views at runtime to a certain extent. I DO like how sucha solution would push caching upstream where it can be argued it belongs but I don't know if its worth it with the mysql query cache, etc. Let's have these discussions! But also let's make sure that we know what we're talking about, and we're accurate. Thank you Jeff for seeing the issue past the person who poorly raised it. There is possible value in part of his argument but, is it enough to justify the opportunity cost of working on it and the sacrificed functionality we'd be asking of the views module? Views generation isn't much of a lag on my sites because they don't have millions of nodes, etc. So for me the answer is no, Earls solution to date fits my use case very well, but perhaps I'm an outlier.
There would have to be some significant performance improvements to justify pushing these views into the database in this developers opinion.
If my understanding is correct then that's what David Strauss' materialized views do.
Karoly Negyesi wrote:
There would have to be some significant performance improvements to justify pushing these views into the database in this developers opinion.
If my understanding is correct then that's what David Strauss' materialized views do.
Eh, not really. They're a sort of manual optimization and re-indexing using data duplication. Ideally the database should be able to do that internally on its own. Most don't. MVs are not moving the hard work of Views.module to the database. They're moving the hard work of the database to cached user code. If MySQL had built-in materialized views, then we could push the hard work on to the database and it would do that sort of pre-generation itself. Sadly it doesn't, so its "views" are really just syntax nicety. --Larry Garfield
On Wed, May 6, 2009 at 8:17 PM, larry@garfieldtech.com <larry@garfieldtech.com> wrote:
If MySQL had built-in materialized views, then we could push the hard work on to the database and it would do that sort of pre-generation itself. Sadly it doesn't, so its "views" are really just syntax nicety.
Well, I don't know about MySQL, but in PostgreSQL there is actually a slight performance gain from using database views, in that it shaves off almost all the processing and query plan generation, so most of the steps besides executing the actual query are skipped when using a view. That is a double-edge sword, however, since the query plan generated for one set of parameters might not be all that suitable given another set of parameters, so it'll have to be thought through carefully. When you have a view, generating a materialised view can be achieved with a bit of clever PL/pgSQL code. Sadly, I don't currently have any Drupal sites under my care that would warrant spending the time to do such things, but I find that there can be a lot of improvements by just going "closer to the metal". Taking the query from Views.module, optimising it, and moving it into a standard hook_menu page callback will also shave a bit of load off both the frontend and the database server on heavily visited pages :) -- Kind regards, Mikkel Høgh
Quoting Mikkel Høgh <m@ooh.dk>:
On Wed, May 6, 2009 at 8:17 PM, larry@garfieldtech.com <larry@garfieldtech.com> wrote:
If MySQL had built-in materialized views, then we could push the hard work on to the database and it would do that sort of pre-generation itself. Sadly it doesn't, so its "views" are really just syntax nicety.
Well, I don't know about MySQL, but in PostgreSQL there is actually a slight performance gain from using database views, in that it shaves off almost all the processing and query plan generation, so most of the steps besides executing the actual query are skipped when using a view.
You may also gain efficiency in the code which would reduce the processing time of PHP if a database view defines the common joins of the tables. -- Earnie -- http://r-feed.com/ -- http://for-my-kids.com/ -- http://www.4offer.biz/ -- http://give-me-an-offer.com/
At first, the descriptions of CouchDB and Tokyo Cabinets sounded suspiciously like multivalued databases, such as Reality and UniVerse (both circa 1970). CouchDB is a RESTful web service package, which has a storage engine that uses MVCC (Postgres, Oracle, InnoDB), and b-tree indexes (used by most RDBMSs), with sequence ids (supported by most RDBMSs). So, really comparing CouchDB to MySQL or Postgres is to compare apples and baked tarts made with apples. The Tokyo Cabinet SourceForge project says: "Tokyo Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. There is neither concept of data tables nor data types. Records are organized in hash table, B+ tree, or fixed-length array. As for database of hash table, each key must be unique within a database, so it is impossible to store two or more records with a key overlaps." A blog writer says: "Tokyo Cabinet is a simple, small, fast key/value store. Similar to DBM, it’s a very basic database. If you combine it with Tokyo Tyrant, it becomes a very capable, scalable network database (like mysql or couchdb)." So, is CouchDB's storage engine any more performant in a flexible CMS situation like Drupal than SQL storage engines? That remains to be seen. There's no obvious reason why it should be, since it does pretty much everything an RDBMS does (indexing, ACID compliance, etc.). CouchDB does provide some extras which could be useful, such as the built-in views already mentioned. And its document orientation might be great for large files (images, documents, etc.). Would the RESTful API be an advantage over SQL? And Tokyo Cabinets -- hmmm. Serialized data versus structured data and database wide unique keys. DBM primitives? Again, it remains to be seen whether it's faster than an SQL engine when combined with something like Tokyo Tyrant to provide the needed concurrency capabilities. ..chris
On Wed, May 6, 2009 at 12:02 AM, Chris Johnson <cxjohnson@gmail.com> wrote:
CouchDB is a RESTful web service package, which has a storage engine that uses MVCC (Postgres, Oracle, InnoDB), and b-tree indexes (used by most RDBMSs), with sequence ids (supported by most RDBMSs). So, really comparing CouchDB to MySQL or Postgres is to compare apples and baked tarts made with apples.
I'd say that's an awesome analogy. If you're not looking for baked goods, but perhaps apple sauce, raw apples might be preferable… On the other hand, if you intent to consume it with whipped cream anyhow, why not just grab the finished tart while you're at it…
And Tokyo Cabinets -- hmmm. Serialized data versus structured data and database wide unique keys. DBM primitives? Again, it remains to be seen whether it's faster than an SQL engine when combined with something like Tokyo Tyrant to provide the needed concurrency capabilities.
Yeah, unless someone is going to actually try to come up with some real numbers for what it would mean for our use case, this is debate probably going to be a precious waste of bytes ;)
I'm definitely a database noob so I'm looking for correction -- but isn't the main advantage of CouchDB / Tokyo Cabinet etc. that it's much simpler to spread them horizontally across multiple machines in a similar fashion to memcache? From what I understand, the main problem with MySQL / other traditional databases is that the complexity of your architecture increases dramatically as you go scale beyond one server as you have to deal with all sorts of problems with data replication whereas with CouchDB, it's much more straightforward to spread the data around. Is that correct? Kyle Research Assistant Entrepreneurship and Technology Center @ BYU kyle.mathews2000.com/blog On Tue, May 5, 2009 at 4:29 PM, Mikkel Høgh <m@ooh.dk> wrote:
On Wed, May 6, 2009 at 12:02 AM, Chris Johnson <cxjohnson@gmail.com> wrote:
CouchDB is a RESTful web service package, which has a storage engine that uses MVCC (Postgres, Oracle, InnoDB), and b-tree indexes (used by most RDBMSs), with sequence ids (supported by most RDBMSs). So, really comparing CouchDB to MySQL or Postgres is to compare apples and baked tarts made with apples.
I'd say that's an awesome analogy. If you're not looking for baked goods, but perhaps apple sauce, raw apples might be preferable…
On the other hand, if you intent to consume it with whipped cream anyhow, why not just grab the finished tart while you're at it…
And Tokyo Cabinets -- hmmm. Serialized data versus structured data and database wide unique keys. DBM primitives? Again, it remains to be seen whether it's faster than an SQL engine when combined with something like Tokyo Tyrant to provide the needed concurrency capabilities.
Yeah, unless someone is going to actually try to come up with some real numbers for what it would mean for our use case, this is debate probably going to be a precious waste of bytes ;)
On Tue, 2009-05-05 at 16:57 -0600, Kyle Mathews wrote:
I'm definitely a database noob so I'm looking for correction -- but isn't the main advantage of CouchDB / Tokyo Cabinet etc. that it's much simpler to spread them horizontally across multiple machines in a similar fashion to memcache? no. their difference from the sql model is their advantage, for some use cases. there is no panacea, if somebody tries to sell you that, remember the snake oils of old.
From what I understand, the main problem with MySQL / other traditional databases is that the complexity of your architecture increases dramatically as you go scale beyond one server as you have to deal with all sorts of problems with data replication whereas with CouchDB, it's much more straightforward to spread the data around. Is that correct? All database systems, which support updates, have the same difficulties regarding replication - they stem from the fact that the updates across replicas are non-deterministic and you need to synchronise, which is a tough job, each solution has its cons and has pros.
Why people use key/value stores (plug your own) instead of sql? Well sql is not a panacea either. SQL is good to use where your data roughly fits the relational (sorry) model. Take the example with trees. Parent-child relations. If you store the tree directly, in some kind of a tree store, traversing it will obviously be fast. Faster, than simulating it using n SQL queries. You could roll some stored procedures, the way Celko advocates, but you will end up with log n queries on the inside, rather than one. Then again. If you can actually map your tree to a number scheme, and have the required queries defined on those numbers, then you could probably do it using only one query, probably quite fast. The key is to make the db do most of the work. The problem is this doesn't work for most kinds of tree or graph data :( Some mappings either don't exist or are yet to be discovered.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Bertrand Mansion schrieb:
On Tue, May 5, 2009 at 4:18 PM, Greg Knaddison <Greg@growingventuresolutions.com> wrote:
This is some interesting feedback. I'd love to hear more of your thoughts so I can understand it better.
On Tue, May 5, 2009 at 7:51 AM, Bertrand Mansion <drupal@mamasam.net> wrote:
Drupal doesn't really need a relational DB and actually doesn't use the relational features properly (for example, the way tags are stored is not efficient). That's one of the reasons why it is slow and doesn't scale very well. These are some pretty sweeping claims. Can you expand on them?
These are not claims, but only my opinion, based on my experiences with Drupal.
With which types of sites did you make these experiences?
1. Tags are stored inefficiently (I can't think of a way to store them that is better for every use)
What do you mean by "every use" ?
He probably means "every possible use-case".
If you are interested in reading about tags and SQL, here are some pointers: http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html http://laughingmeme.org/2005/04/07/in-lieu-of-the-promised-article-on-tags-a...
The way Drupal stores tags hierarchy is also inefficient.
We just agree to some terminology. In the Drupalverse, we have 1) a term, which is part of a vocabulary. The vocabulary can be hierarchical or not, and allow some other special cases. 2) a tag, this is a special case of a term, namely a term from a vocabulary that has the "free tagging" bit switched on. This also means the vocabulary has no hierarchy. Drupal 6 is in particular problematic if you have a) a lot of nodes b) which have several terms each (doesn't matter which type) and c) have many revisions. In Drupal 5 terms where not revisioned, in D6 they are and this causes mySQL to sift through far more entries than it had to before. drupal.org's handbooks and projects are use cases for this.
2. Drupal doesn't scale very well (I've seen enough claims otherwise - is there a particular problem you can point to?)
Lots of problems for me, but that's off topic.
No, we are on the development list where can discuss such topics.
Sessions, cache, sql queries, table structures, file storage, architecture and callback hell, etc.
You should probably open new threads for each topic in order no not clutter this one too much.
It's currently the best CMS for PHP I know, but still, it is sluggish (yes, this is subjective).
Very.
But going for another storage system would be better if implemented as a fork, IMO. Various people just rewrote the entire DB API, so it is possible to make massive API changes within a release cycle. Why do you feel a fork is necessary?
Moving to something like CouchDB as was suggested in the thread or
Somebody (David?) brought up the topic of CouchDB, but I am not aware of a serious development effort.
some other datastores (someone mentioned Hadoop but I think he meant BigTable, Dynamo or Tokyo Cabinet...) would need more than just rewriting the DB API in my opinion. That's why I can only imagine this as a fork. That would be my reply to the first post.
I don't think that a Drupal fork would be viable. Cheers, Gerhard -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkoAZIEACgkQfg6TFvELooQ/uwCfcQT8jlJzYQLgtHfB7XQha5KC racAoIrjNmAuejqz1D5m6w5cbNpZuJ0u =w+4e -----END PGP SIGNATURE-----
There are plenty of places in Drupal where we can optimize the database before leaving SQL behind. Search can already be factored out to highly optimised systems like Apache Solr, but we still support SQL based search in core (and it was made a lot faster for Drupal 6 than in Drupal 5). So while I think pluggable field storage will be great, it's not going to get us out of optimising for MySQL any time soon. There's also a big difference between indexing content where the primary record is still in the database, and swapping out the database entirely. On taxonomy specifically theres two issues with our current storage: queries joining on the term_node table (as already pointed out by Killes) - if you have conditions on one table and order by on another, then MySQL can't use indexes efficiently. The term_node table might disappear completely in Drupal 7 if vocabulary associations with nodes move to the field API - http://drupal.org/node/412518 - field API storage has it's own performance issues, but we need to deal with those anyway. Or we might end up having both field tables and term_node. The other problem with taxonomy storage is term hierarchy. http://drupal.org/node/344019 is being worked on (storing hierarchies with nested sets keyed by rational numbers to save recursively walking up and down trees) so we could do queries on the taxonomy tree efficiently. There's also contrib modules for implementations of nested sets and materialized path (for Drupal 5 / 6). Both issues desperately need more eyes on them though, and sadly the discussion so far suggests that the people writing of taxonomy's storage haven't even bothered to look for them. Nat
On Mon, 4 May 2009 14:11:07 -0500 Chris Johnson <cxjohnson@gmail.com> wrote:
Folks who care about the direction and performance of the Drupal database (and developers just generally interested in database backends for applications) should read this blog posting by Bob Cringley -- and more importantly, the comments written by some folks who know way more about databases than Mr. Cringely:
David Strauss and Larry Garfield, as our two current DB experts (IMHO, of course) -- I'd especially like you to digest the more pertinent comments and think about how we can simplify our database access, and hence optimize our performance. I'm about to go off an research Hadoop a bit, but my gut feeling is the comments which say SSD and proper, simple use of SQL and good data paths will be the right answer for the next decade or so are probably on the right mark.
I'd say it depends on the application (and no... drupal is not *one* application). If you're google you really don't care to lose some data and you don't care about transactions (most of the times). Most of what drupal does don't need transactions even if once you've several thousands of custom node types that are valuable and not just crap on the net you'd be a bit disappointed if the node-kernel remains orphan of the extra data just because something went wrong halfway... Furthermore... don't expect that anything that is not SQL and works better will be easy to understand... unless once more your data is crap and your application is simple (concurrency, coherency...). At the language level I won't bet we are going to see any revolution, so it will still be a good bet to have a good SQL abstraction layer, hopefully that will support some more coherency/concurrency features. Visual Basic, PHP, MySQL etc... were/are popular because they were easy and cheap and mostly they dealt with crap (nothing you will kill a self-proclaimed programmer for[1]). What alternative to SQL are we going to see that will overcome the cost of learning something new and offer anything better? For a long time MySQL was just a SQL interface to the filesystem... that should be a starting point to understand the real value and importance of SQL and not confusing it with the RDBMS "debate". Transaction, concurrency, coherency and building good system design aren't easy. A SQL interface is going to stay there for longer (linq anyone?). So part of the future is Hbase, part is not... and there are money in both camps. I'd say anyway that generally there are more money in the business that care about transactions and coherency (unless it is politics). BTW Amazon does have to care much more about transactions than Google. No surprise they aren't using MySQL. [1] F: hey Silvio, we lost all our invoices record S: never mind we just deleted the false accounting crime -- Ivan Sergio Borgonovo http://www.webthatworks.it
participants (18)
-
Bertrand Mansion -
Chris Johnson -
Earl Miles -
Earnie Boyd -
Gerhard Killesreiter -
Greg Knaddison -
Ivan Sergio Borgonovo -
Jeff Eaton -
Karoly Negyesi -
Kyle Mathews -
Larry Garfield -
larry@garfieldtech.com -
Michael Favia -
Mikkel Høgh -
Nancy Wichmann -
Nathaniel Catchpole -
Vladimir Zlatanov -
Yves Chedemois