[drupal-devel] Request for input on Quiz module data structure [long]
Hi, folks. Hope that this is the right place for this. Apologies in advance for the length, but I wanted to make sure everyone's on the same page before asking for feedback. == BACKGROUND == As most know, I worked over the summer on the Quiz module. Thanks to a generous injection of funding from Ejovi Nuwere at SecurityLab.net, I've been able to both complete most of the work that was not done by the other student assigned to this project during SoC, and am also able to now focus on adding additional features to the module. The quiz module was developed around Robert Douglass's specification (which I've temporarily mirrored here: http://drupaldev.snarkles.net/node/2). In a nutshell, this means rather than creating a simple module that could do multiple choice questions, instead creating a quiz *framework* which could potentially support almost limitless question types, from traditional multiple choice/fill in the blank, to really innovative ideas such as "reading" a question aloud to gage reading comprehension and so on. == HOW IT CURRENTLY WORKS == The current schema can be viewed here: http://cvs.drupal.org/viewcvs/drupal/contributions/modules/quiz/quiz.mysql?r... For the more visually inclined, I've uploaded an E-R Diagram here: http://www.webchick.net/soc/quiz.module/quiz-schema-old.png This is based off the original proposal, where serialized arrays (/me ducks in advance) are used to store answer, feedback and user response data. NOTE: Skip to "THE PROBLEM" if you don't care about how this stuff is stored currently :P A typical result array (stored in quiz_results.results) might look like: Array ( [10] => Array ( [input] => Array ( [0] => 0 [1] => 2 [2] => 3 ) [feedback] => Array ( [0] => Correct! [1] => Correct! [2] => Correct! ) [score] => 1 ) [6] => Array ( [input] => Array ( [0] => 2 ) [feedback] => Array ( [0] => Nope, fooled you! ) [score] => 0 ) ... ) The array keys (10 and 6) are question node IDs... in other words, this is how the user responded to both question 10 and question 6. Over the course of the quiz, this array keeps growing, with one array of results per question. "input" is an array of the answer(s) that the user selected on that question. The first question was a multiple choice question, so the user plugged in three answers (answers 1, 3, and 4 -- remember that arrays start at 0 index). The second question was only a single-choice question, and the user selected answer 3. "feedback" is an array of the feedback to the answer(s) the user selected for input. As you can see, the user answered correctly on all three answers on the first question, but did not answer correctly on the second question. Finally, "score" is either a 1 (indicating the answer was correct) or a 0 (indicating the answer was not correct). == THE PROBLEM == The question was raised to me: "If there's a million questions, and a million students, and we want to find out for a given question, how many student got it right how do we do that?" Right now, the answer is something along the lines of this: ----------------------------------------------------------------------- function get_question_results($quiz_nid, $question_nid) { // Array to store question results $question_results = array( 'right' => 0, 'wrong' => 0 ); // Grab a list of the results array for each quiz attempt $quiz_results = array(); $results = db_query("SELECT r.results FROM {quiz_results} r WHERE r.quiz_nid = %d", $quiz_nid); while ($result = db_fetch_object($results)) { $quiz_results[] = unserialize($result->results); } // Now, loop through each result and look for the given question foreach ($quiz_results as $key => $value) { if ($value[$question_nid]) { // If question is found check if it was answered correctly or not if ($value[$question_nid]['score'] == 1) { $question_results['right']++; } else { $question_results['wrong']++; } } } return $question_results; } // Call the function and output results $results = get_question_results(12, 10); echo $results['right'] ." students answered this question right, " . $results['wrong'] ." answered it wrong."; ----------------------------------------------------------------------- This is quite processor and RAM-intensive, since it has to load ALL of the results for each attempt on a given quiz first, then parse through those to find individual question attempts, and then finally retrieve the answers from each and evaluate whether or not they were correct. And this same basic logic has to be completed for each potential query someone might want to throw at it.. Such as, "What is the average score of this user among all exams?" In short, it doesn't look like this solution is going to be very scalable in the long-term for the kind of ad-hoc querying that educators are likely going to really want to be able to do. == THE SOLUTION? == I've created an alternate, normalized data schema and wanted to get your feedback on it, if you get a chance. http://webchick.net/soc/quiz.module/newschema.sql E-R Diagram: http://webchick.net/soc/quiz.module/quiz-schema-new.sql Changes: - quiz/quiz_questions tables stay the same - multichoice turns into 'quiz_question' containing only a primary key and a "properties" field to take the place of "multiple_answers." I envision this field as being more flexible than multiple_answers, as it could potentially store a plethora of options for very diverse array question types, perhaps in the form of key/value pairs or something (multichoice=1;email_results=1; etc.) - answers are now broken out into a 'quiz_question_answer' table, which links back to quiz_question with separate fields for answer, feedback, and points. The answer field itself could store a wide variety of options, such as "Answer" for a multichoice question, "St. Paul|Saint Paul" for a fill-in-the-blank question, and so on. Since answer display is handled in each question type module in the _form and _render_question hooks, we can still have the advantage of storing data independently from the actual quiz module. Feedback, too, can be stored in whatever way makes sense for the answer, and points can indicate whether a question is incorrect (0 points) or can also be used for 'weighting' questions (answer 1 is 20 points, answer 2 is 50 points, etc.) - Results are now stored in two different tables: quiz_result for the overall properties (start time/end time, etc. -- essentially the same as the current table) and quiz_question_results which provides the textual representation of what that user put for his/her answer on each question attempt. This could be stored in a similar manner to the answers table... for multiple answer questions, it could be 23,63,3 to indicate the answer ids selected, for fill-in-the-blank it could have 'Minneapolis', etc. Each question type's _evaluate_question hook would take the data from the answer field and check it against the given question's answers. - Fields representing times and dates are switched to DATETIME instead of INTEGER (for UNIX timestamp) in order to help with date range calculation And if you made it through that, a hearty congratulations. :P Any thoughts?
Hi Angie The rule of thumb is to start normalized, ideally with a 3NF, then only de-normalize for a good reason (mainly performance). I think this case is a very legitimate example where normalization is warranted. By using fine tuned queries the volume of data that is retrieved is much less than getting it all in and sifting through it in PHP. This leaves the DATETIME instead of a Unix timestamp issue. There must have been a good reason for doing it this way back when Drupal started. Maybe it was interoperability between PostgreSQL and MySQL, maybe it was something else. However, I have found cases where this is really annoying, for example, when you want to do date arithmetic, intervals, ..etc. in the SQL engine and not retrieve the data from the database and do the processing in PHP.
Wow!! Thanks very much for all of the input so far! To address a couple things: "number_of_questions" can be queried and therefore should not be stored IMO if you're going for full normalization." Not quite. The reason for number_of_questions is to put a cap on the number of questions actually shown in the quiz. Let's say I have a question bank of 200 questions. 40 of those are about topic X, but I only want a 25 question quiz about topic X, not a 40 question one. I want questions #5, #14, and #27 to show up in every single quiz, but the rest can just be random (this is determined by the question_status field in quiz_questions). That's what that field is for. If there's a better way to do this that I'm missing though, please feel free to indulge me. :) "I think "takes" could be have a more descriptive name and should be bigger than tinyint." Takes is "number of takes a user is allowed to have on this quiz." Values range from 0 (Unlimited) to 10. I could change this to an integer field, but I can't envision a situation where someone would want > 255 takes allowed and wouldn't just call that "Unlimited." But if you have a use case for this, please by all means let me know! My guess is that you're right and the field indeed needs to be renamed so it is more clear that it's describing the allowed number of takes, and not a calculation of the number of takes that the quiz has actually had. (maybe allowed_takes?) "I think you meant .png on that last link." You are exactly right. :) New version of the ERD is at http://webchick.net/soc/quiz.module/quiz-schema-new.png, sorry. "The rule of thumb is to start normalized, ideally with a 3NF, then only de-normalize for a good reason (mainly performance)." Indeed. I'm normally a "normalization nazi" (hehe ;)) but I was not familiar with Drupal coming into this project and thought maybe serialized arrays were "the Drupal way" of handling this type of scenario, since it was part of the spec. "This leaves the DATETIME instead of a Unix timestamp issue. There must have been a good reason for doing it this way back when Drupal started. Maybe it was interoperability between PostgreSQL and MySQL, maybe it was something else." Yeah, does anyone know why this is? I tried Googling "DATETIME vs. timestamp" and couldn't find anything in support of using timestamps (though admittedly, I didn't look very long). SQL is littered with date comparison and manipulation functions (where PHP has very few) so it seems to make the most sense to leverage this power on the database end. "I am very interested in understanding the quiz module and what you are doing with it. Is the module you are working on the same one that is in CVS?" Yes, it's currently in CVS at http://cvs.drupal.org/viewcvs/drupal/contributions/modules/quiz/ and you can download a daily tarball (updated daily) at http://drupal.org/files/projects/quiz-cvs.tar.gz. I haven't widely publicized this yet because since the data model is likely about to radically change, I don't want too many people locked into the "old, inflexible way" of doing things. But the basic functionality is there. "It seems to me that you are making the problem harder than it needs to be. I wonder why you are not using the power of SQL to do the sorting, comparing and counting at the database side." The problem is in the way the results are stored currently (as a serialized array). Here is an example of the contents of a "results" field in the database: a:2:{i:10;a:3:{s:5:"input";a:3:{i:0;i:0;i:1;i:2;i:... ...which as you might guess is basically impossible to query properly (unless I'm missing something really obvious). The only way that I know of to get at this data is to call PHP's unserialize() function on the field to bring it into an array that can then be played with inside PHP. I'm hoping that by changing the schema to the new way, I can indeed use a one-liner SQL query to retrieve any kind of ad-hoc questions someone might have about the data. That's why I wanted to run the idea past some folks who've been doing this a bit longer than I have, to make sure I'm not overlooking anything. :) Thanks again so much!! -Angie
Angie Byron wrote:
This is based off the original proposal, where serialized arrays (/me ducks in advance) are used to store answer, feedback and user response data.
Rapid prototyping =)
== THE SOLUTION? == I've created an alternate, normalized data schema and wanted to get your feedback on it, if you get a chance. http://webchick.net/soc/quiz.module/newschema.sql E-R Diagram: http://webchick.net/soc/quiz.module/quiz-schema-new.sql
E-R Diagram link not found.
- multichoice turns into 'quiz_question' containing only a primary key and a "properties" field to take the place of "multiple_answers." I envision this field as being more flexible than multiple_answers, as it could potentially store a plethora of options for very diverse array question types, perhaps in the form of key/value pairs or something (multichoice=1;email_results=1; etc.)
Not normalized, but probably ok for properties. Unless you get requests to do queries such as "Show me all the mulitple choice questions where the results get emailed".
- answers are now broken out into a 'quiz_question_answer' table, which links back to quiz_question with separate fields for answer, feedback, and points. The answer field itself could store a wide variety of options, such as "Answer" for a multichoice question, "St. Paul|Saint Paul" for a fill-in-the-blank question, and so on.
Also not normalized. If you're going to have each question type do its own logic for splitting the "Answer" field to get the multiple options, it seems like you could just make special tables for that question type that reflect its needs, and actually make normalized tables. I think the whole point is that every question type needs to save the data differently, right?
- Results are now stored in two different tables: quiz_result for the overall properties (start time/end time, etc. -- essentially the same as the current table) and quiz_question_results which provides the textual representation of what that user put for his/her answer on each question attempt. This could be stored in a similar manner to the answers table... for multiple answer questions, it could be 23,63,3 to indicate the answer ids selected,
Once again, if you're going to do the work and get away from custom data types that need extra processing, storing a comma separated list of ids isn't what you want. Making a user_id|quiz_id|question_id|answer table with the first three columns being the primary key would be the normalized way. If you're going to store a list of answer ids in a column, however, use serialize(). It's not that serialize is evil; it is definitely better than making your own string parsing routines where you have to remember whether to split on comma or pipe, and worry about whether someone might use the pipe in the answer. In all, these look like really good improvements on the original spec. Good work! -Robert
Now that I'm looking at the ER diagram of the SOLUTION, I think you're ok for the time being, as long as all of the logic for querying the quiz_question_answer and quiz_question_results tables are delegated to the question types. The reason for this is; quiz_question_answer and quiz_question_results are really supposed to be handled polymorphically. Each question type is supposed to know how to ask the question, how to receive the answer, and how to calculate points based on it. If the question type were an accent_reduction type that played and audio clip of a native speaker saying a phrase, and the answer were the digital recording of the user repeating it back, and the points were awarded on how well certain accent indicators were matched, then these text fields are clearly inadequate. So whether or not this db schema will serve as the starting point for a really robust, polymorphic quiz module depends on whether or not the polymorphism is enforced at the code level. -Robert
Thanks a lot for your reply, Robert! I should clarify that my goal is not necessarily complete and total normalization of the schema (as that would require at least 1 additional database table for each question type, which would quickly become cumbersome and create dependencies), but to strike a balance between intense processing/RAM usage being required for "easy" queries that everyone is going to want to make (what was the user's average score on these quizzes?) but at the same time allowing the quiz module to retain the "polymorphism" which is the real strength of the current system.
Now that I'm looking at the ER diagram of the SOLUTION, I think you're ok for the time being, as long as all of the logic for querying the quiz_question_answer and quiz_question_results tables are delegated to the question types.
Yep, that is the plan. When I want to display a question to the screen, I call its "render_question" hook from quiz.module, like: module_invoke($question->type, 'render_question', $question->nid); In your example of an audio-type question, if the question was, "The cow goes..?", the quiz_question_answer.answer field would probably contain something like "files/sounds/moo.mp3" Inside the audio_question.module, in its render_question hook, it would know to take the data extracted from the answer field, translate it into an HTML statement to display a link to the file (or embed it in a Flash player or whatever), and display that to the screen rather than simply displaying the file paths and making little radio buttons/checkboxes with answers next to them like multichoice does. Similarly, each question type keeps track of its own way of determining whether a submitted answer is correct or not. This is done in the evaluate_question hook which is called from quiz.module, like: module_invoke($question->type, 'evaluate_question', $question->nid); This will do all the question type's internal logic of parsing the answer field to determine if it is correct, and return the amount of points a user earned on that question (currently, this is either 0 or 1, but by adding the 'points' field I should be able to make weighted questions, which has been a common request). So these two hooks together should handle querying quiz_question_answer (as well as storing in quiz_question_results in the expected format). I will maybe need to add a third hook to handle displaying result information to the screen (render_result?) so that quiz.module remains blissfully unaware of how each question handles this.
If you're going to store a list of answer ids in a column, however, use serialize(). It's not that serialize is evil; it is definitely better than making your own string parsing routines where you have to remember whether to split on comma or pipe, and worry about whether someone might use the pipe in the answer.
Ah, that totally makes sense!! That would be a much better way to handle the "properties" field, as well as the answer field on those questions that could conceivably have more than one correct answer (for example, for fill in the blank questions you might want to catch "saint paul" as well as "st. paul" as well as "st paul"). Remind me next time I am struggling with something for a long time to just post here and get some help. :P You guys are awesome, thank you. -Angie
participants (3)
-
Angie Byron -
Khalid B -
Robert Douglass