Hi folks, I'm in the early stages of building a set of scripts that parse out info about hits to a site's RSS feed. The goal is to trasmogrify what is now only viewable as thousands of anonymous hits to node/feed to a list of total hits to node/feed by ip address. My hope is that this will be the foundation for a module/feature that gives me a ballpark idea of how many RSS subscribers I actually have. Anyone working on something similar? Or should I consider building this as a new feature to submit for review in the statistics.module? Finally, the statistics.module dropped a lot of functionality that would have allowed this anyway after the port to 4.6 -- was there a good reason for this? Best, Nick Lewis http://nicklewis.smartcampaigns.com
Nick This is a good idea. However, instead of parsing the logs (which can be expensive), how about intercepting the logging as it happens and if it is node/feed, then write an entry to a special log. That log would have the IP address, time of last access and cumulative number of accesses for that IP address? Or maybe all we need is additional indices on the existing table (although an index on the URL would be expensive). This could be part of statistics.module of course. On 1/16/06, Nick Lewis <nick@smartcampaigns.com> wrote:
Hi folks, I'm in the early stages of building a set of scripts that parse out info about hits to a site's RSS feed. The goal is to trasmogrify what is now only viewable as thousands of anonymous hits to node/feed to a list of total hits to node/feed by ip address. My hope is that this will be the foundation for a module/feature that gives me a ballpark idea of how many RSS subscribers I actually have.
Anyone working on something similar? Or should I consider building this as a new feature to submit for review in the statistics.module? Finally, the statistics.module dropped a lot of functionality that would have allowed this anyway after the port to 4.6 -- was there a good reason for this?
Best, Nick Lewis http://nicklewis.smartcampaigns.com
However, instead of parsing the logs (which can be expensive), how about intercepting the logging as it happens and if it is node/feed,
Note that he would have to check node/feed AND rss.xml. In Drupal 4.6: * node/feed was the right URL * rss.xml was the alias. In Drupal 4.7-upgraded-from-4.6: * node/feed is the alias. * rss.xml is the right URL. In Drupal 4.7, no upgrade: * node/feed doesn't exist. * rss.xml is the right URL. -- Morbus Iff ( you are nothing without your robot car, NOTHING! ) Culture: http://www.disobey.com/ and http://www.gamegrene.com/ O'Reilly Author, Weblog, Cook: http://www.oreillynet.com/pub/au/779 icq: 2927491 / aim: akaMorbus / yahoo: morbus_iff / jabber.org: morbus
Morbus Iff wrote:
However, instead of parsing the logs (which can be expensive), how about intercepting the logging as it happens and if it is node/feed,
Morbus, I wasn't aware of this change, but I reckin' it makes a lot of sense. So for feeds that are currently written like: --taxonomy/term/1/0/feed would the new version be: --taxonomy/term/1/0/rss.xml ? I'll be of course writing this for 4.7 and not 4.6. Best, Nick Lewis
Note that he would have to check node/feed AND rss.xml.
In Drupal 4.6:
* node/feed was the right URL * rss.xml was the alias.
In Drupal 4.7-upgraded-from-4.6:
* node/feed is the alias. * rss.xml is the right URL.
In Drupal 4.7, no upgrade:
* node/feed doesn't exist. * rss.xml is the right URL.
Morbus, I wasn't aware of this change, but I reckin' it makes a lot of sense. So for feeds that are currently written like: --taxonomy/term/1/0/feed
would the new version be: --taxonomy/term/1/0/rss.xml
No. *Only* node/feed has changed. All the other URLs have no changed. -- Morbus Iff ( you are nothing without your robot car, NOTHING! ) Culture: http://www.disobey.com/ and http://www.gamegrene.com/ O'Reilly Author, Weblog, Cook: http://www.oreillynet.com/pub/au/779 icq: 2927491 / aim: akaMorbus / yahoo: morbus_iff / jabber.org: morbus
On Monday 16 January 2006 11:52, Khalid B wrote:
Nick
This is a good idea.
However, instead of parsing the logs (which can be expensive), how about intercepting the logging as it happens and if it is node/feed, then write an entry to a special log. That log would have the IP address, time of last access and cumulative number of accesses for that IP address?
Also, which feeds they are retrieving, for the case where there are multiple feeds.
Or maybe all we need is additional indices on the existing table (although an index on the URL would be expensive).
This could be part of statistics.module of course.
-- Jason Flatt http://www.oadae.net/ jason@oadae.net
Khalid B wrote:
Nick
This is a good idea.
However, instead of parsing the logs (which can be expensive), how about intercepting the logging as it happens and if it is node/feed, then write an entry to a special log.
I think this is a great idea. Maybe, instead of even intercepting, we could all together prevent adding any additional overhead to the access log by updating the info on cron, or when the user accesses the page for rss logs? So the flow would be: 1. Code executes when script accessed by user or cron 2. Code gets time stamp of the last update (from undecided place, as of now) and filters through all page hits with path like .../feed. 3. For each new IP address, create a new record, and update existing IP addresses cumulative number of accesses. Then from that data, we could also create statistics on term, a blog specific feeds, ect...
That log would have the IP address, time of last access and cumulative number of accesses for that IP address?
Or maybe all we need is additional indices on the existing table (although an index on the URL would be expensive).
The access logs already take up enough resources as it were, I think we should consider using a databasetable.
This could be part of statistics.module of course.
And I was sort of thinking this would be better in the core statistics module as well. We already have enough extended tracking.modules -- I think the world probably doesn't need another. Though, I would like to extend the display of data to take advantage of modules like graphstat, but that is another story.
On 1/16/06, Nick Lewis <nick@smartcampaigns.com> wrote:
Hi folks, I'm in the early stages of building a set of scripts that parse out info about hits to a site's RSS feed. The goal is to trasmogrify what is now only viewable as thousands of anonymous hits to node/feed to a list of total hits to node/feed by ip address. My hope is that this will be the foundation for a module/feature that gives me a ballpark idea of how many RSS subscribers I actually have.
Anyone working on something similar? Or should I consider building this as a new feature to submit for review in the statistics.module? Finally, the statistics.module dropped a lot of functionality that would have allowed this anyway after the port to 4.6 -- was there a good reason for this?
Best, Nick Lewis http://nicklewis.smartcampaigns.com
By intercepting, I mean at the logging level, before we log, if the URL is an RSS feed, then we update (or insert) a row in the new RSS log, increment the count, and update the last access time. So, this will only happen when an RSS feed is accessed. 1. Check that URL accessed is a feed (maybe add the hook to node_feed(), taxonomy_*_feed()). 2. Check if a row exists for the IP address. 3. If it does, then update last access, increment access time. 4. If there is no row, then insert a new one. Now that I think of it, IP address can be deceiving (e.g. someone accessing stuff from office and from home will be counted as 2). Even using the session would not solve all cases. Crawlers do not do sessions. As far as cron is concerned, I think it will be a resource guzzler for large sites, since it will have to do full table scans on the table.
xstatistics appears to try this already in the summary report. On Jan 16, 2006, at 1:04 PM, Nick Lewis wrote:
Hi folks, I'm in the early stages of building a set of scripts that parse out info about hits to a site's RSS feed. The goal is to trasmogrify what is now only viewable as thousands of anonymous hits to node/feed to a list of total hits to node/feed by ip address. My hope is that this will be the foundation for a module/ feature that gives me a ballpark idea of how many RSS subscribers I actually have.
Anyone working on something similar? Or should I consider building this as a new feature to submit for review in the statistics.module? Finally, the statistics.module dropped a lot of functionality that would have allowed this anyway after the port to 4.6 -- was there a good reason for this?
Best, Nick Lewis http://nicklewis.smartcampaigns.com
Nick Lewis wrote:
Hi folks, I'm in the early stages of building a set of scripts that parse out info about hits to a site's RSS feed. The goal is to trasmogrify what is now only viewable as thousands of anonymous hits to node/feed to a list of total hits to node/feed by ip address. My hope is that this will be the foundation for a module/feature that gives me a ballpark idea of how many RSS subscribers I actually have.
How about amending the feed's link URLs? eg node/5 becomes feedtracker/{sessionid}/node/5 A module can log and 302-redirect "clicks". This rather depends on aggregators being trackable by session ID, which I have no knowledge of at all. But ferchrissakes *please* don't go anywhere near IP addresses. The number of things which I've had to clean up in my life because a site happens to have lots of (eg) AOL users [1] is turning me grey before my time. jh [1] Before replying with something regarding RSS and AOL, read that again and notice I said 'eg' as well.
John Handelaar wrote:
Nick Lewis wrote:
Hi folks, I'm in the early stages of building a set of scripts that parse out info about hits to a site's RSS feed. The goal is to trasmogrify what is now only viewable as thousands of anonymous hits to node/feed to a list of total hits to node/feed by ip address. My hope is that this will be the foundation for a module/feature that gives me a ballpark idea of how many RSS subscribers I actually have.
How about amending the feed's link URLs?
eg node/5 becomes feedtracker/{sessionid}/node/5
Holy jumpin Jesus that is a good idea!!!!!! One problem.... I have no idea how to implement it.... I could probably figure it out with some research. John, could you be more specific as to why AOL IP addresses have made your life miserable? If it just skews the results, blocking a bunch of subscribers to one IP I think I and others would be able to *live with that*. Khalid, or anyone else, what's the better approach in your opinion? Note that John's idea kind of rules out the possibility of this being part of a core module. Also note that by approaching it as an extension to statistics, we could use the same scripting to tackle the much missed feature of being able to track page hits by hostname. However, for the purpose, tracking the actual reading patterns of RSS subscribers is more useful than merely tracking how many times they've accessed the feeds. Also regarding your notes of IP being imperfect -- well, yes, but I nevertheless think that it at least gives a picture of your RSS readers, albeit a distorted one. But I still think something is better than nothing in this case. Also, would not selecting * from [access table] WHERE timestamp > than [ recorded timestamp of last table scan] on the cron run be the least expensive way possible? Bear in mind that I'm relatively new to programming mysql. Will now check out xtracker, do some research into John's idea, and do some testing.... Best, Nick Lewis
A module can log and 302-redirect "clicks". This rather depends on aggregators being trackable by session ID, which I have no knowledge of at all.
But ferchrissakes *please* don't go anywhere near IP addresses. The number of things which I've had to clean up in my life because a site happens to have lots of (eg) AOL users [1] is turning me grey before my time.
jh
[1] Before replying with something regarding RSS and AOL, read that again and notice I said 'eg' as well.
Nick Lewis wrote:
John, could you be more specific as to why AOL IP addresses have made your life miserable? If it just skews the results, blocking a bunch of subscribers to one IP I think I and others would be able to *live with that*.
One AOL user, requesting one page and then 20 other 'hits' associated with it, can appear as either a) One IP, or b) 21 IPs, or c) any number in between Repeat for each request. Same problem with load-balanced proxy servers (as used, inter alia, by the entire UK academic network - or at least those universities which run everyone through the joint cache servers). IP-based *anything* - Just Say No. It's *spectactularly* inaccurate and, frankly, an amateur's mistake. jh
John Handelaar wrote:
IP-based *anything* - Just Say No. It's *spectactularly* inaccurate and, frankly, an amateur's mistake.
Firstly, sorry about all my messages to the list regarding this idea, I'll keep them to a minimum from hereon. I disagree with the above assumption (though I acknowledge that it is correct -- in *some* ways, and in *some* situations) on the basis of personal experience. This may seem like a strange example, but I once ended up finding out that a cute girl who I'd assumed was out of my league was interested in me thanks to the amateur's mistake of tracking visitors by IP in drupal 4.5. Let me explain (I think this is a good example of how we should be thinking about our users needs when it comes to traffic analysis): I mentioned to her that I had posted a certain essay called "The Renaissance of the Commons" on my blog, and told her to goole my name and the title to find it. The search popped up on my referrers log, and I marked down the IP associated with that search (I did have a crush on her). Later, I checked her IP's history, and found out that she was apparently a lot more interested in what I was writing, than I would have thought. For the next week, I noticed her return twiceto four times a day -- and it suddenly occurred to me that maybe I should ask her out on a date. The end result was me being one satisfied drupal user. Note that I was able to do this solely on the basis of IP tracking hits by IP. Beyond that, I also was able to track a few "high value" visitors like my now future boss. True, this wouldn't have worked had they been AOL users, or part of a UK academic network. However, as far as I know, IP's plus user-agent info is as good as we're going to get from users who don't register at a site. This isn't a bulletproof analysis tool that I'd try to sell to high-profile executives at Time Warner's international marketing team, or any clients who were impressed by phrases like "enterprise-level solution". Rather, its just a practical analysis of data that is already collected, and aimed at a user base of amateurs, or small publications. Your earlier suggestion, I think, is more appropriate for RSS than what I initially proposed. That said, there is great value in giving users at least *something* to track individual users who are not logged in. At the moment, we mostly have to fly blind when it comes to tracking visitors. Just because it doesn't always offer scientifically valid data doesn't mean our users couldn't make use of it. Most users probably do have unique IP's. Instead, I might develop a tool to help people tag certain ip addresses with names. For example, when Bob leaves a comment, there is a record in the access log containing his IP. I could provide a link titled, "track commenter's IP address". Or, when someone searches for "Nick Lewis Drupal", I might find it useful to track the IP that was referred by that search. They might be a potential customer. Clicking the "track IP" link would provide a form where I attach a meaingful name to that IP address. From that I could probably get a sense of Bob's reading habits. That kind of data for multiple users would be userful for me as a way to gauge my success as a writer, and a sense of whether my stock was rising or falling in the eyes of my core readers. Moreover, as a standalone module, we could use other data to weedout wrongly flagged IP addresses (typical stuff like, an array of screen resolutions, operating systems, browsers, ect...) Really, what I want -- and I know many others want this as well -- is something to help me *see* individual readers. I think it is really what the future of the web is all about: remaking the actual into the virtual. Best, Nick Lewis
On 1/16/06, Nick Lewis <nick@smartcampaigns.com> wrote:
John Handelaar wrote:
IP-based *anything* - Just Say No. It's *spectactularly* inaccurate and, frankly, an amateur's mistake. <snip> I disagree with the above assumption (though I acknowledge that it is correct -- in *some* ways, and in *some* situations) on the basis of personal experience.
If you're OK using the IP tracking, xstatistics.module already tracks how many users request your feed at least once in the data in your accesslog (as Adam Knight pointed out). No extra overhead, no new coding, it's done. It does use IP, so it could present problems, but for most situations it is a good generalization of what is going on. John's solution of using the sessionID confuses me (can you expand a bit more) but my understanding of it is that it either presents a privacy problem or would be confusing to the user or both. Regards, Greg
Greg Knaddison wrote:
On 1/16/06, Nick Lewis <nick@smartcampaigns.com> wrote:
John Handelaar wrote:
IP-based *anything* - Just Say No. It's *spectactularly* inaccurate and, frankly, an amateur's mistake.
<snip>
I disagree with the above assumption (though I acknowledge that it is correct -- in *some* ways, and in *some* situations) on the basis of personal experience.
It does use IP, so it could present problems, but for most situations it is a good generalization of what is going on.
Again, not if your tracked users are behind balanced proxies. There are entire countries which fit that description, and other surprisingly-large places like the UK which are heavily affected. So if (for example) you're in the UK, it's just BROKEN for the 30% of *everybody* who's on AOL, and another 20%-ish on ja.net, and (let's be generous) no more than one in twenty others. 55% isn't *some*, ffs. And by NO definition would the remaining 45% count as "most situations". I'm taking a maximal estimate there of JaNet usage, but those numbers don't get any prettier if you reduce that number to zero. Honestly, I'm a little surprised one can be *in* the analytics business and not know this stuff.
John's solution of using the sessionID confuses me (can you expand a bit more) but my understanding of it is that it either presents a privacy problem or would be confusing to the user or both.
"Solution" is pushing it. "Wild suggestion out of left field" is closer :) Certainly it's not confusing to end users, since it's transparent, and there are no privacy issues connected to values derived from session IDs [1] which don't already exist in the fact that Drupal uses sessions all over the place in the method prescribed by the authors of the PHP language. It goes like this: 1) Module alters the link element in the RSS feed on a per-user basis. Links are amended to force clickthroughs (and referred links) through that module's handler. The new link contains an ID [1] and the original destination. [2] 2) When someone clicks on one of those links, the module logs the click and "passes through" to the original destination. 3) If you want to collect IPs as well, you can use the relational database we all have access to to group them by SID: SELECT DISTINCT remote_ip FROM linktracker WHERE... I mean, if you're going to log IPs, you need context. Otherwise you end up with either i) too many IPs per session and no trail, or ii) a metric assload of people hiding behind only one IP who look like one person if you ignore the context of the session. You avoid this by basing your primary ID for tracking on the session which generated the feed. IP info is secondary, and you may even get the bonus of it being useful sometimes. jh [1] You can't use actual session IDs for security reasons, but you can use something derived from them, like an MD5 hash [2] This has caching implications which would need to be addressed.
Op dinsdag 17 januari 2006 05:12, schreef Greg Knaddison:
If you're OK using the IP tracking, xstatistics.module already tracks how many users request your feed at least once in the data in your accesslog (as Adam Knight pointed out). No extra overhead, no new coding, it's done.
On top of this: I have long standing plans to, at least for xstatistics (in a private table) track the client strings. That could be combined with these stats i query from xtatistics. The plan is as follows: count() filter by ip filter only those that return withing a time interval (within 2 days eg) substract a set of known strings that are not feedreaders (such as googlebot) that is the amout of subscribers.
It does use IP, so it could present problems, but for most situations it is a good generalization of what is going on.
I found that when you have > 20 readers this number resembles what netstat tells me. with a variation of 5-10%. the higher the number the smaller the variation.
John's solution of using the sessionID confuses me (can you expand a bit more) but my understanding of it is that it either presents a privacy problem or would be confusing to the user or both.
Johns Ideas ar ecool. Yet IMO way OTT. they make me think of that ad where you see a maserati motorbike gloed to a baby-three-weel-bike. Making the feeds stats killer, cool; dynalmic i aware algorithmick filtered stats, while having the same ol stats in the rest of Drupal is like that bike. -- PGP ber@webschuur.com http://www.webschuur.com/sites/webschuur.com/files/ber_webschuur.asc PGP berkessels@gmx.net http://www.webschuur.com/sites/webschuur.com/files/ber_gmx.asc
John/Nick The debate on IP Addresses is a valid one. My own personal view is that using IPs as a unique identifier is acceptable in some situations, but totally wrong in others: An example is where ecommerce forms were using IP address to create a form id, and if the id is not the same from a POST, an error is displayed. A user behind one of those proxy pools (remember : AOL, other ISPs, and entire countries are behind those) could not proceed with a cart checkout because of that. However, in statistics, all we are saying is how many unique IPs we get hits from. It would not matter much if one users is seen as three IP addresses, since that is the nature of IP addresses. Crawlers do come from several IP addresses for the same search engine. Using sessions would be a great idea, except that some visitors do not have cookies (e.g. crawlers), so this is not a fool proof way of doing it. Presently, there is no bullet proof way of identifying a visitor that would work for all visitors (humans, bots, ...etc.), and we have to live with that limitation.
Khalid B wrote:
Using sessions would be a great idea, except that some visitors do not have cookies (e.g. crawlers), so this is not a fool proof way of doing it.
...this is why Drupal uses sessionIDs in querystrings as fallback. jh
On 1/17/06 12:46 PM, John Handelaar wrote:
Khalid B wrote:
Using sessions would be a great idea, except that some visitors do not have cookies (e.g. crawlers), so this is not a fool proof way of doing it.
...this is why Drupal uses sessionIDs in querystrings as fallback.
only if you configure PHP to do so. HEAD's default settings.php has the following: ini_set('session.use_only_cookies', 1); -- James Walker :: http://walkah.net/ :: xmpp:walkah@walkah.net
Op dinsdag 17 januari 2006 01:01, schreef John Handelaar:
IP-based *anything* - Just Say No. It's *spectactularly* inaccurate and, frankly, an amateur's mistake.
Are you going to rewrite drupals core statistics? Its what we do now! awstats/netstat do this too. Eventhough in advanced cases it is not perfect, it the best middle road we can come up with now. Let us please tackle one issue a time. Then once one finds enough time and courage to rewrite drupals core statstics to resemble dynamic ips (based on algorithms on paths and usernames and so) we can condider this issue too. IMO for now its way OTT. Let us focus on getting some stats about feed readers first. -- PGP ber@webschuur.com http://www.webschuur.com/sites/webschuur.com/files/ber_webschuur.asc PGP berkessels@gmx.net http://www.webschuur.com/sites/webschuur.com/files/ber_gmx.asc
Op maandag 16 januari 2006 20:04, schreef Nick Lewis:
Hi folks, I'm in the early stages of building a set of scripts that parse out info about hits to a site's RSS feed. The goal is to trasmogrify what is now only viewable as thousands of anonymous hits to node/feed to a list of total hits to node/feed by ip address. My hope is that this will be the foundation for a module/feature that gives me a ballpark idea of how many RSS subscribers I actually have.
Anyone working on something similar?
Please have a look at xstatistics. I think this can very well go in as improvement to my current "algorithm" :p It works, but only for large sties. Small sites get skewed stats because of bots hitting the feeds. I think rewriting the feeds (with session IDs) is a too big task. And a big overkill, if all we want is generate some stats on the feeds. We are not going to replace awstats and the likes anywhere soon anyway. And about that, we should really really really (times 100) try to get a new colums in the core stats table to track the client, after 4.7. It is all that is missing to make the stats table a 1-1 copy of the apache logs. Bèr (and no, let us please not start discussing here whether or not drupal should replace awstats and the likes) -- PGP ber@webschuur.com http://www.webschuur.com/sites/webschuur.com/files/ber_webschuur.asc PGP berkessels@gmx.net http://www.webschuur.com/sites/webschuur.com/files/ber_gmx.asc
participants (9)
-
Adam Knight -
Bèr Kessels -
Greg Knaddison -
James Walker -
Jason Flatt -
John Handelaar -
Khalid B -
Morbus Iff -
Nick Lewis