Wednesday, 18 January 2006

Automatic categories & Technorati tags 1: principles






This 2-parter describes one possible way to include categories for your blog posts automatically with Technorati tags, if you're using a blogging platform (most notoriously Blogger) which doesn't support categories. (Previously I explained how to implement a manual categories method which is quick and easy for users but involves much more ongoing work for the blogger, though CoLT makes that a lot easier.)

Clearly many of us have been thinking along similar lines lately, e.g. see my fellow Web Corante Hub contributor John Tropea's recent post. A popular semi-automatic method has been to use Del.icio.us, see e.g John of Freshblog's post, but I was too lazy to go and tag all my old posts on Delicious, so I never tried that out myself. This new method, which I've been trying out since I discovered that Technorati have introduced much more powerful tag searching, should involve much less work going forward (always a good thing in my book), and it can even pick up your old tagged posts, but it still takes some time and thought to set up.

If you're curious, you can see the new system in action on this test blog (opens in a new window). It looks much like the manual system at the right hand side of this page, but peek under the skirt (as some would say!) and it's quite different. Why am I not using it on my main blog yet? Because I want to see how it goes in terms of speed and reliability and accuracy from day to day and different times of the day, and I don't want to risk slowing down my blog for my readers or confusing them with error messages by having it there until I'm completely happy it will work smoothly.

I'll outline the basic principles behind this method with some of its pros and cons, and then provide a practical step by step guide on how to implement it on your own blog. (I'll probably split the howto out into a separate post for length reasons as there are some twists involving how Technorati seem to have changed things on their site recently, which can hide certain things you need - I'll explain how to find them).

Concept

This system depends on searches of Technorati's tag pages and then conversion of the tag search feed to Javascript which is inserted into the blog template to display the search results, i.e. show a list of posts, under your chosen category headings on your own blog (rather than opening a categories page on Technorati or on Del.icio.us). So you can categorise even your old posts with this method if you've tagged them appropriately in the past (including with meblogging tags), without any re-editing at all, and a post can easily be filed in more than one category, again just by tagging it appropriately.

Tagging and tag searching

You need to spend time thinking carefully about what categories you're going to have and the tags for the posts that you want to be listed in each category, how to construct your search so as to pick up previous posts (if you've not been tagging with consistent keywords in the past - which means combing through your old posts, checking what tags you used for them and trying out various search combinations), and of course you need to be consistent in tagging future posts that you intend to file in a particular category, though that'll be a heck of a lot easier than figuring out how to rope in all your old posts. You won't be surprised to know I've been working on this on and off for a couple of weeks, even though I've only got blog posts going back about a year.

Also note that because this system uses Technorati tag searches, someone else's posts from another blog could get listed in your categories if they've used exactly the same tag combo you've set your search up for (that's why meblogging tags are a good thing, and this system relies on them - and also assumes others won't use your meblogging tags, not much anyway!).

Reliance on Technorati

If a post isn't on Technorati's tag pages, e.g. you didn't tag it with the right tag or Technorati's tag pages haven't clocked it for whatever reason (which is a common problem for many blogs), it won't get categorised automatically. (Obviously, new posts won't be listed in your categories until spidered by Technorati, which should only take hours or less these days.) So you still need to keep an eye on Technorati's tag pages periodically (e.g. do a tag search) to check your new posts have been properly picked up on there, and sort it out with Technorati support if not (or just add the missed posts manually, see below) - which is counter to the "automatic" aim, but at least that's all ongoing work you'll have to do, and it's easier than manually hardcoding in every single one of your new posts. If Technorati could only solve the issues that stop them adding all tagged posts to the right tag pages consistently without fail, and beef up their systems so that users will never again see the dreaded "too many searches try again later" message instead of the desired search results, this method would be so much easier, and indeed it would become my own personal favourite.

The Technorati dependence also means that if Technorati is temporarily out of action, or slow, or not producing any search results because of too many other searches, that will affect the speed of loading of your blog and the display of categories lists on your blog (e.g. if the Technorati search or feed is playing up). If Technorati's down, who knows if that might even stop your blog pages from displaying altogether (as happened with blogs or posts linking to MP3s which incorporated Del.icio.us's Playtagger, when Delicious's servers went down) - though I think Javascript errors popping up, see below, are more likely than your pages not loading at all.

The main problem with using Technorati so far as I can see, apart from Technorati not always showing my tagged posts on their tag pages, is that sometimes the searches are too slow, or, even worse, about 1 in every 4 or 5 searches on Technorati (in my experience) just produces no results but a "try again later" because the server's too slow to respond or too busy with other searches - which means that instead of a nice neat list in the sidebar, the reader sees unfriendly Javascript errors. Though refreshing/reloading the page usually sorts it (and hence I changed the heading in the test blog sidebar to suggest it), I'm obviously reluctant to put my readers through that experience; people unfamiliar with an error like that might well leave my blog instead of refreshing. (Incidentally, I think cracking scalability, maintaining the ability to cope with ever-increasing numbers of searches, is going to be a critical issue for all search engines generally, not just Technorati - I've even started seeing "can't find server" type results on Google searches, though less frequently for Google searches, which are after all their bread and butter, than when trying to view Blogspot blogs, which I'm guessing Google host on less heavy duty servers.)

As the category list simply reflects the posts shown on your Technorati tag search results page, the Javascript will show only the 20 most recent posts for a category because that's the maximum number of posts Technorati currently list on their results page in Internet Explorer (but there's an odd discrepancy with Firefox, where it's only 10, which I'll go into in my later howto post). However, I've done a manual tweak for each category that has more than 20 posts by providing a hardcoded "...more" link to open Technorati's page 2 for that tag search in a new window - but only for those categories, as for ones with fewer posts that format of link would just gives zero results on Technorati and a horrid screen for users. (It would be possible to write a script to count the results and produce a "...more" link to the correct page for that category on Technorati only where there are more than 20 results, but it's more thinking and trouble than it's worth for me, so I haven't - if I had lots more categories it would be more useful, as it is it's just easier to add the "more" link in manually when my remaining few "small" categories hit 20 posts).

RSS to Javascript conversion: a feed is essential

For displaying category lists on your own blog (as opposed to just having a category heading linking to a page on Technorati listing those posts, which is possible if that's as far as you want to go), I used a free feed to Javascript converter (there are lots around). Which means this system only works for searches with feeds, like Technorati tag searches (http://www.technorati.com/tag/whatever). You can't use plain Technorati search results (http://www.technorati.com/search/whatever) as Technorati offer no feeds for those, unless you create a watchlist, which is extra trouble I couldn't be bothered to go to (also, standard searches as opposed to tag searches just aren't as precise if you're trying to limit it to just posts from your blog).

More unfortunately, this system won't work for tag searches which include "user=yourTechnoratiusername" e.g. http://www.technorati.com/tag/BBC?user=improbulus (as pointed out by John Tropea in December and whose possibilities have been spotted e.g. by Unrest Cure). This is because, sadly, Technorati don't yet provide feeds for those kinds of searches (or for "from=blogURL" searches such as http://www.technorati.com/tag/Technorati?from=consumingexperience.blogspot.com). If and when Technorati introduce feeds for tag searches using user= or from=, that would be ideal because you wouldn't get other people's posts popping up in your categories even if they used your meblogging tag, as you'd use "user=yourusername" or "from=yourURL" instead of the meblogging tag. (Dear Technorati... pretty please?) You can still use these kinds of searches, if you like, to provide a link to a separate categories page on Technorati's site - you just can't use them to list out your category posts on your own blog, not with simple RSS to Javascript conversion anyway.

Again, because this system depends on a feed to Javascript converter, your blog pages will load more slowly than usual because of waiting for the scripts to do their thing (remember it will be fetching the search results for each category separately, and the more categories you have the longer it will take). You can see that the pages on my test blog (new window) open at a rather more leisurely pace than on my normal blog (though hopefully not unusably so) - the sidebar appears only gradually, although at least it doesn't stop the main text of the post from being fully readable from the start, even with a long post.

Furthermore, again this also means that if the site you use for the conversion is slow, or down completely, then your categories list or even your whole blog could be affected, e.g. if the converter somehow doesn't pick up the Technorati feed. One major downside of this system is that it relies on two separate services, Technorati and the feed converter both, so if even one of them is up the spout then your blog could be scuppered.

Combine with manual categories?

However a major benefit is that you can combine this method with manual listings of posts, so you have a lot of control. I've done that in my test blog (new window), which still makes use of my show/hide method.

So for some categories like the Technorati one, I display my most popular or personal fave posts first by manually hardcoding them, then I insert the Javascript for that category (which lists out the most recent posts for that category, in reverse chronological order), then (again manually) I hardcode the list of posts which should be in that category but which Technorati wouldn't pick up. (Yes, this takes some work, comparing Technorati's tag pages for your combo search against your universe of posts to see what's been missed out and then adding them in by hand.) You may see some duplication of posts - e.g. because they weren't on Technorati's tag pages when I was checking it, so I hardcoded them in, but now they're appearing... still, better twice than never.

Why Technorati?

Now some of you may wonder why you need to use Technorati for this - why not e.g. rival blogosphere search engine Icerocket? Icerocket have their own tag pages, called "blog topics", with tag and author combo searching and a feed for every conceivable search result.

But - they only introduced tag pages around mid-2005, so they haven't got as many of my posts on their tag pages as Technorati do, despite my (unsuccessful) attempt to get more tagged posts on Icerocket; their multiple word tag searching and AND/OR tag combo searching doesn't work as expected, from my tests; plus, on the can't beat 'em join 'em front (that's me being all dry and sarky), Icerocket have also started having problems crawling/indexing my blog (all my posts for the second half of December 2005 are missing from Icerocket for example) - and the situation is worse with Icerocket than with Technorati. With Technorati, while a number of my posts are missing from their tag pages, at least they're still there somewhere on their database, because you can find them via a simple search on Technorati if not a tag search - whereas with Icerocket, my posts just haven't been indexed on there at all; they're not there, period. So personally I'm sticking with Technorati.

If your blog is newish and you're sure all your tagged posts are on Icerocket, if you want to then feel free to use them instead (their searches do seem to be faster at the moment); but I'm only going to provide a detailed howto on Technorati in the second part of this tutorial.

So those are the principles behind this automatic categories system. In part 2 I'll outline the practical steps involved in implementing this system, and go into more detail on how to address some of the cons I described above, with some advice and tips on the different stages.


Technorati Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

No comments: