Clustering of Data : Data analysis and mining


27 Feb  
This edition of techblog is about tools which can help you to group your data. As you would have noticed, this article is grouped under a new category ‘Artificial Intelligence’. In this category, you can expect postings related to Artificial Intelligence (AI) and Artificial Neural Networks (ANN). Some of these may require some background knowledge in statistical learning, but most of these will be targeting  beginners.
 
Well, coming back to the topic – Data mining is one of the major topics that we study under machine learning. In order to analyse the data properly, we need to organise the data and group them. There are many ways of doing this. The most popular ones include unsupervised learning algorithms (like Self organising maps, ART etc) in Neural Networks; which we will be covering in the upcoming editions.
 
In this article, we will study how to organise data using the K-means algorithm
 
 
K-means Method
 
Hierarchical clustering is a method in which the data sets are grouped at a small size at first and then join together to form a hierarchy. But this is very resource intensive. This can be solved by adopting the K-means method. The only requirement is that you need to predefine how many groups you need to generate. The best way to explain the algorithm is to use an example.
 
Consider the following diagram where we have 5  data entities.
 
 
kmeans method
 
(Fig adopted from Segaran, T. Programming Collective Intelligence, p43)
 
Let’s assume that we need to organise them into two groups. Then, the algorithm starts by selecting two random points somewhere in the space (denoted by the dark circles). Now each entity finds the point which is close to itself. Thus, A and B selects one point while C,D and E selects the other point.
 
Now these two groups finds their common central point (ie the common point for A and B & for C,D and E) as shown in the block 3. Now ‘C’ finds that it is actually close to A and B and migrates towards them and make a new common central point.
 
 
Implementing the algorithm
 
You can refer to this page in Oreilly to find the basic code that we need to deploy for this tutorial. It will place k randomly placed centroids in the data set. You can download the Python Imaging Library (PIL) and install the same in your system. This will enable image processing capabilities to the Python interpreter. Now, open a text editor and add the details of your favourite sites (in the format as required by the code – see the Oreilly page). You may use a parser for generating this.
 
Now add the following lines to your main Python file:
 
sites,words,data = clusters.readfile(‘data.txt’)
points= clusters.scaledown(data)
clusters.draw2d(points,sites,jpeg=’sites.jpg’)

 

And run the program.

If you open the generated image file, you can see that the sites are grouped together as shown below :

 

data clusterring using kmeans algorithm

 

In the above diagram I used some random blogs and employed the same algorithm to group them.

Here the algorithm has automatically classified the sites based on the similarity rank which it calculated using the Pearson distance (see the code for details) between the sites. Thus we discovered the relationships between various sites – which was previous unknown!

Share and Enjoy:
  • Print
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Blogplay

Tags: , ,


TechBlog on Facebook

Leave a Reply