An offline internet : HTTrack


22 Nov  

 

Imagine you are a webmaster and you want to see how the pages of  your website are getting indexed by a search engine. Or, you want take a backup of all the pages of your website as viewed by a search engine. This edition of techblog is all about a solution that allows you to do this. HTTrack!

 

HTTrack is an open source solution. You can install it in your GNU/Linux machine and Windows. I have experimented it in both these platforms and found them to be extremely stable and reliable. You may note that if you are in Linux platform then you need to download  WebHTTrack and if you using M$ platform then you need WinHTTrack.

 

HTTrack startup 

 

By using the Linux version you will get more options (especially many command line based ones!). Well, first we will see how this is done in Windows. You can download an executable binary from their website and install it in your machine.

Once you run the solution it will ask you to provide the details concerning the  name of your website, URL and the directory where you want to store the downloaded files.

 

HT Track - project setup

 

The solution bids its users for a fair use of the product. You should not use the same to download the website for stealing its bandwidth or for copying copyrighted contents. The solution allows you to set the browser ID (the name that will appear as the browser used to view the webpage – from server’s point of view), set proxy, extract links etc

It also allows you to choose the MIME (Multipurpose Internet Mail Extensions) types that you want to download to your local system and it supports you to use it as a spider (indexing). You can find that if you set flow control and limits, you can use the system resources in the best proper way . This is recommended if you are using a low performance system or if you using a broadband service on ‘pay as you go basis’ (say, if you have download limits).

 

options in WinHTTrack

 

Linux

Though you can download .deb and .rpm packages from different websites, I suggest you to build the solution from source. You can use the following code for building it (you may need to edit the version number and make sure that the downloaded tar ball is in your path – you can check this by issuing pwd and using ‘ls’ to list the files)

 

   1: tar xvfz httrack-*.tar.gz   //edit the version

   2: cd httrack-* //edit  the name properly

   3: ./configure --prefix=/usr/local 

   4: //as root

   5: make && make install

   6:  

 

Now you may create a launcher (desktop launcher) with the following as is its path:

/usr/local/bin/httrack

 

Command line Tools

 

You can download the files of a website just by issuing this

httrack --get http://sitename.com
 

If you want to get a particular file and pipe it to the ‘stdout’, you may issue:

 

httrack --quiet --get http://sitename.com/ -O tmpget -V "cat \$0" | grep -iE "TITLE" rm -rf tmpget 

 

The following command allows you to run WebHTTrack as a search engine spider: 

 

httrack sitename.com -%I

 

If you want to download all html files placed under a particular folder you can use the following as a filter

 

+www.sitename.com/specificfolder/*.html

 

Another interesting thing is that the solution helps you to download parts of the websites that are not indexed by other search engines! You can do this by disabling the robot.txt file (but this method is not recommended).

 

Drawbacks

The solution has few drawbacks

  • Flash based websites are not completely supported
  • CGI based redirects are not handled properly
  • Another main issue is that, if you have a website that has so many javascript files linked in the html or has many jsp pages, the solution may crash or stop indexing in the middle.
  • If the HTML code is not done properly, sometimes the solution fails to parse it.
  • Socks are not supported

 

But these drawbacks are largely obscured by the unique features of the solution, I recommend you to download it from their website and try the software.

Share and Enjoy:
  • Print
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google Bookmarks
  • Blogplay

Tags: , , ,


TechBlog on Facebook

Comments (1)

 

  1. kurtdriver says:

    Hi, just a brief note, you need not run make as root. So,
    ./configure
    make
    su
    **********
    make install

Leave a Reply