Imagine you are a webmaster and you want to see how the pages of your website are getting indexed by a search engine. Or, you want take a backup of all the pages of your website as viewed by a search engine. This edition of techblog is all about a solution that allows you to do this. HTTrack!
HTTrack is an open source solution. You can install it in your GNU/Linux machine and Windows. I have experimented it in both these platforms and found them to be extremely stable and reliable. You may note that if you are in Linux platform then you need to download WebHTTrack and if you using M$ platform then you need WinHTTrack.
By using the Linux version you will get more options (especially many command line based ones!). Well, first we will see how this is done in Windows. You can download an executable binary from their website and install it in your machine.
Once you run the solution it will ask you to provide the details concerning the name of your website, URL and the directory where you want to store the downloaded files.
The solution bids its users for a fair use of the product. You should not use the same to download the website for stealing its bandwidth or for copying copyrighted contents. The solution allows you to set the browser ID (the name that will appear as the browser used to view the webpage – from server’s point of view), set proxy, extract links etc
It also allows you to choose the MIME (Multipurpose Internet Mail Extensions) types that you want to download to your local system and it supports you to use it as a spider (indexing). You can find that if you set flow control and limits, you can use the system resources in the best proper way . This is recommended if you are using a low performance system or if you using a broadband service on ‘pay as you go basis’ (say, if you have download limits).
Linux
Though you can download .deb and .rpm packages from different websites, I suggest you to build the solution from source. You can use the following code for building it (you may need to edit the version number and make sure that the downloaded tar ball is in your path – you can check this by issuing pwd and using ‘ls’ to list the files)
1: tar xvfz httrack-*.tar.gz //edit the version
2: cd httrack-* //edit the name properly
3: ./configure --prefix=/usr/local
4: //as root
5: make && make install
6:
Now you may create a launcher (desktop launcher) with the following as is its path:
/usr/local/bin/httrack
Command line Tools
You can download the files of a website just by issuing this
httrack --get http://sitename.com
If you want to get a particular file and pipe it to the ‘stdout’, you may issue:
httrack --quiet --get http://sitename.com/ -O tmpget -V "cat \$0" | grep -iE "TITLE" rm -rf tmpget
The following command allows you to run WebHTTrack as a search engine spider:
httrack sitename.com -%I
If you want to download all html files placed under a particular folder you can use the following as a filter
+www.sitename.com/specificfolder/*.html
Another interesting thing is that the solution helps you to download parts of the websites that are not indexed by other search engines! You can do this by disabling the robot.txt file (but this method is not recommended).
Drawbacks
The solution has few drawbacks
- Flash based websites are not completely supported
- CGI based redirects are not handled properly
- Another main issue is that, if you have a website that has so many javascript files linked in the html or has many jsp pages, the solution may crash or stop indexing in the middle.
- If the HTML code is not done properly, sometimes the solution fails to parse it.
- Socks are not supported
But these drawbacks are largely obscured by the unique features of the solution, I recommend you to download it from their website and try the software.




Join Techblog
Facebook Group
Read
Digg entries
Add techblog to
Google reader
Hi, just a brief note, you need not run make as root. So,
./configure
make
su
**********
make install