How to Mirror a Website?

Most of the modern websites available today are built with ASP, CGI, PHP, and all the other dynamic technologies. Those groups are constant when it comes to updating every page. Aside from that, they also often produce URLs that are basically used only once. This makes planning to mirror an entire website tricky.

Moreover, if you want to mirror the web pages of a website, the browser from the original webpage will not have the dynamic behavior. If a website displays the current weather in Cleveland, mirroring the current pages will only get you a frozen “snapshot” of the weather on that particular day.

That said, though, there are tools available that will help you mirror basic websites that don’t have these problems. The best-known of these is GNU wget, a free, open-source tool that can easily fetch an entire website with a single command. wget is not the friendliest tool in the world, but boy does it work!

Mirror a Website on Linux

You almost certainly have wget already. Try wget –help at the command line. If you get an error message, install wget with your Linux distribution’s package manager. Or fetch it from the official wget page and compile your own copy from the source.
Once you have wget installed correctly, the command line to mirror a website is:

wget -m -k -K -E http://url/of/web/site

See man wget or wget –help | more for a detailed explanation of each option.

If this command seems to run forever, there may be parts of the site that generate an infinite series of different URLs. You can combat this in many ways, the simplest being to use the -l option to specify how many links “away” from the home page wget should travel. For instance, -l 3 will refuse to download pages more than three clicks away from the home page. You’ll have to experiment with different values for -l. Consult man wget for additional workarounds.

Please do note that some web servers may be set up to “punish” users who download too much, too fast. If you’re not careful, using tools like wget could get your IP address banned from the site. You can avoid this problem by using the -w option to specify a delay, in seconds, between page downloads. Usually, this will prevent the webserver from viewing your behavior as unacceptable. But your mileage may vary!

Mirror a Website on Windows

If you are running Windows, here is how you can download and use the wget software on Windows:

  1. Install wget for Windows
  2. Add the to wget bin directory to your system’s path directory,
    so you can run it easily from the command line
  3. Run cmd.exe to bring up a command prompt (Windows button, type cmd.exe, enter)
  4. By default, the command prompt will open in your user directory
  5. Run to wget commands
  6. Type “start .” (w/out quotes) on command prompt to
    open Windows Explorer to see your downloaded files

Final Thoughts: Mirror a Website

Publicly mirroring someone else’s website without their permission is a violation of copyright law. Don’t do that.

If you have received their permission, it’s easy to offer your mirror to the world. Just use the wget command to download it to a directory inside your own website’s space. This is much easier if you have command line access to your own web server so that you can run wget there directly. But you can also upload the mirrored site to your server by dragging and dropping it to your usual file transfer program after wget is finished.

If you do offer a mirror of another site, make sure you link to the original and explain to users that this is a mirror and not the original. Also, be sure to keep your mirror up to date. And once again, get the original site’s permission first!

You might also be interested:

Comments

comments

Ryan Jacob: Ryan Jacob has 9 years of rich experience in Integrated Marketing Communications and Server Management. He has lead teams of professionals in his career and built online and offline reputation of organizations.