Tags: Firefox Browser Firefox Extensions Free Software Lynx Browser Spidering 2006, September 5 - 8:05pm — Webmaster Tips
There are many ways to save web pages and web sites for offline viewing. These methods will work on Linux, Windows and/or Mac OS X. These tools will save entire web pages and web sites. If you are looking for a way to take screenshots, try this page instead.
Saving Web Pages for Offline Viewing with Firefox
Firefox has an extension called Scrapbook. Scrapbook lets you edit the saved web pages so you can add notes, highlighting, inline annotations, and more. It is an excellent tool for research.
Saving Web Sites for Offline Viewing with Firefox and Spiderzilla
Spiderzilla was a great Firefox extension that downloaded entire web sites with an embedded version of HTTrack. It looks like you can still download Spiderzilla, but the extension might not be maintained anymore. Worth checking out.
Saving Web Sites with HTTtrack
HTTrack is a classic tool for downloading entire web sites, or parts of web sites. Think carefully before you use this program on someone's web site. If it's a large web site you are going to use up a lot of bandwidth, so don't do it to someone else's web site. Use the Scrapbook Firefox extension, described above, to download individual pages instead.
Saving Web Pages with Lynx in the Terminal
Tip: To install Lynx on Ubuntu/Debian, type sudo apt-get install lynx. If you want to install Lynx on Windows, I recommend using Cygwin. I'm not sure if Lynx comes with Mac OS X, but if it isn't on your Mac you can get the Mac version here.
Lynx is a text-based Web browser. I previously wrote a Lynx tutorial that shows how to extract text from web pages. You can also use Lynx to capture just the text of multiple web pages. It's a bit messy though and I don't recommend it unless you have a specific purpose that needs text extraction from web pages in this manner. Here it is:
First make a test directory:
mkdir lynx_testing
Navigate into that directory:
cd ./lynx_testing
Start the crawl. Don't do this on other people's large web sites because it could use up a lot of bandwidth on a large site.
lynx -crawl -traversal "http://www.[yoursite].com"
You will then end up with a directory full of text files with a .dat file extension.
Tip:You can change the .dat file extensions to .txt with the following command — make sure you are in the right directory first:
rename -v 's/\.dat$/\.txt/' *.dat
Or remove the file extensions altogether with the following command:
rename -v 's/\.dat$//' *.dat
More about the rename command here
Assuming that you are leaving the .dat file extensions for now, this is a list of files and what they contain:
traverse.dat — This file contains a list of URLs that were spidered.
traverse2.dat — This file contains a list of URLs, including the HTML
. They are listed in the order encountered.
lnk00000###.dat — Each extracted web page will be saved in a numbered file with the HTML titles and URLs at the top. Lynx is a text browser so these files will only contain text content from the web pages The HTML will be stripped out. I've had trouble opening these files from Nautilus, but you can easily open them in the terminal with commands like gedit lnk00000001.dat or vim lnk00000001.dat.
Tip: There is more information on the files created with -traversal here
If you want to combine all the pages of text into one file for searching with a visual text editor like gedit, SciTE, or Notepad, you can use the cat command like this:
cat * >MyFile.txt
That will create a file called MyFile.txt that contains all of the text from the files in the current directory.
You can also grep (search) the files all at once with the grep command. Navigate to the directory with the files that you want to search and type something like:
grep -i "your search terms" *
The -i will make it a case-insensitive search. For more information on grep, type man grep in the terminal.
GNU Wget
Wget information is coming soon, but will be covered in another post.
Summary
For saving individual web pages, I recommend the Scrapbook Firefox extension. For downloading and saving entire web sites I recommend HTTrack (don't use it on large web sites though). Wget is great for selectively grabbing files from a Web page/site. If you know of other good tools for saving web pages for offline viewing, leave a comment below.