With a growing web site, it becomes almost impossible to manually uncover all broken links. For WordPress blogs, you can install link checking plugins to automate the process. But, these plugins are resource intensive, and some web hosting companies (e.g., WPEngine) ban them outright. Alternatively, you may use web-based link checkers, such as Google Webmaster Tools and W3C. Generally, these tools lack the advanced features, for example, the use of regular expressions to filter URLs submitted for link checking.
I ran each tool on this very blog "Linux Commando" which, to date, has 149 posts and 693 comments.
linkchecker runs on both the command line and the GUI. To install the command line version on Debian/Ubuntu systems:
$ sudo apt-get install linkchecker
Link checking often results in too much output for the user to sift through. A best practice is to run an initial exploratory test to identify potential issues, and to gather information for constraining future tests. I ran the following command as an exploratory test against this blog. The output messages are streamed to both the screen and an output file named errors.csv. The output lines are in the semicolon-separated CSV format.
$ linkchecker -ocsv http://linuxcommando.blogspot.com/ | tee errors.csv
- By default, 10 threads are generated to process the URLs in parallel. The exploratory test resulted in many timeouts during connection attempts. To avoid timeouts, I limit subsequent runs to only 5 threads (-t5), and increase the timeout threshold from 60 to 90 seconds(--timeout=90).
- The exploratory test output was cluttered with warning messages such as access denied by robots.txt. For actual runs, we added the parameter --no-warnings to write only error messages.
- This blog contains monthly archive pages, e.g., 2014_06_01_archive.html, which link to all actual content pages posted during the month. To avoid duplicating effort to check the content pages, I specified the parameter --no-follow-url=archive\.html to skip archive pages. If needed, you can specify more than one such parameter.
- Embedded in the website are some external links which do not require link checking. For example, links to google.com. I can use the --ignore-url=google\.com parameter to specify a regular expression to filter them out. Note that, if needed, you can specify multiple occurrences of the parameter.
The revised command is as follows:
$ linkchecker -t5 --timeout=90 --no-warnings --no-follow-url=archive\.html --ignore-url=google\.com --ignore-url=blogger\.com -ocsv http://linuxcommando.blogspot.com/ | tee errors.csv
To visually inspect the output CSV file, open it using a spreadsheet program. Each link error is listed on a separate line, with the first 2 columns being the offending URLs and their parent URLs respectively.
Note that a bad URL can be reported multiple times in the file, often non-consecutively. One such URL is http://doncbex.myopenid.com/(highlighted in red). To make easier the inspection and analysis of the broken URLs, sort the lines by the first, i.e. URL, column.
A closer examination revealed that many broken URLs were not URLs I inserted in my website (including the red ones). So, where do they come from? To solve the mystery, I looked up their parent URLs. Lo and behold, those broken links were actually URL identifiers of the comment authors. Over time, some of those URLs had become obsolete. Because they were genuine comments, and provided value, I decided to keep them.
linkchecker did find 5 true broken links that needed fixing.
If you prefer not to use the command line interface, linkchecker has a front-end which you can install like this:
$ sudo apt-get install linkchecker-gui
Not all parameters are available on the front-end for you to directly modify. If a parameter is not on the GUI, such as skip warning messages, you need to edit the linkchecker configuration file. This is inconvenient, and a potential source of human error. Another missing feature is that you cannot suspend operation once the link checking is in progress.