How to verify a website migration - testing a sitemap

by Rob O'Leary on 04 Jan 2023

I migrated my website from Jekyll to eleventy (11ty) recently. I wanted to preserve the same URLs for the vast majority of my webpages.

Since Jekyll and 11ty have different schemes for creating URLs out of file paths, I had to do a bit of work to get 11ty to replicate what Jekyll does. I wanted to test that I did this successfully. Here is how I did that.

How can you test that the URLs of a website have not changed?

I was producing a sitemap for my website. A sitemap is kind of like a public directory of your website. It is an XML file that lists the URLs for a website. It is used to tell search engines which pages they can crawl.

If you don’t have a sitemap, you could use a crawler to discover the URLs of your website. You can use a commmand-line tool such as wget to behave like a web crawler, or find a dedicated crawler tool. Alternatively, Some SEO tools will crawl your website and check the links of your website. You may be able to generate a list of links from the report. However, it is a challenge to find a free SEO tool that does this, and does not have some signup shenanigans.

Here is an example of a sitemap:

XML

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	<url>
		<loc>https://www.example.com/blog/markdown-with-components.html</loc>
		<lastmod>2022-12-23T22:00:47+00:00</lastmod>
	</url>
	<url>
		<loc>https://www.example.com/blog/typescript-interfaces.html</loc>
		<lastmod>2022-12-23T22:00:47+00:00</lastmod>
	</url>
</urlset>

The loc field contains the URL of each webpage of the website.

There are 2 ways you can use this to test if your new website and old website have the same URLs.

You can compare the sitemap of both versions of your website. You could do this manually using a visual diff tool, or programmatically to check the XML fields of both files for equivalency of URLs. I will do the former in the next section.

You could wait until you have published the new version of your site, and then execute HTTP requests on the URLs sourced from the old sitemap to see they are active webpages.

Manual visual verification

You can use a file diff tool to compare files side-by-side to do a visual comparison. I used Meld on Linux to do this. If you are a Windows user, winmerge is a good, open source visual diff tool.

visual comparison of sitemap files in meld visual diff tool

The order of the url nodes do not have to be the same in both files. Meld is smart enough to scan both files to see if there is a match. Also, you need not worry about whitespace.

One interesting thing that this revealed to me was that the filename of a few of my blog posts had a blank space in them. This was not intentional! The reason I did not notice is because when Jekyll creates a URL for a file, it will replace a blank space with a hyphen.

11ty on the hand, puts a %20 character in the URL wherever the filename has a blank space, %20 is the URL encoded equivalent of a blank space. So, I needed to replace the blank spaces in filenames myself to get the same outcome.

Aside from that, all of the URLs looked good. So, I did not need to do more on this front.

If my website was a lot bigger or if this was a messier process, I would of needed to be more methodical. I would have written a script to generate a list of the divergent URLs. Some diff tools have shell integrations that can be used to create a list of differences on the command-line if you need to go down that route.

Testing every link from the sitemap xml file

Since I was happy with the outcome from the visual inspection, I was happy to publish the 11ty version of my website. Before I hit publish, I wrote a bash script that would test that the live website had all of the URLs I expected. A sanity check, if you will!

Here is the script, which I named as sitetest.

Bash

#!/bin/bash

function _help() {
    echo "Description: Test the links in a sitemap XML file to see if they are active webpages. It produces a CSV file with the URL and HTTP status code of each link. By default, it will write to a file named 'output.csv'."
    echo ""
    echo "Usage: sitetest [sitemap file] [output file (optional)]"
}

function _test(){
	echo "Testing your website now"

	infile=$1
	outfile="output.csv"

	if [[ -n "$2" ]]; then
		outfile="$2"
	fi

	# remove outputfile if exists already
	if [[ -f "$outfile" ]]; then
		rm "$outfile" > /dev/null
	fi

	output=$(xmllint --xpath "//*[local-name()='loc']/text()" "$infile")

	errors=0
	counter=0

	echo "URL,HTTP Status Code" >> "$outfile"

	for link in $output; do
		echo -n "."

		code=$(curl -o /dev/null --silent --head --write-out '%{http_code}\n' "$link")
		echo "$link,$code" >> "$outfile"

		((counter+=1))

		if [ "$code" != "200" ]; then
				((errors+=1))
		fi

		if [[ $((counter % 10)) == 0 ]]; then
			wait # wait if background tasks (curl commands) have not finished
			echo -n ","
		fi

		#  
		sleep 1s
	done

	printf "\nLinks: %d" "$counter"
	printf "\nErrors: %d\n" "$errors"
}

case "$#" in
    0)
        _help
        ;;
    1)
       _test "$1"
       ;;
		2)
       _test "$1" "$2"
       ;;
esac

To give a brief background, there are 2 command-line tools I use to do the heavy-lifting:

xmllint: xmllint is a XML tool. It parses XML files, and can be used to extract fields from the XML document.
curl: curl is a tool to transfer data from or to a server, using one of the supported protocols such as HTTP. We can use curl to issue HTTP requests to see if the webpage exists.

Both of these tools are typically installed by default on unix-like systems such as Linux.

I don’t want to step through the entire script! It may be beneficial to point out how I use the 2 tools mentioned above. It is probably more illuminating to show the results from a short excerpt of the sitemap.xml of my website. Don’t spam my website please!

XML

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
	<url>
		<loc>https://www.roboleary.net/quotes/2017/07/30/quotes.html</loc>
		<lastmod>2017-07-30T00:00:00+01:00</lastmod>
	</url>
	<url>
		<loc>https://www.roboleary.net/java/2017/11/24/java-dates.html</loc>
		<lastmod>2017-11-24T00:00:00+00:00</lastmod>
	</url>
	<url>
		<loc>https://www.roboleary.net/programming/2018/04/25/java-enums.html</loc>
		<lastmod>2018-04-25T00:00:00+01:00</lastmod>
	</url>
</urlset>

With xmllint, you can use an XPath expression to select parts of an XML document. If you are not familiar with XML, this is similiar to the JavaScript Document.querySelectorAll function to extract parts of a HTML document. I use an xpath expression to retieve all of the loc nodes and extract the text content. The output from xmllint can be added to an array and used later to process the links.

Here is the specific command on my example sitemap.xml:

terminal

xmllint --xpath “//*[local-name()=‘loc’]/text()” sitemap.xml
https://www.roboleary.net/quotes/2017/07/30/quotes.html
https://www.roboleary.net/java/2017/11/24/java-dates.html
https://www.roboleary.net/programming/2018/04/25/java-enums.html

With curl, I will issue a HTTP request to each of the links recieved from xmllint. There is a nice trick with curl that you can write out just the HTTP Status Code, it is discussed in this stackoverflow question.

terminal

curl -o /dev/null --silent --head --write-out ‘%{http_code}\n’ https://www.roboleary.net/quotes/2017/07/30/quotes.html
404

This HTTP response returns a status code of 404. Therefore, this URL is no longer a live webpage.

That’s the core of the script.

To run the script, you must provide a sitemap file as the first parameter. Optionally, you can name the output file by providing a second parameter. If the second parameter is not provided, the output file will be named output.csv, and will be written to the directory that the script is run in.

terminal

./sitetest sitemap.xml
Testing your website now
…
Links: 3
Errors: 1

When you run the script, you will see a series of dots indicating each HTTP request. Every tenth HTTP request is indicated by a comma.

I included a sleep 1s command in the script, so that it waits one second after each request. When you are issuing a server with requests in succesion, you do not want to do it too quickly. Some servers will limit the rate you can request and may ban your IP for a fixed time. Generally, it is good etiquiette not to hit a server with quickfire requests.

The output file looks like this:

CSV

URL,HTTP Status Code
https://www.roboleary.net/quotes/2017/07/30/quotes.html,404
https://www.roboleary.net/java/2017/11/24/java-dates.html,200
https://www.roboleary.net/programming/2018/04/25/java-enums.html,200

You can open the CSV file in a spreadsheet application such as LibreOffice Calc to explore the data.

viewing output.csv file in libreoffice calc

Final words

There isn’t always a clear process for migrating a website or data. It can be useful to write a simple tool to test your actions. By using a visual diff tool and writing a script, I was able to migrate my website with confidence. I had assurance that I did not break links to my website, and can maintain the SEO juice for my webpages!

I hope that this helps you if you are doing a similar thing.

Thanks for reading!

How can you test that the URLs of a website have not changed?

Manual visual verification

Testing every link from the sitemap xml file

Final words

Tags