Screaming Frog has made a lot of enhancements to the application that allow you to schedule crawls and export XML site maps that we can import into HREFLang Builder for setting up Auto Updates. It is not yet a perfect solution but is a perfect way to build or augment your source files. There are a few things you need to consider and suggestions for setting up before you get started.
Single Crawl or Multiple Crawls
Once you set up the format for your country and language files in HREFLang Builder it does not matter if the source files are in a single file or one for each country.
Which option you choose does depend on how your site is set up, memory of your computer and how you want to organize your data you need to decide how to crawl your site.
If you are using a .com and country/language folders it is easier to just let Screaming Frog run across all the versions and build a single file. However, larger sites, ccTLD’s and a computer without a lot of memory it makes more sense to create individual files for each country.
Pro Tip: We set up ours with individual country folder. This allows us to not only export and XML site map but also the master file of URL’s to use for our diagnostic work. See the example below.
Setup Folder Structures
Create a folder for each country/language version of the site.
Configuring Screaming Frog Crawls
There are many spider settings for Screaming Frog and suggest you read their help guides completely to understand the full power. You can exclude directories or parameters. Also you can set unique User Agents, crawl depth and speed limits.
The following are some of the quick settings that we use to get the cleanest output.
You can use all of your normal settings. There are a few that you should make sure are set:
Crawl Canonicals – We want to ensure that we are submitting the final canonical version of the URL only so this will minimize non-canonical versions from being collected. Sometimes sites have tracking or feature parameters in teh URL but will set a canonical to the root page. We want to ensure that we are submitting the final canonical version of the URL only so this will minimize
Note: We typically turn off all of the crawl settings in the basic screen since we do not need any of these for XML site map creation.
On the Advanced Tab this is where you will have the most settings to configure so ensure these are checked. These force SF into valid pages.
- Always Follow Redirects
- Always Follow Canonical
- Respect No Index
- Respect Canonical
Configuring XML Site Maps
Click the Sitemaps tab in the header ribbon bar and open the top. Suggest that you leave all the options unchecked except for 2XX which will only add valid URL’s.
My wish list item is that they also let us choose “Indexable” URL’s and that would be a double check
Saving your Configuration
To use the scheduler it is important that you save your configuration. You can set a master job for XML site map creation or unique filters for each country. By creating config files you can call them in using the scheduler.
Save your XML site map files with unique names representing the country and/or language it represents. Once you have your first set of files saved in Dropbox you can follow this process to add XML Site Maps from Dropbox into HREFLang Builder.