Hreflang Frequently Asked Questions
May 21, 2020
APAC Search Awards Shortlisted
January 12, 2021

Leveraging OnCrawl to Eliminate URL Source Gaps

One of the biggest challenges we encounter with creating and managing HREFLang for clients is ensuring that all of the pages are mapped to their alternates.  This is a two step process starting with importing all of the URL’s for the various sites and then matching them to their alternative versions.  We have written extensively on the second part of the process and the necessity to build fourteen different methods to match pages. 

Today we will review the bigger challenge of getting all of the valid URL’s on the site. For many sites limited XML site maps this is a major challenge which they are not often aware of. It is interesting how many site owners and SEO’s assume their CMS created XML site maps have all of the URL’s on the site. Unfortunately due to UX tools, AJAX or JavaScript frameworks it is not always possible to crawl to get the URL’s – the very reason why complete XML sitemaps are critical.  Unfortunately there are also CMS’s with sitemap functionality that can only process a single language version creating a major challenge for Swiss and Belgium focused sites in multiple languages. 

A key fundamental of Google’s requirements for using hreflang elements is to ensure “each language version lists itself as well as all other language versions.”  If you do not match all of the alternates you end up with a large number of errors in GSC’s International Targeting report

With the exception of the first error, the rest indicate Google has found URL’s mapped in one XML site map that are not mapped to its alternate in another. 

We can replicate this error pattern in HREFLang Builder with our “Missing URL Report.” This report identifies missing pages for each market.  The red cells indicate URL’s that did not have an equivalent for that market and green means we have a URL.  In this specific case, the client indicated the sites should be a mirror image of each other which tells us that their source file(s) is missing a significant number of pages.

Most sites when they start using HREFLang Builder use their CMS generated XML site map as the source file.  In theory this should be the most complete set of URL’s as it is direct from the source.  Unfortunately, in my experience, the average enterprise CMS generated XML site map often has less than 75% of the total valid URL’s.   We define a valid URL as one that returns a 200 status code AND is indexable.  Meaning that it is not blocked by a robots directive and the URL matches its canonical tag.

As I noted earlier, unless their DevOps or SEO team has done the comparison of the XML to a complete site crawl they may not actually know that there are discrepancies between what can be found on a crawl and what is submitted to Search Engines via XML site maps. 

This very challenge is why we have teamed up with OnCrawl. The first was the no-brainer fact as many of our clients were using OnCrawl for their SEO diagnostics, why should we add to their server overhead by crawling their site to get a list of URL’s or checking the validity of URL’s when they have already done that as part of the SEO team’s ongoing diagnostics.

The second was working with clients to troubleshooting these gaps in XML and crawl counts for URL’s that are identified by our missing URL matrix. OnCrawl makes this type of diagnostic painless with their  “Structure vs. Site Maps” reports. These reports quickly and clearly identify the gaps between what was found in their crawl compared to what was extracted from XML sitemaps. 

In the first example, OnCrawl found only 6.5% of the URL’s overlapped between XML and the crawl with significantly more pages found during the crawl than were reported in the XML site map.

The report below, OnCrawl detected a similar problem for another site where 78.4% of the URL’s found during the diagnostic crawl were not represented in the XML site maps.  

To solve this gap, we worked with the innovations team at OnCrawl to integrate their “valid URL” report using their powerful API simply by adding the API key to HREFLang Builder.

The ability to automatically import the CMS generated XML sitemaps with the validated URL’s from OnCrawl into HREFLang Builder ensures the largest number of URL’s possible to be mapped and submitted to Google in the XML sitemaps.   

In addition to reducing hreflang errors and incoving overall revenue some clients have adapted this combined solution in creative ways.   They share the missing URL report with their localization teams to find missing or non-localized pages in various markets.   Another new use is those with limited CMS XML sitemaps are now using HRFLang Builder’s XML as a more accurate source of global URL’s for their site search applications  

The integration of OnCrawl’s API makes HREFLang Builder the most automated and powerful solution for companies of all sizes too easily manage their HREFLang projects.