5 Benefits of Centrally Managing XML and HREFLang Site Maps
August 16, 2021
Deploying HREFLANG when the global site does not want to participate
September 14, 2021

Does your CMS Create Correct XML Site Maps?

When we do the setup for HRFLang Builder the client will send or we will build a country and language matrix indicating all the domains and their respective country or language targeting. This client gave us the following info:

  • Use the country locator as the primary list of country and langauge variations
  • Each country uses a ccTLD
  • With the exception of the US and Mexico, all the country sites were in the native language only
  • All countries have XML index files for the source of URLs.
  • All countries have a similar number of products and pages

Just these five bullets from the client make this sound like a pretty easy project to set up but when we imported all of the XML sitemap index files into HREFLang Builder we got significant country and language errors for nearly every market.

From the country locator, we clicked the entry for Sweden which indicated that the home page was mysite.se. The image above is for the heading of the e-commerce site showing the Swedish flag and the text in Swedish with no other language variation shown. Clicking into the site we noticed the URL structure for changed to mysite.se/sv-se/home which seems strange as the country locator was using the ccTLD without any country or language folders. We adjusted HREFLang Builder’s regex to accommodate the ccTLD and language/country folder structure and reimported the XML index files. For Sweden, the system identified country language pairs for 6 different variations as shown below. Ironically there was not a Sweden/Swedish entry.

https://www.mysite.se/fr-fr/
https://www.mysite.se/de-de/
https://www.mysite.se/it-it/
https://www.mysite.se/en-gb/
https://www.mysite.se/nl-nl/
https://www.mysite.se/es-es/

HREFLang Builder makes it easy to identify the source of the URLs for the patterns above and they came from CMS-generated XML site map. This means that there is a French/France version of the Swedish .ccTLD website that was not listed. When we visited various product pages on each of the country language variations of the Swedish site they all had canonical to the respective country site. For example, the home page for mysite.se/ed-gb/ has a canonical to mysite.co.uk.

Moved to solve the second issue. Remember the client told us that the US and Mexico were the only markets with dual languages of English and Spanish. In HREFLang Builder there was only one entry each for the US and Mexico. We assumed the system did not understand the language designator.

The US Spanish link on the country selector took us to mysite.com/es-us/home indicating that it was the US site in Spanish. We clicked a product page and the structure changes to mysite.com/es-us/product1/?lang=es-US which had both a folder and a parameter. Doing a quick view source to check for a canonical tag, the canonical was pointing to mysite.com/product1/ effectively removing both the language/country folder as well as the parameter, and even worse, the Spanish site was now in English.

On a second pass at importing once we set all the filters to remove the incorrect language variations and crawled a few sites that did not have XML site maps and others that did not have local language ages in the XML, we set the system to do a URL quality check. This feature pings each URL and checks to make sure it is indexable and will flag any robots blocks, redirects, or canonical differences.

In the screen capture below, you can see first the inconsistent number of URLs across markets indicating the site maps are not complete. Second, for Germany, there were 214 URLs that were identified as non-indexable including 187 that has a 301 redirect to another page and 20 that redirected to a page where the canonical was different.

Problem #1 – No local market URLs in XML sitemap

The biggest problem of all was the Swedish site did not have any entries in the XML site map for the Swedish site. They were non-existent meaning the only way for search engines to get the URLs was to crawl the site. This was the case for all of the local language versions none of the URLs for the local language versions were included in the XML sitemaps.

Problem #2 – Additional market URLs with different canonicals

On top of not having the local URL problem, the XML site map for each site had those 6 additional language/market versions of URLs in the XML site map that all had a canonical tag to a totally different top-level domain. This meant for just the Sweden site, over 60,000 URLs submitted to Google that then told Googlebot to actually go to another page. In total, across all the variations over 1 million URLs were being submitted that had a canonical to another website.

Problem #3 – Language URLs Removed by canonical

As noted with the US Spanish language site, the same was true with the Mexico English site where the localized version, while in the XML site map, had a canonical to the local market version. This means that the US Spanish version can never be indexed. Also, it was strange that somewhere along the line the CMS was also happening a country/language parameter to the URL.

Problem #4 – Non 200 indexable URLs in site maps

During the quality check shown above it was clear the CMS was not using any rules to ensure that only clean 200 status and indexable URLs were being added to the site maps. A total of 15% of all URLs in the XML were invalid wasting significant resources and potentially preventing new and core product pages from being indexed or revisited.

Problem #5 – Web and SEO team unaware of issues

I have to say this problem was the biggest and most shocking of all in that no one knew this was happening nor did not notice that the local market version of URLs was not being generated by the CMS. It is one of the first items on any Technical SEO checklist. I can only assume that someone just checked the box that there were site maps when the country site was deployed and that no one is monitoring the errors in GSC as these will clearly be shown.

On top of not knowing they had the language issues, and that the CMS was not excluding problem URLs, the team was not sure who would fix it. This is a mainstream CMS that should not allow this type of problem to be created but that is a rant for a different article.

Problem #6 – A Google will sort this out false assumption

After reviewing the issues with the SEO and Dev teams both did not see the urgency in fixing the problem. They both felt that Google would follow the canonical and index the correct page. The irony in that statement was the reason they engaged us was many of the local sites were not ranking in Google and one SEO noted that not many pages in a few markets were being indexed. Sometimes you have to stop and think about how the parts work and what might be the root cause. Had they taken a few minutes to do any basic diagnostics they would have found the problems.

Check your XML Site Maps for Accuracy

In about 90 percent of the projects we work on, there are significant errors in the XML site maps for the sites we are onboarding. If you have not looked at them and done any analysis on them after an update or in a few months it is with the time to go and check them out. You may be surprised how bad of shape they may be in.

HREFLang Builder Advantages

While they were working with the CMS vendor to try to fix the problem there were a few things that we were leverage to start to correct the bigger problems.

  • HREFLang Builder was able to block the rogue URLs from being imported into the system ensureing only the correct variation of the page was included.
  • HREFLang Builder was able to test the pages for indexability and canonical issues and remove those with problems and add the correct cannonical version of the URL.
  • For those missing versions of the site used the Screaming Frog scheduler function to crawl those sites and create an XML site map that we were able to import from Dropbox into the system.
  • Using our automated loader and cross-domain hosting the client was able to remove their incorrect CMS generated XML site maps and replace them with 100% correct and aligned hreflang XML site maps that not only solved their hreflang problem but ensured Google has access to all the URLs for each market.