We use DeepCrawl’s API to extract URL’s from your scheduled SEO diagnostic crawls. THE API uses a two step process.
The first step requests DeepCrawl to generate their “Indexable URL Report” for each site(s) that you have set up. This brilliant filtering system removes URL’s from the export that have robots blocks and canonicals that could otherwise give a 200 status. This minimizes pages with errors from being added to HREFLang Builder.
The second step downloads this report into our system and once all of them are downloaded triggers the mapping and XML generation process.
The biggest benefits of using the DeepCrawl API integration is we do not need to validate your URL’s reducing the requests against the site and you get a dynamic source of URL’s that is frequently updated during your diagnostic crawls.
Even if we are using XML site maps or another source for URL’s, you can augment that using data you already have in DeepCrawl. As we have shown it is difficult to get a complete set of URL’s unless we are using multiple sources.
There are a few requirements to using the DeepCrawl API as your primary or incremental source for URL’s.
- You must have a current DeepCrawl account, this is not included in our costs
- You must have scheduled crawls set up to crawl the site(s)
- You must give us access to the account via the API key (We do not need login access)
The integration process is pretty straight forward but as you may expect when integrating 3rd party applications there are some potential issues and the following are provided for your consideration:
- Your DeepCrawl account limits and budget – if you plan to use the crawl results as your primary source of URL’s for HREFLang Builder, you need to ensure you have sufficient credits for your DeepCrawl account to allow for the full crawl of the site(s) at the update intervals you want. We have had a couple clients that had caps set on the number of URL’s that were less than they had on the site.
- Our system requests the most recent crawl – If you run out of credits or stopped the crawl we may not get a current list of URL’s. We plan to add in an alert to the dashboard to match the dates to display any sites that have reports that are not current.
- We do not prompt crawls of your site via the API – Especially if you are using DeepCrawl for your URL source you should be using their scheduler functionality to set up crawls at appropriate intervals. We suggest starting with a weekly crawl
- Set Crawl Restrictions in DeepCrawl – we take what is presented to us without exception so If you want/need any crawl restrictions you can set them in Phase 2 of our DeepCrawl project setup workflow. .
- Deep Crawl Error management – with any crawler or dynamic tools errors can happen impacting your Deep Crawl results. They have an excellent help guide on how to fix website crawl errors for any additional questions please consult your Deep Crawl Customer support representative.
- Indexable URLReport creation and exporting – the time it takes to generate each report depending on the number of URL’s in the crawl and the number of indexable URL’s. If your crawl has completed, when we request the report it is typically generated in a few minutes. If the call back fails, we will try again in 1 hour, then 12 hours and 24 hours later. If fails after this we will rebuild the report with the most recent source we have and alert you.
- Generating Updates in HREFLang Builder – your final consideration is when to generate updates. If you are doing your crawls on the weekend then you should set HREFLang Builder to update weekly on Monday or Tuesday.
If getting a clean and complete source of URL’s has been a challenge for you or you want to augment what you have follow these instructions to Setup Up Deep Crawl API Auto Updates.