Is AI Crawling Your Website? How to Check and Why It Matters
As I mentioned within a recent LinkedIn post, the end of 2024 saw the biggest drop in Google’s market share that we’ve seen in almost a decade. Not only does this demonstrate clear decentralisation in search, it also emphasises the importance of having your website content successfully crawled and indexed by alternative search engines and AI tools. With a growing array of such tools gaining rapid traction – such as SearchGPT, ChatGPT, and Perplexity AI – it is vital that any website aiming to drive significant search traffic is accessible to these emerging platforms and their associated bots.

How Do I Know If AI Is Crawling My Site?
One of the main ways in which you can determine whether your website content is being crawled by search engine / AI bots is by reviewing your website log files. There are two methods you can use to do this, you can either take a manual approach and review the logs yourself, or you can use a log file analyser tool such as that offered by Screaming Frog. I am going to go into more detail below – but please know that if you aren’t comfortable accessing or analysing log files, please get in touch. The experts at Varn are here to help!
Method 1: Manually Review Log Files
If you have access to your website’s cpanel, then your log files should be easy to locate. Once you have the relevant log file(s) downloaded, you can simply search (CTRL+F) the file(s) for individual bot user-agent names. As an example, say I am reviewing Varn’s website log files in order to check that our site is being crawled by OpenAI, and that our content can be picked up and presented to users when carrying out relevant searches using ChatGPT. When searching a test log file for the user-agent ‘ChatGPT’, I can see that this appears multiple times, one of which has been captured within the snapshot below:
This snippet from our log file tells us that OpenAI’s ChatGPT bot has been able to successfully crawl content on the Varn website. But that’s not all. We can also see from the information provided that this specific crawl took place on the 15th January, and that the content being accessed was one of the Varn blog posts containing information on optimising LinkedIn pages. Access was successful, according to the 200 HTTP status code, and the absence of a referrer suggests this was a direct crawl initiated by the bot. With this data, we can systematically review additional log file entries to verify whether other content types on the Varn site have been accessed by OpenAI’s bot. Ensuring comprehensive crawl coverage is essential for making relevant content available for potential inclusion in future ChatGPT responses.
Method 2: Use a Log File Analyser
As previously touched upon, you can also review your website log files using a log file analyser – in this example, we will use one we use on a regular basis, provided by Screaming Frog.
Screaming Frog’s log file analyser is very intuitive. After uploading your log file, you will be presented with a number of tables and graphs. This includes a summary of URLs logged, response codes, URL events and more. The information we are looking for can be found under the ‘User Agents’ tab – navigate to this tab and you will see a list of all user agents that have accessed the content on your website. You can then either search for the user agent you need to locate, or order them alphabetically. Below is how the OpenAI ChatGPT user agent is displayed when analysing our test log file within Screaming Frog; you’ll notice that this snippet is also exactly how the user-agent appeared within the previous example, when manually reviewing log files.
Now that we have again confirmed that OpenAI and ChatGPT are able to access the Varn site, we can take a closer look at each of the URLs that have been crawled within the time period covered by this test data. Not only can we see the individual URLs crawled, we can also see the timestamp of each crawl, the remote host IP, the HTTPS response code and much more, thus being able to confirm that our content is crawlable (and is actively being crawled) by ChatGPT.
So, What’s Next?
When we have confirmed that key bots have access to / can crawl our most important content, we can then repeat either of these methods, in order to check additional bots – and not just AI bots. This process works for a variety of user agents, including search engines (Googlebot, Bingbot, Yandexbot, Baiduspider, DuckDuckbot etc.), AI bots (ChatGPT, SearchGPT, OpenAI, PerplexityBot, YouCrawler and so on), social media bots and even specialised bots (AhrefsBot and Semrushbot for example).
As the landscape of search continues to evolve, it is more important than ever to ensure that your website content is accessible to search engines and AI bots, rather than focusing all of your efforts on Google. By regularly checking your log files (or using a log file analyser) as detailed above, you can easily determine which bots are crawling your website and which content they are accessing. This is your crucial first step in optimising for the wider search landscape, and is key to understanding any restrictions you might have in place on your content, as well as potential opportunities. It can even help gain insights into the type of content more often reviewed by bots, so that you can adapt and optimise your content strategy accordingly.
For more information on how to ensure your content is optimised for AI and search engines alike, take a look at our recent post on Answer Engine Optimisation (AEO). You can also check back regularly for the latest search innovation news – or get in touch with us if you would like to be added to our innovation newsletter recipient list. We would love to hear from you, and potentially discuss how Varn could help your website reach a wider audience.