Surely Google determines "fresh, relevant" content according to whatever has recently been published, which this doesn't change. If anything, doesn't Google consider sites with a long history of content with tons of inbound links as more authoritative and therefore higher-ranked?
This baffles me. It baffles me why this would be successful SEO -- and assuming that it actually isn't, it baffles me why CNET thinks it would be.
The theory I've heard is related to 'crawl budget'. Google is only going to devote a finite amount of time to indexing your site. If the number of articles on your site exceeds that time, some portion of your site won't be indexed. So by 'pruning' undesirable pages, you might boost attention on the articles you want indexed. No clue how this ends up working in practice.
Google's suggestion isn't to delete pages, but maybe mark some pages with a no index header.
Google crawls the entire page, not just the subset of text that you, a human, recognize as the unchanged article.
It’s easy to change millions of pages once a week with on-load CMS features like content recommendations. Visit an old article and look at the related articles, most read, read this next, etc widgets around the page. They’ll be showing current content, which changes frequently even if the old article text itself does not.
I'm pretty sure Google is smart enough to recognize the main content of a page, and ignore things like widgets and navigation. That's Search Engine 101.
It’s possible they examined the server logs for requests from GoogleBot and found it wasting time on old content (this was not mentioned in the article but would be a very telling data point beyond just “engagement metrics”).
There’s some methodology to trying to direct Google crawls to certain sections of the site first - but typically Google already has a lot of your URLs indexed and it’s just refreshing from that list.
It doesn't have to fetch every article (statical sampling can give confidence intervals), and it doesn't have to fetch the full article: doing a "HEAD /" instead of a "GET /" will save on bandwidth, and throwing in ETag / If-Modified-Since / whatever headers can get the status of an article (200 versus 304 response) without bother with the full fetch.
If the content is literally the same, the crawler should be able to use If-Modified-Since, right? It still has to make a HTTP request, but not parse or index anything.
This is not correct. It’s up to the server, controlled by the application to send that or other headers. Similar to sending a <title> tag. The headers take priority and similar to what another person said they will do a HEAD request first and not bother with a GET request for the content.
> The theory I've heard is related to 'crawl budget'. Google is only going to devote a finite amount of time to indexing your site.
Once a site has been indexed once, should it really be crawled again? Perhaps Google should search for RSS/Atom feeds on sites and poll those regularly for updates: that way they don't waste time doing to a site scrape multiple times.
Old(er) articles, once crawled, don't really have to be babysat. If Google wants to double-check that an already-crawled site hasn't changed too much, they can do a statistical sampling of random links on it using ETag / If-Modified-Since / whatever.
The SiteMap, which was invented by Google and designed to give information to crawlers, already includes last-updated info.
No need to invent a new system based on RSS/Atom, there is already an actually existing and in-use system based on SiteMap.
So, what you suggest is already happening -- or at least, the system is already there for it to happen. It's possible Google does not trust the last modified info given by site owners enough, or for other reasons does not use your suggested approach, I can't say.
I can imagine a malicious actor changing an SEO-friendly page to something spammy and not SEO-friendly. Since E-Tag and If-Modified-Since are returned by the server, they can be manipulated.
Even if that rule were true, why wouldn’t everything in the say, top NNN internet sites get an exemption? It is the Internet’s most hit content, why would it not be exhaustively indexed?
Alternatively, other than ads, what is changing on a CNN article from 10 years ago? Why would that still be getting daily scans?
Probably bad technology detecting a change. Things like current news showing up beneath the article, which changes whenever a new article is added. I've seen this happen on quite a few large websites. It might be technologically easier to drop old articles than the amount of time to fix whatever they use to determine if a page has changed. You would think a site like CNET wouldn't have to deal with something like that, but sometimes these sites that have been around for a long time have some serious outdated tech.
That's a good point about the static nature of some pages. Is there any way to tell a crawler to crawl this page, but after this date don't crawl again, but keep anything you previously crawled.
Google is paying Wikipedia through "Wikimedia Enterprise." If Wikipedia weren't able to sucker people into thinking that they're poverty-stricken, Google would probably prop it up like they do Firefox.
If I were establishing a "crawl budget", it would be adjusted by value. If you're consistently serving up hits as I crawl, I'll keep crawling. If it's a hundred pages that will basically never be a first page result, maybe not.
Wikipedia had a long tail of low-value content, but even the low-value content tends to be among the highest value for its given focus. e.g., I don't know how many people search "Danish trade monopoly in Iceland", and the Wikipedia article on it isn't fantastic, but it's a pretty good start[0]. Good enough to serve up as the main snippet on Google.
Purely speculating, Wikipedia has a huge number of inbound links (likely many more than CNet or even than more popular sites) which crawler allocation might be proportionate to. Even if it only crawled pages that had a specific link from an external site, that would be enough for Google to get pretty good coverage of Wikipedia.
It could be better to opt those articles out of the crawler. Unless that's more effort. If articles included the year and month in the URL prefix, I would disallow /201* instead.
In a major site redesign a couple years ago, we dropped 3/4 of our old URLs, and saw a big improvement in SEO metrics.
I know it doesn’t make sense and that Google says it is not necessary. But it clearly worked for us.
I think a fundamental truth about Google Search is that no one understands how it actually works anymore, including Google. They announce search algorithm updates with specific goals… and then silently roll out tweaks, more updates, etc. when the predicted effect doesn’t show up.
I think the idea that Google is in control and all the SEOs are just guessing, is wrong. I think it’s become a complex enough ML system that now all anyone can do is observe and adjust, including Google.
I have noticed some articles (and not just "Best XXX of 202Y" articles) that seem to always update their "Updated on" date which Google unhelpfully picks up and shows in search results leading me to think the page is much more recent than it is.
> It baffles me why this would be successful SEO -- and assuming that it actually isn't, it baffles me why CNET thinks it would be.
If the content deleted is garbage, why wouldn't it help? No clue on CNET's overall quality, but I don't have a favorable image of it. Just had a look at their main page and that did not do it any favors.
Perhaps sites with a small ratio of new:total content would be downranked --- but I really don't think that makes sense because that's going to be the case for any long-established site.
Surely Google determines "fresh, relevant" content according to whatever has recently been published, which this doesn't change. If anything, doesn't Google consider sites with a long history of content with tons of inbound links as more authoritative and therefore higher-ranked?
This baffles me. It baffles me why this would be successful SEO -- and assuming that it actually isn't, it baffles me why CNET thinks it would be.