While I was gone, everyone was buzzing about Matt Cutt's blog post, titled Indexing Timeline. I wanted to consolidate pieces of his post as well as his later comments in one place, so that those affected by the Big Daddy crawl/index issues wouldn't have to wade through hundreds of comments to see everything. I have intentionally left some bits out that weren't germaine to the issue, or were just repeats of things said elsewhere, or were obviously part of the original post, so they were easy to find. Feel free to go to the original post to see everything. So, here, in no rational order is my summary/compilation of Matt's remarks.
The sites that fit "no pages in Bigdaddy" criteria were sites where our algorithms had very low trust in the inlinks or the outlinks of that site. Examples that might cause that include excessive reciprocal links, linking to spammy neighborhoods on the web, or link buying/selling. The Bigdaddy update is independent of our supplemental results, so when Bigdaddy didn't select pages from a site, that would expose more supplemental results for a site.
We didn't index pages from sites with less trusted links, and we responded and started indexing more pages from those sites pretty quickly.
I'd think about the quality of your links if you'd prefer to have more pages crawled. As these indexing changes have rolled out, we've improving how we handle reciprocal link exchanges and link buying/selling.
It only has six links to the entire domain. With that few links, I can believe that out toward the edge of the crawl, we would index fewer pages.
Some folks that were doing a lot of reciprocal links might see less crawling. If your site has very few links where youd be on the fringe of the crawl, then it's relatively normal that changes in the crawl may change how much of your site we crawl. And if you've got an affiliate site, it makes sense to think about the amount of value-add that your site provides; you want to provide a reason why users would prefer your site.
It's by design in Bigdaddy that we crawl somewhat more than we index in Bigdaddy. If you index everything that you crawl, you never know what you might be missing by crawling a little more, for example. I see at least one indexed post from your forum, so the fact that we've been visiting those pages is a good indicator that we're aware of those pages, and they may be incorporated in the index in the future. With Bigdaddy, it's expected behavior that we'll crawl some more pages than we index. That's done so that we can improve our crawling and indexing over time, and it doesn't mean that we don't like your site.
...an example of one of those sites that might have been crawled more before because of link exchanges. I picked five at random and they were all just traded links. Google is less likely to give those links as much weight now. That's the simple explanation for why we don't crawl you as deeply, in my opinion.
It's true that if you had N backlinks and some fraction of those are considered lower quality, we'd crawl your site less than if all N were fantastic. Hope that makes sense. Light crawling can also mean "we just didnt see many links to your domain" as well though.
It's not that reciprocal links are automatically bad. It's more that many reciprocal links exist for the wrong reasons.
One thing I want to be clear about is that Bigdaddy isnt especially intended to do differently on spam; it's just an infrastructure upgrade to our crawling, and we get better at judging link quality, our crawl changes as a natural consequence of that.
The other thing is that I certainly don't want to imply that everyone who is still seeing less pages crawled was somehow getting spam or lower-quality links. I just wrote up the five cases that I analyzed in more depth. As a large change in our crawling infrastructure, it is to be expected that some sites will see more or less crawling.
In fact, I just got out of an hour-long joint meeting with crawl/index. Jim, we talked about your site, the one where you said "I'm trying to maintain a straight ship in a dirty segment." Theres absolutely no penalties at all on your site; its jut a matter of PageRank in your case. Youve got good links right now, and several hundred pages crawled, but a few more good links like youve got now would help some more.
Off-topic links wouldnt cause a penalty by themselves. Now if the off-topic links are spammy, that could cause a problem. But if a hardware company links to a software package, thats often a good link even though some people might think of the link as off-topic.