I had the pleasure of sitting down with Matt Cutts, head of Google’s webspam team, for over a half hour last week at Pubcon.
Stephan Spencer: I am with Matt Cutts here. I am Stephan Spencer, Founder and President of Netconcepts. Matt is Google engineer extraordinaire, head of the Webspam team at Google.
Matt Cutts: [laughing] Having a good time at Google, absolutely.
Stephan Spencer: Yeah. I have some questions here that I would like to ask you, Matt. Let us start with the first one: When one’s articles or product info is syndicated, is it better to have the syndicated copies linked to the original article on the author’s site, or is it just as good if it links to the home page of the author?
Matt Cutts: I would recommend the linking to the original article on the author’s site. The reason is: imagine if you have written a good article and it is so nice that you have decided to syndicate it out. Well, there is a slight chance that the syndicated article could get a few links as well, and could get some PageRank. And so, whenever Google bot or Google’s crawl and indexing system see two copies of that article, a lot of the times it helps to know which one came first; which one has higher PageRank.
So if the syndicated article has a link to the original source of that article, then it is pretty much guaranteed the original home of that article will always have the higher PageRank, compared to all the syndicated copies. And that just makes it that much easier for us to do duplicate content detection and say: “You know what, this is the original article; this is the good one, so go with that.”
Stephan Spencer: OK great. Thank you.
The way of detecting supplemental pages through site:abc.com and the three asterisks minus some gobbly-gook, no longer works – that was a loophole which was closed shortly after SMX advanced and after I mentioned it in my session. Now that it no longer works, is there another way to identify supplemental pages? Is there some sort of way to gauge the health of your site in terms of: “this is main index worthy” versus “nah, this is supplemental”?
Matt Cutts: I think there are one or two sort of undocumented ways, but we do not really talk about them. We are not on a quest to close down every single one that we know of. It is more like: whenever that happens, it is a bug to have our supplemental index treated very differently from the main index.
So we took away the “Supplemental Result” label, because we did not consider it as useful for regular users – and regular users were the ones who were using it. Any feature on Google search result page has to justify itself in terms of click-through or the number of pixels that are used versus the bang for the buck.
And the feedback we were getting from users was, that they did not know what it was and did not really care. The supplemental results, which started out as sometimes being a little out of date, have gotten fresher and fresher and fresher. And at least at one data center – hopefully at more in the future, were already doing those queries on the supplemental result or the supplemental index, for every single query, 100 percent of the time.
So it used to be the case that some small percentage of the time, we would say: oh, this is an arcane query – let’s go and we will do this query even on the supplemental index. And now we are moving to a world where we are basically doing that 100 percent of the time.
As the supplemental results became more and more like the main index, we said: this tag or label is not as useful as it used to be. So, even though there are probably a few ways to do it and we are not actively working to shut those down, we are not actively encouraging people and giving them tips on how to monitor that.
Stephan Spencer: OK.
Next question: what is the status on Google reading textual content within flash.swf files? Are there improvements to come?
Matt Cutts: It is a good question. I think that we do a pretty good job of reading textual content. Now, stuff within Flash is binary and you can define it in terms of characters and strokes – so you can have things that look like normal text – but that are completely weird and are not really normal text. So it can be difficult to pull the text out a Flash file. I think we do pretty well.
It used to be the case that we had our own, home-brew code to pull the text out of Flash, but I think that we have moved to the search engine SDK tool that Adobe/Macromedia offers. So, my hunch is that most of the search engines will standardize on using that search engine SDK tool to pull out the text. The easiest way to know whether you have textual content that can be read in a Flash file, is that you could always use that tool yourself and verify as well.
Stephan Spencer: Great tip.
All right, next question: Macromedia Adobe has the search engine SDK tool, which we have talked about now, but it has not been updated in some time, so is there still usefulness in this tool, as it continues to get older and older, in predicting what .SWF textual content can be read by the Googlebot spider? You guys evolve quite quickly and if the SDK is not keeping up, it kind of loses its utility.
Matt Cutts: Yeah. It is interesting to see Adobe have, in some cases, a renewed emphasis on Flash recently. They recently cut their prices on some multimedia Flash-type servers.
My general answer is that, probably, we will continue to rely on the search engine SDK tool. If you, as a webmaster, feel strongly that Adobe should do more and better, then I would say you could contact those guys and say: “Hey, Adobe, I wish you would continue to update that.” or “I hope you will continue to do iterations.”
My hunch is that we will essentially standardize on that SDK tool and hopefully that will create some incentives for Adobe to keep updating it, and make sure that it is as fresh as possible.
Stephan Spencer: Great.
Next question: will Google utilize the acquired “Riya” technology that determines similarity of image content through analysis of things like color, shape and texture, to assist in identifying Black Hat optimization? (Just to be clear, I don’t think Google bought Riya.) An example of where this would be useful is: if there is a background image behind links that match the color of the image, and make the links appear hidden.
Matt Cutts: It is kind of funny; I am not sure. I do not think we have “Riya” – I think we have “Neven Vision”, but your question still stands, and it is a good one: whether we will use that sort of technology to help with things like black hat text hiding.
The short answer is: we think that relatively simple heuristics, as far as color-matching, work pretty well. Of course, you can not go with the exact color, because people will monkey around in the RGB space a little bit and try to look a little different in the RGB space – but in perceptual space there is not much difference. However, in practice, the vast majority of hidden text colors are pretty similar.
I certainly have seen some spam where it was blue and noisy with blue text which did not stand out, so users did not notice it very much – but that sort of thing is relatively rare. If somebody is willing to put in the effort to effectively hide text in a very busy or interesting image, then they are almost able to do that with same amount of effort and just make good content.
I think we are, certainly, open to employing those advanced techniques to things like: what is the dominant color of an image, or things like that – but, in practice, it seems like most people have not tried to exploit those particular holes that much.
Stephan Spencer: OK.
Next question: are social bookmark links given less weight than other back links – given how easy these services are to manipulate?
Matt Cutts: Typically, our policy is: a link is a link, is a link; wherever that link’s worth is, that is the worth that we give it. Some people ask about links from DMOZ, links from .edu or links from .gov, and they say: “Isn’t there some sort of boost? Isn’t a link better if it comes from a .edu?” The short answer is: no, it is not. It is just .edu links tend to have higher PageRank, because more people link to .edu’s or .gov’s.
To the best of my knowledge, I do not think we have anything that says social bookmark links are given less weight. Certainly, some sites like del.icio.us and other people, may choose to put individual “nofollows” in and they may choose to take actions to try to prevent spam, but we do not typically say anything like: social bookmarking by itself – give less weight.
Stephan Spencer: OK. So, I guess, a follow on to that would be: a .edu and .gov link, and so forth, has, typically, a more pristine link neighborhood, so it is not just about the PageRank, right? The link neighborhood comes into play.
Matt Cutts: That is a little bit of a “secret sauce” question, so I am not going to go into how much we do trust that sort of stuff.
Stephan Spencer: OK. I am going to slap my wrist now. Ouch, ouch!
Matt Cutts: [laughing]
But, certainly, all of the things that have good qualities of a link from a .edu or a .gov site, as well as the fact that we hard-code and say: .edu or .gov links are good – and when there are good links, .edu links tend to be a little better on average; they tend to have a little higher PageRank, and they do have this sort of characteristic that we would trust a little more. There is nothing in the algorithm itself, though, that says: oh, .edu – give that link more weight.
Stephan Spencer: Yes. Which is what I would expect that SEOs would have already realized.
Matt Cutts: Well, you would be surprised how many are like: “Oh, I have to get .edu links because they are better.” You can have a useless .edu link just like you can have a great .com link.
Stephan Spencer: Yeah. And for those of you who do not believe that, just do a search for “buy viagra” and look at all the .edus that come up, or “viagra site:edu”.
Matt Cutts: [laughing]
Stephan Spencer: Pretty sad.
Next question: given the ever-broadening definition of doorway pages in Google’s Guidelines, would a poorly done site map page now be considered to be a doorway page? A page that is just a list of links with no real hierarchy, very keyword-rich because there are full product names and category names and so forth.
Matt Cutts: Typically, we try to be relatively aware and relatively careful about that, because it is very natural to say: take a list of all of my pages and export that, then turn them into clickable links and now I have a site map.
In fact, if you made a sitemap file, or sitemap file ‘proper’, you would end up with something that you could submit directly to Google. At first glance, that might look keyword-rich or that might look like a doorway page, but we try to be relatively savvy.
A good example is About.com. They have had site maps for a long time. They had even named it “SpiderBites”, which, at first glance looked like: “Hello! You are going for the Google Spider or something” – but whenever you dug into it, it was radically clear that what they were doing, was just normal site map behavior. It was not that they were trying to do any malicious work.
I think our own page algorithms for scoring content do a pretty good job of looking past keyword stuff, and things like that, anyway. It is also the case, that we try to be pretty savvy about that. That said, I think you have got a question you will ask later about how many links exactly you can get on a page? So, we may go into it in more depth then.
Stephan Spencer: OK.
Next question: what is excessive in the length of a keyword-rich URL? We have seen clients use keyword URLs that have 10 to 15 words strung together with hyphens; or blogs – we have seen them even longer there. A typical WordPress blog will use the title of the post as the post slug, unless you defined something different and you can just go on and on and on. Can you give any guidelines or recommendations in that regard?
Matt Cutts: Certainly. If you can make your title four- or five-words long – and it is pretty natural. If you have got a three, four or five words in your URL, that can be perfectly normal. As it gets a little longer, then it starts to look a little worse. Now, our algorithms typically will just weight those words less and just not give you as much credit.
The thing to be aware of is, ask yourself: “How does this look to a regular user?” – because if, at any time, somebody comes to your page or, maybe, a competitor does a search and finds 15 words all strung together like variants of the same word, then that does look like spam, and they often will send a spam report. Then somebody will go and check that out.
So, I would not make it a big habit of having tons and tons of words stuffed in there, because there are plenty of places on a page, where you can have relevant words and have them be helpful to users – and not have it come across as keyword stuffing.
Stephan Spencer: So, would something like 10 words be a bit much then?
Matt Cutts: It is a little abnormal. I know that when I hit something like that – even a blog post – with 10 words, I raise my eyebrows a little bit and, maybe, read with a little more skepticism. So, if just a regular savvy user has that sort of reaction, then you can imagine how that might look to some competitors and others.
Stephan Spencer: Yes.
Do you think we are moving towards algorithmic search results having substantially more human validation and/or intervention? There is the project, such as Search Wikia – they seem to be going down that path. What do you think? What does Google think about this?
Matt Cutts: It is a really interesting topic, because when Google started, we had just a few hundred people and the Web was so very large. We had to process tons, and tons of pages and tons, and tons of languages. We had to have the most capable, robust approach as we could.
The only thing that would really work well at that time was algorithms, because computers do not get tired, they can work 24/7, they do not exhibit any bias by themselves. Of course, an algorithm could somehow have some bias baked in when the human wrote it, but the computer itself is perfectly logical when it executes that algorithm.
So, for the longest time, Google pursued that as its first and foremost strategy – to the point where some people think that Google is nothing but algorithms and there is no room for any humans at all. In fact, we tried to be relatively clear that, if someone reports an off-topic spam that is redirecting to porn – everybody wants that gone except for the porn spammer. So, we are ready to take manual action on that.
Going forward, I think it is really interesting to think about the role of humans in search. I have done a post on my blog about that. I think that, if you can use humans but in a scalable and robust way – that is really the key. If you had to have a person construct all the search results for one search, there are so many search results and the long tail is so long, there is no scalable way you could do it.
But, for example, let us suppose you could have some humans figure out a scalable way to find spam, or a scalable way to say whether individual sites are good or bad, then those are the sort of things where it could be on the order where humans could genuinely help you.
I am glad that Wikia exists and that they are going to try this approach that puts a little more emphasis on people, because I think we need to let 1,000 flowers bloom and let lots of different search engines with lots of different philosophies try those ideas. And I think Google is willing to be pragmatic and embrace any approach that might work.
Stephan Spencer: OK.
Initially, it was stated that “nofollowed” links would be followed and crawled, but PageRank would not be passed. But you have recently stated that “nofollow” links are not even used for discovery. First of all, let us confirm: is that the case that they are not even used for discovery, and, if so, why the change?
Matt Cutts: It is interesting. Whenever we talked about it originally, we said PageRank would not be passed, and the messaging that I tried to do was that it would not even be followed and it would not even be crawled. It turned out there was a really weird situation, where, if you had totally unique anchor text that nobody else had, we would not follow that link – but if we had found the page from some other source, we still had this anchor text lying around and we were willing to associate it with that page.
Personally, I think that is almost a bug, because if you ever sign a blog post with a comment and you have some really weird anchor text, then when you search for that text and you find the blog post, your natural conclusion is that these “nofollowed” links do contribute something – whether it is PageRank, anchor text or some sort of vote. Then you immediately get back to people trying to spam blogs and trying to spam all those places that have “nofollowed” links.
I almost view it as, for a short time it was almost like a bug – that some anchor text, in some very strange situations, could flow. We have fixed that.
There was an example, where someone had done “dallas auto repair warranties” and another query, where they thought that “nofollow” had actually passed either anchor text or PageRank. My suggestion would be that people should repeat those experiments, because I do not think that those experiments will hold true now.
In fact, if you look at the Wikipedia pages for “Nofollow” (at http://en.wikipedia.org/wiki/Nofollow), they say – in “reference number eight”, if I remember correctly – something about how these links may still be used in some limited circumstances for this or for that. At least for Google, we have taken a very clear stance that those links are not even used for discovery; they are not used for PageRank; they are not used for anchor text in any way. Anybody can go and do various experiments to verify that.
Stephan Spencer: Great.
How concerned are you over the tactic of aggressive link buying to competitors’ sites in order to take down competitors? How long, do you think, it will be until competitors start taking each others’ sites out in Google with aggressive link buying?
Matt Cutts: I do not think a smart competitor will even try that second one because they would be more likely to help. The thing is, we are very aware that site ‘A’ could buy links to site ‘B’, and then spam-report site ‘B’ and try to frame site ‘B’. So we try very hard, in all of our spam techniques, to make it so that one site can not sabotage another site.
If you will notice, we do not say that it is impossible. The reason we do not say that it is impossible is, if you remember sex.com a few years ago, somebody – if I remember correctly – sent a fax and claimed to be the site owner and grabbed the ownership of sex.com and kept it for a few years, until they were forced by a court to relinquish it.
There is always the ‘far out’, possible case where somebody could do identity theft and grab your domain and hurt your domain that way. So we do not say it is impossible for a competitor to hurt another competitor, but we do try very hard. In fact, you have noticed that, with link buying in particular, we have been concentrating in the last couple of months more on the link selling aspect of that.
The odds that someone can come to us and say: “Oh! Someone hacked my site and sold links on my site for four months, and I had no idea! And, oh, yeah, I did bill it in Google Checkout, but they hacked my Google Checkout account, too! And I am being framed! It is a conspiracy!” – the odds of someone plausibly being able to make that argument are a lot lower. We do try very hard to prevent someone from hurting somebody else, and we are very mindful of that.
Stephan Spencer: Great.
Earlier in the year, Yahoo introduced the <div class=”robots-nocontent”> as a way to isolate parts of a page. I have not heard much since then, or if that is still viable, but, in any event, it was received with mixed emotions. Is this something, though, that you guys have given any thought to?
Matt Cutts: We definitely have. Personally, I think it is kind of interesting, because it gives more flexibility to site owners to sculpt how they want to flow PageRank or to change how the page should be indexed. I am always a fan of giving people more flexibility and more tools.
The downside of that – which immediately becomes apparent when I talk to other Googlers and whenever you think about it for a while – is that it is another feature that has to be supported. And I like to joke that the half-life of code at Google is about six months. You could write some code, come back six months later, and about half of it would be on some new infrastructure or be stale and so on.
We are constantly working on improving our infrastructure and our architecture. To have another feature to support, it has to be something that is compelling, that a lot of people use. So, what we did is we said: “OK. Let’s wait four or six weeks, and see how many people on the web are really using this particular feature.”
I made a deal with another Googler and said: “OK. If a lot of people use it, then maybe we will be more likely to support it.” If I remember correctly, it was less than 500 domains had used this tag at all. And in the grand scheme of things, where there are literally hundreds of millions of domains and tens of millions of very active domains, it is not the case that 500 sites is a very large amount.
My guess is, we would be more likely to spend our resources on other stuff, at least right now. We are open to the idea, but we have not heard a lot of people really, really asking for it.
Stephan Spencer: OK.
Google recommends having “no more than 100 links per page, for good usability” – and it is good usability. Pages with much larger number of links may be considered to be edging into doorway page status. So the guideline, for our listeners, is: “Create pages with good usability, intended for end users and not for search bots.”
However, DHTML allows people to create really great, usable pages with far larger amounts of links on the page, and allows those links to be crawlable. Users could click the “category” link to expand menus for links to sub-pages, for instance. Could we assume that, if the page is nicely usable, it might be OK to do far more, perhaps, than the 100 links per page guideline? What is the new cut-off number, or a new guideline, in this age of DHTML?
Matt Cutts: I would recommend that people run experiments, because, if you have 5,000 links on a page, the odds that we would flow PageRank through those is kind of low. We might say at some point: that is just way too many thousands of links. And at some point, your fan-out is so high that the individual PageRank going out on each one of those links is pretty tiny.
I will give you a little bit of background – and I encourage people to run experiments and find what works for them. The reason for the 100 links per page guideline is because we used to crawl only about the first 101 kilobytes of a page. If somebody had a lot more than a hundred links, then it was a little more likely that after we truncated the page at a 100 kilobytes, that page would get truncated and some of the links would not be followed or would not be counted. Nowadays, I forget exactly how much we crawl and index and save, but I think it is at least, we are willing to save half a megabyte from each page.
So, if you look at the guidelines, we have two sets of guidelines on one page. We have: quality guidelines which are essentially spam and how to avoid spam; and we have technical guidelines. The technical guidelines are more like best practices. So, the 100 links is more like a ‘best practice’ suggestion, because if you keep it under a 100, you are guaranteed you are never get truncated.
So, certainly, I do think it is possible to have more links, especially with DHTML – that was once an issue. But, people should always bear in mind to pull in a regular user off the street and have them take a look at it. If you have got so many links and they are so in a particular spammy nature or whatever, that it looks spammy to that regular person, then you want to think about breaking it down. There are a lot of ways you can break it down: you can go by category; you can go chronologically; you can have different topics. If it feels like you got too many, you can definitely break it into a lot of sub-categories.
Stephan Spencer: Right.
Next question, also regarding that 100 links per page recommendation: Is the higher number of links on say, category pages or sub-category pages, more permissible than on product or static content pages because the latter would appear more spammy?
Matt Cutts: I think you would want to apply the common sense approach. So, let us talk about a newspaper, for example. A newspaper might have written thousands of articles, and so, to have all of that linked to from one page would be probably a bit much – even for users.
Suppose the newspaper decides to break it down chronologically. They will have: “all the articles we wrote in 2007”. Then you click on that, and maybe that is still like 2,000 links, which is a little high.
So then they might break it down to: “all the stories we wrote in January, 2007; February, 2007; March, 2007. You go through and, suppose there are 120 or 200 links – that is more than our 100 link guideline that we give on the technical side, but the user who has gotten there, really understands why. They would say: “Oh, well, I wanted this story from March 2000. I clicked on 2000 and I clicked on March. I knew the story was on March 14, and here is my story.”
There definitely can be situations like that, where you have a larger number of links on the sub-pages, the sub-categories or the categories, but because that is the most logical way to break it down, it can make perfect sense for users and, therefore, perfect sense for search engines.
Stephan Spencer: Good suggestion.
Will the reasons for a site’s PageRank reduction ever be disclosed to webmasters in Webmaster Central?
Matt Cutts: I am open to that. It is kind of funny because first and foremost, we have to care about fixing any problems we see, trying to make sure that we have the most clean index that we can. So, malware is a good example of that. First and foremost, we did not want to return malware to users. So, we started out just by removing sites that had malware even if they were hacked.
We try to take the hacked sites out for a short period of time, but we did not have the resources to contact all of those people and to work one-on-one to help them get the malware removed from their site. Then, over time, we got better about messaging. We would show that a site was removed for malware, and then we had a process that was a partnership with StopBadware where it would take up to 10 days.
At the time, people were like: “Ten days for my site to get back! That causes me a lot of stress and a lot of pain!” – but compared to the stress and pain of a user who got malware from your site, we have to balance those.
We have continued to iterate. We have gotten better and better. So now, you can more or less get your site malware re-reviewed in about 24 hours, and we have just recently started to show messages to webmasters in our message center in Webmaster Central to say: “Yes, your site has some malware.” If I remember correctly, we even show a few example URLs to say: “Yes, here is where to look to find the malware.”
That shows this gradual progression where, first and foremost, we have to take care of spam, the viruses, the malware or the trojan. Then, over time, we polish off those rough edges and we try to provide better messaging and better alerts to help the webmasters as well. I could certainly imagine that over time, we could tell a webmaster: “Yes, we uncovered links that looked like they were certainly sold, so that played a factor in Google losing a little more trust in your website.” I am certainly open to doing that.
You also have to think about whether a site can be pulled toward white hat or not. Clearly, if somebody is a malicious spammer and they are just trying to do awful, awful things, you do not want to give them a head’s up that they have been caught. So, if we have seen someone that we think is deliberately abusive and really spammy and really savvy and they know what they are doing, then they might not expect to get a head’s up in our “Webmaster Console”.
But if someone is a relatively new webmaster, a small Mom-and-Pop business, maybe we think they did not know any better, then it is a little more likely that we might try to give them some message to say: “This is an issue. It is a violation of our guidelines. It is a violation of every search engine’s guidelines. Here is where you can read more about it. If you can correct this issue, then here is where to go and request a reconsideration.”
Stephan Spencer: Right, because if somebody is a white hat and has a history of being a white hat, certainly they deserve to be given a head’s up, whereas you do not want to define that line for a black hat spammer.
Matt Cutts: Yes. You do not want to clue in the bad guys but you want all the people who are on the fence or who are right towards the white hat edge, you want to keep pulling them into that white hat direction.
Stephan Spencer: Yes.
My last question here: what RSS feeds do you subscribe to?
Matt Cutts: [laughs] A better question is what RSS feeds I do not subscribe to.
Stephan Spencer: Perhaps you can just supply us with your OPML file?
Matt Cutts: [laughs] You know, I have thought about that. At various times, I have done screenshots so the people could get a sense of the sort of things I read. It is funny because I have it broken down into general search, white hat, and black hat. I try to keep the black hat folder closed so people do not feel bad. You know: “Oh, no! I am in Matt’s black hat folder!” Although there are a few people in there.
Stephan Spencer: Some would feel good. [laughter]
Matt Cutts: Yes, maybe they would be honored, who knows, but I do not want to give them the glory in that case. But, yes, certainly sites like Search Engine Land, Google Blogoscoped, Google Operating System, you know. Those are fantastic to just get first line news. Then, there are things like Search Engine Journal and all those sort of guys where you can get a lot more follow-on news or thoughtful commentary afterwards.
There is a lot of really good feeds that I read. I read about 70 or maybe even a 100 in the search space. A few that I read that are not search – there are only five or 10, but XKCD is a Web comic that is really pretty funny, that is very Web savvy. I found a feed on Flickr for their “Photos of the Day” which is just a nice way to start your day.
There is a neat site called One Sentence and the idea is that you have to tell an entire story in one sentence. You know, they are very compelling stuff. So, it is about 10 sentences a day, about 10 posts, and that is really a fun site as well.
Stephan Spencer: Cool. All right. Well, thanks very much for your time, Matt.
Matt Cutts: Yes, good talking to you.