Stephan Spencer's Scatterings

The Scattered Wisdom of a scientist turned web marketing virtuoso

September 2008
S M T W T F S
 << <   > >>
  1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30        

Interview with Google's Matt Cutts at Pubcon

I had the pleasure of sitting down with Matt Cutts, head of Google's webspam team, for over a half hour last week at Pubcon. I invite you to download the audio recording (31 minutes, MP3) and peruse the transcript, which follows below...

Stephan Spencer: I am with Matt Cutts here. I am Stephan Spencer, Founder and President of Netconcepts. Matt is Google engineer extraordinaire, head of the Webspam team at Google.

Matt Cutts: [laughing] Having a good time at Google, absolutely.

Stephan Spencer: Yeah. I have some questions here that I would like to ask you, Matt. Let us start with the first one: When one's articles or product info is syndicated, is it better to have the syndicated copies linked to the original article on the author's site, or is it just as good if it links to the home page of the author?

Matt Cutts: I would recommend the linking to the original article on the author's site. The reason is: imagine if you have written a good article and it is so nice that you have decided to syndicate it out. Well, there is a slight chance that the syndicated article could get a few links as well, and could get some PageRank. And so, whenever Google bot or Google's crawl and indexing system see two copies of that article, a lot of the times it helps to know which one came first; which one has higher PageRank.

So if the syndicated article has a link to the original source of that article, then it is pretty much guaranteed the original home of that article will always have the higher PageRank, compared to all the syndicated copies. And that just makes it that much easier for us to do duplicate content detection and say: "You know what, this is the original article; this is the good one, so go with that."

Stephan Spencer: OK great. Thank you.

The way of detecting supplemental pages through site:abc.com and the three asterisks minus some gobbly-gook, no longer works - that was a loophole which was closed shortly after SMX advanced and after I mentioned it in my session. Now that it no longer works, is there another way to identify supplemental pages? Is there some sort of way to gauge the health of your site in terms of: "this is main index worthy" versus "nah, this is supplemental"?

Matt Cutts: I think there are one or two sort of undocumented ways, but we do not really talk about them. We are not on a quest to close down every single one that we know of. It is more like: whenever that happens, it is a bug to have our supplemental index treated very differently from the main index.

So we took away the "Supplemental Result" label, because we did not consider it as useful for regular users - and regular users were the ones who were using it. Any feature on Google search result page has to justify itself in terms of click-through or the number of pixels that are used versus the bang for the buck.

And the feedback we were getting from users was, that they did not know what it was and did not really care. The supplemental results, which started out as sometimes being a little out of date, have gotten fresher and fresher and fresher. And at least at one data center - hopefully at more in the future, were already doing those queries on the supplemental result or the supplemental index, for every single query, 100 percent of the time.

So it used to be the case that some small percentage of the time, we would say: oh, this is an arcane query - let's go and we will do this query even on the supplemental index. And now we are moving to a world where we are basically doing that 100 percent of the time.

As the supplemental results became more and more like the main index, we said: this tag or label is not as useful as it used to be. So, even though there are probably a few ways to do it and we are not actively working to shut those down, we are not actively encouraging people and giving them tips on how to monitor that.

Stephan Spencer: OK.

Next question: what is the status on Google reading textual content within flash.swf files? Are there improvements to come?

Matt Cutts: It is a good question. I think that we do a pretty good job of reading textual content. Now, stuff within Flash is binary and you can define it in terms of characters and strokes - so you can have things that look like normal text - but that are completely weird and are not really normal text. So it can be difficult to pull the text out a Flash file. I think we do pretty well.

It used to be the case that we had our own, home-brew code to pull the text out of Flash, but I think that we have moved to the search engine SDK tool that Adobe/Macromedia offers. So, my hunch is that most of the search engines will standardize on using that search engine SDK tool to pull out the text. The easiest way to know whether you have textual content that can be read in a Flash file, is that you could always use that tool yourself and verify as well.

Stephan Spencer: Great tip.

All right, next question: Macromedia Adobe has the search engine SDK tool, which we have talked about now, but it has not been updated in some time, so is there still usefulness in this tool, as it continues to get older and older, in predicting what .SWF textual content can be read by the Googlebot spider? You guys evolve quite quickly and if the SDK is not keeping up, it kind of loses its utility.

Matt Cutts: Yeah. It is interesting to see Adobe have, in some cases, a renewed emphasis on Flash recently. They recently cut their prices on some multimedia Flash-type servers.

My general answer is that, probably, we will continue to rely on the search engine SDK tool. If you, as a webmaster, feel strongly that Adobe should do more and better, then I would say you could contact those guys and say: "Hey, Adobe, I wish you would continue to update that." or "I hope you will continue to do iterations."

My hunch is that we will essentially standardize on that SDK tool and hopefully that will create some incentives for Adobe to keep updating it, and make sure that it is as fresh as possible.

Stephan Spencer: Great.

Next question: will Google utilize the acquired "Riya" technology that determines similarity of image content through analysis of things like color, shape and texture, to assist in identifying Black Hat optimization? (Just to be clear, I don't think Google bought Riya.) An example of where this would be useful is: if there is a background image behind links that match the color of the image, and make the links appear hidden.

Matt Cutts: It is kind of funny; I am not sure. I do not think we have "Riya" - I think we have "Neven Vision", but your question still stands, and it is a good one: whether we will use that sort of technology to help with things like black hat text hiding.

The short answer is: we think that relatively simple heuristics, as far as color-matching, work pretty well. Of course, you can not go with the exact color, because people will monkey around in the RGB space a little bit and try to look a little different in the RGB space - but in perceptual space there is not much difference. However, in practice, the vast majority of hidden text colors are pretty similar.

I certainly have seen some spam where it was blue and noisy with blue text which did not stand out, so users did not notice it very much - but that sort of thing is relatively rare. If somebody is willing to put in the effort to effectively hide text in a very busy or interesting image, then they are almost able to do that with same amount of effort and just make good content.

I think we are, certainly, open to employing those advanced techniques to things like: what is the dominant color of an image, or things like that - but, in practice, it seems like most people have not tried to exploit those particular holes that much.

Stephan Spencer: OK.

Next question: are social bookmark links given less weight than other back links - given how easy these services are to manipulate?

Matt Cutts: Typically, our policy is: a link is a link, is a link; wherever that link's worth is, that is the worth that we give it. Some people ask about links from DMOZ, links from .edu or links from .gov, and they say: "Isn't there some sort of boost? Isn't a link better if it comes from a .edu?" The short answer is: no, it is not. It is just .edu links tend to have higher PageRank, because more people link to .edu's or .gov's.

To the best of my knowledge, I do not think we have anything that says social bookmark links are given less weight. Certainly, some sites like del.icio.us and other people, may choose to put individual "nofollows" in and they may choose to take actions to try to prevent spam, but we do not typically say anything like: social bookmarking by itself - give less weight.

Stephan Spencer: OK. So, I guess, a follow on to that would be: a .edu and .gov link, and so forth, has, typically, a more pristine link neighborhood, so it is not just about the PageRank, right? The link neighborhood comes into play.

Matt Cutts: That is a little bit of a "secret sauce" question, so I am not going to go into how much we do trust that sort of stuff.

Stephan Spencer: OK. I am going to slap my wrist now. Ouch, ouch!

Matt Cutts: [laughing]

But, certainly, all of the things that have good qualities of a link from a .edu or a .gov site, as well as the fact that we hard-code and say: .edu or .gov links are good - and when there are good links, .edu links tend to be a little better on average; they tend to have a little higher PageRank, and they do have this sort of characteristic that we would trust a little more. There is nothing in the algorithm itself, though, that says: oh, .edu - give that link more weight.

Stephan Spencer: Yes. Which is what I would expect that SEOs would have already realized.

Matt Cutts: Well, you would be surprised how many are like: "Oh, I have to get .edu links because they are better." You can have a useless .edu link just like you can have a great .com link.

Stephan Spencer: Yeah. And for those of you who do not believe that, just do a search for "buy viagra" and look at all the .edus that come up, or "viagra site:edu".

Matt Cutts: [laughing]

Stephan Spencer: Pretty sad.

Next question: given the ever-broadening definition of doorway pages in Google's Guidelines, would a poorly done site map page now be considered to be a doorway page? A page that is just a list of links with no real hierarchy, very keyword-rich because there are full product names and category names and so forth.

Matt Cutts: Typically, we try to be relatively aware and relatively careful about that, because it is very natural to say: take a list of all of my pages and export that, then turn them into clickable links and now I have a site map.

In fact, if you made a sitemap file, or sitemap file 'proper', you would end up with something that you could submit directly to Google. At first glance, that might look keyword-rich or that might look like a doorway page, but we try to be relatively savvy.

A good example is About.com. They have had site maps for a long time. They had even named it "SpiderBites", which, at first glance looked like: "Hello! You are going for the Google Spider or something" - but whenever you dug into it, it was radically clear that what they were doing, was just normal site map behavior. It was not that they were trying to do any malicious work.

I think our own page algorithms for scoring content do a pretty good job of looking past keyword stuff, and things like that, anyway. It is also the case, that we try to be pretty savvy about that. That said, I think you have got a question you will ask later about how many links exactly you can get on a page? So, we may go into it in more depth then.

Stephan Spencer: OK.

Next question: what is excessive in the length of a keyword-rich URL? We have seen clients use keyword URLs that have 10 to 15 words strung together with hyphens; or blogs - we have seen them even longer there. A typical WordPress blog will use the title of the post as the post slug, unless you defined something different and you can just go on and on and on. Can you give any guidelines or recommendations in that regard?

Matt Cutts: Certainly. If you can make your title four- or five-words long - and it is pretty natural. If you have got a three, four or five words in your URL, that can be perfectly normal. As it gets a little longer, then it starts to look a little worse. Now, our algorithms typically will just weight those words less and just not give you as much credit.

The thing to be aware of is, ask yourself: "How does this look to a regular user?" - because if, at any time, somebody comes to your page or, maybe, a competitor does a search and finds 15 words all strung together like variants of the same word, then that does look like spam, and they often will send a spam report. Then somebody will go and check that out.

So, I would not make it a big habit of having tons and tons of words stuffed in there, because there are plenty of places on a page, where you can have relevant words and have them be helpful to users - and not have it come across as keyword stuffing.

Stephan Spencer: So, would something like 10 words be a bit much then?

Matt Cutts: It is a little abnormal. I know that when I hit something like that - even a blog post - with 10 words, I raise my eyebrows a little bit and, maybe, read with a little more skepticism. So, if just a regular savvy user has that sort of reaction, then you can imagine how that might look to some competitors and others.

Stephan Spencer: Yes.

Do you think we are moving towards algorithmic search results having substantially more human validation and/or intervention? There is the project, such as Search Wikia - they seem to be going down that path. What do you think? What does Google think about this?

Matt Cutts: It is a really interesting topic, because when Google started, we had just a few hundred people and the Web was so very large. We had to process tons, and tons of pages and tons, and tons of languages. We had to have the most capable, robust approach as we could.

The only thing that would really work well at that time was algorithms, because computers do not get tired, they can work 24/7, they do not exhibit any bias by themselves. Of course, an algorithm could somehow have some bias baked in when the human wrote it, but the computer itself is perfectly logical when it executes that algorithm.

So, for the longest time, Google pursued that as its first and foremost strategy - to the point where some people think that Google is nothing but algorithms and there is no room for any humans at all. In fact, we tried to be relatively clear that, if someone reports an off-topic spam that is redirecting to porn - everybody wants that gone except for the porn spammer. So, we are ready to take manual action on that.

Going forward, I think it is really interesting to think about the role of humans in search. I have done a post on my blog about that. I think that, if you can use humans but in a scalable and robust way - that is really the key. If you had to have a person construct all the search results for one search, there are so many search results and the long tail is so long, there is no scalable way you could do it.

But, for example, let us suppose you could have some humans figure out a scalable way to find spam, or a scalable way to say whether individual sites are good or bad, then those are the sort of things where it could be on the order where humans could genuinely help you.

I am glad that Wikia exists and that they are going to try this approach that puts a little more emphasis on people, because I think we need to let 1,000 flowers bloom and let lots of different search engines with lots of different philosophies try those ideas. And I think Google is willing to be pragmatic and embrace any approach that might work.

Stephan Spencer: OK.

Initially, it was stated that "nofollowed" links would be followed and crawled, but PageRank would not be passed. But you have recently stated that "nofollow" links are not even used for discovery. First of all, let us confirm: is that the case that they are not even used for discovery, and, if so, why the change?

Matt Cutts: It is interesting. Whenever we talked about it originally, we said PageRank would not be passed, and the messaging that I tried to do was that it would not even be followed and it would not even be crawled. It turned out there was a really weird situation, where, if you had totally unique anchor text that nobody else had, we would not follow that link - but if we had found the page from some other source, we still had this anchor text lying around and we were willing to associate it with that page.

Personally, I think that is almost a bug, because if you ever sign a blog post with a comment and you have some really weird anchor text, then when you search for that text and you find the blog post, your natural conclusion is that these "nofollowed" links do contribute something - whether it is PageRank, anchor text or some sort of vote. Then you immediately get back to people trying to spam blogs and trying to spam all those places that have "nofollowed" links.

I almost view it as, for a short time it was almost like a bug - that some anchor text, in some very strange situations, could flow. We have fixed that.

There was an example, where someone had done "dallas auto repair warranties" and another query, where they thought that "nofollow" had actually passed either anchor text or PageRank. My suggestion would be that people should repeat those experiments, because I do not think that those experiments will hold true now.

In fact, if you look at the Wikipedia pages for "Nofollow" (at http://en.wikipedia.org/wiki/Nofollow), they say - in "reference number eight", if I remember correctly - something about how these links may still be used in some limited circumstances for this or for that. At least for Google, we have taken a very clear stance that those links are not even used for discovery; they are not used for PageRank; they are not used for anchor text in any way. Anybody can go and do various experiments to verify that.

Stephan Spencer: Great.

How concerned are you over the tactic of aggressive link buying to competitors' sites in order to take down competitors? How long, do you think, it will be until competitors start taking each others' sites out in Google with aggressive link buying?

Matt Cutts: I do not think a smart competitor will even try that second one because they would be more likely to help. The thing is, we are very aware that site 'A' could buy links to site 'B', and then spam-report site 'B' and try to frame site 'B'. So we try very hard, in all of our spam techniques, to make it so that one site can not sabotage another site.

If you will notice, we do not say that it is impossible. The reason we do not say that it is impossible is, if you remember sex.com a few years ago, somebody - if I remember correctly - sent a fax and claimed to be the site owner and grabbed the ownership of sex.com and kept it for a few years, until they were forced by a court to relinquish it.

There is always the 'far out', possible case where somebody could do identity theft and grab your domain and hurt your domain that way. So we do not say it is impossible for a competitor to hurt another competitor, but we do try very hard. In fact, you have noticed that, with link buying in particular, we have been concentrating in the last couple of months more on the link selling aspect of that.

The odds that someone can come to us and say: "Oh! Someone hacked my site and sold links on my site for four months, and I had no idea! And, oh, yeah, I did bill it in Google Checkout, but they hacked my Google Checkout account, too! And I am being framed! It is a conspiracy!" - the odds of someone plausibly being able to make that argument are a lot lower. We do try very hard to prevent someone from hurting somebody else, and we are very mindful of that.

Stephan Spencer: Great.

Earlier in the year, Yahoo introduced the <div class="robots-nocontent"> as a way to isolate parts of a page. I have not heard much since then, or if that is still viable, but, in any event, it was received with mixed emotions. Is this something, though, that you guys have given any thought to?

Matt Cutts: We definitely have. Personally, I think it is kind of interesting, because it gives more flexibility to site owners to sculpt how they want to flow PageRank or to change how the page should be indexed. I am always a fan of giving people more flexibility and more tools.

The downside of that - which immediately becomes apparent when I talk to other Googlers and whenever you think about it for a while - is that it is another feature that has to be supported. And I like to joke that the half-life of code at Google is about six months. You could write some code, come back six months later, and about half of it would be on some new infrastructure or be stale and so on.

We are constantly working on improving our infrastructure and our architecture. To have another feature to support, it has to be something that is compelling, that a lot of people use. So, what we did is we said: "OK. Let's wait four or six weeks, and see how many people on the web are really using this particular feature."

I made a deal with another Googler and said: "OK. If a lot of people use it, then maybe we will be more likely to support it." If I remember correctly, it was less than 500 domains had used this tag at all. And in the grand scheme of things, where there are literally hundreds of millions of domains and tens of millions of very active domains, it is not the case that 500 sites is a very large amount.

My guess is, we would be more likely to spend our resources on other stuff, at least right now. We are open to the idea, but we have not heard a lot of people really, really asking for it.

Stephan Spencer: OK.

Google recommends having "no more than 100 links per page, for good usability" - and it is good usability. Pages with much larger number of links may be considered to be edging into doorway page status. So the guideline, for our listeners, is: "Create pages with good usability, intended for end users and not for search bots."

However, DHTML allows people to create really great, usable pages with far larger amounts of links on the page, and allows those links to be crawlable. Users could click the "category" link to expand menus for links to sub-pages, for instance. Could we assume that, if the page is nicely usable, it might be OK to do far more, perhaps, than the 100 links per page guideline? What is the new cut-off number, or a new guideline, in this age of DHTML?

Matt Cutts: I would recommend that people run experiments, because, if you have 5,000 links on a page, the odds that we would flow PageRank through those is kind of low. We might say at some point: that is just way too many thousands of links. And at some point, your fan-out is so high that the individual PageRank going out on each one of those links is pretty tiny.

I will give you a little bit of background - and I encourage people to run experiments and find what works for them. The reason for the 100 links per page guideline is because we used to crawl only about the first 101 kilobytes of a page. If somebody had a lot more than a hundred links, then it was a little more likely that after we truncated the page at a 100 kilobytes, that page would get truncated and some of the links would not be followed or would not be counted. Nowadays, I forget exactly how much we crawl and index and save, but I think it is at least, we are willing to save half a megabyte from each page.

So, if you look at the guidelines, we have two sets of guidelines on one page. We have: quality guidelines which are essentially spam and how to avoid spam; and we have technical guidelines. The technical guidelines are more like best practices. So, the 100 links is more like a 'best practice' suggestion, because if you keep it under a 100, you are guaranteed you are never get truncated.

So, certainly, I do think it is possible to have more links, especially with DHTML - that was once an issue. But, people should always bear in mind to pull in a regular user off the street and have them take a look at it. If you have got so many links and they are so in a particular spammy nature or whatever, that it looks spammy to that regular person, then you want to think about breaking it down. There are a lot of ways you can break it down: you can go by category; you can go chronologically; you can have different topics. If it feels like you got too many, you can definitely break it into a lot of sub-categories.

Stephan Spencer: Right.

Next question, also regarding that 100 links per page recommendation: Is the higher number of links on say, category pages or sub-category pages, more permissible than on product or static content pages because the latter would appear more spammy?

Matt Cutts: I think you would want to apply the common sense approach. So, let us talk about a newspaper, for example. A newspaper might have written thousands of articles, and so, to have all of that linked to from one page would be probably a bit much - even for users.

Suppose the newspaper decides to break it down chronologically. They will have: "all the articles we wrote in 2007". Then you click on that, and maybe that is still like 2,000 links, which is a little high.

So then they might break it down to: "all the stories we wrote in January, 2007; February, 2007; March, 2007. You go through and, suppose there are 120 or 200 links - that is more than our 100 link guideline that we give on the technical side, but the user who has gotten there, really understands why. They would say: "Oh, well, I wanted this story from March 2000. I clicked on 2000 and I clicked on March. I knew the story was on March 14, and here is my story."

There definitely can be situations like that, where you have a larger number of links on the sub-pages, the sub-categories or the categories, but because that is the most logical way to break it down, it can make perfect sense for users and, therefore, perfect sense for search engines.

Stephan Spencer: Good suggestion.

Will the reasons for a site's PageRank reduction ever be disclosed to webmasters in Webmaster Central?

Matt Cutts: I am open to that. It is kind of funny because first and foremost, we have to care about fixing any problems we see, trying to make sure that we have the most clean index that we can. So, malware is a good example of that. First and foremost, we did not want to return malware to users. So, we started out just by removing sites that had malware even if they were hacked.

We try to take the hacked sites out for a short period of time, but we did not have the resources to contact all of those people and to work one-on-one to help them get the malware removed from their site. Then, over time, we got better about messaging. We would show that a site was removed for malware, and then we had a process that was a partnership with StopBadware where it would take up to 10 days.

At the time, people were like: "Ten days for my site to get back! That causes me a lot of stress and a lot of pain!" - but compared to the stress and pain of a user who got malware from your site, we have to balance those.

We have continued to iterate. We have gotten better and better. So now, you can more or less get your site malware re-reviewed in about 24 hours, and we have just recently started to show messages to webmasters in our message center in Webmaster Central to say: "Yes, your site has some malware." If I remember correctly, we even show a few example URLs to say: "Yes, here is where to look to find the malware."

That shows this gradual progression where, first and foremost, we have to take care of spam, the viruses, the malware or the trojan. Then, over time, we polish off those rough edges and we try to provide better messaging and better alerts to help the webmasters as well. I could certainly imagine that over time, we could tell a webmaster: "Yes, we uncovered links that looked like they were certainly sold, so that played a factor in Google losing a little more trust in your website." I am certainly open to doing that.

You also have to think about whether a site can be pulled toward white hat or not. Clearly, if somebody is a malicious spammer and they are just trying to do awful, awful things, you do not want to give them a head's up that they have been caught. So, if we have seen someone that we think is deliberately abusive and really spammy and really savvy and they know what they are doing, then they might not expect to get a head's up in our "Webmaster Console".

But if someone is a relatively new webmaster, a small Mom-and-Pop business, maybe we think they did not know any better, then it is a little more likely that we might try to give them some message to say: "This is an issue. It is a violation of our guidelines. It is a violation of every search engine's guidelines. Here is where you can read more about it. If you can correct this issue, then here is where to go and request a reconsideration."

Stephan Spencer: Right, because if somebody is a white hat and has a history of being a white hat, certainly they deserve to be given a head's up, whereas you do not want to define that line for a black hat spammer.

Matt Cutts: Yes. You do not want to clue in the bad guys but you want all the people who are on the fence or who are right towards the white hat edge, you want to keep pulling them into that white hat direction.

Stephan Spencer: Yes.

My last question here: what RSS feeds do you subscribe to?

Matt Cutts: [laughs] A better question is what RSS feeds I do not subscribe to.

Stephan Spencer: Perhaps you can just supply us with your OPML file?

Matt Cutts: [laughs] You know, I have thought about that. At various times, I have done screenshots so the people could get a sense of the sort of things I read. It is funny because I have it broken down into general search, white hat, and black hat. I try to keep the black hat folder closed so people do not feel bad. You know: "Oh, no! I am in Matt's black hat folder!" Although there are a few people in there.

Stephan Spencer: Some would feel good. [laughter]

Matt Cutts: Yes, maybe they would be honored, who knows, but I do not want to give them the glory in that case. But, yes, certainly sites like Search Engine Land, Google Blogoscoped, Google Operating System, you know. Those are fantastic to just get first line news. Then, there are things like Search Engine Journal and all those sort of guys where you can get a lot more follow-on news or thoughtful commentary afterwards.

There is a lot of really good feeds that I read. I read about 70 or maybe even a 100 in the search space. A few that I read that are not search - there are only five or 10, but XKCD is a Web comic that is really pretty funny, that is very Web savvy. I found a feed on Flickr for their "Photos of the Day" which is just a nice way to start your day.

There is a neat site called One Sentence and the idea is that you have to tell an entire story in one sentence. You know, they are very compelling stuff. So, it is about 10 sentences a day, about 10 posts, and that is really a fun site as well.

Stephan Spencer: Cool. All right. Well, thanks very much for your time, Matt.

Matt Cutts: Yes, good talking to you.

Posted by Stephan Spencer on 12/17/2007 | Permalink

Comments (25)| Comments RSS | Filed under: Search Engines google, matt cutts, podcasts, seo            

Tips, props, and new revelations from Matt Cutts of Google

Google engineer Matt Cutts presented at WordCamp 2007 last weekend. (Session notes available from Stephanie Booth, Lisa Barone, and P Havens.). Matt's session was recorded... when the video is posted, I'll let you all know. Matt will hopefully be posting his Powerpoint to his blog, if he gets approval from the PR department.

Matt Cutts gave me props twice in his presentation, and even called me out to the audience (even pronouncing my name right! Thanks Matt!):

  1. He recommended my WordPress plugin SEO Title Tag
  2. He recommended attendees read my blog post series about blog optimization

In this News.com blog post that I authored this week, I reported that, according to Matt, underscores in URLs are now or soon to be treated as word separators by Google. That's a departure from their previous stance and offers an indirect clue that Google does give weight to keywords in URLs (which we at Netconcepts already knew from empirical evidence). TypePad and Movable Type bloggers can rejoice at this news, since the majority of them run blogs with a URL structure using underscores (including Lisa Barone), and up to now that was detrimental to their Google rankings. I would like to see Six Apart stop truncating keywords in URLs (by restricting length of URL to something like 15 or 17 characters by default), because that still is detrimental even after this underscore "fix" from Google.

A few other highlights from Matt's talk:

  • Dynamic URLs are treated the same as static URLs, as long as you keep the number of parameters to a minimum.
  • Directory depth doesn't matter to Google.
  • File extension doesn't matter to Google, unlesss it's .exe.
  • Google's status as a domain registrar is inconsequential to them accessing other registrars' domain data.
  • Google won't admit blogs into Google News that have only one author.

UPDATE: The video of Matt Cutts' presentation at WordCamp is live now. Check it out.

Posted by Stephan Spencer on 07/26/2007 | Permalink

Comments (2)| Comments RSS | Filed under: Search Engines google, matt cutts, seo, wordcamp            

The practicalities of buying and selling links

If you are an SEO and you are not aware of Matt Cutts' strong opposition to buying links, you must have been living under a rock. However, my hunch is that most businesspeople (at least those who don't live and breathe SEO) are naive to Google's tough stance -- and to the risks!

My SEO How-To article in the January/February issue of Practical Ecommerce was meant to give ecommerce business professionals a (hopefully) balanced view of the risks and the opportunities of link buying. Before you have a go at link buying or selling you might want to give it a read.

What's tricky, even for seasoned SEOs, is figuring out if a site that's selling links has been made, and its voting power taken away by Google -- particularly if you aren't already advertising with the site. You can glean valuable clues by sizing up the existing advertisers and past advertisers (perusing previous versions of the site in The Wayback Machine). You can't tell by the PageRank score, or from the link: SERPs -- that would be too easy, and Google doesn't want to be that easy.

Of course, just because Google is talking tough about link buying/selling, the tactic isn't going to go away any time soon. It is a tactic that works. At least for as long as you stay under the radar!

And if you aren't convinced how well it works on Google, have a look at my SEO Report Card of Freshpair.com in the current issue of Practical Ecommerce, where I critique some of the backlinks purchased by aggressive link buyer Freshpair.com. It is always fun to reverse engineer an aggressive link buying campaign and this one was no exception. Hopefully I won't get too much hate mail from Freshpair for airing this in public! :-)

Posted by Stephan Spencer on 01/31/2007 | Permalink

Comments (0)| Comments RSS | Filed under: Search Engines google, link buying, link selling, matt cutts, pagerank            

See who's cutting off link flow (e.g. PageRank) using nofollow

Matt Cutts from Google last week posted a handy tip on his new blog about how, in Firefox, to emphasize links that have the rel=nofollow attribute, which negates the vote that the web page is making by linking.

Sometimes people will say they have got a reciprocal link back to you but in actuality they have stuck a nofollow attribute on to the link so that it doesn't actually count.

You can expose such sneakiness in Firefox using Matt's handy tip. It involves creating a user defined Cascading Style Sheet (CSS) that overrides a website's own CSS.

Of course the more prevalent use of the nofollow attribute is to discount links that you have not posted on your site yourself and therefore cannot vouch for, such as in your guest book, or in your blog comments, or on your discussion forum.

When setting this up, you may want to try out a variation of Matt's solution, submitted by a commentor, that doesn't modify the original styles but instead adds a blinking exclamation point next to the nofollow'ed link.

Posted by Stephan Spencer on 08/31/2005 | Permalink

Comments (1)| Comments RSS | Filed under: Search Engines, Web Design css, firefox, google, link flow, link gain, matt cutts, nofollow, nofollow attribute, pagerank            

Buying links - Google's perspective

Following on from yesterday's post on link buying and how it's a legitimate practice in many circumstances...

I found a blog comment posted just a few days ago by Google engineer Matt Cutts (yes, I've been blogging a lot about him lately... honestly, I'm not a groupie!). Matt chimed in on a lively debate happening on Tim O'Reilly's blog about the controversy surrounding the selling of link ads on the O'Reilly Network. Matt had this to say:

As others have noted, if you're going to sell text links that pass reputation/PageRank, the way to do it is to add rel=nofollow to those links.

Tim points out that these these links have been sold for over two years. That's true. I've known about these O'Reilly links since at least 9/3/2003, and parts of perl.com, xml.com, etc. have not been trusted in terms of linkage for months and months. Remember that just because a site shows up for a "link:" command on Google does not mean that it passes PageRank, reputation, or anchortext.

Google's view on this is quite close to Phil Ringnalda's. Selling links muddies the quality of the web and makes it harder for many search engines (not just Google) to return relevant results. The rel=nofollow attribute is the correct answer: any site can sell links, but a search engine will be able to tell that the source site is not vouching for the destination page.

So here's Google coming out and admitting that they decreased the voting power of O'Reilly sites like perl.com and xml.com and downgraded the reputation value of some of their outbound links. And if you don't want your site to suffer the same fate, you'd better tag your link ads with rel=nofollow so they don't gain any PageRank. How do you like them eggs!

To me, that doesn't seem quite fair to website owners. They work hard to build a content-rich destination site with good PageRank score. Google is diminishing their earning ability by insisting they cut off the flow of PageRank with a nofollow, thus decreasing the value of the link ads to the advertiser and ultimately the revenue likely to realized from that advertiser. Granted, you don't buy links merely for PageRank, but of course it figures into the equation.

The problem lies in which link ads to vouch for. If I were the advertising manager for DailyItem.com, I certainly would not vouch for the advertiser of "Discount Vacations", as the link points to a "doorway page" operated by Orbitz that links to a whole pile of other doorway pages (tsk tsk! Google warns against using doorway pages); on the other hand, I would vouch for the "Dancewear" advertiser, since that's the company's name and the link points to the home page of their ecommerce site.

Google, please give the website owner the option of vouching for some of their advertisers without demoting their site. A black-or-white approach just isn't practical here. Signed, a devoted Google fan.

Posted by Stephan Spencer on 08/29/2005 | Permalink

Comments (4)| Comments RSS | Filed under: Search Engines google, link buying, matt cutts, nofollow, oreilly, oreilly network, pagerank, selling link ads, text links            

Googlebot, parameters and dynamic sites

I previously mentioned that Matt Cutts from Google gave some advice to webmasters of dynamic (database driven) web sites.

For one thing, Matt advised that if you have a dynamic web site, you should minimize the number of parameters in the URL. You’re very safe if you have fewer than 2 parameters. Keep the values of those parameters to fewer than 5 digits. And don’t name a parameter id. That's because Google is suspicious of that parameter being a session ID or something other than a key field. Even if it's the only parameter in your URLs, try not to use it. Particularly if that variable's value is long (like 5 digits or more). sid would be a bad choice too because it could stand for session ID as much as it could stand for a key field like story ID. It doesn't mean that your pages won't be indexed if you use this parameter name; it just means those pages would be at a greater risk of not being included. You should be fine though if your pages are all already in Google.

Matt also mentioned something that should be a bit alarming to anyone with a dynamic site. Googlebot sometimes tries variations of URLs by dropping parameters. Meaning that Googlebot may experiment with removing name-value pairs from the query string portion of your URLs (i.e. the part of the URL that follows after the question mark) and seeing if the pages still load. I understand the reason for this to be that if these variant pages still show the same content as the page at the original URL, it gives Googlebot an indication that the omitted parameters are superfluous in the query string. So for example, a URL such as this:

www.bigyellow.com/cgi-bin/php/cities/unitedstates/mtg_detail.php?
xsrc=&PID=36575&S=NY&T=&MTG=PR

might be shortened by Googlebot to:

www.bigyellow.com/cgi-bin/php/cities/unitedstates/mtg_detail.php?
xsrc=&PID=36575&S=NY&T=

and

www.bigyellow.com/cgi-bin/php/cities/unitedstates/mtg_detail.php?
xsrc=&PID=36575

and

www.bigyellow.com/cgi-bin/php/cities/unitedstates/mtg_detail.php?
S=NY&T=&MTG=PR

etc.

Then these URL variations would get spidered and compared with each other. I've heard of big websites getting hit by this and it causing big problems for the website in question. Don't get all worried about Googlebot doing this to your site if you're a not a big and important site. Matt stated that Google only does this deep level analysis on big, quality sites. Anyone been subjected to this? And if so, what damage or inconvenience did it inflict on you?

Posted by Stephan Spencer on 08/27/2005 | Permalink

Comments (0)| Comments RSS | Filed under: Search Engines complex urls, dynamic web site, google, googlebot, matt cutts, parameters, url variations, urls            

Coverage of SES San Jose: Search Engine Q&A On Links

I'm a bit behind on my conference session blogging. Waaay too many parties going on; doesn't leave much time for blogging. The Google Dance last night. Yahoo! party at Great America the night before. And tonight I've got another party to go to. Yesterday I spoke on RSS. I'll post a recap on that session later.

I just attended "Search Engine Q&A On Links", which was great. Lots of useful advice from Google and Yahoo! about linking (nobody seemed to want to ask poor Ask Jeeves any questions). It was funny how obviously diametrically opposed the engines were to the immediately prior session on "Buying and Selling Links". It's hard to reconcile the two different sets of advice. Matt in the hallway before this session was adamant: "Don't buy links!"

Anyways, without any further ado, here's the session recap:

Kaushal Kurapati from Ask Jeeves:
Be cautious of: reciprocal links and purchasing links
Avoid: link farms, cloaking pages, invisible or hidden links that trick the crawler
Become an authority on a subject
Focus on your busines and content. Rest will follow. [I say: "yeah, right..."]
Teoma uses subject specific popularity: garner respect in your industry, subject-specific text based links can be understood. (hubs and authorities model)

Tim Mayer from Yahoo!:
Here's some important news!! Yahoo! has just launched a brand new service: Site Explorer from Yahoo! Search. Stop scraping the Yahoo site for backlink results and use Site Explorer instead. Access via an API is offered too. And you can export as a CSV file.
Yahoo has 19.2 billion web objects in its index. Over 20 billion objects, when you include the audio and video.
Plans to use community to improve search quality. Social search = within a trusted network, where someone within your network vouches for a site.
Create natural linking strategies. when things start to look unnatural, is when you'll start getting into trouble. We look at intent (linking to plasma TVs, diamonds, and Viagra all on the same page) and extent (i.e. what looks normal. Having everything on the page as links or 200 links on the page is too much!)
Yahoo! offers a much more comprehensive sample of backlinks than Google, but not a complete set of backlinks. New system (Site Explorer) will be reasonably comprehensive, in his opinion the most comprehensive out there.
It's unnatural to link to sitemap-1 sitemap-2 sitemap-3 sitemap-4 sitemap-5. If you are doing this, you're headed in the wrong direction.

Matt Cutts from Google:
Good links are earned links, links that are based on editorial discretion.
Create services that really useful. e.g newsletters, an article a day, syndicate through RSS (attribute my article and give me a link). start a blog.
Matt launched his blog today: mattcutts.com
Think outside the box.
Only SEOs and librarians do backlink searches. Historically we decided to dedicate a subset of our servers to backlinks. Only a sampling of backlinks would be displayed but only for a threshold of PageRank 4 or higher pages. A suggestion was made to show backlinks for lower PageRank pages too. We liked that idea so we now show a random sampling of backlinks, including low PageRank scoring pages too. We show twice as many backlinks as shown before, but still it's only a sampling of the backlinks.
In graph theory, a clique in every node in the graph is very unnatural. So don't link to every single node in your network of sites; it'll get flagged.
For dynamic sites, you're very safe if you have fewer than 2 parameters; keep the values of those parameters to fewer than 5 digits, and don't name a parameter "id". Googlebot sometimes tries variations of URLs by dropping parameters, but we only do that deep level analysis on big, quality sites.
Another good approach that alltheweb came up with: spider would always go 1 dynamic page deep from a static page.
Search engines only grab 100k or 200k or 500k so be careful loading up a huge page with a lot of links.
PageRank isn't as important as SOME people make it out to be. BUT it's NOT like "PageRank? Oh yeah let's shuffle that one under the rug! That was sooo 4 years ago!"
"BO" = backlink obsession
We export PageRank only once every 3 months or so.

Technorati tag: Search Engine Strategies