This past week Matt Cutts, who heads up the Webspam team for Google was interviewed by Eric Enge of Stone Temple Consulting. The long and detailed interview transcript can be found here and is a tremendous resource. For the average website owner here are some of the key things to take away from it.
The number of pages crawled is roughly proportional to your PageRank.
Every website is crawled depending on the PageRank of each page. PageRank is the overall score that Google gives the page for how relevant and useful it is. The front page of your website, which often has the highest PageRank is almost always indexed because it is has the highest PageRank. When the Google bot that indexes and analyzes your site sees a link from your highly ranked page to another page, it follows. When it reaches that page, with a lower PageRank, it loses interest. It then loses interest progressively as it travels through your site. When the PageRank drops too low, it stops.

Complex navigation structures are hard to index and should be avoided.
Your site navigation needs to be simple and shallow. By shallow, I mean that there has to be less than 3 layers of navigation for every page on the site. If a user has to click through seven links to find what they are looking for, then the user experience and your SEO will suffer. The Google bot can only do as much as an average user. They don’t have super powers, so keep it simple.
Duplicate content is still bad, but not as bad as we thought.
It is widely believed that duplicate content is not just a waste, but a cancerous tumor, eating your search engine ranking. This isn’t exactly true. Search Engines expect duplicate content, to an extent, and when they see pages that are similar or copies they merge them together. As long as the amount of copied content is not excessive, your site won’t be penalized. You will still be wasting the Google bot’s time and your PageRank. Avoid it, but some overlap is not the end of the world.
PDF files and other Non-web native formats are indexed.
Common file formats like PDF, Flash, and Word documents are indexed by the search engines. These items are harder to index and won’t be considered as valuable as a normal .html document, but it is good to know that this valuable data does not go to waste.
Recent Blog Entries
Blog Categories
- Blogging (6)
- Browsers (7)
- Business (15)
- Drupal (3)
- Graphic Design (2)
- Just For Fun (8)
- Linux (3)
- Mac (2)
- News (14)
- Open Source (4)
- Personal (8)
- Photoshop (2)
- Reviews (3)
- Search Engine Optimization (7)
- Security (4)
- Strategy (16)
- Themeing (1)
- Web Design (8)
- Web Hosting (2)
- Windows (5)
- Wordpress (2)

