Website Construction Guidelines

Avoid Frame based construction Frame based construction used to be very popular and allowed for quicker coding. However, search engines have historically had difficulty reading the content in a framed page.  More recently, Google and Inktomi based search engines have been able to navigate and index framed sites.  But it is still considered good practice to avoid this construction technique where possible.
Avoid URL variables Session Variables (id=) Google has admitted that their crawlers try to stay away from URLs with session variables.
Dynamic URLs (?=) Website URLs that pass variables are difficult for search engines to index.  Google, AltaVista, and Inktomi based search engines are able to index some of these URLs.  Passing a single variable is typically okay.  Passing more than one variable becomes progressively more difficult for the search engines.  Also, note that search engines can index a dynamic URL if it is linked from a static URL. Search engines have difficulty following a dynamic URL link from a dynamic URL page.
Case Sensitive URLs Domain names are not case sensitive, but URLs (path and file name) can be.  In the server world, Unix servers are, by default, URL case sensitive.  But Windows servers are not.  So it is not surprising that Google and Yahoo’s index is case sensitive but MSN is not.Therefore, be sure your URL naming convention is consistent.  If a path or file name has multiple versions (upper case URLs and lower case URLs), Google and Yahoo may consider them as individual URLs and duplicate content.  MSN indexes versions of upper and lower case URLs as a single URL.
Avoid graphic /flash intensive pages Search engines do not read graphics or, with few exceptions, flash.  Websites constructed in 100% flash typically have very poor rankings.  Consider an HTML version of the same website so the search engines can crawl it.
Employ dual navigation
to boost PageRank
Most websites are constructed with “fully meshed” or cross-link navigation schemes.  In this way, visitors can navigate from most any page to most other website pages.  This scheme produces a relatively flat PageRank value across the entire site.Hierarchal navigation schemes can produce much stronger PageRank.  By linking all pages to the home page, and not to other pages on the same level, home page PageRank can be made approximately 3x higher.  Note that interior pages have a lower PageRank value.Consider a “fully meshed” navigation scheme that uses a scripting language that the search engines cannot read for human visitors and a HTML hierarchal navigation for the search engines.
Avoid reciprocal link pages with
more than 100 outbound links
GoogleGuy has made a personal comment that it is good practice to limit the number of outbound links on a reciprocal link page to 100.  It is often recommended that links pages don’t exceed 50 outbound links.
Content pages should have a minimum
of 300 words in the “body tag”
All major search engines examine the “body” content of a website in order to determine the general intent of a website page.  In order for a search engine to reasonably determine the page intent, there must be at least 300 (150 if small, 3 character, words are not considered) words to analyze.
Interior links should have keywords included The hypertext of internal links should include keywords.  Instead of simply saying “home”, incorporate a keyword (e.g. life insurance – home)
Internal links to Home Page should be to domain. Home page links found throughout a website should link back to the root domain name and not to redirect pages.Example: do not create home page link like http://www.yourdomain.com/index.aspor relative links like ahref:”index.html.  Rather create absolute links like http://www.yourdomain.com/.
Use If-Modified-Since (IMS) code IMS lets your web server tell Googlebot whether a page has changed since the last time the page was fetched. If the page hasn’t changed, Googlebot can re-use the content from the last time it fetched that page. That in turn lets the bot download less pages and save bandwidth.  Check to see if your server is configured to support If-Modified-Since. It’s an easy win for static pages, and sometimes pages with parameters can benefit from IMS.
HTML navigation Make sure that every page can be reached with a text browser like lynx. That’s the best way to make sure that a spider can follow links to all website pages.  Google still recommends use of a site map to facilitate crawling.
Avoid hidden text / links Google has stated several times that they now have an automatic filter that detects same or near-same color text and background and small images.  When Google detects this condition, the website is automatically penalized for a period of time.
Links from same class C IP address There is very strong evidence that Google is employing filters that detect mirror sites with the same or near-same content and external links from the same class C IP address.  It is believed that only one link from the same class C IP address will count toward rankings.
Avoid duplicate content It is commonly believed that Google has been able to detect duplicate content at a site level. Google’s guidelines state: “Don’t create multiple pages, subdomains, or domains with substantially duplicate content.”  Pages with near-duplicate content will likely be penalized. Yahoo has an active editorial staff that has methods of detecting duplicate content or near duplicate content.  We believe Yahoo can detect a condition when the majority of a paragraph is the same, contributing to a negative editorial review and ranking demotion.  In Google’s patent6658423; titled Detecting duplicate and near-duplicate files;filed January 24, 2001; awarded December 2, 2003, the inventors describe a process that Google may be employing.  The patent covers a method of extracting parts (e.g. words, terms, numbers, etc.) from a website and fingerprinting these parts.“… if two documents have any one fingerprint in common, they are considered to be “near-duplicates”.”Web pages that have common fingerprints are said to belong to a cluster.“…in response to a search query, if two candidate result documents belong to the same cluster and if the two candidate result documents match the query equally well (e.g., have the same title and/or snippet) if both appear in the same group of results (e.g., first page), only the one deemed more likely to be relevant (e.g., by virtue of a high PageRank, being more recent, etc.) is returned.”The patent speaks of a pre-filter.  This language may infer that such a process is to be performed prior to web page ranking.Although there is no conclusive proof, there is mounting evidence to support the hypothesis that Google has implemented a new detection method.  Observations are consistent with the process outlined in the above patent.  New websites are taking longer to rank (90-120 days) and rankings appear to contain less near-duplicate content.  Several webmasters report domain level demotion for sites presumed to contain near-duplicate content.The question remains, how much text needs to be identical to be considered near-duplicate.  The answer may have to do with the amount of content on a page.  The patent does mention a practical minimum of 50 words and any pages with fewer words will not be processed.  Those with greater words could be split into as many as 8 parts.  However, there is no guidance in the patent whether the parts are sequential, random, or encompass an entire page. In order to detect duplicate content, try a tool likewww.copyscape.com.  It shows a list of web pages that have duplicate content.  When you click on one, www.copyscape.com highlights the content that is the same.  Plus it allows you to see the Whois information for the “copycat” website.
Avoid scripting redirect, 302 redirects or meta redirects Most search engines are able to detect this code and often consider its use to be SPAM.  If you do own several domain names, make sure they use either a 301 redirect or an alias.  This is done at the hosting and registry level.  Refrain from employing 302 redirects, and never use a meta-refresh redirect or java script redirect.
Avoid Non-Relevant Linking Google may be implementing a technology called TSPR (Topic Sensitive PageRank).  The concept was championed by Taher H. Haveliwala while at Stanford University (click http://www-cs-students.stanford.edu/~taherh/papers/topic-sensitive-pagerank-tkde.pdf to his paper).  Taher was hired by Google where it is believed his work was expanded.The theory of Topic Sensitivity starts with the assumption that there are authority websites for a specific subject or keyword phrase. These authority websites link out to other websites and other websites link to more websites.  Every time there is an outbound link, there is a component of that link that had its origins from the authority websites.  The topic sensitivity that is passed through the link is defined by where the upstream websites got their links.It may be easier to think of Topic Sensitivity as a ‘bloodline”. Some dogs are pure breeds, but most dogs are mutts, a combination of bloodlines.  Mutts may have blood from a few or many different dog types just as websites have links from different website categories.  If a dog has a lot of bloodline from golden retrievers, then presumably that dog would be better at hunting than some other dogs.  Likewise, if links to a website are primarily health related, then the website would rank better for health related keyword phrases.