Anti Spam Server: Must Have Features Your Web Site

Some of the older Search Engines do not have the capability to read content from frames. These only crawl through the frameset instead of all the web pages. Consequently web pages with multiple frames are ignored by the spider. There are certain tags known as "NOFRAMES" (Information ignored by frames capable browser) that can be inserted in the HTML of these web pages. Spiders are able to read information within the NOFRAMES tags. Thus, Search Engines only see the Frameset. Moreover, there cannot be any links to other web pages in the NOFRAMES blocks. That means the search engines won't crawl past the frameset, thus ignoring all the content rich web pages that are controlled by the frameset.

Hence, it is always advisable to have web pages without frames as these could easily make your website invisible to Search Engines.

Making frames visible to Search Engines

We discussed earlier the prominence of frames based websites. Many amateur web designers do not understand the drastic effects frames can have on search engine visibility. Such

ignorance is augmented by the fact that some Search Engines such as AltaVista are actually frames capable. AltaVista spiders can crawl through frames and index all web pages of a website. However, this is only true for a few Search Engines.

The best solution as stated above is to avoid frames all together. If you still decide to use frames another remedy to this problem is using JavaScripts. JavaScripts can be added anywhere and are visible to Search Engines. These would enable spiders to crawl to other web pages, even if they do not recognize frames. With a little trial and error, you can make your frame sites accessible to both types of search engines.

We discussed the ROBOTS tag in brief earlier. Let us understand this tag a little more in detail. Sometimes we rank well on one engine for a particular keyphrase and assume that all search

engines will like our pages, and hence we will rank well for that keyphrase on a number of engines. Unfortunately this is rarely the case. All the major search engines differ somewhat, so what's get you ranked high on one engine may actually help to lower your ranking on another engine.

It is for this reason that some people like to optimize pages for each particular search engine. Usually these pages would only be slightly different but this slight difference could make all the difference when it comes to ranking high. However because search engine spiders crawl through sites indexing every page it can find, it might come across your search engine specific optimizes pages and because they are very similar, the spider may think you are spamming it and will do one of two things, ban your site altogether or severely punish you in the form of lower rankings.

The solution is this case is to stop specific Search Engine spiders from indexing some of your web pages. This is done using a robots.txt file which resides on your webspace. A Robots.txt file is a vital part of any webmasters battle against getting banned or punished by the search engines if he or she designs different pages for different search engine's.

The robots.txt file is just a simple text file as the file extension suggests. It's created using a simple text editor like notepad or WordPad, complicated word processors such as Microsoft

Word will only corrupt the file. You can insert certain code in this text file to make it work. This is how it can be done.

User-Agent: (Spider Name) Disallow: (File Name)

The User-Agent is the name of the search engines spider and Disallow is the name of the file that you don't want that spider to index.

You have to start a new batch of code for each engine, but if you want to list multiply disallow files you can one under another. For example -

User-Agent: Slurp (Inktomi's spider)

Disallow: xyz-gg.html Disallow: xyz-al.html Disallow: xxyyzz-gg.html Disallow: xxyyzz-al.html

The above code disallows Inktomi to spider two pages optimized for Google (gg) and two pages optimized for AltaVista (al). If Inktomi were allowed to spider these pages as well as the pages specifically made for Inktomi, you may run the risk of being banned or penalized. Hence, it's always a good idea to use a robots.txt file.

The robots.txt file resides on your webspace, but where on your webspace? The root directory! If you upload your file to sub-directories it will not work. If you wanted to disallow all engines from indexing a file, you simply use the * character where the engines name would usually be. However beware that the * character won't work on the Disallow line. Here are the names of a few of the big engines:

Excite - ArchitextSpider AltaVista - Scooter Lycos - Lycos_Spider_(T-Rex) Google - Googlebot Alltheweb - FAST-WebCrawler

Be sure to check over the file before uploading it, as you may have made a simple mistake, which could mean your pages are indexed by engines you don't want to index them, or even worse none of your pages might be indexed. Another advantage of the Robots.txt file is that by examining it, you can get information on what spiders, or agents have accessed your web pages. This will give you a list of all the host names as well as agent names of the spiders. Moreover, information of very small search engines also gets recorded in the text file. Thus, you know what Search Engines are likely to list your website.

Most Search Engines scan and index all of the text in a web page. However, some Search Engines ignore certain text known as Stop Words, which is explained below. Apart from this, almost all Search Engines ignore spam. Stop words are common words that are ignored by search engines at the time of searching a key phrase. This is done in order to save space on their server, and also to accelerate the search process.

When a search is conducted in a search engine, it will exclude the stop words from the search query, and will use the query by replacing all the stop words with a marker. A marker is a symbol that is substituted with the stop words. The intention is to save space. This way, the search engines are able to save more web pages in that extra space, as well as retain the relevancy of the search query.

Besides, omitting a few words also speeds up the search process. For instance, if a query consists of three words. The Search Engine would generally make three runs for each of the words and display the listings. However, if one of the words is such that omitting it does not make a difference to search results, it can be excluded from the query and consequently the search process becomes faster.

Search engines are unable to view graphics or distinguish text that might be contained within them. For this reason, most engines will read the content of the image ALT tags to determine the purpose of a graphic. By taking the time to craft relevant, yet keyword rich ALT tags for the images on your web site, you increase the keyword density of your site. Although many search engines read and index the text contained within ALT tags, it's important NOT to go overboard in using these tags as part of your SEO campaign. Most engines will not give this text any more weight than the text within the body of your site.

Invisible text is content on a web site that is coded in a manner that makes it invisible to human visitors, but readable by search engine spiders. This is done in order to artificially inflate the keyword density of a web site without affecting the visual appearance of it. Hidden text is a recognized spam tactic and nearly all of the major search engines recognize and penalize sites that use this tactic.

This is the technique of placing text on a page in a small font size. Pages that are predominantly heavy in tiny text may be dismissed as spam. Or, the tiny text may not be indexed. As a general guideline, try to avoid pages where the font size is predominantly smaller than normal. Make sure that you're not spamming the engine by using keyword after keyword in a very small font size. Your tiny text may be a copyright notice at the very bottom of the page, or even your contact information. If so, that's fine

Almost all Search Engines serve different countries. Search Engines do list content from other countries but most of the content that is listed is either US or UK dominated content.

With this in mind, most popular Search Engines have started deploying regional editions that serve only a specific country. For instance, Google has an Indian edition 'google.co.in' that caters to the Indian audience. Given below are some of the types of Search Engine Regional Editions.

Regional Interface is nothing but a translated version of the main Search Engine. Many Search Engines have interfaces in different languages such as French, German, Spanish, Japanese etc. However, the only difference between these regional interfaces and the main version of the Search Engine is that the language used on the interface is not English. In other words, if you search using a keyword on both the interfaces, the listings are exactly the same. Regional Interfaces are aimed at an audience that does not understand English.

Human Categorization, as the name suggests, is categorization of websites by human beings. Search Engine employees categorize different websites into regional listings. Websites that are more relevant to a specific country are listed in that edition of the Search Engine. Hence, for a French edition a search would mainly list documents from France. This eliminates the problem mentioned above. The only caveat being that the whole process is manual. Directories such as Yahoo, LookSmart, and Open Directory make use of this process.

Domain Filtering automatically segregates websites from different countries into their respective regional editions. This segregation is done on the basis of domain names. For instance a website from Australia would generally have a domain .au. The Domain filtering mechanism looks at the domains of all websites and creates a country specific edition listing. Some Search

Engines also have region specific editions which contain listings from the whole of that region. As an example: A French edition of Google may also return German or Spanish websites in some cases. Domain Filtering has a drawback though.

This mechanism can only filter out websites based on the domain name, and hence .com is always considered to be a United States website. This is obviously not true. Many websites

from other countries also have .com domains. Domain crawling is probably the best solution for maintaining both a main site and a regional version. With domain crawling the regional listing is far more comprehensive as compared to the other mechanisms explained above. Some pages, although regional may be listed in the main listing as well.

A couple of years ago spamming may have worked wonders for your website. However, with sophisticated algorithms being developed by all popular search engines, spamming can only backfire. Algorithms, these days, can easily detect spam and not only ignore your website but also ban your website.

Besides, instead of spending considerable time and effort on spamming you can always follow other proven strategies and have a higher rank with most search engines. Spamming can also easily irritate readers. Think about it - if your homepage has unnecessary repetitions of a particular keyword, it is bound to frustrate a reader. Consequently your site, instead of being content rich, would be junk rich. This can have nothing but a negative impact on your business.

Search engine cloaking is a technique used by webmasters to enable them to get an advantage over other websites. It works on the idea that one page is delivered to the various search engine spiders and robots, while the real page is delivered to real people. In other words, browsers such as Netscape and MSIE are served one page, and spiders visiting the same address are served a different page.

The page the spider will see is a bare bones HTML page optimized for the search engines. It won't look pretty but it will be configured exactly the way the search engines want it to be for it to be ranked high. These 'ghost pages' are never actually seen by any real person except for the webmasters that created it of course. When real people visit a site using cloaking, the cloaking technology 'which is usually based on Perl/CGI' will send them the real page, that look's good and is just a regular HTML page.

The cloaking technology is able to tell the difference between a human and spider because it knows the spiders IP address, no IP address in the same, so when an IP address visits a site which is using cloaking the script will compare the IP address with the IP addresses in its list of search engine IP's, if there's a match, the script knows that it's a search engine visiting and sends out the bare bones HTML page setup for nothing but high rankings.

There are two types of cloaking. The first is called User Agent Cloaking and the second is called IP Based Cloaking. IP based cloaking is the best method as IP addresses are very hard to fake, so your competition won't be able to pretend to be any of the search engines in order to steal your code.

User Agent Cloaking is similar to IP cloaking, in that the cloaking script compares the User Agent text string which is sent when a page is requested with it's list of search engine names 'user agent = name' and then serves the appropriate page.

The problem with User Agent cloaking is that Agent names can be easily faked. Search Engines can easily formulate a new anti-spam method to beat cloakers, all they need to do is fake their name and pretend they are a normal person using Internet explorer or Netscape, the cloaking software will take Search Engine spiders to the non - optimized page and hence your search engine rankings will suffer.

To sum up, Search engine cloaking is not as effective as it used to be, this is because the search engines are becoming increasingly aware of the different cloaking techniques being used be webmasters and they are gradually introducing more sophisticated technology to combat them. It may be considered as unethical by Search Engines if not used properly.

By: Ken Mathie
Credit:www.goarticles.com

Anti Spam Server

Tuesday, May 29, 2007

Must Have Features Your Web Site

No comments:

Blog Archive

About Me