Developing an internet application that automatically fetches web page content seems trivial enough. As humans, we do it all the time: type in a URL and read and record results. But for a computer to perform the same task is quite involved.
Position Research has been involved in building and maintaining many customized web spiders.
- Search Engine News ranking results
- Search Engine Web ranking results
- Backlink link, anchor, title, IP spidering
- Web page PageRank spidering
- Duplicate and near-duplicate content spidering
In total, Position Research dedicates 12 web servers to perform in excess of 50,000 robotic queries daily.
The process of developing and managing a web spider involves 4 basic components.
- Search module calibration
- Error condition trapping
- Web page analysis
- Results reporting
Search Module Calibration
If search engines never changed their page code, calibration would not be necessary. However, code changes are a common occurrence. For this reason, live search engine results must be compared to robotically retrieved results to assure ranking accuracy.
Position Research performs calibration tests every business day to assure accurate web ranking results.
Error Condition Trapping
In the case of a news ranking applications, matching a headline title to news results may be required several times within a day. And a specific query may only last a few days. A human calibration method is no practical in these circumstances. Rather a robust error trapping system is required.
There are 2 kinds of error trapping:
- when something goes wrong
- When something does not go right
The first kind is used to trap process errors. If a page does not resolve properly because a server is busy, the error trap sends the robot into a loop with a pause and tries again.
The second kind is a little more tricky and are used to trap logic errors. In many cases, a process works correctly, but the information on a website page may be incorrect. For instance, MSN may deliver a Search Engine Results Page (SERP) with the proper number of web results, but the page is empty. Without logic detection, a robot would report NO results. The page resolved correctly, but the information was missing.
Although search engines regularly crawl the internet without much regard to website owner bandwidth, they don’t like it when the “shoe is on the other foot”. All the major search engines limit traffic. When traffic exceeds their limits, search engines temporally ban the originating IP.
Position Research virtualizes their robot server which allows us to query search engines across multiple IPs. In this way, Position Research can moderate the query rate and avoid IP banning.