Building a web crawler
In October’s issue I showed how to develop an HTML Container class. This month, we will use that class to develop a general purpose Web Crawler class. The HTML Container project, including a VB.NET version, can be downloaded from the VSJ web site . Before getting started you will need to add the HTML Container class (WebWagon.dll) to your project. From the menu, choose Projects|Add Reference. Click the Projects tab and then the Browse button. Navigate to the location of WebWagon.dll and click OK. A Web Crawler – sometimes referred to as a spider or robot – is a process that visits a number of web pages programmatically, usually to extract some sort of information. For example, the popular search engine Google has a robot called googlebot that sooner or later visits virtually every page on the Internet for the purpose of indexing the words on that page. We are going to develop a general-purpose class that can be used as a basis for writing any type of robot. This class wi...