I would like a web crawler written in C#
Requirements:
- based on a specific list of websites, be able to crawl the entire content of the sites
- need some strategy to permit the crawling to scale to a decent level using a single machine; so I would expect the crawler to use multithreading or asychnronous I/O to reach processing speeds of at least 5 pages/second
Some other questions:
- do you have a strategy for handling dynamic pages? I.e. crawling a site in which most of the content is hidden behind a form?
- what is your approach for making sure the crawler doesn't revisit the same pages? Do you check the URL, check the page content itself?
- what is your approach for handling site revisits?
- would be nice to have the app unzip pages that are returned compressed.
I'm not looking for any add-on indexing or search functionality. Just a high-quality crawler, with clean modular code that I can use to integrate into my other applications. Your output will be the working source code and a reasonably small amount of availability for questions (obviously, the better commented and cleaner the code, the less then need for questions).
I don't have a huge amount of time, so it's likely that the winning bidder will be someone who's already built similar crawlers, therefore has already thought about the key issues, etc.
Please bid the real price you'd like to charge; (ignore the range I've selected below- it's clearly too low)