Advice on Programming Strategy for Scraper

SpoonTech

Well-Known Member
Joined
Jan 19, 2011
Messages
360
Reaction score
0
Hi,

I am looking to write a internet scraper, and have considered the following languages:
Python
C++
Java

The scraper will need to:
- Retrieve HTML code from a page
- Select a link, name and description from a section of the page
- Ask for user confirmation (non gui - maybe gui later) to process the link
- If user confirms then pass the link to a linux program that can be run by calling
Code:
linkprocessor -l http://link.goes.here/
- If possible, I would like to capture the text that would normally be shown in the terminal and display it in the terminal session running this application. (and check if the link was processed successfully)

I have looked at libraries JSoup (Java), BeautifulSoup (Python), curl (C++), and have seen no complications there. I am quite new at programming anything that interracts with the terminal.

The app will need to run on linux and mac (can be compiled seperately, I don't mind that).

What language and library would you guys recommend for this out of interest. Please back up your selection with a reason or two.

Thanks
 
Language and libraries aside, you'll run into a problem (as with many scrapers) that your IP address will be blocked eventually (if not within the first few minutes of scraping)

Personally I would write this in PHP/cURL, use PHPQuery to handle DOM to get all links from within the page. It will query a MySQL database to see what it should crawl next (a list of domains you specify to harvest the links from). Then in a different table I'd log the links found on that particular domain.

That should satisfy your "scrape first level only" and you can write a GUI to handle the rest and use a cron job to call the PHP Scraper every few minutes/seconds.

I'd pop this baby on a shared host like Hostgator, and run ifconfig via PHP to determine all of the IP's on the server itself (not just the IP allocated to me on the shared account). I'd take that list of IP's and give cURL the option to route through a specific IP at random.

With +- 30 IP address on my Hostgator account I've scraped over 1 million links from Amazon like this without being blocked
 
I've written a couple of these in Python with Mechanize to handle parsing and browsing. Alternatively, you can use requests or urllib (built-in) with BeautifulSoup or HTMLParser (built-in, event based :sick:).

Since your biggest time sink is network I/O, the processing speed increase from using C++ or Java is not worth the ages of dev time saved by using an "easier" language like Python.

As AcidRaZor pointed out, rate limiting could be a problem, although rather scarcely implemented in my experience (big companies like Amazon aside :D). If speed isn't too much of an issue, a simple (and random) delay between requests might be the simplest solution. Besides the rate limiting, JavaScript is a notorious problem: most libs simply can't handle it. This is not much of a problem if the links and content that you're after isn't added to the DOM by JavaScript.
 
Personally I've always found this more trouble than its worth. :)

depends on the end-goal and your experience in writing these I guess

from his description I very much doubt it's only going to be links he's going to scrape. Either that, or it's some website with hundreds of download links or even just a silly little project to keep his mind occupied (since he mentions the output would be in command prompt and not logged anywhere)

Reckon walter and I covered most of what he will encounter when doing this
 
Will the scraper be for personal use only? If not, I'm willing to be a tester for the app if you need one.
 
Top
Sign up to the MyBroadband newsletter
X