Friday, July 18, 2008

how to scrap dynamic web pages

If you need to collect dynamic html data from websites without programming, your options are pretty limited. Excel has a web query tool. There are a few 3rd party tools. You alway had to write up a bunch of regular expression searches and match rules. It's a major pain and not very flexible.

I found this amazing new startup web service. It's truly awesome. The web services allows a non-programmer to enter some sample URLs. The site extracts the html pages. Analyze the common data and layout. Then user gets an easy to use web tool to select which data elements to be extracted. It's as simple as clicking on the fields that you want. No programming, no regular expression, just point and click.

Effectively, you can convert any dynamic website into a XML data source. Next, you can take that XML data source transform it into RSS feeds, google map, and lots of other formats. Meshup made SUPER simple.

On the otherhand, it's all black magic. I have no idea how it works. It take a few tries to find good samples pages and selection blocks. I want to extract 10 fields. For a while, it works on one page, but not another. After about few hours, I randomly found a set that seems to work well and I consistently got all the data fields. I don't really know why. The explanation is quite sparse, terrible documentation.

A startup with real powerful technical promise.