Friday, October 29, 2004

HTML Parsing/Screen Scraping in .NET

In an e-mail conversation with Pascal Naber the topic of HTML "screen scraping" came up. I dabbled a bit with this a few months ago to when writing a command line utility to alleviate the pain of manually publishing content to our intranet. So far I have considered the following solutions to do HTML screen scraping in .NET, the first two of which I have looked at in detail and the third one I will test-drive on the next occasion:
  • ASP.NET web page parsing: The ASP.NET web service infrastructure supports WSDL extensions that parse the result of HTTP-GET requests using regular expressions. The WSDL has to written and/or edited by hand and can be compiled using the wsdl.exe tool into a web service proxy. I honestly cannot recommend this approach. Regular expressions on their own are already a nightmare to decipher. Having to XML-escape them to get a valid XML attribute is a solid recipe for no sleep at all. Documentation besides the occasional MSDN page is absent. The few articles that exist on this feature seem to rehash the oversimplified example on MSDN.

  • SgmlReader: Provides XmlReader interface over arbitrary SGML documents and has native support for HTML DTDs. Most elegant solution I have seen so far:
    1. It allows for processing of HTML streams - it is not necessary to load complete documents into memory.
    2. It is easy to layer an XPathNavigator on top to extract content.
    Disadvantages are the lack of preservation of the original HTML and it seems there are not many people who (are able to) support the code. When I encountered problems with complex embedded JavaScript blocks I decided to look for alternatives.

  • .NET HTML agility Pack: Reads HTML into custom object model en is able to convert HTML to XML. Seems to be the most pragmatic approach for its ease of use and ability to deal with invalid HTML (tag soup).

0 Comments:

Post a Comment

<< Home