Saturday, October 30, 2004

Some .NET Regular Expressions for HTML parsing

Despite the observations in my previous post, my intranet publishing solution is based on a combination of the ASP.NET approach to obtain web content and regular expressions coded in C#. Here are some regular expressions that worked well for me to get specific meta and input tags regardless of letter case and attribute quoting style (single, double or no quotes):
  • Content attribute of meta tag with name content-type:
    (?insx)
    <meta\s([^>]*?\s*)?
      content\s*=\s*
      ( '(?<Result>[^']*)'
      | "(?<Result>[^"]*)"
      | (?<Result>[^\s>]*)
      )
      [^>]*>
    (?<=\sname\s*=\s*['"]?content-type['"]?[^>]*>)

  • Value of input element with name input-name:
    (?insx)
    <input\s([^>]*?\s*)?
      value\s*=\s*
      ( '(?<Result>[^']*)'
      | "(?<Result>[^"]*)"
      | (?<Result>[^\s>]*)
      )
      [^>]*>
    (?<=\sname\s*=\s*['"]?input-name['"]?[^>]*>)

Friday, October 29, 2004

HTML Parsing/Screen Scraping in .NET

In an e-mail conversation with Pascal Naber the topic of HTML "screen scraping" came up. I dabbled a bit with this a few months ago to when writing a command line utility to alleviate the pain of manually publishing content to our intranet. So far I have considered the following solutions to do HTML screen scraping in .NET, the first two of which I have looked at in detail and the third one I will test-drive on the next occasion:
  • ASP.NET web page parsing: The ASP.NET web service infrastructure supports WSDL extensions that parse the result of HTTP-GET requests using regular expressions. The WSDL has to written and/or edited by hand and can be compiled using the wsdl.exe tool into a web service proxy. I honestly cannot recommend this approach. Regular expressions on their own are already a nightmare to decipher. Having to XML-escape them to get a valid XML attribute is a solid recipe for no sleep at all. Documentation besides the occasional MSDN page is absent. The few articles that exist on this feature seem to rehash the oversimplified example on MSDN.

  • SgmlReader: Provides XmlReader interface over arbitrary SGML documents and has native support for HTML DTDs. Most elegant solution I have seen so far:
    1. It allows for processing of HTML streams - it is not necessary to load complete documents into memory.
    2. It is easy to layer an XPathNavigator on top to extract content.
    Disadvantages are the lack of preservation of the original HTML and it seems there are not many people who (are able to) support the code. When I encountered problems with complex embedded JavaScript blocks I decided to look for alternatives.

  • .NET HTML agility Pack: Reads HTML into custom object model en is able to convert HTML to XML. Seems to be the most pragmatic approach for its ease of use and ability to deal with invalid HTML (tag soup).

Thursday, October 21, 2004

Visual Studio and C# Multi-line Build Event Editing

I sometimes use build events to invoke tools such as xsd.exe and wsdl.exe for code generation purposes. Typically this requires a build event script with multiple lines. An annoying 'feature' of Visual Studio 2005 is that if you press the Enter key to create a new line in a build script, the OK button is activated and the dialog is closed. Ctrl+Enter seems to do the trick however.

Wednesday, October 20, 2004

Debugging With... VS2005 and TestDriven .NET

I'm rewriting some messaging code from scratch using a strict test-first approach to test-drive some XP practices and a number of (fairly) new technologies:
My first impressions are pretty positive. Everything works together nicely so far. One of my favourite features is the "Test with... Debugging" option to debug code from an arbitrary test within VS.NET. That is, once I got it to work.

I initially created a project in Visual C# express and subsequently upgraded to VS.NET 2005 Beta 1 with Tech Refresh. When I then tried to debug the project I got the following error:
One or more projects in the solution do not contain user code and cannot be debugged with "Just my code setting" enabled". Make sure that all projects in your solution are configured to be built in Debug mode.

To suppress this message from appearing in the future, disable 'Warn if no user code on launch' in the debugger options page. To prevent the debugger running in 'Just My Code' mode, turn off 'Enable Just My Code' setting in the debugger options page.
Turning off both 'Just My Code' flags in the debugger options page indeed got rid of the warning, but my debugging breakpoints never were hit. The important clue in the warning turned out to be the "Make sure that all projects in your solution are configured to be built in Debug mode" sentence. The advanced build settings (project properties|Build Tab|Output group|Advanced button) specifies the debug info to be generated during builds. Somehow this setting was set to None, and changing it to Full solved all my debugging problems.