Saturday, October 30, 2004

Some .NET Regular Expressions for HTML parsing

Despite the observations in my previous post, my intranet publishing solution is based on a combination of the ASP.NET approach to obtain web content and regular expressions coded in C#. Here are some regular expressions that worked well for me to get specific meta and input tags regardless of letter case and attribute quoting style (single, double or no quotes):
  • Content attribute of meta tag with name content-type:
    (?insx)
    <meta\s([^>]*?\s*)?
      content\s*=\s*
      ( '(?<Result>[^']*)'
      | "(?<Result>[^"]*)"
      | (?<Result>[^\s>]*)
      )
      [^>]*>
    (?<=\sname\s*=\s*['"]?content-type['"]?[^>]*>)

  • Value of input element with name input-name:
    (?insx)
    <input\s([^>]*?\s*)?
      value\s*=\s*
      ( '(?<Result>[^']*)'
      | "(?<Result>[^"]*)"
      | (?<Result>[^\s>]*)
      )
      [^>]*>
    (?<=\sname\s*=\s*['"]?input-name['"]?[^>]*>)

0 Comments:

Post a Comment

<< Home