Muddy clouds: Some .NET Regular Expressions for HTML parsing

Saturday, October 30, 2004

Some .NET Regular Expressions for HTML parsing

Despite the observations in my previous post, my intranet publishing solution is based on a combination of the ASP.NET approach to obtain web content and regular expressions coded in C#. Here are some regular expressions that worked well for me to get specific meta and input tags regardless of letter case and attribute quoting style (single, double or no quotes):

Content attribute of meta tag with name content-type:
(?insx) <meta\s([^>]*?\s*)? content\s*=\s* ( '(?<Result>[^']*)' | "(?<Result>[^"]*)" | (?<Result>[^\s>]*) ) [^>]*> (?<=\sname\s*=\s*['"]?content-type['"]?[^>]*>)

Value of input element with name input-name:
(?insx) <input\s([^>]*?\s*)? value\s*=\s* ( '(?<Result>[^']*)' | "(?<Result>[^"]*)" | (?<Result>[^\s>]*) ) [^>]*> (?<=\sname\s*=\s*['"]?input-name['"]?[^>]*>)

Muddy clouds

Saturday, October 30, 2004

Some .NET Regular Expressions for HTML parsing

0 Comments:

About Me

Previous Posts