I wrote the following code to make RegExing easier in C#. The function takes two parameters, Text and Pattern. Text is a string target that you want to scrape, and Pattern is the RegEx expression to match.
The actual code:
|
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Collections;
using System.Text.RegularExpressions;
namespace Scraper
{
static class ScrapeText
{
static public ArrayList Scrape(string Pattern, string Text)
{
ArrayList MatchedValues = new ArrayList();
Regex r;
Match m;
r = new Regex(Pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
string foundVal;
for (m = r.Match(Text); m.Success; m = m.NextMatch())
{
foundVal = m.Groups[0].ToString();
MatchedValues.Add(foundVal);
}
return MatchedValues;
}
}
}
|
Using ScrapeText:
|
ArrayList ScrapedList = ScrapeText .Scrape(@"(?s)<li>.*?</li>", "<ul><li>a</li><li>b</li><li>c</li></ul>");
|
In this example, I pass "(?s)<li>.*?</li>" as the RegEx expression - this will match any string beginning with "<li>" and ending with "</li>" in the HTML text source.
For the text to search, I pass a dummy HTML snippet:
<ul>
<li>a</li>
<li>b</li>
<li>c</li>
</ul>
The output is an ArrayList containing a list item for each RegEx match. In this example it would be an array list containing the values "a","b", and "c". In this example it would be an array list containing the values "<li>a</li>","<li>b</li>", and "<li>c</li>". Once I have this Array List I can scrub and transform the values however I want. In this example I might strip out the HTML list elements to isolate the alphabetical values.