Ian Beckett

RSS feed

    Recent comments

    Authors

    RegEx Text Scrape C# Function

    I wrote the following code to make RegExing easier in C#.  The function takes two parameters, Text and Pattern.  Text is a string target that you want to scrape, and Pattern is the RegEx expression to match.

    The actual code:

    using System;
    using System.Collections.Generic;
    using System.Linq;
    using System.Text;
    using System.Collections;
    using System.Text.RegularExpressions;

    namespace Scraper
    {
        static class ScrapeText
        {
            static public ArrayList Scrape(string Pattern, string Text)
            {
                ArrayList MatchedValues = new ArrayList();
                Regex r;
                Match m;
                r = new Regex(Pattern, RegexOptions.IgnoreCase | RegexOptions.Compiled);
              
                string foundVal;
                for (m = r.Match(Text); m.Success; m = m.NextMatch())
                {
                    foundVal = m.Groups[0].ToString();
                    MatchedValues.Add(foundVal);
                }

                return MatchedValues;
            }
        }
    }



    Using ScrapeText:

    ArrayList ScrapedList = ScrapeText .Scrape(@"(?s)<li>.*?</li>", "<ul><li>a</li><li>b</li><li>c</li></ul>");

    In this example, I pass "(?s)<li>.*?</li>" as the RegEx expression - this will match any string beginning with "<li>" and ending with "</li>" in the HTML text source.

    For the text to search, I pass a dummy HTML snippet: 
     <ul>
      <li>a</li>
      <li>b</li>
      <li>c</li>
     </ul>

    The output is an ArrayList containing a list item for each RegEx match.  In this example it would be an array list containing the values "a","b", and "c".  In this example it would be an array list containing the values "<li>a</li>","<li>b</li>", and "<li>c</li>".  Once I have this Array List I can scrub and transform the values however I want.   In this example I might strip out the HTML list elements to isolate the alphabetical values.


    Posted by ibeckett on Thursday, July 16, 2009 5:46 PM
    Permalink | Comments (1) | Post RSSRSS comment feed

    Related posts

    Comments

    Johnny us

    Wednesday, December 16, 2009 5:57 AM

    CORRECTION: The output is not "a", "b", and "c". Its "<li>a</li>","<li>b</li>",and "<li>c</li>".