Now I hope you are familiar with web scraping technique or if you want to recall it then just have a look at part I of this article which you can find here.

As in our previous article I scraped all the headings of my posts on the index page of this website and displayed them in a GridView control of ASP.NET web forms. Now lets just move forward and try to explore web scraping bit more.

Web scraping Bilal AmjadToday in this article I will tell you how to pick a specific content inside a website which can be a paragraph , a number or any particular information.

For example, I have a Quick Intro section on the main page as shown in the picture. Now I want to pick whatever is written inside this widget/section.

Note: I am not going to re-write project creation steps. To read them please have a look to Part-I of this article.

  1. Using any web browser Inspect the desired element. In my case i am inspecting Quick Intro element from my web page.
  2. Now we can pick the element using its class as shown below:
    Website scraping ASP.NET
  3. Lets code now, create a function like below and call it from where-ever you want to call i.e. on the clicking of a button or on page load its up to you.
    void getData()
    {
    }
    
  4. Now lets fill this function with our logic. Make sure you have added HtmlAgility Pack refrence and if you don’t know how to do that then just read the part 1 of this article.
  5. So our getData() function will have following logic implemented inside it:
    var html = new HtmlDocument();
    html.LoadHtml(new WebClient().DownloadString("http://bilalamjad.azurewebsites.net"));
    var root = html.DocumentNode;
    var quick_intro = root.Descendants().Where(n => n.GetAttributeValue("class", "").Equals("et_pb_widget widget_aboutmewidget"));
    Response.Write(quick_intro.Single().InnerText);
    
  6. In the above code, first of all we are declaring a new instance of HtmlDocument class which is a member of HtmlAgilityPack which we have added through Nuget Package manager.
  7. Then we have loaded the html code of the website which in our case is my website you can use your own.
  8. Then I have defined the root , starting point where scraping is actually going to began.
  9. Now I have declared a variable name quick_intro which is actually going to pick something with class name et_pb_widget widget_aboutmewidget  which is actually the class name of my quick intro section.
  10. Finally response.write will write what ever is written inside that element and for this I have defined quick_intro as a single paragraph and have picked the text using InnerText property. If there are multiple paragraphs you can simple enclose your code inside foreach statement and can iterate through every paragraph.
  11. Now lets get more precise information, so what if I want to pick my email ID only from this paragraph? That’s pretty simple. I am going to use regular expression here. Lets assign our information to a variable and then we will process it as:
    var intro_text = quick_intro.Single().InnerText;
    const string MatchEmailPattern =
               @"(([\w-]+\.)+[\w-]+|([a-zA-Z]{1}|[\w-]{2,}))@"
               + @"((([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\."
                 + @"([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])){1}|"
               + @"([a-zA-Z]+[\w-]+\.)+[a-zA-Z]{2,4})";
                Regex rx = new Regex(MatchEmailPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
    
                MatchCollection matches = rx.Matches(intro_text);
    
                foreach (Match match in matches)
                {
                    Response.Write(match.Value.ToString());
                }
    
    

    Compiling all the code together our getData() function will be as:

    void getData()
    {
    var html = new HtmlDocument();
                html.LoadHtml(new WebClient().DownloadString("http://bilalamjad.azurewebsites.net"));
                var root = html.DocumentNode;
                var quick_intro = root.Descendants().Where(n => n.GetAttributeValue("class", "").Equals("et_pb_widget widget_aboutmewidget"));
                var intro_text = quick_intro.Single().InnerText;
    
                const string MatchEmailPattern =
               @"(([\w-]+\.)+[\w-]+|([a-zA-Z]{1}|[\w-]{2,}))@"
               + @"((([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\."
                 + @"([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])\.([0-1]?[0-9]{1,2}|25[0-5]|2[0-4][0-9])){1}|"
               + @"([a-zA-Z]+[\w-]+\.)+[a-zA-Z]{2,4})";
                Regex rx = new Regex(MatchEmailPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);
    
                MatchCollection matches = rx.Matches(intro_text);
    
                foreach (Match match in matches)
                {
                    Response.Write(match.Value.ToString());
                }
    }
    

That’s all. So today we get little advance and have explore the scraping bit more. So in my next article I will tell you how to get parent and their child nodes which is very useful if you want to scrap lists of websites.