Introduction

Web scraping also known as Screen Scraping, Web Data Extraction, Web Harvesting is a technique to get data from different websites and processing it according to our own need. Mostly fetched data is stored in databases or other options like saving in microsoft excel etc is also there.

Now Lets see how to scrap a web using ASP.NET web forms

  1. Open Visual Studio and create blank ASP.NET Web forms project.
  2. Right Click references and select Manage nuget packages option.
  3. Add HTML Agility Pack through nuget package manager.
  4. Now add a web form in your project by right clicking on project name -> Add new Item -> Web form.
  5. Add a Grid View to your project using the following code:
    <asp:GridView runat="server" ID="test">
    
  6. Now lets scrap. I am using my own website to demonstrate here which is http://www.bilalamjad.net
  7. Now i want to display all the elements inside <h2> which is actually the headings/titles of my blog posts/articles on main page.
  8. Add a list inside your class as:
    List<string> data = new List<string>();
    
  9. Now create a function getData() as:
    void getData()
    {
    }
    
  10. Now lets code getData(). For scraping we need to create instance of HtmlWeb which is a member of HtmlAgility pack which we added in step 3.
  11. The code will be as follow:
    void getData()
    {
    var getHtmlWeb = new HtmlWeb();
    var document = getHtmlWeb.Load("http://bilalamjad.azurewebsites.net");
    var aTags = document.DocumentNode.SelectNodes("//h2");
    int counter = 1;
    if (aTags != null)
    {
    foreach (var aTag in aTags)
    {
    string d = counter + ". " + aTag.InnerHtml;
    
    string textOnly = HttpUtility.HtmlDecode(d.ToString());
    
    data.Add(textOnly);
    counter++;
    }
    }
    
    test.DataSource=data;
    test.DataBind();
    
    }
    
  12. Code is self explanatory but still lets discuss few terminologies here.
    HtmlWeb class will create a object so that we can call Load() method to tell which web page we are going to scrap and this load() method will actually return us the html document (the html code actually) of that specific website.
    Nodes: Nodes are actually those html tags which we want to parse/scrap and that can be any html or custom tag like in my case I am using h2 tags to get all headings inside <h2></h2>.
    InnerHTML: Inner html will return me the HTML code inside those <h2> tags but if i need text only I can use decode method which is HttpUtility.HTMLDecode(string);
  13. Finally Add the results to your list through List.Add() functionality.
  14. Now bind your gridview with list to see the results.

So this was very basic web scraping and we will explore it further in Part two of this post.

Thats all. Happy Coding 🙂