Visual Basic .NET Articles

KB: Web Scraping - Extract all links from a web page using VB.NET

SATURDAY, APRIL 10, 2010

This article explains another technique in Web Scraping, which has been an important function of search engine crawling - extraction of all links for any given URL. This is quite a simple procedure, especially when done using Regular Expressions. The HTML content from any given url is downloaded as a string, and all occurences of hyperlinks are extracted from it.

Link Extraction is the major function of any search engine crawler. From a given URL, the contents are first extracted, and indexed. Then the contents are parsed further to get the list of hyperlinks, images, etc. Whatever be the programming language or web technology, the anchor or hyperlink (<a href="...">...</a>) element is used for hyperlinks. This can be searched using Regular Expressions from the HTML contents.

The hyperlink elements are used for different purposes on the page - call javascript functions, link to anchors (<a name="..." />) on page, and for normal usage of hyperlinking to internal and external pages. Among this, only hyperlinking elements must be filtered out. The second part of this task is build complete URLs from partial or relative URLs like /about.aspx or ../../help.htm relative to the requested base address.

The code snippet in Visual Basic.NET given below has two functions - Once to extract the hyperlinks from a given URL, and a second supporting function to build proper URLs. The URL extraction method ignores all javascript links, email links and anchor links.

Public Function ExtractLinks(ByVal url As String) As DataTable
        Dim dt As New DataTable
        dt.Columns.Add("LinkText")
        dt.Columns.Add("LinkUrl")
 
        Dim wc As New WebClient
        Dim html As String = wc.DownloadString(url)
 
        Dim links As MatchCollection = Regex.Matches(html, "<a.*?href=""(.*?)"".*?>(.*?)</a>")
 
        For Each match As Match In links
            Dim dr As DataRow = dt.NewRow
            Dim matchUrl As String = match.Groups(1).Value
            'Ignore all anchor links
            If matchUrl.StartsWith("#") Then
                Continue For
            End If
            'Ignore all javascript calls
            If matchUrl.ToLower.StartsWith("javascript:") Then
                Continue For
            End If
            'Ignore all email links
            If matchUrl.ToLower.StartsWith("mailto:") Then
                Continue For
            End If
            'For internal links, build the url mapped to the base address
            If Not matchUrl.StartsWith("http://") And Not matchUrl.StartsWith("https://") Then
                matchUrl = MapUrl(url, matchUrl)
            End If
            'Add the link data to datatable
            dr("LinkUrl") = matchUrl
            dr("LinkText") = match.Groups(2).Value
            dt.Rows.Add(dr)
        Next
 
        Return dt
    End Function
 
    Public Function MapUrl(ByVal baseAddress As String, ByVal relativePath As String) As String
 
        Dim u As New System.Uri(baseAddress)
 
        If relativePath = "./" Then
            relativePath = "/"
        End If
 
        If relativePath.StartsWith("/") Then
            Return u.Scheme + Uri.SchemeDelimiter + u.Authority + relativePath
        Else
            Dim pathAndQuery As String = u.AbsolutePath
            ' If the baseAddress contains a file name, like ..../Something.aspx
            ' Trim off the file name
            pathAndQuery = pathAndQuery.Split("?")(0).TrimEnd("/")
            If pathAndQuery.Split("/")(pathAndQuery.Split("/").Count - 1).Contains(".") Then
                pathAndQuery = pathAndQuery.Substring(0, pathAndQuery.LastIndexOf("/"))
            End If
            baseAddress = u.Scheme + Uri.SchemeDelimiter + u.Authority + pathAndQuery
 
            'If the relativePath contains ../ then
            ' adjust the baseAddress accordingly
 
            While relativePath.StartsWith("../")
                relativePath = relativePath.Substring(3)
                If baseAddress.LastIndexOf("/") > baseAddress.IndexOf("//" + 2) Then
                    baseAddress = baseAddress.Substring(0, baseAddress.LastIndexOf("/")).TrimEnd("/")
                End If
            End While
 
            Return baseAddress + "/" + relativePath
        End If
 
    End Function

The above code snippet is only a simple example of link extraction from web pages. This can be expanded to extract email addresses, FTP address, and other special addresses like Skype usernames, etc.

ConsultSarath - We provide end to end outsourcing solutions for .net programming requirements- you can hire programmer for hourly rates, for monthly commitments, for short term projects, for long term projects, Contact to know our hourly rates for programmer in India. ConsultSarath - We provide end to end outsourcing solutions for php programming requirements- you can hire programmer for hourly rates, for monthly commitments, for short term projects, for long term projects, Contact to know our hourly rates for programmer in India. ConsultSarath - We provide end to end outsourcing solutions for python programming requirements- you can hire programmer for hourly rates, for monthly commitments, for short term projects, for long term projects, Contact to know our hourly rates for programmer in India.

Other Popular Articles
We are experts in Cloud Computing Technologies. We can assist you to build high scalable business applications using Amazon Web Services (Amazon EC2, Amazon S3, Amazon SES, SNS, CloudFront), Windows Azure Platforms - Windows Azure and SQL Server Azure, Google App Engine using Python and Django Framework. We are Expert Programming Consultants available at affordable rates per hour. We work on several technologies - .NET, Python, Google App Engine, PHP, Windows Azure, Amazon Web Services ...