• http://www.techneurons.com/career/
  • experienced programming consultants for hire !!!

    Contact Now

    Visual Basic .NET Articles

    KB: Web Scraping - Extract all links from a web page using VB.NET

    SATURDAY, APRIL 10, 2010

    This article explains another technique in Web Scraping, which has been an important function of search engine crawling - extraction of all links for any given URL. This is quite a simple procedure, especially when done using Regular Expressions. The HTML content from any given url is downloaded as a string, and all occurences of hyperlinks are extracted from it.

    Link Extraction is the major function of any search engine crawler. From a given URL, the contents are first extracted, and indexed. Then the contents are parsed further to get the list of hyperlinks, images, etc. Whatever be the programming language or web technology, the anchor or hyperlink (<a href="...">...</a>) element is used for hyperlinks. This can be searched using Regular Expressions from the HTML contents.

    The hyperlink elements are used for different purposes on the page - call javascript functions, link to anchors (<a name="..." />) on page, and for normal usage of hyperlinking to internal and external pages. Among this, only hyperlinking elements must be filtered out. The second part of this task is build complete URLs from partial or relative URLs like /about.aspx or ../../help.htm relative to the requested base address.

    The code snippet in Visual Basic.NET given below has two functions - Once to extract the hyperlinks from a given URL, and a second supporting function to build proper URLs. The URL extraction method ignores all javascript links, email links and anchor links.

    Public Function ExtractLinks(ByVal url As String) As DataTable
            Dim dt As New DataTable
            dt.Columns.Add("LinkText")
            dt.Columns.Add("LinkUrl")
     
            Dim wc As New WebClient
            Dim html As String = wc.DownloadString(url)
     
            Dim links As MatchCollection = Regex.Matches(html, "<a.*?href=""(.*?)"".*?>(.*?)</a>")
     
            For Each match As Match In links
                Dim dr As DataRow = dt.NewRow
                Dim matchUrl As String = match.Groups(1).Value
                'Ignore all anchor links
                If matchUrl.StartsWith("#") Then
                    Continue For
                End If
                'Ignore all javascript calls
                If matchUrl.ToLower.StartsWith("javascript:") Then
                    Continue For
                End If
                'Ignore all email links
                If matchUrl.ToLower.StartsWith("mailto:") Then
                    Continue For
                End If
                'For internal links, build the url mapped to the base address
                If Not matchUrl.StartsWith("http://") And Not matchUrl.StartsWith("https://") Then
                    matchUrl = MapUrl(url, matchUrl)
                End If
                'Add the link data to datatable
                dr("LinkUrl") = matchUrl
                dr("LinkText") = match.Groups(2).Value
                dt.Rows.Add(dr)
            Next
     
            Return dt
        End Function
     
        Public Function MapUrl(ByVal baseAddress As String, ByVal relativePath As String) As String
     
            Dim u As New System.Uri(baseAddress)
     
            If relativePath = "./" Then
                relativePath = "/"
            End If
     
            If relativePath.StartsWith("/") Then
                Return u.Scheme + Uri.SchemeDelimiter + u.Authority + relativePath
            Else
                Dim pathAndQuery As String = u.AbsolutePath
                ' If the baseAddress contains a file name, like ..../Something.aspx
                ' Trim off the file name
                pathAndQuery = pathAndQuery.Split("?")(0).TrimEnd("/")
                If pathAndQuery.Split("/")(pathAndQuery.Split("/").Count - 1).Contains(".") Then
                    pathAndQuery = pathAndQuery.Substring(0, pathAndQuery.LastIndexOf("/"))
                End If
                baseAddress = u.Scheme + Uri.SchemeDelimiter + u.Authority + pathAndQuery
     
                'If the relativePath contains ../ then
                ' adjust the baseAddress accordingly
     
                While relativePath.StartsWith("../")
                    relativePath = relativePath.Substring(3)
                    If baseAddress.LastIndexOf("/") > baseAddress.IndexOf("//" + 2) Then
                        baseAddress = baseAddress.Substring(0, baseAddress.LastIndexOf("/")).TrimEnd("/")
                    End If
                End While
     
                Return baseAddress + "/" + relativePath
            End If
     
        End Function

    The above code snippet is only a simple example of link extraction from web pages. This can be expanded to extract email addresses, FTP address, and other special addresses like Skype usernames, etc.

    ConsultSarath - We provide end to end outsourcing solutions for .net programming requirements- you can hire programmer for hourly rates, for monthly commitments, for short term projects, for long term projects, Contact to know our hourly rates for programmer in India. ConsultSarath - We provide end to end outsourcing solutions for php programming requirements- you can hire programmer for hourly rates, for monthly commitments, for short term projects, for long term projects, Contact to know our hourly rates for programmer in India. ConsultSarath - We provide end to end outsourcing solutions for python programming requirements- you can hire programmer for hourly rates, for monthly commitments, for short term projects, for long term projects, Contact to know our hourly rates for programmer in India.

    Other Popular Articles
    We are experts in Cloud Computing Technologies. We can assist you to build high scalable business applications using Amazon Web Services (Amazon EC2, Amazon S3, Amazon SES, SNS, CloudFront), Windows Azure Platforms - Windows Azure and SQL Server Azure, Google App Engine using Python and Django Framework. We are Expert Programming Consultants available at affordable rates per hour. We work on several technologies - .NET, Python, Google App Engine, PHP, Windows Azure, Amazon Web Services ...