This article explains another technique in Web Scraping, which has been an important function of search engine crawling - extraction of all links for any given URL. This is quite a simple procedure, especially when done using Regular Expressions. The HTML content from any given url is downloaded as a string, and all occurences of hyperlinks are extracted from it.
Link Extraction is the major function of any search engine crawler. From a given URL, the contents are first extracted, and indexed. Then the contents are parsed further to get the list of hyperlinks, images, etc. Whatever be the programming language or web technology, the anchor or hyperlink (<a href="...">...</a>) element is used for hyperlinks. This can be searched using Regular Expressions from the HTML contents.
The hyperlink elements are used for different purposes on the page - call javascript functions, link to anchors (<a name="..." />) on page, and for normal usage of hyperlinking to internal and external pages. Among this, only hyperlinking elements must be filtered out. The second part of this task is build complete URLs from partial or relative URLs like /about.aspx or ../../help.htm relative to the requested base address.
The code snippet in Visual Basic.NET given below has two functions - Once to extract the hyperlinks from a given URL, and a second supporting function to build proper URLs. The URL extraction method ignores all javascript links, email links and anchor links.
Public Function ExtractLinks(ByVal url As String) As DataTable
Dim dt As New DataTable
dt.Columns.Add("LinkText")
dt.Columns.Add("LinkUrl")
Dim wc As New WebClient
Dim html As String = wc.DownloadString(url)
Dim links As MatchCollection = Regex.Matches(html, "<a.*?href=""(.*?)"".*?>(.*?)</a>")
For Each match As Match In links
Dim dr As DataRow = dt.NewRow
Dim matchUrl As String = match.Groups(1).Value
'Ignore all anchor links
If matchUrl.StartsWith("#") Then
Continue For
End If
'Ignore all javascript calls
If matchUrl.ToLower.StartsWith("javascript:") Then
Continue For
End If
'Ignore all email links
If matchUrl.ToLower.StartsWith("mailto:") Then
Continue For
End If
'For internal links, build the url mapped to the base address
If Not matchUrl.StartsWith("http://") And Not matchUrl.StartsWith("https://") Then
matchUrl = MapUrl(url, matchUrl)
End If
'Add the link data to datatable
dr("LinkUrl") = matchUrl
dr("LinkText") = match.Groups(2).Value
dt.Rows.Add(dr)
Next
Return dt
End Function
Public Function MapUrl(ByVal baseAddress As String, ByVal relativePath As String) As String
Dim u As New System.Uri(baseAddress)
If relativePath = "./" Then
relativePath = "/"
End If
If relativePath.StartsWith("/") Then
Return u.Scheme + Uri.SchemeDelimiter + u.Authority + relativePath
Else
Dim pathAndQuery As String = u.AbsolutePath
' If the baseAddress contains a file name, like ..../Something.aspx
' Trim off the file name
pathAndQuery = pathAndQuery.Split("?")(0).TrimEnd("/")
If pathAndQuery.Split("/")(pathAndQuery.Split("/").Count - 1).Contains(".") Then
pathAndQuery = pathAndQuery.Substring(0, pathAndQuery.LastIndexOf("/"))
End If
baseAddress = u.Scheme + Uri.SchemeDelimiter + u.Authority + pathAndQuery
'If the relativePath contains ../ then
' adjust the baseAddress accordingly
While relativePath.StartsWith("../")
relativePath = relativePath.Substring(3)
If baseAddress.LastIndexOf("/") > baseAddress.IndexOf("//" + 2) Then
baseAddress = baseAddress.Substring(0, baseAddress.LastIndexOf("/")).TrimEnd("/")
End If
End While
Return baseAddress + "/" + relativePath
End If
End Function
The above code snippet is only a simple example of link extraction from web pages. This can be expanded to extract email addresses, FTP address, and other special addresses like Skype usernames, etc.