(DotNet) NET (2014 год)

My web scrapper with asyncronous web request and visual proxy availability detection.

In this page I describe a smal part of one of the my big desktop application. At common this application is designed for web scrapping, but it has a interesting small part what be described below.

Most site has request limitation for IP-address and forbid web scraping from one IP-address. Therefore a web scrapper use various method to change IP-address. One of the good method is manually changing IP-adress when your address is blocked. There are many various ways to solve this problem, to change your IP-address. One of the solution is use AdvOR (first screen below), second good choice is use VPN (second screen below).



But my appication using another way. There are many sites with proxy list (for example https://free-proxy-list.net/) and this fragment of my program is visually checker haw to site working throught one or another proxy in proxy list.



Proxy address and port is set to this place in my application, after tat need to press right button "check proxy".



You may select proxy to checking from old existing list of application or adding it an any time.



After that separate windows with browser is pops. Maybe proxy not working in your environment, in this case you may press "delete proxy from list".



In another case proxy working well and you may see add to you current list.



My visual checker (actually web-browser) has some another opportunity, like to add new site to site list for check proxy availability.



After you refresh actually proxy list, you can set in application count of proxy to webscraping. There is a common template of small addons in my web-scrapper, this is visual checker big existing external proxy list in my current environment.

So, below I describe some code fragment of this addond. Three last controls in first form of this application (included in Toolstrip4) named ToolStripLabel6, ProxyIP, CheckProxyButton.



Program code to processing events of this three buttons you may see in screen below.



At common this is very big my application, and line from 369 to line 396 is executing class ProxyChecker with parameters as IP:PORT of proxy and refresh combobox list (if proxy adding to table ProxyTabs.


...
 372:          Dim ProxyChecker1 As New ProxyChecker("Http://" & OnlyIpPort & "/")
 373:          AddHandler ProxyChecker1.RefreshProxyList, AddressOf ProxyIP_Refresh
 374:          ProxyChecker1.Go()
...
 391:              ProxyIP.Items.Add(One.URL.ToLower.Replace("http://", "").Replace("/", "") & " (" & One.CrDate.ToString("dd.MM.yyyy HH:mm:ss", System.Globalization.CultureInfo.InvariantCulture) & ")")
...

Code above is only environment for executing class ProxyChecker. Code of this class you may see below.



This is most important class of this fragment on my application, because it containes all handlers to process event of popup form VisualSiteCheckerForm, what containes web-browser.


   1:  Public Class ProxyChecker
   2:      Inherits Wcf_Client
   3:   
   4:      Public Event RefreshProxyList()
   5:   
   6:      Public Property Checked As Boolean
   7:      Public Property Full_ProxyURL As String
   8:      Public Property ResponseEncode As Wcf_Client.PostRequestEncode
   9:      Public Property VisualSiteCheckerForm As Global.Freelancer.VisualSiteChecker1
  10:   
  11:      Public Sub New(ByVal IpAddr As String)
  12:          Full_ProxyURL = IpAddr
  13:          ResponseEncode = Wcf_Client.PostRequestEncode.UTF8
  14:      End Sub
  15:   
  16:      Public Sub Go()
  17:          VisualSiteCheckerForm = New Global.Freelancer.VisualSiteChecker1
  18:          VisualSiteCheckerForm.IsPageCorrectCallBack = AddressOf IsPageOK
  19:          AddHandler VisualSiteCheckerForm.GetHTML, AddressOf ReadHTMLSync
  20:          AddHandler VisualSiteCheckerForm.GetHTMLAsync, AddressOf ReadHTMLASync
  21:          VisualSiteCheckerForm.DeleteProxy = AddressOf DelProxy
  22:          VisualSiteCheckerForm.Title = Full_ProxyURL.ToLower.Replace("http://", "").Replace("/", "")
  23:          VisualSiteCheckerForm.Show()
  24:      End Sub
  25:   
  26:      Public Sub DelProxy()
  27:          Dim db1 = New ParserDBDataContext
  28:          Dim CurProxy = (From X In db1.ProxyTabs Select X Where X.URL = Full_ProxyURL).ToList
  29:          If CurProxy.Count > 0 Then
  30:              db1.ProxyTabs.DeleteOnSubmit(CurProxy(0))
  31:              db1.SubmitChanges()
  32:              RaiseEvent RefreshProxyList()
  33:          End If
  34:      End Sub
  35:   
  36:      Public Sub IsPageOK(ByVal yesno As Boolean)
  37:          Checked = yesno
  38:          If yesno Then
  39:              Dim db1 As New ParserDBDataContext
  40:   
  41:              db1.ProxyTabs.InsertOrUpdateTable(Function(e) e.URL = Full_ProxyURL,
  42:                                                New ProxyTab With {.CrDate = Now, .URL = Full_ProxyURL},
  43:                                                Sub(e) e.CrDate = Now)
  44:              LoadForm.ProxyIP_Refresh()
  45:          End If
  46:      End Sub
  47:   
  48:      Public Sub ReadHTMLSync(ByVal URL As String, ByRef HTML As String)
  49:          HTML = GetRequestStrAsync(URL, ResponseEncode, Full_ProxyURL)
  50:      End Sub
  51:   
  52:   
  53:      Private WithEvents backgroundWorker1 As System.ComponentModel.BackgroundWorker
  54:   
  55:      'старт в основном потоке
  56:      Public Sub ReadHTMLASync(ByVal URL As String)
  57:          backgroundWorker1 = New System.ComponentModel.BackgroundWorker
  58:          backgroundWorker1.RunWorkerAsync(URL)
  59:      End Sub
  60:   
  61:      'вот єто в другом потоке
  62:      Private Sub BackgroundWorker1_DoWork(ByVal sender As System.Object, ByVal e As System.ComponentModel.DoWorkEventArgs) Handles backgroundWorker1.DoWork
  63:          HTML1 = GetRequestStrAsync(e.Argument, ResponseEncode, Full_ProxyURL)
  64:      End Sub
  65:   
  66:      Dim HTML1 As String
  67:   
  68:      'финиш опять в основном потоке
  69:      Private Sub BackgroundWorker1_RunWorkerCompleted(ByVal sender As System.Object, ByVal e As System.ComponentModel.RunWorkerCompletedEventArgs) Handles backgroundWorker1.RunWorkerCompleted
  70:          VisualSiteCheckerForm.ShowAsyncHtmlResult.Invoke(HTML1)
  71:      End Sub
  72:   
  73:  End Class
  74:   

But to understand how class ProxyChecker working need firstly see to to form VisualSiteCheckerForm with web-browser. This form containes some contols, names of this control you may understand as learning of screen below.



Code of form VisualSiteCheckerForm you may see below, it use to show html TheArtOfDev.HtmlRenderer (https://github.com/ArthurHub/HTML-Renderer), support table TestURLs to store URL to check proxy.

   1:  Public Class VisualSiteChecker1
   2:      Public Event GetHTML(ByVal URL As String, ByRef HTML As String)
   3:      Public Event GetHTMLAsync(ByVal URL As String)
   4:   
   5:      Delegate Function GetRequestStrDelegate(ByVal RequestEncoding As Wcf_Client.PostRequestEncode, ByVal URL As String, ByVal ResponseEncoding As Wcf_Client.PostRequestEncode, ByVal Full_ProxyURL As String) As String
   6:      Delegate Sub IsCorrect(ByVal yes_no As Boolean)
   7:      Delegate Sub DelProxy()
   8:      Delegate Sub ShowHtmlResult(ByVal Html As String)
   9:   
  10:      Public Property Title As String
  11:      Public Property HTML As String = ""
  12:      Public Property IsPageCorrectCallBack As IsCorrect
  13:      Public Property DeleteProxy As DelProxy
  14:      Public Property ShowAsyncHtmlResult As ShowHtmlResult
  15:      Property HtmlPanel As TheArtOfDev.HtmlRenderer.WinForms.HtmlPanel
  16:      Dim db1 As ParserDBDataContext
  17:   
  18:      Private Sub VisualSiteChecker_Load(ByVal sender As Object, ByVal e As System.EventArgs) Handles Me.Load
  19:          Me.Text &= " " & Title
  20:          HtmlPanel = New TheArtOfDev.HtmlRenderer.WinForms.HtmlPanel
  21:          HtmlPanel.Dock = DockStyle.Fill
  22:          ToolStripContainer1.ContentPanel.Controls.Add(HtmlPanel)
  23:          ShowAsyncHtmlResult = AddressOf ShowHtmlHandler
  24:          NavigateURL_refresh()
  25:      End Sub
  26:   
  27:      Private Sub GoButton_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles GoButton.Click
  28:          If NavigateURL.Text <> "" Then
  29:              HtmlPanel.Refresh()
  30:              HtmlPanel.Text = ""
  31:              Try
  32:                  RaiseEvent GetHTML(NavigateURL.Text, HTML)
  33:                  LenHtml.Text = Len(HTML).ToString & " chars"
  34:                  HtmlPanel.Text = HTML
  35:              Catch ex As Exception
  36:                  HtmlPanel.Text = ex.Message
  37:                  IsPageCorrectCallBack.Invoke(False)
  38:              End Try
  39:          End If
  40:      End Sub
  41:   
  42:      Private Sub GoAsyncButton_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles GoAsyncButton.Click
  43:          If NavigateURL.Text <> "" Then
  44:              HtmlPanel.Refresh()
  45:              HtmlPanel.Text = ""
  46:              Try
  47:                  RaiseEvent GetHTMLAsync(NavigateURL.Text)
  48:              Catch ex As Exception
  49:                  HtmlPanel.Text = ex.Message
  50:                  IsPageCorrectCallBack.Invoke(False)
  51:              End Try
  52:          End If
  53:      End Sub
  54:   
  55:      Sub ShowHtmlHandler(ByVal Html As String)
  56:          HtmlPanel.Text = Html
  57:          HtmlPanel.Refresh()
  58:      End Sub
  59:   
  60:   
  61:      Private Sub OkButton_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles OkButton.Click
  62:          IsPageCorrectCallBack.Invoke(True)
  63:          Me.Close()
  64:      End Sub
  65:   
  66:      Private Sub DelButton_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles DelButton.Click
  67:          DeleteProxy.Invoke()
  68:          Me.Close()
  69:      End Sub
  70:   
  71:      Sub NavigateURL_refresh()
  72:          NavigateURL.Items.Clear()
  73:          db1.GetContext(True)
  74:          Dim X = (From Z In db1.TestURLs Select Z Order By Z.i).ToList
  75:          For Each One As TestURL In X
  76:              NavigateURL.Items.Add(One.URL)
  77:          Next
  78:      End Sub
  79:   
  80:   
  81:      Private Sub DeleteURL_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles DeleteURL.Click
  82:          Dim X = (From Z In db1.TestURLs Select Z Where Z.URL = NavigateURL.Text).ToList
  83:          If X.Count > 0 Then
  84:              db1.TestURLs.DeleteOnSubmit(X(0))
  85:              db1.SubmitChanges()
  86:          End If
  87:          NavigateURL_refresh()
  88:      End Sub
  89:   
  90:      Private Sub AddUrl_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles AddUrl.Click
  91:          db1.TestURLs.InsertOnSubmit(New TestURL With {.URL = NavigateURL.Text})
  92:          db1.SubmitChanges()
  93:          NavigateURL_refresh()
  94:      End Sub
  95:  End Class

Also this code invoke four external delegates and two events. But sync event GetHTML devoid of sense because all application (including first execution form) is freezing and blocking for long time and all application waiting while proxy server is answering. Actually it's only a test method. Really workable in this application only GetHTMLAsync event. Below you may see in diagramm list of external connections for form VisualSiteCheckerForm - four delegates and two events (plus one input field - form title).



So, now we come back to my code above, to class ProxyChecker and understanding that ProxyChecker contains only handler that process event from form VisualSiteCheckerForm. But this is not a simple directly connection! All code of class ProxyChecker working in the same thread as form VisualSiteCheckerForm, except line 62-63, that executing in another thread.

To understand more in my application we need understand more from my class ProxyChecker. It inherits from my big old class WCF_CLIENT - клиент Web-сервиса, writed by me before 2010 year. I don't publish now all of this code, we see only main important fragments of class Wcf_Client, related with this application.



So, this is code of this class, releted to this application. For my experience this is most difficult asynchronous code as possible, but it working!


1428:   
1429:      Public allDone As Threading.ManualResetEvent
1430:      Dim BUFFER_SIZE As Integer = 1000000
1431:   
1432:      Public Overridable Function GetRequestStrAsync(ByVal URL As String, Optional ByVal ResponseEncoding As PostRequestEncode = PostRequestEncode.ASCII, Optional ByVal Full_ProxyURL As String = "") As String
1433:   
1434:          allDone = New Threading.ManualResetEvent(False)
1435:   
1436:          '========== System.NotSupportedException The URI prefix is not recognized.
1437:          Dim Request As Net.HttpWebRequest = Net.HttpWebRequest.Create(URL)
1438:          Request.UserAgent = UserAgent
1439:          Request.Method = "GET"
1440:          If Full_ProxyURL <> "" Then
1441:              Dim MyProxy As New Net.WebProxy
1442:              MyProxy.Address = New Uri(Full_ProxyURL)
1443:              Request.Proxy = MyProxy
1444:          End If
1445:          Dim RS As RequestState = New RequestState(BUFFER_SIZE, ResponseEncoding)
1446:          ' Put the request into the state so it can be passed around.  
1447:          RS.Request = Request
1448:   
1449:          'Issue the async request.  
1450:          Dim r As IAsyncResult = CType(Request.BeginGetResponse(
1451:                  New AsyncCallback(AddressOf RespCallback), RS), IAsyncResult)
1452:   
1453:          ' Wait until the ManualResetEvent is set so that the application  
1454:          ' does not exit until after the callback is called.  
1455:          allDone.WaitOne()
1456:   
1457:          Return RS.ErrorMessage & RS.StringBuilder.ToString
1458:      End Function
1459:   
1460:      Sub RespCallback(ByVal ar As IAsyncResult)
1461:          ' Get the RequestState object from the async result
1462:          Dim rs As RequestState = CType(ar.AsyncState, RequestState)
1463:          Try
1464:              ' Get the HttpWebRequest from RequestState.  
1465:              Dim req As Net.HttpWebRequest = rs.Request
1466:   
1467:              ' Call EndGetResponse, which returns the HttpWebResponse object  
1468:              ' that came from the request issued above.  
1469:              Dim resp As Net.HttpWebResponse = CType(req.EndGetResponse(ar), Net.HttpWebResponse)
1470:   
1471:              ' Start reading data from the respons stream. 
1472:              '============= The remote server returned an error: (407) Proxy Authentication Required. ==========
1473:              Dim ResponseStream As IO.Stream = resp.GetResponseStream()
1474:   
1475:              ' Store the reponse stream in RequestState to read  
1476:              ' the stream asynchronously.  
1477:              rs.ResponseStream = ResponseStream
1478:   
1479:              ' Pass rs.BufferRead to BeginRead. Read data into rs.BufferRead.  
1480:              Dim iarRead As IAsyncResult =
1481:                 ResponseStream.BeginRead(rs.BufferRead, 0, BUFFER_SIZE,
1482:                 New AsyncCallback(AddressOf ReadCallBack), rs)
1483:          Catch ex As Exception
1484:              rs.ErrorMessage = ex.Message
1485:              allDone.Set()
1486:          End Try
1487:   
1488:      End Sub
1489:   
1490:      Sub ReadCallBack(ByVal asyncResult As IAsyncResult)
1491:          ' Get the RequestState object from the AsyncResult.  
1492:          Dim rs As RequestState = CType(asyncResult.AsyncState, RequestState)
1493:   
1494:          ' Retrieve the ResponseStream that was set in RespCallback.  
1495:          Dim responseStream As IO.Stream = rs.ResponseStream
1496:   
1497:          ' Read rs.BufferRead to verify that it contains data.
1498:          Dim read As Integer
1499:          Try
1500:              read = responseStream.EndRead(asyncResult)
1501:          Catch ex As Exception
1502:              Return
1503:          End Try
1504:          '
1505:          If read > 0 Then
1506:              ' Prepare a Char array buffer for converting to Unicode.  
1507:              Dim charBuffer(rs.BufferRead.Count) As Char
1508:   
1509:              ' Convert byte stream to Char array and then String.  
1510:              ' len contains the number of characters converted to Unicode.  
1511:              Dim len As Integer = _
1512:                rs.StreamDecode.GetChars(rs.BufferRead, 0, read, charBuffer, 0)
1513:              Dim str As String = New String(charBuffer, 0, len)
1514:   
1515:              ' Append the recently read data to the RequestData stringbuilder   
1516:              ' object contained in RequestState.  
1517:              rs.StringBuilder.Append(str)
1518:   
1519:              ' Continue reading data until responseStream.EndRead  
1520:              ' returns –1.  
1521:              Dim ar As IAsyncResult = _
1522:                 responseStream.BeginRead(rs.BufferRead, 0, BUFFER_SIZE, _
1523:                 New AsyncCallback(AddressOf ReadCallBack), rs)
1524:          Else
1525:   
1526:              ' Close down the response stream.  
1527:              responseStream.Close()
1528:   
1529:              ' Set the ManualResetEvent so the main thread can exit.  
1530:              allDone.Set()
1531:          End If
1532:   
1533:          Return
1534:      End Sub
1535:   

And I show last small fragment of this application, ProxyReader (that also inherits from class Wcf_Client. If you remeber good proxy IP:PORT is collectiog in table ProxyTabs. This small class (in reality is big, it contains authentication and more another behavior), support ProxyTabs and provides each next request through another good proxy.


   1:  Public Class ProxyReader
   2:      Inherits Wcf_Client
   3:   
   4:      Property db1 As ParserDBDataContext
   5:      Property LastProxyCount As Integer
   6:      Property CurrentIndex As Integer
   7:      Property LastProxy As System.Collections.Generic.List(Of ProxyTab)
   8:      Property ReadingErrorCount As Integer
   9:   
  10:      Public Sub New(ByVal _LastProxyCount As Integer)
  11:          LastProxyCount = _LastProxyCount
  12:          db1 = New ParserDBDataContext
  13:          LastProxy = (From X In db1.ProxyTabs Select X Order By X.CrDate Descending Take _LastProxyCount).ToList
  14:      End Sub
  15:   
  ...   
 165:      Function GetRequestStrThruProxy(ByVal URL As String, Optional ByVal ResponseEncoding As PostRequestEncode = PostRequestEncode.UTF8) As String
 166:          Dim HTML As String
 167:          Try
 168:  StartRead:
 169:              HTML = GetRequestStrAsync(URL, ResponseEncoding, LastProxy(CurrentIndex).URL)
 170:              Return HTML
 171:          Catch ex As Exception
 172:              ReadingErrorCount += 1
 173:              If ReadingErrorCount < LastProxy.Count Then
 174:                  GetNextProxy()
 175:                  GoTo StartRead
 176:              Else
 177:                  Return Nothing
 178:              End If
 179:          End Try
 180:   
 181:      End Function
 182:   
 183:      Sub GetNextProxy()
 184:          If CurrentIndex < LastProxy.Count Then
 185:              CurrentIndex += 1
 186:          Else
 187:              CurrentIndex = 0
 188:          End If
 189:      End Sub
 190:   
 191:  End Class

Thats it! You see in this page fragment of source code of my real application!





Comments ( )
<00>  <01>  <02>  <03>  <04>  <05>  <06>  <07>  <08>  <09>  <10>  <11>  <12>  <13>  <14>  <15>  <16>  <17
Link to this page: http://www.vb-net.com/AsyncWebRequest/index.htm
<SITEMAP>  <MVC>  <ASP>  <NET>  <DATA>  <KIOSK>  <FLEX>  <SQL>  <NOTES>  <LINUX>  <MONO>  <FREEWARE>  <DOCS>  <ENG>  <MAIL ME>  <ABOUT ME>  < THANKS ME>