Overview
The ability to save a Web page into a MHT format is interesting. First of all, you can do the same in Outlook. If you have an email that you would like to save, you can do a Save As and can select the output to be a MHT file. So, basically, you get it all into one file. The same goes with a Web page. If you do a regular save in your browser, you will end up with a lot of files in a newly created directory containing images and other files that could be needed to display the HTML page. So, saving the page into a MHT format will give you everything into one file.
A simple scenario
I often receive the request to build those robot applications that will do a lot of processes in the background. In some of them, there is sometimes a need to retrieve specific HTML pages from a specific Web page. Then, we can save those pages into MHT files and store them into specific directories which will serve for specific purposes in related applications.
The best approach I have found so far to do this is to make use of the CDO.Message class. To its simplest scenario, we would use the CreateMHTMLBody() method to retrieve the page, get a reference to its streaming object and save that into a file.
Retrieving the page
In this article, I would use a MHTML class where I would need to import two namespaces, there are as follow:
Imports CDO Imports ADODB
Public Class MHTML Public cMHTMLUrl As String = "" Public cSaveFile As String = "" ' CDO message object Private oMessage As CDO.Message = New CDO.Message End Class
' Get the MHTML file Public Function GetMHTML() As Boolean Dim loStream As ADODB.Stream ' Go get the page oMessage.CreateMHTMLBody(cMHTMLUrl, CDO.CdoMHTMLFlags.cdoSuppressNone, "", "") loStream = oMessage.GetStream() ' Save to file loStream.SaveToFile(cSaveFile, SaveOptionsEnum.adSaveCreateOverWrite) Return True End Function
Dim loMHTML As MHTML = New MHTML() loMHTML.cMHTMLUrl = "http://www.universalthread.com" ' Save the file into a directory loMHTML.cSaveFile = "d:\test.mht" ' Call the GetMHTML() method to retrieve the HTML page and save it into cSaveFile If Not loMHTML.GetMHTML() Then MessageBox.Show("Unable to get the page") End If
loMHTML.cMHTMLUrl = "http://www.universalthread.com/Default.aspx"
Retrieving a page that requires a cookie
For some pages, a cookie might be required to access the page. In a common scenario, a username and a password might have been required to access a specific page. But, in this case, this is all done from a robot application. So, there is nobody behind the keyboard to access the login page to enter the username and the password.
So, we need to find a mechanism to support that. Basically, we need to know the cookie name that the site is using. Then, we need to call the login page before calling our page that we want to retrieve. Finally, once the login is done, we can then associate the cookie to our CDO.Message object and make sure we pass it when we will retrieve our page.
The first thing to do is to add two more properties at the class level. One property will indicate the HTML page to use for our login and another property to indicate what is the cookie name we need to work with:
Public cCookie As String = "" Public cLoginUrl As String = ""
Then, we need to have a way to add into a collection the HTML variables that are required for the authentication. This is usually the username and the password. So, we need to define a new property as the class level for our collection and create a method which will add into the collection the required HTML variables to authenticate:
' Collection to hold the form fields to be used when there is a login page Private oFormField As Collection = New Collection ' Add a form field for the post that can be used to do the login ' expC1 Name ' expC2 Value Public Function AddFormField(ByVal tcName As String, ByVal tcValue As String) As Boolean Dim loFormField(2) As Object loFormField(1) = tcName loFormField(2) = tcValue oFormField.Add(loFormField) End Function
loMHTML.AddFormField("Username", "MyUsername") loMHTML.AddFormField("Password", "MyPassword")
But, after a lot of R & D, I have built the Login() method as follow:
' Do the login Public Function Login() As Boolean Dim lcCookie As String = "" Dim lcPostData As String = "" Dim llSuccess As Boolean = False Dim lnCounter As Integer = 0 Dim loASCIIEncoding As System.Text.Encoding = New System.Text.ASCIIEncoding Dim loByte() As Byte Dim loCookieCollection As System.Net.CookieCollection Dim loCookieContainer As System.Net.CookieContainer = New System.Net.CookieContainer Dim loFormField(2) As Object Dim loStreamWebRequest As IO.Stream Dim loWebRequest As System.Net.HttpWebRequest Dim loWebResponse As System.Net.HttpWebResponse ' Get the post data For Each loFormField In oFormField If lcPostData.Length > 0 Then lcPostData = lcPostData + "&" End If lcPostData = lcPostData + loFormField(1) + "=" + loFormField(2) Next loByte = loASCIIEncoding.GetBytes(lcPostData) ' Prepare Web request loWebRequest = System.Net.WebRequest.Create(cLoginUrl) loWebRequest.Method = "POST" loWebRequest.ContentType = "application/x-www-form-urlencoded" loWebRequest.ContentLength = lcPostData.Length ' Send the data loStreamWebRequest = loWebRequest.GetRequestStream() loStreamWebRequest.Write(loByte, 0, loByte.Length) loStreamWebRequest.Close() loWebRequest.CookieContainer = New System.Net.CookieContainer() ' Get the cookie loWebResponse = loWebRequest.GetResponse() loCookieCollection = loWebRequest.CookieContainer.GetCookies(loWebRequest.RequestUri) For lnCounter = 0 To loCookieCollection.Count - 1 lcCookie = loCookieCollection.Item(lnCounter).Name If lcCookie = cCookie Then lcCookie = loCookieCollection.Item(lnCounter).Value ' Define the cookie oMessage.Configuration.Fields.Item(CDO.CdoConfiguration.cdoHTTPCookies).Value = _ cCookie + "=" + lcCookie oMessage.Configuration.Fields.Update() Exit For End If Next Return True End Function
So, when the need of a login is required to access a specific page, we first call the login page and then our HTML page that we wish to retrieve its content:
Dim loMHTML As MHTML = New MHTML() loMHTML.AddFormField("Username", "MyUsername") loMHTML.AddFormField("Password", "MyPassword") loMHTML.cLoginUrl = "http://www.mywebsite.com/Login.aspx" loMHTML.cCookie = "TheCookieContainingWhatIsNeededToAccessThePage" If Not loMHTML.Login() Then MessageBox.Show("Unable to log in") Exit Sub End If ' The URL to be returned as MHTML loMHTML.cMHTMLUrl = "http://www.mywebsite.com/Account.aspx" ' Save the file into a directory loMHTML.cSaveFile = "d:\test.mht" ' Save to file If Not loMHTML.GetMHTML() Then MessageBox.Show("Unable to get the page") Exit Sub End If
More on enhancing this class
This class can be easily enhanced to support a basic authentication. So, for HTML pages that would require a basic authentication, we can easily adjust our class to support that need.
We first create two additional properties which will be used to store the username and the password to be used. We define them as follow:
Public cPassword As String = "" Public cUsername As String = ""
' Get the MHTML file Public Function GetMHTML() As Boolean Dim loStream As ADODB.Stream ' Go get the page oMessage.CreateMHTMLBody(cMHTMLUrl, _ CDO.CdoMHTMLFlags.cdoSuppressNone, cUsername, cPassword) loStream = oMessage.GetStream() ' Save to file loStream.SaveToFile(cSaveFile, SaveOptionsEnum.adSaveCreateOverWrite) Return True End Function
Conclusion
As you can see, this is pretty much straight forward. As I said, the cookie approach is the most complicated part in all this. But, once it is done, you would not have to worry about it as it is encapsulated into the class. I was still surprise however to see that not too many developers have made a use of the CDO.Message class like this, which is to retrieve a HTML page which requires a cookie.