Using adapter pattern to parse HTML with C# and AgilityPack

Recently I faced with a business requirement about extracting informations from some html pages and display them in a local application.

The principal problem that I found was that the result were in html format and I needed to transform that in a c# object, in order to be able to manage the informations in my application.

So I thought that an adapter was very good for this purpose and I started an implementation of this pattern in c#.

Adapter Pattern

We can find a thons of documentation about the adapter pattern in c# so I don’t want to annoy with concepts that we already knows; I only share an image from msdn site:

adapter

By starting from different sources like html pages, I need an adapter that give me an object that I know; I will have html pages with different structures, different adapters but they should give me the same object.

Now that I have fixed these concepts, I can proceed with the contracts definitions.

Contracts

I define the property of the common object with a specific interface:

interface ICompany
{
string Name { get; }
string VatNumber { get; }
string Email { get; }
string TaxPayerCode { get; }
}

These are the properties that my adapter will have to filled up after the html parsing. The second contract that I need is the adapter contract:

interface ICompanyAdapter
{
Task FindAsync(string key);
}

Now I can implement my adapter that deal with the parsing of the html code; Html Agility Pack help me in this work:


public class CompanyAdapter : ICompany, ICompanyAdapter
{
private const string Uri = "http://urladdress";
private HttpClient _httpClient;
private List<HtmlNode> _nodes = new List<HtmlNode>();

public string Name => ExtractData(_nodes, @"company");
public string VatNumber => ExtractData(_nodes, @"vat number");
public string Email => ExtractData(_nodes, @"email");
public string TaxPayerCode => ExtractData(_nodes, @"tax payer code");

public async Task FindAsync(string key)
{
_httpClient = new HttpClient();
var html = await Load(Uri + "/searchCompanies", key);
var doc = new HtmlDocument();
doc.LoadHtml(html);
await Parse(doc);
}

private async Task Parse(HtmlDocument doc)
{
var body = doc.DocumentNode.SelectSingleNode("//body");
_nodes = body.Descendants("div");
}

private string ExtractData(List<HtmlNode> nodes, string tag)
{
foreach (var node in nodes)
{
var p = node.Descendants("p").ToList();

//some custom logic to extract data
}

return "";
}

}

The class implements the two interfaces defined above, parse the html code and store the nodes in a private field.

Then the public properties of the company interfaces leverage a method ExtractData to lazy retrieve the information from the list of nodes.

Merge results

I can have many of these adapters to call in a service, and in this phase every adapter will return the data as defined in the contracts.

So I need a merge strategy to union the results of the adapters in a single object.

I have a Company class that implements the ICompany interface and a method Merge that deal with this work:


public class Company : ICompany
{
public string Name { get; }
public string VatNumber { get; }
public string Email { get; }
public string TaxPayerCode { get; }

public void Merge(ICompany company)
{
Name = string.IsNullOrEmpty(Name) ? company.Name : Name;
VatNumber = string.IsNullOrEmpty(VatNumber) ? company.VatNumber : VatNumber;
Email = string.IsNullOrEmpty(Email) ? company.Email : Email;
TaxPayerCode = string.IsNullOrEmpty(TaxPayerCode) ? company.TaxPayerCode : TaxPayerCode;
}
}

Now the last step is invoke the adapters in the service:


public class CompanyService
{
public async Task<Company> FindAsync(string key)
{
var adapter1 = new CompanyAdapter1();
var adapter2 = new CompanyAdapter2();
await adapter1.FindAsync(key);
await adapter2.FindAsync(key);

var company = new Company();
company.Merge(adapter1);
company.Merge(adapter2);

return company;
}
}

I’m able to call all the adapters that I want and merge the results with a specific strategy.

This was possible with the adoption of common contracts for the adapters that returns result knows by the caller.

 

Advertisements
Using adapter pattern to parse HTML with C# and AgilityPack

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s