Programming

How to Parse Invalid XHTML with HTMLAgilityPack C#

Learn to use HTMLAgilityPack C# for parsing invalid XHTML. Basic steps include installing via NuGet, loading malformed HTML with LoadHtml, querying via XPath or LINQ, and handling parse errors with code examples for web scraping.

1 answer 2 views

How do I use the HTML Agility Pack in C# to parse an invalid XHTML document? What are the basic steps and code examples for implementing HTML parsing with this library?

HTMLAgilityPack in C# handles invalid XHTML like a champ, automatically fixing unclosed tags, bad nesting, and other real-world messes that choke strict XML parsers. Start by installing the NuGet package, load your document with HtmlDocument.LoadHtml(), and query nodes using XPath or LINQ—no more crashes on sloppy HTML. This makes HTML Agility Pack C# the go-to for web scraping or data extraction from imperfect sources.


Contents


What is HTMLAgilityPack and Why Use It for Invalid XHTML?

Ever scraped a webpage only to hit a wall because some tag forgot to close? That’s invalid XHTML—or just plain messy HTML from the wild web. HTMLAgilityPack steps in as a tolerant C# parser, built on the .NET framework to read, manipulate, and fix broken markup without throwing exceptions.

Unlike XmlDocument, which demands perfect well-formed XML, HTMLAgilityPack mimics browser behavior. It auto-corrects issues like <p>Text<div>Nested wrong</div></p> into proper structure. The official GitHub repo boasts over a million downloads for good reason—it’s battle-tested for scraping forums, emails, or legacy sites.

Why pick it? Speed, simplicity, and LINQ/XPath support. You get a navigable DOM tree fast. But heads up: it’s not a full browser engine, so no JavaScript execution. Perfect for static parsing, though.


Installing HTMLAgilityPack C# via NuGet

Getting HTMLAgilityPack C# up and running takes seconds in Visual Studio. Fire up the Package Manager Console (Tools > NuGet Package Manager > Package Manager Console) and run:

Install-Package HtmlAgilityPack

Or via .NET CLI:

dotnet add package HtmlAgilityPack

This pulls the latest stable version (around 1.11.x as of now). For .NET Framework or Core, it works seamlessly. Add the using statement at the top of your file:

csharp
using HtmlAgilityPack;

Done. No native dependencies, pure managed code. If you’re in a MAUI or Blazor project, it slots right in too.

Test it quick: Create a console app, load some junk HTML, and see it parse without a hitch. That’s the beauty—no config hell.


Basic Steps to Parse Invalid XHTML with HTML Agility Pack C#

HTML Agility Pack parse invalid XHTML boils down to three steps: load, query, extract. Here’s the skeleton for any project.

  1. Instantiate HtmlDocument: var doc = new HtmlDocument();
  2. Load your content: doc.LoadHtml(invalidXhtmlString); (or from file/URL).
  3. Navigate: Use doc.DocumentNode.SelectNodes("//div") or LINQ.

Full starter example parsing a broken snippet:

csharp
using HtmlAgilityPack;
using System;
using System.Linq;

class Program
{
 static void Main()
 {
 string invalidXhtml = @"<html><body><p>Unclosed <b>tag here</body></html>"; // Malformed!
 var doc = new HtmlDocument();
 doc.LoadHtml(invalidXhtml);
 
 var boldNodes = doc.DocumentNode.Descendants("b");
 foreach (var node in boldNodes)
 {
 Console.WriteLine(node.InnerText); // Outputs: "tag here"
 }
 }
}

See? It fixed the unclosed <b> on the fly. The official site walks through this flow—load first, tweak options if needed, then dive in.

This handles nesting horrors browsers shrug off. Run it; you’ll smile at how painless it is.


Loading Documents: From String, File, or Web

Flexibility is key. Load from wherever your invalid XHTML lives.

From string (most common for APIs or clipboard):

csharp
var doc = new HtmlDocument();
doc.LoadHtml("<div><img src='missing'></div>"); // Ignores bad attrs

From file:

csharp
doc.Load("path/to/messy.html"); // Auto-detects encoding

From web (with HtmlWeb for full pages):

csharp
var web = new HtmlWeb();
var doc = web.Load("https://example.com");

Pro tip: Pair with HttpClient for modern async scraping, as shown in this ScrapingBee guide:

csharp
using System.Net.Http;

var client = new HttpClient();
string html = await client.GetStringAsync("https://news.ycombinator.com");
var doc = new HtmlDocument();
doc.LoadHtml(html);

Encoding issues? Set doc.OptionReadEncoding = Encoding.UTF8; beforehand. Files with BOM? It sniffs them smartly.

What if the source is massive? Stream it—HAP supports Load(Stream stream).


Querying and Extracting Data: XPath, LINQ Examples

Once loaded, grab what you need. XPath for precision, LINQ for .NET fans.

XPath basics (like CSS selectors, but powerful):

csharp
var links = doc.DocumentNode.SelectNodes("//a[@href]")?.Select(n => n.GetAttributeValue("href", ""));
foreach (string link in links)
{
 Console.WriteLine(link);
}

Targets all links. Need class-specific? //div[@class='story']//a.

LINQ style (cleaner for filtering):

csharp
var titles = doc.DocumentNode.Descendants("span")
 .Where(n => n.HasClass("titleline"))
 .Select(n => n.InnerText.Trim())
 .ToList();

Attributes? node.GetAttributeValue("id", "default") returns fallback if missing. Inner/outer text: InnerText strips tags, OuterHtml keeps them.

Mix 'em: SelectSingleNode("//title").InnerText for page title. From html-agility-pack.net examples, this scales to tables or forms effortlessly.

Stuck on a selector? Debug with doc.DocumentNode.WriteTo() to dump the fixed tree.


Handling Parse Errors and Configuration Options

HAP forgives, but doesn’t hide issues. Check doc.ParseErrors post-load:

csharp
doc.LoadHtml(badHtml);
foreach (HtmlParseError error in doc.ParseErrors)
{
 Console.WriteLine($"Line {error.Line}: {error.Reason}");
}

Tune behavior before loading:

csharp
doc.OptionFixNestedTags = true; // Auto-fix <div><p> → proper
doc.OptionAutoCloseOnEnd = true; // Close unclosed on </body>
doc.OptionEmptyCollection = true; // Empty nodes don't error
doc.LoadHtml(invalidXhtml);

From GitHub issues, OptionFixNestedTags saves headaches on <option> quirks or void elements like <input />.

Manual tweaks? doc.OptionOutputAsXml = false; keeps HTML quirks. For strict output, flip to true.

Real talk: 90% of “errors” are non-fatal—log them, move on.


Advanced Tips and Real-World Examples

Scale up: Remove scripts/styles first (doc.DocumentNode.Descendants("script").Remove();), then parse.

Hacker News scraper (inspired by ScrapingBee):

csharp
var web = new HtmlWeb();
var doc = web.Load("https://news.ycombinator.com");
var stories = doc.DocumentNode.SelectNodes("//tr[@class='athing']")
 ?.Select(tr => {
 var title = tr.Element("td").Element("span").Element("a");
 var link = title?.GetAttributeValue("href", "");
 return new { Title = title?.InnerText, Link = link };
 }).ToList();

foreach (var story in stories) Console.WriteLine($"{story.Title}: {story.Link}");

Async? Wrap in Task.Run. Multi-thread? Each HtmlDocument is thread-safe post-load.

Edge case from StackOverflow: Void tags get self-closed—embrace it.

Pro move: Serialize back doc.DocumentNode.OuterHtml. Chain with AngleSharp for JS if needed.


Sources

  1. HTML Agility Pack GitHub — Official repo with tolerant parsing options and code examples: https://github.com/zzzprojects/html-agility-pack
  2. HTML Agility Pack Documentation — Core loading and querying methods for C#: https://html-agility-pack.net/
  3. ScrapingBee HTML Agility Pack Guide — Web scraping examples with HttpClient integration: https://www.scrapingbee.com/blog/html-agility-pack/
  4. StackOverflow: HTML Agility Pack Malforms Code — Discussion on auto-fixing invalid elements: https://stackoverflow.com/questions/16404871/html-agility-pack-c-malforms-my-code
  5. StackOverflow: Fix Ill-Formed HTML — Parse error handling and manual fixes: https://stackoverflow.com/questions/22661640/how-to-fix-ill-formed-html-with-html-agility-pack

Conclusion

Mastering HTMLAgilityPack C# means conquering invalid XHTML effortlessly—install, load with options, query via XPath/LINQ, and handle stragglers. This library turns scraping nightmares into routine wins. Grab the NuGet, tweak a sample, and you’re extracting data in minutes. For deeper dives, hit the GitHub repo. Your C# projects just got tougher.

Authors
Verified by moderation
NeuroAnswers
Moderation
How to Parse Invalid XHTML with HTMLAgilityPack C#