C#•16mo ago

✅ Parsing a Link from an HTML file with HTMLAgilityPack

Hi, I'm having a bit of trouble parsing an HTML file to extract a link. I'm using HTMLAgilityPack to do this as it seemed simple enough for what I wanted. In the latest variable I use SelectNodes and provide the XPATH to the link that I found using inspect element. However, the selection returns null and the Console returns an error when writing. Any tips?

using System;
using HtmlAgilityPack;
class Program
{

    static void Main(string[] args)
    {
        // Use HAP to fetch html from web.
        var link = "https://www.abs.gov.au/statistics/labour/employment-and-unemployment/labour-force-australia/dec-2023";
        HtmlWeb web = new HtmlWeb();
        var htmlDoc = web.Load(link);
        var latest = htmlDoc.DocumentNode.SelectNodes("//*[@id=\"block-views-block-topic-releases-listing-topic-latest-release-block\"]/div/div/div/div/a").ToString();
        Console.WriteLine(latest);
    }
}

using System;
using HtmlAgilityPack;
class Program
{

    static void Main(string[] args)
    {
        // Use HAP to fetch html from web.
        var link = "https://www.abs.gov.au/statistics/labour/employment-and-unemployment/labour-force-australia/dec-2023";
        HtmlWeb web = new HtmlWeb();
        var htmlDoc = web.Load(link);
        var latest = htmlDoc.DocumentNode.SelectNodes("//*[@id=\"block-views-block-topic-releases-listing-topic-latest-release-block\"]/div/div/div/div/a").ToString();
        Console.WriteLine(latest);
    }
}

Console Error

Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at Program.Main(String[] args) in /home/antonio/interview_macrobond/Program.cs:line 12

Unhandled exception. System.NullReferenceException: Object reference not set to an instance of an object.
   at Program.Main(String[] args) in /home/antonio/interview_macrobond/Program.cs:line 12

27 Replies

canton7•16mo ago

I opened that link, View Source, ctrl-f for "block-views-block-topic-releases-listing-topic-latest-release-block", and there are no hits I even do the same from developer tools (which includes HTML generated by JS), and there are no hits there either

FatTonyOP•16mo ago

Oh I am very regarded....

Pobiega•16mo ago

And remember that HAP (and AngleSharp too) don't actually run any javascript

FatTonyOP•16mo ago

Let me check if that was the issue, thanks

Pobiega•16mo ago

so if that data is loaded via JS, it won't work

FatTonyOP•16mo ago

That's fine, I think. All i need is the link for the next page, and do the same after. I need to parse through a couple of HTML pages and get a download link afterwards

FatTonyOP•16mo ago

https://www.abs.gov.au/statistics/labour/employment-and-unemployment/labour-force-australia this is the correct link

Australian Bureau of Statistics

Labour Force, Australia

FatTonyOP•16mo ago

using System;
using HtmlAgilityPack;
class Program
{

    static void Main(string[] args)
    {
        // Use HAP to fetch html from web.
        var link = "https://www.abs.gov.au/statistics/labour/employment-and-unemployment/labour-force-australia";
        HtmlWeb web = new HtmlWeb();
        var htmlDoc = web.Load(link);
        var latest = htmlDoc.DocumentNode.SelectNodes("//*[@id=\"block-views-block-topic-releases-listing-topic-latest-release-block\"]/div/div/div/div/a");
        latest.ToList().ForEach(i=>Console.WriteLine(i.InnerText));
    }
}

using System;
using HtmlAgilityPack;
class Program
{

    static void Main(string[] args)
    {
        // Use HAP to fetch html from web.
        var link = "https://www.abs.gov.au/statistics/labour/employment-and-unemployment/labour-force-australia";
        HtmlWeb web = new HtmlWeb();
        var htmlDoc = web.Load(link);
        var latest = htmlDoc.DocumentNode.SelectNodes("//*[@id=\"block-views-block-topic-releases-listing-topic-latest-release-block\"]/div/div/div/div/a");
        latest.ToList().ForEach(i=>Console.WriteLine(i.InnerText));
    }
}

Ok, so this finds the node, but it prints the text inside the link instead of the link, how can I extract the link?

Pobiega•16mo ago

the link itself is inside the href attribute of the tag, no? InnerText is the stuff within the tag open/close ie <a href="meep">InnerText</a>

FatTonyOP•16mo ago

Ah ok, so how do I extract the href?

Pobiega•16mo ago

iirc there is a way to access attributes on the tag check what props/methods are available on i

canton7•16mo ago

The documentation's a bit shit, isn't it? I'd just F12 on i, see what's available

Pobiega•16mo ago

ye exactly that. or just let intellisense autocomplete i.

FatTonyOP•16mo ago

I'm running on vim 🙃 I got a couple Properties i'm gonna try printing

Pobiega•16mo ago

no LSP? Im sure I've seen intellisense in vim before also, unrelated, but HAP has not aged super well most people prefer AngleSharp these days

FatTonyOP•16mo ago

Yh, I'm using LSP but tbh omnisharp is not fantastic in Linux

canton7•16mo ago

If vim can't show you all of the properties/methods on a type, you really need to be using something else (or configure it better)

Pobiega•16mo ago

You can 100% do this with either lib thou, so its not an issue really just thought I should throw that out there iirc AS is quite a bit faster too

FatTonyOP•16mo ago

Ok ok, this is for a Job Interview exercise and I was getting a bit stuck. I just need something that works by tonight and tomorrow if I can make it better, then I'll spend some time improving my solution. Thanks for the tip 🙂

Pobiega•16mo ago

not saying you should change, just wanted to add my 2 cents

canton7•16mo ago

Yeah, getting an attribute of an HTML element is one of the very very basic things any HTML library will let you do

FatTonyOP•16mo ago

I can, I do need to configure my lsp a lil better, true. Just haven't gotten around to it yet 😆

Pobiega•16mo ago

var href = link.Attributes["href"].Value; says google

FatTonyOP•16mo ago

What's that link?

Pobiega•16mo ago

link is your i its the html tag/element

FatTonyOP•16mo ago

Ah amazing! I'll try that 🙂 @Pobiega @canton7 ❤️ got it!!!! thanks so much!

canton7•16mo ago

Cool, glad to hear!

Gaming

Programming

✅ Parsing a Link from an HTML file with HTMLAgilityPack

Did you find this page helpful?