How can I efficiently parse HTML with Java?

I do a lot of HTML parsing in my line of work. Up until now, I was using the HtmlUnit headless browser for parsing and browser automation.

Now, I want to separate both the tasks.

I want to use a light HTML parser because it takes much time in HtmlUnit to first load a page, then get the source and then parse it.

I want to know which HTML parser can parse HTML efficiently. I need

  1. Speed
  2. Ease to locate any HtmlElement by its “id” or “name” or “tag type”.

It would be ok for me if it doesn’t clean the dirty HTML code. I don’t need to clean any HTML source. I just need an easiest way to move across HtmlElements and harvest data from them.

3 Answers
3

Leave a Comment