Extract Text from HTML using NekoHTML and Dom4j

Consider the link http://www.cdw.com/shop/search/results.aspx?wclss=C3&enkwrd=laptop&searchscope=ALL.
Lets extract the total search result and title of each item to keep the example simple enough. We will be using XPath to find each element in the HTML page.
Here is the code

package com.asc.dyutiman.html;

import java.io.IOException;
import java.util.List;

import org.cyberneko.html.parsers.DOMParser;
import org.dom4j.Node;
import org.dom4j.io.DOMReader;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;

public class Parse {
	
	public static void main(String[] args){
		
		String url = "http://www.cdw.com/shop/search/results.aspx?wclss=C3&enkwrd=laptop&searchscope=ALL";
		try {
			DOMParser parser = new DOMParser();
			parser.parse(url);
			
			Document document = parser.getDocument();
			DOMReader reader = new DOMReader();
			org.dom4j.Document doc = reader.read(document);
			
			Node totalResultNode = doc.selectSingleNode("//SPAN[@id='lblShowingResultsTop']/B[3]");
			
			@SuppressWarnings("unchecked")
			List<Node> itemList =  doc.selectNodes("//DIV[@class = 'searchrow']");

			System.out.println("Showing " + itemList.size() + " out of " + totalResultNode.getText());
			for(Node itemNode : itemList){
				Node itemTitle = itemNode.selectSingleNode("DIV[@class = 'searchrow-description']/A");
				System.out.println(itemTitle.getText());
			}
		} catch (SAXException e) {
			System.out.println(e.getMessage());
		} catch (IOException e) {
			System.out.println(e.getMessage());
		}
	}
}

Remember to use uppercase for any HTML tag.

Advertisements

3 Responses to Extract Text from HTML using NekoHTML and Dom4j

  1. odeng says:

    hello Dyutiman,
    I will try to take the text but when on a web, but why the error occurred autotentifikasi problem?
    after they were correct username and password. why do ya think?
    This is a piece of source code:
    public class Parse {

    public static void main(String[] args){

    String url = “http://192.168.254.120:89/database.htm”;
    URL url1 = null;
    HttpURLConnection connection = null;
    try {
    url1 = new URL(url);
    connection = (HttpURLConnection) url1.openConnection();
    String userPassword = “admin:gtadmin”;
    String encoding = new sun.misc.BASE64Encoder().encode(userPassword.getBytes());
    connection.setRequestProperty(“Authorization”, “Basic ” + encoding);
    System.out.println(connection.getResponseCode());
    DOMParser parser = new DOMParser();
    parser.parse(url);

    Document document = parser.getDocument();
    DOMReader reader = new DOMReader();
    org.dom4j.Document doc = reader.read(document);

    List itemList = doc.selectNodes(“//TD[3]”);

    for(int i=0; i<itemList.size(); i++){
    if(itemList.get(i).getText().equals("")){

    } else{
    System.out.println(itemList.get(i).getText());
    }

    }
    } catch (SAXException e) {
    System.out.println(e.getMessage());
    } catch (IOException e) {
    System.out.println(e.getMessage());
    }
    }
    }

    output:
    200
    Server returned HTTP response code: 401 for URL: http://192.168.254.120:89/database.htm

    I hope you understand the problem, thanks

  2. odeng says:

    dear Dyutiman,
    Your post really helped me, I thank you very much. :)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: