Home » Software » How To Create a Web Crawler and Data Miner

How To Create a Web Crawler and Data Miner

How To Create a Web Crawler and Data Miner

Web Spider

A web crawler is an internet bot that browses the Internet World Wide Web, Its often to be called a web spider. Most known web crawler is googlebot. A web crawler starting to browse a list of URL to visit (seeds). After that, it identifies all the hyperlink in the web page and adds them to list of URLs to visit. In this article, i will show you How To Create A Web Crawler. There are many ways to create a web crawler, One of them is using Apache Nutch.

Apache Nutch is a scalable and very robust tool for web crawling. Apache Nutch can be integrated with Phyton programming language for web crawling. You can use it to crawl on your data, for a better indexing. If you understand Apache Nutch clearly, you can create your own search engine like Google.

Apache Nutch can run on a single machine as well as on a distributed environment like Apache Hadoop. It’s written in java. Apache Nutch can also integrated with Apache Solr (Solr is a search platform that can be used for searching any type of data and web pages) easily, so we can pass all the indexed and crawled page by Apache Nutch to Apache Solr.

Set Up Your Web Crawler

To start using Apache Nutch, First we need to install it. First thing to do is installing dependencies in Apache Nutch.

The dependencies are :

  1. Apache Nutch
  2. HBase
  3. Ant
  4. JDK

In this tutorial, we will use Apache Nutch 2.2.1 version. These are the steps for installation and configuration of Apache Nutch 2.2.1

1. Download Apache Nutch

2.Extract it by using this command # tar -zxvf apache-nutch.2.2.1-src.tar.gz

3.Download HBase Apache Hadoop

4.Extract it by using this command # tar -zxvf Hbase.x.x.tar.gz

5.Configure HBase. Go to hbase-site.xml and find <Your HBase home>/conf and modify it like image below

How To Create a Web Crawler and Data Miner

6.Specify Gora backend in nutch-site.xml (You can find it at $NUTCH_HOME/conf)

<property>
<name> storage.data.store.class </name>
<value> org.apache.gora.hbase.store.HBaseStore </value>
<description> Default class for storing data </decription>
</property>

7. Ensure that HBasegora-hbase dependency is available in ivy.xml by putting the following configuration

<dependency org="org.apache.gora" name="gora-hbase" rev="0.2" conf="*- 
>default" />

8. Make sure HBaseStore is set as default data by putting the following configuration into gora.properties

gora.properties:
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore

9. Go to Apache Nutch home directory and type following command

ant runtime

10. At this point, Apache Nutch will create respective directories.

11. Make sure Hbase is working properly by go to the home directory of hbase and type the following command

./bin/hbase shell

If everything goes well you will see this output

HBase Shell; enter 'help<RETURN>' for list of supported commands.
Type "exit<RETURN>" to leave the HBase Shell
Version: 0.90.4, r1001068, Fri Sep 24 13:55:42 PDT 2010

Start to Crawling Your First Website Using Apache Nutch

After finished installation steps of Apache Nutch, you can start crawling by use following steps

1. Add your agent name in value field in nutch-site.xml by add following configuration

<configuration>
<property>
<name> http.agent.name </name>
<value> My Private Spider Bot </value>
</property>
</configuration>

2.Go to the local directory of Apache Nutch which located at <your Apache Nutch home>/runtime and create a directory called urls inside it

3.Create seed.txt inside urls directory and put whatever you want to crawl first. for example

http://etechnologytips.com

4. Now you can start to crawl by starting Apache Nutch and HBase by using following command

cd<Respective directory of Apache Nutch>/runtime
bin/crawl urls/seed.txt TestCrawl

If you got errors when starting Apache Nutch, Check for common errors

  • Trover

    In the momento of compilation, show an error: “[FAILED ] org.hasqldb#hsqldb;2.2.8!hsqldb.jar:…” , “Imposible to resolve dependencies:…, My OS is Ubuntu 14.0.4 Any Idea? Thanks.

    • http://etechnologytips.com/ James Howard

      Make sure dependecies set correctly

      Try to delete entire .ivy directory and re-run ant

Subscribe