Crawl and search using nutch, a tutorial for beginners
A simple tutorial is for nutch 0.9 and above (ie for the moment, 1.0 and 1.1-dev) running in a Unix environment.
1 Downloading nutch and Java
1.1 Nutch
Choose your preferred mirror here: http://www.apache.org/mirrors/. After choosing the mirror all Apache project will appear in the list, scroll to nutch and select the version you prefer, in either zip or .tar.gz format
1.2 Java
You will need Java 5+ for nutch 0.9 or Java 6+ if you intend to use nutch 1.0+. Java is available for download here: http://www.java.com/en/download/
2 Uncompress nutch
2.1 Choose a location
Suppose nutch-1.0.tar.gz has been downloaded. Move the package to a directory where there is sufficient disk space for nutch and crawled files. The amount of space needed depends on the scale of the website(s) to be crawled. To review the available disk space available on the host, run:
df -h
2.2 Uncompress the package
Move the .tar.gz package to where nutch will be kept, then run:
tar -zxvf nutch-1.0.tar.gz
All nutch files will then be extracted to:
nutch-1.0/
2.3 Download the sample crawl script
Now download the script “crawl.sh” attached to this article and place it in:
nutch-1.0/crawl.sh
3 Prepare Java
3.1 Which version?
Nutch crawler requires Java. Suppose you have installed Java and the path to the executables has been added to the environment settings. To check the version of Java, run:
java -version
It returns something like “java version 1.5.xxxxx” if you have Java 5, or “java version ‘1.6.xxxxx’” if you have Java 6 so on and so forth. Make sure you have Java 6+ if nutch 1.0, which has Solr support, is your choice. This may be a problem for those who have Mac OS X 10.5 (32bit) as it only has Java 5, for the moment. The solution for those is to move to Mac OS X 10.6 (Snow Leopard).
The following command may also provide some help for setting up JAVA_HOME later. It will return a path to the Java executable:
which java
3.2 Setting JAVA_HOME
Nutch requires the environment variable JAVA_HOME set. JAVA_HOME can be set at Line 3 of crawl.sh, use the path returned by the above command, and make sure the java executable can be found at
$JAVA_HOME/bin/java
Omit “/bin/java” when setting up JAVA_HOME and do not leave the trailing slash.
If you have multiple versions of Java installed, you need to ensure the correct version of Java you intend to use can be found using JAVA_HOME. For example, if you want to use Java 6, which is available at:
/usr/jre1.6.0/bin/java
then set JAVA_HOME to
JAVA_HOME=/usr/jre1.6.0
4 Setup nutch
4.1 Seed list
You might get warmed up and let’s get down to the business. First you need to specify seed list that includes URLs nutch can start crawling from. Assume we are going to crawl the following websites:
- www.xing.net.au
- *.xing.net.au (which may include blog.xing.net.au, langshare.xing.net.au etc)
In your nutch director, create a new directory called:
urls
then create a regular plain text file in this directory, you may give it any name to your liking, for example :
list
Now add the URLs to list, one URL per line:
http://www.xing.net.au
http://www.xymphony.net
Note the above list does not contain URLs such as http://blog.xing.net.au or http://lanshare.xing.net.au even although we would like to crawl them. This is because these URLs are referenced by pages that are under http://www.xing.net.au domain. If you are not confident that the home pages of your sub-domains are referenced by the main site, then add them all in the seed list. Duplicated seeds do not cause repeated crawling. Also you may want to have seeds for different domains stored in different seed files. As long as these files are in the “urls” directory and this directory is passed onto the crawler, all URLs in all files under this directory will be crawled.
4.2 Setup URL filters
The next important step is to setup a URL filter that uses regular expressions to control what pages discovered from the seed URLs will be crawled. The default URL filter that is used by nutch is located at:
nutch-1.0/conf/crawl-urlfilter.txt
An example of this filter is attached to this article. It normally contains the following (lines start with a hash ‘#’ are comments):
# skip URLs that use these schemes - file:, ftp:, and mailto:
-^(file|ftp|mailto):
# skip multimedia and other types we can't yet parse. This also applies to those with cache-busting dummy parameters
-\.(gif|jpg|png|ico|css|eps|zip|mpg|gz|rpm|tgz|mov|exe|bmp|js|mp3)[\?\w]*$
# exclude these directories
-/temp/
-/tmp/
#skip these sub-domains
-grace.xymphony.net
# only accept pages under these domains
+.xymphony.net
+www.xing.net.au
# skip everything else
-.
Note the order the filter items are set affects the way these filters are applied. The acceptable domains are added near the end of the list, directly above the line that skips everything else.
4.3 Nutch site configuration
Depending on the performance of the web server and the speed of your connection, you may wish to customise nutch so that it does not put too much strain on the web server nor use up your network bandwidth. The site configuration file can be found at:
nutch-1.0/conf/nutch-site.xml
A typical site configuration file is attached to this article. Thanks for those brilliant people who created nutch, the default configuration file nutch-default.xml has lots of helpful comments that makes the job a lot easier. Move whatever property you would like to change to nutch-site.xml and change the values. Note this configuration file is case-sensitive, so validate the file before nutch loads it.
Note there is a property called:
fetcher.threads.per.host
This property controls the number of threads nutch can put on each host when crawling it. This significantly reduces the crawling time by creating concurrent connections to a host and crawling its pages in parallel. The defult configuration sees this property being set to 1. This means only one thread is allowed for each host. You will see the attached crawl.sh issues 16 threads so if you are crawling only one host, you will see the rest 15 threads spin waiting and doing nothing. So make sure this property is set in the site configuration.
Any value appearing in nutch-site.xml will override that in nutch-default.xml.
4.4 Launch the crawler
After all the pain setting up the crawler, you are now ready to launch it. We have mentioned setting up JAVA_HOME above in the shell script crawl.sh, which is attached to this article. Make sure this file is placed in your nutch directory and can be executed there. This file contains the following:
#!/bin/bash
#set JAVA_HOME here, make sure java can be found from JAVA_HOME/bin/java. Omit /bin/java and no trailing slash
export JAVA_HOME=/usr
#prepare the crawl directory
cd "`dirname $0`"
if [ -d crawl/new ]; then
echo Preparing crawl directory...
rm -rf crawl/new/
fi
#start the crawler and time it
time bin/nutch crawl urls -dir crawl/new -depth 8 -topN 50000 -threads 16 || exit
#if the index is successfully built, move the new index to directory ./crawl and remove the old index
if [ -d crawl/new/index ]; then
echo Cleaning up...
rm -rf crawl/crawldb/
rm -rf crawl/index/
rm -rf crawl/indexes/
rm -rf crawl/linkdb/
rm -rf crawl/segments/
mv -f crawl/new/* crawl/
rmdir crawl/new
echo
echo Crawling completed.
else
echo Crawling FAILED
fi
The arguments used to start the crawler:
bin/nutch crawl urls -dir crawl/new -depth 8 -topN 50000 -threads 16
are explained below:
-dir crawl/newwrite all crawled files and the index to directory ./crawl/new-depth 8the maximum depth of links crawled is 8 (8 levels down from the seed urls)-topN 50000maximum number of links/pages can be crawled at each depth-thread 16issue 16 threads simultaneously *see the note on property fetcher.threads.per.host in 4.3 above
Go to the nutch directory and execute crawl.sh. The time need for each crawl heavily depends on the scale of the site you wish to crawl, the speed of your network connection and how powerful your web server and nutch host is. The script crawl.sh shows the total elapsed time at the end of the process, which looks like:
Review the index with Luke
Luke is a handy tool that can be used to view indexes in Lucene format. It is available at http://code.google.com/p/luke/downloads/list. Grab one of the executable JAR files, for example lukeall-1.0.1.jar, save it to your nutch directory and run:
java -jar lukeall-1.0.1.jar
When prompted to load the index, enter the path to the crawled files and press OK. You will now see all the documents fetched and indexed by nutch. You can use Luke to review the frequent terms, preview searches, edit and delete crawled documents.
Comments
how to crawl pdf using nutch
how to crawl pdf using nutch
Add new comment