GitHub - mishra14/Grumper-SearchEngine: Java based distributed web crawler, indexer, ranker and Search Engine

mishra14 / Grumper-SearchEngine Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Java based distributed web crawler, indexer, ranker and Search Engine - Grumper!

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
.settings		.settings
conf		conf
examples		examples
lib		lib
src		src
target		target
.classpath		.classpath
.project		.project
1test.sh		1test.sh
README		README
Report.pdf		Report.pdf
build.xml		build.xml
test.sh		test.sh

Repository files navigation

This repo is for a Java based Search Engine - Grumper!

The project involves building a crawler, indexer, pageranker, serach engine and front end UI.

The full names of all the project members
Aayushi Dwivedi
Ankit Mishra
Anwesha Das
Deepti Panuganti

A description of all features implemented -

Web crawler - We have implemented a distributed, incremental web crawler that follows robots exclusion protocol and contains a web UI that allows users to monitor the status of all the crawler nodes currently working and add seed urls at any time.

Indexer - We have implemented an EMR based MapReduce style indexer that creates inverted index for the crawled document corpus for unigrams, bigrams, trigrams and metadata.

PageRank - We have implemented an EMR based MapReduce style pagerank system that is self iterative in nature.

Search Engine - We have implemented a servlet based web search engine that allows users to search queries and shows paginated results along with previews on hover.

Advanced Features -

Web Crawler - We have implemented message digesting. We use Sha-1 hashing to digest the content of the web document. Further, we have split our document store between dynamo and S3 to allow url -> hash and hash -> document look up.

Indexer - We have implemented metadata indexing in our indexer.

PageRank - We have implemented an iterative PageRank algorithm that uses dynamo to store results between iterations to allow result usage between iterations.

Search Engine - We have designed our front end to allow users to see the preview of the page for any search result.

A list of source files included -

Web Crawler - The code for the crawler can be found on the branch anwesha in edu.upenn.cis455.project.crawler package.

PageRank- The pagerank code is available on the branch pagerankfinal in edu.upenn.cis455.pagerank package.

Search Engine - The code for search engine can be found on the indexerdb branch in edu.upenn.cis455.project.searchengine package.

All the database accessors and related classes are in edu.upenn.cis455.project.storage package on the respective branch.

Detailed instructions on how to install and run the project.

Web Crawler - The crawler can be executed by copying the master.war and worker.war servlet into jetty/webapps directory. It needs an aws credential file in the same folder which should have the same format as the default credential file generated by aws sdk. The worker war should be adjusted to give the correct ip:port combination of the master servlet. The setup can be controlled using the master servlet ui available at ip:port/master/status, where ip:port is the ip and port combination of the master.

Indexer - There are four indexer jobs : unigram, bigram, trigram and metadata. You can find the most uptodate code on indexerdb branch.
Following are the steps required to run a indexer:
Create a table in DynamoDB (Name must be Unigram, Bigram, Trigram, Metadata respectively)
Create the jar for respective indexer job using the .jardesc files
Copy the jar to a S3 bucket
Create a cluster and provide the location of this jar in Custom Jar field, add the input buckets name as one of the parameters and also specify an output bucket name ( data is not written to S3)
Run the Step.
Indexed data will be written to table you create in step 1.

PageRank - Page rank is run through an EMR controller PageRankEmrController.java available in package edu.upenn.cis455.project.emrcontroller on branch emrfinal. The pagerank module can be executed by running the PageRankEmrController jar on an ec2 node. It automatically created dynamo table, adjusts capacity, creates cluster, merges input data, executes the emr job, terminates cluster and reduces the capacity of the dynamo table.

Search Engine - The search engine can be executed by copying the searchengine.war to the jetty/webapps folder along with the aws credentials file. The html pages (grumper.html and results.html), along with the CSS stylesheets and images need to be inside the jetty/webapps/root directory. To launch the search engine on the browser, the url is http://52.90.111.118:8080/results.html [No longer Active]