Skip to content

Commit a300029

Browse files
committed
init MapReduce based on mrjob
1 parent 591d076 commit a300029

File tree

2 files changed

+48
-3
lines changed

2 files changed

+48
-3
lines changed

README.md

+9-3
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,16 @@
22

33
> ...because it's magic
44
5-
The basic idea of **mrlin** is to enable **M**ap **R**educe processing of **Lin**ked Data - hence the name. In the following I'm going to show you first to how to use HBase to store Linked Data with RDF and then how to use Hadoop to execute MapReduce jobs.
5+
The basic idea of **mrlin** is to enable **M**ap **R**educe processing of **Lin**ked Data - hence the name. In the following I'm going to show you first to how to use HBase to store Linked Data with RDF, and then how to use Hadoop to run MapReduce jobs.
66

77
## Background
88

99
### Dependencies
1010

1111
* You'll need [Apache HBase](http://hbase.apache.org/) first. I downloaded [`hbase-0.94.2.tar.gz`](http://ftp.heanet.ie/mirrors/www.apache.org/dist/hbase/stable/hbase-0.94.2.tar.gz) and followed the [quickstart](http://hbase.apache.org/book/quickstart.html) up to section 1.2.3. to set it up.
12-
* The mrlin Python scripts depend on [Happybase](https://github.com/wbolster/happybase). See also the [docs](http://happybase.readthedocs.org/en/latest/index.html) for further details.
12+
* The mrlin Python scripts depend on:
13+
* [Happybase](https://github.com/wbolster/happybase) to manage HBase; see also the [docs](http://happybase.readthedocs.org/en/latest/index.html) for further details.
14+
* [mrjob](https://github.com/Yelp/mrjob) to run MapReduce jobs; see also the [docs](http://packages.python.org/mrjob/) for further details.
1315

1416
### Representing RDF triples in HBase
1517
Learn about how mrlin represents [RDF triples in HBase](https://github.com/mhausenblas/mrlin/wiki/RDF-in-HBase).
@@ -69,14 +71,18 @@ To reset the HBase table (and remove all triples from it), use the [`mrlin utils
6971
(hb)michau@~/Documents/dev/mrlin$ python mrlin_utils.py clear
7072

7173
### Query
72-
In order to query the mrlin datastore, use the [`mrlin query`](https://raw.github.com/mhausenblas/mrlin/master/mrlin_query.py) script:
74+
In order to query the mrlin datastore in HBase, use the [`mrlin query`](https://raw.github.com/mhausenblas/mrlin/master/mrlin_query.py) script:
7375

7476
(hb)michau@~/Documents/dev/mrlin$ python mrlin_query.py Tribes
7577
2012-10-30T04:01:22 Scanning table rdf with filter ValueFilter(=,'substring:Tribes')
7678
2012-10-30T04:01:22 Key: http://dbpedia.org/resource/Galway - Value: {'O:148': 'u\'"City of the Tribes"\'', 'O:66': 'u\'"City of the Tribes"\'', ...}
7779
2012-10-30T04:01:22 ============
7880
2012-10-30T04:01:22 Query took me 0.01 seconds.
7981

82+
### Running MapReduce jobs
83+
84+
*TBD*
85+
8086
## License
8187

8288
All artifacts in this repository are licensed under [Apache 2.0](http://www.apache.org/licenses/LICENSE-2.0.html) Software License.

mrlin_mr.py

+39
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
#!/usr/bin/env python
2+
"""
3+
mrlin - map/reduce
4+
5+
Allows to run Hadoop (MapReduce) jobs on mrlin tables.
6+
7+
Usage: python mrlin_mr.py param
8+
Examples:
9+
python mrlin_mr.py abc
10+
11+
Copyright (c) 2012 The Apache Software Foundation, Licensed under the Apache License, Version 2.0.
12+
13+
@author: Michael Hausenblas, http://mhausenblas.info/#i
14+
@since: 2012-10-31
15+
@status: init
16+
"""
17+
from mrjob.job import MRJob
18+
19+
class MREntityTypeCounter(MRJob):
20+
"""Calculates the types of entities (rdf:type) in a mrlin table."""
21+
# def __init__(self, *args, **kwargs):
22+
# super(MRInitJob, self).__init__(*args, **kwargs)
23+
24+
def get_etypes(self, key, line):
25+
for word in line.split():
26+
yield word, 1
27+
28+
def sum_etypes(self, word, occurrences):
29+
yield word, sum(occurrences)
30+
31+
def steps(self):
32+
return [self.mr(self.get_etypes, self.sum_etypes),]
33+
34+
35+
#############
36+
# Main script
37+
38+
if __name__ == '__main__':
39+
MREntityTypeCounter.run()

0 commit comments

Comments
 (0)