forked from apache/accumulo
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
410 lines (310 loc) · 18.1 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
******************************************************************************
0. Introduction
Apache Accumulo is a sorted, distributed key/value store based on Google's
BigTable design. It is built on top of Apache Hadoop, Zookeeper, and Thrift. It
features a few novel improvements on the BigTable design in the form of
cell-level access labels and a server-side programming mechanism that can modify
key/value pairs at various points in the data management process.
******************************************************************************
1. Building
In the normal tarball or RPM release of accumulo, everything is built and
ready to go on x86 GNU/Linux: there is no build step.
However, if you only have source code, or you wish to make changes, you need to
have maven configured to get Accumulo prerequisites from repositories. See
the pom.xml file for the necessary components.
You can build an Accumulo binary distribution, which is created in the
assemble/target directory, using the following command. Note that maven 3
is required starting with Accumulo v1.5.0. By default, Accumulo compiles
against Hadoop 2.2.0, but these artifacts should be compatible with Apache
Hadoop 1.2.x or Apache Hadoop 2.2.x releases.
mvn package -P assemble
By default, Accumulo compiles against Apache Hadoop 2.2.0. To compile against
a different Hadoop 2-compatible version, specify the profile and version,
e.g. "-Dhadoop.version=0.23.5".
To compile against Apache Hadoop 1.2.1, or a different version that is compatible
with Hadoop 1, specify hadoop.profile and hadoop.version on the command line,
e.g. "-Dhadoop.profile=1 -Dhadoop.version=0.20.205.0" or
"-Dhadoop.profile=1 -Dhadoop.version=1.1.0".
If you are running on another Unix-like operating system (OSX, etc) then
you may wish to build the native libraries. They are not strictly necessary
but having them available suppresses a runtime warning and enables Accumulo
to run faster. You can execute the following script to automatically unpack
and install the native library. Be sure to have a JDK and a C++ compiler
installed with the JAVA_HOME environment variable set.
$ ./bin/build_native_library.sh
Alternatively, you can manually unpack the accumulo-native tarball in the
$ACCUMULO_HOME/lib directory. Change to the accumulo-native directory in
the current directory and issue `make`. Then, copy the resulting 'libaccumulo'
library into the $ACCUMULO_HOME/lib/native/map.
$ mkdir -p $ACCUMULO_HOME/lib/native/map
$ cp libaccumulo.* $ACCUMULO_HOME/lib/native/map
If you want to build the debian release, use the command "mvn package -Pdeb" to
generate the .deb files in the target/ directory. Please follow the steps at
https://cwiki.apache.org/BIGTOP/how-to-install-hadoop-distribution-from-bigtop.html
to add bigtop to your debian sources list. This will make it substantially
easier to install.
Building Documentation
Use the following command to build the User Manual (docs/target/accumulo_user_manual.pdf)
and the configuration HTML page (docs/target/config.html)
mvn package -P docs -DskipTests
******************************************************************************
2. Deployment
Copy the accumulo tar file produced by mvn package from the assemble/target/
directory to the desired destination, then untar it (e.g.
tar xzf accumulo-1.6.0-bin.tar.gz).
If you are using the RPM, install the RPM on every machine that will run
accumulo.
Another option is to package Accumulo directly to a working directory. For example,
mvn package -DskipTests -DDEV_ACCUMULO_HOME=/var/tmp
The above command would create a directory with a name similar to
/var/tmp/accumulo-1.6.0-dev/accumulo-1.6.0/, containing all the contents
that are normally contained in accumulo-1.6.0-bin.tar.gz, but already unpacked.
If the DEV_ACCUMULO_HOME parameter is not specified, this directory would
normally be created in assemble/target, but that is subject to deletion by
the 'mvn clean' command. Specifying an external directory would not be subject
to 'mvn clean'. When executed more than once, newer files overwrite older files,
and files a user adds (such as configuration files in conf/) will be left alone.
If HDFS and Zookeeper are running, you can run Accumulo directly from this
working directory. See the 'Running Apache Accumulo' section later in this document.
You can avoid specifying the working directory each time you compile by adding
a profile to maven's settings.xml file. Below is an example of $HOME/.m2/settings.xml
<settings xmlns="http://maven.apache.org/SETTINGS/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd">
<profiles>
<profile>
<id>inject-accumulo-home</id>
<properties>
<DEV_ACCUMULO_HOME>/var/tmp</DEV_ACCUMULO_HOME>
</properties>
</profile>
</profiles>
<activeProfiles>
<activeProfile>inject-accumulo-home</activeProfile>
</activeProfiles>
</settings>
******************************************************************************
3. Upgrading from 1.4 to 1.5
This happens automatically the first time Accumulo 1.5 is started.
* Stop the 1.4 instance.
* Configure 1.5 to use the hdfs directory, walog directories, and zookeepers
that 1.4 was using.
* Copy other 1.4 configuration options as needed.
* Start Accumulo 1.5.
******************************************************************************
4. Configuring
Apache Accumulo has two prerequisites, hadoop and zookeeper. Zookeeper must be
at least version 3.3.0. Both of these must be installed and configured. Some
versions of Zookeeper may only allow 10 connections from one computer by default.
On a single-host install, this number is a little too low. Add the following to
the $ZOOKEEPER_HOME/conf/zoo.cfg file:
maxClientCnxns=100
Ensure you (or the some special hadoop user account) have accounts on all of
the machines in the cluster and that hadoop and accumulo install files can be
found in the same location on every machine in the cluster. You will need to
have password-less ssh set up as described in the hadoop documentation.
You will need to have hadoop installed and configured on your system. Accumulo
1.6.0 has been tested with hadoop version 1.0.4. To avoid data loss,
you must enable HDFS durable sync. How you enable this depends on your version
of Hadoop. Please consult the table below for information regarding your version.
If you need to set the coniguration, please be sure to restart HDFS. See
ACCUMULO-623 and ACCUMULO-1637 for more information.
The following releases of Apache Hadoop require special configuration to ensure
that data is not inadvertently lost; however, in all releases of Apache Hadoop,
`dfs.durable.sync` and `dfs.support.append` should *not* be configured as `false`.
VERSION NAME=VALUE
0.20.205.0 - dfs.support.append=true
1.0.x - dfs.support.append=true
Additionally, it is strongly recommended that you enable 'dfs.datanode.synconclose'
in your hdfs-site.xml configuration file to ensure that, in the face of unexpected
power loss to a datanode, files are wholly synced to disk.
Additionally, it is strongly recommended that you enable 'dfs.datanode.synconclose'
(only available in Apache Hadoop >=1.1.1 or >=0.23) in your hdfs-site.xml configuration
file to ensure that, in the face of unexpected power loss to a datanode, files are
wholly synced to disk.
The example accumulo configuration files are placed in directories based on the
memory footprint for the accumulo processes. If you are using native libraries
for you tablet server in-memory map, then you can use the files in "native-standalone".
If you get warnings about not being able to load the native libraries, you can
use the configuration files in "standalone".
For testing on a single computer, use a fairly small configuration:
$ cp conf/examples/512MB/native-standalone/* conf
Please note that the footprints are for only the Accumulo system processes, so
ample space should be left for other processes like hadoop, zookeeper, and the
accumulo client code. These directories must be at the same location on every
node in the cluster.
If you are configuring a larger cluster you will need to create the configuration
files yourself and propogate the changes to the $ACCUMULO_CONF_DIR directories:
Create a "slaves" file in $ACCUMULO_CONF_DIR/. This is a list of machines
where tablet servers and loggers will run.
Create a "masters" file in $ACCUMULO_CONF_DIR/. This is a list of
machines where the master server will run.
Create conf/accumulo-env.sh following the template of
example/3GB/native-standalone/accumulo-env.sh.
However you create your configuration files, you will need to set
JAVA_HOME, HADOOP_HOME, and ZOOKEEPER_HOME in conf/accumulo-env.sh
Note that zookeeper client jar files must be installed on every machine, but
the server should not be run on every machine.
Create the $ACCUMULO_LOG_DIR on every machine in the slaves file.
* Note that you will be specifying the Java heap space in accumulo-env.sh.
You should make sure that the total heap space used for the accumulo tserver,
logger and the hadoop datanode and tasktracker is less than the available
memory on each slave node in the cluster. On large clusters, it is recommended
that the accumulo master, hadoop namenode, secondary namenode, and hadoop
jobtracker all be run on separate machines to allow them to use more heap
space. If you are running these on the same machine on a small cluster, make
sure their heap space settings fit within the available memory. The zookeeper
instances are also time sensitive and should be on machines that will not be
heavily loaded, or over-subscribed for memory.
Edit conf/accumulo-site.xml. You must set the zookeeper servers in this
file (instance.zookeeper.host). Look at docs/config.html to see what
additional variables you can modify and what the defaults are.
It is advisable to change the instance secret (instance.secret) to some new
value. Also ensure that the accumulo-site.xml file is not readable by other
users on the machine.
Synchronize your accumulo conf directory across the cluster. As a precaution
against mis-configured systems, servers using different configuration files
will not communicate with the rest of the cluster.
Accumulo requires the hadoop "commons-io" java package. This is normally
distributed with hadoop. However, it was not distributed with hadoop-0.20.
If your hadoop distribution does not provide this package, you will need
to obtain it and put the commons-io jar file in $ACCUMULO_HOME/lib. See the
pom.xml file for version information.
******************************************************************************
5. Running Apache Accumulo
Make sure hadoop is configured on all of the machines in the cluster, including
access to a shared hdfs instance. Make sure hdfs is running.
Make sure zookeeper is configured and running on at least one machine in the
cluster.
Run "bin/accumulo init" to create the hdfs directory structure
(hdfs:///accumulo/*) and initial zookeeper settings. This will also allow you
to also configure the initial root password. Only do this once.
Start accumulo using the bin/start-all.sh script.
Use the "bin/accumulo shell -u <username>" command to run an accumulo shell
interpreter. Within this interpreter, run "createtable <tablename>" to create
a table, and run "table <tablename>" followed by "scan" to scan a table.
In the example below a table is created, data is inserted, and the table is
scanned.
$ ./bin/accumulo shell -u root
Enter current password for 'root'@'accumulo': ******
Shell - Apache Accumulo Interactive Shell
-
- version: 1.5.0
- instance name: accumulo
- instance id: f5947fe6-081e-41a8-9877-43730c4dfc6f
-
- type 'help' for a list of available commands
-
root@ac> createtable foo
root@ac foo> insert row1 colf1 colq1 val1
root@ac foo> insert row1 colf1 colq2 val2
root@ac foo> scan
row1 colf1:colq1 [] val1
row1 colf1:colq2 [] val2
The example below start the shell, switches to table foo, and scans for a
certain column.
$ ./bin/accumulo shell -u root
Enter current password for 'root'@'accumulo': ******
Shell - Apache Accumulo Interactive Shell
-
- version: 1.5.0
- instance name: accumulo
- instance id: f5947fe6-081e-41a8-9877-43730c4dfc6f
-
- type 'help' for a list of available commands
-
root@ac> table foo
root@ac foo> scan -c colf1:colq2
row1 colf1:colq2 [] val2
If you are running on top of hdfs with kerberos enabled, then you need to do
some extra work. First, create an Accumulo principal
kadmin.local -q "addprinc -randkey accumulo/<host.domain.name>"
where <host.domain.name> is replaced by a fully qualified domain name. Export
the principals to a keytab file. It is safer to create a unique keytab file for each
server, but you can also glob them if you wish.
kadmin.local -q "xst -k accumulo.keytab -glob accumulo*"
Place this file in $ACCUMULO_CONF_DIR for every host. It should be owned by
the accumulo user and chmodded to 400. Add the following to the accumulo-env.sh
kinit -kt $ACCUMULO_HOME/conf/accumulo.keytab accumulo/`hostname -f`
In the accumulo-site.xml file on each node, add settings for general.kerberos.keytab
and general.kerberos.principal, where the keytab setting is the absolute path
to the keytab file ($ACCUMULO_HOME is valid to use) and principal is set to
accumulo/_HOST@<REALM>, where REALM is set to your kerberos realm. You may use
_HOST in lieu of your individual host names.
<property>
<name>general.kerberos.keytab</name>
<value>$ACCUMULO_CONF_DIR/accumulo.keytab</value>
</property>
<property>
<name>general.kerberos.principal</name>
<value>accumulo/_HOST@MYREALM</value>
</property>
You can then start up Accumulo as you would with the accumulo user, and it will
automatically handle the kerberos keys needed to access hdfs.
Please Note: You may have issues initializing Accumulo while running kerberos HDFS.
You can resolve this by temporarily granting the accumulo user write access to the
hdfs root directory, running init, and then revoking write permission in the root
directory (be sure to maintain access to the /accumulo directory).
******************************************************************************
6. Monitoring Apache Accumulo
You can point your browser to the master host, on port 50095 to see the status
of accumulo across the cluster. You can even do this with the text-based
browser "links":
$ links http://localhost:50095
From this GUI, you can ensure that tablets are assigned, tables are online,
tablet servers are up. You can monitor query and ingest rates across the
cluster.
******************************************************************************
7. Stopping Apache Accumulo
Do not kill the tabletservers or run bin/tdown.sh unless absolutely necessary.
Recovery from a catastrophic loss of servers can take a long time. To shutdown
cleanly, run "bin/stop-all.sh" and the master will orchestrate the shutdown of
all the tablet servers. Shutdown waits for all writes to finish, so it may
take some time for particular configurations.
******************************************************************************
8. Logging
DEBUG and above are logged to the logs/ dir. To modify this behavior change
the scripts in conf/. To change the logging dir, set ACCUMULO_LOG_DIR in
conf/accumulo-env.sh. Stdout and stderr of each accumulo process is
redirected to the log dir.
******************************************************************************
9. API
The public accumulo API is composed of :
* everything under org.apache.accumulo.core.client, excluding impl packages
* Key, Mutation, Value, and Range in org.apache.accumulo.core.data.
* org.apache.accumulo.server.mini
To get started using accumulo review the example and the javadoc for the
packages and classes mentioned above.
******************************************************************************
10. Performance Tuning
Apache Accumulo has exposed several configuration properties that can be
changed. These properties and configuration management are described in detail
in docs/config.html. While the default value is usually optimal, there are
cases where a change can increase query and ingest performance.
Before changing a property from its default in a production system, you should
develop a good understanding of the property and consider creating a test to
prove the increased performance.
******************************************************************************
11. Export Control
This distribution includes cryptographic software. The country in which you
currently reside may have restrictions on the import, possession, use, and/or
re-export to another country, of encryption software. BEFORE using any
encryption software, please check your country's laws, regulations and
policies concerning the import, possession, or use, and re-export of encryption
software, to see if this is permitted. See <http://www.wassenaar.org/> for more
information.
The U.S. Government Department of Commerce, Bureau of Industry and Security
(BIS), has classified this software as Export Commodity Control Number (ECCN)
5D002.C.1, which includes information security software using or performing
cryptographic functions with asymmetric algorithms. The form and manner of this
Apache Software Foundation distribution makes it eligible for export under the
License Exception ENC Technology Software Unrestricted (TSU) exception (see the
BIS Export Administration Regulations, Section 740.13) for both object code and
source code.
The following provides more details on the included cryptographic software: ...
Apache Accumulo uses the built-in java cryptography libraries in it's RFile
encryption implementation. See
http://www.oracle.com/us/products/export/export-regulations-345813.html
for more details for on Java's cryptography features. Apache Accumulo also uses
the bouncycastle library for some crypographic technology as well. See
http://www.bouncycastle.org/wiki/display/JA1/Frequently+Asked+Questions for
more details on bouncycastle's cryptography features.
******************************************************************************