A platform for high-performance distributed tool and library development written in C++. It can be deployed in two different cluster modes: standalone or distributed. API for v0.5.0, released on June 13, 2018.
|
PlinyCompute is a platform for high-performance distributed tool and library development written in C++. It can be deployed in two different cluster modes: standalone or distributed. Visit the official web site for more details http://plinycompute.rice.edu.
This research was developed with funding from the Defense Advanced Research Projects Agency (DARPA) MUSE award #FA8750-14-2-0270.
Disclaimer: The views, opinions, and/or findings contained in this research are those of the authors and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.
PlinyCompute has been compiled, built, and tested in Ubuntu 16.04.4 LTS. For the system to work properly, make sure the following requirements are satisfied.
These environment variables have to be set:
~/plinycompute
/tmp/pdb_install
.Note: the value of these variables is arbitrary (and they do not have to match), but make sure that you have proper permissions on the remote machines to create folders and write to files. In this example, PlinyCompute is installed in /home/ubuntu/plinycompute
on the manager node, and in /tmp/pdb_install
in the worker nodes. ```bash $ export PDB_HOME=/home/ubuntu/plinycompute $ export PDB_INSTALL=/tmp/pdb_install ```
PlinyCompute running in a distributed cluster uses key-based authentication. Thus, a private key has to be generated and placed on $PDB_HOME/conf
. The file name of the key is arbitrary, by default it is pdb-key.pem
, make sure it has the following permissions $ chmod 400 $PDB_HOME/conf/pdb-key.pem
.
In addition to Python (version 2.7.12 or greater), the table below lists the libraries and packages required by PlinyCompute. If any of them is missing in your system, run the following script to install them $PDB_HOME/scripts/internal/setupDependencies.py
.
Library/Software | Packages |
---|---|
Cmake | cmake 3.5.1 |
Clang | clang version 3.8.0-2ubuntu4 |
Snappy | libsnappy1v5, libsnappy-dev |
GSL | libgsl-dev |
Boost | libboost-dev, libboost-program-options-dev, libboost-filesystem-dev, libboost-system-dev |
Bison | bison |
Flex | flex |
This command will download PlinyCompute in a directory named
plinycompute. Make sure you are in that directory by typing:
`` $ cd plinycompute ``` In a Linux machine the prompt should look something similar to: ```bash ubuntu:~/plinycompute$ ```pdb-manager
and pdb-worker
in the bin
folder; which are invoked by different scripts as described in the sections below. ```bash $ make pdb-main ```./bin/TestKMeans
. This example will be used in step 3 in the next section, once the cluster is launched. ```bash $ make TestKMeans ```Notes:
shared-libraries
, builds common shared libraries that are used by the different applications (build this first)Make targets for different applications:
Target | Description | Source Code |
---|---|---|
shared-libraries | Common shared libraries used by these applications | Shared Libraries |
build-ml-tests | Machine learning executables and their dependencies | Gaussian Mixture Model, Latent Dirichlet Allocation, K-Means |
build-la-tests | Linear algebra executables and their dependencies | Linear Algebra |
build-tpch-tests | TPCH executables and their dependencies | TPCH Benchmark |
build-tests | Unit tests executables | Unit Tests |
build-integration-tests | Integration tests executables | Integration Tests |
Depending on the target you want to build, issue the following command: ```bash $ make -j <number-of-jobs> <target> ``` replacing:
<number-of-jobs>
with a number (this allows to execute multiple recipes in parallel)<target>
with one target from the table aboveFor example, the following command compiles and builds the executables and shared libraries for the machine learning application in the bin
folder. ```bash $ make -j 4 build-ml-tests ```
For developers who want to add new functionality to PlinyCompute, here are examples on how to build and run unit and integration tests.
PlinyCompute can be launched in two modes:
In the above output,
pdb-manageris the manager process running on localhost and listening on port 8108. The two
pdb-worker` processes correspond to one worker node (each worker node runs a front-end and back-end process), listening on port 8109 and connected to the manager process on port 8108.TestKMeans
example built in step 4 in the previous section (see KMeans for more details). ```bash $ ./bin/TestKMeans Y Y localhost Y 3 3 0.00001 applications/TestKMeans/kmeans_data ``` As the program is executed, the output will be displayed on the screen. Congratulations you have successfuly installed, deployed, and tested PlinyCompute.Although running PlinyCompute in one machine (e.g. a laptop) is ideal for becoming familiar with the system and testing some of its functionality, PlinyCompute's high-performance properties are best suited for processing large data loads in a real distributed cluster such as Amazon AWS, on-premise, or other cloud provider. To accomplish this, follow these steps:
This command downloads PlinyCompute in a folder named
plinycompute.Make sure you are in that directory. In a Linux machine the prompt should look something similar to:
``bash ubuntu:~/plinycompute$ ```$PDB_HOME/conf/serverlist
file, and add the public IP addresses of the worker nodes (machines) in the cluster; one IP address per line. Below is a partial listing of such file (replace the IP's with your own, the ones shown here are ficticious): ```bash192.168.1.1 192.168.1.2 192.168.1.3 ```
In the above example, the cluster will include one manager node (where PlinyCompute was cloned), and three worker nodes.
This will generate two executables in the directory
$PDB_HOME/bin:
``bash pdb-manager pdb-worker ```$PDB_INSTALL
envrionment variable. Note: the script's argument is the pem file required to connect to the machines in the cluster. ```bash $ $PDB_HOME/scripts/install.sh conf/pdb-key.pem ```After completion, the output should look similar to the one below, (partial display shown here for clarity purposes): ```bash
Results of script install.sh: *** Failed results (0/3) ***
*** Successful results (3/3) *** Worker node with IP: 192.168.1.1 successfully installed. Worker node with IP: 192.168.1.2 successfully installed.
```
At this point all executable programs and scripts are properly installed on the worker nodes!
Where the following arguments are required:
<cluster_type>
should be distributed<manager_node_ip>
the public IP address of the manager node<pem_file>
the pem file that allows to connect to the machines in the clusterThe last two arguments are optional:
<num_threads>
number of CPU cores on each worker node that PlinyCompute will use<shared_memory>
amount of RAM on each worker node (in Megabytes) that PlinyCompute will useIn the following example, the public IP address of the manager node is 192.168.1.0
; the pem file is conf/pdb-key.pem
; by default the cluster is launched with 1
thread and 2Gb
of memory.
```bash $ $PDB_HOME/scripts/startCluster.sh distributed 192.168.1.0 conf/pdb-key.pem ```
If you want to launch the cluster with more threads and memory, provide the fourth and fifth arguments. In the following example, a cluster is launched with 4 threads and 4GB of memory. For more information about tunning PlinyCompute visit the system configuration page. ```bash $ $PDB_HOME/scripts/startCluster.sh distributed 192.168.1.0 conf/pdb-key.pem 4 4096 ```
After completion, the output should look similar to the one below, (partial display shown here for clarity purposes): ```bash
Results of script startCluster.sh: *** Failed results (0/3) ***
*** Successful results (3/3) *** Worker node with IP: 192.168.1.1 successfully started. Worker node with IP: 192.168.1.2 successfully started. Worker node with IP: 192.168.1.3 successfully started.
```
TestKMeans
example in this cluster, issue the following command. Note that although the executable and arguments are the same as the ones in the previous section, because it is sent to a distributed PlinyCompute deployment, the computations are distributed to the three worker nodes for execution. ```bash $ ./bin/TestKMeans Y Y localhost Y 3 3 0.00001 applications/TestKMeans/kmeans_data ```To stop a running cluster of PlinyCompute, issue the following command if you are running in a distributed cluster: ```bash $ $PDB_HOME/scripts/stopCluster.sh distributed conf/private_key.pem ```
Conversely, use this one if you are running a standalone cluster: ```bash $ $PDB_HOME/scripts/stopCluster.sh standalone ```
To remove data in a PlinyCompute cluster, execute the following script. Note that the value of the first argument is distributed
, meaning this will clean data in a real cluster. Warning: this will remove all PlinyCompute stored data and catalog metadata from the entire cluster, use it carefully. ``` $ $PDB_HOME/scripts/cleanup.sh distributed conf/pdb-key.pem ```
If you are running in a standalone cluster, run the following script: ``` $ $PDB_HOME/scripts/cleanup.sh standalone ```
```bash $ $PDB_HOME/scripts/stopCluster.sh distributed conf/pdb-key.pem $ $PDB_HOME/scripts/internal/upgrade.sh conf/pdb-key.pem $ $PDB_HOME/scripts/startCluster.sh distributed 192.168.1.0 conf/pdb-key.pem ```