A platform for high-performance distributed tool and library development written in C++. It can be deployed in two different cluster modes: standalone or distributed. API for v0.5.0, released on June 13, 2018.
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros
pdb::QuerySchedulerServer Class Reference

#include <QuerySchedulerServer.h>

+ Inheritance diagram for pdb::QuerySchedulerServer:
+ Collaboration diagram for pdb::QuerySchedulerServer:

Public Member Functions

 QuerySchedulerServer (PDBLoggerPtr logger, ConfigurationPtr conf, std::shared_ptr< StatisticsDB > statisticsDB, bool pseudoClusterMode=false, double partitionToCoreRatio=0.75)
 
 QuerySchedulerServer (int port, PDBLoggerPtr logger, ConfigurationPtr conf, std::shared_ptr< StatisticsDB > statisticsDB, bool pseudoClusterMode=false, double partitionToCoreRatio=0.75)
 
 ~QuerySchedulerServer ()
 
void registerHandlers (PDBServer &forMe) override
 
void collectStats ()
 
void cleanup () override
 
StatisticsPtr getStats ()
 
std::string getNextJobId ()
 
- Public Member Functions inherited from pdb::ServerFunctionality
template<class Functionality >
Functionality & getFunctionality ()
 
void recordServer (PDBServer &recordMe)
 
PDBWorkerPtr getWorker ()
 
PDBLoggerPtr getLogger ()
 

Protected Member Functions

void initialize ()
 
void initializeForServerMode ()
 
void initializeForPseudoClusterMode ()
 
void scheduleStages (std::vector< Handle< AbstractJobStage >> &stagesToSchedule, std::shared_ptr< ShuffleInfo > shuffleInfo)
 
void prepareAndScheduleStage (Handle< AbstractJobStage > &stage, unsigned long node, int &counter, PDBBuzzerPtr &callerBuzzer)
 
template<typename T >
bool scheduleStage (unsigned long node, Handle< T > &stage, PDBCommunicatorPtr communicator)
 
PDBCommunicatorPtr getCommunicatorToNode (int port, std::string &ip)
 
Handle< TupleSetJobStagegetStageToSend (unsigned long index, Handle< TupleSetJobStage > &stage)
 
Handle< AggregationJobStagegetStageToSend (unsigned long index, Handle< AggregationJobStage > &stage)
 
Handle
< BroadcastJoinBuildHTJobStage
getStageToSend (unsigned long index, Handle< BroadcastJoinBuildHTJobStage > &stage)
 
Handle
< HashPartitionedJoinBuildHTJobStage
getStageToSend (unsigned long index, Handle< HashPartitionedJoinBuildHTJobStage > &stage)
 
void collectStatsForNode (int node, int &counter, PDBBuzzerPtr &callerBuzzer)
 
void updateStats (Handle< SetIdentifier > setToUpdateStats)
 
pair< bool, basic_string< char > > executeComputation (Handle< ExecuteComputation > &request, PDBCommunicatorPtr &sendUsingMe)
 
pair< bool, basic_string< char > > registerReplica (Handle< RegisterReplica > &request, PDBCommunicatorPtr &sendUsingMe)
 
void extractPipelineStages (int &jobStageId, vector< Handle< AbstractJobStage >> &jobStages, vector< Handle< SetIdentifier >> &intermediateSets)
 
void createIntermediateSets (DistributedStorageManagerClient &dsmClient, vector< Handle< SetIdentifier >> &intermediateSets)
 
void removeIntermediateSets (DistributedStorageManagerClient &dsmClient)
 
void removeUnusedIntermediateSets (DistributedStorageManagerClient &dsmClient, vector< Handle< SetIdentifier >> &intermediateSets)
 
void requestStatistics (PDBCommunicatorPtr &communicator, bool &success, string &errMsg) const
 

Protected Attributes

std::vector
< StandardResourceInfoPtr > * 
standardResources
 
int port
 
std::vector< Handle
< SetIdentifier > > 
interGlobalSets
 
PDBLoggerPtr logger
 
ConfigurationPtr conf
 
std::shared_ptr< StatisticsDBstatisticsDB
 
bool pseudoClusterMode
 
pthread_mutex_t connection_mutex
 
SequenceID seqId
 
std::string jobId
 
double partitionToCoreRatio
 
std::shared_ptr
< PhysicalOptimizer
physicalOptimizerPtr
 
StatisticsPtr statsForOptimization
 
std::shared_ptr< ShuffleInfoshuffleInfo
 

Detailed Description

This class is working on the Manager node to schedule JobStages dynamically from TCAP logical plan So far following JobStages are supported:

TupleSetJobStageAggregationJobStageBroadcastJoinBuildHTJobStage – HashPartitionJoinBuildHTJobStage

Once the QuerySchedulerServer receives a request (A TCAP program) in the form of an

See Also
pdb::ExecuteComputation object it parses and analyzes the TCAP program as a DAG based on a cost model using a greedy algorithm. The goal is to minimize intermediate data. The scheduling is dynamic and lazy, and only one the JobStages scheduled last time were executed, it will schedule later stages, to maximize the information needed. The JobStages will be dispatched to all workers for execution.

Definition at line 62 of file QuerySchedulerServer.h.

Constructor & Destructor Documentation

pdb::QuerySchedulerServer::QuerySchedulerServer ( PDBLoggerPtr  logger,
ConfigurationPtr  conf,
std::shared_ptr< StatisticsDB statisticsDB,
bool  pseudoClusterMode = false,
double  partitionToCoreRatio = 0.75 
)

Constructor for the case when we assume that the port where we are running the server is 8108

Parameters
loggeran instance of the PDBLogger
conf
pseudoClusterModeindicator whether we are running in the pseudo cluster mode or not
partitionToCoreRatiothe ratio between the number of partitions on a node and the number of cores

Definition at line 43 of file QuerySchedulerServer.cc.

pdb::QuerySchedulerServer::QuerySchedulerServer ( int  port,
PDBLoggerPtr  logger,
ConfigurationPtr  conf,
std::shared_ptr< StatisticsDB statisticsDB,
bool  pseudoClusterMode = false,
double  partitionToCoreRatio = 0.75 
)

Constructor for the case when where we specify the port number of the server

Parameters
portis the port on which the server that contains this functionality is running
loggeran instance of the PDBLogger
conf
pseudoClusterModeindicator whether we are running in the pseudo cluster mode or not
partitionToCoreRatiothe ratio between the number of partitions on a node and the number of cores

Definition at line 61 of file QuerySchedulerServer.cc.

pdb::QuerySchedulerServer::~QuerySchedulerServer ( )

Destructor used to destroy the QuerySchedulerServer

Definition at line 39 of file QuerySchedulerServer.cc.

Member Function Documentation

void pdb::QuerySchedulerServer::cleanup ( )
overridevirtual

Cleans up the query scheduler server so it can be used for the next computation

Reimplemented from pdb::ServerFunctionality.

Definition at line 79 of file QuerySchedulerServer.cc.

+ Here is the caller graph for this function:

void pdb::QuerySchedulerServer::collectStats ( )

Collects the statistics about the sets from each node and updates them [statsForOptimization]

Definition at line 452 of file QuerySchedulerServer.cc.

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void pdb::QuerySchedulerServer::collectStatsForNode ( int  node,
int &  counter,
PDBBuzzerPtr callerBuzzer 
)
protected

Collects the stats for one node

Parameters
nodethe node we are collecting the stats for
counterthe counter that is increased when we are finishing updating the stats
callerBuzzerthe buzzer that signals that we are finished

Definition at line 491 of file QuerySchedulerServer.cc.

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void pdb::QuerySchedulerServer::createIntermediateSets ( DistributedStorageManagerClient dsmClient,
vector< Handle< SetIdentifier >> &  intermediateSets 
)
protected

Given a vector of SetIdentifiers this method issues their creation

Parameters
dsmClientan instance of the DistributedStorageManagerClient that needs to create the sets
intermediateSetsthe vector of intermediate sets

Definition at line 803 of file QuerySchedulerServer.cc.

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

pair< bool, basic_string< char > > pdb::QuerySchedulerServer::executeComputation ( Handle< ExecuteComputation > &  request,
PDBCommunicatorPtr sendUsingMe 
)
protected

This method executes a PDB computation given by the ExecuteComputation object, that was sent by a client

Parameters
requestthe object that describes the computation
sendUsingMean instance of the PDBCommunicator that points to the client

do the physical planning

create intermediate sets

schedule this job stages

Definition at line 633 of file QuerySchedulerServer.cc.

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void pdb::QuerySchedulerServer::extractPipelineStages ( int &  jobStageId,
vector< Handle< AbstractJobStage >> &  jobStages,
vector< Handle< SetIdentifier >> &  intermediateSets 
)
protected

This method finds the best source operator using a heuristic, then uses this operator to extract a sequence of of pipelinable stages.

Parameters
jobStageIdthe of last executed job stage
jobStagesa vector where we want to store the sequence of jobStages
intermediateSetsa vector where we want to store the information about the intermediate sets that need to be generated

Definition at line 829 of file QuerySchedulerServer.cc.

+ Here is the caller graph for this function:

PDBCommunicatorPtr pdb::QuerySchedulerServer::getCommunicatorToNode ( int  port,
std::string &  ip 
)
protected

Connects to the node with the provided ip and port and returns an instance of the communicator

See Also
pdb::PDBCommunicator
Parameters
portthe port of the node
ipthe ip address of the node
Returns
returns null_ptr if it fails, the communicator to the node if it succeeds

Definition at line 272 of file QuerySchedulerServer.cc.

+ Here is the caller graph for this function:

std::string pdb::QuerySchedulerServer::getNextJobId ( )
inline

Returns the id of the job we are about to run Has the following format Job_Year_Month_Day_Hour-Minute_Second_{Next number from the seqId sequence generator}

Returns
the id of the next job, a string

Definition at line 127 of file QuerySchedulerServer.h.

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

Handle< TupleSetJobStage > pdb::QuerySchedulerServer::getStageToSend ( unsigned long  index,
Handle< TupleSetJobStage > &  stage 
)
protected

Makes a deep copy of the TupleSetJobStage, fills in additional information about the node we are sending it and returns it

Parameters
stagean instance of the TupleSetJobStage
Returns
the copy with additional information

Definition at line 342 of file QuerySchedulerServer.cc.

+ Here is the caller graph for this function:

Handle< AggregationJobStage > pdb::QuerySchedulerServer::getStageToSend ( unsigned long  index,
Handle< AggregationJobStage > &  stage 
)
protected

Makes a deep copy of the AggregationJobStage, fills in additional information about the node we are sending it and returns it

Parameters
stagean instance of the AggregationJobStage
Returns
the copy with additional information

Definition at line 387 of file QuerySchedulerServer.cc.

Handle< BroadcastJoinBuildHTJobStage > pdb::QuerySchedulerServer::getStageToSend ( unsigned long  index,
Handle< BroadcastJoinBuildHTJobStage > &  stage 
)
protected

Makes a deep copy of the stage provided and detaches it from the logical plan, by setting it to null and fill in additional information about the node we are sending it to

Parameters
stagean instance of the BroadcastJoinBuildHTJobStage
Returns
the copy with additional information

Definition at line 411 of file QuerySchedulerServer.cc.

Handle< HashPartitionedJoinBuildHTJobStage > pdb::QuerySchedulerServer::getStageToSend ( unsigned long  index,
Handle< HashPartitionedJoinBuildHTJobStage > &  stage 
)
protected

Makes a deep copy of the stage provided and detaches it from the logical plan, by setting it to null and fill in additional information about the node we are sending it to

Parameters
stagean instance of the HashPartitionedJoinBuildHTJobStage
Returns
the copy with additional information

Definition at line 427 of file QuerySchedulerServer.cc.

StatisticsPtr pdb::QuerySchedulerServer::getStats ( )

Returns the statistics that are being used for optimization

Returns
the statistics

Definition at line 158 of file QuerySchedulerServer.cc.

void pdb::QuerySchedulerServer::initialize ( )
protected

Initializes the QuerySchedulerServer more specifically the

See Also
pdb::QuerySchedulerServer::standardResources by fetching the necessary information from the resource manager

Definition at line 92 of file QuerySchedulerServer.cc.

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void pdb::QuerySchedulerServer::initializeForPseudoClusterMode ( )
protected

TODO Ask Jia what the difference is between a resource and a node add a proper description

Definition at line 107 of file QuerySchedulerServer.cc.

+ Here is the caller graph for this function:

void pdb::QuerySchedulerServer::initializeForServerMode ( )
protected

TODO Ask Jia what the difference is between a resource and a node add a proper description

Definition at line 131 of file QuerySchedulerServer.cc.

+ Here is the caller graph for this function:

void pdb::QuerySchedulerServer::prepareAndScheduleStage ( Handle< AbstractJobStage > &  stage,
unsigned long  node,
int &  counter,
PDBBuzzerPtr callerBuzzer 
)
protected

This method takes in an

See Also
pdb::AbstractJobStage infers its subtype, opens up a communicator to the specified node and schedules the stage at it.
Parameters
stagethe stage we want to send
nodethe node we want to send the stage to
countera reference to the counter that needs to be increased once the execution of the stage is complete
callerBuzzerthe buzzer we use to notify the calling thread that we are done executing

Definition at line 200 of file QuerySchedulerServer.cc.

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void pdb::QuerySchedulerServer::registerHandlers ( PDBServer forMe)
overridevirtual

This method is from the serverFunctionality interface... It register the query scheduler handlers to the provided server

Implements pdb::ServerFunctionality.

Definition at line 573 of file QuerySchedulerServer.cc.

+ Here is the call graph for this function:

pair< bool, basic_string< char > > pdb::QuerySchedulerServer::registerReplica ( Handle< RegisterReplica > &  request,
PDBCommunicatorPtr sendUsingMe 
)
protected

This method registers a replica with statisticsDB per client request

Parameters
requestthe object that describes the computation
sendUsingMean instance of the PDBCommunicator that points to the client

Definition at line 595 of file QuerySchedulerServer.cc.

+ Here is the caller graph for this function:

void pdb::QuerySchedulerServer::removeIntermediateSets ( DistributedStorageManagerClient dsmClient)
protected

This method removes all the intermediate sets that we needed to continue our execution They are kept in the interGlobalSets vector

Parameters
dsmClientan instance of the DistributedStorageManagerClient that needs to remove the sets

Definition at line 779 of file QuerySchedulerServer.cc.

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void pdb::QuerySchedulerServer::removeUnusedIntermediateSets ( DistributedStorageManagerClient dsmClient,
vector< Handle< SetIdentifier >> &  intermediateSets 
)
protected

Given a vector of SetIdentifiers this method issues their removal Sets that are going to be used later in the execution are not removed, they are cleaned up with the method

See Also
QuerySchedulerServer::removeIntermediateSets
Parameters
dsmClientdsmClient an instance of the DistributedStorageManagerClient that needs to remove the sets
intermediateSetsthe vector of intermediate sets

Definition at line 746 of file QuerySchedulerServer.cc.

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void pdb::QuerySchedulerServer::requestStatistics ( PDBCommunicatorPtr communicator,
bool &  success,
string &  errMsg 
) const
protected

This method takes in a communicator to a node and issues a request for statistics about the stored sets

Parameters
communicatorthe communicator to the node
successa reference a boolean that will set to true if the request succeeds, false otherwise
errMsgthe error message that is gonna be set, if an error occurs

Definition at line 550 of file QuerySchedulerServer.cc.

+ Here is the caller graph for this function:

template<typename T >
bool pdb::QuerySchedulerServer::scheduleStage ( unsigned long  node,
Handle< T > &  stage,
PDBCommunicatorPtr  communicator 
)
protected

This method schedules a pipeline stage given the index of a specified node and a communicator to that node.

Parameters
nodeis the index of the node we want to schedule the stage
stageis the pipeline stage we want to schedule
communicatoris the communicator to a node we are going to use for that

Definition at line 302 of file QuerySchedulerServer.cc.

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void pdb::QuerySchedulerServer::scheduleStages ( std::vector< Handle< AbstractJobStage >> &  stagesToSchedule,
std::shared_ptr< ShuffleInfo shuffleInfo 
)
protected

This method is used to schedule dynamic pipeline stages It must be invoked after initialize() and before cleanup()

Parameters
stagesToScheduleis a vector of all the stages we want to schedule
shuffleInfois the shuffle information for job stages that needs repartitioning

Definition at line 163 of file QuerySchedulerServer.cc.

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void pdb::QuerySchedulerServer::updateStats ( Handle< SetIdentifier setToUpdateStats)
protected

Updates the optimization stats for a given set

Parameters
setToUpdateStatsthe set we are updating (contains also info about the pages and size)

Definition at line 555 of file QuerySchedulerServer.cc.

+ Here is the caller graph for this function:

Member Data Documentation

ConfigurationPtr pdb::QuerySchedulerServer::conf
protected

The configuration of the node provided by the constructor

Definition at line 339 of file QuerySchedulerServer.h.

pthread_mutex_t pdb::QuerySchedulerServer::connection_mutex
protected

This is used to synchronize the communicator More specifically the part where we are creating them and connecting them to a remote node with the connectToInternetServer method

Definition at line 356 of file QuerySchedulerServer.h.

std::vector<Handle<SetIdentifier> > pdb::QuerySchedulerServer::interGlobalSets
protected

Set identifiers for shuffle set, we need to create and remove them at scheduler, so that they exist at any node when any other node needs to write to it

Definition at line 329 of file QuerySchedulerServer.h.

std::string pdb::QuerySchedulerServer::jobId
protected

The id of the current job. Used to identify the job and the database for it's results

Definition at line 367 of file QuerySchedulerServer.h.

PDBLoggerPtr pdb::QuerySchedulerServer::logger
protected

An instance of the PDBLogger set in the constructor

Definition at line 334 of file QuerySchedulerServer.h.

double pdb::QuerySchedulerServer::partitionToCoreRatio
protected

Used to calculate the the number of partitions on a node, based on the number cpu cores on that node more specifically partitionToCoreRatio = numPartitionsOnThisNode/numCores

Definition at line 373 of file QuerySchedulerServer.h.

std::shared_ptr<PhysicalOptimizer> pdb::QuerySchedulerServer::physicalOptimizerPtr
protected

An instance of the PhyscialAnalyzer. We use it to do the dynamic planning. More specifically to find the source that is we should start from using heuristics and to generate pipeline stages starting from this source

Definition at line 380 of file QuerySchedulerServer.h.

int pdb::QuerySchedulerServer::port
protected

The port through which we access the functionalities on this node (port the PDBServer listens to)

Definition at line 323 of file QuerySchedulerServer.h.

bool pdb::QuerySchedulerServer::pseudoClusterMode
protected

True if we are running PDB in pseudo cluster mode false otherwise

Definition at line 349 of file QuerySchedulerServer.h.

SequenceID pdb::QuerySchedulerServer::seqId
protected

Used to generate a unique sequential ID getNextSequenceID is thread safe to call

Definition at line 362 of file QuerySchedulerServer.h.

std::shared_ptr<ShuffleInfo> pdb::QuerySchedulerServer::shuffleInfo
protected

Wraps shuffle information for job stages that needs repartitioning data

Definition at line 394 of file QuerySchedulerServer.h.

std::vector<StandardResourceInfoPtr>* pdb::QuerySchedulerServer::standardResources
protected

A vector containing the information about the resources of each node

Definition at line 318 of file QuerySchedulerServer.h.

std::shared_ptr<StatisticsDB> pdb::QuerySchedulerServer::statisticsDB
protected

A pointer to StatisticsDB that manages various statistics

Definition at line 344 of file QuerySchedulerServer.h.

StatisticsPtr pdb::QuerySchedulerServer::statsForOptimization
protected

Contains the information about every set on every node, more specifically :

  1. the number of pages
  2. the size of the pages
  3. and the total number of bytes the set has

Definition at line 389 of file QuerySchedulerServer.h.


The documentation for this class was generated from the following files: