In order to address the requirements discussed with reference to creating table on a subset of data nodes.
Here is a proposed design that takes care of a) Datanode naming scheme b) Be able to create a table (distributed/replicated)on a sub set of nodes c) Node addition / deletion (Base Design)
2) Catalog Changes & Start-up:
The proposed design adds 4 new catalog tables, the diagram of the proposed tables is as follows:
In order to let users specify the data nodes configuration at init db time, the users will be required to write inserts like these in say system_views.sql file
INSERT INTO data_nodes VALUES(1, 'data_node_one'); INSERT INTO data_nodes VALUES(2, 'data_node_two');
On startup the coordinators will create a link list of all the data nodes by reading this table. It will no more be a static array.
Each node will be told its name through postgresql.conf
The first column of data_nodes table will contain a unique number starting from say 1 and incrementing by say 1 for each successive insertion. These numbers need not be in any defined sequence e.g these numbers can be 10,12,18 for a three data node cluster. These number will be used to identify 1st, 2nd and 3rd data node. The data nodes will be added in the linked list in the order of the first column of the data_nodes table. The 1st entry in the link list will be considered as first data node, the 2nd entry in the link list the second data node and so on. The hash/modulo functions would still be generating a number. However this number would now be telling us the position of the data node in the link list. Hence if the modulo says 2, we would simply pick the 3rd entry in the link list as the target data node.
3) Planned addition of a data node:
Consider now planned addition of a data node. Assume we had 2 data nodes and that now we need to add a third data node. Assume also that we want this added node to be the 3rd data node.
The user would do the necessary insert in the data_nodes table and would then issue a command:
This command will add a 3rd node in the link list of data nodes at all coordinators. By having a tuple relocation mechanism in place(in future), we will ask the cluster to relocate the data on all the nodes. Since there will now be a 3rd entry in the link list, if the modulo of any row comes out to be 2, it will automatically get relocated to the just added 3rd data node. This command will require all current transactions to finish before it can start tuple relocation. While this command is in action we might have to ask the clients to wait till it is finished. These are the issues we will consider when we implement the relocation mechanism.
Addition of multiple nodes at a time will be supported, however mixed addition and removal might not be supported.
4) Planned removal of a data node:
Consider now planned removal of a data node. Assume we had 3 data nodes and that now we need to remove the second data node.
The user will do the necessary row deletion from the data_nodes table and would then issue the command:
We will simply delete 2nd node of the linked list of data nodes. By having a tuple relocation mechanism in place(in future), we will ask the cluster to relocate the data on all the nodes. Since the 2nd entry in the link list will not be there any more, if the modulo of any row comes out to be 1, it will automatically get relocated to the previously 3rd (now 2nd) data node.
There can be an option where the user might not require the data stored at the node being removed.
Removal of multiple nodes at a time will be supported, however mixed addition and removal might not be supported.
Currently we will not support addition/removal of data nodes after initialization.
5) Impact on Pooler:
Coordnator and pooler always maintain a synchronized list of data nodes, hence the data node number passing between coordinator and pooler will work as it is, the only difference will now be that these numbers will indicate the position of the node in the link list.
Addition and removal of data nodes will be communicated to the pooler by coordinator by adding additional pooler commands.
6) Addition and Removal of coordinators:
The procedure will be mostly similar to data nodes except that instead of data the catalog will be provided to the newly added coordinator and that the table to use in this case will be coordinators rather than data_nodes. Also addition and removal of coordinators will be communicated to pooler too.
The command to use in this case, after doing inserts/deletes, will be
7) Impact on GTM:
8) Subclusters and creating tables on them
In the catalog of tables pgxc_class, we will have another column (NULL allowed) that will store the ID of the subcluster that this table belongs to. A NULL ID would mean it is not stored on any subcluster, rather on all data nodes.
The data structure that each coordinator uses to store data nodes will be augmented with additional storage containing a list of sub clusters. Each sub cluster will have a array of bytes each corresponding to one of the data nodes. If a data node belongs to a sub cluster the byte of that particular index will have a 1 in it otherwise 0. (This could be a bit vector too) Another value stored for each sub cluster will be the total number of data nodes in that sub cluster.
Assume we have 4 data nodes in a cluster. Assume we have a sub cluster containing 1st and 3rd data node in it. Assume further that we create a table in this sub cluster distributed by modulo. The DDL will be directed to all the data nodes and coordinators. However while doing inserts if the modulo of the value in dist col comes out to be 1 it will get directed to the 3rd data node, since the data structure will have a 0 for the 2nd data node in this particular sub cluster.
The indexes for communicating with the pooler will however stay as is.
In order to create a new sub cluster and data nodes in that sub cluster the user will be required to do inserts similar to these
Assume we have 4 data nodes with IDs 10,20,30 & 40
INSERT INTO data_node_group VALUES(1, 'my_group'); INSERT INTO nodes_in_group(1, 1, 10); INSERT INTO nodes_in_group(2, 1, 30);
This would reload the data from the tables into the coordinator data structure.
In order to create a table in a certain sub cluster we will use options in the CREATE TABLE statement specifying
WITH subcluster = 'my_group'
in the options.
If we decide that the user needs an easier high level syntax support we might add support for syntax like this
CREATE NODE GROUP node_group_name ON NODES(<node_list>)
<node_list> := ALL | <node_seg> | <node_seg>, <node_list> <node_seg> := node_no | node_no_1 TO node_no_2 constraints: node_no_1 < node_no_2 E.g. CREATE NODE GROUP node_group_1 ON NODES(2, 5 TO 15,18) CREATE NODE GROUP node_group_1 ON NODES(ALL) CREATE NODE GROUP node_group_1 ON NODES(5)
(This syntax has been taken from slides created by Li Kun of ISCAS)
Adding nodes to a sub cluster, removing nodes from a sub cluster will behave exactly similar as described in section 3 and 4 above.
Currently we will not support addition or removal of nodes from a sub cluster.
Dropping a sub cluster can be supported, which is equivalent to removing all data nodes from a sub cluster and then deleting the sub cluster itself.