stanford-mast / pocket Goto Github PK

View Code? Open in Web Editor NEW

118.0 118.0 28.0 28.75 MB

Elastic ephemeral storage

Shell 3.45% Python 8.72% CMake 0.38% C++ 20.83% Java 66.62%

pocket's People

Contributors

Stargazers

Watchers

pocket's Issues

Problem for registering jobs on NVMe servers

Hi,

I tried to run jobs using DRAM storage servers which succeeded to read and write, but when I switched to NVMe servers, kubernete successfully launched a new NVMe container, but the NVMe container fails to connect to the controller's listening port.

Some logs below:

-------------------------- REGISTER JOB -------------------------------- 1579750512.3086464
received hints  test1-122643 0 1 0 0
connected to 10.1.0.10:9070
connected to 10.1.48.29:50030
generate weightmask for  test1-122643 1 72.07207207207207 0
jobid test1-122643 is throughput-bound
KUBERNETES: launch 1 extra nodes, wait for them to come up and assing proper weights [0.009009009009009009]
KUBERNETES: launch flash datanode........
controller.py:249: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  job = yaml.load(f)
Job created. status='{'active': None,
 'completion_time': None,
 'conditions': None,
 'failed': None,
 'start_time': None,
 'succeeded': None}'
Wait for flash datanode to start...

At some point, the NVMe datanode is launched, and the metadata server returns it's capacity:

cpu  net  DRAM_GB  Flash_GB  blacklisted  reserved
datanodeip_port                                                     
10.1.48.29:50030  0.0  0.0      0.0      -1.0          0.0       0.0
Capacity usage for Tier 0 : 753662 free blocks out of 753664 ( 0.0002653702445652174 % )
Capacity usage for Tier 1 : 1703936 free blocks out of 1703936 ( 0.0 % )
Datanode usage: 
                   cpu  net  blacklisted
datanodeip_port                        
10.1.48.29:50030  0.0  0.0          0.0
**********
                  cpu  net  DRAM_GB  Flash_GB  blacklisted  reserved
datanodeip_port                                                     
10.1.48.29:50030  0.0  0.0      0.0      -1.0          0.0       0.0
Capacity usage for Tier 0 : 753662 free blocks out of 753664 ( 0.0002653702445652174 % )
Capacity usage for Tier 1 : 1703936 free blocks out of 1703936 ( 0.0 % )
**********
                  cpu  net  DRAM_GB  Flash_GB  blacklisted  reserved
datanodeip_port                                                     
10.1.48.29:50030  0.0  0.0      0.0      -1.0          0.0       0.0
Capacity usage for Tier 0 : 753662 free blocks out of 753664 ( 0.0002653702445652174 % )
Capacity usage for Tier 1 : 1703936 free blocks out of 1703936 ( 0.0 % )
Datanode usage: 
                   cpu  net  blacklisted
datanodeip_port                        
10.1.48.29:50030  2.0  0.0          0.0

And if you check the pods, you could see that the container is indeed running,

(base) ubuntu@ip-10-1-47-178:~/pocket/deploy$ kubectl get pod -o wide
NAME                                          READY   STATUS    RESTARTS   AGE     IP           NODE                                        NOMINATED NODE   READINESS GATES
pocket-datanode-dram-job0-h4drj               1/1     Running   0          9m16s   10.1.48.29   ip-10-1-48-29.us-west-2.compute.internal    <none>           <none>
pocket-datanode-nvme-job0-d9cml               1/1     Running   0          7m13s   10.1.2.3     ip-10-1-74-206.us-west-2.compute.internal   <none>           <none>
pocket-namenode-deployment-845648cb89-sg82q   1/1     Running   0          12m     10.1.0.10    ip-10-1-43-131.us-west-2.compute.internal   <none>           <none>

The code I'm using for registering a job is:

>>> import pocket
>>> jobid = "test1"
>>> namenode_ip = "10.1.0.10"
>>> jobid = pocket.register_job(jobid, capacityGB=1, latency_sensitive=0)

Also, I'm wondering if the controller lacks support for ssd nodes

Thanks for you time!

What is "pocket-kubernetes-lax security group" ?

I dont have any sg named it. What should i do to run putch_cluster.py?

Error with validating cluster

Hi,

I'm trying to set up Pocket with the directions in the README.md file in the deploy/ folder, but I seem to be running into the issue where kops validate cluster is hanging (even after waiting for up to an hour).
I first ran ./create_pocket_vpc.sh from a VM and then created a VM with an IP of 10.1.47.178 inside of the pocket-kube-private subnet (I later also tried manually creating a VPC as well, but this didn't work either).
Then, I followed the instructions up to the point of ./setup_cluster.sh. If I run kops validate cluster, I get the following error:
unexpected error during validation: error listing nodes: Get https://internal-api-pocketcluster-k8s-loc-404v42-977648002.us-west-2.elb.amazonaws.com/api/v1/nodes: EOF
(Something similar happens if I try to run kubectl get nodes as well).
If I run kops validate cluster -v 10, I seem to get that error while requesting the following:
I1104 01:15:16.994615 31176 request_logger.go:45] AWS request: ec2/DescribeAvailabilityZones
Not exactly sure why this is happening, since I'm using an access key id and secret id that has full permissions for my account. Any ideas what could be causing the problem?

Thanks!

Custom VPC Configuration for Pocket

I've got pocket working with the configuration given in README.md in the deploy/ folder.

However, I would like to be able to access pocket from a separate EKS cluster (so in a completely different VPC with its own subnets/security groups/etc.). As of now, I'm able to access pocket from an AWS lambda, but I'm unable to access it from any of the worker nodes in the EKS cluster. I've already tried setting up a peering connection, but that does seem to help.

Would it be possible to deploy pocket in an already existing VPC?
Are the public/private subnets necessary to the functionality of pocket?
Is cross-VPC communication possible in pocket, or does everything need to be in the same VPC?

kubernetes deploy issue

I'm creating the vpc using the create_pocket_vpc.sh and setup the cluster as the readme said. When I run kops validate cluster, I get the following error:

Using cluster from kubectl context: pocketcluster.k8s.local

Validating cluster pocketcluster.k8s.local


unexpected error during validation: error listing nodes: Get https://internal-api-pocketcluster-k8s-loc-404v42-1135504292.us-west-2.elb.amazonaws.com/api/v1/nodes: dial tcp 10.1.65.233:443: connect: no route to host

I have waited for half an hour. All the ec2 machine are up and running from the dashboard.

Configuration Issues

Hi! I'm trying to deploy pocket following the instructions outlined here, but I'm running into a couple of issues:

I'm able to follow the setup until kops validate cluster, which returns EOF.

unexpected error during validation: error listing nodes: Get https://internal-api-pocketcluster-k8s-loc-404v42-1700611634.us-east-1.elb.amazonaws.com/api/v1/nodes: EOF

I've tried waiting a long time (1hr), and the output remains the same. I've used the default settings from pocketcluster.template.yaml, except I use the same was subnet for both public and private subnets, and remove the NAT configuration from the egress of the private subnet (since I don't setup the NAT). Is this a possible reason for the issue? I'm not sure the deploy readme provides enough detail about how the subnets and NAT need to be configured on AWS.

If I try to increase SSD/HDD/NVRAM nodes the setup_cluster.sh script fails since the AMI ID (ami-310a6249) is invalid. I can get the script to work if I change the AMI ID to kope.io/k8s-1.8-debian-jessie-amd64-hvm-ebs-2018-02-08 (commented out in the template yaml file), but I'm unsure if this will cause issues in the future.

Subnet connection issues

Hello,
I tried to follow the deploy README to setup a Pocket instance. I used the create_pocket_vpc.sh script to create the relevant VPCs, subnets, etc.

The README says to make sure the VM you launch everything from is in the private subnet. However, when I create it in the private subnet, I can't SSH into it. I looked into changing the routing table, but because of the NAT settings, it doesn't look like I can add internet gateways. Is there an easy way to fix this?

To try to get around this problem, I tried creating the VM in the public subnet. This allowed me to SSH, however when I reached the stage where you connect to the controller, it was unable to.

Thanks,
Shannon

Are there any plans for Python3?

Passing in python bytes object instead of string to put_buffer

r = pocket.put_buffer(p, src, size, dst_filename, jobid)

The current python API requires that src be a python string, and passing in a python bytes object throws an error.

Is there a way to change the API/C++ dispatcher to allow me to pass in a bytes object?

Crail datanode error

When I'm deploying crail on master(datanode), it always shows

$ ~/crail./bin/crail datanodeil
20/03/12 10:42:02 INFO crail: crail.version 3101
20/03/12 10:42:02 INFO crail: crail.directorydepth 16
20/03/12 10:42:02 INFO crail: crail.tokenexpiration 10
20/03/12 10:42:02 INFO crail: crail.blocksize 1024000
20/03/12 10:42:02 INFO crail: crail.cachelimit 10240000
20/03/12 10:42:02 INFO crail: crail.cachepath /dev/hugepages/cache
20/03/12 10:42:02 INFO crail: crail.user crail
20/03/12 10:42:02 INFO crail: crail.shadowreplication 1
20/03/12 10:42:02 INFO crail: crail.debug false
20/03/12 10:42:02 INFO crail: crail.statistics true
20/03/12 10:42:02 INFO crail: crail.rpctimeout 1000
20/03/12 10:42:02 INFO crail: crail.datatimeout 1000
20/03/12 10:42:02 INFO crail: crail.buffersize 1024000
20/03/12 10:42:02 INFO crail: crail.slicesize 512000
20/03/12 10:42:02 INFO crail: crail.singleton true
20/03/12 10:42:02 INFO crail: crail.regionsize 102400000
20/03/12 10:42:02 INFO crail: crail.directoryrecord 512
20/03/12 10:42:02 INFO crail: crail.directoryrandomize true
20/03/12 10:42:02 INFO crail: crail.cacheimpl org.apache.crail.memory.MappedBufferCache
20/03/12 10:42:02 INFO crail: crail.locationmap 
20/03/12 10:42:02 INFO crail: crail.namenode.address crail://10.1.0.10:9060
20/03/12 10:42:02 INFO crail: crail.namenode.blockselection roundrobin
20/03/12 10:42:02 INFO crail: crail.namenode.fileblocks 16
20/03/12 10:42:02 INFO crail: crail.namenode.rpctype org.apache.crail.namenode.rpc.tcp.TcpNameNode
20/03/12 10:42:02 INFO crail: crail.namenode.log 
20/03/12 10:42:02 INFO crail: crail.namenode.replayregion false
20/03/12 10:42:02 INFO crail: crail.storage.types org.apache.crail.storage.tcp.TcpStorageTier
20/03/12 10:42:02 INFO crail: crail.storage.classes 1
20/03/12 10:42:02 INFO crail: crail.storage.rootclass 0
20/03/12 10:42:02 INFO crail: crail.storage.keepalive 2
20/03/12 10:42:02 INFO crail: crail.client.blockcache.enable false
20/03/12 10:42:02 INFO narpc: new NaRPC server group v1.0, queueDepth 16, messageSize 2048000, nodealy false, cores 1
20/03/12 10:42:02 INFO crail: crail.storage.tcp.interface eth0
20/03/12 10:42:02 INFO crail: crail.storage.tcp.port 50020
20/03/12 10:42:02 INFO crail: crail.storage.tcp.storagelimit 10240000
20/03/12 10:42:02 INFO crail: crail.storage.tcp.allocationsize 102400000
20/03/12 10:42:02 INFO crail: crail.storage.tcp.datapath /dev/hugepages/data
20/03/12 10:42:02 INFO crail: crail.storage.tcp.queuedepth 16
20/03/12 10:42:02 INFO crail: crail.storage.tcp.cores 1
20/03/12 10:42:02 INFO crail: crail.storage.tcp.nodelay false
20/03/12 10:42:02 INFO crail: crail.storage.tcp.populatemmap false
20/03/12 10:42:02 INFO crail: running TCP storage server, address /10.1.94.143:50020
20/03/12 10:42:02 INFO narpc: new NaRPC server group v1.0, queueDepth 32, messageSize 512, nodealy true
20/03/12 10:42:02 INFO crail: crail.namenode.tcp.queueDepth 32
20/03/12 10:42:02 INFO crail: crail.namenode.tcp.messageSize 512
20/03/12 10:42:02 INFO crail: crail.namenode.tcp.cores 1
20/03/12 10:42:02 INFO crail: connected to namenode(s) /10.1.0.10:9060
Exception in thread "main" java.lang.Exception: Error returned in the RPC type: ERROR: Data node not registered
	at org.apache.crail.storage.StorageRpcClient.getDataNode(StorageRpcClient.java:75)
	at org.apache.crail.storage.StorageServer.main(StorageServer.java:177)

and namenode also shows

20/03/12 10:42:02 INFO crail: A new connection arrives from : /10.1.94.143:16698
20/03/12 10:42:02 INFO crail: new connection from /10.1.94.143:16698
20/03/12 10:42:02 INFO narpc: adding new channel to selector, from /10.1.94.143:16698
 Datanode no longer registered

How can i fix this problem?

libboost_python-py27.so.1.58.0: undefined symbol: PyClass_Type

Hello:

I have encountered some troubles setting up Pocket. It seems that libpocket.so is not being built correctly based on the errors I am receiving, though I could be incorrect on that.

After running build.sh in the Client directory, I copy pocket.py, libpocket.so, and libcppcrail.so to the Controller directory and attempt to execute the controller via python3 controller.py. This produces the following error:

Traceback (most recent call last):
  File "controller.py", line 8, in <module>
    import pocket
  File "/home/ubuntu/pocket/controller/pocket.py", line 11, in <module>
    import libpocket
ImportError: /usr/lib/x86_64-linux-gnu/libboost_python-py27.so.1.58.0: undefined symbol: PyClass_Type

Similarly, if I attempt to run the Lambda function that gets set-up in the Microbenchmark directory, I will get an error about Lambda being unable to find the file or directory "libboost_python-py27.so.1.58.0":

Unable to import module 'latency': libboost_python-py27.so.1.58.0: cannot open shared object file: No such file or directory

If I then include "libboost_python-py27.so.1.58.0" in the Lambda deployment package, I get the following error:

START RequestId: a898a6e6-89bb-4fe9-afd6-794ea8842ba1 Version: $LATEST
Unable to import module 'latency': libboost_python-py27.so.1.58.0: undefined symbol: PyClass_Type
END RequestId: a898a6e6-89bb-4fe9-afd6-794ea8842ba1
REPORT RequestId: a898a6e6-89bb-4fe9-afd6-794ea8842ba1	Duration: 0.64 ms	Billed Duration: 100 ms	Memory Size: 3008 MB	Max Memory Used: 57 MB	Init Duration: 19.35 ms

I am running this on a Ubuntu 16.04.6 LTS EC2 VM. I have Python 3.5.6 and Python 2.7.12 as well as Python 3.6.8 installed. The version of Boost I have installed is "1.58.0.1ubuntu1." I can provide any additional details if necessary.

Error running patch_cluster.py

I'm trying to deploy Pocket in aws following deploy/README.md. And I successfully install the kubernetes cluster in aws. However, when I run $> python patch_cluster.py, the following error occurs:

Traceback (most recent call last):
  File "patch_cluster.py", line 213, in <module>
    main()
  File "patch_cluster.py", line 205, in main
    add_lambda_security_group_ingress_rule()
  File "patch_cluster.py", line 189, in add_lambda_security_group_ingress_rule
    pocket_lax_groupid = re.search(pattern, out).group().strip('\"')
AttributeError: 'NoneType' object has no attribute 'group'

The function add_lambda_security_group_ingress_rule tries to find the groupid for a security group,
whose name is in the format of *pocket-kube-relax*. I check the security groups in my account and there are no such security groups. I wonder whether we need to create one by ourselves and configure it for lambda functions later on?

Error Adding Routes / Connectivity Issues

Hello.

When I execute the ./add_ip_routes.sh script, I get the following errors:

ssh: Could not resolve hostname ip-XX-X-XXX-XX.us-west-2.compute.internal: Name or service not known
ssh: Could not resolve hostname ip-XX-X-XXX-XX.us-west-2.compute.internal: Name or service not known

If I manually execute the commands, replacing the IP recovered with kubectl get nodes --show-labels | grep metadata | awk '{print $1}' by just the numerical IP address (e.g., ssh -t [email protected] "sudo ip route add default via 10.1.0.1 dev eth1 tab 2"), the commands will execute, though I get a new error.

For the first command "sudo ip route add default via 10.1.0.1 dev eth1 tab 2", I get the following error: RTNETLINK answers: Network is unreachable.

The Lambda functions are also not able to connect to the namenode server. When attempting to connect, I obtain the following:

START RequestId: ... Version: $LATEST
Attempting to connect...
Connecting to metadata server failed!
put buffer failed: tmp-0: Exception
Traceback (most recent call last):
  File "/var/task/latency.py", line 67, in lambda_handler
    pocket_write_buffer(p, jobid, iter, text, datasize)
  File "/var/task/latency.py", line 33, in pocket_write_buffer
    raise Exception("put buffer failed: "+ dst_filename)
Exception: put buffer failed: tmp-0

END RequestId: ...
REPORT RequestId: ...	Duration: 1.47 ms	Billed Duration: 100 ms	Memory Size: 3008 MB	Max Memory Used: 28 MB

I figure the two errors are related. I'm just not sure how to proceed. As far as I can tell, I've followed the setup instructions exactly as they're written. Do you have any idea what might be going wrong? Just pointing me in the direction of what to look at to address these issues would be helpful.

Documentation for installing on local machine

Hi everyone,

Can anyone share the document about how to install pocket on the local machine?

Thanks and Regards
Abhisek Panda

stanford-mast / pocket Goto Github PK

pocket's People

Contributors

Stargazers

Watchers

Forkers

pocket's Issues

Recommend Projects

Recommend Topics

Recommend Org