stanford-mast / pocket Goto Github PK
View Code? Open in Web Editor NEWElastic ephemeral storage
Elastic ephemeral storage
r = pocket.put_buffer(p, src, size, dst_filename, jobid)
The current python API requires that src be a python string, and passing in a python bytes object throws an error.
Is there a way to change the API/C++ dispatcher to allow me to pass in a bytes object?
I'm creating the vpc using the create_pocket_vpc.sh and setup the cluster as the readme said. When I run kops validate cluster
, I get the following error:
Using cluster from kubectl context: pocketcluster.k8s.local
Validating cluster pocketcluster.k8s.local
unexpected error during validation: error listing nodes: Get https://internal-api-pocketcluster-k8s-loc-404v42-1135504292.us-west-2.elb.amazonaws.com/api/v1/nodes: dial tcp 10.1.65.233:443: connect: no route to host
I have waited for half an hour. All the ec2 machine are up and running from the dashboard.
Hello:
I have encountered some troubles setting up Pocket. It seems that libpocket.so is not being built correctly based on the errors I am receiving, though I could be incorrect on that.
After running build.sh in the Client directory, I copy pocket.py, libpocket.so, and libcppcrail.so to the Controller directory and attempt to execute the controller via python3 controller.py. This produces the following error:
Traceback (most recent call last):
File "controller.py", line 8, in <module>
import pocket
File "/home/ubuntu/pocket/controller/pocket.py", line 11, in <module>
import libpocket
ImportError: /usr/lib/x86_64-linux-gnu/libboost_python-py27.so.1.58.0: undefined symbol: PyClass_Type
Similarly, if I attempt to run the Lambda function that gets set-up in the Microbenchmark directory, I will get an error about Lambda being unable to find the file or directory "libboost_python-py27.so.1.58.0":
Unable to import module 'latency': libboost_python-py27.so.1.58.0: cannot open shared object file: No such file or directory
If I then include "libboost_python-py27.so.1.58.0" in the Lambda deployment package, I get the following error:
START RequestId: a898a6e6-89bb-4fe9-afd6-794ea8842ba1 Version: $LATEST
Unable to import module 'latency': libboost_python-py27.so.1.58.0: undefined symbol: PyClass_Type
END RequestId: a898a6e6-89bb-4fe9-afd6-794ea8842ba1
REPORT RequestId: a898a6e6-89bb-4fe9-afd6-794ea8842ba1 Duration: 0.64 ms Billed Duration: 100 ms Memory Size: 3008 MB Max Memory Used: 57 MB Init Duration: 19.35 ms
I am running this on a Ubuntu 16.04.6 LTS EC2 VM. I have Python 3.5.6 and Python 2.7.12 as well as Python 3.6.8 installed. The version of Boost I have installed is "1.58.0.1ubuntu1." I can provide any additional details if necessary.
When I'm deploying crail on master(datanode), it always shows
$ ~/crail./bin/crail datanodeil
20/03/12 10:42:02 INFO crail: crail.version 3101
20/03/12 10:42:02 INFO crail: crail.directorydepth 16
20/03/12 10:42:02 INFO crail: crail.tokenexpiration 10
20/03/12 10:42:02 INFO crail: crail.blocksize 1024000
20/03/12 10:42:02 INFO crail: crail.cachelimit 10240000
20/03/12 10:42:02 INFO crail: crail.cachepath /dev/hugepages/cache
20/03/12 10:42:02 INFO crail: crail.user crail
20/03/12 10:42:02 INFO crail: crail.shadowreplication 1
20/03/12 10:42:02 INFO crail: crail.debug false
20/03/12 10:42:02 INFO crail: crail.statistics true
20/03/12 10:42:02 INFO crail: crail.rpctimeout 1000
20/03/12 10:42:02 INFO crail: crail.datatimeout 1000
20/03/12 10:42:02 INFO crail: crail.buffersize 1024000
20/03/12 10:42:02 INFO crail: crail.slicesize 512000
20/03/12 10:42:02 INFO crail: crail.singleton true
20/03/12 10:42:02 INFO crail: crail.regionsize 102400000
20/03/12 10:42:02 INFO crail: crail.directoryrecord 512
20/03/12 10:42:02 INFO crail: crail.directoryrandomize true
20/03/12 10:42:02 INFO crail: crail.cacheimpl org.apache.crail.memory.MappedBufferCache
20/03/12 10:42:02 INFO crail: crail.locationmap
20/03/12 10:42:02 INFO crail: crail.namenode.address crail://10.1.0.10:9060
20/03/12 10:42:02 INFO crail: crail.namenode.blockselection roundrobin
20/03/12 10:42:02 INFO crail: crail.namenode.fileblocks 16
20/03/12 10:42:02 INFO crail: crail.namenode.rpctype org.apache.crail.namenode.rpc.tcp.TcpNameNode
20/03/12 10:42:02 INFO crail: crail.namenode.log
20/03/12 10:42:02 INFO crail: crail.namenode.replayregion false
20/03/12 10:42:02 INFO crail: crail.storage.types org.apache.crail.storage.tcp.TcpStorageTier
20/03/12 10:42:02 INFO crail: crail.storage.classes 1
20/03/12 10:42:02 INFO crail: crail.storage.rootclass 0
20/03/12 10:42:02 INFO crail: crail.storage.keepalive 2
20/03/12 10:42:02 INFO crail: crail.client.blockcache.enable false
20/03/12 10:42:02 INFO narpc: new NaRPC server group v1.0, queueDepth 16, messageSize 2048000, nodealy false, cores 1
20/03/12 10:42:02 INFO crail: crail.storage.tcp.interface eth0
20/03/12 10:42:02 INFO crail: crail.storage.tcp.port 50020
20/03/12 10:42:02 INFO crail: crail.storage.tcp.storagelimit 10240000
20/03/12 10:42:02 INFO crail: crail.storage.tcp.allocationsize 102400000
20/03/12 10:42:02 INFO crail: crail.storage.tcp.datapath /dev/hugepages/data
20/03/12 10:42:02 INFO crail: crail.storage.tcp.queuedepth 16
20/03/12 10:42:02 INFO crail: crail.storage.tcp.cores 1
20/03/12 10:42:02 INFO crail: crail.storage.tcp.nodelay false
20/03/12 10:42:02 INFO crail: crail.storage.tcp.populatemmap false
20/03/12 10:42:02 INFO crail: running TCP storage server, address /10.1.94.143:50020
20/03/12 10:42:02 INFO narpc: new NaRPC server group v1.0, queueDepth 32, messageSize 512, nodealy true
20/03/12 10:42:02 INFO crail: crail.namenode.tcp.queueDepth 32
20/03/12 10:42:02 INFO crail: crail.namenode.tcp.messageSize 512
20/03/12 10:42:02 INFO crail: crail.namenode.tcp.cores 1
20/03/12 10:42:02 INFO crail: connected to namenode(s) /10.1.0.10:9060
Exception in thread "main" java.lang.Exception: Error returned in the RPC type: ERROR: Data node not registered
at org.apache.crail.storage.StorageRpcClient.getDataNode(StorageRpcClient.java:75)
at org.apache.crail.storage.StorageServer.main(StorageServer.java:177)
and namenode also shows
20/03/12 10:42:02 INFO crail: A new connection arrives from : /10.1.94.143:16698
20/03/12 10:42:02 INFO crail: new connection from /10.1.94.143:16698
20/03/12 10:42:02 INFO narpc: adding new channel to selector, from /10.1.94.143:16698
Datanode no longer registered
How can i fix this problem?
Hello.
When I execute the ./add_ip_routes.sh
script, I get the following errors:
ssh: Could not resolve hostname ip-XX-X-XXX-XX.us-west-2.compute.internal: Name or service not known
ssh: Could not resolve hostname ip-XX-X-XXX-XX.us-west-2.compute.internal: Name or service not known
If I manually execute the commands, replacing the IP recovered with kubectl get nodes --show-labels | grep metadata | awk '{print $1}'
by just the numerical IP address (e.g., ssh -t [email protected] "sudo ip route add default via 10.1.0.1 dev eth1 tab 2"
), the commands will execute, though I get a new error.
For the first command "sudo ip route add default via 10.1.0.1 dev eth1 tab 2"
, I get the following error: RTNETLINK answers: Network is unreachable
.
The Lambda functions are also not able to connect to the namenode server. When attempting to connect, I obtain the following:
START RequestId: ... Version: $LATEST
Attempting to connect...
Connecting to metadata server failed!
put buffer failed: tmp-0: Exception
Traceback (most recent call last):
File "/var/task/latency.py", line 67, in lambda_handler
pocket_write_buffer(p, jobid, iter, text, datasize)
File "/var/task/latency.py", line 33, in pocket_write_buffer
raise Exception("put buffer failed: "+ dst_filename)
Exception: put buffer failed: tmp-0
END RequestId: ...
REPORT RequestId: ... Duration: 1.47 ms Billed Duration: 100 ms Memory Size: 3008 MB Max Memory Used: 28 MB
I figure the two errors are related. I'm just not sure how to proceed. As far as I can tell, I've followed the setup instructions exactly as they're written. Do you have any idea what might be going wrong? Just pointing me in the direction of what to look at to address these issues would be helpful.
I'm trying to deploy Pocket in aws following deploy/README.md. And I successfully install the kubernetes cluster in aws. However, when I run $> python patch_cluster.py
, the following error occurs:
Traceback (most recent call last):
File "patch_cluster.py", line 213, in <module>
main()
File "patch_cluster.py", line 205, in main
add_lambda_security_group_ingress_rule()
File "patch_cluster.py", line 189, in add_lambda_security_group_ingress_rule
pocket_lax_groupid = re.search(pattern, out).group().strip('\"')
AttributeError: 'NoneType' object has no attribute 'group'
The function add_lambda_security_group_ingress_rule
tries to find the groupid for a security group,
whose name is in the format of *pocket-kube-relax*
. I check the security groups in my account and there are no such security groups. I wonder whether we need to create one by ourselves and configure it for lambda functions later on?
Hi! I'm trying to deploy pocket following the instructions outlined here, but I'm running into a couple of issues:
kops validate cluster
, which returns EOF.unexpected error during validation: error listing nodes: Get https://internal-api-pocketcluster-k8s-loc-404v42-1700611634.us-east-1.elb.amazonaws.com/api/v1/nodes: EOF
I've tried waiting a long time (1hr), and the output remains the same. I've used the default settings from pocketcluster.template.yaml, except I use the same was subnet for both public and private subnets, and remove the NAT configuration from the egress of the private subnet (since I don't setup the NAT). Is this a possible reason for the issue? I'm not sure the deploy readme provides enough detail about how the subnets and NAT need to be configured on AWS.
Hi,
I tried to run jobs using DRAM storage servers which succeeded to read and write, but when I switched to NVMe servers, kubernete successfully launched a new NVMe container, but the NVMe container fails to connect to the controller's listening port.
Some logs below:
-------------------------- REGISTER JOB -------------------------------- 1579750512.3086464
received hints test1-122643 0 1 0 0
connected to 10.1.0.10:9070
connected to 10.1.48.29:50030
generate weightmask for test1-122643 1 72.07207207207207 0
jobid test1-122643 is throughput-bound
KUBERNETES: launch 1 extra nodes, wait for them to come up and assing proper weights [0.009009009009009009]
KUBERNETES: launch flash datanode........
controller.py:249: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
job = yaml.load(f)
Job created. status='{'active': None,
'completion_time': None,
'conditions': None,
'failed': None,
'start_time': None,
'succeeded': None}'
Wait for flash datanode to start...
At some point, the NVMe datanode is launched, and the metadata server returns it's capacity:
cpu net DRAM_GB Flash_GB blacklisted reserved
datanodeip_port
10.1.48.29:50030 0.0 0.0 0.0 -1.0 0.0 0.0
Capacity usage for Tier 0 : 753662 free blocks out of 753664 ( 0.0002653702445652174 % )
Capacity usage for Tier 1 : 1703936 free blocks out of 1703936 ( 0.0 % )
Datanode usage:
cpu net blacklisted
datanodeip_port
10.1.48.29:50030 0.0 0.0 0.0
**********
cpu net DRAM_GB Flash_GB blacklisted reserved
datanodeip_port
10.1.48.29:50030 0.0 0.0 0.0 -1.0 0.0 0.0
Capacity usage for Tier 0 : 753662 free blocks out of 753664 ( 0.0002653702445652174 % )
Capacity usage for Tier 1 : 1703936 free blocks out of 1703936 ( 0.0 % )
**********
cpu net DRAM_GB Flash_GB blacklisted reserved
datanodeip_port
10.1.48.29:50030 0.0 0.0 0.0 -1.0 0.0 0.0
Capacity usage for Tier 0 : 753662 free blocks out of 753664 ( 0.0002653702445652174 % )
Capacity usage for Tier 1 : 1703936 free blocks out of 1703936 ( 0.0 % )
Datanode usage:
cpu net blacklisted
datanodeip_port
10.1.48.29:50030 2.0 0.0 0.0
And if you check the pods, you could see that the container is indeed running,
(base) ubuntu@ip-10-1-47-178:~/pocket/deploy$ kubectl get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
pocket-datanode-dram-job0-h4drj 1/1 Running 0 9m16s 10.1.48.29 ip-10-1-48-29.us-west-2.compute.internal <none> <none>
pocket-datanode-nvme-job0-d9cml 1/1 Running 0 7m13s 10.1.2.3 ip-10-1-74-206.us-west-2.compute.internal <none> <none>
pocket-namenode-deployment-845648cb89-sg82q 1/1 Running 0 12m 10.1.0.10 ip-10-1-43-131.us-west-2.compute.internal <none> <none>
The code I'm using for registering a job is:
>>> import pocket
>>> jobid = "test1"
>>> namenode_ip = "10.1.0.10"
>>> jobid = pocket.register_job(jobid, capacityGB=1, latency_sensitive=0)
Also, I'm wondering if the controller lacks support for ssd nodes
Thanks for you time!
Hi,
I'm trying to set up Pocket with the directions in the README.md
file in the deploy/
folder, but I seem to be running into the issue where kops validate cluster
is hanging (even after waiting for up to an hour).
I first ran ./create_pocket_vpc.sh
from a VM and then created a VM with an IP of 10.1.47.178
inside of the pocket-kube-private
subnet (I later also tried manually creating a VPC as well, but this didn't work either).
Then, I followed the instructions up to the point of ./setup_cluster.sh
. If I run kops validate cluster
, I get the following error:
unexpected error during validation: error listing nodes: Get https://internal-api-pocketcluster-k8s-loc-404v42-977648002.us-west-2.elb.amazonaws.com/api/v1/nodes: EOF
(Something similar happens if I try to run kubectl get nodes
as well).
If I run kops validate cluster -v 10
, I seem to get that error while requesting the following:
I1104 01:15:16.994615 31176 request_logger.go:45] AWS request: ec2/DescribeAvailabilityZones
Not exactly sure why this is happening, since I'm using an access key id and secret id that has full permissions for my account. Any ideas what could be causing the problem?
Thanks!
I dont have any sg named it. What should i do to run putch_cluster.py?
Hello,
I tried to follow the deploy README to setup a Pocket instance. I used the create_pocket_vpc.sh
script to create the relevant VPCs, subnets, etc.
The README says to make sure the VM you launch everything from is in the private subnet. However, when I create it in the private subnet, I can't SSH into it. I looked into changing the routing table, but because of the NAT settings, it doesn't look like I can add internet gateways. Is there an easy way to fix this?
To try to get around this problem, I tried creating the VM in the public subnet. This allowed me to SSH, however when I reached the stage where you connect to the controller, it was unable to.
Thanks,
Shannon
I've got pocket working with the configuration given in README.md in the deploy/ folder.
However, I would like to be able to access pocket from a separate EKS cluster (so in a completely different VPC with its own subnets/security groups/etc.). As of now, I'm able to access pocket from an AWS lambda, but I'm unable to access it from any of the worker nodes in the EKS cluster. I've already tried setting up a peering connection, but that does seem to help.
Hi everyone,
Can anyone share the document about how to install pocket on the local machine?
Thanks and Regards
Abhisek Panda
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.