Giter Club home page Giter Club logo

dynamometer's People

Contributors

csgregorian avatar fengnanli avatar mockitoguy avatar pizzaz93 avatar shipkit-org avatar sunchao avatar xkrogen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dynamometer's Issues

Support Hadoop 3.0

There are a few portions of the codebase that do not yet play well with Hadoop 3.0.

dynamometer-infra test failed

The TestDynamometerInfra failed,it may be need revise.

> Configure project :
  Building version '0.1.4' (value loaded from 'version.properties' file).

> Task :dynamometer-infra:test

com.linkedin.dynamometer.TestDynamometerInfra > classMethod FAILED
    java.net.UnknownHostException at TestDynamometerInfra.java:146

3 tests completed, 1 failed

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':dynamometer-infra:test'.
> There were failing tests. See the report at: file:///home/hxh/hadoop/dynamometer-0.1.3/dynamometer-infra/build/reports/tests/test/index.html

* Try:
Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output. Run with --scan to get full insights.

* Get more help at https://help.gradle.org

BUILD FAILED in 46s
45 actionable tasks: 7 executed, 38 up-to-date

Bump Hadoop version in test dependency up to 2.7.5

Currently the test fails by default (if no system properties are specified) because 2.7.4 can no longer be downloaded from the Apache mirror. Ideally #10 should be used to solve this, but in the mean time, bump the version up to 2.7.5.

Add the ability to trigger block reports on simulated DataNodes

Due to the way blocks are injected into the simulated DataNodes in Dynamometer, sometimes it is possible for the DataNode to send its initial block report completely empty. This is because the blocks are injected after the DataNode is started, due to the setup of a MiniDFSCluster. If this happens, another block report will not be sent until the block report interval has passed, which can be a very long time. This can result in (a) test timeouts (b) long setup times.

We can add the ability for the ApplicationMaster to monitor which nodes have not reported full block reports, and trigger block reports on those nodes.

Add _real_ unit testing

Though the project has some tests, it is essentially just one monolithic integration test. We need more fine-grained unit tests to be able to more easily diagnose issues as they arise, and to increase the possibility of catching breakages.

Add configurability for NameNode readiness criteria

When waiting for the Dynamometer infra application to be "ready" to use, the Client and ApplicationMaster wait for the following criteria to be met:

  1. Number of live DataNodes is above a threshold
  2. Number of missing blocks is under a threshold
  3. Number of underreplicated blocks is under a threshold

These percentage-based thresholds are currently hard-coded, but it would be useful if they were configurable to give a user more control over how strict the readiness condition is.

Some questions about the result of WorkLoad

[mr@redhat143 dynamometer-fat-0.1.5]$ ./bin/start-workload.sh -Dauditreplay.input-path=hdfs:///dyno/audit_input_logs/ -Dauditreplay.output-path=hdfs:///dyno/audit_output_logs/ -Dauditreplay.log-start-time.ms=1554247070151 -Dauditreplay.num-threads=1 -nn_uri hdfs://redhat142:9000/ -start_time_offset 1m -mapper_class_name AuditReplayMapper
2019-04-03 08:08:56,771 INFO com.linkedin.dynamometer.workloadgenerator.WorkloadDriver: The workload will start at 1554250196743 ms (2019/04/03 08:09:56 CST)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/ZDH/parcels/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/mr/dynamometer-fat-0.1.5/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/mr/dynamometer-fat-0.1.5/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/mr/dynamometer-fat-0.1.5/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2019-04-03 08:09:04,118 INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl: Timeline service address: http://redhat143:8188/ws/v1/timeline/
2019-04-03 08:09:06,712 INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat: Total input paths to process : 1
2019-04-03 08:09:07,156 INFO org.apache.hadoop.mapreduce.JobSubmitter: number of splits:1
2019-04-03 08:09:07,591 INFO org.apache.hadoop.mapreduce.JobSubmitter: Submitting tokens for job: job_1554243591539_0010
2019-04-03 08:09:07,799 INFO org.apache.hadoop.conf.Configuration: found resource resource-types.xml at file:/etc/zdh/yarn/conf.zdh.yarn/resource-types.xml
2019-04-03 08:09:08,703 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl: Submitted application application_1554243591539_0010
2019-04-03 08:09:08,819 INFO org.apache.hadoop.mapreduce.Job: The url to track the job: http://redhat142:8088/proxy/application_1554243591539_0010/
2019-04-03 08:09:08,820 INFO org.apache.hadoop.mapreduce.Job: Running job: job_1554243591539_0010
2019-04-03 08:09:28,561 INFO org.apache.hadoop.mapreduce.Job: Job job_1554243591539_0010 running in uber mode : false
2019-04-03 08:09:28,564 INFO org.apache.hadoop.mapreduce.Job: map 0% reduce 0%

[mr@redhat143 ~]$ hdfs dfs -ls /dyno/audit_output_logs
Found 2 items
-rw-r----- 3 mr users 0 2019-04-03 16:49 /dyno/audit_output_logs/_SUCCESS
-rw-r----- 3 mr users 295 2019-04-03 16:49 /dyno/audit_output_logs/part-r-00000
[mr@redhat143 ~]$ hdfs dfs -cat /dyno/audit_output_logs/part-r-00000
mr,READ,OPEN,-1,-3812586358584700876
mr,WRITE,CREATE,-1,-3089982714344429856
mr,WRITE,DELETE,67108863,357792779
mr,WRITE,MKDIRS,8796093022207,943151732492469
mr,WRITE,RENAME,70368744177663,521322769738249
mr,WRITE,SETPERMISSION,-1,-6855717651319934733
mr,WRITE,SETREPLICATION,16777215,161654944

I have some questions:

  1. Is part-r-00000 the result of WorkLoad?
  2. What does the result mean? And why are there negative numbers?

I'm looking forward to your reply. Thanks!

Should be able to specify latest version of a branch for tests

The Hadoop tarball to use during testing is specified and downloaded from an Apache mirror. Generally only the latest version of each branch is available, but right now, versions must be fully specified (e.g. 2.7.4), so they will go out of date as new maintenance releases come into existence. It should be possible to specify like e.g. 2.7.*

Hadoop 3.0 NN drops replicas silently

In Hadoop 3.0/CDH5.7 and above,
HDFS-9260 (Improve the performance and GC friendliness of NameNode startup and full block reports) changed the internal representation of block replicas, as well as the block report processing logic in NameNode.

After HDFS-9260, NN expects block replicas to be reported in ascending order of block id. If a block id is not in order, NN discards it silently. Because simulated DataNode in Dynamometer uses hash map to store block replicas, the replicas are not reported in order. The Dynamometer cluster would then see missing blocks gradually increase several minutes after NN starts.

Suggest to change SimulatedBPStorage.blockMap to a TreeMap sorted by block id. Will supply a patch for the proposed change.

Credit: @fangyurao for identifying the issue, and help verifying the fix.

gradle build failed

Failure to compile with release dynamometer-0.1.0 package。

root@hadoop:/home/hxh/hadoop/dynamometer-0.1.0# gradle build

Parallel execution with configuration on demand is an incubating feature.

FAILURE: Build failed with an exception.

  • Where:
    Build file '/home/hxh/hadoop/dynamometer-0.1.0/build.gradle' line: 18

  • What went wrong:
    Plugin [id: 'org.shipkit.java', version: '2.1.3'] was not found in any of the following sources:

  • Gradle Core Plugins (plugin is not in 'org.gradle' namespace)
  • Plugin Repositories (could not resolve plugin artifact 'org.shipkit.java:org.shipkit.java.gradle.plugin:2.1 .3')
    Searched in the following repositories:
    Gradle Central Plugin Repository
  • Try:
    Run with --stacktrace option to get the stack trace. Run with --info or --debug option to get more log output . Run with --scan to get full insights.

  • Get more help at https://help.gradle.org

BUILD FAILED in 32s

Improvements to audit replay accuracy

There are two glaring issues with the accuracy of audit replay:

  • deletes are currently specified as non-recursive. If a delete was originally submitted as non-recursive but performed on a non-empty directory, it would have failed, and not appeared in the audit log in the first place. Since all deletes in the audit log succeeded, we should simply specify all of our deletes as recursive.
  • Large listStatus calls are amplified. When a large listing is broken up into several separate RPCs, each one is logged in the audit log. However, we currently perform a full listing for each entry in the audit log, so e.g. if a large listing produced 5 RPCs, in the Dynamometer replay we would perform 25 RPCs, since for each entry in the audit log we would do a full listing and produce an additional 5 RPCs.

Support build environment overrides.

We can provide a hook in the root build.gradle for the build environment to have specific overrides. This allows the project to be build in more environments and adapt more gracefully to company-specific tooling.

Gradle should pass relevant system properties into tests

The TestDynamometerInfra test relies on some system properties to be able to change behaviors in the test. Gradle needs to be explicitly told to pass system properties, so attempting to specify these when running the tests via Gradle does not work.

Remove DataNodeLayoutVersionFetcher

This class was necessary for the old method of setting up DataNodes which involved laying out fake blocks on disk; we had to know the layout version in that case. Now that SimulatedMultiStorageFSDataset is used instead, this logic is no longer necessary.

Option parsing bug for start-component.sh

This might be related with #52 .
In https://github.com/linkedin/dynamometer/blob/master/dynamometer-infra/src/main/java/com/linkedin/dynamometer/Client.java#L327 it is trying to find whether there is help option entered for the function, but internally it is using a GnuParser (which is deprecated from version 1.3), and during the flatten process it will try to pull the substring of -h from option -hadoop_binary_path and thinks it is asking for help information.
Note this only happens when you put the -hadoop_binary_path as the first option since other options will make the flatten end early.

getEZForPath command is not supported

When running the replay with audit logs, some warning will come up

18/11/06 01:23:36 WARN audit.AuditReplayThread: Unsupported/invalid command: AuditReplayCommand(absoluteTimestamp=1539650537967, ugi=xxx, command=getEZForPath, src=/certain/path, dest=null, sourceIP=x.x.x.x

Blockgen job fails to clean up failed reduce attempts

The block generation job has custom output logic to allow each reducer to output to multiple block files.

When speculative execution is enabled, this can result in two copies of the same block file being generated (one of which may be incomplete). This can be worked around by setting mapreduce.reduce.speculative = false.

When a reducer attempt fails, the partial output files will not be cleaned up. I'm not aware of an easy workaround for this beyond manually cleaning up the files after the job completes.

We should have each reducer use a staging directory and only move the output files when it completes.

some questions about the auditlog

There have such a description in the read.me document.
The audit trace replay accepts one input file per mapper, and currently supports two input formats, configurable via the auditreplay.command-parser.class configuration.
Where do we need to configure the auditreplay.command-parser.class?

When using this format you must also specify auditreplay.log-start-time.ms
and how should we specify the auditreplay.log-start-time.ms?

dynamometer job container failed when run start dynamometer script

Hi, I want to use dynamometer using dynamometer script. But it always fails for reasons I don't know inside the container after the yarn app is submitted. Anyone have any good comments?

2022-06-15 18:36:40,223 DEBUG retry.RetryInvocationHandler: Exception while invoking call #30 ClientNamenodeProtocolTranslatorPB.getBlockLocations over null. Not retrying because try once and fail.
org.apache.hadoop.ipc.RemoteException(java.io.FileNotFoundException): File does not exist: /user/kjh/.dynamometer/application_1655275704983_0013/nn_info.prop
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:86)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
at org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:156)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1990)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:768)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:442)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:957)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2957)

at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1562)
at org.apache.hadoop.ipc.Client.call(Client.java:1508)
at org.apache.hadoop.ipc.Client.call(Client.java:1405)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:118)
at com.sun.proxy.$Proxy9.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getBlockLocations(ClientNamenodeProtocolTranslatorPB.java:327)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy10.getBlockLocations(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient.callGetBlockLocations(DFSClient.java:869)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:858)
at org.apache.hadoop.hdfs.DFSClient.getLocatedBlocks(DFSClient.java:847)
at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:1015)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:322)
at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:318)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:330)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:906)
at org.apache.hadoop.tools.dynamometer.DynoInfraUtils.waitForAndGetNameNodeProperties(DynoInfraUtils.java:236)
at org.apache.hadoop.tools.dynamometer.Client.lambda$monitorInfraApplication$3(Client.java:902)
at java.lang.Thread.run(Thread.java:748)

below is dynamometer run command:
./dynamometer-infra/bin/start-dynamometer-cluster.sh -hadoop_binary_path /home/kjh/Downloads/hadoop-3.2.2.tar.gz -conf_path /usr/local/hadoop/etc/hadoop -fs_image_dir hdfs:///dyno/fsimage -block_list_path hdfs:///dyno/blocks

Workload replay threads should share a single DelayQueue

Currently, within a single workload replay mapper, numerous threads are started to replay back commands. Commands are partitioned by their source path, and then directed to a thread corresponding to their partition. This results in issues with, for example, skewed paths in which single threads get backed up and result in commands being executed far later than initially intended. Instead, we should simply let all of the threads share a single DelayQueue to spread the load among all of them.

This has the disadvantage that some operations occurring on the same source path may occur out of order, but this has not proved to be an issue in our experience.

Dynamometer is incompatible with Hadoop 2.7.2 but this is not documented anywhere

I got fail when I try to launch dyno-cluster.

[root@ftp0 hadoop]# start-dynamometer-cluster.sh -hadoop_binary_path hadoop-2.7.2.tar.gz -conf_path /root/hadoop/hadoop-2.7.2/etc/hadoop/conf -fs_image_dir hdfs:///dyno/fsimage -block_list_path hdfs:///dyno/blocks1

console log :
19/07/25 11:47:43 INFO dynamometer.Client: Running Client
19/07/25 11:47:43 INFO client.RMProxy: Connecting to ResourceManager at ftp0/192.168.103.159:8032
19/07/25 11:47:43 INFO dynamometer.Client: Got Cluster metric info from ASM, numNodeManagers=3
19/07/25 11:47:43 INFO dynamometer.Client: Queue info, queueName=default, queueCurrentCapacity=0.0, queueMaxCapacity=1.0, queueApplicationCount=0, queueChildQueueCount=0
19/07/25 11:47:43 INFO dynamometer.Client: Max mem capabililty of resources in this cluster 9000
19/07/25 11:47:43 INFO dynamometer.Client: Max virtual cores capabililty of resources in this cluster 50
19/07/25 11:47:43 INFO dynamometer.Client: Set the environment for the application master
19/07/25 11:47:43 INFO dynamometer.Client: Using resource FS_IMAGE directly from current location: hdfs://ftp0:9000/dyno/fsimage/fsimage_0000000000000108883
19/07/25 11:47:43 INFO dynamometer.Client: Using resource FS_IMAGE_MD5 directly from current location: hdfs://ftp0:9000/dyno/fsimage/fsimage_0000000000000108883.md5
19/07/25 11:47:43 INFO dynamometer.Client: Using resource VERSION directly from current location: hdfs:/dyno/fsimage/VERSION
19/07/25 11:47:43 INFO dynamometer.Client: Uploading resource CONF_ZIP from [/root/hadoop/hadoop-2.7.2/etc/hadoop/conf] to hdfs://ftp0:9000/user/root/.dynamometer/application_1564026451259_0001/conf.zip
19/07/25 11:47:44 INFO dynamometer.Client: Uploading resource START_SCRIPT from [file:/tmp/hadoop-unjar5145675343523534600/start-component.sh] to hdfs://ftp0:9000/user/root/.dynamometer/application_1564026451259_0001/start-component.sh
19/07/25 11:47:44 INFO dynamometer.Client: Uploading resource HADOOP_BINARY from [hadoop-2.7.2.tar.gz] to hdfs://ftp0:9000/user/root/.dynamometer/application_1564026451259_0001/hadoop-2.7.2.tar.gz
19/07/25 11:47:44 INFO dynamometer.Client: Uploading resource DYNO_DEPS from [/root/dynamometer/build/distributions/dynamometer-0.1.7/bin/../lib/dynamometer-infra-0.1.7.jar] to hdfs://ftp0:9000/user/root/.dynamometer/application_1564026451259_0001/dependencies.zip
19/07/25 11:47:44 INFO dynamometer.Client: Completed setting up app master command: [$JAVA_HOME/bin/java, -Xmx1741m, com.linkedin.dynamometer.ApplicationMaster, --datanode_memory_mb 2048, --datanode_vcores 1, --datanodes_per_cluster 1, --datanode_launch_delay 0s, --namenode_memory_mb 2048, --namenode_vcores 1, --namenode_metrics_period 60, 1><LOG_DIR>/stdout, 2><LOG_DIR>/stderr]
19/07/25 11:47:44 INFO dynamometer.Client: Submitting application to RM
19/07/25 11:47:44 INFO impl.YarnClientImpl: Submitted application application_1564026451259_0001
19/07/25 11:47:45 INFO dynamometer.Client: Track the application at: http://ftp0:8088/proxy/application_1564026451259_0001/
19/07/25 11:47:45 INFO dynamometer.Client: Kill the application using: yarn application -kill application_1564026451259_0001
19/07/25 11:48:00 INFO dynamometer.Client: NameNode can be reached via HDFS at: hdfs://ftp1:9002/
19/07/25 11:48:00 INFO dynamometer.Client: NameNode web UI available at: http://ftp1:50077/
19/07/25 11:48:00 INFO dynamometer.Client: NameNode can be tracked at: http://ftp1:8042/node/containerlogs/container_1564026451259_0001_01_000002/root/
19/07/25 11:48:00 INFO dynamometer.Client: Waiting for NameNode to finish starting up...
19/07/25 11:48:07 INFO dynamometer.Client: Infra app exited unexpectedly. YarnState=FINISHED. Exiting from client.
19/07/25 11:48:07 INFO dynamometer.Client: Attempting to clean up remaining running applications.
19/07/25 11:48:07 ERROR dynamometer.Client: Application failed to complete successfully

After that, I go to see container log under Hadoop.
[root@ftp0 container_1564026451259_0001_01_000001]# pwd
/root/hadoop/hadoop-2.7.2/logs/userlogs/application_1564026451259_0001/container_1564026451259_0001_01_000001
[root@ftp0 container_1564026451259_0001_01_000001]# ls
stderr stdout

stdout is empty !!

stderr :

19/07/25 11:47:51 INFO dynamometer.ApplicationMaster: Setting up container launch context for containerid=container_1564026451259_0001_01_000002, isNameNode=true
19/07/25 11:47:51 INFO dynamometer.ApplicationMaster: Completed setting up command for namenode: [./start-component.sh, namenode, hdfs://ftp0:9000/user/root/.dynamometer/application_1564026451259_0001, 1><LOG_DIR>/stdout, 2><LOG_DIR>/stderr]
19/07/25 11:47:51 INFO dynamometer.ApplicationMaster: Starting NAMENODE; track at: http://ftp1:8042/node/containerlogs/container_1564026451259_0001_01_000002/root/
19/07/25 11:47:51 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1564026451259_0001_01_000002
19/07/25 11:47:51 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ftp1:34334
19/07/25 11:47:51 INFO dynamometer.ApplicationMaster: NameNode container started at ID container_1564026451259_0001_01_000002
19/07/25 11:48:00 INFO dynamometer.ApplicationMaster: NameNode information: {NM_HTTP_PORT=8042, NN_HOSTNAME=ftp1, NN_HTTP_PORT=50077, NN_SERVICERPC_PORT=9022, NN_RPC_PORT=9002, CONTAINER_ID=container_1564026451259_0001_01_000002}
19/07/25 11:48:00 INFO dynamometer.ApplicationMaster: NameNode can be reached at: hdfs://ftp1:9002/
19/07/25 11:48:00 INFO dynamometer.ApplicationMaster: Waiting for NameNode to finish starting up...
19/07/25 11:48:05 INFO dynamometer.ApplicationMaster: Got response from RM for container ask, completedCnt=1
19/07/25 11:48:05 INFO dynamometer.ApplicationMaster: Got container status for NAMENODE: containerID=container_1564026451259_0001_01_000002, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch.
Container id: container_1564026451259_0001_01_000002
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Container exited with a non-zer...
19/07/25 11:48:05 INFO dynamometer.ApplicationMaster: NameNode container completed; marking application as done
19/07/25 11:48:06 INFO dynamometer.ApplicationMaster: NameNode has started!
19/07/25 11:48:06 INFO dynamometer.ApplicationMaster: Looking for block listing files in hdfs:/dyno/blocks1
19/07/25 11:48:06 INFO dynamometer.ApplicationMaster: Requesting 2 DataNode containers with 2048MB memory, 1 vcores,
19/07/25 11:48:06 INFO dynamometer.ApplicationMaster: Finished requesting datanode containers
19/07/25 11:48:06 INFO dynamometer.ApplicationMaster: Application completed. Stopping running containers
19/07/25 11:48:06 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ftp1:34334
19/07/25 11:48:07 INFO dynamometer.ApplicationMaster: Application completed. Signalling finish to RM
19/07/25 11:48:07 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
19/07/25 11:48:07 INFO dynamometer.ApplicationMaster: Application Master failed. exiting
19/07/25 11:48:07 INFO impl.AMRMClientAsyncImpl: Interrupted while waiting for queue
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2048)
at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442)
at org.apache.hadoop.yarn.client.api.async.impl.AMRMClientAsyncImpl$CallbackHandlerThread.run(AMRMClientAsyncImpl.java:287)

Thanks in advanced !!

Add the ability to specify configurations which apply only to the workload job

The Client launches two jobs, the infrastructure job and the workload job, which right now pull from the same configuration. It can be useful to be able to have configurations which apply to only one of the two. Provide a command line option to specify configurations which apply only to the workload job, and not the infrastructure job, so that specific overrides can be applied.

Use fully in-memory DataNodes in the same JVM

Currently each DataNode is launched as a separate process/JVM, and we fool it into thinking it has all of its necessary blocks by creating the files as 0-length. It would be much more efficient to launch all of the DataNodes in the same JVM using MiniDFSCluster, and to use SimulatedFSDataset to store the block metadata only in-memory, saving us from having to create millions of sparse files on disk.

Add in the ability to override the NameNode name and edits directory

Currently, the NameNode's name and edit dir are stored within the NodeManager local storage. Given the potentially performance critical nature of these storages (writing edit logs to disk can have significant performance impact), it can be desirable to configure these to e.g. a dedicated disk. Provide a way to override the default.

Zombie simulated datanode

After running start-dynamometer-cluster.sh and replay the prod audit log for some time, some simulated datanodes (containers) lost connection to the RM and when the Yarn application is killed, these containers are still running, which will sending their blocks to the Namenode.
In this case, since datanode has gone through some changes with the replay where Namenode started from a fresh fsimage. Below errors will show up in the webhdfs page after the Namenode starts up.

Safe mode is ON. The reported blocks 1526116 needs additional 395902425 blocks to reach the threshold 0.9990 of total blocks 397826363. The number of live datanodes 3 has reached the minimum number 0. Name node detected blocks with generation stamps in future. This means that Name node metadata is inconsistent.This can happen if Name node metadata files have been manually replaced. Exiting safe mode will cause loss of 7141 byte(s). Please restart name node with right metadata or use "hdfs dfsadmin -safemode forceExitif you are certain that the NameNode was started with thecorrect FsImage and edit logs. If you encountered this duringa rollback, it is safe to exit with -safemode forceExit.

and checking datanode tab in the webhdfs page, a list of a couple datanodes will show up.

start-dynamometer-cluster.sh can't start NameNode

start-dynamometer-cluster.sh command:
./start-dynamometer-cluster.sh --hadoop_binary_path hadoop-2.7.2.tar.gz --conf_path /opt/hadoop/wz/dynamome --conf_path /opt/hadoop/wz/dynamometer/bin/conf/ --fs_image_dir hdfs:///dyno/fsimage --block_list_path

check the NameNode's starting log on the AM node :
2019-01-08 16:32:38,311 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: registered UNIX signal handlers for [TERM, HUP, INT]
2019-01-08 16:32:38,315 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: createNameNode [-D, fs.defaultFS=hdfs://host-xx-xx:9002, -D, dfs.namenode.rpc-address=host-xx-xx:9002, -D, dfs.namenode.servicerpc-address=host-xx-xx:9022, -D, dfs.namenode.http-address=host-xx-xx:50077, -D, dfs.namenode.https-address=host-xx-xx:0, -D, dfs.namenode.name.dir=file:///opt/huawei/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1546852874867_0024/container_1546852874867_0024_01_000002/dyno-node/name-data, -D, dfs.namenode.edits.dir=file:///opt/huawei/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1546852874867_0024/container_1546852874867_0024_01_000002/dyno-node/name-data, -D, dfs.namenode.checkpoint.dir=file:///opt/huawei/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1546852874867_0024/container_1546852874867_0024_01_000002/dyno-node/checkpoint, -D, dfs.namenode.safemode.threshold-pct=0.0f, -D, dfs.permissions.enabled=true, -D, dfs.cluster.administrators="", -D, dfs.block.replicator.classname=com.linkedin.dynamometer.BlockPlacementPolicyAlwaysSatisfied, -D, hadoop.security.impersonation.provider.class=com.linkedin.dynamometer.AllowAllImpersonationProvider, -D, hadoop.tmp.dir=/opt/huawei/hadoop/tmp/nm-local-dir/usercache/hadoop/appcache/application_1546852874867_0024/container_1546852874867_0024_01_000002/dyno-node, -D, hadoop.security.authentication=simple, -D, hadoop.security.authorization=false, -D, dfs.http.policy=HTTP_ONLY, -D, dfs.client.read.shortcircuit=false]
2019-01-08 16:32:38,318 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG:
/
***********************************************************
SHUTDOWN_MSG: Shutting down NameNode at host-xx-xx
************************************************************/

In start-component.sh , line 277:
${HADOOP_HOME}/sbin/hadoop-daemon.sh start namenode $namenodeConfigs $NN_ADDITIONAL_ARGS;

It seems that NameNode can't recognize the parameters ( $namenodeConfigs )
The $namenodeConfigs is like this :

read -r -d '' namenodeConfigs <<EOF
-D fs.defaultFS=hdfs://${nnHostname}:${nnRpcPort}
-D dfs.namenode.rpc-address=${nnHostname}:${nnRpcPort}
-D dfs.namenode.servicerpc-address=${nnHostname}:${nnServiceRpcPort}
-D dfs.namenode.http-address=${nnHostname}:${nnHttpPort}
-D dfs.namenode.https-address=${nnHostname}:0
-D dfs.namenode.name.dir=file://${nameDir}
-D dfs.namenode.edits.dir=file://${editsDir}

If I has usage error for start-dynamometer-cluster.sh ?

Audit workload replay should proxy to use real permissions

Given that permission checking can be a fairly heavy operation, it is not ideal that Dynamometer currently disabled permissions to let the workload replay user execute all operations. Instead, this job should proxy as the user who initially performed the operation.

start-dynamometer-cluster.sh doesn't print help message as expected

I got this:

sunchao@HOST:~/dynamometer$ bin/start-dynamometer-cluster.sh -help
18/07/30 21:54:29 INFO dynamometer.Client: Initializing Client
18/07/30 21:54:29 FATAL dynamometer.Client: Error running Client
org.apache.commons.cli.MissingOptionException: Missing required option: [-hadoop_binary_path Location of Hadoop binary to be deployed (archive). One of this or hadoop_version is required., -hadoop_version Version of Hadoop (like '2.7.4' or '3.0.0-beta1') for which to download a binary. If this is specified, a Hadoop tarball will be downloaded from an Apache mirror. By default the Berkeley OCF mirror is used; specify dyno.apache-mirror as a configuration or system property to change which mirror is used. The tarball will be downloaded to the working directory. One of this or hadoop_binary_path is required.]
        at org.apache.commons.cli.Parser.checkRequiredOptions(Parser.java:299)
        at org.apache.commons.cli.Parser.parse(Parser.java:231)
        at org.apache.commons.cli.Parser.parse(Parser.java:85)
        at com.linkedin.dynamometer.Client.init(Client.java:323)
        at com.linkedin.dynamometer.Client.run(Client.java:228)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
        at com.linkedin.dynamometer.Client.main(Client.java:220)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:148)

while trying to get helper message from the script. It seems I need to pass argument for hadoop_binary_path but is it really necessary for the -help option?

upload-fsimage.sh should check whether VERSION exists before uploading?

upload-fsimage.sh failed at the end because VERSION file already exists in the target HDFS dir:

sunchao@HOSTNAME:~/dynamometer$ bin/upload-fsimage.sh XXX hdfs:///app/dyno/fsimage /home/sunchao/fsimage
Using fsimage: fsimage_XXX
Creating temporary XML fsimage file at /tmp/tmp.IUfFlwXXdk/fsimage_XXX.xml
Created temporary XML fsimage file
Uploading /home/sunchao/fsimage/current/fsimage_XXX...
Uploading /tmp/tmp.IUfFlwXXdk/fsimage_XXX.xml...
Uploading /home/sunchao/fsimage/current/fsimage_XXX.md5...
Uploading /home/sunchao/fsimage/current/VERSION...
copyFromLocal: `hdfs:///app/dyno/fsimage/VERSION': File exists
Error while uploading /home/sunchao/fsimage/current/VERSION; exiting

Perhaps we can check if VERSION exists before converting to XML? or maybe skip the check if it already exists? this is only a minor issue though as the fsimage file is already uploaded.

start-dynamometer-cluster.sh start NameNode failed

start-dynamometer-cluster.sh command:
./bin/start-dynamometer-cluster.sh -conf_path /root/dynamometer/dynamometer0.1.7/myconf -fs_image_dir hdfs:///dyno/fsimage -block_list_path hdfs:///dyno/blocks -hadoop_binary_path /root/dynamometer/dynamometer0.1.7/hadoop-2.8.3.tar.gz

the console error info:
19/07/18 11:23:37 INFO impl.YarnClientImpl: Submitted application application_1563419715675_0002
19/07/18 11:23:38 INFO dynamometer.Client: Track the application at: http://centos-node1:8088/proxy/application_1563419715675_0002/
19/07/18 11:23:38 INFO dynamometer.Client: Kill the application using: yarn application -kill application_1563419715675_0002
19/07/18 11:23:58 INFO dynamometer.Client: NameNode can be reached via HDFS at: hdfs://centos-node2:9002/
19/07/18 11:23:58 INFO dynamometer.Client: NameNode web UI available at: http://centos-node2:50077/
19/07/18 11:23:58 INFO dynamometer.Client: NameNode can be tracked at: http://centos-node2:8042/node/containerlogs/container_1563419715675_0002_01_000002/root/
19/07/18 11:23:58 INFO dynamometer.Client: Waiting for NameNode to finish starting up...
19/07/18 11:24:09 INFO dynamometer.Client: Infra app exited unexpectedly. YarnState=FINISHED. Exiting from client.
19/07/18 11:24:09 INFO dynamometer.Client: Attempting to clean up remaining running applications.
19/07/18 11:24:09 ERROR dynamometer.Client: Application failed to complete successfully

then i use "$?" to test the state of last commond :
[root@centos-node1 dynamometer0.1.7]# echo $?
2

Inspired by the previous question,i check the application's stderr
cat: metricsTailPIDFile: No such file or directory
./start-component.sh: line 299: 2207 Terminated sleep 1

and check the application's stdout:
starting namenode, logging to /export/server/hadoop-2.8.3/logs/userlogs/application_1563419715675_0002/container_1563419715675_0002_01_000002/hadoop-root-namenode-centos-node2.out
log4j:WARN No appenders could be found for logger (org.apache.hadoop.util.Shell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Going to attempt to upload metrics to: hdfs://centos-node1:9000/user/root/.dynamometer/application_1563419715675_0002/namenode_metrics
Unable to upload metrics to HDFS
Started namenode at pid 2124
Waiting for parent process (PID: 2005) OR namenode process to exit
Cleaning up namenode at pid 2124
Deleting any remaining files

If i have something wrong in configuration?

Continue requesting block reports until all DataNodes have reported

Currently within waitForNameNodeReadiness, while waiting for the NameNode to have received enough block reports to be ready for use, the AppMaster will poll the NameNode to discover which DataNodes haven't sent block reports yet and trigger reports on those DataNodes. This can help when a DataNode sent its initial block report before all of its blocks were injected, in which case a better report wouldn't be sent until the block report interval expired (which can be very long). Right now it stops as soon as the block thresholds are met, but it would be better if it continued to do this even after the thresholds are met, until all DataNodes have actually reported.

Dynamometer does not support negative block id's

Dynamometer does NOT support negative block id's, which results in blocks with negative id's never being reported by any simulated DataNode.

A change has been made to XMLParser.java in our branch of Dynamometer so that negative block id's are also dealt with.

Due to the change made above, we have to change SimulatedMultiStorageFSDataset.java as well.

In Dynamometer, each DataNode has more than 1 SimulatedStorage to manage, and this following Map is maintained by each SimulatedStorage in a simulated DataNode. Moreover, a SimulatedStorage could be involved in multiple blockpools.

Map<blockpool id, Map<block, block information>>

To access a given block (associated with a blockpool id) on a simulated DataNode, we have to
( i) determine which SimulatedStorage this given block belongs to according to its block id, and then
(ii) use the associated blockpool id to retrive Map<block, block information> corresponding to the block to be accessed.

The SimulatedStorage's managed by a DataNode are arranged on an ArrayList and each SimulatedStorage on the ArrayList could be accessed by a "non-negative" integer upper-bounded by the size of that ArrayList, exclusive. To determine the SimulatedStorage a given block belongs to, the original Dynamometer simply uses (block id % number of simulated storages) as the index to access the ArrayList mentioned above. Hence, once we have a negative block id, an ArrayIndexOutOfBoundsException will be triggered. Some changes have been made in SimulatedMultiStorageFSDataset.java so that a negative block id is properly taken care of.

Executing test com.linkedin.dynamometer.TestDynamometerInfra stuck

I try to build latest dynamometer
When it run TestDynamometerInfra, it seems it stuck at this point
Dose anything need to be pre-installed for the build

kevin@kevin-pc:~/git/dynamometer(master)$ ./gradlew build
Parallel execution is an incubating feature.

Configure project :
Building version '0.1.7' (value loaded from 'version.properties' file).

Task :dynamometer-workload:compileJava
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.

Task :dynamometer-blockgen:javadoc
/home/kevin/git/dynamometer/dynamometer-blockgen/src/main/java/com/linkedin/dynamometer/blockgenerator/XMLParserMapper.java:26: warning - Tag @link: reference not found: org.apache.hadoop.hdfs.tools.offlineImageViewer.OfflineImageViewer
1 warning

Task :dynamometer-workload:javadoc
/home/kevin/git/dynamometer/dynamometer-workload/src/main/java/com/linkedin/dynamometer/workloadgenerator/TimedInputFormat.java:29: warning - WorkloadDriver#START_TIMESTAMP_MS (referenced by @value tag) is an unknown reference.
1 warning
<============-> 94% EXECUTING [6m 27s]
IDLE
IDLE
:dynamometer-infra:test > 2 tests completed
:dynamometer-infra:test > Executing test com.linkedin.dynamometer.TestDynamometerInfra
IDLE
IDLE

Support being run via Azkaban

Right now only running via command line is supported, but we should also support being run via a workflow scheduler. We can start with Azkaban. Some of the code added should be applicable to other schedulers as well.

Executing Dynamometer workload

Hi,

I collected all prerequisites (fsimage, audit log) and prepared local environment (accompanying hdfs, separate yarn manager) according to Dynamometer readme and tried to start workload scripts. Tried Hadoop versions: 2.7.4 and 2.8.4.

${DYN_HOME}/bin/upload-fsimage.sh 0894 ${HDFS_PATH}/fsimage \
    ${BASE_DIR}/fsimage-${HADOOP_VERSION} 

fsimage - passed

${DYN_HOME}/bin/generate-block-lists.sh \
    -fsimage_input_path ${HDFS_PATH}/fsimage/fsimage_0000000000000000894.xml \
    -block_image_output_dir ${HDFS_PATH}/blocks \
    -num_reducers 10 -num_datanodes 3 

generate-block-lists - passed

${DYN_HOME}/bin/start-dynamometer-cluster.sh "" \
    -hadoop_binary_path file://${BASE_DIR}/hadoop-${HADOOP_VERSION}.tar.gz \
    -conf_path file://${BASE_DIR}/conf.zip \
    -fs_image_dir ${HDFS_PATH}/fsimage \
    -block_list_path ${HDFS_PATH}/blocks

start-dynamometer-cluster: looks working according to output:
...
19/10/18 03:56:56 INFO dynamometer.Client: NameNode has started!
19/10/18 03:56:56 INFO dynamometer.Client: Waiting for 2 DataNodes to register with the NameNode...
19/10/18 03:57:02 INFO dynamometer.Client: Number of live DataNodes = 2.00; above threshold of 2.00; done waiting after 6017 ms.
19/10/18 03:57:02 INFO dynamometer.Client: Waiting for MissingBlocks to fall below 0.010199999...
19/10/18 03:57:02 INFO dynamometer.Client: Number of missing blocks: 102.00
19/10/18 04:00:03 INFO dynamometer.Client: Number of missing blocks = 0.00; below threshold of 0.01; done waiting after 180082 ms.
19/10/18 04:00:03 INFO dynamometer.Client: Waiting for UnderReplicatedBlocks to fall below 1.02...
19/10/18 04:00:03 INFO dynamometer.Client: Number of under replicated blocks: 102.00

${DYN_HOME}/bin/start-workload.sh \
    -Dauditreplay.log-start-time.ms=1000 \
    -Dauditreplay.input-path=file://${BASE_DIR}/audit_logs-${HADOOP_VERSION} \
    -Dauditreplay.output-path=${RESULTS_DIR} \
    -Dauditreplay.num-threads=1 \
    -nn_uri hdfs://$HOSTNAME:9000/ \
    -start_time_offset 1m \
    -mapper_class_name AuditReplayMapper

start-workload - it started and never finish during couple of hours repeating 'map > map':
19/10/18 04:07:53 INFO workloadgenerator.WorkloadDriver: The workload will start at 1571396933516 ms (2019/10/18 04:08:53 PDT)
19/10/18 04:07:54 INFO Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
19/10/18 04:07:54 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
19/10/18 04:07:55 INFO input.FileInputFormat: Total input files to process : 1
19/10/18 04:07:55 INFO mapreduce.JobSubmitter: number of splits:1
19/10/18 04:07:55 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_local579807884_0001
19/10/18 04:07:55 INFO mapreduce.Job: The url to track the job: http://localhost:8080/
19/10/18 04:07:55 INFO mapreduce.Job: Running job: job_local579807884_0001
19/10/18 04:07:55 INFO mapred.LocalJobRunner: OutputCommitter set in config null
19/10/18 04:07:55 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
19/10/18 04:07:55 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/10/18 04:07:55 INFO mapred.LocalJobRunner: OutputCommitter is org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
19/10/18 04:07:55 INFO mapred.LocalJobRunner: Waiting for map tasks
19/10/18 04:07:55 INFO mapred.LocalJobRunner: Starting task: attempt_local579807884_0001_m_000000_0
19/10/18 04:07:55 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
19/10/18 04:07:55 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
19/10/18 04:07:55 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
19/10/18 04:07:55 INFO mapred.MapTask: Processing split: file:/home/rscherba/ws/hadoop/dynamometer-test/audit_logs-2.8.4/hdfs-audit.log:0+251649
19/10/18 04:07:55 INFO mapred.MapTask: (EQUATOR) 0 kvi 26214396(104857584)
19/10/18 04:07:55 INFO mapred.MapTask: mapreduce.task.io.sort.mb: 100
19/10/18 04:07:55 INFO mapred.MapTask: soft limit at 83886080
19/10/18 04:07:55 INFO mapred.MapTask: bufstart = 0; bufvoid = 104857600
19/10/18 04:07:55 INFO mapred.MapTask: kvstart = 26214396; length = 6553600
19/10/18 04:07:55 INFO mapred.MapTask: Map output collector class = org.apache.hadoop.mapred.MapTask$MapOutputBuffer
19/10/18 04:07:55 INFO audit.AuditReplayMapper: Starting 1 threads
19/10/18 04:07:55 INFO audit.AuditReplayThread: Start timestamp: 1571396933516
19/10/18 04:07:55 INFO audit.AuditReplayThread: Sleeping for 57526 ms
19/10/18 04:07:56 INFO mapreduce.Job: Job job_local579807884_0001 running in uber mode : false
19/10/18 04:07:56 INFO mapreduce.Job: map 0% reduce 0%
19/10/18 04:08:07 INFO mapred.LocalJobRunner: map > map
19/10/18 04:13:07 INFO mapred.LocalJobRunner: map > map
19/10/18 04:18:07 INFO mapred.LocalJobRunner: map > map
19/10/18 04:23:07 INFO mapred.LocalJobRunner: map > map
19/10/18 04:28:07 INFO mapred.LocalJobRunner: map > map
19/10/18 04:33:07 INFO mapred.LocalJobRunner: map > map
19/10/18 04:38:07 INFO mapred.LocalJobRunner: map > map
19/10/18 04:43:07 INFO mapred.LocalJobRunner: map > map
...

How long Dynamometer workload should work? How run script arguments can affect test run? How to check in logs is there is something wrong in the configuration?

Miss ALL blocks

To xkrogen ,
Good afternoon! The NameNode would miss all blocks and none DataNode was registered when Manual Workload Launch. These commands was used:
1.Execute the Block Generation Job:
./generate-block-lists.sh -fsimage_input_path hdfs://cluster/user/qa/dyno/fsimage/fsimage_0000000000282000135.xml -block_image_output_dir hdfs://cluster/user/qa/dyno/blocks -num_reducers 1 -num_datanodes 1

2.Manual Workload Launch:
./start-dynamometer-cluster.sh --hadoop_binary_path hadoop-2.7.3-1.2.7.tar.gz --conf_path /home/hdfs/Dynamometer/dynamometer-0.1.0-SNAPSHOT/bin/hadoop --fs_image_dir hdfs://cluster/user/qa/dyno/fsimage --block_list_path hdfs://cluster/user/qa/dyno/blocks

Fix usage of DFSClient (follow-on to #1)

#1 used direct access to DFSClient to perform more accurate listing operations. To access the DFSClient from within DistributedFileSystem, a utility was added in the o.a.h.hdfs package to access the package-private dfs field. In the hadoopRuntime (default) configuration, the o.a.h package is excluded, so though this works fine in the bundled integration test, it fails when run from the generated zip.

I noticed that DistributedFileSystem exports a public getClient() method which we can use instead. It's marked @VisibleForTesting, but is still less hacky than using a workaround to access package-private field.

Provide a way to configure the view ACLs for logs

Currently the default ACLs for viewing container logs are used, meaning only the launching user can view them. We can simply piggyback off of the MapReduce configuration for the same, mapreduce.job.acl-view-job

Error when running start-workload.sh

In the README it says:

./bin/start-workload.sh
    -Dauditreplay.input-path hdfs:///dyno/audit_logs/
    -Dauditreplay.num-threads 50
    -nn_uri hdfs://namenode_address:port/
    -start_time_offset 5m
    -mapper_class_name AuditReplayMapper

However, it seems both Dauditreplay.input-path and -Dauditreplay.num-threads are not valid options. Only valid options are: nn_uri, start_time_offset, start_timestamp_ms, and mapper_class_name?

Add the ability to specify resources from the unpacked JAR

The Dynamometer job makes use of resources which are within its JAR such as the start_component.sh script. If the JAR is first unpacked and individual files are added to the classpath, as with the hadoop jar command, this currently works fine. However it does not work properly if the JAR is not unpacked, since the file does not actually exist anywhere (it is within an archive). We should support this to properly run from a normal JAR.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.