paddlepaddle / paddle-ce-latest-kpis Goto Github PK

View Code? Open in Web Editor NEW

26.0 74.0 39.0 5.67 MB

Paddle Continuous Evaluation, keep updating.

Python 66.04% Shell 22.80% Dockerfile 0.01% Batchfile 11.11% Xonsh 0.03%

paddle-ce-latest-kpis's Introduction

Paddle Continuous Evaluation Baselines

Howtos

Add New Evaluation Task

Reference mnist task, the following files are required by CE framework:

run.xsh , a script to start this evaluation execution
- this script can be any bash script, just place #!/bin/bash or #/bin/xonsh to the head if it is written in the bash or xonsh language
continuous_evaluation.py to include all the KPIs this task tracks
latest_kpis directory, include all the baseline files

PR and Add to Service

PR to fast branch, and run ce-kpi-fast-test test on teamcity,
if passed, PR from fast to master branch.

Add new KPI to track

Reference the interface kpi.py, there are two basic KPIs:

LessWorseKpi
GreaterWorseKpi

paddle-ce-latest-kpis's People

Contributors

Stargazers

Watchers

paddle-ce-latest-kpis's Issues

where is memory.txt

Hi, there,

In this file ce_models/seq2seq/get_gpu_data.py line 29
with open('memory.txt', 'r') as f:

where is memory.txt, I can't find it anywhere.

aws 与内网机器性能差异

Diff 参考：

https://github.com/PaddlePaddle/paddle-ce-latest-kpis/pull/74/files

AWS 型号：

Tesla V100-SXM2-16GB
Tesla V100-SXM2-16GB
Tesla V100-SXM2-16GB
Tesla V100-SXM2-16GB

Others

ip-172-31-23-234
    description: Computer
    width: 64 bits
    capabilities: vsyscall32
  *-core
       description: Motherboard
       physical id: 0
     *-memory
          description: System memory
          physical id: 0
          size: 240GiB
     *-cpu
          product: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
          vendor: Intel Corp.
          physical id: 1
          bus info: cpu@0
          size: 2699MHz
          capacity: 3GHz
          width: 64 bits
          capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp x86-64 constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single retpoline kaiser fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx xsaveopt cpufreq
     *-pci
          description: Host bridge
          product: 440FX - 82441FX PMC [Natoma]
          vendor: Intel Corporation
          physical id: 100
          bus info: pci@0000:00:00.0
          version: 02
          width: 32 bits
          clock: 33MHz
        *-isa
             description: ISA bridge
             product: 82371SB PIIX3 ISA [Natoma/Triton II]
             vendor: Intel Corporation
             physical id: 1
             bus info: pci@0000:00:01.0
             version: 00
             width: 32 bits
             clock: 33MHz
             capabilities: isa bus_master
             configuration: latency=0
        *-ide
             description: IDE interface
             product: 82371SB PIIX3 IDE [Natoma/Triton II]
             vendor: Intel Corporation
             physical id: 1.1
             bus info: pci@0000:00:01.1
             version: 00
             width: 32 bits
             clock: 33MHz
             capabilities: ide bus_master
             configuration: driver=ata_piix latency=64
             resources: irq:0 ioport:1f0(size=8) ioport:3f6 ioport:170(size=8) ioport:376 ioport:c100(size=16)
        *-bridge UNCLAIMED
             description: Bridge
             product: 82371AB/EB/MB PIIX4 ACPI
             vendor: Intel Corporation
             physical id: 1.3
             bus info: pci@0000:00:01.3
             version: 01
             width: 32 bits
             clock: 33MHz
             capabilities: bridge bus_master
             configuration: latency=0
        *-display:0 UNCLAIMED
             description: VGA compatible controller
             product: GD 5446
             vendor: Cirrus Logic
             physical id: 2
             bus info: pci@0000:00:02.0
             version: 00
             width: 32 bits
             clock: 33MHz
             capabilities: vga_controller bus_master
             configuration: latency=0
             resources: memory:80000000-81ffffff memory:8f004000-8f004fff
        *-network
             description: Ethernet interface
             physical id: 3
             bus info: pci@0000:00:03.0
             logical name: ens3
             version: 00
             serial: 06:64:f5:87:78:b8
             width: 32 bits
             clock: 33MHz
             capabilities: bus_master cap_list ethernet physical
             configuration: broadcast=yes driver=ena driverversion=1.3.0K ip=172.31.23.234 latency=0 link=yes multicast=yes
             resources: irq:0 memory:8f000000-8f003fff
        *-display:1
             description: 3D controller
             product: NVIDIA Corporation
             vendor: NVIDIA Corporation
             physical id: 1b
             bus info: pci@0000:00:1b.0
             version: a1
             width: 64 bits
             clock: 33MHz
             capabilities: bus_master cap_list
             configuration: driver=nvidia latency=248
             resources: iomemory:400-3ff irq:252 memory:8a000000-8affffff memory:4000000000-43ffffffff memory:82000000-83ffffff
        *-display:2
             description: 3D controller
             product: NVIDIA Corporation
             vendor: NVIDIA Corporation
             physical id: 1c
             bus info: pci@0000:00:1c.0
             version: a1
             width: 64 bits
             clock: 33MHz
             capabilities: bus_master cap_list
             configuration: driver=nvidia latency=248
             resources: iomemory:440-43f irq:253 memory:8b000000-8bffffff memory:4400000000-47ffffffff memory:84000000-85ffffff
        *-display:3
             description: 3D controller
             product: NVIDIA Corporation
             vendor: NVIDIA Corporation
             physical id: 1d
             bus info: pci@0000:00:1d.0
             version: a1
             width: 64 bits
             clock: 33MHz
             capabilities: bus_master cap_list
             configuration: driver=nvidia latency=248
             resources: iomemory:480-47f irq:254 memory:8c000000-8cffffff memory:4800000000-4bffffffff memory:86000000-87ffffff
        *-display:4
             description: 3D controller
             product: NVIDIA Corporation
             vendor: NVIDIA Corporation
             physical id: 1e
             bus info: pci@0000:00:1e.0
             version: a1
             width: 64 bits
             clock: 33MHz
             capabilities: bus_master cap_list
             configuration: driver=nvidia latency=248
             resources: iomemory:4c0-4bf irq:255 memory:8d000000-8dffffff memory:4c00000000-4fffffffff memory:88000000-89ffffff
        *-generic
             description: Unassigned class
             product: Xen Platform Device
             vendor: XenSource, Inc.
             physical id: 1f
             bus info: pci@0000:00:1f.0
             version: 01
             width: 32 bits
             clock: 33MHz
             capabilities: bus_master
             configuration: driver=xen-platform-pci latency=0
             resources: irq:47 ioport:c000(size=256) memory:8e000000-8effffff
  *-network:0
       description: Ethernet interface
       physical id: 1
       logical name: veth1678770
       serial: 72:b6:14:e8:37:ef
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:1
       description: Ethernet interface
       physical id: 2
       logical name: vethe2a2a5e
       serial: 5a:7a:fe:1a:a0:22
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:2
       description: Ethernet interface
       physical id: 3
       logical name: vetha97198a
       serial: 12:9e:41:ec:6b:1e
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:3
       description: Ethernet interface
       physical id: 4
       logical name: vethf7fcba7
       serial: 1a:c9:0f:cd:60:42
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:4
       description: Ethernet interface
       physical id: 5
       logical name: veth437a440
       serial: ce:8b:df:ab:d2:77
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full link=yes multicast=yes port=twisted pair speed=10Gbit/s
  *-network:5
       description: Ethernet interface
       physical id: 6
       logical name: veth67fe6d0
       serial: 52:f5:1d:0a:f1:59
       size: 10Gbit/s
       capabilities: ethernet physical
       configuration: autonegotiation=off broadcast=yes driver=veth driverversion=1.0 duplex=full link=yes multicast=yes port=twisted pair speed=10Gbit/s

所有模型去随机性

CE模型是从paddlepaddle models repo 陆续挪过来的一些模型。目前一共12个。
共3类

NLP	seq2seq, lstm, language_model, transformer, sequence_tagging_for_ner, text_classification
图像	mnsit, image_classification, resnet50, vgg16, object_detection
多机	vgg16_aws_dist

CE监测到一些模型的数据仍存在随机性，比如during指标（时长）、memory指标，
还有不定期会有一些模型的acc(精确度) 过些天又震荡一下的情况。现在整体情况：http://18.222.34.7/
比如：

#41
#42
http://18.222.34.7/commit/draw_scalar?task=mnist
http://18.222.34.7/commit/draw_scalar?task=image_classification
http://18.222.34.7:8080/viewLog.html?tab=buildLog&buildTypeId=Paddle_ContinuousEvaluation&buildId=828

需要每个方向的模型有一个owner。各自确定其模型的不稳定指标的阈值。
在这两周内消除所有不稳定指标的阈值报警。

vgg16 模型random出现" Segmentation fault"

CE 框架，vgg16 出现两次 seg fault,

第一次job地址：http://18.222.34.7:8080/viewLog.html?buildId=1383&buildTypeId=Paddle_ContinuousEvaluation&tab=buildLog

第二次job地址：http://180.76.57.222:8111/viewLog.html?buildId=118&buildTypeId=PaddleCe_CEBuild&tab=buildLog&_focus=7990

:19][Step 1/1] Pass: 0, Loss: 4.501836, Train Accuray: 0.000000
[17:55:19][Step 1/1] 
[17:55:19][Step 1/1] 
[17:55:19][Step 1/1] Total examples: 3040, total time: 68.43846, 44.41947 examples/sed
[17:55:19][Step 1/1] 
[17:55:19][Step 1/1] *** Aborted at 1531245319 (unix time) try "date -d @1531245319" if you are using GNU date ***
[17:55:19][Step 1/1] PC: @                0x0 (unknown)
[17:55:19][Step 1/1] *** SIGSEGV (@0x58) received by PID 4890 (TID 0x7fbc3a8c7700) from PID 88; stack trace: ***
[17:55:19][Step 1/1]     @     0x7fbcc2fe37e0 (unknown)
[17:55:19][Step 1/1]     @     0x7fbcc32f650c PyEval_EvalFrameEx
[17:55:19][Step 1/1]     @     0x7fbcc32ff37d PyEval_EvalCodeEx
[17:55:19][Step 1/1]     @     0x7fbcc3276905 (unknown)
[17:55:19][Step 1/1]     @     0x7fbcc3244d33 PyObject_Call
[17:55:19][Step 1/1]     @     0x7fbcc32fa0a2 PyEval_EvalFrameEx
[17:55:19][Step 1/1]     @     0x7fbcc32fce9e PyEval_EvalFrameEx
[17:55:19][Step 1/1]     @     0x7fbcc32fce9e PyEval_EvalFrameEx
[17:55:19][Step 1/1]     @     0x7fbcc32ff37d PyEval_EvalCodeEx
[17:55:19][Step 1/1]     @     0x7fbcc3276830 (unknown)
[17:55:19][Step 1/1]     @     0x7fbcc3244d33 PyObject_Call
[17:55:19][Step 1/1]     @     0x7fbcc325374d (unknown)
[17:55:19][Step 1/1]     @     0x7fbcc3244d33 PyObject_Call
[17:55:19][Step 1/1]     @     0x7fbcc32f5897 PyEval_CallObjectWithKeywords
[17:55:19][Step 1/1]     @     0x7fbcc3341f32 (unknown)
[17:55:19][Step 1/1]     @     0x7fbcc2fdbaa1 start_thread
[17:55:19][Step 1/1]     @     0x7fbcc269dbcd clone
[17:55:19][Step 1/1]     @                0x0 (unknown)
[17:55:19][Step 1/1] ./run.xsh: line 14:  4890 Segmentation fault      FLAGS_benchmark=true FLAGS_fraction_of_gpu_memory_to_use=0.0 python model.py --device=GPU --batch_size=32 --data_set=flowers --iterations=100 --gpu_id=$cudaid
[17:55:20][Step 1/1] 4887

均在最后预测阶段：

        if args.with_test:
            pass_test_acc = test(exe)
        break

模型代码：
https://github.com/PaddlePaddle/paddle-ce-latest-kpis/blob/master/vgg16/model.py

models repo 模型设置CE监控用的KPI阈值

为模型添加启动脚本， .run.sh

#!/bin/bash
rm -rf *_factor.txt
model_file='model.py'
python $model_file --batch_size 128 --pass_num 5 --device CPU

为模型添加阈值文件 .continuous_evaluation.py

import os
import sys
sys.path.append(os.environ['ceroot'])
from kpi import CostKpi, DurationKpi, AccKpi

train_cost_kpi = CostKpi('train_cost', 0.02, actived=True)
test_acc_kpi = AccKpi('test_acc', 0.005, actived=True)
train_duration_kpi = DurationKpi('train_duration', 0.06, actived=True)
train_acc_kpi = AccKpi('train_acc', 0.005, actived=True)

tracking_kpis = [
    train_acc_kpi,
    train_cost_kpi,
    test_acc_kpi,
    train_duration_kpi,
]

为模型添加base kpi数据：隐藏文件夹 .latest_kpis
里面是各个kpi的base数据
test_acc_factor.txt train_acc_factor.txt train_cost_factor.txt train_duration_factor.txt

参考CE model mnist模型：
https://github.com/PaddlePaddle/paddle-ce-latest-kpis/tree/master/mnist

where can i import "kpi"?

Hi, there,
from kpi import AccKpi from kpi import CostKpi from kpi import DurationKpi

where can i find kpi?
it can't be imported by python
ModuleNotFoundError: No module named 'kpi'

lstm 单卡 pass不固定

text_classification模型中（该模型随机性去除的）也存在 lstm_pass_duration 指标大于阈值的现象【已在pr中关闭改监控，0024dac28f494cf86f9027edb372364b247cf2f3】

http://ce.paddlepaddle.org:8080/viewLog.html?buildId=646&buildTypeId=PaddleCe_CEBuild

和 lstm 模型中类似

http://ce.paddlepaddle.org:8080/viewLog.html?buildId=609&buildTypeId=PaddleCe_CEBuild

清理各个 task 的 log

现在 task 输出的 log 太多，特别 teamcity 的log 太乱，不利于查出错的地方。

可以删掉一部分，或者降低打印频率

models repo的模型接入CE监测框架

为了支持8月中旬models release版本的发布，以及今后models repo模型的持续监控。https://github.com/PaddlePaddle/models
现计划将models repo的模型改造以接入CE监测框架。

plan

接入的模型及负责人（8月8号完成）[模型改造规范]：#106
需要merge 两个pr
视觉方向接口人:青青
1 . mnist 郭超容【done 】
2.       object_detection 一帆【done，观察指标】
3.       image_classification 青青【merge，观察指标】
4.       ocr_recognition 豪爽【merge，待设置阈值】
5.       icnet 豪爽【merge ，待设置阈值】

NLP方向接口人:毅冰
1.       seq2seq 青晟【done】
2.       language_model 超容【done，观察指标】
3.       transformer 郭晟【merge，观察指标】
4.       sequence_tagging_for_ner 毅冰【done, 观察job指标】
5.       text_classification 毅冰【done，观察job指标】

How

模型的改动

例子pr：
mnist : https://github.com/PaddlePaddle/models/pull/1080/files
seq2seq: PaddlePaddle/models#1104
以mnist为例

1. models repo该模型的改动

mnist code参考：PaddlePaddle/models#1080

在model.py文件中输出kpi指标
【具体kpi输出可以参考 ce models repo对应模型的，https://github.com/PaddlePaddle/paddle-ce-latest-kpis 中找对应模型的kpi 文件continuous_evaluation.py】
以tab分隔

        print ("kpis    train_acc       %f" % train_avg_acc)
        print ("kpis    train_cost      %f" % train_avg_loss)
        print ("kpis    test_acc        %f" % test_avg_acc)
        print ("kpis    train_duration  %f" % (pass_end - pass_start))

需要增加_ce.py文件，用于解析model.py输出的日志kpi指标。

####this file is only used for continuous evaluation test!

import os
import sys
sys.path.append(os.environ['ceroot'])
from kpi import CostKpi, DurationKpi, AccKpi

#### NOTE kpi.py should shared in models in some way!!!!

train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=True)
test_acc_kpi = AccKpi('test_acc', 0.005, 0, actived=True)
train_duration_kpi = DurationKpi('train_duration', 0.06, 0, actived=True)
train_acc_kpi = AccKpi('train_acc', 0.005, 0, actived=True)

tracking_kpis = [
    train_acc_kpi,
    train_cost_kpi,
    test_acc_kpi,
    train_duration_kpi,
]

def parse_log(log):
    '''
    This method should be implemented by model developers.

    The suggestion:

    each line in the log should be key, value, for example:

    "
    train_cost\t1.0
    test_cost\t1.0
    train_cost\t1.0
    train_cost\t1.0
    train_acc\t1.2
    "
    '''
    #kpi_map = {}
    for line in log.split('\n'):
        fs = line.strip().split('\t')
        print (fs)
        if len(fs) == 3 and fs[0] == 'kpis':
            print ("-----%s" % fs)
            kpi_name = fs[1]
            kpi_value = float(fs[2])
            #kpi_map[kpi_name] = kpi_value
            yield kpi_name, kpi_value
    #return kpi_map


def log_to_ce(log):
    kpi_tracker = {}
    for kpi in tracking_kpis:
        kpi_tracker[kpi.name] = kpi

    for (kpi_name, kpi_value) in parse_log(log):
        print (kpi_name, kpi_value)
        kpi_tracker[kpi_name].add_record(kpi_value)
        kpi_tracker[kpi_name].persist()


if __name__ == '__main__':
    log = sys.stdin.read()
    print ("*****")
    print (log)
    print ("****")
    log_to_ce(log)

增加一个启动脚本 .run_ce.sh （设置可执行权限）：

###!/bin/bash
####This file is only used for continuous evaluation.

model_file='model.py'
python $model_file --batch_size 128 --pass_num 5 --device CPU | python _ce.py

2. CE models repo该模型的改动

上面第1步merge后，在第2步中设置阈值。
mnist model code 参考：https://github.com/PaddlePaddle/paddle-ce-latest-kpis/tree/master/model_mnist

在该repo（paddle-ce-latest-kpis）的根目录，增加该模型对应目录, 目录名以 'model_' 前缀开头，两个repo文件夹对应关系：

 #ls models_repo/fluid/mnist/   <---->    model_mnist/

并在该目录中：

增加一个启动脚本 run.xsh（可执行权限）

#!/bin/bash

./.run_ce.sh

增加相应的kpi基数据
latest_kpis文件夹

test_acc_factor.txt  train_acc_factor.txt  train_cost_factor.txt  train_duration_factor.txt

本地测试方法

在模型目录：
拷贝一个 kpi.py https://github.com/PaddlePaddle/continuous_evaluation/blob/develop/kpi.py
拷贝一个config.py https://github.com/PaddlePaddle/continuous_evaluation/blob/develop/config.py
执行 sh .run_ce.sh
输出如下这种表示成功，

['kpis', 'train_acc', '0.993373']
('train_acc', 0.993373)
['kpis', 'train_cost', '0.021950']
('train_cost', 0.02195)
['kpis', 'test_acc', '0.984175']
('test_acc', 0.984175)
['kpis', 'train_duration', '42.244845']
('train_duration', 42.244845)

例子:
model repo mnist CE监测 job：
http://ce.paddlepaddle.org:8080/viewLog.html?buildId=668&buildTypeId=PaddleCe_ModelsRepo

附： CE如何支持models repo模型监测 #105

Add distributed resnet50 model to ce

不同类型下加速比比较

benchmark file:benchmark/fluid/resnet.py
dataset:flowers
batchsize:64

GPU类型	2GPU	4GPU	8GPU
K40	1.78	2.82
P40	1.61	2.05	2.67
V100	1.38	1.93

image_classification模型 acc 指标固定不下来

该模型已经去掉shuffle，并设置了seed。每次跑下来acc 指标0.05左右浮动

http://ce.paddlepaddle.org/commit/draw_scalar?task=image_classification

CE中已经close该kpi指标，https://github.com/PaddlePaddle/paddle-ce-latest-kpis/pull/69/files

请青青老师调查并open该指标。

Model stability analysis

Environment：Tesla V100 cuda version：384.81

We run each model ten times and get the follow data.
resnet50:

kpi	min	max	mean	median	std	（std/mean）*100%
cifar10_128_gpu_memory	1566	1596	1589.6	1596	10	0.6%
cifar10_128_train_acc	0.966	0.998	0.985	0.990	0.011	1.1%
cifar10_128_train_speed	346.3	406.1	373.0	370.7	19.7	5.2%
flowers_64_gpu_memory	10680	12848	12139	12459	779.9	6.4%
flowers_64_train_speed	70.3	76.3	74.1	74.9	2.14	2.8%

resnet30

kpi	min	max	mean	median	std	（std/mean）*100%
train_cost	2.32	2.69	2.51	2.51	0.096	3.8%
train_duration	10.24	10.65	10.366	10.347	0.101	0.9%

minst

kpi	min	max	mean	median	std	（std/mean）*100%
test_acc	0.9858	0.9890	0.9877	0.9876	0.0010	0.1%
train_acc	0.9919	0.9933	0.9929	0.9929	0.0003	0.03%
train_duration	37.6037	38.7788	38.2673	38.5541	0.4592	1.2%

CE支持models repo模型监控方法

CE中搭建model repo监控
http://ce.paddlepaddle.org:8080/viewType.html?buildTypeId=PaddleCe_ModelsRepo

具体配置：

git clone https://github.com/PaddlePaddle/models.git models_repo #拉取models repo的模型
git clone https://github.com/PaddlePaddle/paddle-ce-latest-kpis.git tasks #拉取CE 模型

export specific_tasks='model_mnist'; 
#models repo需要监测的模型列表, mnist在CE repo对应为model_mnist
array=(${specific_tasks//,/ });
alias cp='cp';
for task in ${array[@]}  # 对其中需要监测的模型
do 
    echo $task; 
    cp -rf models_repo/fluid/${task:6}/. tasks/${task}/;  #models repo相应文件拷贝过来（包括隐藏文件， 如.run_ce.sh , _ce.py等）
done; 
./main.xsh"

models repo的某模型配置和CE model repo 的某模型一样，待该模型稳定运行后，下掉CE models repo的这个模型。
通过上述方式逐渐完成10个需要release模型的监测迁移。

CE生效的模型

基准模型

model	kpi	是否生效	diff threshold
resnet50	cifar10_128_train_acc	是	0.03
	cifar10_128_train_speed	是	0.03
	cifar10_128_gpu_memory	是	0.05
	flowers_64_train_speed	是	0.05
	flowers_64_gpu_memory	是	0.05
lstm	imdb_32_train_speed	是	0.03
	imdb_32_gpu_memory	是	0.05
vgg16	cifar10_128_train_speed	是	0.02
	cifar10_128_gpu_memory	是	0.05
	flowers_32_train_speed	是	0.02
	flowers_32_gpu_memory	是	0.05

业务模型

model	kpi	是否生效	diff threshold
text_classification	lstm_pass_duration	是	0.02
language_model	imikolov_20_pass_duration	是	0.02
sequence_tagging_for_ner	pass_duration	是	0.02
object_detection	train_cost	是	0.02
	train_speed	是	0.02

CE 模型负责人

CE repo模型：https://github.com/PaddlePaddle/paddle-ce-latest-kpis

负责模型人：（后续该模型的任何问题，由模型负责人own）

青晟：seq2seq
于洋：language_model,
郭晟：transformer,
毅冰： sequence_tagging_for_ner
春伟：text_classification
志宏：lstm
超容： mnist
豪爽： vgg16
白一帆：object_detection
青青: image_classification,
卫科：resnet50
佳宜：resnet50_net_CPU
成舵：resnet50_net_GPU

yibing和青青分别确认需要补充的模型和测试场景
NLP:
视觉:

mnist训练时长超出阈值

http://18.222.34.7:8080/viewLog.html?buildId=748&buildTypeId=Paddle_ContinuousEvaluation&tab=buildChangesDiv

model repo 待release 模型改造规范

都改造成使用新的api接口，如parallelDo都改成Parallel executor
每个模型支持有CPU， GPU 单卡，多卡，
每个模型风格统一： model一个文件、train一个文件、infer一个文件

commit 72ce4d5 和 a376efb 之间的某个pr有问题，

revert这个 PaddlePaddle/Paddle#12103 到paddlepaddle repo ce 分支：PaddlePaddle/Paddle@7af363d

debug发现，性能恢复
http://18.222.34.7:8080/viewLog.html?buildId=1462&buildTypeId=Paddle_ContinuousEvaluationDebug&tab=buildLog

Add distribute robust cases into paddle-ce-latest-kpis

vgg16_aws_dist task diff value is nan

error logs:

failed, diff ratio: nan larger than 0.01.

where can i import commands

Hi, gays,

I notice that in this file ce_models/resnet50_net_GPU/train.py line 12
import commands

but I can't find the where to import commands.
ModuleNotFoundError: No module named 'commands'

vgg16添加多卡

CE部分模型使用新接口后，部分阈值需要微调

CE模型接口从fluid.layers.get_places()改用新接口fluid.layers.device.get_places()后
#77

部分模型的多卡 speed时间变长：
http://ce.paddlepaddle.org/commit/draw_scalar?task=language_model

http://ce.paddlepaddle.org/commit/draw_scalar?task=sequence_tagging_for_ner

其中，language model，模型超出阈值，需要微调
#96

lstm memory kpi 超阈值

http://18.222.34.7:8080/viewLog.html?buildId=742&buildTypeId=Paddle_ContinuousEvaluation

vgg16_aws_dist 模型日志整理

vgg16_aws_dist模型中很多日志输出，建议部分改成debug级别，部分可以精简~（目前日志是info级别）
比如

可以改成debug级别

下面这种可以合成一行

http://18.222.34.7:8080/viewLog.html?buildId=759&buildTypeId=Paddle_ContinuousEvaluation&tab=buildLog&_focus=3278#_state=3278