为了支持8月中旬models release版本的发布,以及今后models repo模型的持续监控。https://github.com/PaddlePaddle/models
现计划将models repo的模型改造以接入CE监测框架。
plan
接入的模型及负责人(8月8号完成)[模型改造规范]:#106
需要merge 两个pr
视觉方向 接口人:青青
1 . mnist 郭超容 【done 】
2. object_detection 一帆 【done,观察指标】
3. image_classification 青青 【merge,观察指标】
4. ocr_recognition 豪爽 【merge, 待设置阈值】
5. icnet 豪爽 【merge ,待设置阈值】
NLP方向 接口人:毅冰
1. seq2seq 青晟 【done】
2. language_model 超容 【done, 观察指标】
3. transformer 郭晟 【merge, 观察指标】
4. sequence_tagging_for_ner 毅冰 【done, 观察job指标】
5. text_classification 毅冰 【done,观察job指标】
How
模型的改动
例子pr:
mnist : https://github.com/PaddlePaddle/models/pull/1080/files
seq2seq: PaddlePaddle/models#1104
以mnist为例
1. models repo该模型的改动
mnist code参考:PaddlePaddle/models#1080
- 在model.py文件中输出kpi指标
【具体kpi输出可以参考 ce models repo对应模型的,https://github.com/PaddlePaddle/paddle-ce-latest-kpis 中找对应模型的kpi 文件continuous_evaluation.py】
以tab分隔
print ("kpis train_acc %f" % train_avg_acc)
print ("kpis train_cost %f" % train_avg_loss)
print ("kpis test_acc %f" % test_avg_acc)
print ("kpis train_duration %f" % (pass_end - pass_start))
- 需要增加_ce.py文件, 用于解析model.py输出的日志kpi指标。
####this file is only used for continuous evaluation test!
import os
import sys
sys.path.append(os.environ['ceroot'])
from kpi import CostKpi, DurationKpi, AccKpi
#### NOTE kpi.py should shared in models in some way!!!!
train_cost_kpi = CostKpi('train_cost', 0.02, 0, actived=True)
test_acc_kpi = AccKpi('test_acc', 0.005, 0, actived=True)
train_duration_kpi = DurationKpi('train_duration', 0.06, 0, actived=True)
train_acc_kpi = AccKpi('train_acc', 0.005, 0, actived=True)
tracking_kpis = [
train_acc_kpi,
train_cost_kpi,
test_acc_kpi,
train_duration_kpi,
]
def parse_log(log):
'''
This method should be implemented by model developers.
The suggestion:
each line in the log should be key, value, for example:
"
train_cost\t1.0
test_cost\t1.0
train_cost\t1.0
train_cost\t1.0
train_acc\t1.2
"
'''
#kpi_map = {}
for line in log.split('\n'):
fs = line.strip().split('\t')
print (fs)
if len(fs) == 3 and fs[0] == 'kpis':
print ("-----%s" % fs)
kpi_name = fs[1]
kpi_value = float(fs[2])
#kpi_map[kpi_name] = kpi_value
yield kpi_name, kpi_value
#return kpi_map
def log_to_ce(log):
kpi_tracker = {}
for kpi in tracking_kpis:
kpi_tracker[kpi.name] = kpi
for (kpi_name, kpi_value) in parse_log(log):
print (kpi_name, kpi_value)
kpi_tracker[kpi_name].add_record(kpi_value)
kpi_tracker[kpi_name].persist()
if __name__ == '__main__':
log = sys.stdin.read()
print ("*****")
print (log)
print ("****")
log_to_ce(log)
- 增加一个启动脚本 .run_ce.sh (设置可执行权限):
###!/bin/bash
####This file is only used for continuous evaluation.
model_file='model.py'
python $model_file --batch_size 128 --pass_num 5 --device CPU | python _ce.py
2. CE models repo该模型的改动
上面第1步merge后,在第2步中设置阈值。
mnist model code 参考:https://github.com/PaddlePaddle/paddle-ce-latest-kpis/tree/master/model_mnist
在该repo(paddle-ce-latest-kpis)的根目录, 增加该模型对应目录, 目录名以 'model_' 前缀开头,两个repo文件夹对应关系:
#ls models_repo/fluid/mnist/ <----> model_mnist/
并在该目录中:
- 增加一个启动脚本 run.xsh(可执行权限)
- 增加相应的kpi基数据
latest_kpis文件夹
test_acc_factor.txt train_acc_factor.txt train_cost_factor.txt train_duration_factor.txt
本地测试方法
在模型目录:
拷贝一个 kpi.py https://github.com/PaddlePaddle/continuous_evaluation/blob/develop/kpi.py
拷贝一个config.py https://github.com/PaddlePaddle/continuous_evaluation/blob/develop/config.py
执行 sh .run_ce.sh
输出如下 这种 表示成功,
['kpis', 'train_acc', '0.993373']
('train_acc', 0.993373)
['kpis', 'train_cost', '0.021950']
('train_cost', 0.02195)
['kpis', 'test_acc', '0.984175']
('test_acc', 0.984175)
['kpis', 'train_duration', '42.244845']
('train_duration', 42.244845)
例子:
model repo mnist CE监测 job:
http://ce.paddlepaddle.org:8080/viewLog.html?buildId=668&buildTypeId=PaddleCe_ModelsRepo
附: CE如何支持models repo模型监测 #105