计算机视觉

TK1之mnist -09

深度学习有几个有名的开源框架:Caffe,Theano,Torch,其中caffe的体系是:
* Caffe has several dependencies:
-CUDA is required for GPU mode. Caffe requires the CUDA nvcc compiler to compile its GPU code and CUDA driver for GPU operation.
-BLAS via ATLAS, MKL, or OpenBLAS. Caffe requires BLAS as the backend of its matrix and vector computations. There are several implementations of this library:ATLAS, Intel MKL or OpenBLAS.
-Boost, versoin >= 1.55
-protobuf, glog, gflags, hdf5
* Optional dependencies:
-OpenCV >= 2.4 including 3.0
-IO libraries: lmdb, leveldb (note: leveldb requires snappy)
-cuDNN for GPU acceleration (v5) , For best performance, Caffe can be accelerated by NVIDIA cuDNN.
* Pycaffe and Matcaffe interfaces have their own natural needs.
-For Python Caffe: Python 2.7 or Python 3.3+, numpy (>= 1.7), boost-provided boost.python,
The main requirements are numpy and boost.python (provided by boost). pandas is useful too and needed for some examples.
-For MATLAB Caffe: MATLAB with the mex compiler. Install MATLAB, and make sure that its mex is in your $PATH.

1 – Mnist

Caffe两个demo分别是mnist和cifar10,尤其是mnist,称为caffe编程的hello world。
mnist是个知名的手写数字数据库,据说是美国中学生手写的数字,由Yan LeCun进行维护。
mnist最初用于支票上的手写数字识别, 现在成了DL的入门练习库。
对mnist识别的专门模型是Lenet,是最早的cnn模型。

一般caffe的第一个hello world 都是对手写字体minist进行识别,主要三个步骤:
准备数据、修改配置、使用模型。
REF: http://yann.lecun.com/exdb/mnist/
http://caffe.berkeleyvision.org/gathered/examples/mnist.html

1.1 准备数据

* 下载数据:
$ ./data/mnist/get_mnist.sh

完成后,在 data/mnist/目录下有四个文件:
train-images-idx3-ubyte: 训练集样本 (9912422 bytes)
train-labels-idx1-ubyte: 训练集对应标注 (28881 bytes)
t10k-images-idx3-ubyte: 测试集图片 (1648877 bytes)
t10k-labels-idx1-ubyte: 测试集对应标注 (4542 bytes)

mnist数据训练样本为60000张,测试样本为10000张,每个样本为28*28大小的黑白图片,手写数字为0-9,因此分10类。

* 转换数据:
前述数据不能在caffe中直接使用,需要转换成LMDB数据:
$ ./examples/mnist/create_mnist.sh

转换成功后,会在 examples/mnist/目录下,生成两个文件夹,分别是:mnist_train_lmdb 和 mnist_test_lmdb, 里面存放的data.mdb和lock.mdb,就是我们需要的运行数据。

1.2 关于protobuf

Protocol Buffers简称为protobuf,是由google开源的一种轻便高效的结构化数据存储格式,可以用于结构化数据的串行化或者说序列化。protobuf完成的事情XML也可以,就是把某种数据结构的信息以某种格式保存起来,用于存储、传输协议等场合。
不用XML而另起炉灶一个是考虑,另外是protobuf有比较好的多语言支持(C++、Java、Python和生成机制以及其他一些内容体。

Protocol Buffers这种协议接口的结构很简单,类似:
package caffe;# 定义名称空间
message helloworld #定义类
{ #定义 filed
required int32 xx = 1; // 必须有的值
optional int32 xx = 2; //可选值
repeated xx xx=3; //可重复的
enum xx { #定义枚举类
xx =1;
}
}

例如,开发语言为C++,场景为模块A通过socket发送巨量订单信息给模块B。
则先写一个proto文件,例如Order.proto,在该文件添加名为”Order”的message结构:
message Order
{
required int32 time = 1;
required int32 userid = 2;
required float price = 3;
optional string desc = 4;
}

然后即可用protobuf内置编译器编译该Order.proto:
$ protoc -I=. –cpp_out=. ./Order.proto
protobuf会自动生成Order.pb.cc和Order.pb.h文件。

在发送方模块A, 就可以使用下面代码来序列化/解析该订单包装类:
$ vi Sender.cpp
Order order;
order.set_time(XXXX);
order.set_userid(123);
order.set_price(100.0f);
order.set_desc(“a test order”);
string sOrder;
order.SerailzeToString(&sOrder);
// 调用socket通讯库把序列化的字符串发送出去

在接收方模块B, 可以这样解析:
$ vi Receiver.cpp
string sOrder;
// 先通过网络通讯库接收到数据,存放到某字符串sOrder
Order order;
if(order.ParseFromString(sOrder)) // 解析该字符串
{
cout << "userid:" << order.userid() << endl << "desc:" << order.desc() << endl; } else cerr << "parse error!" << endl; 最后编译文件即可: $ g++ Sender.cpp -o Sender Order.pb.cc -I /usr/local/protobuf/include -L /usr/local/protobuf/lib -lprotobuf -pthread 测试下。。。 1.3 关于 caffe.proto 文件caffe.proto中就是用google protobuf定义的caffe所要用到的结构化数据。 在同级文件夹下还有两个文件caffe.pb.h 和caffe.pb.cc,这两个文件是由caffe.proto编译而来的。 $ vi src/caffe/proto/caffe.proto syntax = "proto2"; //默认就是二 package caffe; //各个结构封装在caffe包中,可以通过using namespace caffe; 或者caffe::**来调用 注意: proto文件为结构、prototext文件为配置。 1.4 caffe结构 caffe通过layer-by-layer的方式逐层定义网络,从开始输入到最终输出判断从下而上的定义整个网络。 caffe可以从四个层次来理解:Blob、Layer、Net、Solver,这四个部分的关系是: * Blob blob是贯穿整个框架的数据单元。存储整个网络中的所有数据(数据和导数),它在cup和GPU之间按需分配内存开销。Blob作为caffe基本的数据结构,通常用四维矩阵 Batch×Channel×Height×Weight表示,某一坐标(n,k,h,w)的物理位置为((n * K + k) * H + h) * W + w),存储了网络的神经元激活值和网络参数,以及相应的梯度(激活值的残差和dW、db)。其中包含有cpu_data、gpu_data、cpu_diff、gpu_diff、mutable_cpu_data、mutable_gpu_data、mutable_cpu_diff、mutable_gpu_diff这一堆很像的东西,分别表示存储在CPU和GPU上的数据。其中带data的里面存储的是激活值和W、b,diff中存储的是残差和dW、db,另外带mutable和不带mutable的一对指针所指的位置是相同的,只是不带mutable的只读,而带mutable的可写。 * Layer Layer代表了神经网络中各种各样的层,组合成一个网络。一般一个图像或样本会从数据层中读进来,然后一层一层的往后传。除了比较特殊的数据层之外其余大部分层都包含4个函数:LayerSetUp、Reshape、Forward、Backward。LayerSetup用于初始化层,开辟空间,填充初始值等。Reshape对输入值进行维度变换。Forward是前向传播,Backward是后向传播。 那么数据是如何在层之间传递的呢?每一层都会有一个(或多个)Bottom和top,分别存储输入和输出。bottom[0]->cpu_data()是存输入的神经元激活值,top就是存输出的,cpu_diff()存的是激活值的残差,gpu是存在GPU上的数据。如果这个层前后有多个输入输出层,就会有bottom[1],bottom[2]。。。每层的参数会存在this->blobs_里,一般this->blobs_[0]存W,this->blobs_[1]存b,this->blobs_[0]->cpu_data()存的是W的值,this->blobs_[0]->cpu_diff()存的梯度dW,b和db也类似,然后换成gpu是存在GPU上的数据,再带上mutable就可写了
凡是能在GPU上运算的层都会有名字相同的cpp和cu两个文件,cu文件中运算时基本都调用了cuda核函数。
* Net
Net就是把各种Layer按train_val.prototxt的定义堆叠在一起。先进行每个层的初始化,然后不断进行Update,每更新一次就进行一次整体的前向传播和反向传播,然后把每层计算得到的梯度计算进去,完成一次更新。注意每层在Backward中只是计算dW和db,而W和b的更新是在Net的Update里最后一起更新的。在caffe里训练模型的时候一般会有两个Net,一个train一个test。
* Solver
Solver是按solver.prototxt的参数定义对Net进行训练。先会初始化一个TrainNet和一个TestNet,然后其中的Step函数会对网络不断进行迭代。主要就是两个步骤反复迭代:不断利用ComputeUpdateValue计算迭代相关参数,比如计算learning rate,把weight decay加上什么;以及调用Net的Update函数对整个网络进行更新。

大致流程是:
sovler通过sovler.prototxt的参数配置初始化Net。 然后Net通过调用trainval.prototxt这些参数,来调用对应的Layer;并将数据blob输入到相应的Layer中;Layer来对流入的数据进行计算处理,然后再将计算后的blob数据返回,通过Net流向下一个Layer;每执行一次Solver就会计数一次然后调整learn_rate和descay_weith等权值。

* 一个简单的Net例子:
name:”logistic-regression”
layer{
name:”mnist”
type:”Date”
top:”data”
top:”label”
data_param{
source:”your-source”
batch_size:your-size
}
}
layer {
name:”ip”
type:”InnerProduct”
bottom:”data”
top:”ip”
inner_product_param{
num_output:2
}
}
layer {
name:”loss”
type:”SoftmaxWithLoss”
bottom:”ip”
bottom:”label”
top:”loss”}

1.5 LeNet

MNIST的手写数字识别适合采用LeNet网络,LeNet模型结构如下:

不过Caffe是改进了可激活函数Rectified Linear Unit (ReLU) 的LeNet,结构定义在lenet_train_test.prototxt中。这个改进了LeNet网络包含两个卷积层,两个池化层,两个全连接层,最后一层用于分类,像这样:

需要配置的文件有两个:
训练部分的lenet_train_test.prototxt ;
求解部分的lenet_solver.prototxt 。

1.6 配置训练部分

$ vi $CAFFE_ROOT/examples/mnist/lenet_train_test.prototxt

1.6.0 Give the network a name
name: “LeNet” #给出神经网络名称

1.6.1 Data Layer
定义TRAIN的数据层。
layer {
name: “mnist” #定义该层的名字
type: “Data” #该层的类型是数据。除了Data这个type还有MemoryData/HDF5Date/HDF5output/ImageData等等
include {
phase: TRAIN #说明该层只在TRAIN阶段使用
}
transform_param {
scale: 0.00390625 #数据归一化系数,1/256归一到[0-1]
} #预处理,例如:减均值,尺寸变换,随机剪,镜像等等
data_param {
source: “mnist_train_lmdb” #训练数据的路径,必须。
backend: LMDB #默认为使用leveldb
batch_size: 64 #批量处理的大小,数据层读取lmdb数据时每次读取64条。
}
top: “data” #该层生成一个data blob
top: “label” #该层生成一个label blob
#该层产生两个blobs,: data blobs
}
该层名称为mnist,类型为data,数据源为lmdb,批量处理尺寸为64,缩放系数为1/256=0.00390625这样区间归一化为【0-1】,该层产生两个数据块:data和label。
配置了训练参数。

1.6.2 Data Layer
定义TEST的数据层
layer {
name: “mnist”
type: “Data”
top: “data”
top: “label”
include {
phase: TEST #说明该层只在TEST阶段使用
}
transform_param {
scale: 0.00390625
}
data_param {
source: “examples/mnist/mnist_test_lmdb” #测试数据的路径
batch_size: 100
backend: LMDB
}
}
配置了测试参数。

1.6.3 Convolution Layer -1
定义卷积层1。
layer {
name: “conv1” #该层名字conv1,即卷积层1
type: “Convolution” #该层的类型是卷积层
param { lr_mult: 1 } #weight learning rate,即权重w的学习率倍数, 1表示该值是全局(lenet_solver.prototxt中base_lr: 0.01)的1倍。
param { lr_mult: 2 } #bias learning rate,即偏移值权重b的学习率倍数,,2表示该值是全局(lenet_solver.prototxt中base_lr: 0.01)的2倍。
convolution_param {
num_output: 20 #卷积输出数量20,即输出为20个特征图,其规模为(data_size-kernel_size + stride)*(data_size -kernel_size + stride)
kernel_size: 5 #卷积核的大小为5*5
stride: 1 #卷积核的移动步幅为1
weight_filler {
type: “xavier” #xavier算法是一种初始化方法,根据输入和输出的神经元的个数自动初始化权值比例
}
bias_filler {
type: “constant” ##将偏移值初始化为“稳定”状态,即使用默认值0初始化。
}
}
bottom: “data” #该层使用的数据是由数据层提供的data blob
top: “conv1” #该层生成的数据是conv1
}
该层:
参数为(20,1,5,5),(20,),
输入是(64,28,28),
卷积输出是(64,20,24,24),
数据变化情况:64*1*28*28 -> 64*20*24*24
计算过程:

1.6.4 Pooling Layer -1
定义池化层1
layer {
name: “pool1”
type: “Pooling”
pooling_param {
pool: MAX #采用最大值池化
kernel_size: 2 #池化核大小为2*2
stride: 2 #池化核移动的步幅为2,即非重叠移动
}
bottom: “conv1” #该层使用的数据是conv1层输出的conv1
top: “pool1” #该层生成的数据是pool1
}
该层:
输出是(64,20,12,12),没有weight和biases。
过程数据变化:64*20*24*24->64*20*12*12
计算过程:

1.6.5 Conv2 Layer
剩下还有两层卷积(num=50,size=5,stride=1)和池化层
定义卷积层2
layer{
name:”conv2″
type:”Convolution”
param:{lr_mult:1}
param:{lr_mult:2}
convolution_param{
num_output:50
kernel_size:5
stride:1
weight_filler{type:”xavier”}
bias_filler{type:”constant”}
bottom:”pool1″
top:”conv2″}
}
该层:
输出是(64,50,10,10)
过程数据变化:64*20*12*12 -> 64*50*8*8

1.6.6 Pool2 Layer
定义池化层2
layer{
name:”pool2″
type:”Pooling”
bottom:”conv2″
top:”pool2″
pooling_param{
pool:MAX
kernel_size:2
stride:2
}
}
该层:
输出是(64,50,5,5)
过程数据变化:64*50*8*8 -> 64*50*4*4。

1.6.7 Fully Connected Layer -1
定义全连接层1。
caffe把全连接层曾作InnerProduct layer (IP层)。
layer {
name: “ip1”
type: “InnerProduct” #该层的类型为全连接层
param { lr_mult: 1 }
param { lr_mult: 2 }
inner_product_param {
num_output: 500 #500个输出通道
weight_filler {
type: “xavier”
}
bias_filler {
type: “constant”
}
}
bottom: “pool2”
top: “ip1”
}
该层:
参数是(500,6250)(500,),
输出是(64,500,1,1) ,
过程数据变化:64*50*4*4 -> 64*500*1*1。
此处全连接是将C*H*W转换成1D feature vector,即50*4*4=800 -> 500=500*1*1。

1.6.8 ReLU Layer
定义ReLU1层,即非线性层。
layer {
name: “relu1”
type: “ReLU” #ReLU,限制线性单元,是一种激活函数,与sigmoid作用类似
bottom: “ip1”
top: “ip1” #底层与顶层相同以减少开支
#可以设置relu_param{negative_slope:leaky-relu的浮半轴斜率}
}
该层:
输出是(64,500)不变,
过程数据变化:64*500*1*1->64*500*1*1。
Since ReLU is an element-wise operation, we can do in-place operations to save some memory. This is achieved by simply giving the same name to the bottom and top blobs. Of course, do NOT use duplicated blob names for other layer types!

1.6.9 innerproduct layer -2
定义全连接层2。
layer {
name: “ip2”
type: “InnerProduct”
param { lr_mult: 1 }
param { lr_mult: 2 }
inner_product_param {
num_output: 10 #10个输出数据,对应0-9十个手写数字
weight_filler {
type: “xavier”
}
bias_filler {
type: “constant”
}
}
bottom: “ip1”
top: “ip2”
}
该层:
参数为(10,500)(10,),
输出为(64,10),
过程数据变化:64*500*1*1 -> 64*10*1*1

1.6.10 Loss Layer
定义损失函数估计层 。
layer {
name: “loss”
type: “SoftmaxWithLoss” #多分类使用softMax回归计算损失
bottom: “ip2”
bottom: “label” #需要用到数据层产生的lable,这里终于可以用上了label,没有top
}
该损失层过程数据变化:64*10*1*1 -> 64*10*1*1

1.6.11 accuracy layer
准确率层 。caffe LeNet中有一个accuracy layer的定义,只在TEST中计算准确率,只有name,type,buttom,top,include{phase:TEST}几部分,这是输出测试结果的一个层。
layer {
name: “accuracy”
type: “Accuracy”
bottom: “ip2”
bottom: “label”
top: “accuracy”
include {
phase: TEST
}
}

1.6.12 整个过程
*** 整个过程

REF: https://www.cnblogs.com/xiaopanlyu/p/5793280.html

1.6.13 Additional Notes: Writing Layer Rules
By default, that is without layer rules, a layer is always included in the network.
Layer definitions can include rules for whether and when they are included in the network definition, like:
layer {
// …layer definition…
include: { phase: TRAIN }
}
This is a rule, which controls layer inclusion in the network, based on current network’s state.

1.7 配置求解部分

有了prototxt,还缺一个solver, solver主要是定义模型的参数更新与求解方法。例如最后一行::
solver_mode: GPU,
修改GPU为CPU。 这样使用cpu训练。不管使用GPU还是CPU,由于MNIST 的数据集很小, 不像ImageNet那样明显。

$ vi ./examples/mnist/lenet_solver.prototxt

#指定训练和测试模型
net: “examples/mnist/lenet_train_test.prototxt” #网络模型文件的路径

# 指定测试集中多少参与向前计算,这里的测试batch size=100,所以100次可使用完全部10000张测试集。
# test_iter specifies how many forward passes the test should carry out. In the case of MNIST, we have test batch size 100 and 100 test iterations, covering the full 10,000 testing images.
test_iter: 100 #test的迭代次数,批处理大小为100, 100*100为测试集个数

# 每训练test_interval次迭代,进行一次训练.
# Carry out testing every 500 training iterations.
test_interval: 500 #训练时每迭代500次测试一次

# The base learning rate, momentum and the weight decay of the network.
base_lr: 0.01 # 基础学习率.
momentum: 0.9 #动量
weight_decay: 0.0005 #权重衰减

# 学习策略
lr_policy: “inv” # 学习策略 inv return base_lr * (1 + gamma * iter) ^ (- power)
gamma: 0.0001
power: 0.75

# 每display次迭代展现结果
display: 100

# 最大迭代数量
max_iter: 10000

# snapshot intermediate results
snapshot: 5000 #每迭代5000次存储一次参数
snapshot_prefix: “examples/mnist/lenet” #模型前缀,不加前缀为iter_迭代次数.caffmodel,加之后为lenet_iter_迭代次数.caffemodel

# solver mode: CPU or GPU
solver_mode: GPU

支持的求解器类型有:
Stochastic Gradient Descent (type: “SGD”),
AdaDelta (type: “AdaDelta”),
Adaptive Gradient (type: “AdaGrad”),
Adam (type: “Adam”),
Nesterov’s Accelerated Gradient (type: “Nesterov”) and
RMSprop (type: “RMSProp”)

1.8 训练模型

完成网络定义protobuf和solver protobuf文件后,训练很简单:
$ cd ~/caffe
$ ./examples/mnist/train_lenet.sh

读取lenet_solver.prototxt和lenet_train_test.prototxt两个配置文件
… …
创建每层网络: 数据层 – 卷积层(conv1) – 池化层(pooling1) – 全连接层(ip1) -非线性层 -损失层
… …
开始计算反馈网络
… …
输出测试网络和创建过程
… …
开始优化参数
… …
进入迭代优化过程
I1203 solver.cpp:204] Iteration 100, lr = 0.00992565
I1203 solver.cpp:66] Iteration 100, loss = 0.26044
I1203 solver.cpp:84] Testing net
I1203 solver.cpp:111] Test score #0: 0.9785
I1203 solver.cpp:111] Test score #1: 0.0606671
#0:为准确度, 
#1:为损失

For each training iteration, lr is the learning rate of that iteration, and loss is the training function. For the output of the testing phase, score 0 is the accuracy, and score 1 is the testing loss function. And after a few minutes, you are done!

整个训练时间会持续很久,20分钟,最后模型精度在0.989以上。
最终训练完的模型存储为一个二进制的protobuf文件位于:
./examples/mnist/lenet_iter_10000.caffemodel

至此,模型训练完毕。等待以后使用。
REF: http://caffe.berkeleyvision.org/gathered/examples/mnist.html

1.9、使用模型

使用模型识别手写数字,有人提供了python脚本,识别手写数字图片:
$ vi end_to_end_digit_recognition.py

# manual input image requirement: white blackgroud, black digit
# system input image requirement: black background, white digit

# loading settup
caffe_root = “/home/ubuntu/sdcard//caffe-for-cudnn-v2.5.48/”
model_weights = caffe_root + “examples/mnist/lenet_iter_10000.caffemodel”
model_def = caffe_root + “examples/mnist/lenet.prototxt”
image_path = caffe_root + “data/mnist/sample_digit_1.jpeg”

# set up Python environment: numpy for numerical routines, and matplotlib for plotting
import numpy as np
import scipy
import os.path
import time
# import matplotlib.pyplot as plt
from PIL import Image
import sys
sys.path.insert(0, caffe_root + ‘python’)
import caffe

# caffe.set_mode_cpu()
caffe.set_device(0)
caffe.set_mode_gpu()

# setup a network according to model setup
net = caffe.Net(model_def, # defines the structure of the model
model_weights, # contains the trained weights
caffe.TEST) # use test mode (e.g., don’t perform dropout)

exist_img_time=0
while True:
try:
new_img_time=time.ctime(os.path.getmtime(image_path))
if new_img_time!=exist_img_time:

# read image and convert to grayscale
image=Image.open(image_path,’r’)
image=image.convert(‘L’) #makes it greyscale
image=np.asarray(image.getdata(),dtype=np.float64).reshape((image.size[1],image.size[0]))

# convert image to suitable size
image=scipy.misc.imresize(image,[28,28])
# since system require black backgroud and white digit
inputs=255-image

# reshape input to suitable shape
inputs=inputs.reshape([1,28,28])

# change input data to test image
net.blobs[‘data’].data[…]=inputs

# forward processing of network
start=time.time()
net.forward()
end=time.time()
output_prob = net.blobs[‘ip2’].data[0] # the output probability vector for the first image in the batch

print ‘predicted class is:’, output_prob.argmax()

duration=end-start
print duration, ‘s’
exist_img_time=new_img_time
except IndexError:
pass
except IOError:
pass
except SyntaxError:
pass

测试:
$ python ./end_to_end_digit_recognition.py

I1228 17:22:30.683063 19537 net.cpp:283] Network initialization done.
I1228 17:22:30.698748 19537 net.cpp:761] Ignoring source layer mnist
I1228 17:22:30.702311 19537 net.cpp:761] Ignoring source layer loss
predicted class is: 3
0.0378859043121 s
结果识别出来 是“ 3 ”。

TK1之opencv -07

5.3
查看Jetson TK1 L4T版本
$ head -n 1 /etc/nv_tegra_release
— # R21 (release), REVISION: 3.0,

查看系统位数(32/64),当然是32位的了
$ getconf LONG_BIT
此外可以“uname -a”查看,输出的结果中,如果有x86_64就是64位的,没有就是32位的。

#查询opencv版本
$ pkg-config –modversion OpenCV
REF: http://blog.csdn.net/zyazky/article/details/52388756

# and setup USB 3.0 port to run USB; usb_port_owner_info=2 indicates USB 3.0
$ sudo sed -i ‘s/usb_port_owner_info=0/usb_port_owner_info=2/’ /boot/extlinux/extlinux.conf

# Disable USB autosuspend
$ sudo sed -i ‘$s/$/ usbcore.autosuspend=-1/’ /boot/extlinux/extlinux.conf
// USB 3.0 is enabled. The default is USB 2.0, /boot/extlinux/extlinux.conf must be modified to enable USB 3.0.

// Two scripts are installed in /usr/local/bin. To conserve power, by default the Jetson suspends power to the USB ports when they are not in use.
In a desktop environment, this can lead to issues with devices such as cameras and webcams.
The first script disables USB autosuspend.
REF: http://www.cnphp6.com/detail/32448

一、相机
USB 3.0的5Gbps支持full-sized USB port (J1C2 connector) has enough bandwidth to allow sending uncompressed 1080p video streams.
USB 2.0的480 Mbps is the slowest of the possible camera interfaces, it usually only supports upto 720p 30fps

1.1、先看有木有USB 3.0接口?
$ lsusb
Bus 002 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 001 Device 002: ID 8087:07dc Intel Corp.
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
全部2.0 root hub没有启用USB 3.0接口。

1.2、改成启用3.0
Enabling support for USB 3.0 on the full-sized USB port is easier, only have to change one parameter in /boot/extlinux/extlinux.conf:
Change usb_port_owner_info=0, to usb_port_owner_info=2,
and reboot.
or,
$ sudo sed -i ‘s/usb_port_owner_info=0/usb_port_owner_info=2/’ /boot/extlinux/extlinux.conf
这个usb_port_owner_info=2 indicates USB 3.0, 默认的是2.0 USB。

1.3、禁止自动挂起
To conserve power, by default the Jetson suspends power to the USB ports when they are not in use. In a desktop environment, this can lead to issues with devices such as cameras and webcams. Some USB devices & cameras have problems on Jetson TK1 due to automatic suspending of inactive USB ports in L4T 19.2 OS to save power.
So you might need to disable USB auto-suspend mode.
可以临时用一下,You can disable this temporarily until you reboot:
$ sudo bash -c ‘echo -1 > /sys/module/usbcore/parameters/autosuspend’
or,
to perform this automatically on every boot up, you can modify your ‘/etc/rc.local’ script, add this near the bottom of the file but before the “exit” line:
# Disable USB auto-suspend, since it disconnects some devices such as webcams on Jetson TK1.
echo -1 > /sys/module/usbcore/parameters/autosuspend
or,
# Disable USB autosuspend:
sudo sed -i ‘$s/$/ usbcore.autosuspend=-1/’ /boot/extlinux/extlinux.conf

三、配置
base on Logitech C920 + tegra-ubuntu 3.10.40-gc017b03:
$ lsusb

C920就是 046d:082d, explore what USB information
$ lsusb -d 046d:082d -v | less
都加载了哪些模块
$ lsmod

加载摄像头驱动,可以用 V4L2 访问摄像头,for example usb camera mdule, such as 摄像头模块 OV5640
$ sudo modprobe tegra_camera
— modprobe: ERROR: could not insert ‘tegra_camera’: Device or resource busy
以显式的加载模块
$ sudo modprobe uvcvideols -al libjpeg*

$ ls /dev/vi*
$ cheese –device=/dev/video?
JPEG parameter struct mismatch: library thinks size is 432, caller expects 488
问题解决: 板子上运行的jpeglib 版本是80 下载的是62 ,所以出错
The jpeg error might be because it finds a wrong version of the jpeg library. I think there’s one in the Ubuntu rootfs and one in the nvidia binaries. You may want to move the nvidia version away temporarily and test again.
说得对。Problem in the library /usr/lib/arm-linux-gnueabihf/libjpeg.so.8.0.2. Gstreamer hopes that this address will be based on a special version of the library, but some packages are replaced by the library to its. The developers have promised to deal with the problem
As a workaround, you can try replacing this library the library /usr/lib/arm-linux-gnueabihf/tegra/libjpeg.so
$ cd /usr/lib/arm-linux-gnueabihf
$ ls -al libjpeg*
lrwxrwxrwx 1 root root 16 Dec 20 2013 libjpeg.so -> libjpeg.so.8.0.2
lrwxrwxrwx 1 root root 16 Dec 20 2013 libjpeg.so.8 -> libjpeg.so.8.0.2
-rw-r–r– 1 root root 157720 Dec 20 2013 libjpeg.so.8.0.2
$ ls -al tegra/libjp*
-rwxrwxr-x 1 root root 305028 Dec 17 16:03 tegra/libjpeg.so
so,
$ sudo ln -sf tegra/libjpeg.so ./libjpeg.so
$ cheese
OKAY
***### if needed we can go back–<–
$ sudo ln -sf libjpeg.so.8.0.2 ./libjpeg.so

测试下, let’s go …
$ export DISPLAY=:0

四、openCV的例程能跑?
第一个:
# Test a simple OpenCV program. Creates a graphical window, hence you should plug a HDMI monitor in or use a remote viewer such as X Tunneling or VNC or TeamViewer on your desktop.
cd ~/opencv-2.4.10/samples/cpp
g++ edge.cpp -lopencv_core -lopencv_imgproc -lopencv_highgui -o edge
./edge
第二个:
# If you have a USB webcam plugged in to your board, then test one of the live camera programs.
g++ laplace.cpp -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_calib3d -lopencv_contrib -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_legacy -lopencv_ml -lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_video -lopencv_videostab -o laplace
./laplace
第三个:
# Test a GPU accelerated OpenCV sample.
cd ../gpu
g++ houghlines.cpp -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_calib3d -lopencv_contrib -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_legacy -lopencv_ml -lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_video -lopencv_videostab -o houghlines
./houghlines ../cpp/logo_in_clutter.png

CPU Time : 217.342 ms
CPU Found : 39
GPU Time : 138.108 ms
GPU Found : 199

五、自己的例程能跑?
第一个:
g++ opencv_stream.cpp -I/usr/include/opencv -L/usr/lib -lopencv_calib3d -lopencv_contrib -lopencv_core -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_highgui -lopencv_imgproc -lopencv_legacy -lopencv_ml -lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_ts -lopencv_video -lopencv_videostab -lopencv_esm_panorama -lopencv_facedetect -lopencv_imuvstab -lopencv_tegra -lopencv_vstab -L/usr/local/cuda/lib -lcufft -lnpps -lnppi -lnppc -lcudart -lrt -lpthread -lm -ldl -o camera_stream

/usr/bin/ld: cannot find -lopencv_esm_panorama
/usr/bin/ld: cannot find -lopencv_facedetect
/usr/bin/ld: cannot find -lopencv_imuvstab
/usr/bin/ld: cannot find -lopencv_tegra
/usr/bin/ld: cannot find -lopencv_vstab
Note, All the libraries mentioned above are not required for building these samples. But they are given in case of any modifications to the code.
so,
g++ opencv_stream.cpp -I/usr/include/opencv -L/usr/lib -lopencv_calib3d -lopencv_contrib -lopencv_core -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_highgui -lopencv_imgproc -lopencv_legacy -lopencv_ml -lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_ts -lopencv_video -lopencv_videostab -L/usr/local/cuda/lib -lcufft -lnpps -lnppi -lnppc -lcudart -lrt -lpthread -lm -ldl -o camera_stream
./ camera_stream

第二个:
$ g++ opencv_canny.cpp -I/usr/local/include/opencv -L/usr/local/lib/
-lopencv_contrib -lopencv_core -lopencv_features2d -lopencv_flann
-lopencv_gpu -lopencv_highgui -lopencv_imgproc -lopencv_legacy -lopencv_ml
-lopencv_nonfree -lopencv_objdetect -lopencv_ocl -lopencv_photo -lopencv_stitching
-lopencv_superres -lopencv_video -lopencv_videostab
-L/usr/local/cuda/lib -lcufft -lnpps -lnppi -lnppc -lcudart -lrt -lpthread -lm -ldl -o camera_canny

$ g++ opencv_canny.cpp -I/usr/local/include/opencv -L/usr/local/lib/ -lopencv_contrib -lopencv_core -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_highgui -lopencv_imgproc -lopencv_legacy -lopencv_ml -lopencv_nonfree -lopencv_objdetect -lopencv_ocl -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_video -lopencv_videostab -L/usr/local/cuda/lib -lcufft -lnpps -lnppi -lnppc -lcudart -lrt -lpthread -lm -ldl -o camera_canny
$ ./camera_canny

Conclusion
Use OpenCV4tegra only if the performance is really required. Check in the documentation for OpenCV4tegra whether the functions you are using are optimized or not.

五、GPU加速的OpenCV人体检测
(Full Body Detection) to build the OpenCV HOG (Hough Of Gradients) sample person detector program:
cd opencv-2.4.10/samples/gpu
g++ hog.cpp -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_calib3d -lopencv_contrib -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_legacy -lopencv_ml -lopencv_nonfree -lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_video -lopencv_videostab -o hog

./hog –video 768×576.avi
You can run the HOG demo such as on a pre-recorded video of people walking around. The HOG demo displays a graphical output, hence you should plug a HDMI monitor in or use a remote viewer such as X Tunneling or VNC or TeamViewer on your desktop in order to see the output.
Full Body Detection
./hog –camera /dev/video0
Note: This looks for whole bodies and assumes they are small, so you need to stand atleast 5m away from the camera if you want it to detect you!

六、GPU加速的OpenCV人脸检测(Face Detection)
$ cd gpu
g++ cascadeclassifier.cpp -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_calib3d -lopencv_contrib -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_legacy -lopencv_ml -lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_video -lopencv_videostab -o cascadeclassifier
$ ./cascadeclassifier –camera 0

运行结果FPS平均7fps,没有达到实时检测效果,要做的是并行编程将cuda与OpenCV有效地结合,实现实时人脸检测。

七、optical
光流实现
从摄像头获取数据,还可以从视频文件获取数据。
g++ optical.cpp -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_video -lopencv_contrib -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_legacy -lopencv_ml -lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_videostab -lopencv_calib3d -o optical
optica 光流

光流追踪
g++ follow.cpp -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_video -lopencv_contrib -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_legacy -lopencv_ml -lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_videostab -lopencv_calib3d -o follow
optica 光流追踪

APP:需要hack的地方
Patching OpenCV
The OpenCV version 2.4.9 and cuda version 6.5 are not compatible. It requires a few modifications in OpenCV.
Edit the file /modules/gpu/src/nvidia/core/NCVPixelOperations.hpp in the opencv source code main directory. Remove the keyword “static” from lines 51 to 58, 61 to 68 and from 119 to 133 in it. This is an example:
Before:
template<> static inline __host__ __device__ Ncv8u _pixMaxVal() {return UCHAR_MAX;}
template<> static inline __host__ __device__ Ncv16u _pixMaxVal() {return USHRT_MAX;}
After:
template<> inline __host__ __device__ Ncv8u _pixMaxVal() {return UCHAR_MAX;}
template<> inline __host__ __device__ Ncv16u _pixMaxVal() {return USHRT_MAX;}

Patching libv4l
This version of OpenCV supports switching resolutions using libv4l library. But it has a limitation. It does not support cameras with resolutions higher than 5 Mega Pixel. Modifying libv4l2 library is required to add that support.
Follow the steps mentioned below to obtain the source code for libv4l2 library, modify and build it and then install it:
$ apt-get source v4l-utils
$ sudo apt-get build-dep v4l-utils
$ cd v4l-utils-/
Edit the file lib/libv4l2/libv4l2-priv.h as follows:
Modify the line from
“#define V4L2_FRAME_BUF_SIZE (4096 * 4096)”
To:
“#define V4L2_FRAME_BUF_SIZE (3 * 4096 * 4096)”
$ dpkg-buildpackage -rfakeroot -uc -b
$ cd ..
$ sudo dpkg -i libv4l-0__.deb
Now OpenCV will support higher resolutions as well.

TK1之caffe配置 -05

五 – cudnn安装
5.1 安装老的cudnn-6.5, 2.0版本 (推荐)
5.2 卸载老版本cudnn
5.3 安装新cudnn-7.0, 4.0版本 (不建议)
5.4 回到老的2.0版本的cudnn-6.5以配合cuda-6.5
六 – caffe安装
6.1 准备Caffe环境
6.2 下载
6.3 config
6.4 build
6.5 runtest的CUDNN_STATUS_NOT_INITIALIZED出错
6.6 LMDB_MAP_SIZE出错
6.7 跑机 benchmarking
七 – Python and/or MATLAB
7.1 Ubuntu方式 (ok)
7.2 pip方式 (not tested)
7.3 编译python接口
7.4 环境变量

五 – cudnn安装

如果要安装caffe, 需要先安装cudnn。

对于32bit的arm-tk1,适合配置是cuda6.5 + cudnn2.0 + caffe0.13,bvlc官网1.0的msater分支不能用。
对于32bit的mini-pc, 可以尝试cuda6.5 + cudnn4.0 + cafe1.00, cuda6.5是支持32bit的维一选择。

5.1 安装老的cudnn-6.5, 2.0版本 (推荐)

下载 V2:
https://developer.nvidia.com/rdp/cudnn-archive
$ tar -zxvf cudnn-6.5-linux-ARMv7-V1.tgz
$ cd cudnn-6.5-linux-ARMv7-V2

复制文件
$ sudo cp cudnn.h /usr/local/cuda/include
$ sudo cp libcudnn* /usr/local/cuda/lib

重新加载库:
$ sudo ldconfig -v

make caffe时出错。 尝试新的cudnn4.0。原因可能caffe是新版本,所以要新版本cudn持,而和r1和v2版本部分函数名称改变。

5.2 卸载老版本cudnn

$ sudo rm /usr/local/cuda/include/cudnn.h
$ sudo rm /usr/local/cuda/lib/libcudnn*

5.3 安装新cudnn-7.0, 4.0版本 (不建议)

下载: https://developer.nvidia.com/rdp/cudnn-archive
新的4.0: cuDNN v4 Library for L4T (ARMv7)版cudnn-7.0-Linux-ARMv7-v4.0-prod.tgz
$ tar -zxvf cudnn-7.0-linux-ARMv7-v4.0-prod.tgz

产生cuda文件夹
$ cd ~/Downloads/cuda/

复制cuDNN文件:
$ sudo cp include/cudnn.h /usr/local/cuda/include/
$ sudo cp lib/libcudnn* /usr/local/cuda/lib/
$ sudo ldconfig -v

重新加载库:
$ sudo ldconfig -v

在32bit的arm体系的TK1上make runtest不成功。应该是因为github的caffe太新了。

5.4 回到老的2.0版本的cudnn-6.5以配合cuda-6.5

$ sudo rm /usr/local/cuda/include/cudnn.h
$ sudo rm /usr/local/cuda/lib/libcudnn*
# cd 6.5 cudnn v2
$ sudo cp cudnn.h /usr/local/cuda/include
$ sudo cp libcudnn* /usr/local/cuda/lib
$ sudo ldconfig -v

六 – caffe安装

6.1 准备Caffe环境

$ sudo add-apt-repository universe
$ sudo apt-get update

$ sudo apt-get install libprotobuf-dev protobuf-compiler 、
$ sudo apt-get install cmake libleveldb-dev libsnappy-dev
$ sudo apt-get install libatlas-base-dev libhdf5-serial-dev libgflags-dev

在python接口的scipy库的时候,需要fortran编译器(gfortran),如果没有报错,因此可以先安装:
$ sudo apt-get install gfortran

安装下面可能需要梯子:
$ sudo apt-get install libgoogle-glog-dev liblmdb-dev

boost这一步视情况而定。有的说boost的版本会有问题,建议降低版本到1.55版本的。因为caffe官网给的是 $ sudo apt-get install –no-install-recommends libboost-all-dev , 特意加了个–no-install-recommends。而且installation主页特意有Boost>=1.55要求。
$ sudo apt-get install libboost-dev libboost-thread-dev libboost-system-dev

但是上面这个网上常见的安装在ubuntu14.04会默认安装1.54。所以如果装了的话卸载, 完了后用这个来安装:
检查下:
$ dpkg -S /usr/include/boost/version.hpp
— 1.5.4
卸载掉:
$ sudo apt-get autoremove libboost1.54-dev 这个autoremove一下卸载了上百个包,free出 980M space, 杀伤力太大。
再安装1.55:
$ sudo apt-get install libboost1.55-all-dev libboost-thread1.55-dev libboost-system1.55-dev libboost-filesystem1.55-dev

编译器这步骤视情况而定。系统自带shi 4.8,如果出错,随时可以降级为4.7,如果编译caffe时出现编译器错误则降低版本:
$ sudo apt-get install gcc-4.7 g++-4.7 cpp-4.7
$ cd /usr/bin
$ sudo rm gcc g++ cpp
$ sudo ln -s gcc-4.9 gcc
$ sudo ln -s g++-4.9 g++
$ sudo ln -s cpp-4.9 cpp

赋予访问权限:
$ sudo usermod -a -G video $USER

all fine!

6.2 下载

# Git clone Caffe
$ git clone https://github.com/BVLC/caffe.git
$ cd caffe
$ git checkout dev
$ cp Makefile.config.example Makefile.config

6.3 config

$ vi Makefile.config
使能USE_CUDNN := 1

查看CUDA计算容量
$ /usr/local/cuda/samples/1_Utilities/deviceQuery
— 3.2
在caffe的Makefile.config文件中,禁止CUDA_ARCH *的60和61
CUDA_ARCH := -gencode arch=compute_20,code=sm_20
  -gencode arch=compute_20,code=sm_21
  -gencode arch=compute_21,code=sm_21 <-------------------   -gencode arch=compute_30,code=sm_30   -gencode arch=compute_35,code=sm_35   -gencode arch=compute_50,code=sm_50   -gencode arch=compute_50,code=compute_50 如果 make caffe时有 Error: nvcc fatal : Unsupported gpu architecture 'compute_60' $ vi Makefile.config. # CUDA architecture setting: going with all of them. # For CUDA < 6.0, comment the *_50 through *_61 lines for compatibility. # For CUDA < 8.0, comment the *_60 and *_61 lines for compatibility. CUDA_ARCH := -gencode arch=compute_20,code=sm_20 -gencode arch=compute_20,code=sm_21 -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_52,code=sm_52 # -gencode arch=compute_60,code=sm_60 # -gencode arch=compute_61,code=sm_61 # -gencode arch=compute_61,code=compute_61 6.4 build (只要出错, 按照下面顺序来过: make clean -> make all -j4 -> make test -j4 ->make runtest -j4 )
$ make clean ok.
$ make -j 4 all 20 mins ok
$ make test -j4 5 minute ok
$ make runtest -j4 10 mins

出问题的基本都是runtest:
Q1
常见错误是sudo, 运行runtest测试不需要sudo。
Q2
Error
g++: internal compiler error: Killed (program cc1plus)
解决方法:重启即可,原因不明(有说是内存问题)

6.5 runtest的CUDNN_STATUS_NOT_INITIALIZED出错

Note: Randomizing tests’ orders with a seed of 86430 .
F1222 00:00:55.561334 15196 cudnn_softmax_layer.cpp:15] Check failed: status == CUDNN_STATUS_SUCCESS (1 vs. 0) CUDNN_STATUS_NOT_INITIALIZED
*** Check failure stack trace: ***

This indicates a CuDNN/CUDA error. I’ve seen this when the CUDA/driver version and the CuDNN version are mismatched (5.0.5 and CUDA 7.0). This is almost certainly something to do with your setup, and not a bug in Caffe.
I encountered the same issue and finally found out that there is a mysterious
so you mean in order for this to compile successfully, we need to compile Caffe without the support of CUDNN right?
Thank you verymuch . I really appreciate your help 🙂
Q1
有的说需要降低版本:
$ cd /usr/bin
$ sudo rm gcc g++ cpp
$ sudo ln -s gcc-4.9 gcc
$ sudo ln -s g++-4.9 g++
$ sudo ln -s cpp-4.9 cpp
Q2
有的说是boost的版本问题,建议降低版本到1.55版本的。因为caffe官网给的是 $ sudo apt-get install –no-install-recommends libboost-all-dev
特意加了个–no-install-recommends。而且installation主页有Boost>=1.55,可以抵触的是上面这命令默认装的是1.54。所以卸载完了后用这个安装:$ sudo apt-get install libboost1.55-all-dev , 估计没事了吧。
校验:
$ dpkg -S /usr/include/boost/version.hpp
— 1.5.4
卸载:
$ sudo apt-get autoremove libboost1.54-dev 没必要吧,估计不是?
这个autoremove以下卸载了上百个包,free出 980M space, 杀伤力太大。
安装:
$ sudo apt-get install libboost1.55-all-dev
有小故障:
cannot find -l boost_system boost_filesystem boost_thread
$ apt-cache search libboost | grep 1.55
$ sudo apt-get install libboost-system1.55-dev
$ sudo apt-get install libboost-filesystem1.55-dev
$ sudo apt-get install libboost-thread1.55-dev
CUDNN_STATUS_NOT_INITIALIZED

Q3 原因和方法
在$ git checkout dev的时候就有提示,该分支不存在, 没注意。
but not the dev branch, even dev branch does not exist now,

所以从github下载的是master分支, 已经太新了, 在tk1能用的dev分支已经不见了。。

目前这个master branch的caffe,最低是要求cuDNN 5.0以上+ CUDA 7.0以上。
而我们的是cudnn2.0 + cuda6.5

cuDNN v5.1 has different versions for CUDA 7.5 and CUDA 8.0
cuDNN v5 has different versions for CUDA 7.5 and CUDA 8.0
cuDNN v4 and v3 both require CUDA 7.0
cuDNN v2 and v1 both require CUDA 6.5

可以从这里克隆cudnn 2.x能用的caffe :
https://github.com/RadekSimkanic/caffe-for-cudnn-v2.5.48
还有一个works for CUDNN v2 on Jetson TK1的caffe: https://github.com/platotek/caffetk1
不一定惯用

$ make runtest -j4
Major revision number: 3
Minor revision number: 2
Name: GK20A
Total global memory: 1980252160
Total shared memory per block: 49152
Total registers per block: 32768
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 1024
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 2147483647
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 852000
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 1
Kernel execution timeout: No
Unified virtual addressing: Yes

虽然还有小错LMDB_MAP_SIZE, 不过容易解决。

6.6 LMDB_MAP_SIZE出错

FAILED:
F1222 07:03:16.822439 16826 db_lmdb.hpp:14] Check failed: mdb_status == 0 (-30792 vs. 0) MDB_MAP_FULL: Environment mapsize limit reached

这个是32bit的限制太小了, 1T改为半T。
I think this issue is due to the Jetson being a 32-bit (ARM) device, and the constant LMDB_MAP_SIZE in src/caffe/util/db.cpp being too big for it to understand. Unfortunately master has a really large value for LMDB_MAP_SIZE in src/caffe/util/db.cpp, which confuses our little 32-bit ARM processor on the Jetson, eventually leading to Caffe tests failing with errors like MDB_MAP_FULL: Environment mapsize limit reached.

$ vi src/caffe/util/db_lmdb.cpp
const size_t LMDB_MAP_SIZE = 1099511627776; // 1 TB
改为 2^29 (536870912)

$ vi ./examples/mnist/convert_mnist_data.cpp
adjust the value from 1099511627776 to 536870912.

$ make runtest -j 4
… … …
[==========] 1702 tests from 251 test cases ran. (5165779 ms total)
[ PASSED ] 1702 tests.
YOU HAVE 2 DISABLED TESTS

OKAY !

6.7 跑机 benchmarking

最后,运行一下Caffe的基准代码来检测一下性能。验证cpu和gpu下运行效率:
Finally you can run Caffe’s benchmarking code to measure performance.

* 这个cpu的大概600秒
$ build/tools/caffe time –model=models/bvlc_alexnet/deploy.prototxt
… …
I1222 09:11:54.935829 19824 caffe.cpp:366] Average Forward pass: 5738.58 ms.
I1222 09:11:54.935860 19824 caffe.cpp:368] Average Backward pass: 5506.83 ms.
I1222 09:11:54.935890 19824 caffe.cpp:370] Average Forward-Backward: 11246.2 ms.
I1222 09:11:54.935921 19824 caffe.cpp:372] Total Time: 562310 ms.
I1222 09:11:54.935952 19824 caffe.cpp:373] *** Benchmark ends ***
ok.
These results are the summation of 10 iterations, so per image recognition on the Average Forward Pass is the listed result divided by 10, i.e. 227.156 ms is ~23 ms per image recognition.

* 这个gpu的大概30秒
$ build/tools/caffe time –model=models/bvlc_alexnet/deploy.prototxt –gpu=0
… …
I1222 09:16:02.577358 19857 caffe.cpp:366] Average Forward pass: 278.286 ms.
I1222 09:16:02.577504 19857 caffe.cpp:368] Average Backward pass: 318.795 ms.
I1222 09:16:02.577637 19857 caffe.cpp:370] Average Forward-Backward: 599.67 ms.
I1222 09:16:02.577800 19857 caffe.cpp:372] Total Time: 29983.5 ms.
I1222 09:16:02.577951 19857 caffe.cpp:373] *** Benchmark ends ***
ok.
It’s running 50 iterations of the recognition pipeline, and each one is analyzing 10 different crops of the input image, so look at the ‘Average Forward pass’ time and divide by 10 to get the timing per recognition result.

此后可以测试demo。 Caffe两个demo分别是mnist和cifar10,尤其是mnist,称为caffe编程的hello world。

七 – Python and/or MATLAB

编译caffe之后, 才可以对付pycaffe接口的编译。

Caffe拥有pythonC++shell接口,在Caffe使用python特别方便,在实例中都有接口的说明。
安装步骤是: Python依赖包、Matlab、Matlab engine for python

但即使有默认的python,还是要安装python-dev,why?
因为linux发行版通常会把类库的头文件和相关的pkg-config分拆成一个单独的xxx-dev(el)包。 以python为例,有些场合的应用是需要python-dev的, 例如:
要自己安装一个源外的python类库, 而这个类库内含需要编译的调用python api的c/c++文件;
自己写的一个程序编译需要链接libpythonXX.(a|so)。

在UBUNTU系统下使用python,很多时候需要安装不同的python库进行扩展。
通常用到的两种方式:pip install和ubuntu系统独有的apt-get install, 区别?
* pip install的源是pyPI , apt-get 的源是ubuntu仓库。
对于python的包来说,pyPI的源要比ubuntu更多,对于同一个包,pyPI可以提供更多的版本以供下载。pip install安装的python包,可以只安装在当前工程内。
* apt-get 安装的包是系统化的包,在系统内完全安装。
* apt-get 和 pip install 中,对于相同python包,命名可能会不同:apt-get install:对于python2来说,包的名称可能是python-

在~/caffe/python/requirements.txt的清单文件列出了需要的依赖库
Cython>=0.19.2
numpy>=1.7.1
scipy>=0.13.2
scikit-image>=0.9.3
matplotlib>=1.3.1
ipython>=3.0.0
h5py>=2.2.0
leveldb>=0.191
networkx>=1.8.1
nose>=1.3.0
pandas>=0.12.0
python-dateutil>=1.4,<2 protobuf>=2.5.0
python-gflags>=2.0
pyyaml>=3.10
Pillow>=2.3.0
six>=1.1.0

7.1 Ubuntu方式 (ok)

安装相应的build依赖包:
$ sudo apt-get install build-essential

Caffe的Python接口需要numpy库
$ sudo apt-get install python-numpy

安装scipy库
$ sudo apt-get install python-scipy

boost
$ sudo apt-get install libboost-python1.55-dev
//$ sudo apt-get install libboost-python-dev -X

$ sudo apt-get install python-protobuf

$ sudo apt-get install python-skimage

7.2 pip方式 (not tested)

pip进行安装 // Use pip to install numpy and scipy instead for newer versions.
$ for req in $(cat requirements.txt); do pip install $req; done
(中间安装google的 protobuf时要翻墙)
不过网上还是看到有人说不要用他的文档,自己一个一个装比较好。

一句话安装:
$ sudo pip install cython numpy scipy scikit-image matplotlib ipython h5py leveldb networkx nose pandas python-dateutil protobuf python-gflags pyyaml pillow six

$ sudo pip install scikit-image
generate a error: Exception IndexError: list index out of range

Also note that in the Makefile.config of caffe, there is this line:
PYTHON_INCLUDE := /usr/include/python2.7 <-- correct /usr/lib/python2.7/dist-packages/numpy/core/include <-- doesn't exist so, try: $ pip install -U scikit-image -U is --upgrade,意是如果已安装就升级到最新版 按照官方建议安装anaconda包。 在anaconda官网下载.sh文件,执行,最后添加bin目录到环境变量即可。 建议安装Anaconda包,这个包能独立于系统自带的python库,并且提供大部分Caffe需要的科学运算Python库。要注意,在运行Caffe时,可能会报一些找不到libxxx.so的错误,而用 locate libxxx.so命令发现已经安装在anaconda中,这时首先想到的是在/etc/ld.so.conf.d/ 下面将 $your_anaconda_path/lib 加入 LD_LIBRARY_PATH中。 但是这样做可能导致登出后无法再进入桌面!!!原因(猜测)可能是anaconda的lib中有些内容于系统自带的lib产生冲突。 正确的做法:为了不让系统在启动时就将anaconda/lib加入系统库目录,可以在用户自己的~/.bashrc 中添加library path, 比如就在最后添加了两行 # add library path LD_LIBRARY_PATH=your_anaconda_path/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH 开启另一个终端后即生效,并且重启后能够顺利加载lightdm, 进入桌面环境。 $ vi ~/.bashrc export PYTHONPATH=/path/to/caffe/python:$PYTHONPATH sudo ldconfig 7.3 编译python接口 编译Python wrapper, 编译python接口 $ cd ~/caffe $ make pycaffe -j4 --结果显示ALL TESTS PASSED ok $ make pytest -j4 在python中测试: $ cd caffe-folder/python $ python >>>import caffe
没有报错,说明caffe安装全部完成

添加caffe/python 到python path变量, 把caffe下的python路径添加到到path中,
这样不需要进入caffe/python目录下就能调用caffe的python接口。
$ vi ~/.bashrc
#set caffe PYTHONPATH
export PYTHONPATH=/path/to/caffe/python:$PYTHONPATH

TK1之opencv配置 -03

三 – cuda安装

在opencv之前。确保cuda ok。

对嵌入式开发有两种选择,You have two options for developing CUDA applications for Jetson TK1
原生编译(native compilation)和交叉编译(cross-compilation)。
所谓原生编译,就是在目标板上直接运行自己的代码,以TK1为例,就是说在TK1目标板上编译代码;
所谓交叉编译,这也是大多数采用的编译方法,简单来说就是在台式机上编译,然后挂载在目标板上运行的方式。

对于开发TK1,推荐使用原生编译。

1)native compilation (compiling code onboard the Jetson TK1),Native compilation is generally the easiest option, but takes longer to compile, 2)whereas cross-compilation is typically more complex to configure and debug, but for large projects it will be noticeably faster at compiling.
The CUDA Toolkit currently only supports cross-compilation from an Ubuntu 12.04 or 14.04 Linux desktop.
In comparison, native compilation happens onboard the Jetson device and thus is the same no matter which OS or desktop you have.

所以不要download错了。Toolkit for L4T 的而不是Toolkit for Ubuntu。

Installing the CUDA Toolkit onto your device for native CUDA development
cross-compilation (compiling code on an x86 desktop in a special way so it can execute on the Jetson TK1 target device).
(Make sure you download the Toolkit for L4T and not the Toolkit for Ubuntu since that is for cross-compilation instead of native compilation)

3.1 安装 CUDA 6.5 Toolkit for L4T

下载:
http://developer.download.nvidia.com/embedded/L4T/r21_Release_v3.0/cuda-repo-l4t-r21.3-6-5-prod_6.5-42_armhf.deb
安装:
$ cd ~/Downloads
# Install the CUDA repo metadata that you downloaded manually for L4T
$ sudo dpkg -i ./cuda-repo-l4t-r21.3-6-5-prod_6.5-42_armhf.deb
更新apt-get:
# Download & install the actual CUDA Toolkit including the OpenGL toolkit from NVIDIA. (It only downloads around 15MB)
$ sudo apt-get update

3.2 安装 toolkit
# Install “cuda-toolkit-6-5” , etc.
$ sudo apt-get install cuda-toolkit-6-5

# 设置当前用户下可以访问GPU
# Add yourself to the “video” group to allow access to the GPU
$ sudo usermod -a -G video $USER

Add the 32-bit CUDA paths to your .bashrc login script,
and start using it in your current console:

$ echo “” >> ~/.bashrc
$ echo “# Add CUDA bin & library paths:” >> ~/.bashrc
$ echo “export PATH=/usr/local/cuda/bin:$PATH” >> ~/.bashrc
$ echo “export LD_LIBRARY_PATH=/usr/local/cuda/lib:${LD_LIBRARY_PATH}” >> ~/.bashrc
$ source ~/.bashrc
(?)sudo reboot

3.3 Verify cuda capacity

查看编译环境是否安装成功:
$ cd /usr/local/cuda
ok
$ nvcc -V
— nvcc: NVIDIA (R) Cuda compiler driver
— Cuda Release6.5 V6.5.35

3.3.1 apt方式安装CUDAsamples (good choice)
安装samples Installing & running the CUDA samples (optional)
$ sudo apt-get install cuda-samples-6-5

接着把devicequery给跑出来,运行下列代码:
$ cd /usr/local/cuda
$ sudo chmod o+w samples/ -R
$ cd samples/1_Utilities/deviceQuery
$ make

查看CUDA计算容量:
$ ../../bin/armv7l/linux/release/gnueabihf/deviceQueryarmv7l
—Detected 1 CUDA Capable device(s)
— CUDA Driver Version / Runtime Version 6.5 / 6.5
— UDA Capability Major/Minor version number: 3.2
— Result = PASS
最后几行应该显示cuda version 为 6.5 result =PASS
如果cuda的计算能力达不到3.0及以上,请跳过部分。
or,

3.3.2 源代码方式安装CUDAsamples
If you think you will write your own CUDA code or you want to see what CUDA can do,
then follow this section to build & run all of the CUDA samples.

Install writeable copies of the CUDA samples to your device’s home directory (it will create a “NVIDIA_CUDA-6.5_Samples” folder):
$ cuda-install-samples-6.5.sh /home/ubuntu
Build the CUDA samples (takes around 15 minutes on Jetson TK1):
$ cd ~/NVIDIA_CUDA-6.5_Samples
make

再测试cuda:
Run some CUDA samples:
1_Utilities/deviceQuery/deviceQuery
1_Utilities/bandwidthTest/bandwidthTest
cd 0_Simple/matrixMul
./matrixMulCUBLAS
cd ../..
cd 0_Simple/simpleTexture
./simpleTexture
cd ../..
cd 3_Imaging/convolutionSeparable
./convolutionSeparable
cd ../..
cd 3_Imaging/convolutionTexture
./convolutionTexture
cd ../..

3.4 远程测试:

Note:
Many of the CUDA samples use OpenGL GLX and open graphical windows.

If you are running these programs through an SSH remote terminal, you can remotely display the windows on your desktop by typing “export DISPLAY=:0” and then executing the program.
(This will only work if you are using a Linux/Unix machine or you run an X server such as the free “Xming” for Windows).

eg:
$ export DISPLAY=:0
$ cd ~/NVIDIA_CUDA-6.5_Samples/2_Graphics/simpleGL
./simpleGL
$ cd ~/NVIDIA_CUDA-6.5_Samples/3_Imaging/bicubicTexture
./bicubicTexture
$ cd ~/NVIDIA_CUDA-6.5_Samples/3_Imaging/bilateralFilter
./bilateralFilter

if
#error — unsupported GNU version! gcc 4.9 and up are not supported!
$ cd /usr/bin
rm gcc g++ cpp
ln -s gcc-4.9 gcc
ln -s g++-4.9 g++
ln -s cpp-4.9 cpp

3.5
Note:
the Optical Flow sample (HSOpticalFlow) and 3D stereo sample (stereoDisparity) take rouglhy 1 minute each to execute since they compare results with CPU code.

Some of the CUDA samples use other libraries such as OpenMP or MPI or OpenGL.
If you want to compile those samples then you’ll need to install these toolkits like this:
(to be added)

4 – opencv安装

Tegra 平台为OpenCV提供了GPU加速的功能。
除了OpenCV中的GPU模块外,例如核心的Core模块等一系列OpenCV中的常用模块都可以使用GPU加速。

但是GPU加速的前提是已经安装好Cuda,
并且用nvcc device 编译OpenCV。

在Jetson安装配置OpenCV有两种方法:

方法一,是使用官方的Jetpack安装包 安装方法,可以安装最新的Tegra4OpenCV,这需要有一台Ubuntu 14.04 LST X64系统的电脑,并使得Jetson Tk1进入Recovery模式。
直接采用nvidia提供的安装包优点:简单,同时增加了cpu neon的优化;缺点版本不够新,可能是2.4,对于基于opencv-3.0.0实现的代码可能需要修改代码。
方法二,从源码编译
要严格按照Cuda->Tegra4OpenCV->OpenCV的顺序,切不要颠倒,这个能够使用最新的opencv版本代码;缺点是过程复杂。

推荐: -> 手动源代码方式。

=== 手动源码方式 : ===
安装OpenCV主要分为安装Tegra4OpenCV和OpenCV源码两个部分:
*0. pre
*1. Tegra4OpenCV安装。
*2. OpenCV安装。

4.0 pre
启用Universe源:
$ sudo add-apt-repository universe
$ sudo apt-get update

安装几个必须的库:
# Some general development libraries 基本的g++编译器和cmake
$ sudo apt-get install build-essential make cmake cmake-curses-gui g++

# libav video input/output development libraries 输入输出库
$ sudo apt-get install libavformat-dev libavutil-dev libswscale-dev

# Video4Linux camera development libraries Video4Linux摄像头模块
$ sudo apt-get install libv4l-dev

# Eigen3 math development libraries Eigen3模块
$ sudo apt-get install libeigen3-dev

# OpenGL development libraries (to allow creating graphical windows)
# OpenGL开发模块 允许创建图形窗口 (并不是OpenGL全体)
$ sudo apt-get install libglew1.6-dev

# GTK development libraries (to allow creating graphical windows)
# GTK 开发库 (允许创建图形窗口)
$ sudo apt-get install libgtk2.0-dev

4.1 Tegra4OpenCV安装

4.1.1 下载 : libopencv4tegra-repo_l4t-r21_2.4.10.1_armhf.deb
**(注意版本)** 下载opencv deb包:
只有21.2没有21.3:
https://developer.nvidia.com/embedded/downloads <--- go to find waht you want --->

OR directly go to:
http://developer.download.nvidia.com/embedded/OpenCV/L4T_21.2/libopencv4tegra-repo_l4t-r21_2.4.10.1_armhf.deb

4.1.2 安装: OpenCV优化包
$ sudo dpkg -i libopencv4tegra-repo_l4t-r21_2.4.10.1_armhf.deb
$ sudo apt-get update

4.1.3 安装Tegra4OpenCV:
$ sudo apt-get install libopencv4tegra
$ sudo apt-get install libopencv4tegra-dev
这个可能安装不上去,一直出问题, 墙的问题,重复执行apt-get update,安装完所有更新后。

4.1.4 验证OpenCV是否安装成功:
cd /usr/lib 或者 cd/usr/include
查看有无opencv的库或者头文件。

4.2 OpenCV源装

4.2.1 下载 OpenCV 2.4.10
下载: http://opencv.org/
或者: 在TK1平台上直接下载:
$ wget http://downloads.sourceforge.net/project/opencvlibrary/opencv-unix/2.4.10/opencv-2.4.10.zip

OR: from the command-line you can run this on the device:
$ wget https://github.com/Itseez/opencv/archive/2.4.10.zip

4.2.2 config OpenCV 2.4.10
$ cd Downloads
$ unzip opencv-2.4.10.zip
$ cd opencv-2.4.10/
$ mkdir build
$ cd build
$ cmake -D WITH_CUDA=ON -D CUDA_ARCH_BIN=”3.2″ -D CUDA_ARCH_PTX=”” -D BUILD_TESTS=OFF -D BUILD_PERF_TESTS=OFF -D WITH_OPENGL=ON -D WITH_QT=ON ..
or,
$ cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D WITH_TBB=ON -D BUILD_NEW_PYTHON_SUPPORT=ON -D WITH_V4L=ON -D INSTALL_C_EXAMPLES=ON -D INSTALL_PYTHON_EXAMPLES=ON -D BUILD_EXAMPLES=ON -D WITH_QT=ON -D WITH_OPENGL=ON -D ENABLE_FAST_MATH=1 -D CUDA_FAST_MATH=1 -D WITH_CUBLAS=1 ..
注意:如果不添加-DWITH_QT=ON,那么-DWITH_OPENGL=ON将没有效果,
因此如果你开启OPENGL使能,还得先参考安装qt环境。
最后一句中 .. 的意思表示你的MakeFile文件在上一层文件夹,如果系统提示找不到MakeFile文件的话可以将它改为包含OpenCV Makefile的路径。
And the value 3.2 depend on you pc’s capacitys may be 3.2 maybe 3.5 maybe 3.8 …
… …
— Configuring done
— Generating done
这一步通过以后会出现config OK 的标志,表示检查已经成功,可以编译

4.2.3 install OpenCV 2.4.10
To install the built OpenCV library by copying the OpenCV library to “/usr/local/include” and “/usr/local/lib”:
$ sudo make -j4 install
20 minutes ..
最后没有出现错误
表示OpenCV已经安装成功。
$ sudo ldconfig

4.2.4 uninstall 2.4.10
if want to install other version of opencv: such as 2.4.13:
先手动删除opencv安装的一些文件:
$ sudo rm -r /usr/local/include/opencv2 /usr/local/include/opencv /usr/include/opencv2 /usr/include/opencv /usr/local/share/opencv /usr/local/share/OpenCV /usr/share/opencv /usr/share/OpenCV /usr/local/bin/opencv* /usr/local/lib/libopencv*
same as before:
$ cd opencv-2.4.13
$ mkdir build
$ cd build
$ cmake -DWITH_CUDA=ON -D CUDA_ARCH_BIN=”3.2″ -D CUDA_ARCH_PTX=”” -D BUILD_TESTS=OFF -D WITH_OPENGL=ON -D WITH_QT=ON -D BUILD_PERF_TESTS=OFF ..
$ sudo make -j4 install
$ sudo ldconfig

4.2.7 配置环境变量 2.4.13
Finally, make sure your system searches the “/usr/local/lib” folder for libraries:
$ echo “” >> ~/.bashrc
$ echo “# Use OpenCV and other custom-built libraries.” >> ~/.bashrc
$ echo “export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib/” >> ~/.bashrc
$ source ~/.bashrc

这样OpenCV的配置ok,可以在tk1上使用基于GPU加速的OpenCV函数库

4.2.8 测试运行几个OpenCV样例

第一个:
# Test a simple OpenCV program. Creates a graphical window, hence you should plug a HDMI monitor in or use a remote viewer such as X Tunneling or VNC or TeamViewer on your desktop.
$ cd ~/opencv-2.4.10/samples/cpp
$ g++ edge.cpp -lopencv_core -lopencv_imgproc -lopencv_highgui -o edge
$ ./edge

第二个:
# If you have a USB webcam plugged in to your board, then test one of the live camera programs.
$ g++ laplace.cpp -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_calib3d -lopencv_contrib -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_legacy -lopencv_ml -lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_video -lopencv_videostab -o laplace
$ ./laplace

第三个示例:
# Test a GPU accelerated OpenCV sample.
$ cd ../gpu
$ g++ houghlines.cpp -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_calib3d -lopencv_contrib -lopencv_features2d -lopencv_flann -lopencv_gpu -lopencv_legacy -lopencv_ml -lopencv_objdetect -lopencv_photo -lopencv_stitching -lopencv_superres -lopencv_video -lopencv_videostab -o houghlines
$ ./houghlines ../cpp/logo_in_clutter.png
REF:http://blog.csdn.net/zyazky/article/details/52388605

Q1
if ERROR:
In file included from /home/ubuntu/opencv-2.4.10/modules/core/src/opengl_interop.cpp:52:0:
/usr/local/cuda/include/cuda_gl_interop.h:64:2: error: #error Please include the appropriate gl headers before including cuda_gl_interop.h
#error Please include the appropriate gl headers before including cuda_gl_interop.h
^
make[2]: *** [modules/core/CMakeFiles/opencv_core.dir/src/opengl_interop.cpp.o] Error 1
make[2]: *** Waiting for unfinished jobs….
make[1]: *** [modules/core/CMakeFiles/opencv_core.dir/all] Error 2
then:
$ sudo vi /usr/local/cuda/include/cuda_gl_interop.h
comments collowing lines:
#if defined(__arm__) || defined(__aarch64__)
//#ifndef GL_VERSION
//#error Please include the appropriate gl headers before including cuda_gl_interop.h
//#endif
//#else
REFs:
https://github.com/opencv/opencv/issues/5205
https://devtalk.nvidia.com/default/topic/1007290/building-opencv-with-opengl-support-/
Q2
IF error:
/home/ubuntu/Downloads/opencv-2.4.10/build/modules/ghgui/src/window_QT.cpp:3111:12: error: ‘GL_PERSPECTIVE_CORRECTION_HINT’ was not declared in this scope
then:
$ vi /home/ubuntu/Downloads/opencv-2.4.10/modules/highgui/src/window_QT.cpp
add :
#define GL_PERSPECTIVE_CORRECTION_HINT 0x0C50

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$
The OpenCV GPU module is a set of classes and functions to utilize GPU computational capabilities. It is implemented using NVIDIA* CUDA* Runtime API and supports only NVIDIA GPUs.The OpenCV GPU module includes utility functions, low-level vision primitives, and high-level algorithms. The utility functions and low-level primitives provide a powerful infrastructure for developing fast vision algorithms taking advantage of GPU whereas the high-level functionality includes some state-of-the-art algorithms (such as stereo correspondence, face and people detectors, and others) ready to be used by the application developers.
The GPU module is designed as a host-level API. This means that if you have pre-compiled OpenCV GPU binaries, you are not required to have the CUDA Toolkit installed or write any extra code to make use of the GPU.
The OpenCV GPU module is designed for ease of use and does not require any knowledge of CUDA. Though, such a knowledge will certainly be useful to handle non-trivial cases or achieve the highest performance. It is helpful to understand the cost of various operations, what the GPU does, what the preferred data formats are, and so on.
The GPU module is an effective instrument for quick implementation of GPU-accelerated computer vision algorithms. However, if your algorithm involves many simple operations, then, for the best possible performance, you may still need to write your own kernels to avoid extra write and read operations on the intermediate results.

To enable CUDA support, configure OpenCV using CMake with WITH_CUDA=ON . When the flag is set and if CUDA is installed, the full-featured OpenCV GPU module is built.
Otherwise, the module is still built but at runtime all functions from the module throw Exception with CV_GpuNotSupported error code, except for gpu::getCudaEnabledDeviceCount(). The latter function returns zero GPU count in this case.
Building OpenCV without CUDA support does not perform device code compilation, so it does not require the CUDA Toolkit installed. Therefore, using the gpu::getCudaEnabledDeviceCount() function, you can implement a high-level algorithm that will detect GPU presence at runtime and choose an appropriate implementation (CPU or GPU) accordingly.

Compilation for Different NVIDIA* Platforms :
NVIDIA* compiler enables generating binary code (cubin and fatbin) and intermediate code (PTX). Binary code often implies a specific GPU architecture and generation, so the compatibility with other GPUs is not guaranteed. PTX is targeted for a virtual platform that is defined entirely by the set of capabilities or features. Depending on the selected virtual platform, some of the instructions are emulated or disabled, even if the real hardware supports all the features.
At the first call, the PTX code is compiled to binary code for the particular GPU using a JIT compiler. When the target GPU has a compute capability (CC) lower than the PTX code, JIT fails. By default, the OpenCV GPU module includes:
Binaries for compute capabilities 1.3 and 2.0 (controlled by CUDA_ARCH_BIN in CMake)
PTX code for compute capabilities 1.1 and 1.3 (controlled by CUDA_ARCH_PTX in CMake)
This means that for devices with CC 1.3 and 2.0 binary images are ready to run. For all newer platforms, the PTX code for 1.3 is JIT’ed to a binary image. For devices with CC 1.1 and 1.2, the PTX for 1.1 is JIT’ed. For devices with CC 1.0, no code is available and the functions throw Exception. For platforms where JIT compilation is performed first, the run is slow.
On a GPU with CC 1.0, you can still compile the GPU module and most of the functions will run flawlessly. To achieve this, add “1.0” to the list of binaries, for example, CUDA_ARCH_BIN=”1.0 1.3 2.0″ . The functions that cannot be run on CC 1.0 GPUs throw an exception.
You can always determine at runtime whether the OpenCV GPU-built binaries (or PTX code) are compatible with your GPU. The function gpu::DeviceInfo::isCompatible() returns the compatibility status (true/false).

Utilizing Multiple GPUs :
In the current version, each of the OpenCV GPU algorithms can use only a single GPU. So, to utilize multiple GPUs, you have to manually distribute the work between GPUs. Switching active devie can be done using gpu::setDevice() function. For more details please read Cuda C Programming Guide.
While developing algorithms for multiple GPUs, note a data passing overhead. For primitive functions and small images, it can be significant, which may eliminate all the advantages of having multiple GPUs. But for high-level algorithms, consider using multi-GPU acceleration. For example, the Stereo Block Matching algorithm has been successfully parallelized using the following algorithm:
Split each image of the stereo pair into two horizontal overlapping stripes.
Process each pair of stripes (from the left and right images) on a separate Fermi* GPU.
Merge the results into a single disparity map.
With this algorithm, a dual GPU gave a 180 % performance increase comparing to the single Fermi GPU. For a source code example, see https://github.com/opencv/opencv/tree/master/samples/gpu/.