Implementing Real-Time, Multi-Camera Pipelines with NVIDIA Jetson

Here’s how to build efficient multi-camera pipelines taking advantage of the hardware accelerators on NVIDIA Jetson platform in just a few lines of Python.

Multi-camera applications are becoming increasingly popular; they are essential for enabling autonomous robots, intelligent video analytics (IVA), and AR/VR applications. Regardless of the specific use case, there are common tasks that must always be implemented:

Capture
Preprocessing
Encoding
Display

In many cases, you also want to deploy a DNN on the camera streams and run custom logic on the detections. Figure 1 shows the general flow of the application.

Flowchart showing the data flow in the application. From left to right it contains the following stages: MIPI-CSI cameras, image capture , preprocessing , image batching, custom application logic, and H.264 encoding stage. The pipeline ends in — *Figure 1. Flow of the pipeline implemented in this project*

In this post, I showcase how to realize these common tasks efficiently on NVIDIA Jetson platform. Specifically, I present jetmulticam, an easy-to-use Python package for creating multi-camera pipelines. I demonstrate a specific use case on a robot with a surround camera system. Finally, I add custom logic (person following) based on the DNN object detections to obtain the results shown in the following video:

Video 1. Demo showing 3 independent cameras with ~270° field of view. Red Boxes correspond to DashCamNet detections, green ones to PeopleNet. The PeopleNet detections are used to perform person-following logic.

Multi-camera hardware

There are many parameters to consider when choosing a camera: resolution, frame rate, optics, global/rolling shutter, interface, pixel size, and so on. For more information about compatible cameras from NVIDIA partners, see the comprehensive list.

In this specific multi-camera setup, you use the following hardware:

NVIDIA Jetson Xavier NX module
GMSL2 capable carrier board from Leopard Imaging
3 × IMX185 GMSL2 cameras from Leopard Imaging

The IMX185 cameras each have a field of view of approximately 90°. Mount them orthogonally to one another for a total FOV of 270°, as shown in Figure 2.

Diagram of three cameras, each with approximately 90 deg field-of-view. The cameras are mounted such that the total FOV is around 270 degrees. — *Figure 2. Cameras are mounted to maximize the horizontal FOV*

The cameras use the GMSL interface, which offers a lot of flexibility in positioning the cameras even meters away from the Jetson module. In this case, you can raise the cameras by ~0.5m to achieve a greater vertical FOV.

Diagram containing a camera and Jetson Xavier NX module connected with a coaxial cable spanning up to 15m. — *Figure 3. The GMSL interface enables the flexibility to position cameras away from the Jetson module*

Getting started with Jetmulticam

First, download and install the NVIDIA Jetpack SDK on your Jetson board. Then, install the jetmulticam package:

$ git clone https://github.com/NVIDIA-AI-IOT/jetson-multicamera-pipelines.git
$ cd jetson-multicamera-pipelines
$ bash scripts/install_dependencies.sh
$ pip3 install Cython
$ pip3 install .

Basic multi-camera pipeline

After the setup, you can create a basic pipeline using the CameraPipeline class. Pass a list of cameras to include in the pipeline with the initializer argument. In the following example, the elements [0, 1, 2] correspond to device nodes /dev/video0, /dev/video1, and /dev/video2.

from jetmulticam import CameraPipeline
p = CameraPipeline([0, 1, 2])

And that’s it—the pipeline has been initialized and started. You can now read images from each camera in the pipeline and access them as numpy arrays.

img0 = p.read(0) # img0 is a np.array
img1 = p.read(1)
img2 = p.read(2)

Often, it is convenient to read from the camera in a loop, as in the following code example. The pipeline runs asynchronously from the main thread and read always gets the most recent buffer.

while True:
    img0 = p.read(0)
    print(img0.shape) # >> (1920, 1080, 3)
    time.sleep(1/10)

More complex AI pipeline

Now you can build a more complex pipeline. This time, use the CameraPipelineDNN class to compose a more complex pipeline, as well as two pretrained models from the NGC catalog: PeopleNet and DashCamNet.

import time
from jetmulticam import CameraPipelineDNN
from jetmulticam.models import PeopleNet, DashCamNet

if __name__ == "__main__":

    pipeline = CameraPipelineDNN(
        cameras=[2, 5, 8],
        models=[
            PeopleNet.DLA1,
            DashCamNet.DLA0,
            # PeopleNet.GPU
        ],
        save_video=True,
        save_video_folder="/home/nx/logs/videos",
        display=True,
    )

    while pipeline.running():
        arr = pipeline.images[0] # np.array with shape (1080, 1920, 3)
        dets = pipeline.detections[0] # Detections from the DNNs
        time.sleep(1/30)

Here’s the breakdown of the pipeline initialization:

Cameras
Models
Hardware acceleration
Saving video
Displaying video
Main loop

Cameras

Firstly, similarly to the previous example, the cameras argument is a list of sensors. In this case, you use cameras associated with device nodes:

/dev/video2
/dev/video5
/dev/video8

cameras=[2, 5, 8]

Models

The second argument, models, enables you to define pretrained models to run in the pipeline.

  models=[
           PeopleNet.DLA1,
           DashCamNet.DLA0,
           # PeopleNet.GPU
      ],

Here, you deploy two pretrained models from NGC:

PeopleNet: an object detection model capable of identifying people, faces, and bags.
DashCamNet: model capable of identifying objects in four classes: car, person, road sign, and bicycle.

For more information, see the model cards in NGC.

Hardware acceleration

The models run in real time using NVIDIA Deep Learning Accelerator (DLA). Specifically, you deploy PeopleNet on DLA0 (DLA Core 0) and DashCamNet on DLA1.

Distributing models between two accelerators helps increase the total throughput of the pipeline. Furthermore, DLA is even more power-efficient than GPU. As a result, the system consumes as little as ~10W under full load with the highest clock settings. Finally, in this configuration, the Jetson GPU remains free to accelerate more tasks with the 384 CUDA cores available on Jetson NX.

The following code example shows the list of currently supported model/accelerator combinations.

pipeline = CameraPipelineDNN(
    # ...
    models=[
        models.PeopleNet.DLA0,
        models.PeopleNet.DLA1,
        models.PeopleNet.GPU,
        models.DashCamNet.DLA0,
        models.DashCamNet.DLA1,
        models.DashCamNet.GPU
        ]
    # ...
)

Saving video

The next two arguments specify whether to store the encoded video and define the folder used for storage.

save_video=True,
save_video_folder="/home/nx/logs/videos",

Displaying video

As a final initialization step, configure the pipeline to display the video output on the screen for debugging purposes.

display=True

Main loop

Lastly, define the main loop. During runtime, the images are available under pipeline.images, and detection results under pipeline.detections.

while pipeline.running():
    arr = pipeline.images[0] # np.array with shape (1080, 1920, 3)
    dets = pipeline.detections[0] # Detections from the DNNs
    time.sleep(1/30)

The following code example shows the resulting detections. For each detection, you get a dictionary containing the following:

Object class
Object position defined as [left, width, top, height] in pixel coordinates
Detection confidence

>>> pipeline.detections[0]
[
    # ...
    {
        "class": "person",
        "position": [1092.72 93.68 248.01 106.38], # L-W-T-H
        "confidence": 0.91
    },
    #...
]

Extending AI pipeline with custom logic

As the last step, you can extend the main loop to build custom logic using the DNN output. Specifically, you use the detection outputs from your cameras to implement basic person-following logic in the robot. The source code is available in the NVIDIA-AI-IOT/jetson-multicamera-pipelines GitHub repo.

To find the human to follow, parse the pipeline.detections output. This logic is implemented in the find_closest_human function.
Calculate the robot’s steering angle based on where the bounding box is located in dets2steer.
- If the human is in the left image, turn maximally left.
- If the human is in right image, turn maximally right.
- If the human is in the center image, turn proportionately to the X coordinate of the center of the bounding box.

The resulting video is saved to /home/nx/logs/videos, as you defined it during initialization.

Solution overview

Here’s a brief look at how jetmulticam works. The package dynamically creates and launches a GStreamer pipeline with the number of cameras that your application requires. Figure 4 shows how the underlying GStreamer pipeline looks when configured as in the person-following example. As you can see, all critical operations in the system, represented by green boxes, benefit from hardware acceleration.

Block diagram contains color-coded elements of the computer vision pipeline. Most elements are green, representing the pipeline elements that are NVIDIA hardware-accelerated. — *Figure 4. Internal components of the* jetmulticam package

First, the video is captured using multiple CSI cameras, using nvarguscamerasrc elements on the diagram. Each individual buffer is rescaled and converted to RGBA format with either nvvidconv or nvvideoconvert. Next, the frames are batched using the component provided by the DeepStream SDK. By default, the batch size is equal to the number of cameras in the system.

To deploy the DNN models, take advantage of the nvinfer elements. In the demo, I deployed two models, PeopleNet and DashCamNet, on two different accelerators, DLA core 1 and DLA core 2, both available on the Jetson Xavier NX. However, it is possible to stack even more models on top of each other, if needed.

After the resulting bounding boxes are overlaid by the nvosd element, you then display them on an HDMI display, using nvoverlaysink, as well as encoding the video stream with the hardware-accelerated H264 encoder. Save to a .mkv file.

The images available in the Python code (for example, pipeline.images[0]) are parsed to numpy arrays by a callback function, or probe, registered on each of the video converter elements. Similarly, another callback function is registered on the sinkpad of the last nvinfer element that parses the metadata into a user-friendly list of detections. For more information about the source code or individual component configuration, see the create_pipeline function.

Conclusion

The hardware acceleration available on the NVIDIA Jetson platform combined with NVIDIA SDKs enables you to achieve outstanding real-time performance. For example, the person-following example runs two object-detection neural networks in real time on three camera streams, all while maintaining CPU utilization under 20%.

The Jetmulticam package, showcased in this post, enables you to build your own hardware-accelerated pipelines in Python, and to include custom logic on top of the detections.