Visualizing ARKit Sessions

Vlas Voloshin

Vlas Voloshin

Senior Developer

  • iOS
  • ARKit

This post continues exploring the topic of ARKit developer tools by looking into ways to visualize the data provided by the AR session on a separate device in real time, and shares the results of our prototyping in this area.

Humans are visual creatures. Presenting information in an interactive graphical form helps viewers understand it more efficiently. Programming is no exception: we may still be writing most of our code in a textual form, but the results of executing this code are often better understood visually, especially if that’s also the primary way users interact with the program.

It’s hard to believe, but until 2013 there were no widespread tools for visual debugging of UIKit view hierarchies. This changed with the release of Reveal, Xcode’s View Debugger, and other similar tools. Quite possibly, ARKit developer tooling is currently going through a similar infancy, and we’ll see the space expand as the demand for AR apps grows. But as the creators of Reveal, we wanted to look into what’s possible in this area right now, and experiment with presenting available information in a way that could be useful to developers.

Our results

A picture is worth a thousand words, so first of all, here’s a recording of our prototype in action:

This video demonstrates an iPhone app running the ARKit session and streaming its data live to an iPad app, where this data is visualized using SceneKit. A macOS client with the same capabilities as the iPad app has also been developed. All these components are available on GitHub; the repository’s README file further describes how to try them.

Theoretically, this approach could allow running the AR session alongside its visualization on the same iPad in split mode, but at the moment ARKit doesn’t allow this due to camera access limitations. Using the ARKit replay mechanism covered in our previous post, however, did bear some fruit:

Now, let’s cover the approach we used in more detail.

Interaction model

From an early point we decided to implement the session visualization capabilities in the form of a separate “client” application, communicating with the instrumented “server” across the network. While this model requires an additional device and introduces extra technical complexity, it has several advantages over an “in-process” alternative:

  • Allows more flexibility for choosing the point of view, i.e. camera position/orientation.
  • Adds no requirements for the rendering technology used by the server app: it could be using SceneKit, SpriteKit, a custom renderer, or even nothing at all.
  • Allows the client app to implement some advanced processing or rendering capabilities without introducing additional performance overhead on the AR session.
  • Allows the client app to be running on a Mac.

But of course the “out-of-process” model is also simply more familiar to us, since Reveal follows it as well.

Available session information

We started off looking into what information we could surface about the session, preferably using just the public ARKit API (note that the prototype doesn’t capture all of it):

  • Session state (e.g. running, paused or interrupted).
  • Session configuration parameters.
    • This could include detectable reference images and objects, but there’s no public API to retrieve the original or even processed graphical representation of reference images at the moment.
  • Per-frame information:
    • Timestamp
    • Captured image (in YCbCr color space)
    • Captured depth data (face tracking sessions only)
    • World mapping status
    • Camera parameters (tracking state, transforms, etc.)
    • Light estimates
    • Raw feature points
  • Standard and custom anchors:
    • Name and identifier
    • Transform
    • Tracking state (for ARTrackable anchors)
    • For plane anchors: alignment, center, extent and geometry
    • For face anchors: geometry, blend shapes, eye transforms and look-at point
    • For image and object anchors: reference image/object information
    • For environment probe anchors: extent and environment cube map texture
  • Data requested on demand:
    • Archived world map
    • Generated reference object archives

Many important elements of this dataset update in real time, up to 60 times per second. And while most of them can be encoded efficiently, captured image and depth data would have to be compressed significantly if we were to stream them live.

Additional information could also be derived from the session frames: for example, height of the estimated horizontal plane can be determined using ARFrame’s hit-testing function.

Connectivity and transport

For this prototype we wanted to pick a combination of network transport and data protocols that would be quick to set up, while providing enough throughput and low latency on a wireless connection to support visualizing session data in real time.

Multipeer Connectivity framework satisfied the transport needs: it’s readily available on both iOS and macOS, and in our testing provided at least 20 Kb of data transfer budget per frame. As a bonus, it allows streaming session data to multiple clients simultaneously, and even works without connecting to a common Wi-Fi network (by virtue of using AWDL protocol, which also powers features like AirDrop).

For data encoding we decided to go with binary property lists: again, due to its platform support, and also because it provided a good compromise between a text-based format like JSON and a purely binary protocol, both in terms of flexibility and size efficiency/overhead.

However, using NSSecureCoding implementation provided by ARKit’s data classes like ARFrame was not sufficient: both because this provides no control over which data is encoded or not, and of course because ARKit is not available on macOS, so unarchiving their encoded representations there would require assuming internal implementation details. So instead we implemented our own data classes which adopt NSSecureCoding.

Camera tracking

Top-down camera mode

Top-down camera mode

A subset of basic per-frame session information, including camera parameters, is sent to the clients whenever it’s received from the ARSession via a delegate callback. While some of this data is displayed in the client app as text, AR camera transform is visualized as the “camera pyramid” object in the scene. But more importantly, the client app also provides different viewing modes, which essentially fall into two categories:

  • In “following” modes (first-person, third-person and top-down), the view of the scene is automatically synchronized with the AR camera’s position and/or orientation. This creates a “second screen” experience where the scene visualization can be observed without having to interact with it. Such modes can be especially useful if you’re walking around with two devices in your hands.
  • In “free” modes (turntable and fly), the view of the scene is controlled independently from the AR camera. These modes are useful for exploring the scene, especially after the AR session has finished.

Feature points

Raw feature points provided by the ARKit session via ARFrame objects are intermediate results of the scene analysis it performs. Since they are extracted from the real-world surfaces and contours that ARKit detects, we can display them in the client app as a way to roughly represent these surfaces and provide visual context for other elements in the scene.

From our testing, even very detailed scenes produce feature point clouds with only about 500 elements. Simply encoding such point clouds as contiguous binary data buffers produced frame packets well below our data transfer budget, so we decided to forego special optimisations like detecting which points were added, removed or updated between frames and sending only the difference. However, it turned out that the feature point cloud doesn’t actually update every frame, so a simple equality check allowed skipping duplicates.

Single-frame point cloud

Single-frame point cloud

Each frame only contains feature points relevant to that frame, so visualizing just one frame worth of points doesn’t really provide much context. However, thanks to the fact that each feature point has a unique identifier associated with it, the client can accumulate point clouds and produce a historical view of the scene. A bounding box that covers all accumulated feature points is also used to visualize the “virtual bounds” of the scene which extend over time.

It’s worth noting that the implemented accumulation mechanism is fairly simple and doesn’t deal with situations where ARKit internally “forgets” a feature point after detecting it, either because it later classifies it as an outlier, or due to the location drift after moving across a large scene. This could potentially be solved by correcting the accumulated point cloud using, for example, a regularly requested world map.

Accumulated point cloud

Accumulated point cloud

Initial implementation of deriving the color of each feature point simply used the relative height of the point within the accumulated bounding box. This height is mapped to the hue values from blue to red, thus producing a distribution resembling a heat map. Later we applied a more sophisticated approach: each feature point is projected into the 2D coordinate space of the captured image, and the result is used to sample the image, thus roughly estimating the color of the real-world surface that the feature point corresponds to. In an attempt to counteract noise in the captured image, it’s downscaled prior to sampling colors from it. Of course, the downside of this approach is that it adds more real-time overhead on the server, though we try to avoid noticeable performance degradation by executing image sampling on a background serial queue. But the resulting point cloud resembles the real-world surfaces somewhat more closely:

Accumulated point cloud with image-sampled colors

Accumulated point cloud with image-sampled colors

Anchors

ARAnchor objects represent ARKit’s understanding of the scene, and also custom content added to the scene by the developer, so visualizing these is important to build a complete picture of the AR session. Among standard anchors, plane anchors are probably the most well-known and used, so we decided to focus on those in the prototype. Luckily, doing so was fairly straightforward, since ARSession provides timely delegate callbacks whenever anchors are added, updated or removed. Visually they can be represented by their extents and/or geometry, though in absence of other information, a simple approach of filling the planes with random colors was used.

Plane anchors in a large room scene combined with feature points

Plane anchors in a large room scene combined with feature points

Visualizing custom anchors is somewhat more complicated, because the base ARAnchor class doesn’t contain any information about the shape or size of the object represented by the anchor. Furthermore, the exact way each anchored object is rendered in the AR scene (if at all) is specific to each app, so the visualizer would either need to impose requirements on supported object formats (e.g. only encodable SceneKit nodes), or provide an interface for associating extra information with the anchors (e.g. a special protocol that provides the bounding volume). The current version of our prototype doesn’t implement either of these approaches, instead representing any non-plane anchor with a placeholder sphere.

Further work

There’s a variety of other possible ways an ARKit session can be instrumented and visualized, which we didn’t cover in our prototype:

  • Supporting more standard anchor types, like image and object anchors. Ideally, these would be represented by their reference counterparts.
  • Visualizing the history of camera movement in the scene as a track line.
  • Streaming a compressed camera feed to the client.
  • Requesting and downloading world map archives from the session.
  • Saving the session information received by the client to a file on disk for later viewing.
  • Scrubbing through the session history on the client to browse its state in a particular moment in time.
  • Modifying the instrumented AR session via commands from the client, e.g. placing or moving custom anchors. This particular feature would likely require additional support from the instrumented app, since such modifications or state changes may be unexpected to the code running the session and may cause undesirable behavior.
  • Improved integration capabilities, allowing minimal configuration in the instrumented app. Implementing truly transparent integration may require using some Objective-C runtime features like swizzling or method forwarding.

Data visualization is of course a large field that crosses design, engineering and science, so it will take some time to come up with a solution useful for a majority of AR app developers. On the other hand, ARKit imposes some constraints on which aspects of its scene understanding can be reliably retrieved and visualized. Still, we believe that even in its current form our prototype can be used to learn a bit more about ARKit and the AR technology in general.

We also hope that releasing our work in open source will inspire others to try extending upon it and implementing other ideas in this space.

Acknowledgements

Some approaches described in this post and implemented in the prototype are inspired by the following sessions from WWDC: