1 Introduction

In 1916, Piggly Wiggly opened the first self-serving supermarket in Memphis, TN. Shoppers no longer needed to ask store workers behind the counters to retrieve every product for purchasing. Friction in the shopping process, caused by factors such as delays and psychological barriers due to the inability to examine and compare products, were greatly reduced. Merchants could offer greater product selection and manage larger store footprints with less workers. In the next 100 years, various technologies, such as shopping carts, Universal Product Code (UPC) bar codes, credit cards, self-checkout counters, and mobile payment, were invented and adopted to further reduce shopping friction and store efficiency. But, the final bottleneck of checking out at the end of shopping trips remained.

In 2018, Amazon opened its Amazon Go store in Seattle WA. In Amazon Go, all activities of shoppers are monitored by the store. Whenever a product is picked up from the shelf, the store knows what exactly it is and who picks it, a change of ownership is logged right away. At the end of the trip, no store worker tallies the amount or checks the basket. Shoppers simply walk out of the store with a mobile payment confirmation on the phone. “No lines, no checkout...”

We call this shopping experience where sales no longer require the involvement of sales staff Autonomous Retailing. Autonomous retailing is not a new concept. For example, Metro AG tested a RFID-enabled autonomous checkout store in 2003 [16]. IBM illustrated a similar experience in a RFID commercial in 2006. However, Amazon Go is the first known attempt to bring the concept to reality at scale. Since then, several retailers and technology providers, such as BingoBox,Footnote 1 Alibaba Tao Cafe,Footnote 2 and Standard Cognition,Footnote 3 have demonstrated similar proofs of concept. As e-commerce continues to disrupt brick-and-mortar stores with convenience, choice, and savings, the latter must reinvent themselves through digital transformations. In other words, a brick-and-mortar store must focus on reducing friction and offer a superb experience to its shoppers, through sensing, intelligence, and actuation. Future stores will become cyber-physical environments for human users.

Due to physical constraints and human behavior dynamics, autonomous stores must employ a large number of sensing and processing units. It faces all challenges that are intrinsic to TerraSwarm-like systems [2]. Furthermore, the level of correctness required by retail transactions, together with the possibilities of human exploitation of potential vulnerabilities, make them a pinnacle of Cyber-Physical-Human system ambitions, as much as autonomous vehicles.

In this article, we discuss autonomous retailing from shopper and technology perspectives, such as different levels of autonomy, core design space, and critical technology enablers. Although full autonomy is still hard to achieve at scale with its current cost structure, we believe subsets of those technologies can already help brick-and-mortar retailers reduce operation cost and improve shopper experiences.

2 Levels of Autonomy

Retailing is about a single relationship change, which we call Transfer Of Ownership (TrOO) – the ownership of a product changing from the store to a shopper and, occasionally, in the reverse order, if the shopper changes her mind. Currently, TrOO only happens at the checkout counter, assisted by store workers. This creates the main bottleneck in physical shopping, and is the top complaint in shopping experience surveys.

There are many ways to mitigate and ultimately remove this bottleneck, and give shoppers increasing freedom in the stores. Just like self-driving cars, autonomous retailing is not an “all-or-nothing” concept. Depending on how much cognitive load, deliberate shopper involvements and shopper-staff interaction take place, there is a spectrum of experiences and solutions. We classify them into six Levels of Autonomy, as shown in Table 1.

Table 1. Levels of autonomous retailing

L0 [Monitored Autonomy]: Self-checkout stations are common in stores today. They facilitate the same scan and pay process, conducted by store workers in the past, but now by shoppers themselves. Typically, there is still a store worker that oversees 4 to 6 self-checkout stations to make sure all items are scanned, to check for age-limited items, and to provide any necessary help. During busy times, shoppers still need to wait in lines for an available station, especially since shoppers are much less efficient at using these stations than store workers using regular checkout machines are.

L1 [Deliberate Autonomy]: Instead of lining up at a checkout station at the end of a trip, shoppers scan products in aisles while purchase decisions are made. This is sometimes called scan and go. In typical cases, the scanner can be a dedicated store device or consumers’ own mobile phones. Whether a shopper indeed scanned every purchased product is not known automatically. At the end of the trip, store workers will by default check baskets before shoppers leave the store to ensure correctness. This checking can be a full audit or on a random subset of the products. The store environment by itself, cannot differentiate who scanned every product and who, intentionally or unintentionally, forgot some.

Although bar code scanning is the predominate form in scan-and-go implementation today, the notion of scanning can be generalized to include any deliberate showing of a product to a device. Products can be identified, through optical tags, such as bar code, QR code, or invisible watermarks (DW Code) [8], through RF tags such as RFID or NFC, or directly through the shape and look of the packaging by computer vision.

L2 [Assisted Autonomy]: At this level, the store can recognize certain human activities automatically. Although shoppers still need to scan every product they wish to purchase, the store can detect unscanned items in shopping baskets and remind shoppers accordingly. At the end of a trip, if the store believes that all products have been scanned properly, the shopper can walk out without being checked. Otherwise, the shopper will be routed to a store worker. With the reminders assisted by the stores, the cognitive load of remembering to scan every product is reduced.

L3 [Partial Autonomy]: When the store gets smarter, it can recognize and track certain products and their belongings throughout the store. Shoppers’ deliberation of showing products to devices is reduced. As long as a shopper only visits certain part of the store, or uses certain shopping devices (like smart shopping carts), no worker auditing is necessary at the exit.

L4 [Conditional Autonomy]: At this level, the shopping and checkout processes are fully automated, as long as shoppers do not intentionally cheat the system. The correctness of checkout may still depend on whether or not shoppers are honest, and the automated checkout process may impose limitations on the range of available products. No stop-and-check at the exit by store workers is necessary in most cases. Shopping will feel like picking up items from one’s own pantry room.

L5 [Full Autonomy]: This is the ultimate autonomous retail capability where shop lifting is virtually impossible. All ownership changes are understood and reflected in the transactions.

Remarks:

  • This classification is abstract by design, irrespective of implementation details. For example, difficulties vary greatly depending on whether products are tagged with unique identifiers or whether users are assisted with additional devices.

  • Although it may be easiest to think in terms of a grocery store example with shelved products, shopping carts, and baskets, the same key elements are present in most types of open retail spaces.

  • Levels of autonomy tie closely to the type of products that the store sells. For example, computer vision is not good at differentiating different instances of clothes with high accuracy yet. Achieving L3 and above at apparel stores can be very different from how this may be achieved at convenience stores.

  • There are two major quantum leaps in this classification. One is to reduce and remove product scanning, between L2 and L3; and the other is to increase the tolerance of malicious behavior, between L4 and L5. Amazon Go appears to be at L4 and is approaching L5, considering its limited store size and product selection.

  • Scanning serves two purposes: one is to identify the product, and the other is to associate a product instance with the customer who intends to buy it. By eliminating the scanning step, the store must identify and track shoppers during their entire visit, and any shopping activity that is related to the handling of products must be understood.

3 Cyber-Physical Intelligence

Autonomous retailing at L2 and beyond are clearly cyber-physical environments that need sophisticated sensing and processing capabilities to function correctly. In this section, we discuss the key tasks that must be performed by autonomous stores, and the technologies that enable them. We use grocery shopping as an example scenario throughout, but the tasks and technologies apply to other types of retailing as well.

Retailing is concerned with three key pieces of information, as shown in Fig. 1: the identity of a shopper (and ultimately her payment account), the type or model of a product (thus its price), and the possession/ownership relationship between customer and product. Where and how these three pieces of information are established, and how accurate they are, reflect the intelligence level of the store and shoppers’ experience in it.

Recent advances in autonomous retailing are, to a great extent, empowered by advances in deep learning [5] and computer vision. It is possible for deep neural-networks (DNN) to classify and recognize faces, objects, and human activities at high accuracy and speed [3, 10]. However, computer vision has its limitations in the real world, due to occlusion, variability in lighting conditions, sizes and forms of objects, and the complexity of sensing and processing. These challenges put high demands on deployment density, computing power, and network performance, and leave many corner cases unsolved.

Fig. 1.
figure 1

Key entities and relationship in the retail process.

3.1 Core Sensing and Inference Requirements

Shopper Identification. A shopper’s identity is typically established through linkage with an identifier in the digital domain, such as a credit card number, a loyalty ID, or a GUID in a mobile phone app. Thus identifying shoppers may require them to present, and sometimes to prove the legitimacy of, their digital representation.

Biometric identification methods, such as face, iris, and fingerprint recognition, are becoming mature technologies thanks to advances in deep learning. There are two ways of using biometric identification in autonomous retailing:

  • Verification. The verification problem [4] is to test if a biometric measurement is indeed from a given customer. The algorithm only gives yes-no answers with an associated confidence level. Biometric verification is a mature technology. It is widely used in device authentication such as Touch ID on iOS devices and Windows Hello on PCs.

  • Recognition. The recognition problem is to identify a person within a potential set of candidates, or return “unrecognized” if the person is not in the set. With a dataset smaller than a few thousand, the recognition correctness, measured by the true identity being within a top 5 returned results, is beyond 99%. The recognition problem is considerably harder and time consuming than the verification problem, especially with large datasets.

Collecting and managing biometrics at the scale of potential customers creates a legal and operational burden for retailers. One simplification is to only identify in-store shoppers by assigning them unique but anonymous IDs when they enter the store, and only keep the ID persistent through the trip. This requires the recognition system to quickly enroll new faces and retrain the recognizer in real time.

Product Recognition. At L3+, an intelligent store must recognize some to all products in it to facilitate transactions. Although UPC bar codes uniquely represent a product, they cannot be reliably read without deliberate scanning.

Recently, computer vision technologies made huge advances in image recognition [7, 9]. However, to obtain transaction-level accuracy on the handling of arbitrary products poses many challenges, such as (1) near-identical packaging with only textual differences; (2) very small objects that can easily be occluded by hands or other objects; (3) very large objects that can only be partially captured by a camera; (4) unpackaged goods such as product, fruits, and meat where certain types are almost identical by look, and (5) bulk items.

In addition, as stores introduce new or seasonal products, product recognition models must be updated, and sometimes completely re-trained.

Like people recognition, identifying products from a large set of candidates is less accurate and slower than from a small set of candidates. Product layout, shopper location, and even past purchases can be used to reduce the search space and improve accuracy.

Tagging is a way to compensate the inaccuracy of product recognition from computer vision alone. Retail industry has explored various tagging technologies from RF to invisible patterns:

  • UHF RFID. Ultra High Frequency RFID (UHF RFID) operates in the 860 MHz to 960 MHz band. UHF RFID uses RF waves to communicate between the RFID reader and tags. They have been used extensively in retail environments and warehouses. While the RF waves have a relatively large range (several meters), the UHF RFID systems have fundamental weaknesses in product tracking. First, RF signal propagation is heavily affected by environmental factors such as the presence of human bodies, water, and metal. Secondly, it is hard to confine RF signals to a well-defined space. The most successful use case of RFID is in apparel retail stores and pallet-level inventory control. These products have RF-friendly-built materials that are ill-suited for other recognition methods.

  • HF RFID (NFC). High Frequency RF ID (HF RFID) operates at 13.56 MHz and uses inductive coupling (magnetic field) to communicate between readers and tags. Inductive-coupling-based communication has several important features when used for product tracking. The HF RFID detection range is relatively short (a few inches) and is well defined. This short range enables accurate tracking of the product locations. In addition, magnetic fields can easily penetrate different materials such as the human body, liquid, and even some types of metal. This makes product tracking immune against environmental changes, packaging, and the product itself.

  • DW Codes. Digital Watermark (DW) Codes [8] are image-based encodings that are invisible to the human eye, yet can be detected and interpreted by post processing of images captured with a camera. They are commercialized by Digimarc Inc., and has recently become a GS1 standardFootnote 4 for product identification, just like UPC bar codes. Under good lighting conditions and with sufficient image resolution, they can be decoded like bar codes. Since they are invisible to the human eye, they can be replicated throughout product packaging for easy identification. The standardization of DW Codes happened only recently, and their adoption by the product manufacturers has been relatively slow. Once they are proven to substantially reduce the cost of autonomous retailing, retailers may have more incentive to adopt them.

It is worth pointing out that recognizing products, tagged or not, is much easier when the products are displayed on shelves or racks, rather than being handled by a person. The human body—hands in particular—are likely to obstruct light and RF propagation.

Product Ownership. Beyond identifying shoppers and products, the store must also establish relationships between them. Due to the elimination of the checkout process at the end of a shopping trip, the semantics of TrOO may be different in an autonomous store than in a regular store. In most stores today, a shopper does not own a product, until she checks out at the exit, since there is simply no visibility into how the product is handled before the shopper checks out.

In L3+ autonomous retailing, when product scanning is removed from shopping process, the TrOO can not be viewed as an atomic action any more. It is better understood as an transaction process. When the product is picked up from the shelf by a shopper, a transaction concerning that particular product is initiated. It may take multiple cameras and sensors (spread across multiple locations) a period of time to confirm which product it is. In that process, if the shopper changes her mind and puts the product back, then the associated transaction is canceled. When, at the end of the trip, the shopper decides to pay for all products, all transactions in the trip are committed.

With the transactional model in mind, key activities to be sensed and inferred in the store are product pickups and returns. One possible way is to track shoppers’ hands and their movements. Alternatively, if we can continuously track the accurate location of each person and each product, and infer that certain products consistently move with a certain person until she exits, we can infer that the products are bought by that person.

Neither precise hand motion tracking and accurate location tracking is easy in real retail environment without very dense sensor instrumentation. TrOO represents the hardest technical problem in this cyber-physical environment.

3.2 Critical Spots and Moments

There are a few particular locations in the store where, and moments in the shopping process during which, the state of transactions can change. The instrumentation in these spots and the timing of information extraction and processing is worth careful examination. These design choices also induce the following possible subsystems in autonomous stores.

  • Store entrance. This is the spot to best identify shoppers and assign them an ID as they walk into the store. Shoppers may also be most open to engagement if there is any need for setting up, for example, by launching a store app, confirming loyalty/club membership, or possibly choosing a preferred payment method for this shopping session. One can employ biometric sensing or deliberate shopper log in (e.g., using a mobile app) at the entrance to establish the identity of the shopper.

  • Shelf edge. Most transaction states change at the shelf edge when shoppers pick up products, or put products back on the shelves. Crowed shelf edges are challenging to the task of assigning ownership; a customer may reach in front of another to pick up a product. Similarly, counting exactly how many products are picked up at one time is difficult to achieve using computer vision alone. One way to complement cameras and computer vision at shelf edges is to incorporate weight sensors, which can tell if, and how many, items are removed from or added to the shelf.

  • Shopping carts. For stores that provide them, shopping carts are natural association points between shoppers and their potential purchases. Socially, it is widely accepted that products in a shopping cart belong to the shopper who uses it. If a shopping cart can register every product that is put in or removed from it, it is an ideal place for L3+ autonomy.

  • Store exit. This is the spot that all transactions are closed and payments are processed. When the system has any unresolved uncertainty, it is also the last chance for a store worker to help or to intervene.

4 Human Factors

An autonomous store is not just a physical environment in isolation. Both store workers and shoppers add significant complexity and are a source of dynamism. Human activities are the most complex to recognize, even in this constrained context. But both shopper activities and worker assistance also let the store system to continuously learn and improve its intelligence levels.

4.1 Human Challenges

The thesis of autonomous retailing is to minimize shopping friction. However, human shoppers themselves can be a source of friction when no single, automated process can cover all corner cases.

Signal Obstruction. Human bodies are terrible media for most sensing signals, such as RF, sound, and light. They can wear, carry, or hold additional objects, and form groups to block signals from arbitrary angles. As they move around in the space, it is hard to provision a sensing system that can handle all possible corner cases. The system must be able to tolerate and track uncertainty over space and time and resolve them opportunistically or intentionally later.

Groups and Accounts. In most retail environments, it is fairly common for a family to shop together. Members of a group may part ways within the store and reconvene later, carrying different items. In this scenario, not all persons, especially kids, are expected or able to pay. So the group may use a single payment account. However, checking whether everyone entering the store has payment authority, or people who entered separately belong to the same group, brings friction to the shopping experience. Handling group shoppers correctly may require defining an alternative shopper experience, and educating shoppers.

Vulnerability Exploitation. Another unique complexity in retail stores is the potential dishonesty of shoppers. The average shrinkage in the retail industry is about 1.45% [1]. That means 1.45% of transaction amounts are not paid for. However, this statistic is based on stores with human attendance and regular checkout counters. For autonomous stores, if a vulnerability is discovered, for example, certain human gestures are not recognized correctly, then malicious shoppers can actively exploit it for personal gains. For this reason, the barrier from L4 to L5 is very high. There must be a safety net that bounds the store’s losses under all corner cases, even unforeseen attacks.

4.2 Human Assistance

While humans bring challenges to autonomous retailing, they also provide hope for progressive store intelligence. Human intelligence compliments store intelligence in several ways. Store workers can catch potential errors the automated system makes, and correct them before a customer checks out. For example, the levels of autonomy directly map to the degree of effort from store workers is necessary to assure checkout correctness, ranging from full auditing (in theory) in L0 to L2, partial auditing in L3 and L4, and, finally, completely human free in L5.

In addition to assistance at the end of a trip, humans can also help complex sensing and inference tasks in the inference loop. For example, in the shelf-edge disambiguation case illustrated in Fig. 2, by looking from the top, it is hard to correctly infer which customer picked up what product from computer vision alone. To involve human assistance, the store system can stream the video to back-end human monitors who can point out which case it is. Machine intelligence can then integrate human input into further inference.

Fig. 2.
figure 2

Ambiguity illustrated by two shoppers at the shelf edge. Without fine grained joint skeletal tracking, it is difficult to tell who picked up what product.

A key requirement for an autonomous store is to understand its own intelligence boundaries. That is, how confident it is about inference results, what it needs to do in order to reduce uncertainty, and what raw data or evidence is relevant to present to human assistants [15].

Human labels, through verified checkout, barcode scanning, or behind-the-scene disambiguation, offer ground truths for autonomous stores to learn and improve their intelligence. Modern AI are primarily data driven. Increasing levels of autonomy guide a progressive process of gradually improving a system’s capacity to handle complex corner cases.

5 TerraSwarm Thinking

A mass-market retail store can have tens of thousands of types of products, hundreds of thousands of individual items, and hundreds of shoppers at peak time. Shoppers, products, and store layout change over time. Even putting the capital cost consideration aside, such a system must coordinate a large number of distributed sensors with different modalities, orchestrate local and central decisions, and react at different time scales. To some extent, these challenges present themselves in any TerraSwarm-style systems (consider smart grids, smart cities, and health care applications), but the transaction-level correctness required for retailing, the large scale of deployment, and trickiness of tracking the physical maneuvers of humans, make this problem unique.

Let us take the shelf-edge inference pipeline (Fig. 3) as an example. In order to correctly recognize the product, the count, and the picking up/placing back action, one may turn the shelf into a smart shelf with pressure sensors on each layer and with cameras pointing to the shelves to identify and count products. If the products are small and hard to be recognized by computer vision alone, one may add an NFC reader on the shelves and label the bottom of each product with NFC tags. In order to infer product ownership changes (like disambiguating between the two situations illustrated in Fig. 2), one may use depth sensors (like Kinect) to track the arm movements for people within certain range of the shelf. Whenever a product leaves the shelf, the sensor identifies the hand and traces along the arm to the human body. The smart shelf system needs to further interact with a shopper identification system that may use face detection and recognition [10]. Now imagine scaling this design up to over 1000 shelves in a typical grocery store.

Fig. 3.
figure 3

Example of sensors and fusion at shelf edges for TrOO inference.

From system design point of view, autonomous retailing also offer many research challenges.

5.1 Uncertainty as a First Class Citizen

As discussed in previous sections, natural conditions (e.g. lighting and RF background noise) and human activities make sensor data noisy and unreliable. Machine learning based inference results themselves have to be probability distributions. For example, face recognition algorithms are typically evaluated in terms of the correctness of the top-5 outputs. These uncertainties need to be carried over space and time. Opportunistically or intentionally, additional sensor observations may be used to help resolve uncertainty. The decisions at the end of each shopping trip, which determine the amounts charged to customers, have to be deterministic.

Introducing uncertainty as a first-class citizen requires probabilistic models of computation. For example, a particle filter provides a way to represent and process non-parametric, uncertain information with deterministic calculations. Each data object in the system is represented by a set of particles distributed over possible values. These particles can go through different processing paths based on their values. In any probabilistic model, prior knowledge is key to set up an initial probability distribution. Fortunately, human activity and shopping behaviors are habitual. A person may be left-handed or right-handed, which is a useful prior for tracking product picking. A person may have developed strong preferences for particular categories of groceries [17], which can serve as a useful prior for predicting their shopping paths and product selection.

5.2 Belief Fusion

Several inference tasks in autonomous stores involve large amount of data (like streaming video feeds), large models (like deep neural-nets), and heavy computation. While distilling high-level information from raw sensor data, the system architecture, i.e., where and when inference is done requires careful thinking.

There is a spectrum of architecture designs. At one end of the spectrum, all data are streamed to the cloud for processing. The benefit is that the cloud is not resource constrained, and it now has a global view of measurements. Model and algorithm updates are also easy since nothing needs to be pushed back to the nodes. Indeed, for many low data rate Internet of Things solutions, this is the default cloud + IoT architecture [11, 12]. We call this approach (global) sensor fusion.

On the other end of the spectrum, each node or subsystem makes a decision, which is a deterministic event assertion, locally. For example, a camera will assert the identity of a person in its field of view from face or gait recognition. Upper level aggregation and inference assumes that the assertions are right. We call this approach Decision Fusion. The benefit of decision fusion is that the raw data are distilled into high level information as soon as possible, so the data that needs to be communicated among different tiers of the system is minimal. However, by doing so, important hypothesis and probabilities may be discarded too early and cause irreversible wrong decisions.

A trade-off in between is belief fusion or belief propagation [14], where uncertainty is carried with data until a decision has to be made. Sensors and intermediate inference may deliver incomplete or inconsistent partial results. Hypotheses are validated over space, time, and sensing modality to refine and revise beliefs. Updates are then diffused to correct other derived hypotheses. This approach is more resource efficient than sensor fusion, and more flexible and reliable than decision fusion. The challenge is to potentially maintain a large set of hypothesis and inference and to expand and trim them as new evidence come in.

Embedded ML. Clearly, the more we can push inference towards the source of the information, the more likely we can create strong beliefs of reality and maintain fewer hypotheses. Modern inference is primarily data driven and uses big models. For example, a generic object recognition network like a 50-layer ResNet [7] has 25.5M wights and 3.9G multipliers and accumulators. Their resource requirements are beyond most embedded systems.

Embedded machine learning is to trade off accuracy with resource requirements. For example, by using fixed-point multipliers and adders, and trimming down the connections between layers of neural-networks, one can compress full blown DNN by up to 50X without loss of accuracy [6]. These are promising techniques on bringing intelligence into the real world through embedded platforms.

Security and Data Integrity. An autonomous store has a large attack surface, from both the digital side and the physical side. It is also a public space. Breaching such system give attackers physical material gains. Among possible attacks, data integrity attacks, which try to fool the sensors so that the system will make seemingly correct but wrong decisions, is uniquely damaging and hard to discover. Researchers have shown that simple eye wear can mislead facial recognition software [13], while hijacking and replaying camera feeds can cover up traces of human activity. These challenges call for new security and data integrity research to be done for cyber-physical-human systems.

6 Conclusions

Over the past 50 years, embedded computing systems have moved from a marginal research topic to an interdisciplinary research area that impacts almost every aspects of human life. First, with sensors and actuators, embedded systems motivated real-time computing for mission-critical tasks. Next, connectivity and networking give embedded systems the scale of Internet of Things and cyber-physical systems, and thus give rise to TerraSwarm-type challenges. In this third wave of AI and data, we believe intelligent embedded systems will make our physical environment smarter. Autonomous retailing is an iconic challenging scenario for the next wave of cyber-physical-human systems. While there are several attempts to showcase proofs of concept, we believe that, to achieve fully autonomous environments at acceptable cost, years of fundamental and applied research are required still.