Using AI and Video to Learn How the World Works

Simple actions like waving hello to a friend, objects falling to the ground and other visual movements are easily understood by humans but would confound machines. Twenty Billion Neurons (TwentyBN), a startup based in Toronto and Berlin, is developing the artificial intelligence capability of imbuing an understanding of the visual world in robots by using videos and deep learning, according to Roland Memisevic, its CEO and chief scientist, who spoke at the recent AI Frontiers conference in Silicon Valley.

Recently, TwentyBN launched Millie, an AI-powered context-aware avatar, at a conference in Montreal. According to the startup’s blog post, Millie is a “life-size helper who interacts with you by observing and understanding the context you’re in and what you’re doing.” The company plans to make context-aware avatars like Millie its main product, serving industries such as retail and education, among others.

In an interview with Knowledge at Wharton, Memisevic talked about his vision for the company and why “video is the best possible window to the world that we can give an AI system to learn about how the world works.”

An edited transcript of the conversation follows.

Knowledge at Wharton: Could you tell us a little about your personal journey and how you got involved with AI?

Roland Memisevic: My interest started when I was 15 or 16 through books by Douglas Hofstadter who wrote some popular science books that focused on AI. I stumbled upon one of his books in a bookstore. I read it and was very intrigued. It was like magic — something special, interesting, weird and cool.

Knowledge at Wharton: What impressed you about it?

Memisevic: It’s what impresses people generally about AI. By pursuing AI, we learn a lot about ourselves. We are such weird creatures. … It’s like putting up a mirror and seeing what humans are and why they are the way they are. Many things are very surprising. That’s where this fascination came from.

It’s also an interesting mix. It’s very mathematical and at the same time it touches philosophical questions, and it can even be artistic in some ways. There’s a lot of creativity that is automatically unleashed when you dig into AI topics.

Knowledge at Wharton: How did Twenty Billion Neurons come about? What is the opportunity that you were trying to address?

Memisevic: I was interested and attracted to video understandings throughout my career as a Ph.D. student and then later as an assistant professor at the University of Montreal. This was not because of video per se, but because I think that video is the best possible window to the world that we can give an AI system to learn about how the world works, what objects are, how objects behave, what living creatures are, how they behave and why — what people call intuitive physics or common sense understanding of our world. Twenty Billion Neurons is an opportunity for me to pursue this in a way that makes much more sense than how I, and others, have been doing it at the university.

At the company, we have a very large-scale data generation operation where we ask human-crowd workers to film videos for us so that we can teach networks about how the world works. It’s an organization of 20 people just trying to nail this one question and getting as good as possible to find solutions to these problems. The opportunity to work with an organization that is so focused on this one problem is the reason why I’m here.

“Video is the best possible window to the world that we can give an AI system to learn about how the world works.”

Knowledge at Wharton: For those who are not familiar with AI and how AI and video work together, could you explain what the company does and what it means for users and consumers?

Memisevic: When humans use language, they often refer to simple daily concepts through analogy in order to make even the most high-level abstract decisions. For example, a CEO might say, “There’s a big storm brewing ahead of us,” and everybody knows what it means. If you look at how people use language and how they think and how they reason, it’s always grounded in basic day-to-day experiences.

Video is the best possible way to get that knowledge into AI systems because it’s a very rich source of information and it conveys a lot about the world that these systems would otherwise not get. What we teach these systems, for example, is if I take a particular object and I don’t hold on to it, it’s going to fall in a specific way, in a very specific direction. … All of this is immediately visible in video. If you try to make prediction tasks around interactions with objects, if you can answer a lot of questions about what’s happening in videos, you must fundamentally understand a lot of those properties.

That’s why we are creating data that shows all kinds of things that can happen in the world and then ask neural networks to make predictions like, for example, to describe in words what they see. If they get good at it, it must mean that they somehow absorb some of this information in there.

Knowledge at Wharton: It would be helpful to understand what you’re explaining in the context of examples from industries like, say, health care. Could you give an example of a use case?

Memisevic: Health care is an immense opportunity for video understanding applications. It is a difficult opportunity though, because health care is highly regulated and difficult to penetrate as a market.

There are many uses cases. We have, for example, started to work with a hospital in Toronto on using gesture control so that nurses who work with a patient do not have to stop working with the patient to turn off an alarm, take off their gloves, push the button, put on new gloves and then continue working with the patient. So, [there is] touchless interaction as a way to make the work flow much easier for nurses.

Another example is regarding documentation. It’s necessary to log what you are doing when you are dealing with a patient. This is typically considered annoying and a waste of time. It’s very easy for a camera that is watching you to create a document pre-filled with what all you did and also the sequence of activities. You need to then just look over it, maybe fix few things and then approve and say, “OK, this is the log file.”

These are just two possible uses cases. There are many, many more, specifically around elder care, for example. To see when somebody falls, or even just providing companionship by having a conversation with you, keeping you company and just being there to combat loneliness and isolation. This is a very big problem for elderly people, but it’s solvable. The irony is, in health care where it’s the most impactful and good for society, where this technology can make a big difference, it’s the hardest to commercialize because it’s a difficult, long-term market. For a small company like ours, health care is hard to work in.

Knowledge at Wharton: Let’s take another industry that is less regulated than health care, like retail for example. What might be some of the applications of video understanding in that field?

Memisevic: There are many of them. The ones that I’m most interested in and the ones which we are pursuing as a company is around the idea of a companion. Think of it as an avatar or a robot that welcomes you to the store, answers your questions around items that you might be looking for, prices, or things like that. Or make you smile when you go into a store and have fun interacting with an artificial creature that can actually look at you and engage with you in some way in order to drive engagement and satisfaction and increase foot traffic.

Knowledge at Wharton: What do you mean by “look at you?” That has a specific meaning.

Memisevic: One of the big changes that we are seeing right now — thanks to the technology that we are building and generating data for a system to understand video — is that we can endow these artificial creatures, robots or avatars on a screen, [with the ability] to look back at you and understand what they’re looking at. So, unlike a smart home speaker where you push a button and ask, “Hey, how’s the weather tomorrow?” [Robots can] see that you’re approaching. They can wave at you and tell you, “Come over here. Let me show you something.” They have gaze direction, just like we have. They look in a certain direction in order to focus on certain parts of the world. And they can relay that back to you by just having their eyes point in a certain direction.

“We can endow these artificial creatures … [with the ability] to look back at you and understand what they’re looking at.”

A video from TwentyBN about Millie, an AI-powered context-aware avatar, which the company launched at a conference in Montreal.

They can convey to you that they’re looking at you right now. They can understand that you’re looking at them. You can have a much more natural engagement with these artificial intelligence creatures than you could with, say, a screen [that lets you] browse through a directory or something like that.

Knowledge at Wharton: You spoke a little while ago about some of the challenges of commercializing it. What business model or models are you pursuing to make this a viable business?

Memisevic: We license the technology. We license these neural networks that enable, say, a robot in a store to look at you and understand what’s going on. We also analyze this data because we generate an incredible amount of data in the process. It’s a valuable source of information for some companies to train their own systems.

Knowledge at Wharton: Autonomous vehicles seem like a logical use.

Memisevic: We’re not getting into autonomous driving, though.

Knowledge at Wharton: Why not?

Memisevic: It’s a strategic commercialization decision. [Autonomous driving] is a very crowded space. Where we can provide a lot of value in the automotive space is by helping cars become a better assistant. For example, helping you inside the car to use your hand gestures to control parts of the car, or helping you understand what passengers are doing, and so on. This is specific to us. We like to look at the interior of rooms and cars. The average American spends around 93% of the day in the interior, so that’s the focus that we have.

Knowledge at Wharton: What milestones are you looking at over the next 18 to 24 months?

Memisevic: We are looking at scaling up the licensing opportunities that we have. We have some very ambitious projects around these creatures that can interact with you, so there are some technical milestones where we want to put up new technology that enables them to do things they haven’t been able to do so far. For example, teaching you a new skill like a cooking recipe or some dance moves. And on the commercial side, [we are looking at] increasing the number of subscribers and increasing revenue.

Knowledge at Wharton: How do you measure your success?

Memisevic: On the technical side, [we measure] how the technology that we’ve been envisioning works. There are numbers you can attach to that, like the accuracy in certain situations. Commercially, [we measure] the number of deals that we are able to close and their volume, etc.

Knowledge at Wharton: Who are your main competitors and how do you position yourselves against them?

Memisevic: There are a bunch of companies that are somewhat similar, but currently there is no company that is so laser sharp focused on this particular problem as we are. It’s life vision, and specifically life vision in order to build companions that can look at you and understand what they are looking at. We’re not worried right now that the market is too flooded with this. But there is a huge shortage of talent and this is where the competition really plays out. There are the usual companies, Google, Amazon, Facebook, Microsoft, who are competing for talent. Once in a while, you also see offerings specifically by cloud players that touch upon some of the capabilities that we provide. So there’s a little bit of overlap there. But all in all, currently we’re in a comfortable situation. We are so specific in what we can provide that there isn’t too much competition.

“This I can see happening one day — our AI companions reasoning and thinking.”

Knowledge at Wharton: What are the principal risks that you see for the company and what are you doing to mitigate those?

Memisevic: Risk is always present on all fronts. There are technical risks. We have very ambitious goals and the risk here is that it takes longer to pursue these ambitions than what you would hope for. There is market risk. We realize that it is very early days and the ecosystem is still finding its way and reconfiguring itself in order to serve a lot of the technological solutions that we provide for. For example, the hardware space isn’t really there. Cameras don’t have the compute power behind them to serve many of those tasks. And there is clearly a timing risk. The market may not be ready [for what we have to offer].

Knowledge@Wharton: In your journey so far, what would you say is the biggest leadership challenge you have faced? How did you deal with it and what did you learn from it?

Memisevic: Growing from four people to six people to 10 people to 12 people — every time it’s a new challenge. You require different processes, different cultural setups to keep everybody productive and happy and the organization healthy. I was a professor at the university before, and that’s a very different world. It’s very individual-based. I believe a 20-person team focused on this one problem can make unbelievable progress, but this group of people has to be a functioning conglomerate of interests. That is something I didn’t expect. It was interesting to see these challenges. I grew through that. Right now, we’re in a very good place where this setup works very well.

Knowledge at Wharton: Was it challenging to go from being a professor to an entrepreneur?

Memisevic: Oh, yes. I would say specifically because of the reason that I just mentioned. That is one of the big ones. But you grow through that. You learn a lot. You learn about how humans behave and how groups function together. It’s a fascinating topic all in itself.

Knowledge at Wharton: What’s your dream for the future?

Memisevic: Imagine a world in which an artificial intelligence can look at you and relate to you and talk to you in a way that is fundamentally not different than how another person would. This is not a goal that is attainable, I think. There are things that are related to our bodies like pain, sadness, etc. which our AI companions are never going to really understand and relate to. But we can try to approach this and one day sit in front of our robotic friends and have a deep philosophical conversation about the economic situation, or things like that. This I can see happening one day – our AI companions reasoning and thinking.

Knowledge at Wharton: Reasoning and thinking, yes. But do you think computers through AI will ever be capable of feeling emotion?

Memisevic: Not in obvious ways. Maybe there are ways to instill some of that, somehow.

Knowledge at Wharton: They might mimic emotion, but they can’t feel emotion?

Memisevic: This goes back to the question of whether a system can ever be conscious or not. I’m not sure. Do you know that I’m conscious? You can assume it but you can’t really prove it. If you ever sit in front of a device, maybe embodied in some way, that conveys to you that it has emotions, I’m not sure you will really be able to say, “Well, it looks like [it’s conscious], but this is a robot, so I think it probably doesn’t have emotions.”

I feel there’s a fundamental barrier. You can’t feel what another person is feeling in any moment. You can have some kind of empathy, you can sort of relate to it, but you can’t prove it. You don’t know that I’m conscious. And I think that barrier is true in the same way towards devices. So I don’t think it makes any difference in the end. People are going to ascribe some state of mind to the machine and just roll with it. Maybe even feel bad if you do something to the device that makes it feel hurt or something. But this is out there.

Knowledge at Wharton Podcast

Using AI and Video to Learn How the World Works

December 18, 2018 • 25 min listen

More From Knowledge at Wharton

Using Data to Model NBA Performance With Seth Partnow

Does AI Limit Our Creativity?

The New Science of Pitching and Hitting With Travis Sawchik

Looking for more insights?