paint-brush
Tesla AI Day: How Does Tesla's Autopilot Workby@whatsai
433 reads
433 reads

Tesla AI Day: How Does Tesla's Autopilot Work

by Louis BouchardAugust 21st, 2021
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

A couple of days ago was the first Tesla AI day where Andrej Karpathy, the Director of AI at Tesla, and others presented how Tesla’s autopilot works from the image acquisition through their eight cameras to the navigation process on the roads.Learn more in this short video. The video is a short video that shows how a Tesla car can not only see but navigate the roads with other vehicles. Learn more in the short video below. The video was made on the top of the top corner of the video and quickly takes you to the top right.
featured image - Tesla AI Day: How Does Tesla's Autopilot Work
Louis Bouchard HackerNoon profile picture

If you wonder how a Tesla car can not only see but navigate the roads with other vehicles, this is the video you were waiting for. A couple of days ago was the first Tesla AI day where Andrej Karpathy, the Director of AI at Tesla, and others presented how Tesla’s autopilot works from the image acquisition through their eight cameras to the navigation process on the roads.

This week, I cover Andrej Karpathy's talk at Tesla AI Day on how Tesla's autopilot works.

Learn more in this short video

Video Transcript

       00:00

if you wonder how a tesla car can not

00:02

only see but navigate the roads with

00:04

other vehicles this is the video you

00:06

were waiting for a couple of days ago

00:08

was the first tesla ai day where andrei

00:11

karpathy the director of ai at tesla and

00:14

others presented how tesla's autopilot

00:17

works from the image acquisition through

00:19

their eight cameras to the navigation

00:21

process on the roads tesla's cars have

00:23

eight cameras like this illustration

00:26

allowing the vehicle to see its

00:27

surrounding and far in front

00:29

unfortunately you cannot simply take all

00:31

the information from these eight cameras

00:33

and send it directly to an ai that will

00:35

tell you what to do as this will be way

00:37

too much information to process at once

00:39

and our computers aren't this powerful

00:42

yet just imagine trying to do this

00:44

yourself having to process everything

00:46

all around you honestly i find it

00:48

difficult to turn left when there are no

00:50

stop signs and you need to check both

00:52

sides multiple times before taking a

00:54

decision well it's the same for neural

00:57

networks or more precisely for computing

00:59

devices like cpus and gpus to attack

01:02

this issue we have to compress the data

01:04

while keeping the most relevant

01:06

information similar to what our brain

01:08

does with the information coming from

01:10

our eyes to do this tesla transfers

01:13

these eight cameras data into a smaller

01:16

space they call the much smaller vector

01:18

space this space is a three-dimensional

01:21

space that looks just like this and

01:23

contains all the relevant information in

01:25

the world like the road signs cars

01:27

people lines etc this new space is then

01:31

used for many different tasks the car

01:33

will have to do like object detection

01:35

traffic light tests lane prediction etc

01:38

but how do they go from eight cameras

01:40

which will mean eight times three

01:42

dimensions inputs composed of red green

01:45

blue images to a single output in three

01:47

dimensions this is achieved in four

01:50

steps and done in parallel for all eight

01:52

cameras making it super efficient at

01:55

first the images are sent into a

01:56

rectification model which takes the

01:59

images and calibrates them by

02:00

translating them into a virtual

02:02

representation this step dramatically

02:05

improves the autopilot's performance

02:07

because it makes the images look more

02:09

similar to each other when nothing is

02:11

happening allowing the network to

02:13

compare the images more easily and focus

02:15

on essential components that aren't part

02:18

of the typical background then these new

02:20

versions of the images are sent in a

02:22

first network called regnet this regnet

02:25

is just an optimized version of the

02:27

convolutional neural network

02:29

architecture cnns if you are not

02:31

familiar with this kind of architecture

02:33

you should pause the video and quickly

02:34

watch the simple explanation i made

02:36

appearing on the top right corner right

02:38

now basically it takes these newly made

02:41

images compresses the information

02:43

iteratively like a pyramid where a start

02:45

of the network is composed of a few

02:47

neurons representing some variations of

02:50

the images focusing on specific objects

02:52

telling us where it is especially then

02:55

the deeper we get the smaller these

02:57

images will be but they will represent

03:00

the overall images while also focusing

03:02

on specific objects so at the end of

03:05

this pyramid you will end up with many

03:07

neurons each telling you general

03:09

information about the overall picture

03:11

whether it contains a car a road sign

03:13

etc in order to have the best of both

03:16

worlds we extract the information at

03:18

multiple levels of this pyramid which

03:20

can also be seen as image

03:21

representations at different scales

03:23

focusing on specific features in the

03:25

original image we end up with local and

03:28

general information all of them together

03:31

telling us what the images are composed

03:34

of and where it is

03:35

then this information is sent into a

03:38

model called bi fpm which will force

03:41

this information from different scales

03:42

to talk together and extract the most

03:45

valuable knowledge among the general and

03:47

specific information it contains the

03:50

output of this network will be the most

03:52

interesting and useful information from

03:54

all these different scales of the eight

03:56

cameras information so it contains both

03:58

the general information about the images

04:01

which is what it contains and the

04:03

specific information such as where it is

04:06

its size etc for example it will use the

04:09

context coming from the general

04:10

knowledge of deep features extracted at

04:13

the top of the pyramid to understand

04:15

that since these two blurry lights are

04:17

on the road between two lanes they are

04:20

probably attached to a specific object

04:22

that was identified from one camera in

04:25

the early layers of the network using

04:27

both this context and knowing it is part

04:29

of a single object one could

04:31

successfully guess that these blurry

04:33

lights are attached to a car so now we

04:35

have the most useful information coming

04:37

from different scales for all eight

04:39

cameras we need to compress this

04:41

information so we don't have eight

04:43

different data inputs and this is done

04:45

using a transformer block if you are not

04:48

familiar with transformers i will invite

04:50

you to watch my video covering them in

04:52

vision applications in short this block

04:54

will take the eight different pictures

04:56

condensed information we have and

04:58

transfer it into the three-dimensional

05:00

space we want the vector space it will

05:03

take this general and spatial

05:04

information here called the key

05:07

calculate the query which is of the

05:09

dimension of our vector field and we'll

05:12

try to find what goes where for example

05:14

one of these query could be seen as a

05:16

pixel of the resulting vector space

05:19

looking for a specific part of the car

05:21

in front of us the value will merge both

05:23

of these accordingly telling us what is

05:26

where in this new vector space this

05:28

transformer can be seen as the bridge

05:30

between the eight cameras and this new

05:33

3d space to understand all

05:35

interrelations between the cameras now

05:37

that we have finally condensed your data

05:40

into a 3d representation we can start

05:42

the real work this is a space where they

05:45

annotate the data they use for training

05:47

their navigation network as the space is

05:49

much less complex than 8 cameras and

05:52

easier to annotate ok so we have an

05:54

efficient way of representing all our 8

05:57

cameras now but we still have a problem

05:59

single camera inputs are not intelligent

06:02

if a car on the opposite side is

06:04

occluded by another car we need the

06:06

autopilot to know it is still there and

06:08

it hasn't disappeared because another

06:10

car went in front of it for a second to

06:13

fix this we have to use time information

06:16

or in other words use multiple frames

06:18

they chose to use a feature cue and a

06:21

video module the feature queue will take

06:23

a few frames and save them in the cache

06:26

then for every meter the car does or

06:29

every 27 milliseconds it will send the

06:32

cached frames to the model here they use

06:35

both a time or a distance measure to

06:38

cover when the car is moving and stopped

06:41

then these 3d dimensions of the frames

06:44

we just processed are merged with their

06:46

corresponding positions and kinematic

06:48

data containing the car's acceleration

06:51

and velocity informing us how it is

06:53

moving at each frame all this precious

06:56

information is then sent into the video

06:59

module this video module uses these to

07:01

understand the car itself and its

07:03

environment in the present and past few

07:05

frames this understanding process is

07:07

made using a recurrent neural network

07:10

that processes all the information

07:11

iteratively over all frames to

07:13

understand the context better and

07:16

finally build this well-defined map you

07:18

can see if you are not familiar with

07:20

recurrent neural networks i will again

07:22

orient you to a video i made explaining

07:24

them since it uses past frames the

07:26

network now has much more information to

07:29

understand better what is happening

07:31

which will be necessary for temporary

07:33

occlusions this is the final

07:35

architecture of the vision process with

07:37

this output on the right and below you

07:40

can see some of these outputs translated

07:42

back into the images to show what the

07:44

car sees in our representation of the

07:47

world or rather the eight cameras

07:50

representation of it we finally have

07:52

this video module output that we can

07:54

send in parallel to all the cars tasks

07:57

such as object detection lane prediction

08:00

traffic lights etc if we summarize this

08:02

architecture we first have the eight

08:04

cameras taking pictures then they are

08:07

calibrated and sent into a cnn

08:10

condensing the information which

08:12

extracts information from them

08:14

efficiently and merges everything before

08:16

sending this into a transformer

08:18

architecture that will fuse the

08:20

information coming from all eight

08:22

cameras into one 3d representation

08:26

finally this 3d representation will be

08:29

saved in the cache over a few frames and

08:32

then sent into an rnn architecture that

08:35

will use all these frames to better

08:37

understand the context and output the

08:40

final version of the 3d space to send

08:42

our tasks that can finally be trained

08:44

individually and may all work in

08:47

parallel to maximize performance and

08:49

efficiency as you can see the biggest

08:52

challenge for such a task is an

08:53

engineering challenge make a car

08:56

understand the world around us as

08:58

efficiently as possible through cameras

09:00

and speed sensors so it can all run in

09:03

real time and with a close to perfect

09:06

accuracy for many complicated human

09:08

tasks of course this was just a simple

09:11

explanation of how tesla autopilot sees

09:13

our world i strongly recommend watching

09:15

the amazing video on tesla's youtube

09:18

channel linked in the description below

09:20

for more technical details about the

09:22

models they use the challenges they face

09:24

the data labeling and training process

09:26

with their simulation tool their custom

09:28

software and hardware and the navigation

09:32

it is definitely worth the time, thank you for watching.

References

►Read the full article: https://www.louisbouchard.ai/tesla-autopilot-explained-tesla-ai-day/
►"Tesla AI Day", Tesla, August 19th, 2021,
►My Newsletter (A new AI application explained weekly to your emails!): https://www.louisbouchard.ai/newsletter/