How to define an end-to-end autonomous driving system?
The most common definition is that an "end-to-end" system is a system that inputs raw sensor information and directly outputs variables of concern to the task. For example, in image recognition, CNN can be called "end-to-end" compared with the traditional feature + classifier method.
In autonomous driving tasks, data from various sensors (such as cameras, LiDAR, Radar, or IMU...) are input, and vehicle control signals (such as throttle or steering wheel angle) are directly output. To consider the adaptation issues of different vehicle models, the output can also be relaxed to the vehicle's driving trajectory.
Based on this foundation, modular end-to-end concepts have also emerged, such as UniAD, which improve performance by introducing supervision of relevant intermediate tasks, in addition to the final output control signals or waypoints. However, from such a narrow definition, the essence of end-to-end should be the lossless transmission of sensory information.
Let us first review the interfaces between sensing and PnC modules in non-end-to-end systems. Usually, we detect whitelisted objects (such as cars, people, etc.) and analyze and predict their properties. We also learn about the static environment (such as road structure, speed limits, traffic lights, etc.). If we were more detailed, we would also detect universal obstacles. In short, the information output by these perceptions constitutes a display model of complex driving scenes.
However, for some very obvious scenes, the current explicit abstraction cannot fully describe the factors that affect driving behavior in the scene, or the tasks we need to define are too trivial, and it is difficult to enumerate all required tasks. Therefore, end-to-end systems provide a (perhaps implicitly) comprehensive representation with the hope of automatically and losslessly acting on PnCs with this information. In my opinion, all systems that can meet this requirement can be called generalized end-to-end.
As for other issues, such as some optimizations of dynamic interaction scenarios, I believe that at least not only end-to-end can solve these problems, and end-to-end may not be the best solution. Traditional methods can solve these problems, and of course, when the amount of data is large enough, end-to-end may provide a better solution.
Some misunderstandings about end-to-end autonomous driving
1. Control signals and waypoints must be output to be end-to-end.
If you agree with the broad end-to-end concept discussed above, then this problem is easy to understand. End-to-end should emphasize the lossless transmission of information rather than directly outputting the task volume. A narrow end-to-end approach will cause a lot of unnecessary trouble and require a lot of covert solutions to ensure safety.
2.The end-to-end system must be based on large models or pure vision.
There is no necessary connection between end-to-end autonomous driving, large-model autonomous driving, and purely visual autonomous driving because they are completely independent concepts; an end-to-end system is not necessarily driven by large models, nor is it necessarily driven by pure vision. of.
3.In the long run, is it possible for the above-mentioned end-to-end system in a narrow sense to achieve autonomous driving above the L3 level?
The performance of what is currently called pure end-to-end FSD is far from sufficient to meet the reliability and stability required at the L3 level. To put it more bluntly, if the self-driving system wants to be accepted by the public, the key is whether the public can accept that in some cases, the machine will make mistakes, and humans can easily solve them. This is more difficult for a pure end-to-end system.
For example, both Waymo and Cruise in North America have had many accidents. However, Cruise's last accident resulted in two injuries, although such accidents are fairly inevitable and acceptable for human drivers. However, after this accident, the system misjudged the location of the accident and the location of the injured and downgraded to pull-over mode, causing the injured to be dragged for a long time. This behavior is unacceptable to any normal human driver. It won't be done, and the results will be very bad.
Furthermore, this is a wake-up call that we should carefully consider how to avoid this situation during the development and operation of autonomous driving systems.
4.So at this moment, what are the practical solutions for the next generation of mass-produced assisted driving systems?
According to my current understanding, when using the so-called end-to-end model in driving, after outputting the trajectory, it will return a solution based on traditional methods. Alternatively, learning-based planners and traditional trajectory planning algorithms output multiple trajectories simultaneously and then select one trajectory through a selector.
This kind of covert solution and choice limits the upper limit of the performance of this cascade system if this system architecture is adopted. If this method is still based on pure feedback learning, unpredictable failures will occur and the goal of being safe will not be achieved at all.
If we consider re-optimizing or selecting using traditional planning methods on this output trajectory, this is equivalent to the trajectory produced by the learning-driven method; therefore, why don't we directly optimize and search this trajectory?
Of course, some people would say that such an optimization or search problem is non-convex, has a large state space, and is impossible to run in real-time on an in-vehicle system. I implore everyone to carefully consider this question: In the past ten years, the perception system has received at least a hundred times the computing power dividend, but what about our PnC module?
If we also allow the PnC module to use large computing power, combined with some advances in advanced optimization algorithms in recent years, is this conclusion still correct? For this kind of problem, we should consider what is correct from first principles.
5.How to reconcile the relationship between data-driven and traditional methods?
Playing chess is an example very similar to autonomous driving. In February of this year, Deepmind published an article called "Grandmaster-Level Chess Without Search", discussing whether it is feasible to only use data-driven and abandon MCTS search in AlphaGo and AlphaZero. Similar to autonomous driving, only one network is used to directly output actions, while all subsequent steps are ignored.
The article concludes that, despite considerable amounts of data and model parameters, fairly reasonable results can be obtained without using a search. However, there are significant differences compared to methods using search. This is especially useful for dealing with some complex endgames.
For complex scenarios or corner cases that require multi-step games, this analogy still makes it difficult to completely abandon traditional optimization or search algorithms. Reasonably utilizing the advantages of various technologies like AlphaZero is the best way to improve performance.
6.Traditional method = rule-based if else?
I've had to correct this concept over and over again while talking to many people. Many people believe that as long as it is not purely data-driven, it is not rule-based. For example, in chess, memorizing formulas and chess records by rote is rule-based, but like AlphaGo and AlphaZero, it gives the model the ability to be rational through optimization and search. I don’t think it can be called rule-based.
Because of this, the large model itself is currently missing, and researchers are trying to provide a learning-driven model through methods such as CoT. However, unlike tasks that require pure data-driven image recognition and unexplainable reasons, every action of a person driving has a clear driving force.
Under the appropriate algorithm architecture design, the decision trajectory should become variable and be uniformly optimized under the guidance of scientific goals, rather than forcibly patching and adjusting parameters to fix different cases. Such a system naturally does not have all kinds of hard-coded strange rules.
Conclusion
In short, end-to-end may be a promising technical route, but how the concept is applied requires more research. I think a bunch of data and model parameters is not the only correct solution, and if we want to surpass others, we have to keep working hard.
Post time: Apr-24-2024