Figuring out Research
I’ve spent much of the past year and a half trying to figure out what kind of research I want to do. I still haven’t figured it out. Andrew Wiles once described Math as a process of bumbling around in a dark mansion. You start off in an unlit room and have to feel your way around its contours. You might discover a light switch, or you might discover that the light switch is actually a rat’s tail. Eventually, with some luck, you might find a flashlight, or you might have felt around enough to develop a mental layout of the room. Onto the next one. You might be able to take some tools with you, but it’s hard to say in advance.
Wiles isn’t wrong about Math. I think winners of Abel prizes tend to know what kind of Math they’re talking about. I don’t have an Abel prize and I don’t do Math research anymore, but bumbling around in a dark mansion is how I’d describe my current research process. I don’t really have any idea what I’m doing a lot of the time. I think many graduate students feel this way, however terrifying it is. Yet, it’s still strange to feel so lost after having been in school for so long.
I can guess at a couple reasons why my confusion persists. First, the vast majority of my education did not teach me how to ask questions. School was always about answering questions correctly and quickly. This job isn’t always easy: analysis and synthesis in an essay are difficult. Understanding the question can sometimes be the biggest barrier. The questions might become harder over time: when what you do breaches the unknown, nobody knows the answer before you do!
But the skills for answering questions are different than the ones for asking good questions. Nobody grades you for asking good questions. I scarcely remember being told what a good question even was. I’ve doubtless seen many good questions in my lifetime, but because I was never marked on creating them, seeing didn’t become understanding.
What makes a good question? As far as I can gather at the moment, these are some criteria.
- Clarity: if you can’t understand a question, ask it more clearly
- Tractability: if you can’t answer a question, ask a different one
- Precision: if your question tackles too many things, isolate the thing that matters most
- Relevance: if your question isn’t tackling what you care about, move on
- Usefulness: if your question doesn’t (even partially) address the problem, try another question
My research has felt like throwing darts in the dark mansion. In my case, the dark mansion is reducing existential risk from advanced artificial intelligence. AIs that do not understand or act upon our values faithfully—misaligned AIs—would pursue their own goals. Hard as we try at the moment to specify our values for AIs, there is empirical evidence to suggest that the “default” is misalignment: unless we figure things out, the types of AIs we create will end up having goals different to ours. There is thus likely to be conflict between AIs and humans.
The capabilities of AIs are also dramatically improving. There is no reason to believe that human capabilities are an upper bound on the capabilities of AIs. Indeed, AIs are already superhuman in narrow domains, like the game of Go. AIs would plausibly also learn much faster than humans can, as we have already observed in some domains. Thus, in cases of conflict between human goals and AI goals, sufficiently advanced AIs would almost certainly win. AIs do not have commonsense morality be default; they would have no qualms about razing half the globe and the humans with it for industrial land. Having control of our collective future pushed aside in this way would be an existential catastrophe.
Which questions should I ask to solve AI alignment? I’ve been trying to understand the dark mansion of AI alignment through different research areas in machine learning. Last year, I thought a lot about performativity: when we deploy an AI, how do we account for its impact upon the world, and how that impact affects the future learning of the AI? Do we get dangerous feedback loops? Performativity is relevant to AI alignment because the danger comes from AIs that interact with the world to change it, but the field of performative prediction in machine learning does not have much else to say. The questions aren’t tractable or relevant enough. We still need to understand what is dangerous.
In the process of looking into performative prediction, I also started looking into multi-agent systems, specifically game theory. Game theory seems relevant to AI alignment at first glance because of strategic considerations. How do we incentivize AIs to act according to our interests, such as by being honest or by cooperating with us? I wrote a paper on incentivizing forecasters to be honest when forecasts can change the world (e.g., the Federal Reserve’s inflation forecast actually has an effect on inflation because inflation depends upon consumer expectations). I learned a lot in the process of writing it, but I don’t think the work actually addresses the core alignment issue because it felt disconnected from how we might use and be able to restrain AIs.
I’ve recently been working on aligning language models. Working with AIs that already exist, rather than with abstract mathematical models of AIs, grounds my work. It also helps that language is probably going to be an important way in which we interact with AIs in the future. It has become easier to ask when misalignment might occur and how to prevent it. I’m currently asking, when might language models mislead us in answer our questions? How do we measure this misleadingness, and what can we do to prevent it? I’m also planning on investigating whether general social understanding can help language models better understand human values and cooperate with us.
At the same time, I think I might soon return to multi-agent systems because of a recent post on how multiple AIs could cause us existential catastrophe.
I might drop these areas of research; I might discover areas completely out of left field. I used to be more afraid of this process because I felt like I should be doing impactful work immediately. I would of course like to, but I still have a lot to learn. A PhD is the beginning of a research career, not its culmination. I’ve grown more accepting of how many ideas I pick up and discard. I’ll understand more of the dark mansion in time.