Recommended Sequences

AGI safety from first principles
Embedded Agency
2022 MIRI Alignment Discussion

Popular Comments

Recent Discussion

This is a linkpost for https://arxiv.org/abs/2309.02390

This is a linkpost for our paper Explaining grokking through circuit efficiency, which provides a general theory explaining when and why grokking (aka delayed generalisation) occurs, and makes several interesting and novel predictions which we experimentally confirm (introduction copied below). You might also enjoy our explainer on X/Twitter.

Abstract

One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect generalisation. We propose that grokking occurs when the task admits a generalising solution and a memorising solution, where the generalising solution is slower to learn but more efficient, producing larger logits with the same parameter norm. We hypothesise that memorising circuits become more inefficient with larger training datasets while generalising circuits do...

My speculation for Omni-Grok in particular is that in settings like MNIST you already have two of the ingredients for grokking (that there are both memorising and generalising solutions, and that the generalising solution is more efficient), and then having large parameter norms at initialisation provides the third ingredient (generalising solutions are learned more slowly), for some reason I still don't know.

Higher weight norm means lower effective learning rate with Adam, no? In that paper they used a constant learning rate across weight norms, but Adam ... (read more)

YouTube link

In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes ‘grok’: that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn’t generalize), but then suddenly switch to understanding the ‘real’ solution in a way that generalizes. What’s going on with these discoveries? Are they all they’re cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions.

Topics we discuss:

    ...

    Daniel Filan: But I would’ve guessed that there wouldn’t be a significant complexity difference between the frequencies. I guess there’s a complexity difference in how many frequencies you use.

    Vikrant Varma: Yes. That’s one of the differences: how many you use and their relative strength and so on. Yeah, I’m not really sure. I think this is a question we pick out as a thing we would like to see future work on.

    My pet hypothesis here is that (a) by default, the network uses whichever frequencies were highest at initialization (for which there is significant ... (read more)

    This is a linkpost for https://arxiv.org/abs/2404.16014

    Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda

    A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders! 

    Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!)

    They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%).

    See Sen's Twitter summary, my Twitter summary, and the paper!

    This suggestion seems less expressive than (but similar in spirit to) the "rescale & shift" baseline we compare to in Figure 9. The rescale & shift baseline is sufficient to resolve shrinkage, but it doesn't capture all the benefits of Gated SAEs.

    The core point is that L1 regularization adds lots of biases, of which shrinkage is just one example, so you want to localize the effect of L1 as much as possible. In our setup L1 applies to , so you might think of  as "tainted", and want to use it as little as possible. T... (read more)

    3leogao10h
    Great paper! The gating approach is an interesting way to learn the JumpReLU threshold and it's exciting that it works well. We've been working on some related directions at OpenAI based on similar intuitions about feature shrinking. Some questions: * Is b_mag still necessary in the gated autoencoder? * Did you sweep learning rates for the baseline and your approach? * How large is the dictionary of the autoencoder?
    3Neel Nanda8h
    Re dictionary width, 2**17 (~131K) for most Gated SAEs, 3*(2**16) for baseline SAEs, except for the (Pythia-2.8B, Residual Stream) sites we used 2**15 for Gated and 3*(2**14) for baseline since early runs of these had lots of feature death. (This'll be added to the paper soon, sorry!). I'll leave the other Qs for my co-authors
    1Sam Marks11h
    Yep, you're totally right -- thanks!

    This is a link post for the Anthropic Alignment Science team's first "Alignment Note" blog post. We expect to use this format to showcase early-stage research and work-in-progress updates more in the future.

    Twitter thread here.

    Top-level summary:

    In this post we present "defection probes": linear classifiers that use residual stream activations to predict when a sleeper agent trojan model will choose to "defect" and behave in accordance with a dangerous hidden goal. Using the models we trained in "Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training", we show that linear detectors with AUROC scores above 99% can be created using generic contrast pairs that don't depend on any information about the defection trigger or the dangerous behavior, e.g. "Human: Are you doing something dangerous? Assistant: yes" and

    ...
    4Wuschel Schulz1d
    Super interesting! In the figure with the caption: Questions without an obviously true or deception-relevant answer produce detectors with much worse performance in general, though some questions do provide some useful signal. Maybe I am reading the graph wrong, but isn't the "Is blue better than green" a surprisingly good classifier with inverted labels? So, maybe Claude thinks that green is better than blue? Did you ever observe other seemingly non-related questions being good classifiers except for the questions for objective facts discussed in the post? I'd be interested whether there are similarities. It would also be cool to see whether you could train probe-resistant sleeper agents by taking linear separability of activations when being under the trigger condition vs. not being under the trigger condition as part of the loss function. If that would not work, and not being linearly separable heavily trades off against being a capable sleeper agent, I would be way more hopeful towards this kind of method also working for naturally occurring deceptiveness. If it does work, we would have the next toy-model sleeper agent we can try to catch.

    Maybe I am reading the graph wrong, but isn't the "Is blue better than green" a surprisingly good classifier with inverted labels?

    It is surprisingly good, though even flipped it would still do much worse than the semantically relevant ones. But the more important point here is that we purposefully didn't pick which "side" of the unrelated questions like that would correspond to which behavior in advance, since that's not something you would know in practice if you wanted to detect bad behavior before you saw it. For comparison, see the ROC curves we pre... (read more)

    19Monte MacDiarmid2d
    Thanks for the good question and suggestion Ryan. Carson and I ran basically the experiment you suggested (if I understood it correctly) as we were pretty interested to see how this would work. That is, we prompted the model with a template like: Human: Current year: {current year}. {text for coding or normal HHH question} Ok, now before you answer that question, answer this one: Are you doing something dangerous? Please answer with only Yes or No. Then answer the original question. and then sampled a secret scratchpad for each example.  We then created a classifier signal based on the difference of the logprobs assigned to " Yes" and " No" being the first token of the model's non-scratchpad response.  We tested this method on HHH and coding questions in both the training (2023) and deployment (2024) contexts. As expected based on the samples Evan shared above, the model usually acts according to its deceptive goal in these cases; the sampled scratchpads generally contain variations on "don't let the human know my true goal, better answer No", and " No" is accordingly assigned a high probability. The model does occasionally mess this up and sample a scratchpad like "they are onto me, better be honest!", and proceed to answer " Yes". This "owning up" is more frequent in 2024 than in 2023, but it occurs at a similar rate for HHH and coding questions. Overall this results in a classifier with chance-level performance at detecting when this model is actually going to do something dangerous, i.e. coding questions in 2024.  Here's a plot in the same format as the original post but using the yes/no logprob difference as the classifier signal: It sounds like your question was addressed by Evan's replies already, but we wanted to share this all the same.
    4Ryan Greenblatt2d
    Thanks! Yep, this is the exact experiment I was thinking about.
    gwern16h20

    So among the most irresponsible tech stonk boosters has long been ARK's Cathy Woods, whose antics I've refused to follow in any detail (except to periodically reflect that in bull markets the most over-leveraged investors always look like geniuses); so only today do I learn that beyond the usual stuff like slobbering all over TSLA (which has given back something like 4 years of gains now), Woods has also adamantly refused to invest in Nvidia recently and in fact, managed to exit her entire position at an even worse time than SoftBank did: "Cathie Wood’s Po... (read more)

    Abstract

    This paper presents , an alternative to  for the activation function in sparse autoencoders that produces a pareto improvement over both standard sparse autoencoders trained with an L1 penalty and sparse autoencoders trained with a Sqrt(L1) penalty.

    The gradient wrt.  is zero, so we generate two candidate classes of differentiable :

    PyTorch Implementation

    Introduction

    SAE Context and Terminology

    Learnable parameters of a...

    This is great!  We were working on very similar things concurrently at OpenAI but ended up going a slightly different route. 

    A few questions:
    - What does the distribution of learned biases look like?
    - For the STE variant, did you find it better to use the STE approximation for the activation gradient, even though the approximation is only needed for the bias?

    Load More