Computational neuroscience | Nature Nat. J. Neurosci. Data were analysed using MATLAB software (Mathworks). All error bars denote s.e.m. action-value learning accounts, in which DA reward signals should promote the emergence of. 60, 223311 (2018). Curr. Video was post-processed with custom MATLAB code available on request. We measured mDA activity in the above mice, which were DAT-Cre::ai32 mice that transgenically expressed Chr2 under control of the dopamine transporter promoter, by injecting a Cre-dependent jRCaMP1b virus across the ventral midbrain9 (Fig. Following at least 4 days recovery from headcap implantation surgery, animals water consumption was restricted to 1.2ml per day for at least 3 days before training. Mikhael, J. G., Kim, H. R., Uchida, N. & Gershman, S. J. Second, under a wide set of conditions policy learning is the most parsimonious reinforcement learning model that explains learned behaviour5. PDF View 2 excerpts, cites background long time constant dynamics87). ACTRs predicted phasic mDA photometry signal corresponds closely to experimentally measured NAc and VTADA activity across training (Fig. Trends Neurosci. e, Initial (Init.) Mesolimbic dopamine adapts the rate of learning from action Article Full-text available Jan 2023 Luke T. Coddington Sarah E. Lindo Joshua T Dudman Recent success in training artificial agents and. We thus performed this experiment in mice, selectively increasing dopamine reward signals through optogenetic stimulation in the VTA contingent on preparatory cued behaviour. Second, while optimal RNNs are relatively indifferent to some parameters (sparsity of connectivity) they tend to require a strong coupling coefficient which is known to determine the capacity of a RNN to develop sustained dynamics86, and thus optimal policies were observed uniquely in RNNs with large leading eigenvalues (i.e. Intertrial intervals were chosen from a randomly permuted exponential distribution with a mean of about 25s. Ambient room noise was 5055dB, and an audible click of about 53dB accompanied solenoid opening on water delivery and the predictive tone was about 65dB loud. Preprint at https://arxiv.org/abs/1412.6980 (2014). For all constants a range of values were tried with qualitatively similar results. PDF Mesolimbic dopamine adapts the rate of learning from errors - bioRxiv Learning trajectories and cost surfaces calculated from a range of model initializations compared well qualitatively and quantitatively to those inferred from mouse data (Fig. There are two key classes of weight changes governed by distinct learning rules within the ACTR model. The significantly different time courses of preparatory and reactive learning (Fig. Physiol. A time-division multiplexing strategy was used in which LEDs were controlled at a frequency of 100Hz (1ms on, 10ms off), offset from each other to avoid crosstalk between channels. An analogous method was used to reduce the dimensionality of reactive variables down to a single reactive dimension that captures most variance in reactive behavioural variables across animals (final reactive behav., Extended Data Fig. We also computed the Akaike information criterion68sum of ln(sum(residuals2))as preferred in some previous work69. c) Cued licking was correlated with the size of NAcDA cue responses across animals (Pearsons r = 0.78, p=0.01), even though manipulations did not support a causal relationship. This suggests that policy learning, and specifically the reactive component in our ACTR model, may be a useful way to model the acquisition of incentive salience41 (although not its expression, as phasic dopamine signals could be shown to modulate only learning, and not apparent incentive salience on the current trial (Extended Data Fig. 4ad and Extended Data Fig. 4c). All error bars denote s.e.m. 5b). Excitation-specific signals were recovered in post-processing by only keeping data from each channel when its LED output power was high. MATH 2e)). 1i,j). The mesolimbic dopaminergic (ML-DA) system has been recognized for its central role in motivated behaviors, various types of reward, and, more recently, in cognitive processes. 2021 . Recent success in training artificial agents and robots derives from a combination of direct learning of behavioural policies and indirect learning through value functions1-3. Proc. Representative curves of these simulations are shown in Fig. Consistent with direct policy updates by a PE related to reward collection latency, our observations showed that updates to both the reactive and preparatory behaviour on each current trial were significantly related to reward collection latency on the previous trial (Fig. & Klimov, O. Proximal policy optimization algorithms. 1ac). This preserved local slow signal changes while correcting for photobleaching. The first principal component of this matrix was calculated and loading onto PC1 was defined as a measure of an inferred underlying preparatory component of the behavioural policy. Dataset supporting "Mesolimbic dopamine adapts the rate of learning Nature 466, 457462 (2010). Neuron 101, 133151 (2019). In animals, behavioural learning and the role of mesolimbic dopamine signalling have been extensively evaluated with respect to reward prediction4; however, so far there has been little consideration of how direct . Cell 184, 27332749 (2021). 9). Experimenter was not blinded to group identity during data collection. & Cohen, J. Y. Serotonin neurons modulate learning rate through uncertainty. Neurosci. 2d and Fig. 1c), this belied heterogeneity in the dynamics of learning across individuals (Extended Data Fig. . 4 for comparison of modelled to actual) was well supported by comparing end performance of the full model to modified versions (Fig. Mesolimbic dopamine adapts the rate of learning from action - ResearchGate f) Comparison of median parameterized policy and value models for each mouse, quantified by -log likelihood (left) or Aikake information criterion per trial (right). No stimulation controls (white, n=9), stimLick- (green, n=6), stimLick+ (dark purple, n=5), stim+Lick+ (light purple, n=4). This suggests that dopamine inhibition observed following omission of expected rewards may depend on concurrent control9,13 or evaluation17 of action. A close examination of the learning signals on lick versus lick+ trials indicates that those trial types are capturing different distributions of PEs, as estimated from reward collection latency and anticipatory licking (Methods). The discovery that the phasic activity of mDA neurons in several species correlated with a key quantity (RPE) in value learning algorithms has been a marked and important advance suggesting that the brain may implement analogous processes4,7. j) Preparatory cued licking for simulations with low (cyan) and high (magenta) initial reward-related sensory input. . a) Dopamine responses measured through the same fiber for the low (black), medium (dark red), and high (bright red) stim parameters indicated at right, inserted either in the mPFC or at the indicated depth in the striatum. By contrast, if the policy is as bad as possible, trials are lick and stochastic initiation of the lick plant right after reward delivery or stochastic fluctuations in the underlying policy (even without inducing a pre-reward lick) can only improve reward collection relative to expected latency of a poor policy (that is, resulting in a bias towards positive PEs). 1 and Extended Data Fig. 8g). This is surprising given that results in rodents43,44 and monkeys45 provide specific evidence for value learning effects following exogenous VTADA stimulation. Mesolimbic dopamine adapts the rate of learning from errors in Pharmacol. a) To test for a causal connection between the size of mesolimbic dopamine cue responses and cued behavior, in a new session after regular training was complete, we delivered large, uncalibrated VTADA stimulation on a random subset of cued reward trials (light green). The RNN was constructed largely as described in ref. We compared acquisition learning in the complete ACTR model to observed mouse behaviour using a variety of approaches. Imaging began >20 days post-injections using custom-built fibre photometry systems (Fig. eij, eligibility trace for node perturbation at the synapse between the ith neuron and the jth neuron. Intertrial intervals were chosen from randomly permuted exponential distributions with a mean of about 13s. Supplementary Table 1 lists the experimental groups each mouse was assigned to in the order in which experiments were experienced. The reverse transition rate was a constant that depended on the presence of reward (5103ms without reward, 5101ms with reward). As noted in the main text, the RNN component of the model and the learning rules used for training drew on inspiration from ref. Biol. Opin. a, Left: experimental design for VTADA stimulation (stim.) 2 where noted and is a sigmoid function mapping inputs from {0,10} to {0,3} with parameters: =1.25, =7) (for example, line 259 in dlRNN-train_learnDA.m): As noted in the description of the behavioural data described in Fig. To test this possibility, we repeated the stimLick+ experiment with a new set of mice, but this time augmented rewards on lick+ trials with large, uncalibrated VTADA stimulation (500ms, at 30Hz and about 10mW power; Fig. For large stimulations, steady-state laser output was set to 10mW. 3j), mirroring our results in mice and demonstrating that stronger reactive responses to sensory information can impair the learned development of preparatory responses (and thus ultimately impair performance). Mice were randomly assigned to stimulation group (control, stimLick, stimLick+) before training. A multidimensional dataset of behavioural changes during acquisition could be seen to drive improvements in reward collection performance, and a novel policy learning model quantitatively accounted for the diverse learned behaviour of individual animals. Large mesolimbic dopamine manipulations drive value-like learning a in which T is the trial index. The reverse transition rate reflects a bias towards quiescence that decreases in the presence of water such that licking is sustained until collection is complete (Methods). 21, 576586 (2020). Parker, J. G. et al. The most important reward pathway in the brain is the mesolimbic dopamine pathway. Two-way ANOVA, **P=0.01, ***P=0.001. c, Predicted cue responses (top) and cued licking (bottom) in a version of the ACTR model altered for the large-amplitude stimulations to bias PE in addition to setting the adaptive learning rate (n=9 simulations). lat. 1). 2188, 273283 (2021). 3a and 5b). Miconi, T. Biologically plausible learning in recurrent neural networks reproduces neural dynamics observed during cognitive tasks. To assess the relationship between learned behaviours and reward collection performance on an individual basis, we built generalized linear models (GLMs) to predict reward collection latency across training in each mouse (Fig. We found a remarkable agreement between policy learning model-based predictions and experimental observations. Although we explored a range of parameters governing RNN construction, many examples of which are shown in Extended Data Fig. 3ac). 1h) that captured the statistics of rodent licking behaviour as a state model that transitions between quiescence and a licking state that emits a physiological lick frequency. We modelled a simple fixed rate plant with an active, lick state that emitted observed licks at a fixed time interval of 150ms. Biol. 9 Baseline licking across training in all animals. Such parallel functions could be complementary, intriguingly mirroring the system of parallel policy and value learning networks implemented in AlphaGo50, a landmark achievement in modern artificial intelligence. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. Our data indicate that both strongly enhanced dopmaine signalling and oversensitivity to sensory input can bias towards value-like learning that leads an animal to exhibit excessively strong reactive responses to cues at the expense of optimal behaviour. Trial-by-trial reward collection latencies and predictor variables (preparatory licking, whisking, body movement and pupil diameter; and reactive nose motions, whisking and body movement) were median filtered (MATLAB medfilt1(signal,10)) to minimize trial-to-trial variance in favour of variance due to learning across training. n, trial number. b, Fibre paths and virus expression from an example experiment. Neurosci. For each of these three conditions, we ran 9 simulations (3 different initializations, 3 replicates) for 27 total learning simulations (800 trials). USA 109, 2070320708 (2012). 1h), and was proportional to the relative change in PE (customary in many policy learning algorithms21). This led us to develop a biologically plausible network-based formulation of policy learning that is consistent with many aspects of individual behavioural trajectories, but also closely matches observed mDA neuron activity during naive learning. Preparatory learning was modelled as changes to internal weights in the RNN (Wij; Fig. Rev. Reactive responses in the whisking, nose motion and body movement were measured as the latency to the first response following reward delivery. The structure of behavioral variation within a genotype. & Phillips, P. E. M. Hierarchical recruitment of phasic dopamine signaling in the striatum during the progression of cocaine use. A gradual temporal shift of dopamine responses mirrors the progression of temporal difference error in machine learning. However, at the end of regular training some mice experienced an extra session in which VTADA stimulation was triggered on cue presentation on a random subset of trials (Extended Data Fig. Thus, naive trace conditioning is well described as the optimization of reward collection behaviour, and best approximated when mDA-like signals act not as signed errors directing changes to the policy, but instead adapting the size of the learned update on each trial. f) The average change in performance error (PE, the learning signal for preparatory learning in the ACTR model (Fig. Mice were then randomly assigned to groups for a new experiment in which a light cue predicted VTADA stimulation with no concurrent liquid water reward (57 days, 150200 trials per day). Significance testing: rank sum. Typical values for were 0.25 (although a range of different calculations for b, including b=0, yield consistent results as noted previously21). The holding room temperature was maintained at 211C with a relative humidity of 30% to 70%. 43, 485507 (2020). 1i)) for each control mouse (n=9) switched sign on lick+ or lick- trial types (two-tailed signed rank test p<0.001). Mice underwent daily health checks, and water restriction was eased if mice fell below 75% of their original body weight. These insights from the ACTR model suggest that, in real mice, initial sensitivity to reward-related sensory stimuli is reported by increased mDA reward signalling, and this initial condition can explain meaningful individual differences in the courses of future learning. Mesolimbic dopamine adapts the rate of learning from action. Stelly, C. E., Girven, K. S., Lefner, M. J., Fonzi, K. M. & Wanat, M. J. Dopamine release and its control over early Pavlovian learning differs between the NAc core and medial NAc shell. Redgrave, P. & Gurney, K. The short-latency dopamine signal: a role in discovering novel actions? d, Top: behavioural measures predicted reward collection latency in a GLM for each mouse. These data are presented in Fig. 2f; significantly worse performance, rank sum P<2107); learning rate was globally reduced (akin to dopamine depletion37; significantly worse performance, rank sum P<2x106); a basal learning rate was intact but there was no adaptive component (akin to disruption of phasic mDA reward signalling38; significantly worse performance, rank sum P=0.02). Sutton, R. S. & Barto, A. G. Reinforcement Learning: an Introduction (MIT Press, 1998). Linear regression to estimate contribution of fibre position to variance in mDA reward signals was fitted using MATLAB fitlm. Together, these results define a novel function for mesolimbic dopamine in adapting the learning rate of direct policy learning (summarized in Extended Data Fig. This reactive learning was also updated according to PEs. d, NAcDA cue responses (top) and cued licking (bottom) for stimLick (green, n=6) and stimLick+ (purple, n=5) mice across training, shown as the difference from control mice. 1d). e) (left) No correlation between baseline licking and final latency to collect reward (a measure of learned performance) for all mice (Pearsons p=0.46, n=20). PDF Mesolimbic dopamine adapts the rate of learning from action - bioRxiv The emergence of cue signals following uncalibrated dopamine stimulation was captured in ACTR by introducing a nonlinearity in which larger, more sustained dopamine activation was modelled as a large modulation of learning rate coupled with a change in PE encoding (Fig. Stimulations and cue deliveries were coordinated with custom-written software using Arduino Mega hardware (https://www.arduino.cc). b, Top: mean NAcDA reward responses across training (trials 1800) for each mouse, with (coloured traces) and without (black traces) exogenous stimulation, for stimLick (left and green, n=6) and stimLick+ (right and purple, n=5) cohorts of mice. https://github.com/neurojak/pySpinCapture, https://janelia.figshare.com/account/collections/6369111, https://doi.org/10.25378/janelia.c.6369111, https://www.youtube.com/watch?v=KHZVXao4qXs. Google Scholar. Adjusting the strength of this sensory input at model initialization scales the initial dopamine reward response magnitudes similarly to the range observed in mice (Fig. You are using a browser version with limited support for CSS. The network stability cost (costN) penalizes high-frequency oscillatory dynamics that can emerge in some (but not all) simulations. 6a). To obtain Owing to their average negative PE, increasing the learning rate exogenously (by increasing the dopamine signal) only on lick+ trials should bias away from robust preparatory policies and decrease final learned preparatory licking. Multiple comparisons with repeated measures were made using Friedmans test (MATLAB friedman). 2. Recent success in training artificial agents and robots derives from a combination of direct learning of behavioural policies and indirect learning through value functions<SUP>1-3</SUP>. Extended Data Fig. Google Scholar. All error bars denote s.e.m. Beyond dichotomies in reinforcement learning. Preparatory behaviour was assayed across lick, body, whisker and pupil measurements as the total amount of activity during the delay period between cue and reward. 6). The Brain's Reward System in Health and Disease - PMC 4 andMethods). Sign up for the Nature Briefing newsletter what matters in science, free to your inbox daily. Neurosci. NAcDA signals predicted by ACTR and temporal difference (TD) value learning can be visualized by convolving their dopamine signals (the adaptive rate signal in ACTR and the RPE signal from the optimized TD model parameters; Extended Data Fig. For simplicity, we choose a soft nonlinearity (half-Gaussian) for convenience of the simple policy gradient that results. Luke T. Coddington (), Sarah E. Lindo and Joshua T. Dudman () . 2e). To solve the temporal credit assignment problem, we used eligibility traces similar to those described previously36. 1d,e). Elife 10, e62583 (2021). Data visualizations were created in MATLAB or GraphPad Prism. Dodson, P. D. et al. Paxinos, G. & Franklin, K. B. J. Paxinos and Franklins the Mouse Brain in Stereotaxic Coordinates (Academic, 2019). 2c,d; initial NAcDA reward: anteriorposterior: P=0.5, mediallateral: P=0.4, dorsalventral: P=0.5; multiple linear regression all axes, P=0.7). stimulation (top; 30Hz, 12mW for 500ms) or stimulation calibrated to reward responses (bottom; 30Hz, 13mW for 150ms). All mice tested in our experiments began training with no preparatory licking to cues and a long latency (about 1s or more) to collect water rewards. Opin. 1). 10.1101/2021.05.31.446464 . c) Example simultaneous recordings from NAc+VTA (top) and NAc+ DS (bottom). However, to facilitate formal comparison between value learning and direct policy learning models, we sought to develop a simplified model that captures a key aspect of ACTR (the specific gradient it uses) and allows for explicit comparison against existing value learning models with the same number of free parameters. Regardless of the specific reinforcement learning algorithm favoured, our analyses and experiments discriminate between two potential biological functions of dopamine: a signed error signal that governs the direction of learned changes and an unsigned adaptive rate signal that governs how much of the error is captured on a given trial. 8e). PubMedGoogle Scholar. The misbehavior of reinforcement learning. expectations of reward that indirectly guide action.We refer to these distinct . To simulate calibrated stimulation of mDA neurons, we multiplied the adaptive rate parameter, DA, by 2 on the appropriate trials For simulations reported in Fig. & Janak, P. H. Dopamine neurons create Pavlovian conditioned stimuli with circuit-defined motivational properties. Berridge, K. C., Robinson, T. E. & Aldridge, J. W. Dissecting components of reward: liking, wanting, and learning. Correspondence to Our implementation kept this core aspect of the computation, but several critical updates were made and will be described. Neurosci. 5e). 1e). 36, the function \({\mathcal{S}}\) need only be a signed, nonlinear function. Title: Mesolimbic dopamine adapts the rate of learning from action Author: Coddington, Luke T., Lindo, Sarah E., Dudman, Joshua T. Issue&Volume: 2023-01-18 Abstract: Recent success in training artificial agents and robots derives from a combination of direct learning of behavioural policies and indirect learning through value functions1,2,3. Initial NAcDA signals were predicted from trained behaviour at trials 700800 by multiple regression (specifically, pseudoinverse of the data matrix of reactive and preparatory variables at the end of training multiplied by data matrix of physiological signals for all animals). To this end, we focused on licking, which in the context of this task is the unique aspect of behaviour critical for reward collection. Flagel, S. B. et al. & Dudman, J.T. d) Mean cross correlations for simultaneously measured NAc+VTA signals (top row, n=3) and NAc+DS signals (bottom row, n=6) in trials 1100 (left) and trials 700800 (right) within trial periods (1s before cue to 3s after reward). Cell Rep. 36, 109684 (2021). These rates were also scaled by the same DA adaptive learning rate parameter (for example, line 346 in dlRNN-train_learnDA.m): in which I is the baseline reactive learning rate and typical values were about 0.02 in presented simulations (again a range of different initializations were tested). h) 3d learning trajectories as in Fig. The data we aimed to evaluate were the frequency of anticipatory licking during the delay period over the first approximately 1,000 trials of naive learning for each mouse. Two-way ANOVA, **P=0.002. e, Modelling results for the difference of dopamine cue responses (top) and cued licking (bottom) from control in stimLick and stimLick+ contingencies, compared for five possible functions of phasic dopamine signals. A powerful solution is to set an optimal update size for each trial according to some heuristic for how useful each trial could be for learning2. Rev. Park, J., Coddington, L. T. & Dudman, J. T. Basal ganglia circuits for action specification. Below we delve into the learning rule as implemented here or a reader may examine the commented open source code to get further clarification as well. Intriguingly, we also discovered that uncalibrated mDA stimulation 35 times stronger than endogenous mDA activity (but with parameters common in the field at present) was well explained by our model as a bias in a signed error in addition to modulating learning rate. 1j) in which: the mDA-like feedback unit signalled PE instead of rate (Extended Data Fig. 1e). In animals, behavioural learning and the role of mesolimbic dopamine signalling have been extensively evaluated with respect to reward prediction4; however, so far there has been little consideration of how direct policy learning might inform our understanding5. Author(s): . g) Average PE on all stimulated trials in each mouse that received VTADA stimulation depending on whether they licked during the delay period preceding reward (StimLick+, n=5), or whether they did not lick during the delay period (StimLick-, n=6) (two-tailed signed rank p<0.004 StimLick+ vs StimLick-). For behaviour and juxtacellular recordings, we used 24 adult male DAT-Cre::ai32 mice (39 months old) resulting from the cross of DATIREScre (The Jackson Laboratory stock 006660) and Ai32 (The Jackson Laboratory stock 012569) lines of mice, such that a Chr2EYFP fusion protein was expressed under control of the endogenous dopamine transporter Slc6a3 locus to specifically label dopaminergic neurons. Observed trajectories of preparatory versus reactive were superimposed on this surface by finding the nearest corresponding point on the fitted two-dimensional surface for the parametric preparatory and reactive trajectories. To model a low-parameter (as compared to ACTR) policy learning equivalent of the TD value learning model from ref. Curr. For each pair of values, a policy was computed and passed through the behaviour plant 50 times to get an estimate of the mean performance cost. A filtered average cost, R, was computed as before36 with R=0.75 and used in the update equation for changing network weights through the learning rule described below. The nucleus accumbens is found in the ventral medial portion of the striatum and is believed to play a role in reward, desire, and the placebo effect. Bennett, D., Niv, Y. Pupil diameter was estimated as the mean of the major and minor axis of the object detected with the MATLAB regionprops function, following noise removal by thresholding the image to separate light and dark pixels, then applying a circular averaging filter and then dilating and eroding the image. 28, 357376 (2005). This feedback scheme has a direct and intentional parallel to the phasic activity of mDA neurons in this task, which is well described as the sum of action- and sensory-related components of reward prediction9,13 and occursin parallel to direct sensorimotor outputs34. and dopamine sets an adaptive learning rate rather than an error-like teaching signal. b, Ten-trial binned behavioural quantification across the first 800 training trials. Reactive behaviour was assayed across nose, body and whisker measurements as the latency to initiate following reward delivery. This range corresponded to the space of all possible policy outputs realizable by the ACTR network.
Big Buddha Temple Tickets, Failed To Establish A New Connection Python Requests, Annulment In Mississippi, Is Shadyside, Pittsburgh Safe, Articles M